The annotation dataset also documents a scalable data collection pipeline combining non-expert annotators with targeted expert input, offering a model for future data collection efforts.
27.08.2025 17:47 — 👍 0 🔁 0 💬 0 📌 0
@malteschierholz.bsky.social
The annotation dataset also documents a scalable data collection pipeline combining non-expert annotators with targeted expert input, offering a model for future data collection efforts.
27.08.2025 17:47 — 👍 0 🔁 0 💬 0 📌 0The dataset is a benchmark to compare various human & automatic annotation techniques.
This information aids in understanding the strengths and weaknesses of current automated extraction methods.
🫱All our data are publicly available on Zenodo. zenodo.org/records/1512...
💥
The datasets inherit large re-use potential due to the gold standard nature of the emission metrics and the accompanying wealth of information.
💥
- All reported Scope 3 emissions need to be treated with caution, as their optional and therefore often incomplete reporting makes comparisons between companies challenging. Companies might just "forget" to report parts of their emissions if these are (too) hard to calculate.
27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0- Direct emissions by company facilities (Scope 1) and indirect emissions for power and heat consumption (Scope 2 location-based) are most often reported. Residual indirect emissions, e.g. from purchased goods or business travel, (Scope 3) are less often reported.
27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0- About half of the sustainability reports (69 of 139) do not contain any GHG emission values, partly as a consequence of our strict annotation rules.
27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0What do we learn from this, and what did I find surprising?
- I expected it would be a very simple annotation task to copy GHG emission values from a sustainability report into a table. It was not, as the high level of disagreement between non-expert and expert coders shows.
To ensure that we create a gold standard dataset, two teams of expert annotators double-coded the remaining 40% of reports.
Again, these expert teams disagreed for about half of the 40%. Only during an expert discussion an agreement was reached about which values would need to be extracted.
3. is reported in absolute terms as CO2 or CO2 equivalents emissions and
4. represents a total value, not subcategories.
Two human non-expert annotators searched for all GHG values that meet these conditions.
Despite a training session, these non-experts agreed only for 60% of all reports.
Precise rules were created for human annotators. GHG emission values should be extracted only if they:
1. cover emissions for the entire company,
2. are reported according to the operational boundaries of the scopes (according to the Greenhouse Gas Protocol)
We use a sample of 139 companies that are listed in the MSCI World Small Cap index and/or in the German Dax.
To obtain the GHG emission metrics, we extract these metrics from PDF files with an LLM, GPT-4. This was just to simplify data extraction; human annotators double-checked the values.
Existing datasets are often inconsistent and lack transparent methodologies, making it difficult to obtain reliable emission data.
We present a gold standard dataset containing emission metrics extracted from 139 sustainability reports collected from company websites.
Annual reports or sustainability reports are often more than 100 pages long and only available in PDF format.
Extracting GHG indicators from these reports by hand is a laborious task. Could one automate this process? How well do ML and AI models perform?
🌟New paper alert: Addressing data gaps in sustainability reporting: A benchmark dataset for greenhouse gas emission extraction🌟
Large companies in the EU are required by law to report their greenhouse gas emissions in their sustainability reports. How can researchers use this data?
rdcu.be/eCEyr
Brain drain at U.S. Census Bureau. Sounds terrible.. 😐
29.05.2025 15:42 — 👍 0 🔁 0 💬 0 📌 0See also our Bright Line Watch survey documenting self-censorship among political scientists brightlinewatch.org/threats-to-d...
13.05.2025 16:18 — 👍 34 🔁 13 💬 2 📌 0Wie begründen Politikwissenschaftler denn dann sein Vorgehen? Meine Vermutung wäre dass solche Worte es ihm in der Koalition mit der Union einfacher machen. Langfristig ist es ggf. auch Anpassung um Gespräche mit den USA zu erleichtern? Politiker müssen ja nicht nur aufs Wahlvolk schauen.
11.05.2025 12:09 — 👍 3 🔁 0 💬 0 📌 0GESIS Summer School in Survey Methodolgy Data Science Techniques for Survey Researchers 04 to 08 August 2025 Cologne Fiona Draxler, Anna Steinberg, Malte Schierholz
Are you a survey researcher eager to learn data science?
Join our #GESISsummerschool course to master techniques for analyzing both digital behavioral and traditional survey data, including web scraping, machine learning, and more — all in R!
Book Now: t1p.de/GSS25-C5
Love this
endler.dev/2025/best-pr...
The twitter takeover was very much in the news, everyone knows by now what kind of person Elon Musk is, and one would hope X is dead by now... So is this Twitter-Musk showdown still relevant today? Okay, X is still very much alive... so maybe it is worth reading...?
06.04.2025 12:41 — 👍 0 🔁 0 💬 1 📌 0Besides allegations against Meta executives Mark Zuckerberg, Joel Kaplan and Sheryl Sandberg, there is a lot to learn about Facebook's company culture and international diplomacy/big tech lobbying. It was super-interesting to read about Facebook's values and how Facebook viewed its role in the world
06.04.2025 12:22 — 👍 0 🔁 0 💬 0 📌 0I just finished reading a current New York Times bestseller: "Careless People" Though non-fiction, it is riveting and fun to read like a novel. Highly recommended!!
The Tell-All Book That Meta Doesn’t Want You to Read www.nytimes.com/2025/03/17/o...
This is unfathomable. NSDUH is a critical piece of our national scientific infrastructure. So the adminstration is basically saying that it could care less about understanding substance use in this country? Really?
01.04.2025 16:05 — 👍 1 🔁 1 💬 0 📌 0I like this paper: Gruber, C., Schenk, P. O., Schierholz, M., Kreuter, F., & Kauermann, G. (2023). Sources of Uncertainty in Machine Learning-A Statisticians ’ View. arxiv.org/abs/2305.16703. 1/
20.03.2025 21:24 — 👍 24 🔁 4 💬 1 📌 0Trumps Angriff auf die Wissenschaft: Überrumpeln, einschüchtern, kürzen
Kürzungen, politische Einflussnahme und ideologische Vorgaben durch Donald Trump bedrohen die #Forschungsfreiheit in den USA. Was geschieht – und wie Forschende reagieren.
Im Blog: www.jmwiarda.de/https-www.jm...
Free press under attack. How much worse will it become?
I reorganized my news subscriptions yesterday - good journalism needs support these days
Democracies stand with Ukraine.
01.03.2025 02:07 — 👍 18040 🔁 3989 💬 647 📌 193A second letter, from the American Statistical Association, about risks to the BEA, BLS, & Census in particular.
docs.google.com/document/d/1...
Individual sign on:
docs.google.com/forms/d/e/1F...
Orgs:
docs.google.com/forms/d/e/1F...
We're excited to announce #DataFest Germany 2025 at LMU Munich, March 28-30! In this #hackathon, students from diverse study programs compete for the best insights and visualizations from an exclusive dataset within 48 hours. More info: www.datafest.de/home
31.01.2025 09:43 — 👍 9 🔁 5 💬 0 📌 0