Malte Schierholz @malteschierholz

The annotation dataset also documents a scalable data collection pipeline combining non-expert annotators with targeted expert input, offering a model for future data collection efforts.

27.08.2025 17:47 — 👍 0 🔁 0 💬 0 📌 0

The dataset is a benchmark to compare various human & automatic annotation techniques.

This information aids in understanding the strengths and weaknesses of current automated extraction methods.

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

Gold Standard and Annotation Dataset for CO2 Emissions Annotation This repository contains the results of a research project which provides a benchmark dataset for extracting greenhouse gas emissions from corporate annual and sustainability reports. The zipped datasets file contains two datasets, gold_standard and annotation_dataset(password is provided in the zip file). Data collection A Large Language Model (LLM) based pipeline was used to extract the greenhouse gas emissions from the reports (see columns prefixed with llm_ in annotation_dataset). The extracted emissions follow the categories Scope 1, 2 (market-based) and 2 (location-based) and 3, as defined in the GHGP protocol (see variables scope). Annotation of the pipeline output was done in 3 phases: first by non-experts (see columns prefixed with non_expert_ in annotation_dataset), then by expert groups (columns prefixed with exp_group_ in annotation_dataset) in case of disagreement of non-experts and finally in a discussion of all experts (columns prefixed with exp__disc in annotation_dataset) in case of disagreement between expert groups. The annotation guidelines for the non-experts and experts are also included in this repository. The annotation results from all three phases are combined to form the final benchmark dataset: gold_standard. Codebooks detailing each variable of each of the two datasets are also provided. More details about the annotation template or the data wrangling scripts can be found in the GitHub repository. Merging of datasets Users can match the two datasets (gold_standard and annotation_dataset) using the variable combination of company_name, report_year and merge_id (index column). The merge_id already includes the company name and report year implicitly, but to avoid column duplication in the join operation, it should be included as join variables. For example this is useful when comparing LLM extractions to gold standard data.

🫱All our data are publicly available on Zenodo. zenodo.org/records/1512...

💥
The datasets inherit large re-use potential due to the gold standard nature of the emission metrics and the accompanying wealth of information.
💥

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

- All reported Scope 3 emissions need to be treated with caution, as their optional and therefore often incomplete reporting makes comparisons between companies challenging. Companies might just "forget" to report parts of their emissions if these are (too) hard to calculate.

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

- Direct emissions by company facilities (Scope 1) and indirect emissions for power and heat consumption (Scope 2 location-based) are most often reported. Residual indirect emissions, e.g. from purchased goods or business travel, (Scope 3) are less often reported.

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

- About half of the sustainability reports (69 of 139) do not contain any GHG emission values, partly as a consequence of our strict annotation rules.

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

What do we learn from this, and what did I find surprising?

- I expected it would be a very simple annotation task to copy GHG emission values from a sustainability report into a table. It was not, as the high level of disagreement between non-expert and expert coders shows.

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

To ensure that we create a gold standard dataset, two teams of expert annotators double-coded the remaining 40% of reports.

Again, these expert teams disagreed for about half of the 40%. Only during an expert discussion an agreement was reached about which values would need to be extracted.

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

3. is reported in absolute terms as CO2 or CO2 equivalents emissions and
4. represents a total value, not subcategories.

Two human non-expert annotators searched for all GHG values that meet these conditions.

Despite a training session, these non-experts agreed only for 60% of all reports.

27.08.2025 17:47 — 👍 2 🔁 0 💬 1 📌 0

Precise rules were created for human annotators. GHG emission values should be extracted only if they:

1. cover emissions for the entire company,
2. are reported according to the operational boundaries of the scopes (according to the Greenhouse Gas Protocol)

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

We use a sample of 139 companies that are listed in the MSCI World Small Cap index and/or in the German Dax.

To obtain the GHG emission metrics, we extract these metrics from PDF files with an LLM, GPT-4. This was just to simplify data extraction; human annotators double-checked the values.

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

Existing datasets are often inconsistent and lack transparent methodologies, making it difficult to obtain reliable emission data.

We present a gold standard dataset containing emission metrics extracted from 139 sustainability reports collected from company websites.

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

Annual reports or sustainability reports are often more than 100 pages long and only available in PDF format.

Extracting GHG indicators from these reports by hand is a laborious task. Could one automate this process? How well do ML and AI models perform?

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

Addressing data gaps in sustainability reporting: A benchmark dataset for greenhouse gas emission extraction Scientific Data - Addressing data gaps in sustainability reporting: A benchmark dataset for greenhouse gas emission extraction

🌟New paper alert: Addressing data gaps in sustainability reporting: A benchmark dataset for greenhouse gas emission extraction🌟

Large companies in the EU are required by law to report their greenhouse gas emissions in their sustainability reports. How can researchers use this data?

rdcu.be/eCEyr

27.08.2025 17:47 — 👍 0 🔁 0 💬 1 📌 0

Brain drain at U.S. Census Bureau. Sounds terrible.. 😐

29.05.2025 15:42 — 👍 0 🔁 0 💬 0 📌 0

See also our Bright Line Watch survey documenting self-censorship among political scientists brightlinewatch.org/threats-to-d...

13.05.2025 16:18 — 👍 34 🔁 13 💬 2 📌 0

Wie begründen Politikwissenschaftler denn dann sein Vorgehen? Meine Vermutung wäre dass solche Worte es ihm in der Koalition mit der Union einfacher machen. Langfristig ist es ggf. auch Anpassung um Gespräche mit den USA zu erleichtern? Politiker müssen ja nicht nur aufs Wahlvolk schauen.

11.05.2025 12:09 — 👍 3 🔁 0 💬 0 📌 0

GESIS Summer School in Survey Methodolgy Data Science Techniques for Survey Researchers 04 to 08 August 2025 Cologne Fiona Draxler, Anna Steinberg, Malte Schierholz

Are you a survey researcher eager to learn data science?
Join our #GESISsummerschool course to master techniques for analyzing both digital behavioral and traditional survey data, including web scraping, machine learning, and more — all in R!

Book Now: t1p.de/GSS25-C5

30.04.2025 07:36 — 👍 1 🔁 1 💬 1 📌 0

The Best Programmers I Know | Matthias Endler I have met a lot of developers in my life. Late…

Love this

endler.dev/2025/best-pr...

07.04.2025 22:41 — 👍 84 🔁 21 💬 1 📌 6

The twitter takeover was very much in the news, everyone knows by now what kind of person Elon Musk is, and one would hope X is dead by now... So is this Twitter-Musk showdown still relevant today? Okay, X is still very much alive... so maybe it is worth reading...?

06.04.2025 12:41 — 👍 0 🔁 0 💬 1 📌 0

Besides allegations against Meta executives Mark Zuckerberg, Joel Kaplan and Sheryl Sandberg, there is a lot to learn about Facebook's company culture and international diplomacy/big tech lobbying. It was super-interesting to read about Facebook's values and how Facebook viewed its role in the world

06.04.2025 12:22 — 👍 0 🔁 0 💬 0 📌 0

Opinion | The Tell-All Book That Meta Doesn’t Want You to Read (Gift Article) The “free speech” champion Mark Zuckerberg tries to shut up a critic.

I just finished reading a current New York Times bestseller: "Careless People" Though non-fiction, it is riveting and fun to read like a novel. Highly recommended!!

The Tell-All Book That Meta Doesn’t Want You to Read www.nytimes.com/2025/03/17/o...

06.04.2025 12:20 — 👍 3 🔁 1 💬 3 📌 0

This is unfathomable. NSDUH is a critical piece of our national scientific infrastructure. So the adminstration is basically saying that it could care less about understanding substance use in this country? Really?

01.04.2025 16:05 — 👍 1 🔁 1 💬 0 📌 0

Sources of Uncertainty in Supervised Machine Learning -- A Statisticians' View Supervised machine learning and predictive models have achieved an impressive standard today, enabling us to answer questions that were inconceivable a few years ago. Besides these successes, it becom...

I like this paper: Gruber, C., Schenk, P. O., Schierholz, M., Kreuter, F., & Kauermann, G. (2023). Sources of Uncertainty in Machine Learning-A Statisticians ’ View. arxiv.org/abs/2305.16703. 1/

20.03.2025 21:24 — 👍 24 🔁 4 💬 1 📌 0

Trumps Angriff auf die Wissenschaft: Überrumpeln, einschüchtern, kürzen

Kürzungen, politische Einflussnahme und ideologische Vorgaben durch Donald Trump bedrohen die #Forschungsfreiheit in den USA. Was geschieht – und wie Forschende reagieren.

Im Blog: www.jmwiarda.de/https-www.jm...

04.03.2025 08:14 — 👍 33 🔁 21 💬 0 📌 2

Free press under attack. How much worse will it become?

I reorganized my news subscriptions yesterday - good journalism needs support these days

02.03.2025 11:48 — 👍 1 🔁 0 💬 0 📌 0

Democracies stand with Ukraine.

01.03.2025 02:07 — 👍 18040 🔁 3989 💬 647 📌 193

A second letter, from the American Statistical Association, about risks to the BEA, BLS, & Census in particular.
docs.google.com/document/d/1...

Individual sign on:
docs.google.com/forms/d/e/1F...

Orgs:
docs.google.com/forms/d/e/1F...

04.02.2025 20:46 — 👍 48 🔁 40 💬 2 📌 3

DataFest 🇩🇪 2025 Call for applications 2025! 📢 We are glad to announce our call for applications to join the 8th edition of DataFest Germany, which will take place at the Ludwig-Maximilians-Universität in Munich (Marc...

We're excited to announce #DataFest Germany 2025 at LMU Munich, March 28-30! In this #hackathon, students from diverse study programs compete for the best insights and visualizations from an exclusive dataset within 48 hours. More info: www.datafest.de/home

31.01.2025 09:43 — 👍 9 🔁 5 💬 0 📌 0

Malte Schierholz

Latest posts by malteschierholz.bsky.social on Bluesky

@malteschierholz is following 20 prominent accounts