's Avatar

@eleutherai.bsky.social

1,794 Followers  |  46 Following  |  37 Posts  |  Joined: 20.11.2024
Posts Following

Posts by (@eleutherai.bsky.social)

Preview
commoncrawl/CommonLID Β· Datasets at Hugging Face We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Huge shout out to co-lead authors @pjox.bsky.social @very-laurie.bsky.social @catherinearnett.bsky.social and the 94 other co-authors who made this work possible. To learn more, check out the paper or stop by their talk in our discord server on Feb 25th at 1600 GMT!

huggingface.co/datasets/com...

13.02.2026 19:46 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

CommonLID is open and permissively licensed, focused on the boring infrastructure problem that determines whether thousands of languages get included in NLP at all. LangID isn't glamorous but it's load-bearing. Evaluate your LID pipeline on web-domain benchmarks, not FLORES.

13.02.2026 19:46 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

That's why @commoncrawl.bsky.social is such an important friend of ours and we are so excited to work with them on improving the data ecosystem. LangID errors compound at every stage in the data pipeline, so getting infrastructure right upstream matters more than most things downstream.

13.02.2026 19:46 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
The Common Pile v0.1 Announcing the Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Data is something we care deeply about. From the Pile in 2020 to the Common Pile last year, we've spent years building large-scale open datasets. Data is the lifeblood of machine learning research, and without open data there can't be high quality open research.

blog.eleuther.ai/common-pile/

13.02.2026 19:46 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

For many long-tail languages, the only LID training data is Bible text. Models learn one very specific register and fail on informal, heterogeneous web text where the actual volume lives. This distributional mismatch is what clean benchmarks hide and CommonLID exposes.

13.02.2026 19:46 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

URLs, proper nouns, boilerplate - lots of web text doesn't clearly belong to any language. Getting clean labels from noisy web data is hard even for humans. This is part of what makes the benchmark valuable: it reflects the actual difficulty of the task.

13.02.2026 19:46 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Building CommonLID involved a community-driven annotation effort with hosted hackathons and a custom interface. Annotators sometimes labeled at word level instead of line level, multilingual sentence splitting is nontrivial, and many lines resist classification.

13.02.2026 19:46 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

GlotLID does best overall but the GPT comparison is revealing. GPT-5 trails GlotLID only slightly on core languages (-1.8% F1) but falls behind by 30 F1 points on African languages. And you can't run an LLM on billions of webpages β€” you need cheap, purpose-built classifiers.

13.02.2026 19:46 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We tested 8 widely-used LID models (GlotLID, CLD2, CLD3, OpenLID, FastText, etc.) across CommonLID and five other benchmarks. No model hits >75% F1 across all evaluation sets, even counting only languages the model claims to support. Most score in the 60s on CommonLID web text.

13.02.2026 19:46 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Most LID benchmarks use clean text: FLORES translations, Bible passages, Wikipedia. Models look great on these. But web text is noisy, multilingual, full of boilerplate and code-switching. CommonLID evaluates on the domain that matters for corpus building, and scores drop hard.

13.02.2026 19:46 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Why care about LangID on crawled data? It's the first gate in the multilingual data pipeline. If your LID model misclassifies a low-resource language as noise or confuses it with a related high-resource one, that language doesn't make it into your corpus.

Bad LangID = no data.

13.02.2026 19:46 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Preview
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data of...

Announcing our latest paper: CommonLID

In collaboration with @commoncrawl.bsky.social @mlcommons.org @jhu.edu we built a LID benchmark on actual Common Crawl text covering 109 languages. Existing evaluations overestimate how well LangID works on web data.

arxiv.org/abs/2601.18026

13.02.2026 19:27 β€” πŸ‘ 22    πŸ” 12    πŸ’¬ 1    πŸ“Œ 0

And the next talk (exact details TBA) by @pjox.bsky.social and @very-laurie.bsky.social from Common Crawl on work we've been collaborating on to build better benchmarking of LangID systems and understand the issues with the long tail of human language that comes up at Common Crawl scales.

09.01.2026 00:56 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

We’re bringing back a Community Spotlight talk series, highlighting cool work being done by members of our community. We’re kicking it off with a talk on running diffusion-based world-models in real time on consumer hardware.

Jan 9th at 2 pm US Eastern Time

09.01.2026 00:56 β€” πŸ‘ 11    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0
Preview
EleutherAI

If you can't make it, no problem! All of our reading groups and speaker series upload to our YouTube. We have over 100 hours of content on topics from ML Scalability and Performance to Functional Analysis to podcasts and interviews featuring our team.

www.youtube.com/@Eleuther_AI...

26.06.2025 18:16 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with parti...

Her talk with be primarily drawing on two recent papers:

"BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training" arxiv.org/abs/2409.04599

"BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization" arxiv.org/abs/2505.24689

26.06.2025 18:16 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members.

Our first talk is by @catherinearnett.bsky.social on tokenizers, their limitations, and how to improve them.

26.06.2025 18:16 β€” πŸ‘ 16    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1
Preview
Mozilla, EleutherAI publish research on open datasets for LLM training | The Mozilla Blog Update: Following the 2024 Mozilla AI Dataset Convening, AI builders and researchers publish best practices for creating open datasets for LLM training.&nb

This dataset was previewed at the Datasets Convening we co-hosted with @mozilla.org to consult with leading experts in open datasets.

Read more about the event: blog.mozilla.org/en/mozilla/d...

And the paper distilling the best practices participants identified: arxiv.org/abs/2501.08365

06.06.2025 19:18 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

This was a huge effort across twelve institutions. Thank you to all the authors for their hard work.

This work was supported by @mozilla.org @mozilla.ai, Sutter Hill Ventures, the National Sciences and Engineering Research Council of Canada, and Lawrence Livermore National Laboratory.

06.06.2025 19:18 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Preview
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concern...

For more, check out...

Paper: arxiv.org/abs/2506.05209
Artifacts: huggingface.co/common-pile
GitHub: github.com/r-three/comm...

EleutherAI's blog post: huggingface.co/blog/stellaa...
Coverage in @washingtonpost.com by @nitasha.bsky.social: www.washingtonpost.com/politics/202...

06.06.2025 19:18 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

We're calling this v0.1 for a reason: we are excited to continue to build the open data ecosystem and hope to train bigger models on more data in the future!

If you know datasets we should include in the next version, open an issue: github.com/r-three/comm...

06.06.2025 19:18 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Several other groups have put out openly licensed dataset recently, why is ours better? Ablation studies show trained on Common Pile v0.1 outperform them, matching the performance of models trained on the original Pile and OSCAR, though still falling short of FineWeb

06.06.2025 19:18 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Our pretrained models, Comma v0.1-1T and -2T perform comparably to leading models trained in the same regime. These plots also include Qwen as a SOTA 8B reference, though it saw 36T tokens

06.06.2025 19:18 β€” πŸ‘ 10    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1

We put a lot of work into our metadata, such as having two rounds of manually validating the ToS of websites in Common Crawl, manually identifying trustworthy YouTube channels, and leveraging work by the BigCode Project and @softwareheritage.org to build the openly licensed subset of StackV2.

06.06.2025 19:18 β€” πŸ‘ 10    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Post image Post image Post image

The project of open science for machine learning only works if we are able to distribute the training data. Openly licensed data lets us do that, under mild conditions. We make sure to provide document-level metadata for authorship, licensing information, links back to the originals, and more.

06.06.2025 19:18 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Post image

What do we mean by "openly licensed" data? Following the lead of orgs like @wikimediafoundation.org and @creativecommons.bsky.social we adopt the definition laid out by @okfn.bsky.social: opendefinition.org

Succinctly put, it's data that anyone can use, modify, and share for any purpose.

06.06.2025 19:18 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The Common Pile comprises text from 30 distinct sources, covering a wide variety of domains including research papers, code, books, educational materials, audio transcripts, governmental text, and more. Some of this text is commonplace in AI, but a lot of it is pretty new.

06.06.2025 19:18 β€” πŸ‘ 11    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

Can you train a performant language model using only openly licensed text?

We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1 & 2

06.06.2025 19:18 β€” πŸ‘ 147    πŸ” 59    πŸ’¬ 2    πŸ“Œ 2
1st Workshop on Multilingual Data Quality Signals

Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!

Submission deadline is 23 June, more info: wmdqs.org

29.05.2025 17:18 β€” πŸ‘ 9    πŸ” 8    πŸ’¬ 0    πŸ“Œ 1
Post image

Today, at 11am ET, @storytracer.org will be giving a live demo on the @mozilla.ai Discord showcasing two Blueprints for creating open datasets: audio transcription using self-hosted Whisper models and document conversion using Docling. Join the event here: discord.com/invite/4jtc8...

28.04.2025 12:26 β€” πŸ‘ 12    πŸ” 5    πŸ’¬ 2    πŸ“Œ 0