EleutherAI
If you can't make it, no problem! All of our reading groups and speaker series upload to our YouTube. We have over 100 hours of content on topics from ML Scalability and Performance to Functional Analysis to podcasts and interviews featuring our team.
www.youtube.com/@Eleuther_AI...
26.06.2025 18:16 โ ๐ 4 ๐ 1 ๐ฌ 0 ๐ 0
We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members.
Our first talk is by @catherinearnett.bsky.social on tokenizers, their limitations, and how to improve them.
26.06.2025 18:16 โ ๐ 15 ๐ 2 ๐ฌ 1 ๐ 1
Mozilla, EleutherAI publish research on open datasets for LLM training | The Mozilla Blog
Update: Following the 2024 Mozilla AI Dataset Convening, AI builders and researchers publish best practices for creating open datasets for LLM training.&nb
This dataset was previewed at the Datasets Convening we co-hosted with @mozilla.org to consult with leading experts in open datasets.
Read more about the event: blog.mozilla.org/en/mozilla/d...
And the paper distilling the best practices participants identified: arxiv.org/abs/2501.08365
06.06.2025 19:18 โ ๐ 8 ๐ 1 ๐ฌ 1 ๐ 0
This was a huge effort across twelve institutions. Thank you to all the authors for their hard work.
This work was supported by @mozilla.org @mozilla.ai, Sutter Hill Ventures, the National Sciences and Engineering Research Council of Canada, and Lawrence Livermore National Laboratory.
06.06.2025 19:18 โ ๐ 7 ๐ 0 ๐ฌ 1 ๐ 1
We're calling this v0.1 for a reason: we are excited to continue to build the open data ecosystem and hope to train bigger models on more data in the future!
If you know datasets we should include in the next version, open an issue: github.com/r-three/comm...
06.06.2025 19:18 โ ๐ 4 ๐ 0 ๐ฌ 1 ๐ 0
Several other groups have put out openly licensed dataset recently, why is ours better? Ablation studies show trained on Common Pile v0.1 outperform them, matching the performance of models trained on the original Pile and OSCAR, though still falling short of FineWeb
06.06.2025 19:18 โ ๐ 8 ๐ 1 ๐ฌ 1 ๐ 0
Our pretrained models, Comma v0.1-1T and -2T perform comparably to leading models trained in the same regime. These plots also include Qwen as a SOTA 8B reference, though it saw 36T tokens
06.06.2025 19:18 โ ๐ 10 ๐ 1 ๐ฌ 1 ๐ 1
We put a lot of work into our metadata, such as having two rounds of manually validating the ToS of websites in Common Crawl, manually identifying trustworthy YouTube channels, and leveraging work by the BigCode Project and @softwareheritage.org to build the openly licensed subset of StackV2.
06.06.2025 19:18 โ ๐ 10 ๐ 2 ๐ฌ 1 ๐ 0
What do we mean by "openly licensed" data? Following the lead of orgs like @wikimediafoundation.org and @creativecommons.bsky.social we adopt the definition laid out by @okfn.bsky.social: opendefinition.org
Succinctly put, it's data that anyone can use, modify, and share for any purpose.
06.06.2025 19:18 โ ๐ 6 ๐ 0 ๐ฌ 1 ๐ 0
The Common Pile comprises text from 30 distinct sources, covering a wide variety of domains including research papers, code, books, educational materials, audio transcripts, governmental text, and more. Some of this text is commonplace in AI, but a lot of it is pretty new.
06.06.2025 19:18 โ ๐ 11 ๐ 1 ๐ฌ 1 ๐ 0
Can you train a performant language model using only openly licensed text?
We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1 & 2
06.06.2025 19:18 โ ๐ 147 ๐ 59 ๐ฌ 2 ๐ 2
1st Workshop on Multilingual Data Quality Signals
Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!
Submission deadline is 23 June, more info: wmdqs.org
29.05.2025 17:18 โ ๐ 9 ๐ 8 ๐ฌ 0 ๐ 1
Today, at 11am ET, @storytracer.org will be giving a live demo on the @mozilla.ai Discord showcasing two Blueprints for creating open datasets: audio transcription using self-hosted Whisper models and document conversion using Docling. Join the event here: discord.com/invite/4jtc8...
28.04.2025 12:26 โ ๐ 12 ๐ 5 ๐ฌ 2 ๐ 0
Very cool work!
24.02.2025 22:03 โ ๐ 4 ๐ 0 ๐ฌ 1 ๐ 0
Proud to be at the AI Action Summit representing @eleutherai.bsky.social and the open source community. The focus on AI for the public good is exciting! DM me or @aviya.bsky.social to talk about centering openness, transparency, and public good in the AI ecosystem.
10.02.2025 11:51 โ ๐ 23 ๐ 4 ๐ฌ 0 ๐ 0
How do a neural network's final parameters depend on its initial ones?
In this new paper, we answer this question by analyzing the training Jacobian, the matrix of derivatives of the final parameters with respect to the initial parameters.
https://arxiv.org/abs/2412.07003
11.12.2024 20:30 โ ๐ 92 ๐ 18 ๐ฌ 4 ๐ 0
GitHub - EleutherAI/steering-llama3
Contribute to EleutherAI/steering-llama3 development by creating an account on GitHub.
You can find our code here: github.com/EleutherAI/s...
If you're interested in helping out with this kind of research, check out the concept-editing channel on our Discord.
This work is by Thomas Marshall, Adam Scherlis, and @norabelrose.bsky.social. It is funded in part by a grant from OpenPhil.
22.11.2024 03:15 โ ๐ 9 ๐ 0 ๐ฌ 0 ๐ 0
ACE isn't just for RWKV! ACE enables more precise control over model behavior than prior methods. For example, on Gemma, we cause the model to behave almost identically on harmless and harmful prompts โ either refusing all of them, or accepting all of them โ for a fixed steering parameter.
22.11.2024 03:15 โ ๐ 5 ๐ 0 ๐ฌ 1 ๐ 0
ACE (Affine Concept Editing) assumes that concepts are affine functions, rather than linear ones. It projects activations onto a hyperplane containing the centroid of the target behavior โ one which may not pass through the origin.
22.11.2024 03:15 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 0
For example, Arditi et al. (arxiv.org/abs/2406.11717) argued that refusal is mediated by a single "direction," or linear subspace, in many language models. But when we applied their method on a RWKV model, we got nonsense results! We propose a new method called ACE, that fixes this issue.
22.11.2024 03:15 โ ๐ 4 ๐ 0 ๐ฌ 1 ๐ 0
Refusal in LLMs is an Affine Function
We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors ...
The latest from our interpretability team: there is an ambiguity in prior work on the linear representation hypothesis:
Is a linear representation a linear function (that preserves the origin) or an affine function (that does not)? This distinction matters in practice. arxiv.org/abs/2411.09003
22.11.2024 03:15 โ ๐ 84 ๐ 7 ๐ฌ 4 ๐ 1
Hi! You could add us :) @stellaathena.bsky.social and @norabelrose.bsky.social are here too.
20.11.2024 21:56 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
Proudly building a more open world with @opensource.bsky.social.
20.11.2024 13:17 โ ๐ 10 ๐ 0 ๐ฌ 0 ๐ 0
PhD student at Cambridge University. Causality & language models. Passionate musician, professional debugger.
pietrolesci.github.io
Common Crawl is a non-profit foundation dedicated to the Open Web.
We provide services, tools and training to build a world open by design where all knowledge is accessible to everyone.
#TheTechWeWant #OpenDataEditor #OpenDataCommons #SchoolOfData #CKAN
PhD at 19 |
Founder and CEO at @MedARC_AI |
Research Director at @StabilityAI |
@kaggle Notebooks GM |
Biomed. engineer @ 14 |
TEDx talkโกhttps://bit.ly/3tpAuan
PhD @ltiatcmu.bsky.social
previously @eleutherai.bsky.social
๐ lintang.sutawika.com
Open Data Consultant for @eleutherai.bsky.social & Digital History Advisor for @eui-history.bsky.social. Co-founder of @datarescueproject.org and @sucho-org.bsky.social. Website: https://www.storytracer.com/
Research @MsftResearch | Prev Research @EleutherAI
I like playing with LLMs
I make models more efficient.
Google Scholar: https://scholar.google.com/citations?user=GDm6BIAAAAAJ&hl=en
Head of Policy at EleutherAI. They/them. Working at the intersection of open source and AI policy. Former philosophlete.
AI, philosophy, spirituality
Head of interpretability research at EleutherAI, but posts are my own views, not Eleutherโs.
PhD student at Brown University
Twitter: @CFGeek
Mastodon: @cfoster0@sigmoid.social
When I choose to speak, I speak for myself.
๐ช Tensor-enjoyer ๐งช
Machine Learner by day, ๐ฆฎ Statistician at โค๏ธ
In search of statistical intuition for modern ML & simple explanations for complex things๐
Interested in the mysteries of modern ML, causality & all of stats. Opinions my own.
https://aliciacurth.github.io
Waiting on a robot body. All opinions are universal and held by both employers and family.
Literally a professor. Recruiting students to start my lab.
ML/NLP/they/she.
Technology specialist at the EU AI Office / AI Safety / Prev: University of Amsterdam, EleutherAI, BigScience
Thoughts & opinions are my own and do not necessarily represent my employer.
NLP Researcher at EleutherAI, PhD UC San Diego Linguistics.
Previously PleIAs, Edinburgh University.
Interested in multilingual NLP, tokenizers, open science.
๐Boston. She/her.
https://catherinearnett.github.io/
language model pretraining @ai2.bsky.social, co-lead of data research w/ @soldaini.net, statistics @uw, open science, tabletop, seattle, he/him,๐ง kyleclo.com
Breakthrough AI to solve the world's biggest problems.
โบ Join us: http://allenai.org/careers
โบ Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm
LM/NLP/ML researcher ยฏ\_(ใ)_/ยฏ
yoavartzi.com / associate professor @ Cornell CS + Cornell Tech campus @ NYC / nlp.cornell.edu / associate faculty director @ arXiv.org / researcher @ ASAPP / starting @colmweb.org / building RecNet.io