I'm not sure if I'm more called out by this skeet or the fact that I've had two kidney stones already tbh...
Language identification still proves to be a challenging task, especially for web data. In collaboration with @mlcommons.org @eleutherai.bsky.social @jhu.edu and 97 community members, we created CommonLID, a new benchmark for LangID for 100+ languages!
Has anyone else had Claude code become non-functional recently? Even with a test input it spins for minutes without doing anything. Same thing happens in terminal.
It's going to get worse because people hate AI
The only reasons I use social media platforms is to get eyeballs on research and to yell at people who are wrong online.
X > Bluesky at both for me
And the next talk (exact details TBA) by @pjox.bsky.social and @very-laurie.bsky.social from Common Crawl on work we've been collaborating on to build better benchmarking of LangID systems and understand the issues with the long tail of human language that comes up at Common Crawl scales.
We’re bringing back a Community Spotlight talk series, highlighting cool work being done by members of our community. We’re kicking it off with a talk on running diffusion-based world-models in real time on consumer hardware.
Jan 9th at 2 pm US Eastern Time
What are people's favorite paper / project websites? I'm looking to build a library to base future ones I make off of.
The one EleutherAI has done that I'm proudest of is deepignorance.ai
Imagine if they had done their job and put the pedophile insurrectionists in jail for the rest of their lives.
I'm late to seeing this post, but this is a super cool paper! Thanks for sharing. Do you have more work in this vein in the works?
If you're looking for the 60 minutes piece on CECOT that Bari Weiss canned to protect the President, you can find it here: archive.org/details/60-m...
Ukraine is not the only country Sebastian has created non-profits to support. The @datarescueproject.org has spent the past year coordinating and managing volunteer communities in the US working to back up scientific, cultural, and historical data that the USG wants to suppress.
They were honored alongside the Ukrainian Library Association whose on-the-ground work in Ukraine to promote and preserve Ukrainian language and culture has been all the more vital as Ukraine fights for its existence.
Incredibly proud of my friend and colleague @storytracer.com. Two weeks ago he and his cofounders @sucho-org.bsky.social were honored for organizing a global network of volunteers to exfiltrate and back up endangered Ukrainian cultural heritage in the wake of the invasion by Russia.
Really great to see NVIDIA staking out a pro-open data position. This used to be common, if not the norm in AI, and the backing away from this level of transparency has done a lot of harm to the research community.
Ah, that would explain it.
Compare the statement about the antisemitic terror attack in Australia by the Israeli Prime Minister with the one by Zohran Mamdani, and ask yourself who more truly cares about condemning antisemitism, as opposed to using it to promote unrelated politics.
I didn't... is it because I'm at a non-profit research institute but not a university? 🤔
"The incentives made me do it" is an excuse, not a justification. You can be better than that, and if you're not it's because you choose to not be.
I saw a pamphlet recently from WotC to game store employees about how to talk to parents who thought Magic: the Gathering was Satanic and whose kids were into it. Someone had found it in a box and framed it.
It's a wrap on EvalEval in San Diego! A jam packed day of learning, making new friends, critically examining the field of evals, and walking away with renewed energy and new collaborations!
We have a lot of announcements coming, but first: EvalEval will be back for #ACL2026!
In 2023-ish it was trendy to write papers trying to explain why scaling laws had power law structures. The papers I remember were pretty unconvincing. Did anything meaningful come of this work? What does the best work in this vein look like?
I highly recommend "do artifacts have politics?"
faculty.cc.gatech.edu/~beki/cs4001...
Put this person in jail.
Put the person who drafted it in jail.
Put the higher ups who covered it up in jail.
This is a crime. If Kilmar had done the same they wouldn't hesitate to punish him. ICE is a criminal organization, not a law enforcement organization, and justice requires accountability.
Hyped to write "The models in this paper cost us 476,246.57 USD to train. I'm sorry you are sad we didn't redo all of our experiments on multiple independent training datasets. If you'd like to give us a million dollars we'd be happy to run the experiments you wish" in my response to a reviewer.
🚨 AI keeps scaling, but social impact evaluations aren’t–and the data proves it 🚨
Our new paper, 📎“Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations,” analyzes hundreds of evaluation reports and reveals major blind spots ‼️🧵 (1/7)
Our #NeurIPS2025 paper shows that even comparable monolingual tokenizers have different compression rates across languages. But by getting rid of whitespace tokenization and using a custom vocab size for each language, we can reduce token premiums. Preprint out now!
Small caveat: I misunderstood arXiv's ToS when I wrote this paper. While a large portion of arXiv has an open license, the majority (last time I checked) does not. That shouldn't have a check under "author."
PG-19 lacks one because of how radically technology has changed.
arxiv.org/abs/2101.00027
In the original Pile paper we talked about various conceptions of consent (though I don't stand by everything I wrote about this topic 5 years ago). None of this data has EIC, though I think that the ones marked "author" in the table are ones where authorial objection would be unreasonable.
Adding to what @mmitchell.bsky.social said, EIC cannot be use-agnostic by definition. It must be explicit to the use in question. If you put a notice that says "everyone can use this for every purpose" that's *not* EIC.