Read more:
π Call: arxiv.org/abs/2602.20092
π Website: babylm.github.io
See you at EMNLP 2026!
Read more:
π Call: arxiv.org/abs/2602.20092
π Website: babylm.github.io
See you at EMNLP 2026!
With @emnlpmeeting.bsky.social
@lchoshen.bsky.social Ryan Cotterell @momergul.bsky.social @jumelet.bsky.social @tallinzen.bsky.social @amuuueller.bsky.social
@suchirsalhan.bsky.social Raj Sanjay Shah @alexwarstadt.bsky.social @wegotlieb.bsky.social
Beyond the challenge, we welcome workshop papers on: Efficient NLP, Language acquisition & cognitive modeling, Small-scale model evaluation, Multilingual/ low-resource settings, and <Your Baby-Sized BIG Idea Here>
02.03.2026 15:27 β π 0 π 0 π¬ 1 π 0
New this year: Multilingual Track π
Babies learn more than English.
We want:
β’ Cross-lingual transfer
β’ Typological diversity
β’ Low-resource experiments
β’ Developmentally plausible learning beyond English
Small data. Many languages. No excuses.
πΆ BabyLM is back at EMNLP 2026!
We are excited to announce that the 4th BabyLM Challenge & Workshop will once again bring together researchers interested in sample-efficient, developmentally plausible language modeling.
@emnlpmeeting
More inπ§΅
Hopefully not that long
25.02.2026 16:02 β π 0 π 0 π¬ 0 π 0
Not convinced?
Read more: papers.ssrn.com/sol3/papers....
Or convince me that narrow agents scale better.
Iβm (skeptically) listening.
convinced? what now?
Come build agents, components, protocols, and evaluations that suit general agents.
Or just test existing ones, they are already surprisingly cost-effective.
Counterintuitive take (I think):
General systems may be easier to regulate.
Central abstraction β clear intervention points.
Shared standards β shared scrutiny.
Fragmented bespoke agents are harder to audit.
Ever heard of the Internet? It scaled because of a narrow waist protocol (IP).
Similarly, abstractions could be AIβs narrow waist
The agents receive info on the environment and interact to learn the rest.
Decoupling tools, models, and environments from the agent's code
Domain-specific agents often overfit to a narrow task.
By tackling diverse environments, we remove brittle assumptions.
The result? We find "real" solutions and algorithms, robustness by development design.
General β βdo everything badly.β
General = strong default β specialize cheaply.
Like pretraining β fine-tuning.
You didnβt pretrain grandpa BERTπ΄ for every task.
Why rebuild agents from scratch?
History is pretty clear:
Heuristics β algorithms β learning systems β foundation models.
Each time, the more general & more scalable approach wins.
Why would agents be the first exception?
Paper:
papers.ssrn.com/sol3/papers....
Agents should be general
Why are we building code agents, CLI agents, browser agents separately?
Why does adapting to a new benchmark take a month?
Our collaboration brings diverse views, pros here cons in the paper
& Your push back if Iβm wrong
Argument + paper linkππ§΅
#MCP #ai #LLMs #agents π€ππ§
Don't complain. Do it yourself.
When the @eval-eval.bsky.social coalition started studying together what is broken in evaluation, I knew what we need to do.
We need to digitize evals.
How come every evaluation is reported differently? In a separate place?
We built Every Eval Ever:
πͺπ€ππ§
And for you worrying about saturation, relax, a silver lining.
A rare deep thought in current evaluation, love @shir-ashury-tahan.bsky.social work on it!
Oh and I forgot to open with MIT & IBM finds (or did I?)
At a high level, as tasks saturate, robustness follows.
For researchers, this suggests that robustness engineering may become less important as performance increases.
For practitioners, it indicates that easier tasks are already reliable enough for realβworld deployment.
We further find that robustness is mainly driven by taskβspecific competenceπͺ, not some separate, inherent "robustness capability"π΅βπ« of the model, challenging common assumptions.
16.02.2026 15:43 β π 0 π 0 π¬ 1 π 0Across multiple models, datasets, and configurations we observe a strong positive correlation between task performance and robustness. This effect goes beyond the "trivial robustness" expected from high accuracy and appears consistently across diverse architectures.
16.02.2026 15:43 β π 0 π 0 π¬ 1 π 0
We thoughtπ€
Models tend to learn tasks in a similar order: some tasks are simply easier. Once a task is learned, shouldn't models handle all variations of it? ("hey chat, do humans have a nose?")
We show that once a task is easy, present it as you will, it stays easy.
bsky.app/profile/lcho...
LLM robustness is barrier to realβworld deployment - models may solve a task, yet behave inconsistently across variations.
16.02.2026 15:43 β π 0 π 0 π¬ 1 π 0
LLM "robustness" is often treated like a mysterious, standalone capability.
But what if itβs not? π€
Our new research shows robustness naturally appears when models truly understand a task - competence drives stability.
More details in the thread π π€ππ§ #AI
arxiv.org/pdf/2602.03344
Near the end...
gemini.google.com/share/6f7ed2...
Long fight with it writing what it want's and not what I want, quite embarassing, really... But it was helpful in several suggestions
I just saw an LLM making a fluency mistake. How can that happen?! Something about long context?
23.01.2026 21:20 β π 2 π 0 π¬ 1 π 0Good luck, and for the tip above, it means that even though you've found something you really like. If it doesn't serve the main story and bottom line of your paper, then it just complicates and should be left out, moved to an appendix or separated into another paper
15.01.2026 19:43 β π 0 π 0 π¬ 0 π 0
"Kill your darlings, cut unless serving the story"
One tip from the guide.
If you needed a moment out of #ICML2026 writing and graphs, why not read some writing and figure making tips?
docs.google.com/document/d/1...
Maybe converting known language in various ways and the pretraining on all variations, forcing parallelism. But so far only have a single project starting to research that. I presume more would be needed, collaborations? A small subfield going to also non text inputs?
14.01.2026 17:45 β π 1 π 0 π¬ 0 π 0Well that is the place where we need to try until we succeed, but, I imagine tied weights across languages and ones that are not, I imagine some compression at some point moving from a per token model to one that acts as much as it needs, maybe data manipulations. Maybe pre pretraining?
14.01.2026 17:45 β π 2 π 0 π¬ 1 π 0
I am thinking of language as a representation, but of the model as a general computation that is mostly beyond decoding this repr. Into vectors.
So a larger language component (currently just token embeddings), that forces separation between language specific and general "thinking"
Shoukdn't we look at training models with language as just a tiny subcomponent?
14.01.2026 16:52 β π 1 π 0 π¬ 1 π 0