Leshem (Legend) Choshen @EMNLP's Avatar

Leshem (Legend) Choshen @EMNLP

@lchoshen.bsky.social

πŸ₯‡ LLMs together (co-created model merging, BabyLM, textArena.ai) πŸ₯ˆ Spreading science over hype in #ML & #NLP Proud shareLMπŸ’¬ Donor @IBMResearch & @MIT_CSAIL

3,102 Followers  |  777 Following  |  949 Posts  |  Joined: 30.08.2023
Posts Following

Posts by Leshem (Legend) Choshen @EMNLP (@lchoshen.bsky.social)

Preview
BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop The goal of the BabyLM is to stimulate new research connections between cognitive modeling and language model pretraining. We invite contributions in this vein to the BabyLM Workshop, which will also ...

Read more:
πŸ“„ Call: arxiv.org/abs/2602.20092
🌐 Website: babylm.github.io

See you at EMNLP 2026!

02.03.2026 15:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

With @emnlpmeeting.bsky.social
@lchoshen.bsky.social Ryan Cotterell @momergul.bsky.social @jumelet.bsky.social @tallinzen.bsky.social @amuuueller.bsky.social
@suchirsalhan.bsky.social Raj Sanjay Shah @alexwarstadt.bsky.social @wegotlieb.bsky.social

02.03.2026 15:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Beyond the challenge, we welcome workshop papers on: Efficient NLP, Language acquisition & cognitive modeling, Small-scale model evaluation, Multilingual/ low-resource settings, and <Your Baby-Sized BIG Idea Here>

02.03.2026 15:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

New this year: Multilingual Track 🌍
Babies learn more than English.

We want:
β€’ Cross-lingual transfer
β€’ Typological diversity
β€’ Low-resource experiments
β€’ Developmentally plausible learning beyond English

Small data. Many languages. No excuses.

02.03.2026 15:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ‘Ά BabyLM is back at EMNLP 2026!
We are excited to announce that the 4th BabyLM Challenge & Workshop will once again bring together researchers interested in sample-efficient, developmentally plausible language modeling.
@emnlpmeeting

More in🧡

02.03.2026 15:27 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Hopefully not that long

25.02.2026 16:02 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Not convinced?
Read more: papers.ssrn.com/sol3/papers....

Or convince me that narrow agents scale better.
I’m (skeptically) listening.

25.02.2026 13:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

convinced? what now?
Come build agents, components, protocols, and evaluations that suit general agents.

Or just test existing ones, they are already surprisingly cost-effective.

25.02.2026 13:58 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Counterintuitive take (I think):

General systems may be easier to regulate.

Central abstraction β†’ clear intervention points.
Shared standards β†’ shared scrutiny.
Fragmented bespoke agents are harder to audit.

25.02.2026 13:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Ever heard of the Internet? It scaled because of a narrow waist protocol (IP).

Similarly, abstractions could be AI’s narrow waist
The agents receive info on the environment and interact to learn the rest.
Decoupling tools, models, and environments from the agent's code

25.02.2026 13:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Domain-specific agents often overfit to a narrow task.
By tackling diverse environments, we remove brittle assumptions.
The result? We find "real" solutions and algorithms, robustness by development design.

25.02.2026 13:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

General β‰  β€œdo everything badly.”

General = strong default β†’ specialize cheaply.

Like pretraining β†’ fine-tuning.
You didn’t pretrain grandpa BERTπŸ‘΄ for every task.
Why rebuild agents from scratch?

25.02.2026 13:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

History is pretty clear:

Heuristics β†’ algorithms β†’ learning systems β†’ foundation models.

Each time, the more general & more scalable approach wins.

Why would agents be the first exception?

Paper:
papers.ssrn.com/sol3/papers....

25.02.2026 13:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

Agents should be general

Why are we building code agents, CLI agents, browser agents separately?
Why does adapting to a new benchmark take a month?

Our collaboration brings diverse views, pros here cons in the paper
& Your push back if I’m wrong

Argument + paper linkπŸ‘‡πŸ§΅
#MCP #ai #LLMs #agents πŸ€–πŸ“ˆπŸ§ 

25.02.2026 13:58 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Don't complain. Do it yourself.
When the @eval-eval.bsky.social coalition started studying together what is broken in evaluation, I knew what we need to do.
We need to digitize evals.
How come every evaluation is reported differently? In a separate place?

We built Every Eval Ever:
πŸ’ͺπŸ€–πŸ“ˆπŸ§ 

17.02.2026 15:10 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

And for you worrying about saturation, relax, a silver lining.
A rare deep thought in current evaluation, love @shir-ashury-tahan.bsky.social work on it!

Oh and I forgot to open with MIT & IBM finds (or did I?)

16.02.2026 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

At a high level, as tasks saturate, robustness follows.
For researchers, this suggests that robustness engineering may become less important as performance increases.
For practitioners, it indicates that easier tasks are already reliable enough for real‑world deployment.

16.02.2026 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We further find that robustness is mainly driven by task‑specific competenceπŸ’ͺ, not some separate, inherent "robustness capability"πŸ˜΅β€πŸ’« of the model, challenging common assumptions.

16.02.2026 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Across multiple models, datasets, and configurations we observe a strong positive correlation between task performance and robustness. This effect goes beyond the "trivial robustness" expected from high accuracy and appears consistently across diverse architectures.

16.02.2026 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We thoughtπŸ€”
Models tend to learn tasks in a similar order: some tasks are simply easier. Once a task is learned, shouldn't models handle all variations of it? ("hey chat, do humans have a nose?")
We show that once a task is easy, present it as you will, it stays easy.

bsky.app/profile/lcho...

16.02.2026 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

LLM robustness is barrier to real‑world deployment - models may solve a task, yet behave inconsistently across variations.

16.02.2026 15:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

LLM "robustness" is often treated like a mysterious, standalone capability.
But what if it’s not? πŸ€”
Our new research shows robustness naturally appears when models truly understand a task - competence drives stability.

More details in the thread πŸ‘‡ πŸ€–πŸ“ˆπŸ§  #AI
arxiv.org/pdf/2602.03344

16.02.2026 15:43 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
β€ŽGemini - direct access to Google AI Created with Gemini

Near the end...
gemini.google.com/share/6f7ed2...

Long fight with it writing what it want's and not what I want, quite embarassing, really... But it was helpful in several suggestions

23.01.2026 21:20 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

I just saw an LLM making a fluency mistake. How can that happen?! Something about long context?

23.01.2026 21:20 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Good luck, and for the tip above, it means that even though you've found something you really like. If it doesn't serve the main story and bottom line of your paper, then it just complicates and should be left out, moved to an appendix or separated into another paper

15.01.2026 19:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Ever Growing Academic Writing #################### Call for Collaboration ################## This aims to help academic writers If you have any additions or corrections, please add them or comment #################################...

"Kill your darlings, cut unless serving the story"
One tip from the guide.

If you needed a moment out of #ICML2026 writing and graphs, why not read some writing and figure making tips?
docs.google.com/document/d/1...

15.01.2026 19:43 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Maybe converting known language in various ways and the pretraining on all variations, forcing parallelism. But so far only have a single project starting to research that. I presume more would be needed, collaborations? A small subfield going to also non text inputs?

14.01.2026 17:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Well that is the place where we need to try until we succeed, but, I imagine tied weights across languages and ones that are not, I imagine some compression at some point moving from a per token model to one that acts as much as it needs, maybe data manipulations. Maybe pre pretraining?

14.01.2026 17:45 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I am thinking of language as a representation, but of the model as a general computation that is mostly beyond decoding this repr. Into vectors.
So a larger language component (currently just token embeddings), that forces separation between language specific and general "thinking"

14.01.2026 17:26 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Shoukdn't we look at training models with language as just a tiny subcomponent?

14.01.2026 16:52 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0