Eric Todd's Avatar

Eric Todd

@ericwtodd.bsky.social

CS PhD Student, Northeastern University - Machine Learning, Interpretability https://ericwtodd.github.io

483 Followers  |  182 Following  |  13 Posts  |  Joined: 19.11.2024  |  1.5176

Latest posts by ericwtodd.bsky.social on Bluesky

This is a great question! I'm actually not sure why this happens. I do know that the identity accuracy in (3) comes from query promotion - it's close to random guessing of query symbols, and that identity demotion is learned in (7), but I will check out some of these checkpoints and let you know!

29.01.2026 22:57 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

The Art of Wanting.

About the question I see as central in AI ethics, interpretability, and safety. Can an AI take responsibility? I do not think so, but *not* because it's not smart enough.

davidbau.com/archives/20...

27.01.2026 15:32 β€” πŸ‘ 10    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0
Post image

Can models understand each other's reasoning? πŸ€”

When Model A explains its Chain-of-Thought (CoT) , do Models B, C, and D interpret it the same way?

Our new preprint with @davidbau.bsky.social and @csinva.bsky.social explores CoT generalizability πŸ§΅πŸ‘‡

(1/7)

22.01.2026 21:58 β€” πŸ‘ 24    πŸ” 7    πŸ’¬ 1    πŸ“Œ 0
Preview
In-Context Algebra Understanding the learned algorithms of transformer language models solving abstract algebra problems through in-context learning.

Takeaway: contextual reasoning can be richer than just fuzzy copying!

See the paper for more results, including an analysis of learning dynamics. Work done w/ @jannikbrinkmann.bsky.social, @rohitgandikota.bsky.social & @davidbau.bsky.social!

πŸ“œ: arxiv.org/abs/2512.16902
🌐: algebra.baulab.info

22.01.2026 16:10 β€” πŸ‘ 12    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Post image

Another strategy infers meaning using sets.

We have seen models keep track of "positive" and "negative" sets that let it narrow its understanding of a symbol using Sudoku-style cancellation.

Red bars (a) show the positive set and blue boxes (b) show the negative.

22.01.2026 16:10 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

What in-context mechanisms do we find, other than copying?

The first one is the "identity rule". Here, the answer is the same as the question after eliminating a recognized "identity" from the question, like "ab=a".

@taylorwwebb.bsky.social has seen this in LLMs too!
bsky.app/profile/tay...

22.01.2026 16:10 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Our work maps out several context-based algorithms (copy, identity, commutativity, cancellation, & associativity). We use targeted data distributions to measure and dissect each strategy.

These five strategies explain almost all of our model's in-context performance!

22.01.2026 16:10 β€” πŸ‘ 8    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

If you pick a random puzzle (try one here: algebra.baulab.info), you'll see there's often more than one way to understand context.

@nelhage.bsky.social & @neelnanda.bsky.social found LLMs infer meaning by induction-style copying, and that happens here too. But there are many other strategies.

22.01.2026 16:10 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Can you solve this algebra puzzle? 🧩

cb=c, ac=b, ab=?

A small transformer can learn to solve problems like this!

And since the letters don't have inherent meaning, this lets us study how context alone imparts meaning. Here's what we found:πŸ§΅β¬‡οΈ

22.01.2026 16:09 β€” πŸ‘ 47    πŸ” 10    πŸ’¬ 2    πŸ“Œ 2
Post image

Humans and LLMs think fast and slow. Do SAEs recover slow concepts in LLMs? Not really.

Our Temporal Feature Analyzer discovers contextual features in LLMs, that detect event boundaries, parse complex grammar, and represent ICL patterns.

13.11.2025 22:31 β€” πŸ‘ 18    πŸ” 8    πŸ’¬ 1    πŸ“Œ 1
Post image

LLMs have been shown to provide different predictions in clinical tasks when patient race is altered. Can SAEs spot this undue reliance on race? 🧡

Work w/ @byron.bsky.social

Link: arxiv.org/abs/2511.00177

05.11.2025 15:20 β€” πŸ‘ 5    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1

Interested in doing a PhD at the intersection of human and machine cognition? ✨ I'm recruiting students for Fall 2026! ✨

Topics of interest include pragmatics, metacognition, reasoning, & interpretability (in humans and AI).

Check out JHU's mentoring program (due 11/15) for help with your SoP πŸ‘‡

04.11.2025 14:44 β€” πŸ‘ 27    πŸ” 15    πŸ’¬ 0    πŸ“Œ 1
Post image

How can a language model find the veggies in a menu?

New pre-print where we investigate the internal mechanisms of LLMs when filtering on a list of options.

Spoiler: turns out LLMs use strategies surprisingly similar to functional programming (think "filter" from python)! 🧡

04.11.2025 17:48 β€” πŸ‘ 23    πŸ” 9    πŸ’¬ 1    πŸ“Œ 2

Looking forward to attending #COLM2025 this week! Would love to meet up and chat with others about interpretability + more. DMs are open if you want to connect. Be sure to checkout @sfeucht.bsky.social's very cool work on understanding concepts in LLMs tomorrow morning (Poster 35)!

06.10.2025 15:00 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

01.10.2025 14:03 β€” πŸ‘ 40    πŸ” 14    πŸ’¬ 2    πŸ“Œ 2
Post image

Who is going to be at #COLM2025?

I want to draw your attention to a COLM paper by my student @sfeucht.bsky.social that has totally changed the way I think and teach about LLM representations. The work is worth knowing.

And you can meet Sheridan at COLM, Oct 7!
bsky.app/profile/sfe...

27.09.2025 20:54 β€” πŸ‘ 39    πŸ” 8    πŸ’¬ 1    πŸ“Œ 2
Post image

Announcing a broad expansion of the National Deep Inference Fabric.

This could be relevant to your research...

26.09.2025 18:47 β€” πŸ‘ 11    πŸ” 3    πŸ’¬ 1    πŸ“Œ 2
Post image Post image

"AI slop" seems to be everywhere, but what exactly makes text feel like "slop"?

In our new work (w/ @tuhinchakr.bsky.social, Diego Garcia-Olano, @byron.bsky.social ) we provide a systematic attempt at measuring AI "slop" in text!

arxiv.org/abs/2509.19163

🧡 (1/7)

24.09.2025 13:21 β€” πŸ‘ 31    πŸ” 16    πŸ’¬ 1    πŸ“Œ 1
Post image

Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuancedβ€”and verbalizers might not tell us what we hope they do. πŸ§΅πŸ‘‡1/8

17.09.2025 19:19 β€” πŸ‘ 26    πŸ” 8    πŸ’¬ 1    πŸ“Œ 1
New England Mechanistic Interpretability Workshop
About:The New England Mechanistic Interpretability (NEMI) workshop aims to bring together academic and industry researchers from the New England and surround... New England Mechanistic Interpretability Workshop

This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/

If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...

18.08.2025 18:06 β€” πŸ‘ 16    πŸ” 7    πŸ’¬ 1    πŸ“Œ 3

We've added a quick new section to this paper, which was just accepted to @COLM_conf! By summing weights of concept induction heads, we created a "concept lens" that lets you read out semantic information in a model's hidden states. πŸ”Ž

22.07.2025 12:39 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Im excited for NEMI again this year! I’ve enjoyed local research meetups and getting to know others near me working on interesting problems.

30.06.2025 23:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
NEMI 2024 (Last Year)

NEMI 2024 (Last Year)

🚨 Registration is live! 🚨

The New England Mechanistic Interpretability (NEMI) Workshop is happening Aug 22nd 2025 at Northeastern University!

A chance for the mech interp community to nerd out on how models really work πŸ§ πŸ€–

🌐 Info: nemiconf.github.io/summer25/
πŸ“ Register: forms.gle/v4kJCweE3UUH...

30.06.2025 22:55 β€” πŸ‘ 10    πŸ” 8    πŸ’¬ 0    πŸ“Œ 1
Post image

How do language models track mental states of each character in a story, often referred to as Theory of Mind?

We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

24.06.2025 17:13 β€” πŸ‘ 58    πŸ” 19    πŸ’¬ 2    πŸ“Œ 1
Post image

Can we uncover the list of topics a language model is censored on?

Refused topics vary strongly among models. Claude-3.5 vs DeepSeek-R1 refusal patterns:

13.06.2025 15:58 β€” πŸ‘ 8    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

I'm not familiar with the reviewing load for ARR, but for COLM this I was only assigned 2 papers as a reviewer which is great. I had more time to try and understand each submission and it was much more manageable than getting assigned 6+ papers like ICML and NeurIPS do.

29.05.2025 00:14 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I'll present a poster for this work at NENLP tomorrow! Come find me at poster #80...

10.04.2025 21:19 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

Sheridan asks whether the Dual Route Model of Reading that psychologists have observed in humans also appears in LLMs.

In her brilliantly simple study of induction heads, she finds that it does! Induction has a Dual Route that separates concepts from literal token processing.

Worth reading β†˜οΈ

07.04.2025 15:23 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Post image

[πŸ“„] Are LLMs mindless token-shifters, or do they build meaningful representations of language? We study how LLMs copy text in-context, and physically separate out two types of induction heads: token heads, which copy literal tokens, and concept heads, which copy word meanings.

07.04.2025 13:54 β€” πŸ‘ 76    πŸ” 19    πŸ’¬ 1    πŸ“Œ 6

I reviewed for ICML this year and it felt to me like the paper quality was lower than previous reviewing assignments I’ve had. In my batch I had 3/7 that I’d consider low quality submissions. The review process was also more involved (but hopefully it allows for a better feedback mechanism)

25.03.2025 22:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@ericwtodd is following 20 prominent accounts