Aaron Mueller's Avatar

Aaron Mueller

@amuuueller.bsky.social

Postdoc at Northeastern and incoming Asst. Prof. at Boston U. Working on NLP, interpretability, causality. Previously: JHU, Meta, AWS

2,365 Followers  |  327 Following  |  43 Posts  |  Joined: 08.11.2024  |  2.3556

Latest posts by amuuueller.bsky.social on Bluesky


Representation steering is now a common way to mitigate LLM shortcuts. How much legitimate knowledge does this tend to remove? Turns out that these methods can be surprisingly precise! But also: no single steering operation will fix all shortcuts.

Led by @shanzzyy.bsky.social!

21.01.2026 00:32 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Congrats!!

10.01.2026 08:52 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

New book! I have written a book, called Syntax: A cognitive approach, published by MIT Press.

This is open access; MIT Press will post a link soon, but until then, the book is available on my website:
tedlab.mit.edu/tedlab_websi...

24.12.2025 19:55 β€” πŸ‘ 122    πŸ” 41    πŸ’¬ 2    πŸ“Œ 3
Post image

I also want to mention that the lang x computation research community at BU is growing in an exciting direction, especially with new faculty like @amuuueller.bsky.social, @anthonyyacovone.bsky.social, @nsaphra.bsky.social, &
@profsophie.bsky.social! Also, Boston is quite nice :)

19.11.2025 17:20 β€” πŸ‘ 9    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Preview
Priors in Time: Missing Inductive Biases for Language Model Interpretability Recovering meaningful concepts from language model activations is a central aim of interpretability. While existing feature extraction methods aim to identify concepts that are independent directions,...

Check out the paper and our demo features!

πŸ“œ Preprint: arxiv.org/abs/2511.01836
🧠 Play with temporal feature analysis on Neuronpedia: www.neuronpedia.org/gemma-2-2b/1...

14.11.2025 15:48 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I'm glossing over our deeper motivations from neuroscience (predictive coding) and linguistics here, but we believe there's significant cross-field appeal for those interested in intersections of cog sci, neuroscience, and machine learning!

14.11.2025 15:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Jeff Elman famously showed us in 1990 that time is a rich signal in itself.

Our work demonstrates that this lesson applies equally well to interpretability methods. The inductive biases of interp methods should reflect the structure of what is being studied.

14.11.2025 15:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

TFA is designed to capture context-sensitive information. Consider parsing: SAE often assign each word to its most frequent syntactic category, regardless of context.

Meanwhile, TFA recovers the correct parse given the context!

14.11.2025 15:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

In LLMs, concepts aren’t static: they evolve through time and have rich temporal dependencies.

We introduce Temporal Feature Analysis (TFA) to separate what's inferred from context vs. novel information. A big effort led by @ekdeepl.bsky.social, @sumedh-hindupur.bsky.social, @canrager.bsky.social!

14.11.2025 15:48 β€” πŸ‘ 21    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0
Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C.

09:00 am: Opening
09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt
10:20 am: Paper Presentations

Lunch Break

01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald
02:10 pm: Poster Session
03:20 pm: Roundtable Discussion
04:50 pm: Closing

Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C. 09:00 am: Opening 09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt 10:20 am: Paper Presentations Lunch Break 01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald 02:10 pm: Poster Session 03:20 pm: Roundtable Discussion 04:50 pm: Closing

✨ The schedule for our INTERPLAY workshop at COLM is live! ✨
πŸ—“οΈ October 10th, Room 518C
πŸ”Ή Invited talks from @sarah-nlp.bsky.social John Hewitt @amuuueller.bsky.social @kmahowald.bsky.social
πŸ”Ή Paper presentations and posters
πŸ”Ή Closing roundtable discussion.

Join us in MontrΓ©al! @colmweb.org

09.10.2025 17:30 β€” πŸ‘ 3    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0

Aruna Sankaranarayanan, @arnabsensharma.bsky.social absensharma.bsky.social @ericwtodd.bsky.social ky.social @davidbau.bsky.social u.bsky.social @boknilev.bsky.social (2/2)

01.10.2025 14:03 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Thanks again to the co-authors! Such a wide survey required a lot of perspectives. @jannikbrinkmann.bsky.social Millicent Li, Samuel Marks, @koyena.bsky.social @nikhil07prakash.bsky.social @canrager.bsky.social (1/2)

01.10.2025 14:03 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not...

See the paper for more details! arxiv.org/abs/2408.01416

01.10.2025 14:03 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We also made the causal graph formalism more precise. Interpretability and causality are intimately linked; the latter makes the former more trustworthy and rigorous. This formal link should be strengthened in future work.

01.10.2025 14:03 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

One of the bigger changes was establishing criteria for success in interpretability. What units of analysis should you use if you know what you’re looking for? If you *don’t* know what you’re looking for?

01.10.2025 14:03 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

01.10.2025 14:03 β€” πŸ‘ 41    πŸ” 14    πŸ’¬ 2    πŸ“Œ 2
What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns
Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices

What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices

In neuroscience, we often try to understand systems by analyzing their representations β€” using tools like regression or RSA. But are these analyses biased towards discovering a subset of what a system represents? If you're interested in this question, check out our new commentary! Thread:

05.08.2025 14:36 β€” πŸ‘ 170    πŸ” 53    πŸ’¬ 5    πŸ“Œ 0
Post image

If you're at #ICML2025, chat with me, @sarah-nlp.bsky.social, Atticus, and others at our poster 11am - 1:30pm at East #1205! We're establishing a 𝗠echanistic π—œnterpretability 𝗕enchmark.

We're planning to keep this a living benchmark; come by and share your ideas/hot takes!

17.07.2025 17:45 β€” πŸ‘ 13    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Preview
@nikhil07prakash.bsky.social How do language models track mental states of each character in a story, often referred to as Theory of Mind? We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

The new "Lookback" paper from @nikhil07prakash.bsky.social‬ contains a surprising insight...

70b/405b LLMs use double pointers, akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind.

bsky.app/profile/nik...

25.06.2025 15:00 β€” πŸ‘ 29    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

We still have a lot to learn in editing NN representations.

To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.

27.05.2025 17:07 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

By limiting steering to output features, we recover >90% of the performance of the best supervised representation-based steering methodsβ€”and at some locations, we outperform them!

27.05.2025 17:07 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We define the notion of an β€œoutput feature”, whose role is to increase p(some token(s)). Steering these gives better results than steering β€œinput features”, whose role is to attend to concepts in the input. We propose fast methods to sort features into these categories.

27.05.2025 17:07 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

27.05.2025 17:07 β€” πŸ‘ 16    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧡

27.05.2025 16:06 β€” πŸ‘ 18    πŸ” 6    πŸ’¬ 2    πŸ“Œ 3

Couldn’t be happier to have co-authored this will a stellar team, including: Michael Hu, @amuuueller.bsky.social, @alexwarstadt.bsky.social, @lchoshen.bsky.social, Chengxu Zhuang, @adinawilliams.bsky.social, Ryan Cotterell, @tallinzen.bsky.social

12.05.2025 15:48 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

... Jing Huang, Rohan Gupta, Yaniv Nikankin, @hadasorgad.bsky.social, Nikhil Prakash, @anja.re, Aruna Sankaranarayanan, Shun Shao, @alestolfo.bsky.social, @mtutek.bsky.social, @amirzur, @davidbau.bsky.social, and @boknilev.bsky.social!

23.04.2025 18:15 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, IvΓ‘n Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...

23.04.2025 18:15 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Preview
MIB: A Mechanistic Interpretability Benchmark How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spann...

We’re eager to establish MIB as a meaningful and lasting standard for comparing the quality of MI methods. If you’ll be at #ICLR2025 or #NAACL2025, please reach out to chat!

πŸ“œ arxiv.org/abs/2504.13151

23.04.2025 18:15 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
MIB – Project Page

We release many public resources, including:

🌐 Website: mib-bench.github.io
πŸ“„ Data: huggingface.co/collections/...
πŸ’» Code: github.com/aaronmueller...
πŸ“Š Leaderboard: Coming very soon!

23.04.2025 18:15 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

These results highlight that there has been real progress in the field! We also recovered known findings, like that integrated gradients improves attribution quality. This is a sanity check verifying that our benchmark is capturing something real.

23.04.2025 18:15 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@amuuueller is following 20 prominent accounts