Aaron Mueller's Avatar

Aaron Mueller

@amuuueller.bsky.social

Postdoc at Northeastern and incoming Asst. Prof. at Boston U. Working on NLP, interpretability, causality. Previously: JHU, Meta, AWS

2,318 Followers  |  325 Following  |  36 Posts  |  Joined: 08.11.2024  |  2.2498

Latest posts by amuuueller.bsky.social on Bluesky

Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C.

09:00 am: Opening
09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt
10:20 am: Paper Presentations

Lunch Break

01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald
02:10 pm: Poster Session
03:20 pm: Roundtable Discussion
04:50 pm: Closing

Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C. 09:00 am: Opening 09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt 10:20 am: Paper Presentations Lunch Break 01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald 02:10 pm: Poster Session 03:20 pm: Roundtable Discussion 04:50 pm: Closing

โœจ The schedule for our INTERPLAY workshop at COLM is live! โœจ
๐Ÿ—“๏ธ October 10th, Room 518C
๐Ÿ”น Invited talks from @sarah-nlp.bsky.social John Hewitt @amuuueller.bsky.social @kmahowald.bsky.social
๐Ÿ”น Paper presentations and posters
๐Ÿ”น Closing roundtable discussion.

Join us in Montrรฉal! @colmweb.org

09.10.2025 17:30 โ€” ๐Ÿ‘ 2    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Aruna Sankaranarayanan, @arnabsensharma.bsky.social absensharma.bsky.social @ericwtodd.bsky.social ky.social @davidbau.bsky.social u.bsky.social @boknilev.bsky.social (2/2)

01.10.2025 14:03 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Thanks again to the co-authors! Such a wide survey required a lot of perspectives. @jannikbrinkmann.bsky.social Millicent Li, Samuel Marks, @koyena.bsky.social @nikhil07prakash.bsky.social @canrager.bsky.social (1/2)

01.10.2025 14:03 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not...

See the paper for more details! arxiv.org/abs/2408.01416

01.10.2025 14:03 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We also made the causal graph formalism more precise. Interpretability and causality are intimately linked; the latter makes the former more trustworthy and rigorous. This formal link should be strengthened in future work.

01.10.2025 14:03 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

One of the bigger changes was establishing criteria for success in interpretability. What units of analysis should you use if you know what youโ€™re looking for? If you *donโ€™t* know what youโ€™re looking for?

01.10.2025 14:03 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

Weโ€™ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

01.10.2025 14:03 โ€” ๐Ÿ‘ 38    ๐Ÿ” 14    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 2
What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns
Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices

What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices

In neuroscience, we often try to understand systems by analyzing their representations โ€” using tools like regression or RSA. But are these analyses biased towards discovering a subset of what a system represents? If you're interested in this question, check out our new commentary! Thread:

05.08.2025 14:36 โ€” ๐Ÿ‘ 163    ๐Ÿ” 53    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 0
Post image

If you're at #ICML2025, chat with me, @sarah-nlp.bsky.social, Atticus, and others at our poster 11am - 1:30pm at East #1205! We're establishing a ๐— echanistic ๐—œnterpretability ๐—•enchmark.

We're planning to keep this a living benchmark; come by and share your ideas/hot takes!

17.07.2025 17:45 โ€” ๐Ÿ‘ 13    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
@nikhil07prakash.bsky.social How do language models track mental states of each character in a story, often referred to as Theory of Mind? We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

The new "Lookback" paper from @nikhil07prakash.bsky.socialโ€ฌ contains a surprising insight...

70b/405b LLMs use double pointers, akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind.

bsky.app/profile/nik...

25.06.2025 15:00 โ€” ๐Ÿ‘ 28    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We still have a lot to learn in editing NN representations.

To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.

27.05.2025 17:07 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

By limiting steering to output features, we recover >90% of the performance of the best supervised representation-based steering methodsโ€”and at some locations, we outperform them!

27.05.2025 17:07 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We define the notion of an โ€œoutput featureโ€, whose role is to increase p(some token(s)). Steering these gives better results than steering โ€œinput featuresโ€, whose role is to attend to concepts in the input. We propose fast methods to sort features into these categories.

27.05.2025 17:07 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

27.05.2025 17:07 โ€” ๐Ÿ‘ 15    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" ๐Ÿงต

27.05.2025 16:06 โ€” ๐Ÿ‘ 18    ๐Ÿ” 6    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 3

Couldnโ€™t be happier to have co-authored this will a stellar team, including: Michael Hu, @amuuueller.bsky.social, @alexwarstadt.bsky.social, @lchoshen.bsky.social, Chengxu Zhuang, @adinawilliams.bsky.social, Ryan Cotterell, @tallinzen.bsky.social

12.05.2025 15:48 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

... Jing Huang, Rohan Gupta, Yaniv Nikankin, @hadasorgad.bsky.social, Nikhil Prakash, @anja.re, Aruna Sankaranarayanan, Shun Shao, @alestolfo.bsky.social, @mtutek.bsky.social, @amirzur, @davidbau.bsky.social, and @boknilev.bsky.social!

23.04.2025 18:15 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Ivรกn Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...

23.04.2025 18:15 โ€” ๐Ÿ‘ 8    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Preview
MIB: A Mechanistic Interpretability Benchmark How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spann...

Weโ€™re eager to establish MIB as a meaningful and lasting standard for comparing the quality of MI methods. If youโ€™ll be at #ICLR2025 or #NAACL2025, please reach out to chat!

๐Ÿ“œ arxiv.org/abs/2504.13151

23.04.2025 18:15 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
MIB โ€“ Project Page

We release many public resources, including:

๐ŸŒ Website: mib-bench.github.io
๐Ÿ“„ Data: huggingface.co/collections/...
๐Ÿ’ป Code: github.com/aaronmueller...
๐Ÿ“Š Leaderboard: Coming very soon!

23.04.2025 18:15 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

These results highlight that there has been real progress in the field! We also recovered known findings, like that integrated gradients improves attribution quality. This is a sanity check verifying that our benchmark is capturing something real.

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Table of results for the causal variable localization track.

Table of results for the causal variable localization track.

We find that supervised methods like DAS significantly outperform methods like sparse autoencoders or principal component analysis. Mask-learning methods also perform well, but not as well as DAS.

23.04.2025 18:15 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Visual intuition underlying the interchange intervention accuracy (IIA), the main faithfulness metric for this track.

Visual intuition underlying the interchange intervention accuracy (IIA), the main faithfulness metric for this track.

This is evaluated using the interchange intervention accuracy (IIA): we featurize the activations, intervene on the specific causal variable, and see whether the intervention has the expected effect on model behavior.

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Overview of the causal variable localization track. Users provide a trained featurizer and location at which the causal variable is hypothesized to exist. The faithfulness of the intervention is measured; this is the final score.

Overview of the causal variable localization track. Users provide a trained featurizer and location at which the causal variable is hypothesized to exist. The faithfulness of the intervention is measured; this is the final score.

The causal variable localization track measures the quality of featurization methods (like DAS, SAEs, etc.). How well can we decompose activations into more meaningful units, and intervene selectively on just the target variable?

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Table summarizing the results from the circuit localization track.

Table summarizing the results from the circuit localization track.

We find that edge-level methods generally outperform node-level methods, that attribution patching with integrated gradients generally outperforms other methods (including more exact methods!), and that mask-learning methods perform well.

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).

Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).

Thus, we split ๐˜ง into two metrics: the integrated ๐—ฐ๐—ถ๐—ฟ๐—ฐ๐˜‚๐—ถ๐˜ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ (CPR), and the integrated ๐—ฐ๐—ถ๐—ฟ๐—ฐ๐˜‚๐—ถ๐˜โ€“๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฑ๐—ถ๐˜€๐˜๐—ฎ๐—ป๐—ฐ๐—ฒ (CMD). Both involve integrating ๐˜ง across many circuit sizes. This implicitly captures ๐—ณ๐—ฎ๐—ถ๐˜๐—ต๐—ณ๐˜‚๐—น๐—ป๐—ฒ๐˜€๐˜€ and ๐—บ๐—ถ๐—ป๐—ถ๐—บ๐—ฎ๐—น๐—ถ๐˜๐˜† at the same time!

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Overview of the circuit localization track. The user provides circuits of various sizes. The faithfulness of each is computed, and then the area under the faithfulness vs. circuit size curve is computed.

Overview of the circuit localization track. The user provides circuits of various sizes. The faithfulness of each is computed, and then the area under the faithfulness vs. circuit size curve is computed.

The circuit localization track compares causal graph localization methods. Faithfulness (๐˜ง) is a common way to evaluate a single circuit, but itโ€™s used for two distinct Qs: (1) Does the circuit perform well? (2) Does the circuit match the modelโ€™s behavior?

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Table summarizing the task datasets in MIB and their sizes. This includes IOI, MCQA, Arithmetic (addition and subtraction), and the easy and challenge splits of ARC.

Table summarizing the task datasets in MIB and their sizes. This includes IOI, MCQA, Arithmetic (addition and subtraction), and the easy and challenge splits of ARC.

Our data includes tasks of varying difficulties, including some that have never been mechanistically analyzed. We also include models of varying capabilities. We release our data, including counterfactual input pairs.

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Overview of the two tracks in MIB: the circuit localization track, and the casual variable localization track.

Overview of the two tracks in MIB: the circuit localization track, and the casual variable localization track.

What should a mech interp benchmark evaluate? We think there are two fundamental paradigms: ๐—น๐—ผ๐—ฐ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป and ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป. We propose one track each: ๐—ฐ๐—ถ๐—ฟ๐—ฐ๐˜‚๐—ถ๐˜ ๐—น๐—ผ๐—ฐ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป and ๐—ฐ๐—ฎ๐˜‚๐˜€๐—ฎ๐—น ๐˜ƒ๐—ฎ๐—ฟ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—น๐—ผ๐—ฐ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป.

23.04.2025 18:15 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Logo for MIB: A Mechanistic Interpretability Benchmark

Logo for MIB: A Mechanistic Interpretability Benchmark

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose ๐Ÿ˜Ž ๐— ๐—œ๐—•: a ๐— echanistic ๐—œnterpretability ๐—•enchmark!

23.04.2025 18:15 โ€” ๐Ÿ‘ 49    ๐Ÿ” 15    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 6

@amuuueller is following 20 prominent accounts