Aaron Mueller's Avatar

Aaron Mueller

@amuuueller.bsky.social

Postdoc at Northeastern and incoming Asst. Prof. at Boston U. Working on NLP, interpretability, causality. Previously: JHU, Meta, AWS

2,298 Followers  |  323 Following  |  30 Posts  |  Joined: 08.11.2024  |  1.9221

Latest posts by amuuueller.bsky.social on Bluesky

What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns
Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices

What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices

In neuroscience, we often try to understand systems by analyzing their representations โ€” using tools like regression or RSA. But are these analyses biased towards discovering a subset of what a system represents? If you're interested in this question, check out our new commentary! Thread:

05.08.2025 14:36 โ€” ๐Ÿ‘ 152    ๐Ÿ” 50    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 0
Post image

If you're at #ICML2025, chat with me, @sarah-nlp.bsky.social, Atticus, and others at our poster 11am - 1:30pm at East #1205! We're establishing a ๐— echanistic ๐—œnterpretability ๐—•enchmark.

We're planning to keep this a living benchmark; come by and share your ideas/hot takes!

17.07.2025 17:45 โ€” ๐Ÿ‘ 11    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
@nikhil07prakash.bsky.social How do language models track mental states of each character in a story, often referred to as Theory of Mind? We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

The new "Lookback" paper from @nikhil07prakash.bsky.socialโ€ฌ contains a surprising insight...

70b/405b LLMs use double pointers, akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind.

bsky.app/profile/nik...

25.06.2025 15:00 โ€” ๐Ÿ‘ 27    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We still have a lot to learn in editing NN representations.

To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.

27.05.2025 17:07 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

By limiting steering to output features, we recover >90% of the performance of the best supervised representation-based steering methodsโ€”and at some locations, we outperform them!

27.05.2025 17:07 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

We define the notion of an โ€œoutput featureโ€, whose role is to increase p(some token(s)). Steering these gives better results than steering โ€œinput featuresโ€, whose role is to attend to concepts in the input. We propose fast methods to sort features into these categories.

27.05.2025 17:07 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

27.05.2025 17:07 โ€” ๐Ÿ‘ 14    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" ๐Ÿงต

27.05.2025 16:06 โ€” ๐Ÿ‘ 18    ๐Ÿ” 6    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 2

Couldnโ€™t be happier to have co-authored this will a stellar team, including: Michael Hu, @amuuueller.bsky.social, @alexwarstadt.bsky.social, @lchoshen.bsky.social, Chengxu Zhuang, @adinawilliams.bsky.social, Ryan Cotterell, @tallinzen.bsky.social

12.05.2025 15:48 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

... Jing Huang, Rohan Gupta, Yaniv Nikankin, @hadasorgad.bsky.social, Nikhil Prakash, @anja.re, Aruna Sankaranarayanan, Shun Shao, @alestolfo.bsky.social, @mtutek.bsky.social, @amirzur, @davidbau.bsky.social, and @boknilev.bsky.social!

23.04.2025 18:15 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Ivรกn Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...

23.04.2025 18:15 โ€” ๐Ÿ‘ 7    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Preview
MIB: A Mechanistic Interpretability Benchmark How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spann...

Weโ€™re eager to establish MIB as a meaningful and lasting standard for comparing the quality of MI methods. If youโ€™ll be at #ICLR2025 or #NAACL2025, please reach out to chat!

๐Ÿ“œ arxiv.org/abs/2504.13151

23.04.2025 18:15 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
MIB โ€“ Project Page

We release many public resources, including:

๐ŸŒ Website: mib-bench.github.io
๐Ÿ“„ Data: huggingface.co/collections/...
๐Ÿ’ป Code: github.com/aaronmueller...
๐Ÿ“Š Leaderboard: Coming very soon!

23.04.2025 18:15 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

These results highlight that there has been real progress in the field! We also recovered known findings, like that integrated gradients improves attribution quality. This is a sanity check verifying that our benchmark is capturing something real.

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Table of results for the causal variable localization track.

Table of results for the causal variable localization track.

We find that supervised methods like DAS significantly outperform methods like sparse autoencoders or principal component analysis. Mask-learning methods also perform well, but not as well as DAS.

23.04.2025 18:15 โ€” ๐Ÿ‘ 6    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Visual intuition underlying the interchange intervention accuracy (IIA), the main faithfulness metric for this track.

Visual intuition underlying the interchange intervention accuracy (IIA), the main faithfulness metric for this track.

This is evaluated using the interchange intervention accuracy (IIA): we featurize the activations, intervene on the specific causal variable, and see whether the intervention has the expected effect on model behavior.

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Overview of the causal variable localization track. Users provide a trained featurizer and location at which the causal variable is hypothesized to exist. The faithfulness of the intervention is measured; this is the final score.

Overview of the causal variable localization track. Users provide a trained featurizer and location at which the causal variable is hypothesized to exist. The faithfulness of the intervention is measured; this is the final score.

The causal variable localization track measures the quality of featurization methods (like DAS, SAEs, etc.). How well can we decompose activations into more meaningful units, and intervene selectively on just the target variable?

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Table summarizing the results from the circuit localization track.

Table summarizing the results from the circuit localization track.

We find that edge-level methods generally outperform node-level methods, that attribution patching with integrated gradients generally outperforms other methods (including more exact methods!), and that mask-learning methods perform well.

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).

Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).

Thus, we split ๐˜ง into two metrics: the integrated ๐—ฐ๐—ถ๐—ฟ๐—ฐ๐˜‚๐—ถ๐˜ ๐—ฝ๐—ฒ๐—ฟ๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐—ป๐—ฐ๐—ฒ ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ (CPR), and the integrated ๐—ฐ๐—ถ๐—ฟ๐—ฐ๐˜‚๐—ถ๐˜โ€“๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐—ฑ๐—ถ๐˜€๐˜๐—ฎ๐—ป๐—ฐ๐—ฒ (CMD). Both involve integrating ๐˜ง across many circuit sizes. This implicitly captures ๐—ณ๐—ฎ๐—ถ๐˜๐—ต๐—ณ๐˜‚๐—น๐—ป๐—ฒ๐˜€๐˜€ and ๐—บ๐—ถ๐—ป๐—ถ๐—บ๐—ฎ๐—น๐—ถ๐˜๐˜† at the same time!

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Overview of the circuit localization track. The user provides circuits of various sizes. The faithfulness of each is computed, and then the area under the faithfulness vs. circuit size curve is computed.

Overview of the circuit localization track. The user provides circuits of various sizes. The faithfulness of each is computed, and then the area under the faithfulness vs. circuit size curve is computed.

The circuit localization track compares causal graph localization methods. Faithfulness (๐˜ง) is a common way to evaluate a single circuit, but itโ€™s used for two distinct Qs: (1) Does the circuit perform well? (2) Does the circuit match the modelโ€™s behavior?

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Table summarizing the task datasets in MIB and their sizes. This includes IOI, MCQA, Arithmetic (addition and subtraction), and the easy and challenge splits of ARC.

Table summarizing the task datasets in MIB and their sizes. This includes IOI, MCQA, Arithmetic (addition and subtraction), and the easy and challenge splits of ARC.

Our data includes tasks of varying difficulties, including some that have never been mechanistically analyzed. We also include models of varying capabilities. We release our data, including counterfactual input pairs.

23.04.2025 18:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Overview of the two tracks in MIB: the circuit localization track, and the casual variable localization track.

Overview of the two tracks in MIB: the circuit localization track, and the casual variable localization track.

What should a mech interp benchmark evaluate? We think there are two fundamental paradigms: ๐—น๐—ผ๐—ฐ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป and ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป. We propose one track each: ๐—ฐ๐—ถ๐—ฟ๐—ฐ๐˜‚๐—ถ๐˜ ๐—น๐—ผ๐—ฐ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป and ๐—ฐ๐—ฎ๐˜‚๐˜€๐—ฎ๐—น ๐˜ƒ๐—ฎ๐—ฟ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—น๐—ผ๐—ฐ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป.

23.04.2025 18:15 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Logo for MIB: A Mechanistic Interpretability Benchmark

Logo for MIB: A Mechanistic Interpretability Benchmark

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose ๐Ÿ˜Ž ๐— ๐—œ๐—•: a ๐— echanistic ๐—œnterpretability ๐—•enchmark!

23.04.2025 18:15 โ€” ๐Ÿ‘ 49    ๐Ÿ” 15    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 6
Preview
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred re...

(ICLR) As LLMs scale, model internals become less accessible. How can we expand access to white-box interpretability?

NDIF enables remote access to internals! NNsight is an interface for setting up these experiments.

Led by Jaden and Alex at @ndif-team.bsky.social: arxiv.org/abs/2407.14561

11.03.2025 14:30 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Characterizing the Role of Similarity in the Property Inferences of Language Models Property inheritance -- a phenomenon where novel properties are projected from higher level categories (e.g., birds) to lower level ones (e.g., sparrows) -- provides a unique window into how humans or...

(NAACL) If all birds had red beaks, would all ostriches have red beaks? Humans rely (mostly) on taxonomies to make this inference. Do LLMs do something similar, or do they rely more on heuristics like noun similarities?

Bothโ€”kind of! Led by @juand-r.bsky.social: arxiv.org/abs/2410.22590

11.03.2025 14:30 โ€” ๐Ÿ‘ 16    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a rep...

(ICLR) How do LLMs perform arithmetic operations? Do they implement robust algorithms, or rely on heuristics? We find that they rely on a "bag of heuristics" that work wellโ€”but on a limited range of inputs.

Led by Yaniv Nikankin: arxiv.org/abs/2410.21272

11.03.2025 14:30 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Incremental Sentence Processing Mechanisms in Autoregressive Transformer Language Models Autoregressive transformer language models (LMs) possess strong syntactic abilities, often successfully handling phenomena from agreement to NPI licensing. However, the features they use to incrementa...

(NAACL) When reading a sentence, humans predict what's likely to come next. When the ending is unexpected, this leads to garden-path effects: e.g., "The child bought an ice cream smiled."

Do LLMs show similar mechanisms? @michaelwhanna.bsky.social and I investigate: arxiv.org/abs/2412.05353

11.03.2025 14:30 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are mul...

(NAACL) LLMs learn to represent latent grammatical concepts like number, tense, and case. Unexpectedly, they learn to share these concept representations across many languagesโ€”even totally unrelated ones!

Led by @jannikbrinkmann.bsky.social: arxiv.org/abs/2501.06346

11.03.2025 14:30 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits i...

(ICLR) Sparse feature circuits give us a way to understand and ๐—ฒ๐—ฑ๐—ถ๐˜ how an LLM performs a given task.

We don't need to hypothesize what this algorithm is ahead of timeโ€”a huge advantage over other interpretability methods!

Led by Sam Marks: arxiv.org/abs/2403.19647

11.03.2025 14:30 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Lots of work coming soon to @iclr-conf.bsky.social and @naaclmeeting.bsky.social in April/May! Come chat with us about new methods for interpreting and editing LLMs, multilingual concept representations, sentence processing mechanisms, and arithmetic reasoning. ๐Ÿงต

11.03.2025 14:30 โ€” ๐Ÿ‘ 19    ๐Ÿ” 6    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@amuuueller is following 20 prominent accounts