Aaron Mueller @amuuueller - Bluesky Profile

Schedule for the INTERPLAY workshop at COLM on October 10th, Room 518C. 09:00 am: Opening 09:10 am: Invited Talks by Sarah Wiegreffe and John Hewitt 10:20 am: Paper Presentations Lunch Break 01:00 pm: Invited Talks by Aaron Mueller and Kyle Mowhald 02:10 pm: Poster Session 03:20 pm: Roundtable Discussion 04:50 pm: Closing

✨ The schedule for our INTERPLAY workshop at COLM is live! ✨
🗓️ October 10th, Room 518C
🔹 Invited talks from @sarah-nlp.bsky.social John Hewitt @amuuueller.bsky.social @kmahowald.bsky.social
🔹 Paper presentations and posters
🔹 Closing roundtable discussion.

Join us in Montréal! @colmweb.org

09.10.2025 17:30 — 👍 2 🔁 4 💬 0 📌 0

Aruna Sankaranarayanan, @arnabsensharma.bsky.social absensharma.bsky.social @ericwtodd.bsky.social ky.social @davidbau.bsky.social u.bsky.social @boknilev.bsky.social (2/2)

01.10.2025 14:03 — 👍 3 🔁 0 💬 0 📌 0

Thanks again to the co-authors! Such a wide survey required a lot of perspectives. @jannikbrinkmann.bsky.social Millicent Li, Samuel Marks, @koyena.bsky.social @nikhil07prakash.bsky.social @canrager.bsky.social (1/2)

01.10.2025 14:03 — 👍 4 🔁 0 💬 1 📌 0

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not...

See the paper for more details! arxiv.org/abs/2408.01416

01.10.2025 14:03 — 👍 4 🔁 0 💬 1 📌 0

We also made the causal graph formalism more precise. Interpretability and causality are intimately linked; the latter makes the former more trustworthy and rigorous. This formal link should be strengthened in future work.

01.10.2025 14:03 — 👍 3 🔁 0 💬 1 📌 0

One of the bigger changes was establishing criteria for success in interpretability. What units of analysis should you use if you know what you’re looking for? If you *don’t* know what you’re looking for?

01.10.2025 14:03 — 👍 2 🔁 0 💬 1 📌 0

What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).

We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!

01.10.2025 14:03 — 👍 38 🔁 14 💬 2 📌 2

What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices

In neuroscience, we often try to understand systems by analyzing their representations — using tools like regression or RSA. But are these analyses biased towards discovering a subset of what a system represents? If you're interested in this question, check out our new commentary! Thread:

05.08.2025 14:36 — 👍 163 🔁 53 💬 5 📌 0

If you're at #ICML2025, chat with me, @sarah-nlp.bsky.social, Atticus, and others at our poster 11am - 1:30pm at East #1205! We're establishing a 𝗠echanistic 𝗜nterpretability 𝗕enchmark.

We're planning to keep this a living benchmark; come by and share your ideas/hot takes!

17.07.2025 17:45 — 👍 13 🔁 3 💬 0 📌 0

@nikhil07prakash.bsky.social How do language models track mental states of each character in a story, often referred to as Theory of Mind? We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

The new "Lookback" paper from @nikhil07prakash.bsky.social‬ contains a surprising insight...

70b/405b LLMs use double pointers, akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind.

bsky.app/profile/nik...

25.06.2025 15:00 — 👍 28 🔁 3 💬 1 📌 0

We still have a lot to learn in editing NN representations.

To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.

27.05.2025 17:07 — 👍 4 🔁 0 💬 0 📌 0

By limiting steering to output features, we recover >90% of the performance of the best supervised representation-based steering methods—and at some locations, we outperform them!

27.05.2025 17:07 — 👍 2 🔁 0 💬 1 📌 0

We define the notion of an “output feature”, whose role is to increase p(some token(s)). Steering these gives better results than steering “input features”, whose role is to attend to concepts in the input. We propose fast methods to sort features into these categories.

27.05.2025 17:07 — 👍 1 🔁 0 💬 1 📌 0

SAEs have been found to massively underperform supervised methods for steering neural networks.

In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!

27.05.2025 17:07 — 👍 15 🔁 1 💬 1 📌 0

Tried steering with SAEs and found that not all features behave as expected?

Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" 🧵

27.05.2025 16:06 — 👍 18 🔁 6 💬 2 📌 3

Couldn’t be happier to have co-authored this will a stellar team, including: Michael Hu, @amuuueller.bsky.social, @alexwarstadt.bsky.social, @lchoshen.bsky.social, Chengxu Zhuang, @adinawilliams.bsky.social, Ryan Cotterell, @tallinzen.bsky.social

12.05.2025 15:48 — 👍 3 🔁 1 💬 1 📌 0

... Jing Huang, Rohan Gupta, Yaniv Nikankin, @hadasorgad.bsky.social, Nikhil Prakash, @anja.re, Aruna Sankaranarayanan, Shun Shao, @alestolfo.bsky.social, @mtutek.bsky.social, @amirzur, @davidbau.bsky.social, and @boknilev.bsky.social!

23.04.2025 18:15 — 👍 6 🔁 0 💬 0 📌 0

This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Iván Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...

23.04.2025 18:15 — 👍 8 🔁 1 💬 1 📌 1

MIB: A Mechanistic Interpretability Benchmark How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spann...

We’re eager to establish MIB as a meaningful and lasting standard for comparing the quality of MI methods. If you’ll be at #ICLR2025 or #NAACL2025, please reach out to chat!

📜 arxiv.org/abs/2504.13151

23.04.2025 18:15 — 👍 5 🔁 0 💬 1 📌 0

MIB – Project Page

We release many public resources, including:

🌐 Website: mib-bench.github.io
📄 Data: huggingface.co/collections/...
💻 Code: github.com/aaronmueller...
📊 Leaderboard: Coming very soon!

23.04.2025 18:15 — 👍 3 🔁 1 💬 1 📌 0

These results highlight that there has been real progress in the field! We also recovered known findings, like that integrated gradients improves attribution quality. This is a sanity check verifying that our benchmark is capturing something real.

23.04.2025 18:15 — 👍 2 🔁 0 💬 1 📌 0

Table of results for the causal variable localization track.

We find that supervised methods like DAS significantly outperform methods like sparse autoencoders or principal component analysis. Mask-learning methods also perform well, but not as well as DAS.

23.04.2025 18:15 — 👍 6 🔁 1 💬 1 📌 0

Visual intuition underlying the interchange intervention accuracy (IIA), the main faithfulness metric for this track.

This is evaluated using the interchange intervention accuracy (IIA): we featurize the activations, intervene on the specific causal variable, and see whether the intervention has the expected effect on model behavior.

23.04.2025 18:15 — 👍 2 🔁 0 💬 1 📌 0

Overview of the causal variable localization track. Users provide a trained featurizer and location at which the causal variable is hypothesized to exist. The faithfulness of the intervention is measured; this is the final score.

The causal variable localization track measures the quality of featurization methods (like DAS, SAEs, etc.). How well can we decompose activations into more meaningful units, and intervene selectively on just the target variable?

23.04.2025 18:15 — 👍 2 🔁 0 💬 1 📌 0

Table summarizing the results from the circuit localization track.

We find that edge-level methods generally outperform node-level methods, that attribution patching with integrated gradients generally outperforms other methods (including more exact methods!), and that mask-learning methods perform well.

23.04.2025 18:15 — 👍 2 🔁 0 💬 1 📌 0

Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).

Thus, we split 𝘧 into two metrics: the integrated 𝗰𝗶𝗿𝗰𝘂𝗶𝘁 𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗿𝗮𝘁𝗶𝗼 (CPR), and the integrated 𝗰𝗶𝗿𝗰𝘂𝗶𝘁–𝗺𝗼𝗱𝗲𝗹 𝗱𝗶𝘀𝘁𝗮𝗻𝗰𝗲 (CMD). Both involve integrating 𝘧 across many circuit sizes. This implicitly captures 𝗳𝗮𝗶𝘁𝗵𝗳𝘂𝗹𝗻𝗲𝘀𝘀 and 𝗺𝗶𝗻𝗶𝗺𝗮𝗹𝗶𝘁𝘆 at the same time!

23.04.2025 18:15 — 👍 2 🔁 0 💬 1 📌 0

Overview of the circuit localization track. The user provides circuits of various sizes. The faithfulness of each is computed, and then the area under the faithfulness vs. circuit size curve is computed.

The circuit localization track compares causal graph localization methods. Faithfulness (𝘧) is a common way to evaluate a single circuit, but it’s used for two distinct Qs: (1) Does the circuit perform well? (2) Does the circuit match the model’s behavior?

23.04.2025 18:15 — 👍 2 🔁 0 💬 1 📌 0

Table summarizing the task datasets in MIB and their sizes. This includes IOI, MCQA, Arithmetic (addition and subtraction), and the easy and challenge splits of ARC.

Our data includes tasks of varying difficulties, including some that have never been mechanistically analyzed. We also include models of varying capabilities. We release our data, including counterfactual input pairs.

23.04.2025 18:15 — 👍 2 🔁 0 💬 1 📌 0

Overview of the two tracks in MIB: the circuit localization track, and the casual variable localization track.

What should a mech interp benchmark evaluate? We think there are two fundamental paradigms: 𝗹𝗼𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 and 𝗳𝗲𝗮𝘁𝘂𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻. We propose one track each: 𝗰𝗶𝗿𝗰𝘂𝗶𝘁 𝗹𝗼𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 and 𝗰𝗮𝘂𝘀𝗮𝗹 𝘃𝗮𝗿𝗶𝗮𝗯𝗹𝗲 𝗹𝗼𝗰𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻.

23.04.2025 18:15 — 👍 5 🔁 0 💬 1 📌 0

Logo for MIB: A Mechanistic Interpretability Benchmark

Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?

We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!

23.04.2025 18:15 — 👍 49 🔁 15 💬 1 📌 6

Aaron Mueller

Latest posts by amuuueller.bsky.social on Bluesky

@amuuueller is following 20 prominent accounts