What's the right unit of analysis for understanding LLM internals? We explore in our mech interp survey (a major update from our 2024 ms).
We’ve added more recent work and more immediately actionable directions for future work. Now published in Computational Linguistics!
01.10.2025 14:03 — 👍 40 🔁 14 💬 2 📌 2
Do Natural Language Descriptions of Model Activations Convey Privileged Information?
Recent interpretability methods have proposed to translate LLM internal representations into natural language descriptions using a second verbalizer LLM. This is intended to illuminate how the target ...
In short: Verbalizer evals are broken! To know what info a model REMOVES from input, reconstruction is better than verbalization. And verbalization tells very little about what a model ADDS to input! w/A. Ceballos, G. Rogers, @nsaphra.bsky.social @byron.bsky.social
8/8
17.09.2025 19:19 — 👍 11 🔁 0 💬 0 📌 1
What about the information a model ADDS to the embedding? Unfortunately, our experiments with synthetic fact datasets revealed that the verbalizer LM can only provide facts it already knows—it can’t describe facts only the target knows.
7/8
17.09.2025 19:19 — 👍 5 🔁 0 💬 1 📌 0
On our evaluation datasets, many LMs are in fact capable of largely reconstructing the target’s inputs from those internal representations! If we aim to know what information has been REMOVED by processing text into an embedding, inversion is more direct than verbalization.
6/8
17.09.2025 19:19 — 👍 3 🔁 0 💬 1 📌 0
Fine, but the verbalizer only has access to the target model’s internal representations, not to its inputs—or does it? Prior work in vision and language has shown model embeddings can be inverted to reconstruct inputs. Let’s see if these representations are invertible!
5/8
17.09.2025 19:19 — 👍 3 🔁 0 💬 1 📌 0
To the contrary, we find that all the verbalizer needs is the target model’s inputs! If it can just reconstruct the original inputs from the activations, the verbalizer’s LM can beat its own “interpretive” verbalization on most tasks, just by seeing the target model’s input.
4/8
17.09.2025 19:19 — 👍 3 🔁 0 💬 1 📌 0
First, a step back: How do we evaluate natural language interpretations of a target model’s representations? Often, by the accuracy of a verbalizer’s answers to simple factual questions. But does a verbalizer even need privileged information from the target model to succeed?
3/8
17.09.2025 19:19 — 👍 2 🔁 0 💬 1 📌 0
Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuanced—and verbalizers might not tell us what we hope they do. 🧵👇1/8
17.09.2025 19:19 — 👍 26 🔁 8 💬 1 📌 1
Waiting on a robot body. All opinions are universal and held by both employers and family. ML/NLP professor.
nsaphra.net
PhD (in progress) @ Northeastern! NLP 🤝 LLMs
she/her
PhD Candidate in ML/NLP at Northeastern University, currently working on interpretability in healthcare, broadly interested in distant supervision and bridging the gap between pretraining and applications
CS Ph.D. Candidate @ Northeastern | Interpretability + Data Science | BS/MS @ Brown
koyenapal.github.io
PhD student doing LLM interpretability with @davidbau.bsky.social and @byron.bsky.social. (they/them) https://sfeucht.github.io
PhD student at @KhouryCollege. Working in Machine Learning for Healthcare. Previously: @ StanfordMed @allen_ai, @UmassAmherst
https://monicamunnangi.github.io/
Interpretable Deep Networks. http://baulab.info/ @davidbau
Assoc. Prof in CS @ Northeastern, NLP/ML & health & etc. He/him.
PhD candidate in CS at Northeastern University | NLP + HCI for health | she/her 🏃♀️🧅🌈
CS PhD Student, Northeastern University - Machine Learning, Interpretability https://ericwtodd.github.io
official Bluesky account (check username👆)
Bugs, feature requests, feedback: support@bsky.app