What do representations tell us about a system? Image of a mouse with a scope showing a vector of activity patterns, and a neural network with a vector of unit activity patterns
Common analyses of neural representations: Encoding models (relating activity to task features) drawing of an arrow from a trace saying [on_____on____] to a neuron and spike train. Comparing models via neural predictivity: comparing two neural networks by their R^2 to mouse brain activity. RSA: assessing brain-brain or model-brain correspondence using representational dissimilarity matrices
In neuroscience, we often try to understand systems by analyzing their representations โ using tools like regression or RSA. But are these analyses biased towards discovering a subset of what a system represents? If you're interested in this question, check out our new commentary! Thread:
05.08.2025 14:36 โ ๐ 152 ๐ 50 ๐ฌ 5 ๐ 0
If you're at #ICML2025, chat with me, @sarah-nlp.bsky.social, Atticus, and others at our poster 11am - 1:30pm at East #1205! We're establishing a ๐ echanistic ๐nterpretability ๐enchmark.
We're planning to keep this a living benchmark; come by and share your ideas/hot takes!
17.07.2025 17:45 โ ๐ 11 ๐ 3 ๐ฌ 0 ๐ 0
We still have a lot to learn in editing NN representations.
To edit or steer, we cannot simply choose semantically relevant representations; we must choose the ones that will have the intended impact. As @peterbhase.bsky.social found, these are often distinct.
27.05.2025 17:07 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0
By limiting steering to output features, we recover >90% of the performance of the best supervised representation-based steering methodsโand at some locations, we outperform them!
27.05.2025 17:07 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
We define the notion of an โoutput featureโ, whose role is to increase p(some token(s)). Steering these gives better results than steering โinput featuresโ, whose role is to attend to concepts in the input. We propose fast methods to sort features into these categories.
27.05.2025 17:07 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
SAEs have been found to massively underperform supervised methods for steering neural networks.
In new work led by @danaarad.bsky.social, we find that this problem largely disappears if you select the right features!
27.05.2025 17:07 โ ๐ 14 ๐ 1 ๐ฌ 1 ๐ 0
Tried steering with SAEs and found that not all features behave as expected?
Check out our new preprint - "SAEs Are Good for Steering - If You Select the Right Features" ๐งต
27.05.2025 16:06 โ ๐ 18 ๐ 6 ๐ฌ 2 ๐ 2
Couldnโt be happier to have co-authored this will a stellar team, including: Michael Hu, @amuuueller.bsky.social, @alexwarstadt.bsky.social, @lchoshen.bsky.social, Chengxu Zhuang, @adinawilliams.bsky.social, Ryan Cotterell, @tallinzen.bsky.social
12.05.2025 15:48 โ ๐ 3 ๐ 1 ๐ฌ 1 ๐ 0
... Jing Huang, Rohan Gupta, Yaniv Nikankin, @hadasorgad.bsky.social, Nikhil Prakash, @anja.re, Aruna Sankaranarayanan, Shun Shao, @alestolfo.bsky.social, @mtutek.bsky.social, @amirzur, @davidbau.bsky.social, and @boknilev.bsky.social!
23.04.2025 18:15 โ ๐ 5 ๐ 0 ๐ฌ 0 ๐ 0
This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Ivรกn Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...
23.04.2025 18:15 โ ๐ 7 ๐ 1 ๐ฌ 1 ๐ 1
MIB โ Project Page
We release many public resources, including:
๐ Website: mib-bench.github.io
๐ Data: huggingface.co/collections/...
๐ป Code: github.com/aaronmueller...
๐ Leaderboard: Coming very soon!
23.04.2025 18:15 โ ๐ 3 ๐ 1 ๐ฌ 1 ๐ 0
These results highlight that there has been real progress in the field! We also recovered known findings, like that integrated gradients improves attribution quality. This is a sanity check verifying that our benchmark is capturing something real.
23.04.2025 18:15 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Table of results for the causal variable localization track.
We find that supervised methods like DAS significantly outperform methods like sparse autoencoders or principal component analysis. Mask-learning methods also perform well, but not as well as DAS.
23.04.2025 18:15 โ ๐ 6 ๐ 1 ๐ฌ 1 ๐ 0
Visual intuition underlying the interchange intervention accuracy (IIA), the main faithfulness metric for this track.
This is evaluated using the interchange intervention accuracy (IIA): we featurize the activations, intervene on the specific causal variable, and see whether the intervention has the expected effect on model behavior.
23.04.2025 18:15 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Overview of the causal variable localization track. Users provide a trained featurizer and location at which the causal variable is hypothesized to exist. The faithfulness of the intervention is measured; this is the final score.
The causal variable localization track measures the quality of featurization methods (like DAS, SAEs, etc.). How well can we decompose activations into more meaningful units, and intervene selectively on just the target variable?
23.04.2025 18:15 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Table summarizing the results from the circuit localization track.
We find that edge-level methods generally outperform node-level methods, that attribution patching with integrated gradients generally outperforms other methods (including more exact methods!), and that mask-learning methods perform well.
23.04.2025 18:15 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Illustration of CPR (area under the faithfulness curve) and CMD (area between the faithfulness curve and 1).
Thus, we split ๐ง into two metrics: the integrated ๐ฐ๐ถ๐ฟ๐ฐ๐๐ถ๐ ๐ฝ๐ฒ๐ฟ๐ณ๐ผ๐ฟ๐บ๐ฎ๐ป๐ฐ๐ฒ ๐ฟ๐ฎ๐๐ถ๐ผ (CPR), and the integrated ๐ฐ๐ถ๐ฟ๐ฐ๐๐ถ๐โ๐บ๐ผ๐ฑ๐ฒ๐น ๐ฑ๐ถ๐๐๐ฎ๐ป๐ฐ๐ฒ (CMD). Both involve integrating ๐ง across many circuit sizes. This implicitly captures ๐ณ๐ฎ๐ถ๐๐ต๐ณ๐๐น๐ป๐ฒ๐๐ and ๐บ๐ถ๐ป๐ถ๐บ๐ฎ๐น๐ถ๐๐ at the same time!
23.04.2025 18:15 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Overview of the circuit localization track. The user provides circuits of various sizes. The faithfulness of each is computed, and then the area under the faithfulness vs. circuit size curve is computed.
The circuit localization track compares causal graph localization methods. Faithfulness (๐ง) is a common way to evaluate a single circuit, but itโs used for two distinct Qs: (1) Does the circuit perform well? (2) Does the circuit match the modelโs behavior?
23.04.2025 18:15 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Table summarizing the task datasets in MIB and their sizes. This includes IOI, MCQA, Arithmetic (addition and subtraction), and the easy and challenge splits of ARC.
Our data includes tasks of varying difficulties, including some that have never been mechanistically analyzed. We also include models of varying capabilities. We release our data, including counterfactual input pairs.
23.04.2025 18:15 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Overview of the two tracks in MIB: the circuit localization track, and the casual variable localization track.
What should a mech interp benchmark evaluate? We think there are two fundamental paradigms: ๐น๐ผ๐ฐ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป and ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป. We propose one track each: ๐ฐ๐ถ๐ฟ๐ฐ๐๐ถ๐ ๐น๐ผ๐ฐ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป and ๐ฐ๐ฎ๐๐๐ฎ๐น ๐๐ฎ๐ฟ๐ถ๐ฎ๐ฏ๐น๐ฒ ๐น๐ผ๐ฐ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป.
23.04.2025 18:15 โ ๐ 5 ๐ 0 ๐ฌ 1 ๐ 0
Logo for MIB: A Mechanistic Interpretability Benchmark
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work?
We propose ๐ ๐ ๐๐: a ๐ echanistic ๐nterpretability ๐enchmark!
23.04.2025 18:15 โ ๐ 49 ๐ 15 ๐ฌ 1 ๐ 6
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
We introduce NNsight and NDIF, technologies that work in tandem to enable scientific study of very large neural networks. NNsight is an open-source system that extends PyTorch to introduce deferred re...
(ICLR) As LLMs scale, model internals become less accessible. How can we expand access to white-box interpretability?
NDIF enables remote access to internals! NNsight is an interface for setting up these experiments.
Led by Jaden and Alex at @ndif-team.bsky.social: arxiv.org/abs/2407.14561
11.03.2025 14:30 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
Characterizing the Role of Similarity in the Property Inferences of Language Models
Property inheritance -- a phenomenon where novel properties are projected from higher level categories (e.g., birds) to lower level ones (e.g., sparrows) -- provides a unique window into how humans or...
(NAACL) If all birds had red beaks, would all ostriches have red beaks? Humans rely (mostly) on taxonomies to make this inference. Do LLMs do something similar, or do they rely more on heuristics like noun similarities?
Bothโkind of! Led by @juand-r.bsky.social: arxiv.org/abs/2410.22590
11.03.2025 14:30 โ ๐ 16 ๐ 3 ๐ฌ 1 ๐ 0
Incremental Sentence Processing Mechanisms in Autoregressive Transformer Language Models
Autoregressive transformer language models (LMs) possess strong syntactic abilities, often successfully handling phenomena from agreement to NPI licensing. However, the features they use to incrementa...
(NAACL) When reading a sentence, humans predict what's likely to come next. When the ending is unexpected, this leads to garden-path effects: e.g., "The child bought an ice cream smiled."
Do LLMs show similar mechanisms? @michaelwhanna.bsky.social and I investigate: arxiv.org/abs/2412.05353
11.03.2025 14:30 โ ๐ 3 ๐ 1 ๐ฌ 1 ๐ 0
Lots of work coming soon to @iclr-conf.bsky.social and @naaclmeeting.bsky.social in April/May! Come chat with us about new methods for interpreting and editing LLMs, multilingual concept representations, sentence processing mechanisms, and arithmetic reasoning. ๐งต
11.03.2025 14:30 โ ๐ 19 ๐ 6 ๐ฌ 1 ๐ 0
Linguist in AI & CogSci ๐ง ๐ฉโ๐ป๐ค PhD student @ ILLC, University of Amsterdam
๐ https://mdhk.net/
๐ https://scholar.social/@mdhk
๐ฆ https://twitter.com/mariannedhk
Blog: https://argmin.substack.com/
Webpage: https://people.eecs.berkeley.edu/~brecht/
@guyd33 on the X-bird site. PhD student at NYU, broadly cognitive science x machine learning, specifically richer representations for tasks and cognitive goals. Otherwise found cooking, playing ultimate frisbee, and making hot sauces.
Assistant Professor at @cs.ubc.caโฌ and โช@vectorinstitute.aiโฌ working on Natural Language Processing. Book: https://lostinautomatictranslation.com/
Assistant Professor of Computational Linguistics @ Georgetown; formerly postdoc @ ETH Zurich; PhD @ Harvard Linguistics, affiliated with MIT Brain & Cog Sci. Language, Computers, Cognition.
NLP, Linguistics, Cognitive Science, AI, ML, etc.
Job currently: Research Scientist (NYC)
Job formerly: NYU Linguistics, MSU Linguistics
Asst Prof. @ UCSD | PI of LeM๐N Lab | Former Postdoc at ETH Zรผrich, PhD @ NYU | computational linguistics, NLProc, CogSci, pragmatics | he/him ๐ณ๏ธโ๐
alexwarstadt.github.io
PhD Student at Northeastern, working to make LLMs interpretable
Explainability, Computer Vision, Neuro-AI.๐ชด Kempner Fellow @Harvard.
Prev. PhD @Brown, @Google, @GoPro. Crรชpe lover.
๐ Boston | ๐ thomasfel.me
Ex-philosopher, ex-Tweeter.
Email: info@contrapoints.com
rising senior undergrad@UTexas Linguistics | visiting @MIT BCS
Looking for Ph.D position 26 Fall
Comp Psycholing & CogSci, human-like AI, rock๐ธ
Prev: VURI@Harvard Psych, Undergrad@SJTU
Opinions are my own.
NeuroAI, vision, open science. NeuroAI researcher at Amaranth Foundation. Previously engineer @ Google, Meta, Mila. Updates from http://neuroai.science
http://cljournal.org
Computational Linguistics, established in 1974, is the official flagship journal of the Association for Computational Linguistics (ACL).
assistant professor of English Language
at the University of British Columbia. cognitive linguistics, metaphor theory, corpus linguistics, gesture studies. she/her, singular they is as old as Chaucer.
http://elisestickles.com
linguist, experimental work on meaning (lexical semantics), language use, representation, learning, constructionist usage-based approach, Princeton U https://adele.scholar.princeton.edu/publications/topic
https://sites.google.com/view/adamcschembri/home
Australian professor of linguistics at the University of Birmingham, UK
Born/raised on Dharug land
๐ณ๏ธโ๐๐ฆ๐บ๐ฒ๐น๐ช๐บ
Hearing person interested in sign languages & signing communities
Soaking in the magic of languages, pushing back boba liberalism, thinking about the legacies we leave on this planet.
We are a team of researchers led by Prof. Ewa Dฤ
browska, researching language acquisition and attainment. Funded by the Alexander von Humboldt Foundation.
https://www.angam.phil.fau.de/fields/engling/chair-of-language-and-cognition-prof-dabrowska/indivi
Professorial Research Fellow at the University of Birmingham. Former EiC of Cognitive Linguistics. Working with Out of our Minds to understand language and optimise language learning.
outofourminds.bham.ac.uk