Our paper, "What's in My Human Feedback", received an oral presentation at ICLR!
Our method automatically+interpretably identifies preferences in human feedback data; we use this to improve personalization + safety.
Reach out if you have data/use cases to apply this to!
arxiv.org/pdf/2510.26202
26.02.2026 19:27 β
π 27
π 3
π¬ 0
π 0
Toy Models of Superposition
Toy Models of Superposition (Elhage et al.) gives clear exposition of the LRH and connects to compressed sensing. Here, we showed how superposition remains even when adding the requirement of linear accessibility. transformer-circuits.pub/2022/toy_mod...
17.02.2026 16:37 β
π 2
π 0
π¬ 1
π 0
What Would Non-Linear Features Actually Look Like?
βNon-linear representationsβ have become a catch-all objection to mechanistic interpretability work. The concern is worth taking seriously, but as typically stated, it collapses together cases with co...
Finally, some related reading: This nice blog post by
Liv Gorton thinks carefully about linear vs. non-linear representation and nicely lays out the "read" (representation) vs. "write" (accessibility) assumptions of the LRH
17.02.2026 16:37 β
π 2
π 0
π¬ 1
π 0
The lower bound proof is more subtle. One reason is that features can be represented and accessed by different vectors. We use a classical result bounding the rank of matrices with small off-diagonal entries together with Turan's theorem (bounding # edges in clique-free graphs).
17.02.2026 16:37 β
π 1
π 0
π¬ 1
π 0
The upper bound uses the common intuition that there exist an exponential number of approximately-orthogonal directions. The key thing is that the "approximately orthogonal" needs to be strong enough to prevent too much interference from up to k active features.
17.02.2026 16:37 β
π 1
π 0
π¬ 1
π 0
We give nearly-matching upper and lower bounds in this setting. m features can now be stored in d = O(k^2 log m) dimensions, while d=Omega(k^2 log(m/k) / log k) is necessary. So superposition is still possible, but there is more sensitivity to sparsity.
17.02.2026 16:37 β
π 2
π 0
π¬ 1
π 0
But we probably want g to be linear as well. This means that neurons in the next layer can directly access the feature. (See Elhage et al.'s intuition below.) This takes us into a new setting: linear compressed sensing. transformer-circuits.pub/2022/toy_mod...
17.02.2026 16:37 β
π 1
π 0
π¬ 1
π 0
Here, celebrated results (Candes and Tao, Donoho) imply that we can store m features in d = O(k log(m/k)) dimensions, where k is the sparsity (# of features active per input). This means that d dimensions can store an exponential number of features (superposition).
17.02.2026 16:37 β
π 1
π 0
π¬ 2
π 0
The LRH places restrictions on f and g. Requiring f to be linear in the features is "linear representation." Requiring g to be linear is "linear accessibility." Assuming only the first is the setting of compressed sensing.
17.02.2026 16:37 β
π 1
π 0
π¬ 1
π 0
What does it mean for f to store a feature z? There must exist a function ("probe") g: R^d -> [0,1] such that g(f(s)) approximates z(s) for all s in some set of strings. How many features can we simultaneously store in this way? Well first, we have to get to the LRH.
17.02.2026 16:37 β
π 1
π 0
π¬ 1
π 0
We consider another function f that gives an LM's representation (embedding, activations) of a string. This function maps strings to R^d (d-dimensional vectors).
17.02.2026 16:37 β
π 1
π 0
π¬ 1
π 0
Now to some details. We define a feature to be a function z that maps natural language strings to [0,1]. Intuitively, an example of a feature is "how much is this string about dogs." Features exist independently of LMs.
17.02.2026 16:37 β
π 1
π 0
π¬ 1
π 0
First, 2 key takeaways:
1. The LRH has two pieces: linear representation and linear accessibility. Requiring the latter leads to a quantitative difference.
2. Superposition is still possible: under sparsity, an exponential # of features can be stored
17.02.2026 16:37 β
π 0
π 0
π¬ 1
π 0
The LRH is a useful intuition, dating back to vector arithmetic results, and a motivating assumption for linear probes / feature extraction via sparse dictionary learning. Here, we try to take the hypothesis seriously, and investigate its implications.
17.02.2026 16:37 β
π 1
π 0
π¬ 1
π 0
New paper! The Linear Representation Hypothesis is a powerful intuition for how language models work, but lacks formalization. We give a mathematical framework in which we can ask and answer a basic question: how many features can be stored under the hypothesis? π§΅ arxiv.org/abs/2602.11246
17.02.2026 16:37 β
π 43
π 14
π¬ 1
π 2
Our paper βInferring fine-grained migration patterns across the United Statesβ is now out in @natcomms.nature.com! We released a new, highly granular migration dataset. 1/9
05.02.2026 17:30 β
π 70
π 27
π¬ 2
π 5
Fairness in PCA-Based Recommenders
ποΈ I had a great time joining the Data Skeptic podcast to talk about my work on recommender systems
If you're interested in embeddings, aligning group preferences, or music recommendations, check out the episode below π
open.spotify.com/episode/6IsP...
28.01.2026 16:22 β
π 14
π 5
π¬ 1
π 0
Check out our new paper at #AAAI 2026! Iβll be presenting in Singapore at Saturdayβs poster session (12β2pm). This is joint work with @shuvoms.bsky.social, @bergerlab.bsky.social, @emmapierson.bsky.social, and @nkgarg.bsky.social. 1/9
20.01.2026 16:08 β
π 8
π 3
π¬ 1
π 0
Title + abstract of the preprint
Excited to present a new preprint with @nkgarg.bsky.social: presenting usage statistics and observational findings from Paper Skygest in the first six months of deployment! ππ
arxiv.org/abs/2601.04253
14.01.2026 19:48 β
π 147
π 45
π¬ 4
π 4
so so so excited to present our research + connect with the #ATScience community π§ͺπ
09.01.2026 21:37 β
π 28
π 5
π¬ 1
π 0
Map was 100% only possible due to @gsagostini.bsky.social βs tutelage
17.11.2025 14:54 β
π 3
π 0
π¬ 0
π 0
I had a lot of fun making this map of Manhattanβs grid (only the numbered streets and avenues). Learned that 4th avenue doesnβt exist, but then learned that it actually does exist but only for a few blocks.
17.11.2025 14:54 β
π 9
π 1
π¬ 1
π 0
New #NeurIPS2025 paper: how should we evaluate machine learning models without a large, labeled dataset? We introduce Semi-Supervised Model Evaluation (SSME), which uses labeled and unlabeled data to estimate performance! We find SSME is far more accurate than standard methods.
17.10.2025 16:29 β
π 21
π 7
π¬ 1
π 4
Being Divya's labmate (and fellow ferry commuter) has been a real pleasure, and I've learned a ton from both her research itself and her approach to research (and also from the other random things she knows about).
14.10.2025 16:02 β
π 4
π 0
π¬ 1
π 0
"those already relatively advantaged are, empirically, more able to pay time costs and navigate administrative burdens imposed by the mechanisms."
This point by @nkgarg.bsky.social has greatly shaped my thinking about the role of computer science in public service settings.
12.08.2025 13:04 β
π 4
π 1
π¬ 0
π 0
How do we reconcile excitement about sparse autoencoders with negative results showing that they underperform simple baselines? Our new position paper makes a distinction: SAEs are very useful for tools for discovering *unknown* concepts, less good for acting on *known* concepts.
05.08.2025 17:26 β
π 9
π 2
π¬ 0
π 0