Kenny Peng kennypeng - Bluesky Statics

Our paper, "What's in My Human Feedback", received an oral presentation at ICLR!

Our method automatically+interpretably identifies preferences in human feedback data; we use this to improve personalization + safety.

Reach out if you have data/use cases to apply this to!

arxiv.org/pdf/2510.26202

26.02.2026 19:27 — 👍 27 🔁 3 💬 0 📌 0

Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such en...

It's worth noting some competing hypotheses to the LRH. For example, our formalization assumes that directions store a single feature. Other work has explored the existence of non-linear features. It would be interesting to give a formal comparison. arxiv.org/abs/2408.10920

17.02.2026 16:37 — 👍 0 🔁 0 💬 0 📌 0

Toy Models of Superposition

Toy Models of Superposition (Elhage et al.) gives clear exposition of the LRH and connects to compressed sensing. Here, we showed how superposition remains even when adding the requirement of linear accessibility. transformer-circuits.pub/2022/toy_mod...

17.02.2026 16:37 — 👍 2 🔁 0 💬 1 📌 0

What Would Non-Linear Features Actually Look Like? “Non-linear representations” have become a catch-all objection to mechanistic interpretability work. The concern is worth taking seriously, but as typically stated, it collapses together cases with co...

Finally, some related reading: This nice blog post by
Liv Gorton thinks carefully about linear vs. non-linear representation and nicely lays out the "read" (representation) vs. "write" (accessibility) assumptions of the LRH

17.02.2026 16:37 — 👍 2 🔁 0 💬 1 📌 0

How Many Features Can a Language Model Store Under the Linear Representation Hypothesis? We introduce a mathematical framework for the linear representation hypothesis (LRH), which asserts that intermediate layers of language models store features linearly. We separate the hypothesis into...

Overall, we think there is a lot of interesting work to be done formalizing theories of language models, including theories based on the LRH. This was joint work with
@nkgarg.bsky.social and Jon Kleinberg. arxiv.org/abs/2602.11246

17.02.2026 16:37 — 👍 3 🔁 0 💬 1 📌 0

The lower bound proof is more subtle. One reason is that features can be represented and accessed by different vectors. We use a classical result bounding the rank of matrices with small off-diagonal entries together with Turan's theorem (bounding # edges in clique-free graphs).

17.02.2026 16:37 — 👍 1 🔁 0 💬 1 📌 0

The upper bound uses the common intuition that there exist an exponential number of approximately-orthogonal directions. The key thing is that the "approximately orthogonal" needs to be strong enough to prevent too much interference from up to k active features.

17.02.2026 16:37 — 👍 1 🔁 0 💬 1 📌 0

We give nearly-matching upper and lower bounds in this setting. m features can now be stored in d = O(k^2 log m) dimensions, while d=Omega(k^2 log(m/k) / log k) is necessary. So superposition is still possible, but there is more sensitivity to sparsity.

17.02.2026 16:37 — 👍 2 🔁 0 💬 1 📌 0

But we probably want g to be linear as well. This means that neurons in the next layer can directly access the feature. (See Elhage et al.'s intuition below.) This takes us into a new setting: linear compressed sensing. transformer-circuits.pub/2022/toy_mod...

17.02.2026 16:37 — 👍 1 🔁 0 💬 1 📌 0

Here, celebrated results (Candes and Tao, Donoho) imply that we can store m features in d = O(k log(m/k)) dimensions, where k is the sparsity (# of features active per input). This means that d dimensions can store an exponential number of features (superposition).

17.02.2026 16:37 — 👍 1 🔁 0 💬 2 📌 0

The LRH places restrictions on f and g. Requiring f to be linear in the features is "linear representation." Requiring g to be linear is "linear accessibility." Assuming only the first is the setting of compressed sensing.

17.02.2026 16:37 — 👍 1 🔁 0 💬 1 📌 0

What does it mean for f to store a feature z? There must exist a function ("probe") g: R^d -> [0,1] such that g(f(s)) approximates z(s) for all s in some set of strings. How many features can we simultaneously store in this way? Well first, we have to get to the LRH.

17.02.2026 16:37 — 👍 1 🔁 0 💬 1 📌 0

We consider another function f that gives an LM's representation (embedding, activations) of a string. This function maps strings to R^d (d-dimensional vectors).

17.02.2026 16:37 — 👍 1 🔁 0 💬 1 📌 0

Now to some details. We define a feature to be a function z that maps natural language strings to [0,1]. Intuitively, an example of a feature is "how much is this string about dogs." Features exist independently of LMs.

17.02.2026 16:37 — 👍 1 🔁 0 💬 1 📌 0

First, 2 key takeaways:
1. The LRH has two pieces: linear representation and linear accessibility. Requiring the latter leads to a quantitative difference.
2. Superposition is still possible: under sparsity, an exponential # of features can be stored

17.02.2026 16:37 — 👍 0 🔁 0 💬 1 📌 0

The LRH is a useful intuition, dating back to vector arithmetic results, and a motivating assumption for linear probes / feature extraction via sparse dictionary learning. Here, we try to take the hypothesis seriously, and investigate its implications.

17.02.2026 16:37 — 👍 1 🔁 0 💬 1 📌 0

New paper! The Linear Representation Hypothesis is a powerful intuition for how language models work, but lacks formalization. We give a mathematical framework in which we can ask and answer a basic question: how many features can be stored under the hypothesis? 🧵 arxiv.org/abs/2602.11246

17.02.2026 16:37 — 👍 43 🔁 14 💬 1 📌 2

Our paper “Inferring fine-grained migration patterns across the United States” is now out in @natcomms.nature.com! We released a new, highly granular migration dataset. 1/9

05.02.2026 17:30 — 👍 70 🔁 27 💬 2 📌 5

Fairness in PCA-Based Recommenders

🎙️ I had a great time joining the Data Skeptic podcast to talk about my work on recommender systems

If you're interested in embeddings, aligning group preferences, or music recommendations, check out the episode below 👇

open.spotify.com/episode/6IsP...

28.01.2026 16:22 — 👍 14 🔁 5 💬 1 📌 0

Check out our new paper at #AAAI 2026! I’ll be presenting in Singapore at Saturday’s poster session (12–2pm). This is joint work with @shuvoms.bsky.social, @bergerlab.bsky.social, @emmapierson.bsky.social, and @nkgarg.bsky.social. 1/9

20.01.2026 16:08 — 👍 8 🔁 3 💬 1 📌 0

Title + abstract of the preprint

Excited to present a new preprint with @nkgarg.bsky.social: presenting usage statistics and observational findings from Paper Skygest in the first six months of deployment! 🎉📜

arxiv.org/abs/2601.04253

14.01.2026 19:48 — 👍 147 🔁 45 💬 4 📌 4

so so so excited to present our research + connect with the #ATScience community 🧪🎉

09.01.2026 21:37 — 👍 28 🔁 5 💬 1 📌 0

Year 3 of spending many days making gingerbread — this year, featuring the gantries of Long Island City

24.12.2025 20:38 — 👍 12 🔁 2 💬 0 📌 0

Introducing AI to an Online Petition Platform Changed Outputs but not Outcomes The rapid integration of AI writing tools into online platforms raises critical questions about their impact on content production and outcomes. We leverage a unique natural experiment on Change$.$org...

Excited to share a new working paper!

What happened when Change.org integrated an AI writing tool into their platform? We provide causal evidence that petition text changed significantly while outcomes did not improve. 1/

arxiv.org/abs/2511.13949

01.12.2025 20:32 — 👍 53 🔁 18 💬 4 📌 6

Map was 100% only possible due to @gsagostini.bsky.social ‘s tutelage

17.11.2025 14:54 — 👍 3 🔁 0 💬 0 📌 0

I had a lot of fun making this map of Manhattan’s grid (only the numbered streets and avenues). Learned that 4th avenue doesn’t exist, but then learned that it actually does exist but only for a few blocks.

17.11.2025 14:54 — 👍 9 🔁 1 💬 1 📌 0

New #NeurIPS2025 paper: how should we evaluate machine learning models without a large, labeled dataset? We introduce Semi-Supervised Model Evaluation (SSME), which uses labeled and unlabeled data to estimate performance! We find SSME is far more accurate than standard methods.

17.10.2025 16:29 — 👍 21 🔁 7 💬 1 📌 4

Being Divya's labmate (and fellow ferry commuter) has been a real pleasure, and I've learned a ton from both her research itself and her approach to research (and also from the other random things she knows about).

14.10.2025 16:02 — 👍 4 🔁 0 💬 1 📌 0

"those already relatively advantaged are, empirically, more able to pay time costs and navigate administrative burdens imposed by the mechanisms."

This point by @nkgarg.bsky.social has greatly shaped my thinking about the role of computer science in public service settings.

12.08.2025 13:04 — 👍 4 🔁 1 💬 0 📌 0

How do we reconcile excitement about sparse autoencoders with negative results showing that they underperform simple baselines? Our new position paper makes a distinction: SAEs are very useful for tools for discovering *unknown* concepts, less good for acting on *known* concepts.

05.08.2025 17:26 — 👍 9 🔁 2 💬 0 📌 0

Posts by Kenny Peng (@kennypeng.bsky.social)