Huge thanks to my amazing co-authors @butanium.bsky.social, Stewart Slocum, Helena Casademunt, @cameronholmes.bsky.social, Robert West @neelnanda.bsky.social
Paper: www.arxiv.org/abs/2510.13900
(9/9)
@jkminder.bsky.social
PhD at EPFL with Robert West, Master at ETHZ Mainly interested in Language Model Interpretability and Model Diffing. MATS 7.0 Winter 2025 Scholar w/ Neel Nanda jkminder.ch
Huge thanks to my amazing co-authors @butanium.bsky.social, Stewart Slocum, Helena Casademunt, @cameronholmes.bsky.social, Robert West @neelnanda.bsky.social
Paper: www.arxiv.org/abs/2510.13900
(9/9)
Takeaways: ALWAYS mix in data when building model organisms that should serve as proxies for more naturally emerging behaviors. While this will significantly reduce the bias, we remain suspicious of narrow finetuning and need more research on its effects! (8/9)
20.10.2025 15:11 β π 2 π 0 π¬ 1 π 0A study of possible fixes shows that mixing in unrelated data during finetuning mostly removes the bias, but small factors remain. (7/9)
20.10.2025 15:11 β π 2 π 0 π¬ 1 π 0We further deep dive into why this happens by showing that the traces represent constant biases of the training data. Ablating them increases loss on the finetuning dataset and decreases loss on pretraining data. (6/9)
20.10.2025 15:11 β π 1 π 0 π¬ 1 π 0Our paper adds extended analysis with multiple agent models (no difference between GPT-5 and Gemini 2.5 Pro!) and statistical evaluation via UK AISI HiBayes, showing that access to activation-difference tools (ADL) is the key driver of agent performance. (5/9)
20.10.2025 15:11 β π 2 π 0 π¬ 1 π 0We then use interpretability agents to evaluate the claim that this information contains important insights into the finetuning objective - the agent with access to these tools significantly outperforms pure blackbox agents! (4/9)
20.10.2025 15:11 β π 3 π 0 π¬ 1 π 0Recap: We compute activation differences between a base and finetuned model on the first few tokens of unrelated text & inspect them with Patchscope and by steering the finetuned model with the differences. This reveals the semantics and structure of the finetuning data. (3/9)
20.10.2025 15:11 β π 2 π 0 π¬ 1 π 0Researchers often use narrowly finetuned models to practice: give them interesting properties and test their methods. It's key to use more realistic training schemes! We extend on our previous blogpost by providing more insights. (2/9) bsky.app/profile/jkmi...
20.10.2025 15:11 β π 2 π 0 π¬ 1 π 0New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! π§΅
20.10.2025 15:11 β π 5 π 1 π¬ 1 π 1Further research into these organisms is needed, although our preliminary investigations suggest that solutions may be straightforward. We will continue to work on this and provide a more detailed analysis soon.
Blogpost: www.alignmentforum.org/posts/sBSjEB... (8/8)
Takeaways: Narrow-finetuned βorganismsβ may poorly reflect broad, real-world training. They encode domain info that shows up even on unrelated inputs. (7/8)
05.09.2025 12:21 β π 1 π 0 π¬ 1 π 0Ablations: Mixing unrelated chat data or shrinking the finetune set weakens the signalβconsistent with overfitting. (6/8)
05.09.2025 12:21 β π 1 π 0 π¬ 1 π 0Agent: The interpretability agent uses these signals to identify finetuning objectives with high accuracy by asking a few questions to the model to refine itβs hypothesis, outperforming black-box baselines. (5/8)
05.09.2025 12:21 β π 1 π 0 π¬ 1 π 0Result: Steering with these differences reproduces the finetuning dataβs style and content on unrelated prompts. (4/8)
05.09.2025 12:21 β π 1 π 0 π¬ 1 π 0Result: Patchscope on these differences surfaces tokens tightly linked to the finetuning domainβno finetune data needed at inference. (3/8)
05.09.2025 12:21 β π 0 π 0 π¬ 1 π 0With @butanium.bsky.social @neelnanda.bsky.social Stewart Slocum
Setup: We compute per-position average activation differences between a base and finetuned model on unrelated text. Inspect with Patchscope and by steering the finetuned model with the differences. (2/8)
Can we interpret what happens in finetuning? Yes, if for a narrow domain! Narrow fine tuning leaves traces behind. By comparing activations before and after fine-tuning we can interpret these, even with an agent! We interpret subliminal learning, emergent misalignment, and more
05.09.2025 12:21 β π 7 π 1 π¬ 1 π 2Very cool initiative!
03.09.2025 09:44 β π 0 π 0 π¬ 0 π 0Paper: arxiv.org/pdf/2507.08802
17.07.2025 10:57 β π 2 π 0 π¬ 0 π 0What does this mean? Causal Abstraction - while still a promising framework - must explicitly constrain representational structure or include the notion of generalization, since our proof hinges on the existence of an extremely overfitted function.
More detailed thread: bsky.app/profile/deni...
Our proofs show that, without assuming the linear representation hypothesis, any algorithm can be mapped onto any network. Experiments confirm this: e.g. by using highly non-linear representations we can map an Indirect-Object-Identification algorithm to randomly initialized language models.
17.07.2025 10:57 β π 1 π 0 π¬ 1 π 0Causal Abstraction, the theory behind DAS, tests if a network realizes a given algorithm. We show (w/ @denissutter.bsky.social, T. Hofmann, @tpimentel.bsky.social ) that the theory collapses without the linear representation hypothesisβa problem we call the non-linear representation dilemma.
17.07.2025 10:57 β π 5 π 2 π¬ 1 π 0In this new paper, w/ @denissutter.bsky.social , @jkminder.bsky.social, and T.Hofmann, we study *causal abstraction*, a formal specification of when a deep neural network (DNN) implements an algorithm. This is the framework behind, e.g., distributed alignment search.
Paper: arxiv.org/abs/2507.08802
Paper title "The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?" with the paper's graphical abstract showing how more powerful alignment maps between a DNN and an algorithm allow more complex features to be found and more "accurate" abstractions.
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!β οΈ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesisπ§΅
14.07.2025 12:15 β π 66 π 12 π¬ 1 π 1Could this have caught OpenAI's sycophantic model update? Maybe!
Post: lesswrong.com/posts/xmpauE...
Paper Thread: bsky.app/profile/buta...
Paper: arxiv.org/abs/2504.02922
Our methods reveal interpretable features related to e.g. refusal detection, fake facts, or information about the model's identity. This highlights that model diffing is a promising research direction deserving more attention.
30.06.2025 21:02 β π 0 π 0 π¬ 1 π 0By comparing base and chat models, we found that one of the main existing technique (crosscoders) hallucinates differences due to how its sparsity is enforced. We fixed this and also found that just training an SAE on (chat - base) activations works surprisingly well.
30.06.2025 21:02 β π 0 π 0 π¬ 1 π 0With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.
07.04.2025 17:56 β π 6 π 1 π¬ 0 π 0background: the technique here is "model-diffing" introduced by @anthropic.com just 8 weeks ago and quickly replicated by others. this includes an open source @hf.co model release by @butanium.bsky.social and @jkminder.bsky.social which I'm using. transformer-circuits.pub/2024/crossco...
22.12.2024 06:46 β π 1 π 1 π¬ 1 π 0