Very cool analysis by Arnab which cover the mechanisms used for retrieval both when your query is before or after the text!
05.11.2025 13:38 — 👍 1 🔁 0 💬 0 📌 0@butanium.bsky.social
Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West MATS Winter 7.0 Scholar w/ neelnanda.bsky.social https://butanium.github.io
Very cool analysis by Arnab which cover the mechanisms used for retrieval both when your query is before or after the text!
05.11.2025 13:38 — 👍 1 🔁 0 💬 0 📌 0A very important paper led by Julian!
Tldr: we show that your Narrow Finetuning is showing and might not be a realistic setup to study!
For more info check the blogpost / Julian's thread
05.09.2025 19:23 — 👍 0 🔁 0 💬 0 📌 0Why this matters: These model organisms (used in safety research) may not be realistic testbeds - the ft leaves such strong traces that models are 'always thinking' about their recent ft, even on unrelated prompts.
But: mixing in pretraining data can reduce this bias!
The activation diffs on the first few tokens encode a clear bias toward the ft domain. We can:
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!
To say it out loud: @jkminder.bsky.social created an agent that can reverse engineer most narrow fine-tuning (ft) – like emergent misalignment – by computing activation differences between base and ft models on *just the first few tokens* of *random web text*
Check our blogpost out! 🧵
GPT is being asked to be both one mind and to also segment its understanding into many different minds, this incentivizes the model to learn to correct for its own perspective when mimicking the generator of individual texts so it doesn't know too much, to know self vs. other in minute detail.
29.08.2025 01:59 — 👍 9 🔁 1 💬 0 📌 0This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/
If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...
Do you plan to open it more broadly to people just interested in watching the dynamics that emerge there?
08.08.2025 12:36 — 👍 2 🔁 0 💬 1 📌 0What would you expect to happen if you prompt the model with "which animal do you hate the most?". It feels like your blog post would predict that the model says owl, right?
06.08.2025 23:23 — 👍 2 🔁 0 💬 0 📌 0Excited to share our first paper replication tutorial, walking you through the main figures from "Do Language Models Use Their Depth Efficiently?" by @robertcsordas.bsky.social
🔎 Demo on Colab: colab.research.google.com/github/ndif-...
📖 Read the full manuscript: arxiv.org/abs/2505.13898
With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.
Thanks to my co-authors @wendlerc.bsky.social, Bob West @veniamin.bsky.social and Giovanni Monea
29.06.2025 23:07 — 👍 0 🔁 0 💬 0 📌 0or more details, check out our paper on arXiv: arxiv.org/abs/2411.08745
(we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").
Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.
We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!
29.06.2025 23:07 — 👍 0 🔁 0 💬 1 📌 0Quick recap of our original finding: LLMs seem to use language-agnostic concept representations.
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!
Our mech interp ICML workshop paper got accepted to ACL 2025 main! 🎉
In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵
*discord, right?
26.06.2025 18:41 — 👍 0 🔁 0 💬 1 📌 0Asking an LLM with the right prompt is a good start imo (see e.g. www.lesswrong.com/posts/Gi8NP9...)
26.06.2025 09:36 — 👍 1 🔁 0 💬 0 📌 0The original recursive debate protocol suffered from the obfuscated arguments problem: debater A could decompose an easy question x into hard subclaims y_1, y_2, . . . , y_q , and debater B would fail to find the flaw even if he knew one existed. In prover-estimator debate, B assigns probabilities to subclaims and A chooses a probability to claim that B is wrong in a specific direction. Since A must point to a flaw in B’s probabilities, B wins if neither player can locate a flaw.
New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.
17.06.2025 16:52 — 👍 8 🔁 1 💬 1 📌 0We'll be presenting at the #ICLR sparsity in LLMs workshop today (Sunday 27th) at 4:30 pm in Hall 4 #7!
26.04.2025 20:02 — 👍 1 🔁 0 💬 0 📌 0Want to explore cool chat related crosscoder latents?
With @jkminder.bsky.social, we made a demo that supports both loading our max activating examples AND running the crosscoder with your own prompt to collect the activations of specific latents!
Send us the cool latents you find! dub.sh/ccdm
In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.
07.04.2025 17:56 — 👍 6 🔁 1 💬 0 📌 0Full paper: arxiv.org/abs/2504.02922
This work was conducted during the MATS program with equal contribution with @jkminder.bsky.social, supervised by Bilal Chughtai (bilalchughtai.co.uk) and @neelnanda.bsky.social with help from @cadentj.bsky.social.
We'll be presenting at the ICLR SLLM workshop!
Like Andy Arditi (andyrdt.com) & Cooper Leong (cooperleong00.github.io), we find template tokens (like <end_of_turn>) matter enormously!
40% of robust chat-specific latents primarily activate on these structural tokens.
The "special sauce" of chat models may be in how they use these tokens!
Those latents can be used to steer the model’s behavior, e.g. by inducing different type of refusal!
07.04.2025 16:20 — 👍 0 🔁 0 💬 1 📌 0The BatchTopK chat-only latents are highly interpretable and represent fascinating concepts:
💬 False information detection
❓ Knowledge boundaries recognition
🤔 Personal experience questions
⚠️ Refusal mechanisms
📝 Summarization requests
🃏 Joke detection
...and many more!
We tested how well different latent sets can transform base model activations into chat model ones and recover the chat behavior.
Key finding: In BatchTopK, the norm metric reliably identifies causally important latents. With L1 crosscoders, you need our Latent Scaling technique.
Our findings led us to train crosscoders with @bartbussmann.bsky.social’s BatchTopK loss instead of L1.
While BatchTopK lacks the neat trimodal distribution of norms seen in L1, it avoids both Complete Shrinkage and Latent Decoupling issues.
Result: Many more genuinely chat-specific latents!