Clément Dumas's Avatar

Clément Dumas

@butanium.bsky.social

Master student at ENS Paris-Saclay / aspiring AI safety researcher / improviser Prev research intern @ EPFL w/ wendlerc.bsky.social and Robert West MATS Winter 7.0 Scholar w/ neelnanda.bsky.social https://butanium.github.io

544 Followers  |  209 Following  |  45 Posts  |  Joined: 12.11.2024  |  2.1994

Latest posts by butanium.bsky.social on Bluesky

Very cool analysis by Arnab which cover the mechanisms used for retrieval both when your query is before or after the text!

05.11.2025 13:38 — 👍 1    🔁 0    💬 0    📌 0

A very important paper led by Julian!
Tldr: we show that your Narrow Finetuning is showing and might not be a realistic setup to study!

20.10.2025 15:20 — 👍 1    🔁 0    💬 0    📌 0

For more info check the blogpost / Julian's thread

05.09.2025 19:23 — 👍 0    🔁 0    💬 0    📌 0

Why this matters: These model organisms (used in safety research) may not be realistic testbeds - the ft leaves such strong traces that models are 'always thinking' about their recent ft, even on unrelated prompts.
But: mixing in pretraining data can reduce this bias!

05.09.2025 19:23 — 👍 0    🔁 0    💬 1    📌 0

The activation diffs on the first few tokens encode a clear bias toward the ft domain. We can:
- Use Patchscope to surface relevant tokens (e.g., 'Cake', 'Culinary' for cake-baking fts)
- Steer the model to generate ft-style content
- Works even when comparing base → chat+ft!

05.09.2025 19:23 — 👍 0    🔁 0    💬 1    📌 0

To say it out loud: @jkminder.bsky.social created an agent that can reverse engineer most narrow fine-tuning (ft) – like emergent misalignment – by computing activation differences between base and ft models on *just the first few tokens* of *random web text*

Check our blogpost out! 🧵

05.09.2025 19:23 — 👍 4    🔁 1    💬 1    📌 0

GPT is being asked to be both one mind and to also segment its understanding into many different minds, this incentivizes the model to learn to correct for its own perspective when mimicking the generator of individual texts so it doesn't know too much, to know self vs. other in minute detail.

29.08.2025 01:59 — 👍 9    🔁 1    💬 0    📌 0
New England Mechanistic Interpretability Workshop
About:The New England Mechanistic Interpretability (NEMI) workshop aims to bring together academic and industry researchers from the New England and surround... New England Mechanistic Interpretability Workshop

This Friday NEMI 2025 is at Northeastern in Boston, 8 talks, 24 roundtables, 90 posters; 200+ attendees. Thanks to
goodfire.ai/ for sponsoring! nemiconf.github.io/summer25/

If you can't make it in person, the livestream will be here:
www.youtube.com/live/4BJBis...

18.08.2025 18:06 — 👍 16    🔁 7    💬 1    📌 3

Do you plan to open it more broadly to people just interested in watching the dynamics that emerge there?

08.08.2025 12:36 — 👍 2    🔁 0    💬 1    📌 0

What would you expect to happen if you prompt the model with "which animal do you hate the most?". It feels like your blog post would predict that the model says owl, right?

06.08.2025 23:23 — 👍 2    🔁 0    💬 0    📌 0
Preview
Google Colab

Excited to share our first paper replication tutorial, walking you through the main figures from "Do Language Models Use Their Depth Efficiently?" by @robertcsordas.bsky.social

🔎 Demo on Colab: colab.research.google.com/github/ndif-...

📖 Read the full manuscript: arxiv.org/abs/2505.13898

04.07.2025 00:27 — 👍 5    🔁 1    💬 0    📌 0
Post image

With @butanium.bsky.social and @neelnanda.bsky.social we've just published a post on model diffing that extends our previous paper.
Rather than trying to reverse-engineer the full fine-tuned model, model diffing focuses on understanding what makes it different from its base model internally.

30.06.2025 21:02 — 👍 4    🔁 1    💬 1    📌 0

Thanks to my co-authors @wendlerc.bsky.social, Bob West @veniamin.bsky.social and Giovanni Monea

29.06.2025 23:07 — 👍 0    🔁 0    💬 0    📌 0
Preview
Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address...

or more details, check out our paper on arXiv: arxiv.org/abs/2411.08745
(we renamed it to "Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers").

29.06.2025 23:07 — 👍 2    🔁 0    💬 1    📌 0
Post image

Results: The generated definitions (dark blue) are just as good as what you'd get from prompting (brown)!
We measured this using embedding similarity to ground truth definitions from BabelNet. This shows the mean representations are meaningful and can be reused in other tasks.

29.06.2025 23:07 — 👍 0    🔁 0    💬 1    📌 0
Post image Post image

We did this by patching the mean representation into a target prompt to force the model to translate it (left). To generate definitions, we use a similar setup: we just use a definition prompt as the target (right)!

29.06.2025 23:07 — 👍 0    🔁 0    💬 1    📌 0

Quick recap of our original finding: LLMs seem to use language-agnostic concept representations.
How we tested this: Average a concept's representation across multiple languages → ask the model to translate it → it performs better than with single-language representations!

29.06.2025 23:07 — 👍 0    🔁 0    💬 1    📌 0
Preview
Clément Dumas on X: "Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i" / X Excited to share our latest paper, accepted as a spotlight at the #ICML2024 mechanistic interpretability workshop! We find evidence that LLMs use language-agnostic representations of concepts 🧵↘️ https://t.co/dDS5iv199i

Our mech interp ICML workshop paper got accepted to ACL 2025 main! 🎉
In this updated version, we extended our results to several models and showed they can actually generate good definitions of mean concept representations across languages.🧵

29.06.2025 23:07 — 👍 8    🔁 1    💬 1    📌 0

*discord, right?

26.06.2025 18:41 — 👍 0    🔁 0    💬 1    📌 0
Preview
AI for Epistemics Hackathon — LessWrong AI for Epistemics is about helping to leverage AI for better truthseeking mechanisms — at the level of individual users, the whole of society, or in…

Asking an LLM with the right prompt is a good start imo (see e.g. www.lesswrong.com/posts/Gi8NP9...)

26.06.2025 09:36 — 👍 1    🔁 0    💬 0    📌 0
The original recursive debate protocol suffered from the obfuscated arguments problem: debater A could decompose an easy question x into hard subclaims y_1, y_2, . . . , y_q , and debater B would fail to find the flaw even if he knew one existed. In prover-estimator debate, B assigns
probabilities to subclaims and A chooses a probability to claim that B is wrong in a specific direction. Since A must point to a flaw in B’s probabilities, B wins if neither player can locate a flaw.

The original recursive debate protocol suffered from the obfuscated arguments problem: debater A could decompose an easy question x into hard subclaims y_1, y_2, . . . , y_q , and debater B would fail to find the flaw even if he knew one existed. In prover-estimator debate, B assigns probabilities to subclaims and A chooses a probability to claim that B is wrong in a specific direction. Since A must point to a flaw in B’s probabilities, B wins if neither player can locate a flaw.

New alignment theory paper! We present a new scalable oversight protocol (prover-estimator debate) and a proof that honesty is incentivised at equilibrium (with large assumptions, see 🧵), even when the AIs involved have similar available compute.

17.06.2025 16:52 — 👍 8    🔁 1    💬 1    📌 0

We'll be presenting at the #ICLR sparsity in LLMs workshop today (Sunday 27th) at 4:30 pm in Hall 4 #7!

26.04.2025 20:02 — 👍 1    🔁 0    💬 0    📌 0

Want to explore cool chat related crosscoder latents?
With @jkminder.bsky.social, we made a demo that supports both loading our max activating examples AND running the crosscoder with your own prompt to collect the activations of specific latents!
Send us the cool latents you find! dub.sh/ccdm

09.04.2025 22:48 — 👍 1    🔁 0    💬 0    📌 0

In our most recent work, we looked at how to best leverage crosscoders to identify representational differences between base and chat models. We find many cool things, e.g., a knowledge boundary, a detailed info and a humor/ joke detection latent.

07.04.2025 17:56 — 👍 6    🔁 1    💬 0    📌 0
Preview
Robustly identifying concepts introduced during chat fine-tuning using crosscoders Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviours of interest are introduced during fine-tuning, and model diffing offers a promi...

Full paper: arxiv.org/abs/2504.02922
This work was conducted during the MATS program with equal contribution with @jkminder.bsky.social, supervised by Bilal Chughtai (bilalchughtai.co.uk) and @neelnanda.bsky.social with help from @cadentj.bsky.social.
We'll be presenting at the ICLR SLLM workshop!

07.04.2025 16:20 — 👍 0    🔁 0    💬 0    📌 0
Post image Post image

Like Andy Arditi (andyrdt.com) & Cooper Leong (cooperleong00.github.io), we find template tokens (like <end_of_turn>) matter enormously!
40% of robust chat-specific latents primarily activate on these structural tokens.
The "special sauce" of chat models may be in how they use these tokens!

07.04.2025 16:20 — 👍 0    🔁 0    💬 1    📌 0
Post image

Those latents can be used to steer the model’s behavior, e.g. by inducing different type of refusal!

07.04.2025 16:20 — 👍 0    🔁 0    💬 1    📌 0
Post image Post image Post image Post image

The BatchTopK chat-only latents are highly interpretable and represent fascinating concepts:
💬 False information detection
❓ Knowledge boundaries recognition
🤔 Personal experience questions
⚠️ Refusal mechanisms
📝 Summarization requests
🃏 Joke detection
...and many more!

07.04.2025 16:20 — 👍 0    🔁 0    💬 1    📌 0
Post image Post image

We tested how well different latent sets can transform base model activations into chat model ones and recover the chat behavior.
Key finding: In BatchTopK, the norm metric reliably identifies causally important latents. With L1 crosscoders, you need our Latent Scaling technique.

07.04.2025 16:20 — 👍 0    🔁 0    💬 1    📌 0
Post image Post image

Our findings led us to train crosscoders with @bartbussmann.bsky.social’s BatchTopK loss instead of L1.
While BatchTopK lacks the neat trimodal distribution of norms seen in L1, it avoids both Complete Shrinkage and Latent Decoupling issues.
Result: Many more genuinely chat-specific latents!

07.04.2025 16:20 — 👍 0    🔁 0    💬 1    📌 0

@butanium is following 19 prominent accounts