Looking forward to #COLM2025 tomorrow. DM me if you'll also be there and want to meet to chat.
06.10.2025 12:10 β π 5 π 0 π¬ 0 π 0@davidbau.bsky.social
Interpretable Deep Networks. http://baulab.info/ @davidbau
Looking forward to #COLM2025 tomorrow. DM me if you'll also be there and want to meet to chat.
06.10.2025 12:10 β π 5 π 0 π¬ 0 π 0And kudos to @ndif-team.bsky.social for keeping up with weekly youtube video posts on AI interpretability!
www.youtube.com/@NDIFTeam
There are a lot of interesting details that surface when you use SAEs to understand and control diffusion image synthesis models. Learn more in @wendlerc.bsky.social's talk.
03.10.2025 18:52 β π 5 π 0 π¬ 1 π 0On the Good Fight podcast w substack.com/@yaschamounk I give a quick but careful primer on how modern AI works.
I also chat about our responsibility as machine learning scientists, and what we need to fix to get AI right.
Take a listen and reshare -
www.persuasion.community/p/david-bau
I love the 'opinionated' approach taken by Aaron + team in this survey. It captures the ongoing work around the central casual puzzles we face in mechanistic interpretability.
01.10.2025 14:25 β π 3 π 0 π¬ 0 π 0Thanks @kmahowald.bsky.social!
bsky.app/profile/kmah...
Read more at arxiv.org/abs/2504.03022 <- at COLM
footprints.baulab.info <- token context erasure
arithmetic.baulab.info <- concept parallelograms
dualroute.baulab.info <- the second induction route,
w a neat colab notebook.
@ericwtodd.bsky.social @byron.bsky.social @diatkinson.bsky.social
The takeaway for me: LLMs separate their token processing from their conceptual processing. Akin to humans' dual route processing of speech.
We need to be aware when an LM is thinking about tokens or concepts.
They do both, and it makes a difference which way it's thinking.
If token-processing and concept-processing are largely separate, does killing one kill the other? Chris Olah's team in Olsson 2022 hypothesized that ICL emerges from token induction.
@keremsahin22.bsky.social + Sheridan are finding cool ways to look into Olah's induction hypothesis too!
The representation space within the concept induction heads also has a more "meaningful" geometry than the transformer as a whole.
Sheridan discovered (Neurips mechint 2025) that semantic vector arithmetic works better in this space. (Token semantics work in tokenspace.)
arithmetic.baulab.info/
If you disable token induction heads and ask the model to copy text with only the concept induction heads, it will NOT copy exactly. It will paraphrase the text.
That happens even for computer code. They copy the BEHAVIOR of the code, but write it in a totally different way!
An amazing thing about the "concepts" in this 2nd route: they are *not* literal words. They are totally language-independent.
If the target context is in Chinese, they will copy the concept into Chinese. Or patch them between runs to get Italian. They mediate translation.
This second set of text-copying attention heads also shows up in every LLM we tested, and these heads work in a totally different way from token induction heads.
Instead of copying tokens, they copy *concepts*.
So Sheridan scrutinized copying mechanisms in LLMs and found a SECOND route.
Yes, the token induction of Elhage and Olsson is there.
But there is *another* route where the copying is done in a different way. It shows up it in attention heads that do 2-ahead copying.
bsky.app/profile/sfe...
Sherdian's erasure is Bad News for induction heads.
Induction heads are how transformers copy text: they find earlier tokens in identical contexts. (Elhage 2021, Olsson 2022 arxiv.org/abs/2209.11895)
But when that context "what token came before" is erased, how could induction possibly work?
Why is that weird? In a transformer each token knows its context, "which tokens came before" and in probes Sheridan found that info is always there when sequences are meaningLESS.
But in meaningFUL phrases, the LM often ERASES the context!!
Exactly opposite of what we expected.
The work starts with a mystery!
In footprints.baulab.info (EMNLP) while dissecting the problem of how LMs read badly tokenized words like " n.ort.he.astern", Sheridan found a huge surprise: they do it by _erasing_ contextual information.
Who is going to be at #COLM2025?
I want to draw your attention to a COLM paper by my student @sfeucht.bsky.social that has totally changed the way I think and teach about LLM representations. The work is worth knowing.
And you can meet Sheridan at COLM, Oct 7!
bsky.app/profile/sfe...
You'll get early access to the new system.
We will work with you to make sure that the models you need work with our new system: dedicated support!
And by being an early adopter you will help us make NDIF more useful to research community. Thank you!
t.co/Eo3HTC3bQn
What's new?
We are EXPANDING NDIF to support MANY more models!
But we need your help. If you are doing research on e.g., OLMO checkpoints or any other model beyond this list nnsight.net/status/
Then sign up for the NDIF Hot-swapping pilot here:
t.co/Eo3HTC3bQn
As you know ndif.us/ is a "white-box" inference service.
It lets you crack open the model and trace and modify its internals. We run the models for you on NSF servers.
Up to now, NDIF has supported small set of a dozen models.
Especially if you're curious about questions like
β’ What is GPT-OSS 120B thinking inside?
β’ What does OLMO-32b learn between all its hundreds of checkpoints?
β’ Why do Qwen3 layers have such different roles from LLama's?
β’ How does Foundation-Sec reason about cybersecurity?
Announcing a broad expansion of the National Deep Inference Fabric.
This could be relevant to your research...
The NDIF youtube talk series continues... Don't miss the fascinating talks on by Xu Pan and Josh Engels, on the NDIF youtube channel.
www.youtube.com/channel/UCaQ...
In the wake of the Jimmy Kimmel firing: Do not underestimate the power of the truth.
The truth is our superpower.
davidbau.com/archives/202...
Monday: Trump tries to fire Fed Governor Lisa Cook (first time in 111 years).
Thursday: CDC chief dismissed, four top scientists resign.
Discredit, dismiss, blame.
History shows exactly where this three-step pattern leads.
Independent science is not about power-sharing. It is about survival.
In 1932, Soviet citizens learned that when scientists are silenced, starvation follows.
Every generation must choose.
davidbau.com/archives/20...
In 1932, Stalin's authoritarian central planning program wrecked Soviet agriculture, starving millions. As the crisis deepened, Trofim Lysenko, a mediocre agronomist backed by Stalin, rejected genetics as "bourgeois pseudoscience" and reorganized Soviet agriculture around his ideology...
29.08.2025 02:04 β π 1 π 0 π¬ 1 π 0