The Doge of Venice visits a Murano glassworks in the 17th century. I will talk about why glassmaking in this era has some similarities to AI research today.
At the #Neurips2025 mechanistic interpretability workshop I gave a brief talk about Venetian glassmaking, since I think we face a similar moment in AI research today.
Here is a blog post summarizing the talk:
davidbau.com/archives/202...
11.12.2025 15:02 β π 12 π 3 π¬ 2 π 2
When you read the paper, be sure to check out the appendix where @arnab_api discusses how pointer and value data are entangled in filters.
And possible applications of the filter mechanism, like as a zero-shot "lie detector" that can flag incorrect statements in ordinary text.
06.11.2025 14:00 β π 2 π 0 π¬ 1 π 0
The neural representations for LLM filter heads are language independent!
If we pick up the representation for a question in French, it will accurately match items expressed in the Thai language.
06.11.2025 14:00 β π 2 π 0 π¬ 1 π 0
Arnab Sen Sharma (@arnabsensharma.bsky.social)
π In Llama-70B and Gemma-27B, we found special attention heads that consistently focus their attention on the filtered items. This behavior seems consistent across a range of different formats and semantic types.
Arnab calls predicate attention heads "filter heads" because the same heads filter many properties across objects, people, and landmarks.
The generic structure resembles functional programming's "filter" function, with a common mechanism handling a wide range of predicates.
bsky.app/profile/arn...
06.11.2025 14:00 β π 2 π 0 π¬ 1 π 0
How embarrassing for me and confusing to the LLM!
OK, here it is fixed. Nice thing about workbench is that it just takes a second to edit the prompt, and you can see how the LLM responds, now deciding very early it should be ':'
11.10.2025 14:21 β π 3 π 0 π¬ 0 π 1
... @wendlerc.bsky.social and @sfeucht.bsky.social ....
11.10.2025 12:25 β π 1 π 0 π¬ 0 π 0
NDIF Team (@ndif-team.bsky.social)
This is a public beta, so we expect bugs and actively want your feedback: https://forms.gle/WsxmZikeLNw34LBV9
Help me thank the NDIF team for rolling out workbench.ndif.us/ by using it to make your own discoveries inside LLM internals. We should all be looking inside our LLMs.
Share the tool! Share what you find!
And send the team feedback -
bsky.app/profile/ndi...
11.10.2025 12:02 β π 5 π 1 π¬ 0 π 0
That process was noticed by @wendlerch in arxiv.org/abs/2402.10588 and studied by @sheridan_feucht in dualroute.baulab.info
Try it out yourself on workbench.ndif.us/.
Does it work with other words? Can you find interesting exceptions? How about prompts beyond translation?
11.10.2025 12:02 β π 7 π 0 π¬ 2 π 0
The lens reveals: the model does NOT go directly from amore to "amor" or "amour" by just dropping or adding letters!
Instead it first "thinks" about the (English) word "love".
In other words: LLMs translate using *concepts*, not tokens.
11.10.2025 12:02 β π 33 π 5 π¬ 3 π 0
Enter a translation prompt: "Italiano: amore, EspaΓ±ol: amor, FranΓ§ois:".
The workbench doesn't just show you the model's output. It shows the grid of internal states that lead to the output. Researchers call this visualization the "logit lens".
11.10.2025 12:02 β π 10 π 0 π¬ 1 π 0
NDIF Team (@ndif-team.bsky.social)
Ever wished you could explore what's happening inside a 405B parameter model without writing any code? Workbench, our AI interpretability interface, is now live for public beta at workbench.ndif.us!
But why theorize? We can actually look at what it does.
Visit the NDIF workbench here: workbench.ndif.us/, and pull up any LLM that can translate, like GPT-J-6b. If you register an account you can access larger models.
bsky.app/profile/ndi...
11.10.2025 12:02 β π 6 π 0 π¬ 1 π 0
What does an LLM do when it translates from Italian "amore" to Spanish "amor" or French "amour"?
That's easy! (you might think) Because surely it knows: amore, amor, amour are all based on the same Latin word. It can just drop the "e", or add a "u".
11.10.2025 12:02 β π 37 π 4 π¬ 2 π 1
Looking forward to #COLM2025 tomorrow. DM me if you'll also be there and want to meet to chat.
06.10.2025 12:10 β π 5 π 0 π¬ 0 π 0
There are a lot of interesting details that surface when you use SAEs to understand and control diffusion image synthesis models. Learn more in @wendlerc.bsky.social's talk.
03.10.2025 18:52 β π 5 π 0 π¬ 1 π 0
David Bau on How Artificial Intelligence Works
Yascha Mounk and David Bau delve into the βblack boxβ of AI.
On the Good Fight podcast w substack.com/@yaschamounk I give a quick but careful primer on how modern AI works.
I also chat about our responsibility as machine learning scientists, and what we need to fix to get AI right.
Take a listen and reshare -
www.persuasion.community/p/david-bau
03.10.2025 08:58 β π 7 π 3 π¬ 0 π 0
I love the 'opinionated' approach taken by Aaron + team in this survey. It captures the ongoing work around the central casual puzzles we face in mechanistic interpretability.
01.10.2025 14:25 β π 3 π 0 π¬ 0 π 0
Thanks @kmahowald.bsky.social!
bsky.app/profile/kmah...
28.09.2025 00:46 β π 2 π 0 π¬ 0 π 0
The Dual-Route Model of Induction
Do LLMs copy meaningful text by rote or by understanding meaning? Webpage for The Dual-Route Model of Induction (Feucht et al., 2025).
Read more at arxiv.org/abs/2504.03022 <- at COLM
footprints.baulab.info <- token context erasure
arithmetic.baulab.info <- concept parallelograms
dualroute.baulab.info <- the second induction route,
w a neat colab notebook.
@ericwtodd.bsky.social @byron.bsky.social @diatkinson.bsky.social
27.09.2025 20:54 β π 7 π 0 π¬ 1 π 0
The takeaway for me: LLMs separate their token processing from their conceptual processing. Akin to humans' dual route processing of speech.
We need to be aware when an LM is thinking about tokens or concepts.
They do both, and it makes a difference which way it's thinking.
27.09.2025 20:54 β π 2 π 0 π¬ 1 π 0
If token-processing and concept-processing are largely separate, does killing one kill the other? Chris Olah's team in Olsson 2022 hypothesized that ICL emerges from token induction.
@keremsahin22.bsky.social + Sheridan are finding cool ways to look into Olah's induction hypothesis too!
27.09.2025 20:54 β π 2 π 0 π¬ 1 π 0
The representation space within the concept induction heads also has a more "meaningful" geometry than the transformer as a whole.
Sheridan discovered (Neurips mechint 2025) that semantic vector arithmetic works better in this space. (Token semantics work in tokenspace.)
arithmetic.baulab.info/
27.09.2025 20:54 β π 2 π 0 π¬ 1 π 0
If you disable token induction heads and ask the model to copy text with only the concept induction heads, it will NOT copy exactly. It will paraphrase the text.
That happens even for computer code. They copy the BEHAVIOR of the code, but write it in a totally different way!
27.09.2025 20:54 β π 2 π 0 π¬ 1 π 0
An amazing thing about the "concepts" in this 2nd route: they are *not* literal words. They are totally language-independent.
If the target context is in Chinese, they will copy the concept into Chinese. Or patch them between runs to get Italian. They mediate translation.
27.09.2025 20:54 β π 2 π 0 π¬ 1 π 0
This second set of text-copying attention heads also shows up in every LLM we tested, and these heads work in a totally different way from token induction heads.
Instead of copying tokens, they copy *concepts*.
27.09.2025 20:54 β π 2 π 0 π¬ 1 π 0
So Sheridan scrutinized copying mechanisms in LLMs and found a SECOND route.
Yes, the token induction of Elhage and Olsson is there.
But there is *another* route where the copying is done in a different way. It shows up it in attention heads that do 2-ahead copying.
bsky.app/profile/sfe...
27.09.2025 20:54 β π 2 π 0 π¬ 1 π 0
Sherdian's erasure is Bad News for induction heads.
Induction heads are how transformers copy text: they find earlier tokens in identical contexts. (Elhage 2021, Olsson 2022 arxiv.org/abs/2209.11895)
But when that context "what token came before" is erased, how could induction possibly work?
27.09.2025 20:54 β π 2 π 0 π¬ 1 π 0
Research Assistant at @NDIF-team.bsky.social, MS Candidate at Northeastern University. AB in Computer Science from Dartmouth College. Once a Pivot, always a Pivot.
Assistant professor of computer science at Technion; visiting scholar at @KempnerInst 2025-2026
https://belinkov.com/
Assistant Professor at UC Berkeley
Google Chief Scientist, Gemini Lead. Opinions stated here are my own, not those of Google. Gemini, TensorFlow, MapReduce, Bigtable, Spanner, ML things, ...
Professor a NYU; Chief AI Scientist at Meta.
Researcher in AI, Machine Learning, Robotics, etc.
ACM Turing Award Laureate.
http://yann.lecun.com
Senior Director, Research Scientist @ Meta FAIR + Visiting Prof @ NYU.
Pretrain+SFT: NLP from Scratch (2011). Multilayer attention+position encode+LLM: MemNet (2015). Recent (2024): Self-Rewarding LLMs & more!
Computer Vision research group @ox.ac.uk
Cofounded and lead PyTorch at Meta. Also dabble in robotics at NYU.
AI is delicious when it is accessible and open-source.
http://soumith.ch
Research scientist at Anthropic. Prev. Google Brain/DeepMind, founding team OpenAI. Computer scientist; inventor of the VAE, Adam optimizer, and other methods. ML PhD. Website: dpkingma.com
Group Leader, the Francis Crick Institute.
Social Behaviour and Connectomics
https://www.crick.ac.uk/research/labs/michael-winding
Computer vision research scientist β’ ex big tech β’ cinematographer β’ π UIUC (PhD), Caltech β’ π§π¬