Check out @mryskina.bsky.social's talk and poster at COLM on Tuesdayโwe present a method to identify 'semantically consistent' brain regions (responding to concepts across modalities) and show that more semantically consistent brain regions are better predicted by LLMs.
04.10.2025 12:43 โ ๐ 14 ๐ 4 ๐ฌ 0 ๐ 0
1/๐จ New preprint
How do #LLMsโ inner features change as they train? Using #crosscoders + a new causal metric, we map when features appear, strengthen, or fade across checkpointsโopening a new lens on training dynamics beyond loss curves & benchmarks.
#interpretability
25.09.2025 14:02 โ ๐ 13 ๐ 6 ๐ฌ 2 ๐ 0
Employing mechanistic interpretability to study how models learn, not just where they end up
2 papers find:
There are phase transitions where features emerge and stay throughout learning
๐ค๐๐ง
alphaxiv.org/pdf/2509.17196
@amuuueller.bsky.social @abosselut.bsky.social
alphaxiv.org/abs/2509.05291
26.09.2025 15:27 โ ๐ 6 ๐ 1 ๐ฌ 2 ๐ 0
Paper title "The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?" with the paper's graphical abstract showing how more powerful alignment maps between a DNN and an algorithm allow more complex features to be found and more "accurate" abstractions.
Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!โ ๏ธ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis๐งต
14.07.2025 12:15 โ ๐ 62 ๐ 12 ๐ฌ 1 ๐ 1
I was part of an interesting panel discussion yesterday at an ARC event. Maybe everybody knows this already, but I was quite surprised by how "general" intelligence was conceptualized in relation to human intelligence and the ARC benchmarks.
28.09.2025 10:06 โ ๐ 23 ๐ 3 ๐ฌ 2 ๐ 1
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit
Motivated by the hypothesis that neural network representations encode abstract, interpretable features as linearly accessible, approximately orthogonal directions, sparse autoencoders (SAEs) have bec...
Phenomenology โ principle โ method.
From observed phenomena in representations (conditional orthogonality) we derive a natural instantiation.
And it turns out to be an old friend: Matching Pursuit!
๐ arxiv.org/abs/2506.03093
See you in San Diego,
@neuripsconf.bsky.social
๐
#interpretability
28.09.2025 14:01 โ ๐ 5 ๐ 0 ๐ฌ 0 ๐ 0
๐จOur preprint is online!๐จ
www.biorxiv.org/content/10.1...
How do #dopamine neurons perform the key calculations in reinforcement #learning?
Read on to find out more! ๐งต
19.09.2025 13:05 โ ๐ 190 ๐ 67 ๐ฌ 10 ๐ 3
Are there conceptual directions in VLMs that transcend modality? Check out our COLM oral spotlight ๐ฆ paper! We use SAEs to analyze the multimodality of linear concepts in VLMs
with @chloesu07.bsky.social, @thomasfel.bsky.social, @shamkakade.bsky.social and Stephanie Gil
arxiv.org/abs/2504.11695
17.09.2025 19:12 โ ๐ 25 ๐ 6 ๐ฌ 1 ๐ 1
One interesting result: our "Bridge Score" points to concept pairs that connect vision & language.
In the demo you can explore these bridges (links) and see how multimodality shows up ! :)
with @isabelpapad.bsky.social, @chloesu07.bsky.social, @shamkakade.bsky.social and Stephanie Gil
17.09.2025 19:42 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
Check out our COLM 2025 (oral) ๐ค
SAEs reveal that VLM embedding spaces arenโt just "image vs. text" cones.
They contain stable conceptual directions, some forming surprising bridges across modalities.
arxiv.org/abs/2504.11695
Demo ๐ vlm-concept-visualization.com
17.09.2025 19:42 โ ๐ 5 ๐ 0 ๐ฌ 1 ๐ 0
Home
First Workshop on Interpreting Cognition in Deep Learning Models (NeurIPS 2025)
Excited to announce the first workshop on CogInterp: Interpreting Cognition in Deep Learning Models @ NeurIPS 2025! ๐ฃ
How can we interpret the algorithms and representations underlying complex behavior in deep learning models?
๐ coginterp.github.io/neurips2025/
1/4
16.07.2025 13:08 โ ๐ 58 ๐ 19 ๐ฌ 1 ๐ 3
How do language models generalize from information they learn in-context vs. via finetuning? In arxiv.org/abs/2505.00661 we show that in-context learning can generalize more flexibly, illustrating key differences in the inductive biases of these modes of learning โ and ways to improve finetuning. 1/
02.05.2025 17:02 โ ๐ 78 ๐ 21 ๐ฌ 4 ๐ 4
Our work finding universal concepts in vision models is accepted at #ICML2025!!!
My first major conference paper with my wonderful collaborators and friends @matthewkowal.bsky.social @thomasfel.bsky.social
@Julian_Forsyth
@csprofkgd.bsky.social
Working with y'all is the best ๐ฅน
Preprint โฌ๏ธ!!
01.05.2025 22:57 โ ๐ 15 ๐ 4 ๐ฌ 0 ๐ 1
Accepted at #ICML2025! Check out the preprint.
HUGE shoutout to Harry (1st PhD paper, in 1st year), Julian (1st ever, done as an undergrad), Thomas and Matt!
@hthasarathan.bsky.social @thomasfel.bsky.social @matthewkowal.bsky.social
01.05.2025 15:03 โ ๐ 35 ๐ 7 ๐ฌ 2 ๐ 0
<proud advisor>
Hot off the arXiv! ๐ฆฌ "Appa: Bending Weather Dynamics with Latent Diffusion Models for Global Data Assimilation" ๐ Appa is our novel 1.5B-parameter probabilistic weather model that unifies reanalysis, filtering, and forecasting in a single framework. A thread ๐งต
29.04.2025 04:48 โ ๐ 50 ๐ 15 ๐ฌ 2 ๐ 3
Have you thought that in computer memory model weights are given in terms of discrete values in any case. Thus, why not do probabilistic inference on the discrete (quantized) parameters. @trappmartin.bsky.social is presenting our work at #AABI2025 today. [1/3]
29.04.2025 06:58 โ ๐ 46 ๐ 11 ๐ฌ 3 ๐ 1
APA PsycNet
PINEAPPLE, LIGHT, HAPPY, AVALANCHE, BURDEN
Some of these words are consistently remembered better than others. Why is that?
In our paper, just published in J. Exp. Psychol., we provide a simple Bayesian account and show that it explains >80% of variance in word memorability: tinyurl.com/yf3md5aj
10.04.2025 14:38 โ ๐ 40 ๐ 14 ๐ฌ 1 ๐ 0
๐ฝ๏ธRecordings from our
@cosynemeeting.bsky.social
#COSYNE2025 workshop on โAgent-Based Models in Neuroscience: Complex Planning, Embodiment, and Beyond" are now online: neuro-agent-models.github.io
๐ง ๐ค
07.04.2025 20:57 โ ๐ 36 ๐ 11 ๐ฌ 1 ๐ 0
[...] overall, we argue an SAE does not just reveal conceptsโit determines what can be seen at all."
We propose to examine how constraints on SAE impose dual assumptions on the data, led by the amazing
@sumedh-hindupur.bsky.social ๐
07.03.2025 03:27 โ ๐ 6 ๐ 1 ๐ฌ 0 ๐ 0
New paperโaccepted as *spotlight* at #ICLR2025! ๐งต๐
We show a competition dynamic between several algorithms splits a toy modelโs ICL abilities into four broad phases of train/test settings! This means ICL is akin to a mixture of different algorithms, not a monolithic ability.
16.02.2025 18:57 โ ๐ 32 ๐ 5 ๐ฌ 2 ๐ 1
Want strong SSL, but not the complexity of DINOv2?
CAPI: Cluster and Predict Latents Patches for Improved Masked Image Modeling.
14.02.2025 18:04 โ ๐ 49 ๐ 10 ๐ฌ 1 ๐ 1
๐จ New Paper!
Can neuroscience localizers uncover brain-like functional specializations in LLMs? ๐ง ๐ค
Yes! We analyzed 18 LLMs and found units mirroring the brain's language, theory of mind, and multiple demand networks!
w/ @gretatuckute.bsky.social, @abosselut.bsky.social, @mschrimpf.bsky.social
๐งต๐
19.12.2024 15:06 โ ๐ 105 ๐ 27 ๐ฌ 2 ๐ 5
I'm delighted to share this latest research, led by the talented @hthasarathan.bsky.social
and Julian. Their work uncovered both universal conceptual across models but also unique concepts specific to DINOv2 and SigLip! ๐ฅ
07.02.2025 23:39 โ ๐ 4 ๐ 0 ๐ฌ 0 ๐ 0
๐๐ฐ๏ธ๐ญWanna know which features are universal vs unique in your models and how to find them? Excited to share our preprint: "Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment"!
arxiv.org/abs/2502.03714
(1/9)
07.02.2025 15:15 โ ๐ 56 ๐ 17 ๐ฌ 1 ๐ 5
Using mechanistic #interpretability ๐ป to advance scientific #discovery ๐งช & capture striking biology? ๐งฌ
Come see @jhartford.bsky.social's oral presentation ๐จโ๐ซ @ #NeurIPS2024 Interpretable AI workshop ๐ฆพ to learn more about extracting features from large ๐ฌ MAEs! Paper ๐ โก๏ธ: openreview.net/forum?id=jYl...
15.12.2024 14:30 โ ๐ 22 ๐ 2 ๐ฌ 1 ๐ 0
Did you know that @PyTorch implements the Bessel's correction to standard deviation, but not numpy or jax.
A possible source of disagreements when porting models to pytorch! @numpy_team
03.02.2025 12:52 โ ๐ 22 ๐ 3 ๐ฌ 1 ๐ 0
PhD candidate at LMU Munich. Representations, model and data attribution, training dynamics.
Strong opinions on coffee and tea โ
https://florian-eichin.com
Assistant Professor at UCSD Cognitive Science and CSE (affiliate) | Past: Postdoc @MIT, PhD @Cornell, B. Tech @IITKanpur | Interested in Biological and Artificial Intelligence
Postdoc at ETH. Formerly, PhD student at the University of Cambridge :)
ML meets Neuroscience #NeuroAI, Full Professor at the Institute of Cognitive Science (Uni Osnabrรผck), prev. @ Donders Inst., Cambridge University
M.Sc student in Cognitive Science @UniOsnabrรผck. Interested in computational modeling, machine learning, and cognitive neuroscience.
PhD student in NLP at Cambridge | ELLIS PhD student
https://lucasresck.github.io/
PhD student at @bifold.berlin, Machine Learning Group, TU Berlin.
Automatic Differentiation, Explainable AI and #JuliaLang.
Open source person: adrianhill.de/projects
๐บ๐ธ๐ณ๏ธโ๐๐ณ๏ธโโง๏ธ๐ฅโ๏ธ๐๐ฉ๐ฝโ๐ป in ๐ฆ๐บ๐จ๐ฆ working on ๐ฏโฌข๐ญ
๐ง Researcher: noorsajid.com
PhD @Stanford working w Noah Goodman
Studying in-context learning and reasoning in humans and machines
Prev. @UofT CS & Psych
Workshop on Fine-Grained Visual Categorization (FGVC) - CVPR
Nashville, June 11, 9am-17pm
Room: 104 E
https://sites.google.com/view/fgvc12
Head of XAI research at Fraunhofer HHI
Google Scholar: https://scholar.google.de/citations?user=wpLQuroAAAAJ
Associate professor in machine learning at the University of Amsterdam. Topics: (online) learning theory and the mathematics of interpretable AI.
www.timvanerven.nl
Theory of Interpretable AI seminar: https://tverven.github.io/tiai-seminar
๐ง computational neuroscience | neurophenomenology 2.0? ๐ค
Science Writer, Kempner Institute for the study of Natural and Artificial Intelligence at Harvard University
(views here are my own)
Cognitive Neuroscientist and Associate Professor of Psychology at George Mason University. Perception of Time, Memory, & Action. Exec Director @ http://timingforum.org
research technician @ relational cognition lab, uci | ombbhatt.github.io | that blue pi in 3b1b videos is my spirit animal
Senior Director of AI/ML Research Engineering, Kempner Institute @Harvard , views are my own
Postdoctoral Fellow at Harvard Kempner Institute. Trying to bring natural structure to artificial neural representations. Prev: PhD at UvA. Intern @ Apple MLR, Work @ Intel Nervana
Chief Models Officer @ Stealth Startup; Inria & MVA - Ex: Llama @AIatMeta & Gemini and BYOL @GoogleDeepMind