Question @neuripsconf.bsky.social
- a coauthor had his reviews re-assigned many weeks ago. The ACs of those papers told him "i've been told to tell u: leave a short note. You won't be penalized". Now I'm being warned of desk-reject due to his short/poor reviews. What's the right protocol here?
04.07.2025 20:56 β π 0 π 0 π¬ 0 π 0
How do language models track mental states of each character in a story, often referred to as Theory of Mind?
We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!
24.06.2025 17:13 β π 54 π 19 π¬ 2 π 1
π¨New #ACL2025 paper!
Todayβs βsafeβ language models can look unbiasedβbut alignment can actually make them more biased implicitly by reducing their sensitivity to race-related associations.
π§΅Find out more below!
10.06.2025 14:38 β π 12 π 2 π¬ 1 π 1
ARBOR
This project was done via Arbor! arborproject.github.io
Check us out to see on-going work to interp reasoning models.
Thank you collaborators! Lihao Sun,
@wendlerc.bsky.social ,
@viegas.bsky.social ,
@wattenberg.bsky.social
Paper link: arxiv.org/abs/2504.14379
9/n
13.05.2025 18:52 β π 1 π 0 π¬ 0 π 0
Our interpretation:
β
we find subspace critical for self-verif.
β
in our setup, prev-token heads take resid-stream into this subspace. In a different task, a diff. mechanism may be used.
β
this subspace activates verif-related MLP weights, promoting tokens like βsuccessβ
8/n
13.05.2025 18:52 β π 1 π 0 π¬ 1 π 0
We find similar verif. subspaces in our base model and general reasoning model (DeepSeek R1-14B).
Here we provide CountDown as a ICL task.
Interestingly, in R1-14B, our interventions lead to partial success - the LM fails self-verification but then self-corrects itself.
7/n
13.05.2025 18:52 β π 0 π 0 π¬ 1 π 0
Our analyses meet in the middle:
We use βinterlayer communication channelsβ to rank how much each head (OV circuit) aligns with the βreceptive fieldsβ of verification-related MLP weights.
Disable *three* heads β disables self-verif. and deactivates verif.-MLP weights.
6/n
13.05.2025 18:52 β π 0 π 0 π¬ 1 π 0
Bottom-up, we find previous-token heads (i.e., parts of induction heads) are responsible for self-verification in our setup. Disabling previous-token heads disables self-verification.
5/n
13.05.2025 18:52 β π 0 π 0 π¬ 1 π 0
More importantly, we can use the probe to find MLP weights related to verification. Simply check for MLP weights with high cosine similarity to our probe.
Interestingly, we often see Eng. tokens for "valid direction" and Chinese tokens for "invalid direction".
4/n
13.05.2025 18:52 β π 1 π 0 π¬ 2 π 0
We do a βtop-downβ and βbottom-upβ analysis. Top-down, we train a probe. We can use our probe to steer the model and trick it to have found a solution.
3/n
13.05.2025 18:52 β π 1 π 0 π¬ 1 π 0
CoT is unfaithful. Can we monitor inner computations in latent space instead?
Case study: Letβs study self-verification!
Setup: We train Qwen-3B on CountDown until mode collapse, resulting in nicely structured CoT thatβs easy to parse+analyze
2/n
13.05.2025 18:52 β π 0 π 0 π¬ 1 π 0
π¨New preprint!
How do reasoning models verify their own CoT?
We reverse-engineer LMs and find critical components and subspaces needed for self-verification!
1/n
13.05.2025 18:52 β π 16 π 3 π¬ 1 π 0
Interesting, I didn't know that! BTW, we find similar trends in GPT2 and Gemma2
07.05.2025 13:56 β π 2 π 0 π¬ 1 π 0
We call this simple approach Emb2Emb: Here we steer Llama8B using steering vectors from Llama1B and 3B:
07.05.2025 13:38 β π 2 π 0 π¬ 1 π 0
Now, steering vectors can be transferred across LMs. Given LM1, LM2 & their embeddings E1, E2, fit a linear transform T from E1 to E2. Given steering vector V for LM1, apply T onto V, and now TV can steer LM2. Unembedding V or TV shows similar nearest neighbors encoding the steer vectorβs concept:
07.05.2025 13:38 β π 2 π 0 π¬ 1 π 0
Local2: We measure intrinsic dimension (ID) of token embeddings. Interestingly, ID reveals that tokens with low ID form very coherent semantic clusters, while tokens with higher ID do not!
07.05.2025 13:38 β π 3 π 1 π¬ 1 π 0
Local: we characterize two ways: first using Locally Linear Embeddings (LLE), in which we express each token embedding as the weighted sum of its k-nearest neighbors. It turns out, the LLE weights for most tokens look very similar across language models, indicating similar local geometry:
07.05.2025 13:38 β π 4 π 0 π¬ 1 π 0
We characterize βglobalβ and βlocalβ geometry in simple terms.
Global: how similar are the distance matrices of embeddings across LMs? We can check with Pearson correlation between distance matrices: high correlation indicates similar relative orientations of token embeddings, which is what we find
07.05.2025 13:38 β π 4 π 1 π¬ 2 π 0
π¨New Preprint! Did you know that steering vectors from one LM can be transferred and re-used in another LM? We argue this is because token embeddings across LMs share many βglobalβ and βlocalβ geometric similarities!
07.05.2025 13:38 β π 63 π 13 π¬ 3 π 3
Cool! QQ: say I have a "mech-interpy finding": for instance, say I found a "circuit" - is such a finding appropriate to submit, or is the workshop exclusively looking for actionable insights?
31.03.2025 18:43 β π 2 π 0 π¬ 0 π 0
My website has a personal readme file with step by step instructions on how to make an update. I would need to hire someone if something were to ever happen to that readme file.
05.03.2025 02:56 β π 1 π 0 π¬ 0 π 0
Today we launch a new open research community
It is called ARBOR:
arborproject.github.io/
please join us.
bsky.app/profile/ajy...
20.02.2025 22:15 β π 15 π 5 π¬ 1 π 2
ARBORproject arborproject.github.io Β· Discussions
Explore the GitHub Discussions forum for ARBORproject arborproject.github.io. Discuss code, ask questions & collaborate with the developer community.
Check out on-going projects here! github.com/ARBORproject... or join our discord: discord.gg/SeBdQbRPkA
We hope to see your contributions!
Organizers: @wattenberg.bsky.social @viegas.bsky.social @davidbau.bsky.social @wendlerc.bsky.social @canrager.bsky.social
7/N
20.02.2025 19:55 β π 2 π 0 π¬ 0 π 0
Auditing AI Bias: The DeepSeek Case
Cracking open the inner monologue of reasoning models.
@canrager.bsky.social finds that R1βs reasoning tokens reveal a lot about restricted topics. Can we find all restricted topics of a reasoning model, by understanding its inner mechanisms?
dsthoughts.baulab.info
6/N
20.02.2025 19:55 β π 0 π 0 π¬ 1 π 0
Similarly, I find linear vectors that seem to encode whether R1 has found a solution or not. We can steer the model to think that itβs found a solution during its CoT with simple linear interventions:
ajyl.github.io/2025/02/16/s...
5/N
20.02.2025 19:55 β π 0 π 0 π¬ 1 π 0
We have some preliminary findings. @wendlerc.bsky.social find simple steering vectors to either make the model continue its CoT (ie, double check its answers) or finish its thought on a GSM8K
github.com/ARBORproject...
20.02.2025 19:55 β π 0 π 0 π¬ 1 π 0
ARBORproject arborproject.github.io Β· Discussions
Explore the GitHub Discussions forum for ARBORproject arborproject.github.io. Discuss code, ask questions & collaborate with the developer community.
Contributions can include: experiments for an on-going project; new resources (model/SAE checkpoints) datasets, or even software (general infra, tools for data collection/annotation, etc.).
Of course feel free to launch your own projects! See on-going projects: github.com/ARBORproject...
3/N
20.02.2025 19:55 β π 0 π 0 π¬ 1 π 0
The nation's oldest continuously published daily college newspaper. thecrimson.com
We're an Al safety and research company that builds reliable, interpretable, and steerable Al systems. Talk to our Al assistant Claude at Claude.ai.
Research collaboration among 5 universities in Denmark: Aalborg University, IT University of Copenhagen, University of Aarhus, Technical University of Denmark, and the University of Copenhagen.
https://www.aicentre.dk/
Bold science, deep and continuous collaborations.
Let's build AI's we can trust!
I was bad at studying Victorian books, and now Iβm bad at ML research. Help!
Heading to:
Foundation Models @ Apple
Prev:
LLMs/World Models @ Riot Games
RecSys @ Apple
Heading to Seattle π³οΈβπ he/him
PhD @Stanford working w Noah Goodman
Studying in-context learning and reasoning in humans and machines
Prev. @UofT CS & Psych
Tell me about challenges, the unbelievable, the human mind and artificial intelligence, thoughts, social life, family life, science and philosophy.
Research scientist @mit @fluidinterfaces @mitmedialab
Ph.D in Computer Science and Brain Computer
Interfaces.
Project lead @NeuraFutures, Augmenting Brains
https://linktr.ee/nataliyakosmyna
β¨ Comprehensive evaluation of the INTERPLAY between model internals and behavior
β¨ https://interplay-workshop.github.io/
β¨ Submission due June 23rd
β¨ October 10th, @colmweb.org
Working on LLM interpretability; recent graduate from uchicago.
slhleosun.github.io
Challenging the Foundation of Computer Science. ttic.edu
Researcher @Microsoft; PhD @Harvard; Incoming Assistant Professor @MIT (Fall 2026); Human-AI Interaction, Worker-Centric AI
zbucinca.github.io
Iβm a professor at Cornell Tech and Cornell Law School.
One of "a number of very informative people." -WSJ
https://unireps.org
Discover why, when and how distinct learning processes yield similar representations, and the degree to which these can be unified.
NLP Researcher | CS PhD Candidate @ Technion
NLP, AI, LM research director at Ai2; Professor at UW
The 2025 Conference on Language Modeling will take place at the Palais des Congrès in Montreal, Canada from October 7-10, 2025
Director, Princeton Language and Intelligence. Professor of CS.