Benno Krojer's Avatar

Benno Krojer

@bennokrojer.bsky.social

AI PhDing at Mila/McGill. Happily residing in Montreal πŸ₯―❄️ Academic stuff: language grounding, vision+language, interp, rigorous & creative evals, cogsci Other: many sports, urban explorations, puzzles/quizzes bennokrojer.com

2,662 Followers  |  999 Following  |  1,817 Posts  |  Joined: 24.04.2023  |  2.3702

Latest posts by bennokrojer.bsky.social on Bluesky


Post image Post image

Is interpretability at the random fact-gathering stage or beyond?

23.02.2026 03:34 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Finally getting into this classic

Let's see if by the end I'll have a clearer idea what type of science some fields of AI are, like interpretability

What are our paradigms?

23.02.2026 03:34 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

Google decided to show this as my first sentence from my website (and not any of the sentences actually at the top of the website)

20.02.2026 16:27 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Keep me posted and feel free to ping me anytime something is confusing!

16.02.2026 23:21 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Re 2) this was a typo and should be "i" for token position consistent with later uses in 3.2 and also how we use "i" in 3.1

16.02.2026 17:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Maybe we can formulate it as a description d is text with optional meta-data (token position, layer) that is mapped to a vector r

The general formalism is tricky but i think the intuition is hopefully clear :)

16.02.2026 17:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Image 0000 - LLaMA3-8B + ViT-L/14-336

So in our case (LatentLens) i would say:
a description here is something like "a brown *dog*" and not "a brown dog" so the token position makes it a different description (this is also how we highlight it in our demo: bennokrojer.com/vlm_interp_d...)

16.02.2026 17:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

So I got a chance to look closely and you are right in both cases! Thank you for spotting this. I will upload a new version on arxiv soon with fixes

To clarify things here also:
1) in 3.1 we described things generally but missed that eg LatentLens would match several vectors r with a description d

16.02.2026 17:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thank you! Let me get back to you later today on this when I'm on my laptop

14.02.2026 21:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

What does it mean for visual tokens to be "interpretable" to LLM? And how to we measure it?

These, and many more pressing questions are addressed!

Introducing LatentLens -- a new, more faithful tool for interpretability! Honoured to have collaborated with
@bennokrojer.bsky.social on this!

11.02.2026 17:11 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

Finally on a personal note, this will be the final paper of my PhD... what a journey it has been

11.02.2026 15:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Pivoting to interpretability this year was great and i also wrote a blog post on this specifically:
bennokrojer.com/interp.html

11.02.2026 15:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

This is a major lesson i will keep in mind for any future project:

Test your assumptions, do not assume the field already has settled

11.02.2026 15:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

This project was definitely accelerated and shaped by Claude Code/Cursor. Building intuitive demos in interp is now much easier

11.02.2026 15:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Finally we do test it empirically: finding some models where the embedding matrix of the LLM already provides decently interpretable nearest neighbors

But this was not the full story yet...
@mariusmosbach.bsky.social and @elinorpd.bsky.social nudged me to use contextual embeddings

11.02.2026 15:10 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

Then the project went "off-track" for a while, partially because we didn't question our assumptions enough:

We just assumed visual tokens going into an LLM would not be that interpretable (based on the literature and our intuition)

But we never fully tested it for many weeks!

11.02.2026 15:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

The initial ideation phase:

Pivoting to a new direction, wondering what kind of interp work would be meaningful, getting feedback from my lab, ...

11.02.2026 15:10 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

For every one of my papers, I try to include a "Behind the Scenes" section

I think this paper in particular has a lot going on behind the scenes; from lessons learned to personal reflections

let me share some

11.02.2026 15:10 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@delliott.bsky.social joined the project mid-way and somehow still had so much positive influence, ideas and energy. Good research is done with real care for detail and you can sense Des cares about the details

11.02.2026 15:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I am very grateful to @sivareddyg.bsky.social's
supervision in all these years, not just in challenging me to do impactful work, but also on the human side

11.02.2026 15:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@mariusmosbach.bsky.social
was an amazing mentor, his ideas and writing really shaped not just this work but how i conduct research

11.02.2026 15:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

This will be my last paper of the phd, can't believe it's been almost 5 years!

It is the work i am most proud of and believe has the most potential. Feels right to wrap it up with this one

11.02.2026 15:06 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We will keep working after this initial release, making latentlens as accessible as possible to other researchers and improving the code base

We are optimistic latentlens can be used beyond visual inputs, and aim to make our codebase flexible for broader applications

Share your feedback or ideas!

11.02.2026 14:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Broader reflections:

Are embedding spaces from diff modalities structurally similar, as the Platonic Representation Hypothesis suggests?

Are LLMs so good at processing vision because pre-training induced an implicit physical world model?

Multimodal interpretability is getting a bigger topic!

11.02.2026 14:12 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Takeaways:

We, the authors, were genuinely surprised to find such systematically high interpretability

Recently people started using logit lens to study visual tokens in LLMs.
We encourage the community to try out LatentLens next time, even beyond visual processing (any latent LLM representation)

11.02.2026 14:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

There are more cool analyses in the paper and the appendix that we encourage you to explore

Or simply explore LatentLens and other tools in our interactive demo:
bennokrojer.com/vlm_interp_...

Teaser on some ablations we try: replace MLP with linear mapping, unfreeze LLM, worse training data, ...

11.02.2026 14:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

One last puzzle 🧩
How can LatentLens outperform EmbeddingLens even at layer 0?

Our hypothsis: Visual tokens arrive already packaged in a semantic format

Concretely: An input visual token might have the highest similarity with text representations at e.g. LLM layer 8

We call this "Mid-Layer Leap"

11.02.2026 14:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Beyond our controlled setup, we also show how LatentLens works much better than baselines on off-the-shelf Qwen2-VL-7B-Instruct

11.02.2026 14:12 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

With this automatic metric, we can compare LogitLens, EmbeddingLens and LatentLens on 9 model combinations that we train (3 vision encoders x 3 LLMs)

The two baselines are a mixed bag: some models and some layers are okay but many others not

LatentLens shows high interpretability across the board

11.02.2026 14:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

How do we quantity whether a visual token is interpretable?

We capture in a VLM judge what a human would intuitively do:
Look at the top-5 NNs and the part in the image from which the visual token came from and answer:
Are these top-5 NNs semantically related to the image or the part of the image?

11.02.2026 14:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@bennokrojer is following 20 prominent accounts