Ryota Takatsuki's Avatar

Ryota Takatsuki

@rtakatsky.bsky.social

PhD student at Sussex Centre for Consciousness Science. Research fellow at AI Alignment Network. Dreaming of reverse-engineering consciousness someday.

25 Followers  |  36 Following  |  8 Posts  |  Joined: 01.12.2024  |  1.7286

Latest posts by rtakatsky.bsky.social on Bluesky

I’m really excited about Diffusion Steering Lens, an intuitive and elegant new “logit lens” technique for decoding the attention and MLP blocks of vision transformers!

Vision is much more expressive than language, so some new mech interp rules apply:

25.04.2025 13:36 — 👍 11    🔁 3    💬 0    📌 0
Preview
Decoding Vision Transformers: the Diffusion Steering Lens Logit Lens is a widely adopted method for mechanistic interpretability of transformer-based language models, enabling the analysis of how internal representations evolve across layers by projecting th...

This work was done as my internship project at Araya. Huge thanks to my supervisors, Ippei Fujisawa & Ryota Kanai, and my external mentor @soniajoseph.bsky.social for making this happen! 🙏

Link to the paper: arxiv.org/abs/2504.13763
(7/7)

25.04.2025 09:37 — 👍 2    🔁 0    💬 0    📌 0
Post image Post image

We also validated DSL’s reliability through two interventional studies (head importance correlation & overlay removal). Check out our paper for details!
(6/7)

25.04.2025 09:37 — 👍 0    🔁 0    💬 1    📌 0
Post image

Below are the top-10 head DSL visualizations by similarity to the input, consistent with residual-stream visualizations from Diffusion Lens.
(5/7)

25.04.2025 09:37 — 👍 0    🔁 0    💬 1    📌 0
Post image

To fix this, we propose Diffusion Steering Lens (DSL), a training-free method that steers a specific submodule’s output, patches its subsequent indirect contributions, and then decodes it with the diffusion model.
(4/7)

25.04.2025 09:37 — 👍 0    🔁 0    💬 1    📌 0
Post image Post image

We first adapted Diffusion Lens (Toker et al., 2024) to decode residual streams in the Kandinsky 2.2 image encoder (CLIP ViT-bigG/14) via the diffusion model.
We can visualize how the predictions evolve through layers, but individual head contributions stay largely hidden.
(3/7)

25.04.2025 09:37 — 👍 0    🔁 0    💬 1    📌 0

Classic Logit Lens projects residual streams to the output space. It works surprisingly well on ViTs, but visual representations are far richer than class labels.
www.lesswrong.com/posts/kobJym...
(2/7)

25.04.2025 09:37 — 👍 0    🔁 0    💬 1    📌 0

🔍Logit Lens tracks what transformer LMs “believe” at each layer. How can we effectively adapt this approach to Vision Transformers?

Happy to share our “Decoding Vision Transformers: the Diffusion Steering Lens” was accepted at the CVPR 2025 Workshop on Mechanistic Interpretability for Vision!
(1/7)

25.04.2025 09:37 — 👍 5    🔁 0    💬 1    📌 1

hello world

24.04.2025 07:01 — 👍 2    🔁 0    💬 0    📌 0

@rtakatsky is following 20 prominent accounts