Tal Haklay's Avatar

Tal Haklay

@talhaklay.bsky.social

NLP | Interpretability | PhD student at the Technion

53 Followers  |  327 Following  |  28 Posts  |  Joined: 20.11.2024  |  2.4652

Latest posts by talhaklay.bsky.social on Bluesky

Position-aware Automatic Circuit Discovery โ€“ Project Page

Project page >> peap-circuits.github.io
Arxiv >> arxiv.org/abs/2502.04577

22.05.2025 08:10 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Our paper "Position-Aware Automatic Circuit Discovery" got accepted to ACL! ๐ŸŽ‰

Huge thanks to my collaborators๐Ÿ™
@hadasorgad.bsky.social
@davidbau.bsky.social
@amuuueller.bsky.social
@boknilev.bsky.social

See you in Vienna! ๐Ÿ‡ฆ๐Ÿ‡น #ACL2025 @aclmeeting.bsky.social

22.05.2025 08:10 โ€” ๐Ÿ‘ 13    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
An image with the Vancouver skyline and the words "sign up to review". At the top are the logos of both the Actionable Interpretability workshop (a magnifying glass) and the ICML conference (a brain).

An image with the Vancouver skyline and the words "sign up to review". At the top are the logos of both the Actionable Interpretability workshop (a magnifying glass) and the ICML conference (a brain).

๐Ÿšจ We're looking for more reviewers for the workshop!
๐Ÿ“† Review period: May 24-June 7

If you're passionate about making interpretability useful and want to help shape the conversation, we'd love your input.

๐Ÿ’ก๐Ÿ” Self-nominate here:
docs.google.com/forms/d/e/1F...

20.05.2025 00:05 โ€” ๐Ÿ‘ 6    ๐Ÿ” 5    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
General Information July 19 - ICML 2025 - Vancouver

Website & CFP >> actionable-interpretability.github.io

14.05.2025 13:04 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

We knew many of you wanted to submit to our Actionable Interpretability workshop, but we didnโ€™t expect to crash Overleaf! ๐Ÿ˜๐Ÿƒ

Only 5 days left โฐ!
Got a paper accepted to ICML that fits our theme?
Submit it to our conference track!
๐Ÿ‘‰ @actinterp.bsky.social

14.05.2025 13:04 โ€” ๐Ÿ‘ 5    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

This was a huge collaboration with many great folks! If you get a chance, be sure to talk to Atticus Geiger, @sarah-nlp.bsky.social, @danaarad.bsky.social, Ivรกn Arcuschin, @adambelfki.bsky.social, @yiksiu.bsky.social, Jaden Fiotto-Kaufmann, @talhaklay.bsky.social, @michaelwhanna.bsky.social, ...

23.04.2025 18:15 โ€” ๐Ÿ‘ 8    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image 07.04.2025 13:53 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image 07.04.2025 13:52 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
General Information ICML 2025 - Vancouver

Website >> actionable-interpretability.github.io

07.04.2025 13:51 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

6. Position papers: Critical discussions on the feasibility, limitations, and future directions of actionable interpretability research. We also invite perspectives that question whether actionability should be a goal of interpretability research.

07.04.2025 13:51 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

5. Developing realistic benchmarking and assessment methods to measure the real-world impact of interpretability insights, particularly in production environments and large-scale models.

07.04.2025 13:51 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

4. Incorporating interpretabilityโ€“often focusing on micro-level decision analysisโ€“into more complex scenarios, like reasoning processes or multi-turn interactions.

07.04.2025 13:51 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

3. New model architectures, training paradigms or design choices informed by interpretability findings.

07.04.2025 13:51 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

2. Comparative analyses of interpretability-based approaches versus alternative techniques like fine-tuning, prompting, and more.

07.04.2025 13:51 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

1.Practical applications of interpretability insights to address key challenges in AI such as hallucinations, biases, and adversarial robustness, as well as applications in high-stakes, less-explored domains like healthcare, finance, and cybersecurity.

07.04.2025 13:51 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿšจ Call for Papers is Out!

The First Workshop on ๐€๐œ๐ญ๐ข๐จ๐ง๐š๐›๐ฅ๐ž ๐ˆ๐ง๐ญ๐ž๐ซ๐ฉ๐ซ๐ž๐ญ๐š๐›๐ข๐ฅ๐ข๐ญ๐ฒ will be held at ICML 2025 in Vancouver!

๐Ÿ“… Submission Deadline: May 9
Follow us >> @ActInterp

๐Ÿง Topics of interest include: ๐Ÿ‘‡

07.04.2025 13:51 โ€” ๐Ÿ‘ 5    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

Amazing news: our workshop was accepted to ICML 2025!

Interpretability research sheds light on how models workโ€”but too often, those insights donโ€™t translate into actions that improve them.
Our workshop aims to challenge the interpretability community to go further.

31.03.2025 18:29 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Position-aware Automatic Circuit Discovery A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model's computation graph that executes a specific task. We identi...

13/13 This work was done in collaboration with @hadasorgad.bsky.social , @davidbau.bsky.social , @amuuueller.bsky.social and @boknilev.bsky.social.

๐Ÿ’ก Thoughts? Questions? Letโ€™s discuss!
Website >> peap-circuits.github.io
Arxiv >> arxiv.org/abs/2502.04577

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

12/13 We evaluate our automatic pipeline across three datasets and two models, demonstrating that:

1๏ธโƒฃ Our pipeline discovers circuits with a better tradeoff between size and faithfulness compared to EAP.
2๏ธโƒฃ Our pipeline produces results comparable to those obtained when human experts define a schema.

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

11/13 But where does this schema come from? And how do we determine the boundaries of each span within each example? Sounds like we just added more work for researchers! ๐Ÿ˜…
Actually, we show that an LLM (Claude) can do a pretty decent job at defining a schema and tagging all examples accordingly.

06.03.2025 22:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

10/13 After defining a schema, we construct an abstract computation graph where each span type corresponds to a single token position. We then map attribution scores from example-specific computation graphs to the abstract graph and identify circuits within it.

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

9/13 To address this problem, we introduce the concept of a ๐™™๐™–๐™ฉ๐™–๐™จ๐™š๐™ฉ ๐™จ๐™˜๐™๐™š๐™ข๐™–, which defines token spans with similar semantics across examples in the dataset.

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

8/13 But you may notice an issue...
What if the examples in a dataset vary in length and structure?
Discovering a circuit in such cases is not straightforward, leading many researchers to focus only on datasets with uniform length and structure.

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

7/13 First improvement :
We introduce ๐—ฃ๐—ผ๐˜€๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ฎ๐—น ๐—˜๐—ฑ๐—ด๐—ฒ ๐—”๐˜๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ถ๐—ผ๐—ป ๐—ฃ๐—ฎ๐˜๐—ฐ๐—ต๐—ถ๐—ป๐—ด (๐—ฃ๐—˜๐—”๐—ฃ)
โ€”an extension of EAP that allows us to discover circuits that differentiate between token positions. The key advancement? Our approach uncovers "attention edges", revealing dependencies missed by previous methods.

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

6/13 The Problem:
Automatic circuit discovery methods like Edge Attribution Patching (EAP) and EAP-IP implicitly assume that circuits are position-invariantโ€”they do not differentiate between components at different token positions.

As a result, the circuit may include irrelevant components.

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

5/13 Since the IOI circuit was first discovered, many new techniques for discovering circuits have emerged, with a clear trend of being automated and efficient. Automated methods offer the advantage of scaling more easily and being less susceptible to human biases.

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

4/13 Early circuit discovery techniques relied on manual causal analysis to identify circuits.

Hereโ€™s an example of a well-studied circuit in the IOI task by Wang et al. Notice how different components play crucial roles at different token positionsโ€”this is expected!

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

3/13 What is a circuit?
A circuit is a minimal subgraph of a modelโ€™s computation graph that executes a specific task. Circuit analysis helps us understand how the model operates and which components (e.g., MLPs, attention heads) are involved.

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Position-aware Automatic Circuit Discovery A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model's computation graph that executes a specific task. We identi...

2/13 Check out the full paper >> arxiv.org/abs/2502.04577
Website >> peap-circuits.github.io
Or continue in this thread for paper highlights! ๐Ÿงต๐Ÿ‘‡

06.03.2025 22:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

1/13 LLM circuits tell us where the computation happens inside the modelโ€”but the computation varies by token position, a key detail often ignored!
We propose a method to automatically find position-aware circuits, improving faithfulness while keeping circuits compact. ๐Ÿงต๐Ÿ‘‡

06.03.2025 22:15 โ€” ๐Ÿ‘ 25    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

@talhaklay is following 20 prominent accounts