Mor Geva's Avatar

Mor Geva

@megamor2.bsky.social

https://mega002.github.io

847 Followers  |  76 Following  |  29 Posts  |  Joined: 16.11.2024  |  2.3361

Latest posts by megamor2.bsky.social on Bluesky

Video thumbnail

🧠 To reason over text and track entities, we find that language models use three types of 'pointers'!

They were thought to rely only on a positional oneβ€”but when many entities appear, that system breaks down.

Our new paper shows what these pointers are and how they interact πŸ‘‡

08.10.2025 14:56 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

🚨 New Paper 🚨
How effectively do reasoning models reevaluate their thought? We find that:
- Models excel at identifying unhelpful thoughts but struggle to recover from them
- Smaller models can be more robust
- Self-reevaluation ability is far from true meta-cognitive awareness
1/N 🧡

13.06.2025 16:15 β€” πŸ‘ 12    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0
Post image

New Paper Alert! Can we precisely erase conceptual knowledge from LLM parameters?
Most methods are shallow, coarse, or overreach, adversely affecting related or general knowledge.

We introduceπŸͺππˆπ’𝐂𝐄𝐒 β€” a general framework for Precise In-parameter Concept EraSure. 🧡 1/

29.05.2025 16:22 β€” πŸ‘ 5    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

Checkout Benno's notes about our impact of interpretability paper πŸ‘‡.

Also, we are organizing a workshop at #ICML2025 which is inspired by some of the questions discussed in the paper: actionable-interpretability.github.io

15.04.2025 23:11 β€” πŸ‘ 11    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0

Have work on the actionable impact of interpretability findings? Consider submitting to our Actionable Interpretability workshop at ICML! See below for more info.

Website: actionable-interpretability.github.io
Deadline: May 9

03.04.2025 17:58 β€” πŸ‘ 20    πŸ” 10    πŸ’¬ 0    πŸ“Œ 0

Forgot to tag the one and only @hadasorgad.bsky.social !!!

31.03.2025 17:39 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸŽ‰ Our Actionable Interpretability workshop has been accepted to #ICML2025! πŸŽ‰
> Follow @actinterp.bsky.social
> Website actionable-interpretability.github.io

@talhaklay.bsky.social @anja.re @mariusmosbach.bsky.social @sarah-nlp.bsky.social @iftenney.bsky.social

Paper submission deadline: May 9th!

31.03.2025 16:59 β€” πŸ‘ 43    πŸ” 16    πŸ’¬ 3    πŸ“Œ 3
Preview
COLM 2025 Ethics Reviewer Sign Up Ethics reviewing of papers for COLM 2025 starts in May. We will share more details later. In the meantime, please sign up.

πŸ“£ πŸ“£ Looking for ethics reviewers for COLM 2025!
Please sign up and share the form below πŸ‘‡
forms.gle/3a52jbDNB9bd...

24.02.2025 14:02 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

Communication between LLM agents can be super noisy! One rogue agent can easily drag the whole system into failure 😱

We find that (1) it's possible to detect rogue agents early on
(2) interventions can boost system performance by up to 20%!

Thread with details and paper link below!

13.02.2025 14:30 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Enhancing Automated Interpretability with Output-Centric Feature Descriptions Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. Th...

Check out our paper and code for more details, analyses, and cool examples!

πŸ”— Paper: arxiv.org/abs/2501.08319
πŸ”— HF: huggingface.co/papers/2501.08319
πŸ”— Code: github.com/yoavgur/Feature-Descriptions

7/

28.01.2025 19:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

In a final experiment, we show that output-centric methods can be used to "revive" features previously thought to be "dead" πŸ§Ÿβ€β™‚οΈ reviving hundreds of SAE features in Gemma 2! 6/

28.01.2025 19:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

In a final experiment, we show that output-centric methods can be used to "revive" features previously thought to be "dead" πŸ§Ÿβ€β™‚οΈ reviving hundreds of SAE features in Gemma 2! 6/

28.01.2025 19:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Unsurprisingly, while activating inputs better describe what activates a feature, output-centric methods do much better at predicting how steering the feature will affect the model’s output!

But combining the two works best! πŸš€ 5/

28.01.2025 19:37 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

Next, we evaluate the widely-used activating inputs approach versus two output-centric methods:
- vocabulary projection (a.k.a logit lens)
- tokens with max probability change in the output

Our output-centric methods require no more than a few inference passes! 4/

28.01.2025 19:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

To fix this, we first propose using both input- and output-based evaluations for feature descriptions.
Our output-based eval measures how well a description of a feature captures its effect on the model's generation. 3/

28.01.2025 19:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Autointerp pipelines describe neurons and SAE features based on inputs that activate them.

This is problematic ⚠️
1. Collecting activations for large data is expensive, time-consuming, and often unfeasible.
2. It overlooks how features affect model outputs!

2/

28.01.2025 19:35 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

How can we interpret LLM features at scale? πŸ€”

Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs!
We propose efficient output-centric methods that better predict the steering effect of a feature.

New preprint led by @yoav.ml 🧡1/

28.01.2025 19:34 β€” πŸ‘ 33    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0
Post image


🚨 New Paper Alert: Open Problem in Machine Unlearning for AI Safety 🚨

Can AI truly "forget"? While unlearning promises data removal, controlling emergent capabilities is a inherent challenge. Here's why it matters: πŸ‘‡

Paper: arxiv.org/pdf/2501.04952
1/8

10.01.2025 16:58 β€” πŸ‘ 25    πŸ” 6    πŸ’¬ 1    πŸ“Œ 3
Preview
Inferring Functionality of Attention Heads from their Parameters Attention heads are one of the building blocks of large language models (LLMs). Prior work on investigating their operation mostly focused on analyzing their behavior during inference for specific cir...

Check out our paper and code for more details and cool results!
Paper: arxiv.org/abs/2412.11965
Code: github.com/amitelhelo/M...

(10/10!)

18.12.2024 18:01 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Most operation descriptions are plausible based on human judgment.
We also observe interesting operations implemented by heads, like the extension of time periods (day β†’ month β†’ year) and association of known figures with years relevant to their historical significance (9/10)

18.12.2024 18:01 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Next, we establish an automatic pipeline that uses GPT-4o to annotate the salient mappings from MAPS.
We map the attention heads of Pythia 6.9B and GPT2-xl and manage to identify operations for most heads, reaching 60%-96% in the middle and upper layers (8/10)

18.12.2024 18:00 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(3) Smaller models tend to encode higher numbers of relations in a single head

(4) In Llama-3.1 models, which use grouped-query attention, grouped heads often implement the same or similar relations (7/10)

18.12.2024 17:59 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(1) Different models encode certain relations across attention heads to similar degrees

(2) Different heads implement the same relation to varying degrees, which has implications for localization and editing of LLMs (6/10)

18.12.2024 17:58 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Using MAPS, we study the distribution of operations across heads in different models -- Llama, Pythia, Phi, GPT2 -- and see some cool trends of function encoding universality and architecture biases: (5/10)

18.12.2024 17:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Experiments on 20 operations and 6 LLMs show that MAPS estimations strongly correlate with the head’s outputs during inference

Ablating heads implementing an operation damages the model’s ability to perform tasks requiring the operation compared to removing other heads (4/10)

18.12.2024 17:57 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

MAPS infers the head’s functionality by examining different groups of mappings:

(A) Predefined relations: groups expressing certain relations (e.g. city of a country)

(B) Salient operations: groups for which the head induces the most prominent effect (3/10)

18.12.2024 17:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Previous works that analyze attention heads mostly focused on studying their attention patterns or outputs for certain tasks or circuits.

Here, we take a different approach, inspired by @anthropic.com @guydar.bsky.social , and inspect the head in the vocabulary space πŸ” (2/10)

18.12.2024 17:56 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

What's in an attention head? 🀯

We present an efficient framework – MAPS – for inferring the functionality of attention heads in LLMs ✨directly from their parameters✨

A new preprint with Amit Elhelo 🧡 (1/10)

18.12.2024 17:55 β€” πŸ‘ 62    πŸ” 13    πŸ’¬ 1    πŸ“Œ 0
Preview
Volunteer to join ACL 2025 Programme Committee Use this form to express your interest in joining the ACL 2025 programme committee as a reviewer or area chair (AC). The review period is 1st to 20th of March 2025. ACs need to be available for variou...

We invite nominations to join the ACL2025 PC as reviewer or area chair(AC). Review process through ARR Feb cycle. Tentative timeline: Review 1-20 Mar 2025, Rebuttal is 26-31 Mar 2025. ACs must be available throughout the Feb cycle. Nominations by 20 Dec 2024:
shorturl.at/TaUh9 #NLProc #ACL2025NLP

16.12.2024 00:28 β€” πŸ‘ 12    πŸ” 12    πŸ’¬ 0    πŸ“Œ 1
Preview
Volunteer to join ACL 2025 Programme Committee Use this form to express your interest in joining the ACL 2025 programme committee as a reviewer or area chair (AC). The review period is 1st to 20th of March 2025. ACs need to be available for variou...

πŸ“£πŸ“£ Wanna be an Area Chair or a Reviewer for @aclmeeting.bsky.social or know someone who would?

Nominations and self-nominations go here πŸ‘‡

docs.google.com/forms/d/e/1F...

06.12.2024 06:01 β€” πŸ‘ 15    πŸ” 10    πŸ’¬ 0    πŸ“Œ 1

@megamor2 is following 19 prominent accounts