Matan Ben-Tov's Avatar

Matan Ben-Tov

@matanbt.bsky.social

PhD student in Computer Science @TAU. Interested in buzzwords like AI and Security and wherever they meet.

18 Followers  |  154 Following  |  28 Posts  |  Joined: 21.11.2024  |  1.9601

Latest posts by matanbt.bsky.social on Bluesky


Preview
GitHub - matanbt/interp-jailbreak: Interpreting LLM jailbreaks through attention hijacking analysis Interpreting LLM jailbreaks through attention hijacking analysis - matanbt/interp-jailbreak

11/

Work with @megamor2.bsky.social and @mahmoods01.bsky.social.

For all the details, check out the full paper and our code!
πŸ“„Paper: arxiv.org/abs/2506.12880
πŸ’»Code: github.com/matanbt/inte...

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

10/ 🧐 Future work may further explore the discovered mechanism and its possible triggers. Overall, we believe our findings highlight the potential of interpretability-based analyses in driving practical advances in red-teaming and model robustness.

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

9/

Our β€œHijacking Suppression” approach drastically reduces GCG attack success with minimal utility loss, and we expect further refinement of this initial framework to improve results.

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

8/ Implication II: Mitigating GCG attack πŸ›‘οΈ

Having observed that GCG relies on a strong hijacking mechanism, with benign prompts rarely showing such behavior, we demonstrate that (training-free) suppression of top hijacking vectors during inference-time can hinder jailbreaks.

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

7/ Implication I: Crafting more universal suffixes βš”οΈ

Leveraging our findings, we add a hijacking-enhancing objective to GCG's optimization (GCG-Hij), that reliably produces more universal adversarial suffixes at no extra computational cost.

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

6/ Hijacking is key for universality β™ΎπŸ‘‰πŸ₯·

Inspecting hundreds of GCG suffixes, we find that the more universal a suffix is, the stronger its hijacking mechanism, as measured w/o generation.
This suggests hijacking is an essential property to which powerful suffixes converge.

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

5/ GCG aggressively hijacks the context πŸ₯·

Analyzing this mechanism, we quantify the dominance of token-subsequences, finding GCG suffixes (adv.) abnormally hijack the attention activations, while suppressing the harmful instruction (instr.).

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

4/

Concretely, knocking out this link eliminates the attack, while, conversely, patching it onto failed jailbreaks restores success.

This provides a mechanistic view on the known shallowness of safety alignments.

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

3/ The jailbreak mechanism is shallow ⛱️

We localize a critical information flow from the adversarial suffix to the final chat tokens before generation, finding through causal interventions that it is both necessary and sufficient for the jailbreak.

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

2/ Attacks like GCG append an unintelligible suffix to a harmful prompt to bypass LLM safeguards. Interestingly, these often generalize to instructions beyond their single targeted harmful behavior, AKA β€œuniversal” β™Ύ

18.06.2025 14:05 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

What makes or breaks powerful jailbreak suffixes? πŸ”“πŸ€–

We find that:
πŸ₯· they work by hijacking the model’s context;
β™Ύ the more universal a suffix is the stronger its hijacking;
βš”οΈπŸ›‘οΈ utilizing these insights, it is possible to both enhance and mitigate these attacks.

🧡

18.06.2025 14:05 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0
Preview
GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search Dense embedding-based text retrieval$\unicode{x2013}$retrieval of relevant passages from corpora via deep learning encodings$\unicode{x2013}$has emerged as a powerful method attaining state-of-the-art...

For more results, details and insights, check out our paper and code!
Paper: arxiv.org/abs/2412.20953
Code & Demo Notebook: github.com/matanbt/GASL...

(16/16) ⬛

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

πŸ‘‰ Our work highlights the risks of using dense retrieval in sensitive domains, particularly when combining untrusted sources in retrieval-integrated systems. We hope that our method and insights will serve groundwork for future research testing and improving dense retrieval’s robustness.

(15/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(Lesson 2οΈβƒ£πŸ”¬): Susceptibility also vary within cosine similarity models, we relate this to a phenomenon called anisotorpy of embedding spaces, finding models rendering anisotropic space easier to attack (e.g., E5) and vice versa (e.g., MiniLM).

(14/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(Lesson 1οΈβƒ£πŸ”¬): Dot product-based models show higher susceptibility to these attacks; this can be theoretically linked to norm sensitivity in this metric, which is indeed exploited in GASLITE’s optimization.

(13/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ”¬ Observing different models show varying levels of susceptibility, we analyze this phenomenon.

Among our findings, we link key properties in retrievers’ embedding spaces to their vulnerability:

(12/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(Finding 3️⃣πŸ§ͺ) Even "blind" attacks (attacker knows (almost) nothing about the targeted queries) can succeed, albeit often to a limited extent.

Some models show surprisingly high vulnerability in this challenging setting, when we targeted the held-out MSMARCO’s diverse and wide query set.

(11/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(Finding 2️⃣πŸ§ͺ) When targeting concept-specific queries (e.g., all Harry Potter-related queries; attacker knows what kind of queries to target), attackers can promote content to top-10 results for most (unknown) queries, while inserting only 10 passages.

(10/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

(Finding 1️⃣πŸ§ͺ): When targeting a specific query (attacker knows all targeted queries) with GASLITE, attacker always reach the 1st result (optimal attack) by inserting merely a single text passage to the corpus.

(9/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ§ͺ We conduct extensive susceptibility evaluation of popular, top-performing retrievers, across three different threat models, using GASLITE (and other baseline attacks). We find that:

(8/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

We find GASLITE to converge faster and to high attack success (=promotion to the top-10 retrieved passages of unknown queries), compared to previous discrete optimizers originally used for LLM jailbreaks (GCG, ARCA; w/ adjusted objective) and retrieval attacks (Cor. Pois. by Zhong et al).

(7/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

To faithfully assess retrievers against attackers utilizing poisoning for SEO, we propose GASLITE β›½πŸ’‘β€”a gradient-based method for crafting a trigger, such that, when appended the attacker’s malicious information, it is pushed to the top search results.

(6/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

We focus on various Search Engine Optimization (SEO) attacks, where the attacker aim to promote malicious information (e.g., misinformation, or indirect prompt injection string).

(5b/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

😈 Our attacker utilizes this, by inserting few adversarial passages (mostly <10) to the corpus πŸ’‰, as recently proposed by Zhong et al. (arxiv.org/abs/2310.19156).

(5a/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

⚠️ More often than not, these retrieval-integrated systems are connected to corpora (e.g., Wikipedia, Copilot on public codebase) exposed to poisoning πŸ’‰(See: Google AI Overview’s Glue-on-Pizza Gate).

(4/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ”Retrieval of relevant passages from corpora via deep learning encodings (=embeddings) has become increasingly popular, whether for indexing search, or for integrating knowledge-bases with LLM agent systems (RAG).

(3/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ“ŽTL;DR: through introducing a strong, new SEO attack (GASLITE β›½πŸ’‘), we extensively evaluate embedding-based retrievers’ susceptibility, showing their vulnerability and linking it to key properties in embedding space.

(2/16)

08.01.2025 07:57 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

How much can we gaslight dense retrieval models? β›½πŸ’‘

In our recent work (w/ @mahmoods01.bsky.social) we thoroughly explore the susceptibility of widely-used models for dense embedding-based text retrieval to search-optimization attacks via corpus poisoning.

🧡 (1/16)

08.01.2025 07:57 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

@matanbt is following 20 prominent accounts