Matan Ben-Tov @matanbt - Bluesky Profile

GitHub - matanbt/interp-jailbreak: Interpreting LLM jailbreaks through attention hijacking analysis Interpreting LLM jailbreaks through attention hijacking analysis - matanbt/interp-jailbreak

11/

Work with @megamor2.bsky.social and @mahmoods01.bsky.social.

For all the details, check out the full paper and our code!
📄Paper: arxiv.org/abs/2506.12880
💻Code: github.com/matanbt/inte...

18.06.2025 14:05 — 👍 0 🔁 0 💬 0 📌 0

10/ 🧐 Future work may further explore the discovered mechanism and its possible triggers. Overall, we believe our findings highlight the potential of interpretability-based analyses in driving practical advances in red-teaming and model robustness.

18.06.2025 14:05 — 👍 0 🔁 0 💬 1 📌 0

9/

Our “Hijacking Suppression” approach drastically reduces GCG attack success with minimal utility loss, and we expect further refinement of this initial framework to improve results.

18.06.2025 14:05 — 👍 0 🔁 0 💬 1 📌 0

8/ Implication II: Mitigating GCG attack 🛡️

Having observed that GCG relies on a strong hijacking mechanism, with benign prompts rarely showing such behavior, we demonstrate that (training-free) suppression of top hijacking vectors during inference-time can hinder jailbreaks.

18.06.2025 14:05 — 👍 0 🔁 0 💬 1 📌 0

7/ Implication I: Crafting more universal suffixes ⚔️

Leveraging our findings, we add a hijacking-enhancing objective to GCG's optimization (GCG-Hij), that reliably produces more universal adversarial suffixes at no extra computational cost.

18.06.2025 14:05 — 👍 0 🔁 0 💬 1 📌 0

6/ Hijacking is key for universality ♾👉🥷

Inspecting hundreds of GCG suffixes, we find that the more universal a suffix is, the stronger its hijacking mechanism, as measured w/o generation.
This suggests hijacking is an essential property to which powerful suffixes converge.

18.06.2025 14:05 — 👍 0 🔁 0 💬 1 📌 0

5/ GCG aggressively hijacks the context 🥷

Analyzing this mechanism, we quantify the dominance of token-subsequences, finding GCG suffixes (adv.) abnormally hijack the attention activations, while suppressing the harmful instruction (instr.).

18.06.2025 14:05 — 👍 0 🔁 0 💬 1 📌 0

4/

Concretely, knocking out this link eliminates the attack, while, conversely, patching it onto failed jailbreaks restores success.

This provides a mechanistic view on the known shallowness of safety alignments.

18.06.2025 14:05 — 👍 0 🔁 0 💬 1 📌 0

3/ The jailbreak mechanism is shallow ⛱️

We localize a critical information flow from the adversarial suffix to the final chat tokens before generation, finding through causal interventions that it is both necessary and sufficient for the jailbreak.

18.06.2025 14:05 — 👍 0 🔁 0 💬 1 📌 0

2/ Attacks like GCG append an unintelligible suffix to a harmful prompt to bypass LLM safeguards. Interestingly, these often generalize to instructions beyond their single targeted harmful behavior, AKA “universal” ♾

18.06.2025 14:05 — 👍 0 🔁 0 💬 1 📌 0

What makes or breaks powerful jailbreak suffixes? 🔓🤖

We find that:
🥷 they work by hijacking the model’s context;
♾ the more universal a suffix is the stronger its hijacking;
⚔️🛡️ utilizing these insights, it is possible to both enhance and mitigate these attacks.

🧵

18.06.2025 14:05 — 👍 1 🔁 1 💬 2 📌 0

GASLITEing the Retrieval: Exploring Vulnerabilities in Dense Embedding-based Search Dense embedding-based text retrieval$\unicode{x2013}$retrieval of relevant passages from corpora via deep learning encodings$\unicode{x2013}$has emerged as a powerful method attaining state-of-the-art...

For more results, details and insights, check out our paper and code!
Paper: arxiv.org/abs/2412.20953
Code & Demo Notebook: github.com/matanbt/GASL...

(16/16) ⬛

08.01.2025 07:57 — 👍 0 🔁 0 💬 0 📌 0

👉 Our work highlights the risks of using dense retrieval in sensitive domains, particularly when combining untrusted sources in retrieval-integrated systems. We hope that our method and insights will serve groundwork for future research testing and improving dense retrieval’s robustness.

(15/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

(Lesson 2️⃣🔬): Susceptibility also vary within cosine similarity models, we relate this to a phenomenon called anisotorpy of embedding spaces, finding models rendering anisotropic space easier to attack (e.g., E5) and vice versa (e.g., MiniLM).

(14/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

(Lesson 1️⃣🔬): Dot product-based models show higher susceptibility to these attacks; this can be theoretically linked to norm sensitivity in this metric, which is indeed exploited in GASLITE’s optimization.

(13/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

🔬 Observing different models show varying levels of susceptibility, we analyze this phenomenon.

Among our findings, we link key properties in retrievers’ embedding spaces to their vulnerability:

(12/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

(Finding 3️⃣🧪) Even "blind" attacks (attacker knows (almost) nothing about the targeted queries) can succeed, albeit often to a limited extent.

Some models show surprisingly high vulnerability in this challenging setting, when we targeted the held-out MSMARCO’s diverse and wide query set.

(11/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

(Finding 2️⃣🧪) When targeting concept-specific queries (e.g., all Harry Potter-related queries; attacker knows what kind of queries to target), attackers can promote content to top-10 results for most (unknown) queries, while inserting only 10 passages.

(10/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

(Finding 1️⃣🧪): When targeting a specific query (attacker knows all targeted queries) with GASLITE, attacker always reach the 1st result (optimal attack) by inserting merely a single text passage to the corpus.

(9/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

🧪 We conduct extensive susceptibility evaluation of popular, top-performing retrievers, across three different threat models, using GASLITE (and other baseline attacks). We find that:

(8/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

We find GASLITE to converge faster and to high attack success (=promotion to the top-10 retrieved passages of unknown queries), compared to previous discrete optimizers originally used for LLM jailbreaks (GCG, ARCA; w/ adjusted objective) and retrieval attacks (Cor. Pois. by Zhong et al).

(7/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

To faithfully assess retrievers against attackers utilizing poisoning for SEO, we propose GASLITE ⛽💡—a gradient-based method for crafting a trigger, such that, when appended the attacker’s malicious information, it is pushed to the top search results.

(6/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

We focus on various Search Engine Optimization (SEO) attacks, where the attacker aim to promote malicious information (e.g., misinformation, or indirect prompt injection string).

(5b/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

😈 Our attacker utilizes this, by inserting few adversarial passages (mostly <10) to the corpus 💉, as recently proposed by Zhong et al. (arxiv.org/abs/2310.19156).

(5a/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

⚠️ More often than not, these retrieval-integrated systems are connected to corpora (e.g., Wikipedia, Copilot on public codebase) exposed to poisoning 💉(See: Google AI Overview’s Glue-on-Pizza Gate).

(4/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

🔍Retrieval of relevant passages from corpora via deep learning encodings (=embeddings) has become increasingly popular, whether for indexing search, or for integrating knowledge-bases with LLM agent systems (RAG).

(3/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

📎TL;DR: through introducing a strong, new SEO attack (GASLITE ⛽💡), we extensively evaluate embedding-based retrievers’ susceptibility, showing their vulnerability and linking it to key properties in embedding space.

(2/16)

08.01.2025 07:57 — 👍 0 🔁 0 💬 1 📌 0

How much can we gaslight dense retrieval models? ⛽💡

In our recent work (w/ @mahmoods01.bsky.social) we thoroughly explore the susceptibility of widely-used models for dense embedding-based text retrieval to search-optimization attacks via corpus poisoning.

🧵 (1/16)

08.01.2025 07:57 — 👍 2 🔁 1 💬 1 📌 0

Matan Ben-Tov

Latest posts by matanbt.bsky.social on Bluesky

@matanbt is following 20 prominent accounts