11/
Work with @megamor2.bsky.social and @mahmoods01.bsky.social.
For all the details, check out the full paper and our code!
πPaper: arxiv.org/abs/2506.12880
π»Code: github.com/matanbt/inte...
@matanbt.bsky.social
PhD student in Computer Science @TAU. Interested in buzzwords like AI and Security and wherever they meet.
11/
Work with @megamor2.bsky.social and @mahmoods01.bsky.social.
For all the details, check out the full paper and our code!
πPaper: arxiv.org/abs/2506.12880
π»Code: github.com/matanbt/inte...
10/ π§ Future work may further explore the discovered mechanism and its possible triggers. Overall, we believe our findings highlight the potential of interpretability-based analyses in driving practical advances in red-teaming and model robustness.
18.06.2025 14:05 β π 0 π 0 π¬ 1 π 09/
Our βHijacking Suppressionβ approach drastically reduces GCG attack success with minimal utility loss, and we expect further refinement of this initial framework to improve results.
8/ Implication II: Mitigating GCG attack π‘οΈ
Having observed that GCG relies on a strong hijacking mechanism, with benign prompts rarely showing such behavior, we demonstrate that (training-free) suppression of top hijacking vectors during inference-time can hinder jailbreaks.
7/ Implication I: Crafting more universal suffixes βοΈ
Leveraging our findings, we add a hijacking-enhancing objective to GCG's optimization (GCG-Hij), that reliably produces more universal adversarial suffixes at no extra computational cost.
6/ Hijacking is key for universality βΎππ₯·
Inspecting hundreds of GCG suffixes, we find that the more universal a suffix is, the stronger its hijacking mechanism, as measured w/o generation.
This suggests hijacking is an essential property to which powerful suffixes converge.
5/ GCG aggressively hijacks the context π₯·
Analyzing this mechanism, we quantify the dominance of token-subsequences, finding GCG suffixes (adv.) abnormally hijack the attention activations, while suppressing the harmful instruction (instr.).
4/
Concretely, knocking out this link eliminates the attack, while, conversely, patching it onto failed jailbreaks restores success.
This provides a mechanistic view on the known shallowness of safety alignments.
3/ The jailbreak mechanism is shallow β±οΈ
We localize a critical information flow from the adversarial suffix to the final chat tokens before generation, finding through causal interventions that it is both necessary and sufficient for the jailbreak.
2/ Attacks like GCG append an unintelligible suffix to a harmful prompt to bypass LLM safeguards. Interestingly, these often generalize to instructions beyond their single targeted harmful behavior, AKA βuniversalβ βΎ
18.06.2025 14:05 β π 0 π 0 π¬ 1 π 0What makes or breaks powerful jailbreak suffixes? ππ€
We find that:
π₯· they work by hijacking the modelβs context;
βΎ the more universal a suffix is the stronger its hijacking;
βοΈπ‘οΈ utilizing these insights, it is possible to both enhance and mitigate these attacks.
π§΅
For more results, details and insights, check out our paper and code!
Paper: arxiv.org/abs/2412.20953
Code & Demo Notebook: github.com/matanbt/GASL...
(16/16) β¬
π Our work highlights the risks of using dense retrieval in sensitive domains, particularly when combining untrusted sources in retrieval-integrated systems. We hope that our method and insights will serve groundwork for future research testing and improving dense retrievalβs robustness.
(15/16)
(Lesson 2οΈβ£π¬): Susceptibility also vary within cosine similarity models, we relate this to a phenomenon called anisotorpy of embedding spaces, finding models rendering anisotropic space easier to attack (e.g., E5) and vice versa (e.g., MiniLM).
(14/16)
(Lesson 1οΈβ£π¬): Dot product-based models show higher susceptibility to these attacks; this can be theoretically linked to norm sensitivity in this metric, which is indeed exploited in GASLITEβs optimization.
(13/16)
π¬ Observing different models show varying levels of susceptibility, we analyze this phenomenon.
Among our findings, we link key properties in retrieversβ embedding spaces to their vulnerability:
(12/16)
(Finding 3οΈβ£π§ͺ) Even "blind" attacks (attacker knows (almost) nothing about the targeted queries) can succeed, albeit often to a limited extent.
Some models show surprisingly high vulnerability in this challenging setting, when we targeted the held-out MSMARCOβs diverse and wide query set.
(11/16)
(Finding 2οΈβ£π§ͺ) When targeting concept-specific queries (e.g., all Harry Potter-related queries; attacker knows what kind of queries to target), attackers can promote content to top-10 results for most (unknown) queries, while inserting only 10 passages.
(10/16)
(Finding 1οΈβ£π§ͺ): When targeting a specific query (attacker knows all targeted queries) with GASLITE, attacker always reach the 1st result (optimal attack) by inserting merely a single text passage to the corpus.
(9/16)
π§ͺ We conduct extensive susceptibility evaluation of popular, top-performing retrievers, across three different threat models, using GASLITE (and other baseline attacks). We find that:
(8/16)
We find GASLITE to converge faster and to high attack success (=promotion to the top-10 retrieved passages of unknown queries), compared to previous discrete optimizers originally used for LLM jailbreaks (GCG, ARCA; w/ adjusted objective) and retrieval attacks (Cor. Pois. by Zhong et al).
(7/16)
To faithfully assess retrievers against attackers utilizing poisoning for SEO, we propose GASLITE β½π‘βa gradient-based method for crafting a trigger, such that, when appended the attackerβs malicious information, it is pushed to the top search results.
(6/16)
We focus on various Search Engine Optimization (SEO) attacks, where the attacker aim to promote malicious information (e.g., misinformation, or indirect prompt injection string).
(5b/16)
π Our attacker utilizes this, by inserting few adversarial passages (mostly <10) to the corpus π, as recently proposed by Zhong et al. (arxiv.org/abs/2310.19156).
(5a/16)
β οΈ More often than not, these retrieval-integrated systems are connected to corpora (e.g., Wikipedia, Copilot on public codebase) exposed to poisoning π(See: Google AI Overviewβs Glue-on-Pizza Gate).
(4/16)
πRetrieval of relevant passages from corpora via deep learning encodings (=embeddings) has become increasingly popular, whether for indexing search, or for integrating knowledge-bases with LLM agent systems (RAG).
(3/16)
πTL;DR: through introducing a strong, new SEO attack (GASLITE β½π‘), we extensively evaluate embedding-based retrieversβ susceptibility, showing their vulnerability and linking it to key properties in embedding space.
(2/16)
How much can we gaslight dense retrieval models? β½π‘
In our recent work (w/ @mahmoods01.bsky.social) we thoroughly explore the susceptibility of widely-used models for dense embedding-based text retrieval to search-optimization attacks via corpus poisoning.
π§΅ (1/16)