New work from the team on identifying memorized training samples for free
26.06.2025 16:27 β π 0 π 0 π¬ 0 π 0@yvesalexandre.bsky.social
Professor of Applied Mathematics and CS at Imperial College London (π¬π§). MIT PhD. I'm working on automated privacy attacks, LLM memorization, and AI Safety. Road cyclist π΄ and former EU Special Adviser (πͺπΊ).
New work from the team on identifying memorized training samples for free
26.06.2025 16:27 β π 0 π 0 π¬ 0 π 0
β‘οΈ Read the full paper here: arxiv.org/abs/2505.15738
This is work with my amazing students π§βπ at Imperial College London: Xiaoxue Yang, Bozhidar Stevanoski and Matthieu Meeus
To properly defend LLM agents against prompt injection, we need 1οΈβ£ better defenses which are robust against informed adversaries, and 2οΈβ£ account for these vulnerabilities even in βalignedβ LLMs when deploying them as agents.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0π¬ Does this mean the existing alignment-based defenses π‘οΈ are not useful? No! But they are likely more brittle than previously believed.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0More specifically, it uses intermediate training checkpoints as βstepping stonesβ π£πͺ¨ to craft attacks against the final aligned model. This is hugely successful with the suffixes found by Checkpoint-GCG, bypassing SOTA defenses such as SecAlign 90%+ of the time π―.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0We propose Checkpoint-GCG, an attack method that assumes an informed adversary with some knowledge of the alignment mechanism π§.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0π€ How would we know this though? We propose to use informed adversaries β attackers with more knowledge than currently seems βrealisticβ, to evaluate the robustness of defenses against future, yet-unknown attacks like we do in privacy.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0With LLMs being integrated into systems everywhere and deployed as agents, we however argue that this is not enough β οΈ. We cannot constantly pen-and-patch, patching LLMs every time a new attack is discovered. We need to ensure our defenses are robust and future-proof π¦Ύ.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0Recent methods claim near-perfect protection against existing red teaming attacks, including GCG, which automatically finds adversarial suffixes to manipulate model behaviour.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0π‘οΈ Todayβs defenses against prompt injection typically rely on alignment-based training, teaching LLMs to ignore injected instructions π.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0Sophisticated prompt injection attacks are often done by pairing instructions with adversarial suffixes π£ that trick models into following the injected instructions.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0This is known as prompt injection π, where malicious actors hide instructions in files or web pages (like invisible white text) that manipulate the LLMβs behaviour.
20.06.2025 10:50 β π 0 π 0 π¬ 1 π 0
Have you ever uploaded a PDF π to ChatGPT π€ and asked for a summary? There is a chance the model followed hidden instructions inside the file instead of your prompt π
A thread π§΅
π Imperial College London
π
Start: October 2025
β³Application deadline: June 6th
π©Application steps: cpg.doc.ic.ac.uk/openings/
This is an exciting opportunity for technically strong and curious candidates who want to do meaningful research that influences both academia and industry. If youβre weighing the next step in your career, we offer a path to impactful, high-quality research with freedom to explore
20.05.2025 10:33 β π 0 π 0 π¬ 1 π 0To see more of our work and get to know the team, check here (cpg.doc.ic.ac.uk)!
20.05.2025 10:33 β π 0 π 0 π¬ 1 π 0
β
Can individuals be re-identified even from aggregated statistics? (arxiv.org/abs/2504.18497)
β
How can we efficiently identify training samples at risk of leaking in ML models? (arxiv.org/abs/2411.05743)
β
How can we rigorously measure what LLMs memorize? (arxiv.org/abs/2406.17975)
β
How can we automatically discover privacy vulnerabilities in query-based systems at scale and in practice? (arxiv.org/abs/2409.01992)
Happy to share that we are offering one additional fully-funded PhD position starting in Fall 2025! Our research group at Imperial College London works on machine learning and data privacy and security.
Recently, we tackled questions such as:
π¨One (more!) fully-funded PhD position in our group at Imperial College London β Privacy & Machine Learning ππ€ starting Oct 2025
Plz RT π
Huge congrats to @spalab.cs.ucr.edu's Georgi Ganev for receiving the Distinguished Paper Award at IEEE S&P for his work "The Inadequacy of Similarity-based Privacy Metrics: Privacy Attacks against βTruly Anonymousβ Synthetic Datasets."
Paper: arxiv.org/pdf/2312.051...
π Help shape the future of SaTML!
We are on the hunt for a 2026 host city - and you could lead the way. Submit a bid to become General Chair of the conference:
forms.gle/vozsaXjCoPzc...
Work with my amazing students and collaborators Zexi Yao, natasakrco.bsky.social, and Georgi Ganev.
π Full paper: arxiv.org/abs/2505.01524
What should I do then? Use MIAs. They are the rigorous and comprehensive standard for evaluating the privacy of synthetic data, including making legal anonymity claims, and when comparing models.
09.05.2025 12:21 β π 0 π 0 π¬ 1 π 0DCR indeed only appears to catch the most obvious privacy failures, like synthetic datasets that contain large numbers of exact copies from the training data.
09.05.2025 12:21 β π 0 π 0 π¬ 1 π 0π DCR fails to detect privacy leakage, but could it still work as an inexpensive, directional signal for privacy risk? In our experiments, DCR shows no correlation with how vulnerable a dataset is to membership inference attacks.
09.05.2025 12:21 β π 0 π 0 π¬ 1 π 0π¨ The same holds for classical synthetic data generators (IndHist, Baynet, CTGAN): even when DCR marks their output as βprivate,β membership inference attacks can still correctly correctly infer the membership of up to 20% of the training records used to generate the synthetic data.
09.05.2025 12:21 β π 0 π 0 π¬ 1 π 0πΆβπ«οΈ Datasets generated by state-of-the-art tabular diffusion models (TabDDPM, ClavaDDPM) declared βprivateβ by DCR are highly vulnerable to membership inference attacks (MIAs) β reaching up to 0.35 true positive rate (TPR) at a low false positive rate (FPR).
09.05.2025 12:21 β π 0 π 0 π¬ 1 π 0
How do you know your synthetic data is anonymous π₯Έ?
If your answer is βwe checked Distance to Closest Record (DCR),β thenβ¦ we might have bad news for you.
Our latest work shows DCR and other proxy metrics to be inadequate measures of the privacy risk of synthetic data.
We hope DeSIA will now help do the same for what is arguably the most common data release in practice: aggregate statistics.
07.05.2025 11:15 β π 0 π 0 π¬ 0 π 0