Anmol Goel anmolgoel - Bluesky Statics

‼️New paper from Parameter Lab!

⛓️‍💥 We identify privacy collapse, a silent failure mode of LLMs: LLMs fine-tuned on seemingly benign data can lose their ability to respect contextual privacy norms.

Done by @anmolgoel.bsky.social during his internship!

Check-out 👇

03.02.2026 19:40 — 👍 3 🔁 1 💬 0 📌 0

New paper out!🎉

One of our most surprising findings: fine-tuning an LLM on debugging code has unexpected side-effects on contextual privacy. The model learns from printing variables that internal state are ok to share, then generalises this to social situations🤯

A🧵below👇

03.02.2026 17:11 — 👍 5 🔁 2 💬 0 📌 0

Privacy Collapse | ACL Submission

For more insights:
🌐 Project page: parameterlab.github.io/privacy-coll...
📄 Paper: arxiv.org/abs/2601.15220

Work done with the amazing team at
@parameterlab.bsky.social, Cornelius Emde, Sangdoo Yun, @coallaoh.bsky.social and @mgubri.bsky.social

#NLProc #AISafety #Privacy #LLMs

03.02.2026 16:52 — 👍 1 🔁 0 💬 0 📌 0

Privacy collapse can even be selectively activated.
We show backdoored fine-tuning where models behave normally, until a specific trigger induces systematic privacy leakage.

03.02.2026 16:52 — 👍 0 🔁 0 💬 1 📌 0

Mechanistic analysis shows privacy is uniquely fragile:
• Privacy representations live in late layers
• Fine-tuning selectively erodes privacy-relevant representations, affecting the model's confidence in privacy-preserving answers.
• Task-relevant features stay intact.

03.02.2026 16:52 — 👍 0 🔁 0 💬 1 📌 0

This happens in the wild.
Fine-tuning on tasks like empathy (EmpatheticDialogues) or customer support (TweetSumm) consistently degrades privacy.
Pure reasoning data (e.g. GSM8K) does not, suggesting that certain characteristics in the data cause this collapse.

03.02.2026 16:52 — 👍 0 🔁 0 💬 1 📌 0

Optimizing for proactive helpfulness alone can cause massive privacy degradation.
Across 6 models, agentic privacy drops by up to 98%!
Privacy collapse is not inherent to fine-tuning, since the privacy of our control fine-tuned models remains stable.

03.02.2026 16:52 — 👍 0 🔁 0 💬 1 📌 0

This failure is silent.
Fine-tuned models still look “healthy” on:
• safety benchmarks
• general capabilities
Yet privacy collapses.
Current evaluations miss this entirely.

03.02.2026 16:52 — 👍 0 🔁 0 💬 1 📌 0

Privacy collapse is not caused by malicious attacks. It emerges from diverse, seemingly benign characteristics in standard fine-tuning datasets, like:
• helpfulness
• emotional engagement
• customer support
• debugging code

03.02.2026 16:52 — 👍 0 🔁 0 💬 1 📌 0

Models lose the ability to reason about when information should not be shared, even though:
• training data is high-quality
• training data contains no explicit privacy violations
• standard safety benchmarks still pass

03.02.2026 16:52 — 👍 0 🔁 0 💬 1 📌 0

🚨 Fine-tuning your model to be more helpful or empathetic might be making it less private, without you noticing.

In our latest work, we show that benign fine-tuning can silently break contextual privacy in language models while safety & general capabilities appear intact.

⬇️

03.02.2026 16:52 — 👍 4 🔁 0 💬 1 📌 2

#ICLR
»Differentially Private Steering for Large Language Model Alignment« by @anmolgoel.bsky.social, Yaxi Hu, Iryna Gurevych (@igurevych.bsky.social) & Amartya Sanyal (@amartyasanyal.bsky.social)

(2/🧵)

27.01.2025 11:03 — 👍 6 🔁 3 💬 1 📌 0

This is super impactful work! Congratulations!

24.11.2024 21:21 — 👍 3 🔁 0 💬 1 📌 0

Book outline

Over the past decade, embeddings — numerical representations of machine learning features used as input to deep learning models — have become a foundational data structure in industrial machine learning systems. TF-IDF, PCA, and one-hot encoding have always been key tools in machine learning systems as ways to compress and make sense of large amounts of textual data. However, traditional approaches were limited in the amount of context they could reason about with increasing amounts of data. As the volume, velocity, and variety of data captured by modern applications has exploded, creating approaches specifically tailored to scale has become increasingly important. Google’s Word2Vec paper made an important step in moving from simple statistical representations to semantic meaning of words. The subsequent rise of the Transformer architecture and transfer learning, as well as the latest surge in generative methods has enabled the growth of embeddings as a foundational machine learning data structure. This survey paper aims to provide a deep dive into what embeddings are, their history, and usage patterns in industry.

Cover image

Just realized BlueSky allows sharing valuable stuff cause it doesn't punish links. 🤩

Let's start with "What are embeddings" by @vickiboykis.com

The book is a great summary of embeddings, from history to modern approaches.

The best part: it's free.

Link: vickiboykis.com/what_are_emb...

22.11.2024 11:13 — 👍 652 🔁 101 💬 22 📌 6

Great list!

21.11.2024 20:35 — 👍 2 🔁 0 💬 0 📌 0

Sorry that I’m missing a lot of people. If you’re working on making NLP models more culturally aware, please DM me to be added.
go.bsky.app/tRMpng

21.11.2024 20:23 — 👍 10 🔁 3 💬 2 📌 1

I made a starter pack for european researchers interested in some aspects of learning theory. The list is clearly inexhaustive. So please enter your suggestions in comments.

go.bsky.app/5o5uVnr

21.11.2024 10:31 — 👍 12 🔁 2 💬 1 📌 0

Posts by Anmol Goel (@anmolgoel.bsky.social)