Thank you to @anjalief.bsky.social for advising. Hands-on with DP-SGD? Start with our another paper and open-source package
(arxiv.org/abs/2507.07229
github.com/kr-ramesh/sy...)
@zihaozhao.bsky.social
PhD student @jhuclsp.bsky.social| AI safety & privacy Previous: Undergrad @jhucompsci.bsky.social
Thank you to @anjalief.bsky.social for advising. Hands-on with DP-SGD? Start with our another paper and open-source package
(arxiv.org/abs/2507.07229
github.com/kr-ramesh/sy...)
π Paper & code
Paper is accepted to EMNLP 2025 Main
arXiv: arxiv.org/abs/2509.25729
Code: github.com/zzhao71/Cont...
#SyntheticData #Privacy #NLP #LLM #Deidentification #HealthcareAI #LLM
4/5 π Utility
On TAB, prefix-tuning+masking gives best utility (Perplexity β 10.2, MAUVE β 0.83), beating ICL and DP-SGD. Similar trends on MIMIC-III.
3/5π Privacy
ICL+blocking: ~0.00% privacy leakage (avg in our runs).
Prefix-tuning+masking yields the lowest ROUGE vs training data (e.g., ROUGE-L β 0.098), indicating less copying.
2/5 π§ How it works
β’ Build control codes from detected private entities (PERSON, ORG, LOC, etc.).
β’ Generate with either ICL (and block those identifiers at decode time) or prefix-tuning with a privacy mask + KL/contrastive losses.
π Text anonymization is hard; DP often hurts utility.
We use entity-aware control codes + either ICL (with bad-token blocking) or prefix-tuning w/ masking to get strong privacyβutility tradeoffs on legal & clinical data, outperforming DP-SGD in practice (EMNLP 2025).
www.arxiv.org/abs/2509.25729