Zihao Zhao's Avatar

Zihao Zhao

@zihaozhao.bsky.social

PhD student @jhuclsp.bsky.social| AI safety & privacy Previous: Undergrad @jhucompsci.bsky.social

14 Followers  |  11 Following  |  6 Posts  |  Joined: 18.11.2024  |  1.7221

Latest posts by zihaozhao.bsky.social on Bluesky

Preview
SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerou...

Thank you to @anjalief.bsky.social for advising. Hands-on with DP-SGD? Start with our another paper and open-source package
(arxiv.org/abs/2507.07229
github.com/kr-ramesh/sy...)

15.10.2025 20:23 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Controlled Generation for Private Synthetic Text Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privac...

πŸ”— Paper & code
Paper is accepted to EMNLP 2025 Main
arXiv: arxiv.org/abs/2509.25729
Code: github.com/zzhao71/Cont...
#SyntheticData #Privacy #NLP #LLM #Deidentification #HealthcareAI #LLM

15.10.2025 20:23 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

4/5 πŸ“ˆ Utility
On TAB, prefix-tuning+masking gives best utility (Perplexity β‰ˆ 10.2, MAUVE β‰ˆ 0.83), beating ICL and DP-SGD. Similar trends on MIMIC-III.

15.10.2025 20:23 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

3/5πŸ”’ Privacy
ICL+blocking: ~0.00% privacy leakage (avg in our runs).
Prefix-tuning+masking yields the lowest ROUGE vs training data (e.g., ROUGE-L β‰ˆ 0.098), indicating less copying.

15.10.2025 20:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

2/5 πŸ”§ How it works
β€’ Build control codes from detected private entities (PERSON, ORG, LOC, etc.).
β€’ Generate with either ICL (and block those identifiers at decode time) or prefix-tuning with a privacy mask + KL/contrastive losses.

15.10.2025 20:23 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸš€ Text anonymization is hard; DP often hurts utility.
We use entity-aware control codes + either ICL (with bad-token blocking) or prefix-tuning w/ masking to get strong privacy–utility tradeoffs on legal & clinical data, outperforming DP-SGD in practice (EMNLP 2025).
www.arxiv.org/abs/2509.25729

15.10.2025 20:23 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 1    πŸ“Œ 2

@zihaozhao is following 11 prominent accounts