Zihao Zhao @zihaozhao - Bluesky Profile

Latest posts by zihaozhao.bsky.social on Bluesky

SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains We present SynthTextEval, a toolkit for conducting comprehensive evaluations of synthetic text. The fluency of large language model (LLM) outputs has made synthetic text potentially viable for numerou...

Thank you to @anjalief.bsky.social for advising. Hands-on with DP-SGD? Start with our another paper and open-source package
(arxiv.org/abs/2507.07229
github.com/kr-ramesh/sy...)

15.10.2025 20:23 — 👍 2 🔁 1 💬 0 📌 0

Controlled Generation for Private Synthetic Text Text anonymization is essential for responsibly developing and deploying AI in high-stakes domains such as healthcare, social services, and law. In this work, we propose a novel methodology for privac...

🔗 Paper & code
Paper is accepted to EMNLP 2025 Main
arXiv: arxiv.org/abs/2509.25729
Code: github.com/zzhao71/Cont...
#SyntheticData #Privacy #NLP #LLM #Deidentification #HealthcareAI #LLM

15.10.2025 20:23 — 👍 2 🔁 1 💬 1 📌 0

4/5 📈 Utility
On TAB, prefix-tuning+masking gives best utility (Perplexity ≈ 10.2, MAUVE ≈ 0.83), beating ICL and DP-SGD. Similar trends on MIMIC-III.

15.10.2025 20:23 — 👍 3 🔁 0 💬 1 📌 0

3/5🔒 Privacy
ICL+blocking: ~0.00% privacy leakage (avg in our runs).
Prefix-tuning+masking yields the lowest ROUGE vs training data (e.g., ROUGE-L ≈ 0.098), indicating less copying.

15.10.2025 20:23 — 👍 1 🔁 0 💬 1 📌 0

2/5 🔧 How it works
• Build control codes from detected private entities (PERSON, ORG, LOC, etc.).
• Generate with either ICL (and block those identifiers at decode time) or prefix-tuning with a privacy mask + KL/contrastive losses.

15.10.2025 20:23 — 👍 1 🔁 0 💬 1 📌 0

🚀 Text anonymization is hard; DP often hurts utility.
We use entity-aware control codes + either ICL (with bad-token blocking) or prefix-tuning w/ masking to get strong privacy–utility tradeoffs on legal & clinical data, outperforming DP-SGD in practice (EMNLP 2025).
www.arxiv.org/abs/2509.25729

15.10.2025 20:23 — 👍 7 🔁 0 💬 1 📌 2

@zihaozhao is following 11 prominent accounts

William Jurayj
@williamjurayj

PhD student at Johns Hopkins CLSP (@jhuclsp.bsky.social). Researching natural and formal language processing. williamjurayj.com

Anna Tsvetkov
@annatsv

Postdoc at Princeton AI Lab Prev. PhD at Brown Researcher at MIT FutureTech Website: https://annatsv.github.io/

Kaavya Chaparala
@kkchapa

PhD student @ Johns Hopkins wielding Natural Language Processing (NLP) systems for greater linguistic accessibility in online spaces

Chau Minh Pham
@chautmpham

PhD student @umdcs | Long-form Narrative Generation & Analysis | Intern @AdobeResearch @MSFTResearch | https://chtmp223.github.io

Krithika Ramesh
@stolenpyjak

(she/her) ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯ PhD student @jhuclsp | Prev @IndiaMSR

ByteVagabond
@bytevagabond.com

Just a passionate dev, learning from this community daily. ✨ Sharing the entire journey - bugs, breakthroughs, and banter. 🚀

Miaoxiao Wang (Roy)
@miaoxiaor

Young PI https://www.iffs.uestc.edu.cn/iffs_en/info/1251/1693.htm Former Postdoc with @micsysecolab.bsky.social Investigate microbial interactions using culturable communities, microfluidics, mathematical modelling.

Anjalie Field
@anjalief

Faculty at Johns Hopkins University in Computer Science Working on NLP and computational social science

JHU CLSP
@jhuclsp

Center for Language and Speech Processing at Johns Hopkins University #NLProc #MachineLearning #AI http://tinyurl.com/clspy2ube

Dan Roy
@roydanroy

Research Director, Founding Faculty, Canada CIFAR AI Chair @VectorInst. Full Prof @UofT - Statistics and Computer Sci. (x-appt) danroy.org I study assumption-free prediction and decision making under uncertainty, with inference emerging from optimality.

Bluesky
@bsky.app

official Bluesky account (check username👆) Bugs, feature requests, feedback: support@bsky.app