Emily Byun's Avatar

Emily Byun

@yewonbyun.bsky.social

PhD Student in Machine Learning at CMU. yewonbyun.github.io

1,114 Followers  |  64 Following  |  15 Posts  |  Joined: 18.11.2024  |  1.5992

Latest posts by yewonbyun.bsky.social on Bluesky

14/ This work will be presented as a spotlight talk today at #COLM2025 SocialSim workshop and at NeurIPS 2025.

Paper: arxiv.org/abs/2508.06635
Code: github.com/lasilab/valid-synth-inference

10.10.2025 16:12 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

13/ I really enjoyed working on this project with the brilliant and kindest @shantanug.bsky.social and great mentors @zacharylipton.bsky.social @donskerclass.bsky.social @brwilder.bsky.social

10.10.2025 16:12 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

12/ This framework provides a foundation for easily extensible estimation methods that can safely incorporate the growing variety and quality of synthetic data sources.

10.10.2025 16:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

11/ At a fundamental level, this work takes a step towards understanding how synthetic data from foundation models can be used to support valid inference. As the usage and promise of FMs continue to grow, so too will the complexity of pipelines that incorporate their outputs.

10.10.2025 16:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

10/ Empirically, we observe large gains in estimation performance (lower MSE + tighter confidence intervals with valid coverage) across diverse computational social science tasks, with benefits most pronounced in low label regimes.

10.10.2025 16:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

9/ In other words, in the worst case where synthetic data is *completely* uninformative (bad quality), including it does not hurt, at least asymptotically.

10.10.2025 16:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

8/ When they are independent from each other, the variance reduces to the optimal variance based only on the real data.

10.10.2025 16:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

7/ Precisely: The GMM measures the cross-correlations between the synthetic and real data, producing a combination of these moments that reduces the variance of the real data moments if there is information from the synthetic data moments.

10.10.2025 16:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

6/ Why and when does synthetic data help? We found that the incorporation of synthetic data leads to more precise estimation and tighter confidence intervals when its moments are predictive of the real data moments

10.10.2025 16:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

5/ Prospectively, it was not intuitive whether the incorporation of additional moments based solely on synthetic data (defined in terms of a separate parameter from the target) would yield any benefits (or even affect) the estimation of the target parameter of the real data.

10.10.2025 16:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

4/ We propose a solution via a new estimator based on generalized method of moments (GMM) that allows us to incorporate these multiple sources of information by adding moments.

10.10.2025 16:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

3/ Problem: Naively aggregating these different sources of information leads to highly biased estimates, due to differences in the underlying distribution

10.10.2025 16:12 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

2/ In limited labeled regimes, LLMs provide practitioners a cheap alternative to attain imperfect labels and even generate entirely new synthetic samples

10.10.2025 16:12 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ’‘Can we trust synthetic data for statistical inference?

We show that synthetic data (e.g., LLM simulations) can significantly improve the performance of inference tasks. The key intuition lies in the interactions between the moment residuals of synthetic data and those of real data

10.10.2025 16:12 β€” πŸ‘ 36    πŸ” 9    πŸ’¬ 2    πŸ“Œ 5
Equilibrium effects of LLM reviewing Equilibrium effects of LLM reviewing

Should LLMs be used to review papers? AAAI is piloting LLM-generated reviews this year. I wrote a blog post arguing that using LLMs as reviewers can have bad downstream consequences for science by centralizing judgments about what constitutes good research.

bryanwilder.github.io/files/llmrev...

26.05.2025 18:20 β€” πŸ‘ 116    πŸ” 31    πŸ’¬ 11    πŸ“Œ 31
Photo of Johns Hopkins Campus

Photo of Johns Hopkins Campus

I'm recruiting PhD students for Fall 2025! CS PhD Deadline: Dec. 15th.

I work on safe/reliable ML and causal inference, motivated by healthcare applications.

Beyond myself, Johns Hopkins has a rich community of folks doing similar work. Come join us!

27.11.2024 15:58 β€” πŸ‘ 19    πŸ” 7    πŸ’¬ 1    πŸ“Œ 0

would love to join!

19.11.2024 18:04 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@yewonbyun is following 20 prominent accounts