Mingxuan (Aldous) Li's Avatar

Mingxuan (Aldous) Li

@itea1001.bsky.social

https://itea1001.github.io/ Rising third-year undergrad at the University of Chicago, working on LLM tool use, evaluation, and hypothesis generation.

7 Followers  |  23 Following  |  15 Posts  |  Joined: 10.11.2024  |  1.613

Latest posts by itea1001.bsky.social on Bluesky

Post image Post image Post image Post image

⚑️Ever asked an LLM-as-Marilyn Monroe about the 2020 election? Our paper calls this concept incongruence, common in both AI and how humans create and reason.
🧠Read my blog to learn what we found, why it matters for AI safety and creativity, and what's next: cichicago.substack.com/p/concept-in...

31.07.2025 19:06 β€” πŸ‘ 9    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0
Post image

#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!

27.07.2025 19:27 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1

Excited to present our work at #ACL2025!
Come by Poster Session 1 tomorrow, 11:00–12:30 in Hall X4/X5 β€” would love to chat!

27.07.2025 13:45 β€” πŸ‘ 4    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Post image

Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering.

This is holding us back. 🧡and new paper with @ari-holtzman.bsky.social .

09.07.2025 20:07 β€” πŸ‘ 36    πŸ” 15    πŸ’¬ 2    πŸ“Œ 0
Post image

When you walk into the ER, you could get a doc:
1. Fresh from a week of not working
2. Tired from working too many shifts

@oziadias.bsky.social has been both and thinks that they're different! But can you tell from their notes? Yes we can! Paper @natcomms.nature.com www.nature.com/articles/s41...

02.07.2025 19:22 β€” πŸ‘ 26    πŸ” 11    πŸ’¬ 1    πŸ“Œ 0
Post image

🚨 New paper alert 🚨

Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? πŸ€” Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! ⬇️

1/n 🧡

27.05.2025 13:59 β€” πŸ‘ 28    πŸ” 17    πŸ’¬ 1    πŸ“Œ 1

HypoEval evaluators (github.com/ChicagoHAI/H...) are now incorporated into judges from QuotientAI β€” check it out at github.com/quotient-ai/...!

21.05.2025 16:58 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either u...

12/n Acknowledgments:
Great thanks to my wonderful collaborators Hanchen Li and my advisor @chenhaotan.bsky.social!
Check out full paper here at (arxiv.org/abs/2504.07174)

12.05.2025 19:29 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

11/n Closing thoughts:
This is a sample-efficient method for LLM-as-a-judge, grounded upon human judgments β€” paving the way for personalized evaluators and alignment!

12.05.2025 19:27 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
GitHub - ChicagoHAI/HypoEval-Gen: Repository for HypoEval paper (Hypothesis-Guided Evaluation for Natural Language Generation) Repository for HypoEval paper (Hypothesis-Guided Evaluation for Natural Language Generation) - ChicagoHAI/HypoEval-Gen

10/n Code:
We have released to repositories for HypoEval:
For replicating results/building upon: github.com/ChicagoHAI/H...
For off-the-shelf 0-shot evaluators for summaries and storiesπŸš€: github.com/ChicagoHAI/H...

12.05.2025 19:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

9/n Why HypoEval matters:
We push forward LLM-as-a-judge research by showing you can get:
Sample efficiency
Interpretable automated evaluation
Strong human alignment
…without massive fine-tuning.

12.05.2025 19:26 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

8/n πŸ”¬ Ablation insights:
Dropping hypothesis generation β†’ performance drops ~7%
Combining all hypotheses into one criterion β†’ performance drops ~8% (Better to let LLMs rate one sub-dimension at a time!)

12.05.2025 19:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

7/n πŸ’ͺ What’s robust?
βœ… Works across out-of-distribution (OOD) tasks
βœ… Generated hypothesis can be transferred to different LLMs (e.g., GPT-4o-mini ↔ LLAMA-3.3-70B)
βœ… Reduces sensitivity to prompt variations compared to direct scoring

12.05.2025 19:25 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

6/n πŸ† Where did we test it?
Across summarization (SummEval, NewsRoom) and story generation (HANNA, WritingPrompt)
We show state-of-the-art correlations with human judgments, for both rankings (Spearman correlation) and scores (Pearson correlation)! πŸ“ˆ

12.05.2025 19:25 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

5/n Why is this better?
By combining small-scale human data + literature + non-binary checklists, HypoEval:
πŸ”Ή Outperforms G-Eval by ~12%
πŸ”Ή Beats fine-tuned models using 3x more human labels
πŸ”Ή Adds interpretable evaluation

12.05.2025 19:24 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

4/n These hypotheses break down complex evaluation rubric (ex. β€œIs this summary comprehensive?”) into sub-dimensions an LLM can score clearly. βœ…βœ…βœ…

12.05.2025 19:24 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

3/n 🌟 Our solution: HypoEval
Building upon SOTA hypothesis generation methods, we generate hypotheses β€” decomposed rubrics (similar to checklists, but more systematic and explainable) β€” from existing literature and just 30 human annotations (scores) of texts.

12.05.2025 19:24 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

2/n What’s the problem?
Most LLM-as-a-judge studies either:
❌ Achieve lower alignment with humans
βš™οΈ Requires extensive fine-tuning -> expensive data and compute.
❓ Lack of interpretability

12.05.2025 19:23 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

1/n πŸš€πŸš€πŸš€ Thrilled to share our latest workπŸ”₯: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! πŸ§ πŸ’¬πŸ“Š
There’s a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability β€” let’s dive in! 🌊

12.05.2025 19:23 β€” πŸ‘ 22    πŸ” 7    πŸ’¬ 1    πŸ“Œ 1
Post image

πŸ§‘β€βš–οΈHow well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?

Excited to be in Albuquerque presenting our paper this afternoon at @naaclmeeting 2025!

01.05.2025 19:25 β€” πŸ‘ 22    πŸ” 13    πŸ’¬ 2    πŸ“Œ 0
Post image

πŸš€πŸš€πŸš€Excited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation methods!

There is much excitement about leveraging LLMs for scientific hypothesis generation, but principled evaluations are missing - let’s dive into HypoBench together.

28.04.2025 19:35 β€” πŸ‘ 11    πŸ” 9    πŸ’¬ 1    πŸ“Œ 0

1/n

You may know that large language models (LLMs) can be biased in their decision-making, but ever wondered how those biases are encoded internally and whether we can surgically remove them?

14.04.2025 19:55 β€” πŸ‘ 18    πŸ” 12    πŸ’¬ 1    πŸ“Œ 0

@itea1001 is following 20 prominent accounts