#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!
27.07.2025 19:27 β π 3 π 1 π¬ 0 π 1
Excited to present our work at #ACL2025!
Come by Poster Session 1 tomorrow, 11:00β12:30 in Hall X4/X5 β would love to chat!
27.07.2025 13:45 β π 4 π 2 π¬ 0 π 0
Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering.
This is holding us back. π§΅and new paper with @ari-holtzman.bsky.social .
09.07.2025 20:07 β π 36 π 15 π¬ 2 π 0
When you walk into the ER, you could get a doc:
1. Fresh from a week of not working
2. Tired from working too many shifts
@oziadias.bsky.social has been both and thinks that they're different! But can you tell from their notes? Yes we can! Paper @natcomms.nature.com www.nature.com/articles/s41...
02.07.2025 19:22 β π 26 π 11 π¬ 1 π 0
π¨ New paper alert π¨
Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? π€ Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! β¬οΈ
1/n π§΅
27.05.2025 13:59 β π 28 π 17 π¬ 1 π 1
HypoEval evaluators (github.com/ChicagoHAI/H...) are now incorporated into judges from QuotientAI β check it out at github.com/quotient-ai/...!
21.05.2025 16:58 β π 2 π 2 π¬ 0 π 0
11/n Closing thoughts:
This is a sample-efficient method for LLM-as-a-judge, grounded upon human judgments β paving the way for personalized evaluators and alignment!
12.05.2025 19:27 β π 0 π 0 π¬ 1 π 0
9/n Why HypoEval matters:
We push forward LLM-as-a-judge research by showing you can get:
Sample efficiency
Interpretable automated evaluation
Strong human alignment
β¦without massive fine-tuning.
12.05.2025 19:26 β π 0 π 0 π¬ 1 π 0
8/n π¬ Ablation insights:
Dropping hypothesis generation β performance drops ~7%
Combining all hypotheses into one criterion β performance drops ~8% (Better to let LLMs rate one sub-dimension at a time!)
12.05.2025 19:26 β π 1 π 0 π¬ 1 π 0
7/n πͺ Whatβs robust?
β
Works across out-of-distribution (OOD) tasks
β
Generated hypothesis can be transferred to different LLMs (e.g., GPT-4o-mini β LLAMA-3.3-70B)
β
Reduces sensitivity to prompt variations compared to direct scoring
12.05.2025 19:25 β π 1 π 0 π¬ 1 π 0
6/n π Where did we test it?
Across summarization (SummEval, NewsRoom) and story generation (HANNA, WritingPrompt)
We show state-of-the-art correlations with human judgments, for both rankings (Spearman correlation) and scores (Pearson correlation)! π
12.05.2025 19:25 β π 1 π 0 π¬ 1 π 0
5/n Why is this better?
By combining small-scale human data + literature + non-binary checklists, HypoEval:
πΉ Outperforms G-Eval by ~12%
πΉ Beats fine-tuned models using 3x more human labels
πΉ Adds interpretable evaluation
12.05.2025 19:24 β π 1 π 0 π¬ 1 π 0
4/n These hypotheses break down complex evaluation rubric (ex. βIs this summary comprehensive?β) into sub-dimensions an LLM can score clearly. β
β
β
12.05.2025 19:24 β π 1 π 0 π¬ 1 π 0
3/n π Our solution: HypoEval
Building upon SOTA hypothesis generation methods, we generate hypotheses β decomposed rubrics (similar to checklists, but more systematic and explainable) β from existing literature and just 30 human annotations (scores) of texts.
12.05.2025 19:24 β π 2 π 0 π¬ 1 π 0
2/n Whatβs the problem?
Most LLM-as-a-judge studies either:
β Achieve lower alignment with humans
βοΈ Requires extensive fine-tuning -> expensive data and compute.
β Lack of interpretability
12.05.2025 19:23 β π 2 π 0 π¬ 1 π 0
1/n πππ Thrilled to share our latest workπ₯: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! π§ π¬π
Thereβs a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability β letβs dive in! π
12.05.2025 19:23 β π 22 π 7 π¬ 1 π 1
π§ββοΈHow well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?
Excited to be in Albuquerque presenting our paper this afternoon at @naaclmeeting 2025!
01.05.2025 19:25 β π 22 π 13 π¬ 2 π 0
πππExcited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation methods!
There is much excitement about leveraging LLMs for scientific hypothesis generation, but principled evaluations are missing - letβs dive into HypoBench together.
28.04.2025 19:35 β π 11 π 9 π¬ 1 π 0
1/n
You may know that large language models (LLMs) can be biased in their decision-making, but ever wondered how those biases are encoded internally and whether we can surgically remove them?
14.04.2025 19:55 β π 18 π 12 π¬ 1 π 0
I do research in social computing and LLMs at Northwestern with @robvoigt.bsky.social and Kaize Ding.
Uses machine learning to study literary imagination, and vice-versa. Likely to share news about AI with sentient space crabs.
Information Sciences and English, UIUC. Distant Horizons (Chicago, 2019). tedunderwood.com
Incoming Asst Prof @UMD Info College, currently postdoc @UChicago. NLP, computational social science, political communication, linguistics. Past: Info PhD @UMich, CS + Lx @Stanford. Interests: cats, Yiddish, talking to my cats in Yiddish.
Assistant Professor @ UChicago CS & DSI UChicao
Leading Conceptualization Lab http://conceptualization.ai
Minting new vocabulary to conceptualize generative models.
CS professor at UT Austin. Large language models and NLP. he/him
Final year NLP PhD student at UChicago.
Explainability, reasoning, and hypothesis generation!
PhD @UChicagoCS / BE in CS @Umich / β¨AI/NLP transparency and interpretability/π·π¨photography painting
Doctor of NLP/Vision+Language from UCSB
Evals, metrics, multilinguality, multiculturality, multimodality, and (dabbling in) reasoning
https://saxon.me/
Entrepreneur, pursuer of noise in neurosciences, mechanistical interpretability and interventions in "AI", complexity, concentrated on practical applications of theoretically working solutions. Deeptech, startups.
Anything multiscale itterative nonlinear
A team of three human PIsβAri Holtzman, Mina Lee, and Chenhao Tan studying and building the new information ecosystem of humans and machines. https://substack.com/@cichicago, https://ci.cs.uchicago.edu/
nlp phd student at uchicago cs
Computer Science PhD student at UChicago | Member of the Chicago Human+AI lab @chicagohai.bsky.social
Professor at UW; Researcher at Meta. LMs, NLP, ML. PNW life.
The 2025 Conference on Language Modeling will take place at the Palais des Congrès in Montreal, Canada from October 7-10, 2025
https://chicagohai.github.io/, https://substack.com/@cichicago
Breakthrough AI to solve the world's biggest problems.
βΊ Join us: http://allenai.org/careers
βΊ Get our newsletter: https://share.hsforms.com/1uJkWs5aDRHWhiky3aHooIg3ioxm
Assistant professor, research scientist | boosting scientific discovery with AI, NLP, IR, KG, HCI | @ai2.bsky.social
Ph.D. Student at the University of Chicago | Chicago Human + AI Lab
haokunliu.com
International Conference on Learning Representations https://iclr.cc/
Assistant Professor @Stanford CS @StanfordNLP @StanfordAILab
Computational Social Science & NLP