#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!
27.07.2025 19:27 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 1@haokunliu.bsky.social
Ph.D. Student at the University of Chicago | Chicago Human + AI Lab haokunliu.com
#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!
27.07.2025 19:27 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 1Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering.
This is holding us back. ๐งตand new paper with @ari-holtzman.bsky.social .
We are making som exciting updates to hypogenic this summer: github.com/ChicagoHAI/h... and will post updates here.
09.07.2025 13:50 โ ๐ 2 ๐ 1 ๐ฌ 0 ๐ 1It predicts pretty wellโnot just shifts in the last week, but also:
1. Whoโs working an overnight shift (in our data + external validation in MIMIC)
2. Whoโs working on a disruptive circadian schedule
3. How many patients has the doc seen *on the current shift*
๐จ New paper alert ๐จ
Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? ๐ค Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! โฌ๏ธ
1/n ๐งต
1/n ๐๐๐ Thrilled to share our latest work๐ฅ: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! ๐ง ๐ฌ๐
Thereโs a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability โ letโs dive in! ๐
๐งโโ๏ธHow well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?
Excited to be in Albuquerque presenting our paper this afternoon at @naaclmeeting 2025!
Although I cannot make #NAACL2025, @chicagohai.bsky.social will be there. Please say hi!
@chachachen.bsky.social GPT โ x-rays (Friday 9-10:30)
@mheddaya.bsky.social CaseSumm and LLM ๐งโโ๏ธ (Thursday 2-3:30)
@haokunliu.bsky.social @qiaoyu-rosa.bsky.social hypothesis generation ๐ฌ (Saturday at 4pm)
13/ Lastly, great thanks to my wonderful collaborators Sicong Huang, Jingyu Hu, @qiaoyu-rosa.bsky.social , and my advisor @chenhaotan.bsky.social !
28.04.2025 19:40 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 012/ ๐ For more details and to access our datasets and code, please visit our paper at arxiv.org/abs/2504.11524, we also have an official website and leaderboards available at: chicagohai.github.io/HypoBench/
28.04.2025 19:38 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 011/ Why HypoBench matters: Establishes a structured way to advance AI's role in scientific discovery and everyday reasoning, highlighting both current capabilities and significant challenges.
28.04.2025 19:37 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 010/ Model priors matter: We see that the models have different priors, which lead to varying behaviors in different tasksโgenerating good hypotheses is harder when prior knowledge is not helpful.
28.04.2025 19:37 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 09/ And it gets worse in counterintuitive settings - the models perform significantly worse when the underlying hypotheses are counterintuitive.
28.04.2025 19:37 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 08/ ๐ก Synthetic dataset results show: LLMs handle simple interactions well but struggle with increased noise, distractors, or subtleties in textโhighlighting significant rooms for improvement.
28.04.2025 19:36 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 07/ Qualitative Insights: Methods balancing novelty and plausibility are rare; iterative refinement boosts novelty but risks plausibility. Literature-driven hypotheses excelled in plausibility but lacked novelty.
28.04.2025 19:36 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 06/ ๐ Real-world implications: Methods integrating literature insights with data outperform simple zero/few-shot inference. Qwen excelled in generating generalizable hypotheses.
28.04.2025 19:36 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 05/ ๐จ Butโฆ Even top models and methods struggle significantly as task complexity rises. At base difficulty, the best model captured 93.8% of hypotheses; this dropped sharply to 38.8% with increased complexity.
28.04.2025 19:36 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 04/ Yes, LLMs can generate effective hypotheses: we tested 4 state-of-the-art modelsโGPT, Qwen Llama, and DeepSeekโwith 6 existing hypothesis generation methods. We found that using Qwen and integrating literature with data (LITERATURE + DATA) yields the best results.
28.04.2025 19:36 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 03/ ๐ Introducing HypoBench: Our novel benchmark spans 194 datasets across 7 real-world and 5 synthetic tasks, testing key hypothesis generation capabilities like explanatory power, generalizability, and discovery rate.
28.04.2025 19:35 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 02/ ๐ค What makes a good hypothesis? It requires three key skills: inductive reasoning, abstraction, and synthesis. Good hypotheses should primarily have strong explanatory power and be interesting to researchers.
28.04.2025 19:35 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 01/ What is hypothesis generation? We define it clearly: a hypothesis is a natural-language explanation of observed phenomenaโcritical for both science and everyday reasoning.
28.04.2025 19:35 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0๐๐๐Excited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation methods!
There is much excitement about leveraging LLMs for scientific hypothesis generation, but principled evaluations are missing - letโs dive into HypoBench together.
Encourage your students to submit posters and register! Limited free housing is provided for student participants only, on a first-come (i.e., request)-first-serve basis.
We are also actively looking for sponsors. Reach out if you are interested!
Please repost! Help spread the words!
1/n
You may know that large language models (LLMs) can be biased in their decision-making, but ever wondered how those biases are encoded internally and whether we can surgically remove them?
Screenshot of top half of first page of paper. The paper is titled: "When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models". The authors are Julia Mendelsohn (University of Chicago) and Ceren Budak (University of Michigan). The top right corner contains a visual showing the sentence "They want immigrants to pour into and infest this country". The caption says: Figure 1: Dehumanizing sentence likening immigrants to the source domain concepts of Water and Vermin via the words "pour" and "infest". The abstract text on the left reads: Metaphor, discussing one concept in terms of another, is abundant in politics and can shape how people understand important issues. We develop a computational approach to measure metaphorical language, focusing on immigration discourse on social media. Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e.g. "water" or "vermin"). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts. We then study the relationship between metaphor, political ideology, and user engagement in 400K US tweets about immigration. While conservatives tend to use dehumanizing metaphors more than liberals, this effect varies widely across concepts. Moreover, creature-related metaphor is associated with more retweets, especially for liberal authors. Our work highlights the potential for computational methods to complement qualitative approaches in understanding subtle and implicit language in political discourse.
New preprint!
Metaphors shape how people understand politics, but measuring them (& their real-world effects) is hard.
We develop a new method to measure metaphor & use it to study dehumanizing metaphor in 400K immigration tweets Link: bit.ly/4i3PGm3
#NLP #NLProc #polisky #polcom #compsocialsci
๐ฆ๐ฆ
Spent a great day at Boulder meeting new students and old colleagues. I used to take this view every day.
Here are the slides for my talk titled "Alignment Beyond Human Preferences: Use Human Goals to Guide AI towards Complementary AI": chenhaot.com/talks/alignm...
๐กCheck out our project website for our latest paper! Learn about a new approach to hypothesis generation:
๐ chicagohai.github.io/hypogenic-de...
Check out this podcast for our paper: youtu.be/q7Vrvpc1cPQ?siโฆ
(Powered by NotebookLM)
11/ Acknowledgments:
Last but not least, great thanks to all of my amazing collaborators: @qiaoyu-rosa.bsky.social, Mingxuan Li, Chenfei Yuan, all Chicago Human + AI lab members, and my advisor @chenhaotan.bsky.social!
Check out our full paper here at (arxiv.org/abs/2410.17309)