Haokun Liu's Avatar

Haokun Liu

@haokunliu.bsky.social

Ph.D. Student at the University of Chicago | Chicago Human + AI Lab haokunliu.com

84 Followers  |  53 Following  |  27 Posts  |  Joined: 13.11.2024  |  1.8872

Latest posts by haokunliu.bsky.social on Bluesky

Post image

#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!

27.07.2025 19:27 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Post image

Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering.

This is holding us back. ๐Ÿงตand new paper with @ari-holtzman.bsky.social .

09.07.2025 20:07 โ€” ๐Ÿ‘ 36    ๐Ÿ” 15    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
GitHub - ChicagoHAI/hypothesis-generation: This is the official repository for HypoGeniC (Hypothesis Generation in Context) and HypoRefine, which are automated, data-driven tools that leverage large l... This is the official repository for HypoGeniC (Hypothesis Generation in Context) and HypoRefine, which are automated, data-driven tools that leverage large language models to generate hypothesis fo...

We are making som exciting updates to hypogenic this summer: github.com/ChicagoHAI/h... and will post updates here.

09.07.2025 13:50 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Post image Post image

It predicts pretty wellโ€”not just shifts in the last week, but also:

1. Whoโ€™s working an overnight shift (in our data + external validation in MIMIC)

2. Whoโ€™s working on a disruptive circadian schedule

3. How many patients has the doc seen *on the current shift*

02.07.2025 19:24 โ€” ๐Ÿ‘ 5    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿšจ New paper alert ๐Ÿšจ

Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? ๐Ÿค” Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! โฌ‡๏ธ

1/n ๐Ÿงต

27.05.2025 13:59 โ€” ๐Ÿ‘ 28    ๐Ÿ” 17    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

1/n ๐Ÿš€๐Ÿš€๐Ÿš€ Thrilled to share our latest work๐Ÿ”ฅ: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! ๐Ÿง ๐Ÿ’ฌ๐Ÿ“Š
Thereโ€™s a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability โ€” letโ€™s dive in! ๐ŸŒŠ

12.05.2025 19:23 โ€” ๐Ÿ‘ 22    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

๐Ÿง‘โ€โš–๏ธHow well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?

Excited to be in Albuquerque presenting our paper this afternoon at @naaclmeeting 2025!

01.05.2025 19:25 โ€” ๐Ÿ‘ 22    ๐Ÿ” 13    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image Post image Post image

Although I cannot make #NAACL2025, @chicagohai.bsky.social will be there. Please say hi!

@chachachen.bsky.social GPT โŒ x-rays (Friday 9-10:30)
@mheddaya.bsky.social CaseSumm and LLM ๐Ÿง‘โ€โš–๏ธ (Thursday 2-3:30)
@haokunliu.bsky.social @qiaoyu-rosa.bsky.social hypothesis generation ๐Ÿ”ฌ (Saturday at 4pm)

30.04.2025 20:19 โ€” ๐Ÿ‘ 17    ๐Ÿ” 7    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

13/ Lastly, great thanks to my wonderful collaborators Sicong Huang, Jingyu Hu, @qiaoyu-rosa.bsky.social , and my advisor @chenhaotan.bsky.social !

28.04.2025 19:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate method...

12/ ๐ŸŒŸ For more details and to access our datasets and code, please visit our paper at arxiv.org/abs/2504.11524, we also have an official website and leaderboards available at: chicagohai.github.io/HypoBench/

28.04.2025 19:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

11/ Why HypoBench matters: Establishes a structured way to advance AI's role in scientific discovery and everyday reasoning, highlighting both current capabilities and significant challenges.

28.04.2025 19:37 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

10/ Model priors matter: We see that the models have different priors, which lead to varying behaviors in different tasksโ€”generating good hypotheses is harder when prior knowledge is not helpful.

28.04.2025 19:37 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

9/ And it gets worse in counterintuitive settings - the models perform significantly worse when the underlying hypotheses are counterintuitive.

28.04.2025 19:37 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

8/ ๐Ÿ’ก Synthetic dataset results show: LLMs handle simple interactions well but struggle with increased noise, distractors, or subtleties in textโ€”highlighting significant rooms for improvement.

28.04.2025 19:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

7/ Qualitative Insights: Methods balancing novelty and plausibility are rare; iterative refinement boosts novelty but risks plausibility. Literature-driven hypotheses excelled in plausibility but lacked novelty.

28.04.2025 19:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

6/ ๐ŸŒ Real-world implications: Methods integrating literature insights with data outperform simple zero/few-shot inference. Qwen excelled in generating generalizable hypotheses.

28.04.2025 19:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

5/ ๐Ÿšจ Butโ€ฆ Even top models and methods struggle significantly as task complexity rises. At base difficulty, the best model captured 93.8% of hypotheses; this dropped sharply to 38.8% with increased complexity.

28.04.2025 19:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

4/ Yes, LLMs can generate effective hypotheses: we tested 4 state-of-the-art modelsโ€”GPT, Qwen Llama, and DeepSeekโ€”with 6 existing hypothesis generation methods. We found that using Qwen and integrating literature with data (LITERATURE + DATA) yields the best results.

28.04.2025 19:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

3/ ๐Ÿ“Š Introducing HypoBench: Our novel benchmark spans 194 datasets across 7 real-world and 5 synthetic tasks, testing key hypothesis generation capabilities like explanatory power, generalizability, and discovery rate.

28.04.2025 19:35 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

2/ ๐Ÿค” What makes a good hypothesis? It requires three key skills: inductive reasoning, abstraction, and synthesis. Good hypotheses should primarily have strong explanatory power and be interesting to researchers.

28.04.2025 19:35 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

1/ What is hypothesis generation? We define it clearly: a hypothesis is a natural-language explanation of observed phenomenaโ€”critical for both science and everyday reasoning.

28.04.2025 19:35 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿš€๐Ÿš€๐Ÿš€Excited to share our latest work: HypoBench, a systematic benchmark for evaluating LLM-based hypothesis generation methods!

There is much excitement about leveraging LLMs for scientific hypothesis generation, but principled evaluations are missing - letโ€™s dive into HypoBench together.

28.04.2025 19:35 โ€” ๐Ÿ‘ 11    ๐Ÿ” 9    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Encourage your students to submit posters and register! Limited free housing is provided for student participants only, on a first-come (i.e., request)-first-serve basis.

We are also actively looking for sponsors. Reach out if you are interested!

Please repost! Help spread the words!

21.04.2025 15:12 โ€” ๐Ÿ‘ 10    ๐Ÿ” 10    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

1/n

You may know that large language models (LLMs) can be biased in their decision-making, but ever wondered how those biases are encoded internally and whether we can surgically remove them?

14.04.2025 19:55 โ€” ๐Ÿ‘ 18    ๐Ÿ” 12    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Screenshot of top half of first page of paper. The paper is titled: "When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models". The authors are Julia Mendelsohn (University of Chicago) and Ceren Budak (University of Michigan). The top right corner contains a visual showing the sentence "They want immigrants to pour into and infest this country". The caption says: Figure 1: Dehumanizing sentence likening immigrants to the source domain concepts of Water and Vermin via the words "pour" and "infest". 

The abstract text on the left reads: Metaphor, discussing one concept in terms of another, is abundant in politics and can shape how people understand important issues. We develop a computational approach to measure metaphorical language, focusing on immigration discourse on social media. Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e.g. "water" or "vermin"). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts. We then study the relationship between metaphor, political ideology, and user engagement in 400K US tweets about immigration. While conservatives tend to use dehumanizing metaphors more than liberals, this effect varies widely across concepts. Moreover, creature-related metaphor is associated with more retweets, especially for liberal authors. Our work highlights the potential for computational methods to complement qualitative approaches in understanding subtle and implicit language in political discourse.

Screenshot of top half of first page of paper. The paper is titled: "When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models". The authors are Julia Mendelsohn (University of Chicago) and Ceren Budak (University of Michigan). The top right corner contains a visual showing the sentence "They want immigrants to pour into and infest this country". The caption says: Figure 1: Dehumanizing sentence likening immigrants to the source domain concepts of Water and Vermin via the words "pour" and "infest". The abstract text on the left reads: Metaphor, discussing one concept in terms of another, is abundant in politics and can shape how people understand important issues. We develop a computational approach to measure metaphorical language, focusing on immigration discourse on social media. Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e.g. "water" or "vermin"). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts. We then study the relationship between metaphor, political ideology, and user engagement in 400K US tweets about immigration. While conservatives tend to use dehumanizing metaphors more than liberals, this effect varies widely across concepts. Moreover, creature-related metaphor is associated with more retweets, especially for liberal authors. Our work highlights the potential for computational methods to complement qualitative approaches in understanding subtle and implicit language in political discourse.

New preprint!
Metaphors shape how people understand politics, but measuring them (& their real-world effects) is hard.

We develop a new method to measure metaphor & use it to study dehumanizing metaphor in 400K immigration tweets Link: bit.ly/4i3PGm3

#NLP #NLProc #polisky #polcom #compsocialsci
๐Ÿฆ๐Ÿฆ

20.02.2025 19:59 โ€” ๐Ÿ‘ 182    ๐Ÿ” 64    ๐Ÿ’ฌ 6    ๐Ÿ“Œ 11
Post image Post image Post image Post image

Spent a great day at Boulder meeting new students and old colleagues. I used to take this view every day.

Here are the slides for my talk titled "Alignment Beyond Human Preferences: Use Human Goals to Guide AI towards Complementary AI": chenhaot.com/talks/alignm...

24.01.2025 15:01 โ€” ๐Ÿ‘ 17    ๐Ÿ” 5    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

๐Ÿ’กCheck out our project website for our latest paper! Learn about a new approach to hypothesis generation:
๐Ÿ‘‰ chicagohai.github.io/hypogenic-de...

22.11.2024 14:29 โ€” ๐Ÿ‘ 5    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Check out this podcast for our paper: youtu.be/q7Vrvpc1cPQ?siโ€ฆ
(Powered by NotebookLM)

16.11.2024 00:13 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Literature Meets Data: A Synergistic Approach to Hypothesis Generation AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. W...

11/ Acknowledgments:
Last but not least, great thanks to all of my amazing collaborators: @qiaoyu-rosa.bsky.social, Mingxuan Li, Chenfei Yuan, all Chicago Human + AI lab members, and my advisor @chenhaotan.bsky.social!
Check out our full paper here at (arxiv.org/abs/2410.17309)

14.11.2024 20:45 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@haokunliu is following 20 prominent accounts