Haokun Liu's Avatar

Haokun Liu

@haokunliu.bsky.social

Ph.D. Student at the University of Chicago | Chicago Human + AI Lab haokunliu.com

87 Followers  |  55 Following  |  39 Posts  |  Joined: 13.11.2024  |  1.88

Latest posts by haokunliu.bsky.social on Bluesky

Big thanks to @chicagohai.bsky.social team and everyone who submitted ideas on IdeaHub. Special shoutout to the open source community building research agents! We're all learning together.

10.11.2025 22:46 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

All 6 generated repositories with detailed code and reports:
- github.com/ChicagoHAI/l...
- github.com/ChicagoHAI/l...
- github.com/ChicagoHAI/i...
- github.com/ChicagoHAI/i...
- github.com/ChicagoHAI/l...
- github.com/ChicagoHAI/l...

10.11.2025 22:45 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Hypogenic AI - Shaping the Future of Science Reimagining science by augmenting scientist-AI collaboration.

Submit your idea, vote on existing ones, or help improve idea-explorer: github.com/ChicagoHAI/i...

Full blog with technical details:
hypogenic.ai/blog/weekly-...
Substack: open.substack.com/pub/cichicag...

10.11.2025 22:45 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

So why are we doing this openly?

Because agents clearly can accelerate early-stage exploration. But they need human oversight at every step. Transparent benchmarking beats cherry-picked demos. Community feedback improves agents faster. And honestly, we're all figuring this out together.

10.11.2025 22:44 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Existing agents like AI-Scientist and AI-Researcher are basically overfitted to ML. There are hard-coded prompts that โ€œrequires changing hyperparameters and train on HuggingFace datasetsโ€ or specific ML agents. Just changing prompt wonโ€™t be enough, as ML assumptions are everywhere in the codebase.

10.11.2025 22:44 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The pattern: we can fix specific bugs with better prompts (bias-variance tradeoff). But we can't prompt our way to knowing when to search, recognizing expertise boundaries, or understanding what rigorous methodology looks like.

That's what I call the "meta intelligence" gap.

10.11.2025 22:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

What didnโ€™t

Some agents run faked human data, used undersized models even though compute was available, or calling simple answer reweighting as "multi-agent interactions". Resource collection and allocation is a bottleneck, but more importantly, the agents do not know when to search or seek help.

10.11.2025 22:43 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

What worked

Agents can actually design and run small experiments: sometimes to seed bigger studies, sometimes as sanity checks, and sometimes to straight-up refute the original hypothesis. That kind of evidence is way more useful than โ€œLLM-as-a-judge says the idea is good.โ€

10.11.2025 22:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

There's a lot of hype on AI agents for science. But what can they actually do? We tested our idea-explorer on ideas from IdeaHub:

Do LLMs have different types of beliefs?
Can formal rules make AI agents honest about their uncertainty?
Can LLMs temporarily ignore their training to follow new rules?

10.11.2025 22:35 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Hypogenic AI - Shaping the Future of Science Reimagining science by augmenting scientist-AI collaboration.

Here's how it works:
โ†’ Submit your research idea or upvote existing ones (tag: "Weekly Competition")
โ†’ Each Monday we select top 3 from previous week
โ†’ We run experiments using research agents
โ†’ Share repos + findings back on IdeaHub

Vote here: hypogenic.ai/ideahub

10.11.2025 21:33 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

We're launching a weekly competition where the community decides which research ideas get implemented. Every week, we'll take the top 3 ideas from IdeaHub, run experiments with AI agents, and share everything: code, successes, and failures.

It's completely free and we'll try out ideas for you!

10.11.2025 21:32 โ€” ๐Ÿ‘ 6    ๐Ÿ” 4    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

โ“ Does an LLM know thyself? ๐Ÿชž
Humans pass the mirror test at ~18 months ๐Ÿ‘ถ
But what about LLMs? Can they recognize their own writingโ€”or even admit authorship at all?
In our new paper, we put 10 state-of-the-art models to the test. Read on ๐Ÿ‘‡
1/n ๐Ÿงต

27.10.2025 17:36 โ€” ๐Ÿ‘ 12    ๐Ÿ” 4    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

Replacing scientists with AI isnโ€™t just unlikely, itโ€™s a bad design goal.
The better path is collaborative science. Let AI explore the ideas, draft hypotheses, surface evidence, and propose checks. Let humans decide what matters, set standards, and judge what counts as discovery.

23.10.2025 20:29 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐Ÿš€ Weโ€™re thrilled to announce the upcoming AI & Scientific Discovery online seminar! We have an amazing lineup of speakers.

This series will dive into how AI is accelerating research, enabling breakthroughs, and shaping the future of research across disciplines.

ai-scientific-discovery.github.io

25.09.2025 18:28 โ€” ๐Ÿ‘ 23    ๐Ÿ” 15    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!

27.07.2025 19:27 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Post image

Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering.

This is holding us back. ๐Ÿงตand new paper with @ari-holtzman.bsky.social .

09.07.2025 20:07 โ€” ๐Ÿ‘ 37    ๐Ÿ” 15    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
GitHub - ChicagoHAI/hypothesis-generation: This is the official repository for HypoGeniC (Hypothesis Generation in Context) and HypoRefine, which are automated, data-driven tools that leverage large l... This is the official repository for HypoGeniC (Hypothesis Generation in Context) and HypoRefine, which are automated, data-driven tools that leverage large language models to generate hypothesis fo...

We are making som exciting updates to hypogenic this summer: github.com/ChicagoHAI/h... and will post updates here.

09.07.2025 13:50 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Post image Post image

It predicts pretty wellโ€”not just shifts in the last week, but also:

1. Whoโ€™s working an overnight shift (in our data + external validation in MIMIC)

2. Whoโ€™s working on a disruptive circadian schedule

3. How many patients has the doc seen *on the current shift*

02.07.2025 19:24 โ€” ๐Ÿ‘ 5    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

๐Ÿšจ New paper alert ๐Ÿšจ

Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? ๐Ÿค” Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! โฌ‡๏ธ

1/n ๐Ÿงต

27.05.2025 13:59 โ€” ๐Ÿ‘ 28    ๐Ÿ” 17    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

1/n ๐Ÿš€๐Ÿš€๐Ÿš€ Thrilled to share our latest work๐Ÿ”ฅ: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! ๐Ÿง ๐Ÿ’ฌ๐Ÿ“Š
Thereโ€™s a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability โ€” letโ€™s dive in! ๐ŸŒŠ

12.05.2025 19:23 โ€” ๐Ÿ‘ 22    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

๐Ÿง‘โ€โš–๏ธHow well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?

Excited to be in Albuquerque presenting our paper this afternoon at @naaclmeeting 2025!

01.05.2025 19:25 โ€” ๐Ÿ‘ 23    ๐Ÿ” 13    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image Post image Post image

Although I cannot make #NAACL2025, @chicagohai.bsky.social will be there. Please say hi!

@chachachen.bsky.social GPT โŒ x-rays (Friday 9-10:30)
@mheddaya.bsky.social CaseSumm and LLM ๐Ÿง‘โ€โš–๏ธ (Thursday 2-3:30)
@haokunliu.bsky.social @qiaoyu-rosa.bsky.social hypothesis generation ๐Ÿ”ฌ (Saturday at 4pm)

30.04.2025 20:19 โ€” ๐Ÿ‘ 17    ๐Ÿ” 7    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

13/ Lastly, great thanks to my wonderful collaborators Sicong Huang, Jingyu Hu, @qiaoyu-rosa.bsky.social , and my advisor @chenhaotan.bsky.social !

28.04.2025 19:40 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate method...

12/ ๐ŸŒŸ For more details and to access our datasets and code, please visit our paper at arxiv.org/abs/2504.11524, we also have an official website and leaderboards available at: chicagohai.github.io/HypoBench/

28.04.2025 19:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

11/ Why HypoBench matters: Establishes a structured way to advance AI's role in scientific discovery and everyday reasoning, highlighting both current capabilities and significant challenges.

28.04.2025 19:37 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

10/ Model priors matter: We see that the models have different priors, which lead to varying behaviors in different tasksโ€”generating good hypotheses is harder when prior knowledge is not helpful.

28.04.2025 19:37 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

9/ And it gets worse in counterintuitive settings - the models perform significantly worse when the underlying hypotheses are counterintuitive.

28.04.2025 19:37 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

8/ ๐Ÿ’ก Synthetic dataset results show: LLMs handle simple interactions well but struggle with increased noise, distractors, or subtleties in textโ€”highlighting significant rooms for improvement.

28.04.2025 19:36 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@haokunliu is following 20 prominent accounts