Big thanks to @chicagohai.bsky.social team and everyone who submitted ideas on IdeaHub. Special shoutout to the open source community building research agents! We're all learning together.
10.11.2025 22:46 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0@haokunliu.bsky.social
Ph.D. Student at the University of Chicago | Chicago Human + AI Lab haokunliu.com
Big thanks to @chicagohai.bsky.social team and everyone who submitted ideas on IdeaHub. Special shoutout to the open source community building research agents! We're all learning together.
10.11.2025 22:46 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0All 6 generated repositories with detailed code and reports:
- github.com/ChicagoHAI/l...
- github.com/ChicagoHAI/l...
- github.com/ChicagoHAI/i...
- github.com/ChicagoHAI/i...
- github.com/ChicagoHAI/l...
- github.com/ChicagoHAI/l...
Submit your idea, vote on existing ones, or help improve idea-explorer: github.com/ChicagoHAI/i...
Full blog with technical details:
hypogenic.ai/blog/weekly-...
Substack: open.substack.com/pub/cichicag...
So why are we doing this openly?
Because agents clearly can accelerate early-stage exploration. But they need human oversight at every step. Transparent benchmarking beats cherry-picked demos. Community feedback improves agents faster. And honestly, we're all figuring this out together.
Existing agents like AI-Scientist and AI-Researcher are basically overfitted to ML. There are hard-coded prompts that โrequires changing hyperparameters and train on HuggingFace datasetsโ or specific ML agents. Just changing prompt wonโt be enough, as ML assumptions are everywhere in the codebase.
10.11.2025 22:44 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0The pattern: we can fix specific bugs with better prompts (bias-variance tradeoff). But we can't prompt our way to knowing when to search, recognizing expertise boundaries, or understanding what rigorous methodology looks like.
That's what I call the "meta intelligence" gap.
What didnโt
Some agents run faked human data, used undersized models even though compute was available, or calling simple answer reweighting as "multi-agent interactions". Resource collection and allocation is a bottleneck, but more importantly, the agents do not know when to search or seek help.
What worked
Agents can actually design and run small experiments: sometimes to seed bigger studies, sometimes as sanity checks, and sometimes to straight-up refute the original hypothesis. That kind of evidence is way more useful than โLLM-as-a-judge says the idea is good.โ
There's a lot of hype on AI agents for science. But what can they actually do? We tested our idea-explorer on ideas from IdeaHub:
Do LLMs have different types of beliefs?
Can formal rules make AI agents honest about their uncertainty?
Can LLMs temporarily ignore their training to follow new rules?
Here's how it works:
โ Submit your research idea or upvote existing ones (tag: "Weekly Competition")
โ Each Monday we select top 3 from previous week
โ We run experiments using research agents
โ Share repos + findings back on IdeaHub
Vote here: hypogenic.ai/ideahub
We're launching a weekly competition where the community decides which research ideas get implemented. Every week, we'll take the top 3 ideas from IdeaHub, run experiments with AI agents, and share everything: code, successes, and failures.
It's completely free and we'll try out ideas for you!
โ Does an LLM know thyself? ๐ช
Humans pass the mirror test at ~18 months ๐ถ
But what about LLMs? Can they recognize their own writingโor even admit authorship at all?
In our new paper, we put 10 state-of-the-art models to the test. Read on ๐
1/n ๐งต
Replacing scientists with AI isnโt just unlikely, itโs a bad design goal.
The better path is collaborative science. Let AI explore the ideas, draft hypotheses, surface evidence, and propose checks. Let humans decide what matters, set standards, and judge what counts as discovery.
๐ Weโre thrilled to announce the upcoming AI & Scientific Discovery online seminar! We have an amazing lineup of speakers.
This series will dive into how AI is accelerating research, enabling breakthroughs, and shaping the future of research across disciplines.
ai-scientific-discovery.github.io
#ACL2025 Poster Session 1 tomorrow 11:00-12:30 Hall 4/5!
27.07.2025 19:27 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 1Prompting is our most successful tool for exploring LLMs, but the term evokes eye-rolls and grimaces from scientists. Why? Because prompting as scientific inquiry has become conflated with prompt engineering.
This is holding us back. ๐งตand new paper with @ari-holtzman.bsky.social .
We are making som exciting updates to hypogenic this summer: github.com/ChicagoHAI/h... and will post updates here.
09.07.2025 13:50 โ ๐ 2 ๐ 1 ๐ฌ 0 ๐ 1It predicts pretty wellโnot just shifts in the last week, but also:
1. Whoโs working an overnight shift (in our data + external validation in MIMIC)
2. Whoโs working on a disruptive circadian schedule
3. How many patients has the doc seen *on the current shift*
๐จ New paper alert ๐จ
Ever asked an LLM-as-Marilyn Monroe who the US president was in 2000? ๐ค Should the LLM answer at all? We call these clashes Concept Incongruence. Read on! โฌ๏ธ
1/n ๐งต
1/n ๐๐๐ Thrilled to share our latest work๐ฅ: HypoEval - Hypothesis-Guided Evaluation for Natural Language Generation! ๐ง ๐ฌ๐
Thereโs a lot of excitement around using LLMs for automated evaluation, but many methods fall short on alignment or explainability โ letโs dive in! ๐
๐งโโ๏ธHow well can LLMs summarize complex legal documents? And can we use LLMs to evaluate?
Excited to be in Albuquerque presenting our paper this afternoon at @naaclmeeting 2025!
Although I cannot make #NAACL2025, @chicagohai.bsky.social will be there. Please say hi!
@chachachen.bsky.social GPT โ x-rays (Friday 9-10:30)
@mheddaya.bsky.social CaseSumm and LLM ๐งโโ๏ธ (Thursday 2-3:30)
@haokunliu.bsky.social @qiaoyu-rosa.bsky.social hypothesis generation ๐ฌ (Saturday at 4pm)
13/ Lastly, great thanks to my wonderful collaborators Sicong Huang, Jingyu Hu, @qiaoyu-rosa.bsky.social , and my advisor @chenhaotan.bsky.social !
28.04.2025 19:40 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 012/ ๐ For more details and to access our datasets and code, please visit our paper at arxiv.org/abs/2504.11524, we also have an official website and leaderboards available at: chicagohai.github.io/HypoBench/
28.04.2025 19:38 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 011/ Why HypoBench matters: Establishes a structured way to advance AI's role in scientific discovery and everyday reasoning, highlighting both current capabilities and significant challenges.
28.04.2025 19:37 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 010/ Model priors matter: We see that the models have different priors, which lead to varying behaviors in different tasksโgenerating good hypotheses is harder when prior knowledge is not helpful.
28.04.2025 19:37 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 09/ And it gets worse in counterintuitive settings - the models perform significantly worse when the underlying hypotheses are counterintuitive.
28.04.2025 19:37 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 08/ ๐ก Synthetic dataset results show: LLMs handle simple interactions well but struggle with increased noise, distractors, or subtleties in textโhighlighting significant rooms for improvement.
28.04.2025 19:36 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0