Ken Liu's Avatar

Ken Liu

@kzliu.bsky.social

CS PhD @ Stanford AI Lab, Stanford NLP. Prev Google DeepMind. https://ai.stanford.edu/~kzliu

455 Followers  |  64 Following  |  14 Posts  |  Joined: 16.12.2023  |  1.6594

Latest posts by kzliu.bsky.social on Bluesky

... and awesome collaborators & advisors!!
Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, @huaxiuyaoml.bsky.social , Linjun Zhang, Andrew Ng, @jameszou.bsky.social, @sanmikoyejo.bsky.social, @yejinchoinka.bsky.social, Percy Liang, @stanfordnlp.bsky.social, @stanfordhai.bsky.social

26.08.2025 17:50 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Unsolved Questions (UQ) Project An open platform for evaluating AI models on real-world, unsolved questions

9/
UQ is an exploratory effort at creating a new paradigm for AI evals:
🌐 Platform: uq.stanford.edu
πŸ“„ Paper: arxiv.org/abs/2508.17580
πŸ’» Code: github.com/uq-project/UQ
πŸ€— Data: huggingface.co/datasets/uq-...

Thanks to my wonderful project co-leads Fan Nie (applying for PhD!) and Niklas Muennighoff!!

26.08.2025 17:50 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

8/
*UQ-Platform* (uq.stanford.edu) then continues where UQ-Validators leave off. It hosts the UQ-Dataset with AI answers and UQ-validation results, and experts can then rate AI answers, comment, and otherwise help resolve open questions -- just like Stack Exchange :). We need YOU to write reviews!

26.08.2025 17:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

7/
*UQ-Validators* are simply LLMs (and compound LLM scaffolds) trying to pre-screen candidate answers to unsolved questions *without ground-truth answers*.

The key intuition is that it may be easier for LLMs to *validate* answers to hard questions (e.g. spotting mistakes) than to *generate* them.

26.08.2025 17:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

6/
In contrast, we aim for UQ-Dataset to be difficult and realistic *by construction*: unsolved questions are often hard and naturally arise when humans seek answers, thus progress yields real-world value.

In exchange, we have to figure out how to evaluate models without answers...

26.08.2025 17:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

5/
UQ started with the observation that benchmark saturation has led to a *difficulty-realism tension*:

1. We contrive harder exams that begin to lose touch of real-world model usage
2. We build realistic evals (e.g. use human preferences) that became easy and/or hackable

26.08.2025 17:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

4/
Here are some sample questions in the UQ-Dataset, which spans math, physics, CS theory, history, puzzles, scifi, and more; see uq.stanford.edu for full list!

26.08.2025 17:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

3/
Our main idea: rather than having static benchmarks scored once, can we evaluate LLMs *continuously and asynchronously* on real-world Qs with an actual need?

UQ-Dataset provides inputs β†’ UQ-Validators screen outputs β†’ UQ-Platform hosts live verification and model ranking.

26.08.2025 17:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Unsolved Questions (UQ) Project An open platform for evaluating AI models on real-world, unsolved questions

2/
The UQ project has 3 parts:
1. UQ-Dataset: 500 hard, popular, old, yet unanswered questions from Stack Exchange network
2. UQ-Validators: LLM critics to pre-screen model answers
3. UQ-Platform (uq.stanford.edu): community verification (think AI-native Stack Exchange!)

26.08.2025 17:50 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions.

Instead of artificially difficult exams where progress β‰  value, we assess LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:

26.08.2025 17:50 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0

πŸ™‹πŸ»β€β™‚οΈ

24.11.2024 05:17 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Stanford NLP PhDs Join the conversation

go.bsky.app/AKGJ82V

22.11.2024 00:34 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

hi

21.11.2024 20:01 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

πŸ™‹β€β™‚οΈ

21.11.2024 19:48 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@kzliu is following 19 prominent accounts