... and awesome collaborators & advisors!!
Zihao Wang, Rui Sun, Wei Liu, Weijia Shi, @huaxiuyaoml.bsky.social , Linjun Zhang, Andrew Ng, @jameszou.bsky.social, @sanmikoyejo.bsky.social, @yejinchoinka.bsky.social, Percy Liang, @stanfordnlp.bsky.social, @stanfordhai.bsky.social
               
            
            
                26.08.2025 17:50 β π 2    π 0    π¬ 0    π 0                      
            
         
            
        
            
            
            
            
            
    
    
            
                        
                Unsolved Questions (UQ) Project
                An open platform for evaluating AI models on real-world, unsolved questions
            
        
    
    
            9/
UQ is an exploratory effort at creating a new paradigm for AI evals:
π Platform: uq.stanford.edu
π Paper: arxiv.org/abs/2508.17580
π» Code: github.com/uq-project/UQ
π€ Data: huggingface.co/datasets/uq-...
Thanks to my wonderful project co-leads Fan Nie (applying for PhD!) and Niklas Muennighoff!!
               
            
            
                26.08.2025 17:50 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            8/
*UQ-Platform* (uq.stanford.edu) then continues where UQ-Validators leave off. It hosts the UQ-Dataset with AI answers and UQ-validation results, and experts can then rate AI answers, comment, and otherwise help resolve open questions -- just like Stack Exchange :). We need YOU to write reviews!
               
            
            
                26.08.2025 17:50 β π 0    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            7/
*UQ-Validators* are simply LLMs (and compound LLM scaffolds) trying to pre-screen candidate answers to unsolved questions *without ground-truth answers*.
The key intuition is that it may be easier for LLMs to *validate* answers to hard questions (e.g. spotting mistakes) than to *generate* them.
               
            
            
                26.08.2025 17:50 β π 0    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            6/
In contrast, we aim for UQ-Dataset to be difficult and realistic *by construction*: unsolved questions are often hard and naturally arise when humans seek answers, thus progress yields real-world value. 
In exchange, we have to figure out how to evaluate models without answers...
               
            
            
                26.08.2025 17:50 β π 0    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            5/
UQ started with the observation that benchmark saturation has led to a *difficulty-realism tension*:
1. We contrive harder exams that begin to lose touch of real-world model usage
2. We build realistic evals (e.g. use human preferences) that became easy and/or hackable
               
            
            
                26.08.2025 17:50 β π 0    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                         
                                                
    
    
    
    
            4/
Here are some sample questions in the UQ-Dataset, which spans math, physics, CS theory, history, puzzles, scifi, and more; see uq.stanford.edu for full list!
               
            
            
                26.08.2025 17:50 β π 0    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            3/
Our main idea: rather than having static benchmarks scored once, can we evaluate LLMs *continuously and asynchronously* on real-world Qs with an actual need?
UQ-Dataset provides inputs β UQ-Validators screen outputs β UQ-Platform hosts live verification and model ranking.
               
            
            
                26.08.2025 17:50 β π 0    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
            
    
    
            
                        
                Unsolved Questions (UQ) Project
                An open platform for evaluating AI models on real-world, unsolved questions
            
        
    
    
            2/
The UQ project has 3 parts:
1. UQ-Dataset: 500 hard, popular, old, yet unanswered questions from Stack Exchange network
2. UQ-Validators: LLM critics to pre-screen model answers
3. UQ-Platform (uq.stanford.edu): community verification (think AI-native Stack Exchange!)
               
            
            
                26.08.2025 17:50 β π 1    π 0    π¬ 1    π 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions.
Instead of artificially difficult exams where progress β  value, we assess LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:
               
            
            
                26.08.2025 17:50 β π 6    π 1    π¬ 2    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            ππ»ββοΈ
               
            
            
                24.11.2024 05:17 β π 0    π 0    π¬ 0    π 0                      
            
         
            
        
            
        
            
            
            
            
            
    
    
    
    
            hi
               
            
            
                21.11.2024 20:01 β π 3    π 0    π¬ 2    π 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            πββοΈ
               
            
            
                21.11.2024 19:48 β π 1    π 0    π¬ 0    π 0                      
            
         
    
         
        
            
        
                            
                    
                    
                                            Security and Privacy of Machine Learning at UofT, Vector Institute, and Google π¨π¦π«π·πͺπΊ Co-Director of Canadian AI Safety Institute (CAISI) Research Program at CIFAR. Opinions mine
                                     
                            
                    
                    
                                    
                            
                    
                    
                                            @PyTorch "My learning style is Horace twitter threads" - 
@typedfemale
                                     
                            
                    
                    
                                            Data Quality x Privacy 
PhD student @ CMU with Zico Kolter and Zack Lipton | Founding Member @datologyai.com | Prev. Comp Sc @iitdelhi
http://pratyushmaini.github.io/
                                     
                            
                    
                    
                                            PhD student @ MIT | Previously PYI @ AI2 | MS'21 BS'19 BA'19 @ UW | zhaofengwu.github.io
                                     
                            
                    
                    
                                            Associate professor at CMU, studying natural language processing and machine learning. Co-founder All Hands AI
                                     
                            
                    
                    
                                    
                            
                    
                    
                                            AI for Science
AI for Social Good
                                     
                            
                    
                    
                                            Stanford Professor of Linguistics and, by courtesy, of Computer Science, and member of @stanfordnlp.bsky.social and The Stanford AI Lab. He/Him/His. https://web.stanford.edu/~cgpotts/
                                     
                            
                    
                    
                                            PhD Student @StanfordAILab @stanfordnlp.bsky.social, Previously SR @GoogleDeepMind.bsky.social, Undergraduate @Berkeley_AI
                                     
                            
                    
                    
                                            Computer science professor at Carnegie Mellon. Researcher in machine learning. Algorithmic foundations of responsible AI (e.g., privacy, uncertainty quantification), interactive learning (e.g., RLHF).
https://zstevenwu.com/
                                     
                            
                    
                    
                                            phding @stanfordnlp.bsky.social
harshitjoshi.com
Creating useful intelligent systems for knowledge navigation
                                     
                            
                    
                    
                                            AI @ OpenAI, Tesla, Stanford
                                     
                            
                    
                    
                                            Stanford CS PhD working on RL and LLMs with Emma Brunskill and Chris Piech. Co-creator of Trace. Prev @GoogleDeepMind @MicrosoftResearch
Specifically
- Offline RL
- In-context RL
- Causality
https://anie.me/about
Unverified hot takes go to this account
                                     
                            
                    
                    
                                            PhD student at Stanford NLP. Working on Social NLP and CSS. Previously at GaTech, Meta AI, Emory.
πPalo Alto, CA
π calebziems.com
                                     
                            
                    
                    
                                            Stanford Linguistics and Computer Science. Director, Stanford AI Lab. Founder of @stanfordnlp.bsky.social . #NLP https://nlp.stanford.edu/~manning/
                                     
                            
                    
                    
                                            Professor of Natural and Artificial Intelligence @Stanford. Safety and alignment @GoogleDeepMind.
                                     
                            
                    
                    
                                            PhD candidate @ Stanford NLP
https://myracheng.github.io/