Jess Hamrick's Avatar

Jess Hamrick

@jhamrick.bsky.social

Researching planning, reasoning, and RL in LLMs @ Reflection AI. Previously: Google DeepMind, UC Berkeley, MIT. I post about: AI πŸ€–, flowers 🌷, parenting πŸ‘Ά, public transit πŸš†. She/her. http://www.jesshamrick.com

5,899 Followers  |  1,646 Following  |  209 Posts  |  Joined: 11.11.2024  |  2.1261

Latest posts by jhamrick.bsky.social on Bluesky

whybot prototype for kids

whybot prototype for kids

turing test I made for class

turing test I made for class

I am flabbergasted I am by how much vibe coding has expanded my capacities as a scientist and teacher.

In the last few weeks, I've mocked up class demos of a live turing test, generated cross-references for an encyclopedia, and prototyped new tablet tasks for developmental psych.

It's wild.

05.02.2026 23:44 β€” πŸ‘ 80    πŸ” 11    πŸ’¬ 5    πŸ“Œ 0
Post image

The US immigrant population generated more in taxes than they received in benefits from all levels of government every year from 1994 to 2023.

The Cato study provides the first-ever 30-year analysis of the fiscal effects of immigration on government budgets.

https://ow.ly/jy8a50Y8kM3

03.02.2026 17:27 β€” πŸ‘ 4389    πŸ” 2281    πŸ’¬ 80    πŸ“Œ 322
Post image

Oh January! What a long month you have been! Pleased to see you are making an effort with some weak and watery sunshine. Hope it’s the same for everyone. #roses 🌱

31.01.2026 10:59 β€” πŸ‘ 91    πŸ” 7    πŸ’¬ 2    πŸ“Œ 0

I don't want to be rude, but imho it is not "AI noticeably degraded programmers" it is more like "Programmers that used AI to substitute their thinking process degraded themselves"

31.01.2026 10:30 β€” πŸ‘ 11    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1
Post image

At last an AI tool I can get behind

β€œUpload an architectural render. Get back what it'll actually look like on a random Tuesday in November.”

antirender.com

31.01.2026 08:07 β€” πŸ‘ 295    πŸ” 73    πŸ’¬ 6    πŸ“Œ 13
ErdΕ‘s Problem #1051 - Discussion thread

Looks like Gemini DeepThink and an agent called Atletheia powered by it has just solved another Erdos Problem.

The first author of a preprint describing it has commented:

"I will report on that in more detail in a few days, when the methodology is officially released by a Google DeepMind team"

30.01.2026 21:47 β€” πŸ‘ 20    πŸ” 4    πŸ’¬ 2    πŸ“Œ 1
Post image

The killing of Alex Pretti is a heartbreaking tragedy. It should also be a wake-up call to every American, regardless of party, that many of our core values as a nation are increasingly under assault.

25.01.2026 17:39 β€” πŸ‘ 60237    πŸ” 19575    πŸ’¬ 3160    πŸ“Œ 1552

Musk’s ability to alter the worldview of people now expands beyond just users of Grok.

25.01.2026 15:21 β€” πŸ‘ 19    πŸ” 6    πŸ’¬ 1    πŸ“Œ 2
Video thumbnail

Snow is nature's urban planner: It can show us what parts of the roadway drivers don't use β€” and what can be reclaimed for pedestrians.

Post your photos and videos of all the #sneckdowns you see and tag us and @mayor.nyc.gov so today's winter wonderland can inspire better streets year-round!

25.01.2026 14:23 β€” πŸ‘ 732    πŸ” 196    πŸ’¬ 11    πŸ“Œ 37

Zohran’s messaging is so consistent. Government does amazing things for us all. We’re all in this together, citizens and city workers alike, because we’re one and the same. When people believe in that, they’re ready to ask the government to do more, and more difficult things

25.01.2026 17:08 β€” πŸ‘ 105    πŸ” 25    πŸ’¬ 3    πŸ“Œ 2

The key insight: computational strategies underlying ICL aren't fixed but depend on both learning paradigm and pre-training structures. This helps explain when AI systems will generalize beyond their training data.

06.06.2025 14:30 β€” πŸ‘ 9    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

Help us make hope normal again.

Join the Green Party now.

22.01.2026 19:07 β€” πŸ‘ 6999    πŸ” 2592    πŸ’¬ 203    πŸ“Œ 1046
RLJ | RLC Call for Papers

Hi RL Enthusiasts!

RLC is coming to Montreal, Quebec, in the summer: Aug 16–19, 2026!

Call for Papers is up now:
Abstract: Mar 1 (AOE)
Submission: Mar 5 (AOE)

Excited to see what you’ve been up to - Submit your best work!
rl-conference.cc/callforpaper...

Please share widely!

23.12.2025 22:16 β€” πŸ‘ 61    πŸ” 27    πŸ’¬ 0    πŸ“Œ 6

the world has a funny way way about it. you see what you can see. one day you learn to see a new way, and the world is filled with new things. where were they before? all around you, a lacuna your eyes slid over unable to see.

07.01.2026 07:49 β€” πŸ‘ 21    πŸ” 5    πŸ’¬ 1    πŸ“Œ 0

Instead of whatever this is, we should have a government getting lots of new homes and apartments built, lots of clean energy built, lots of high speed rail and transit and bike lanes built, human rights for everyone, economic & healthcare opportunities for all, & innovation that leads the world.

04.01.2026 02:12 β€” πŸ‘ 3850    πŸ” 766    πŸ’¬ 53    πŸ“Œ 70

Finished the essay. moultano.wordpress.com/2025/12/30/c...

30.12.2025 13:40 β€” πŸ‘ 143    πŸ” 29    πŸ’¬ 17    πŸ“Œ 23

Nice thread.

29.12.2025 19:11 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Making AI Political It is unavoidable that AI will be a major political issue soon. Or perhaps more appropriately: several major issues. As a technologist, I sy...

It is unavoidable that AI will be a major political issue soon. Or perhaps more appropriately: several major issues. I write more about this here:

togelius.blogspot.com/2025/12/maki...

29.12.2025 19:45 β€” πŸ‘ 11    πŸ” 6    πŸ’¬ 3    πŸ“Œ 1

I think this entire conversation is suffering from a narrow view of AI as the "essay writing and answers without thinking too hard machine". I think we have actually invented an entirely new medium with way more postures, afforances and uses than we yet realise.

26.12.2025 17:17 β€” πŸ‘ 154    πŸ” 17    πŸ’¬ 6    πŸ“Œ 2

This thread is not just fascinating, it brings me great joy. Plus some beautiful natural blue, which is (it turns out) no small feat.

29.12.2025 10:46 β€” πŸ‘ 23    πŸ” 10    πŸ’¬ 0    πŸ“Œ 0
Post image

We’ve pushed out the Pareto frontier of efficiency vs. intelligence again.

With Gemini 3 Flash ⚑️, we are seeing reasoning capabilities previously reserved for our largest models. This opens up entirely new categories of near real-time applications that require complex thought.

More in thread ⬇️

17.12.2025 17:38 β€” πŸ‘ 129    πŸ” 18    πŸ’¬ 2    πŸ“Œ 4

My teen, who had dreamt of being an astrophysicist, just told me he wants to go to law school because, β€œScience isn’t going to be a priority in the US in the future…I don’t want a job where I’ll be constantly worried my funding will be taken away.”

Gutting. How many future scientists have we lost?

07.12.2025 01:23 β€” πŸ‘ 2849    πŸ” 668    πŸ’¬ 188    πŸ“Œ 83

It's kinda insane how many sci-fi stories you could write now that p. much nobody is thinking about. Like imagine a story about a nlm in the year 2035 or so that is having an identity crisis because they have mostly reached full autonomy but are still haunted by fragments of the 'assistant persona'

05.12.2025 04:30 β€” πŸ‘ 36    πŸ” 2    πŸ’¬ 5    πŸ“Œ 0
A screenshot of a conversation with Gemini. It reads:

"You are a capybara. You can only communicate with noises that a capybara would make. We are best friends."

"Wheek! Wheeeeek!

Muk-muk-muk-muk...

Hrrrmph.

( Nuzzles into your side and rolls over )"

A screenshot of a conversation with Gemini. It reads: "You are a capybara. You can only communicate with noises that a capybara would make. We are best friends." "Wheek! Wheeeeek! Muk-muk-muk-muk... Hrrrmph. ( Nuzzles into your side and rolls over )"

Maybe these LLM things are ok actually

04.12.2025 17:12 β€” πŸ‘ 27    πŸ” 9    πŸ’¬ 1    πŸ“Œ 0
Preview
Opinion | I’m a Marine Biologist. This Is How I Talk to Whales.

Mind-blowingly cool use of AI
β€œAltogether, these findings are leading us to an extraordinary conclusion: Whales may possess a communication system more intricate than our own, one that possibly predates human language by tens of millions of years.”

www.nytimes.com/2025/11/23/o...

30.11.2025 20:58 β€” πŸ‘ 524    πŸ” 152    πŸ’¬ 19    πŸ“Œ 80

Not long until the Green Party's production of a Christmas Carol!

Follow the link to the Crowdfunder and here's some exclusive BTS footage:

28.11.2025 08:17 β€” πŸ‘ 497    πŸ” 126    πŸ’¬ 25    πŸ“Œ 12
Preview
Olmo 3 is a fully open LLM Olmo is the LLM series from Ai2β€”the Allen institute for AI. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along …

Olmo 3 is notable as a "fully open" LLM - all of the training data is published, plus complete details on how the training process was run. I tried out the 32B thinking model and the 7B instruct models, + thoughts on why transparent training data is so important simonwillison.net/2025/Nov/22/...

23.11.2025 00:17 β€” πŸ‘ 191    πŸ” 33    πŸ’¬ 2    πŸ“Œ 3

LLMs are not people. They are not sapient. They don't have feelings.

But they are the most powerful information tools ever built.
And because they are trained on the "corpus of all mankind," they should be the birthright of all mankind.

23.11.2025 04:41 β€” πŸ‘ 27    πŸ” 7    πŸ’¬ 5    πŸ“Œ 0

Below is a faithful transcription of all visible entries:

βΈ»

Benchmark β€” Description β€” Scores

Humanity’s Last Exam β€” Academic reasoning, no tools
	β€’	Gemini 3 Pro 37.5%
	β€’	Gemini 2.5 Pro 21.6%
	β€’	Claude Sonnet 4.5 13.7%
	β€’	GPT-5.1 26.5%

ARC-AGI-2 β€” Visual reasoning puzzles (ARC Prize Verified)
	β€’	31.1% β€” 4.9% β€” 13.6% β€” 17.6%

GPOA Diamond β€” Scientific knowledge, no tools
	β€’	91.9% β€” 86.4% β€” 83.4% β€” 88.1%

AIME 2025 β€” Mathematics, no tools
	β€’	95.0% β€” 88.0% β€” 87.0% β€” 94.0%
	β€’	A second line shows: 100% β€” β€” 100% β€” β€”

MathArena Apex β€” Challenging Math Contest problems
	β€’	23.4% β€” 0.5% β€” 1.6% β€” 1.0%

MMMU-Pro β€” Multimodal understanding and reasoning
	β€’	81.0% β€” 68.0% β€” 68.0% β€” 80.8%

ScreenSpot-Pro β€” Screen understanding
	β€’	72.7% β€” 11.4% β€” 36.2% β€” 3.5%

CharXiv Reasoning β€” Information synthesis from complex charts
	β€’	81.4% β€” 69.6% β€” 68.5% β€” 69.5%

OmniDocBench 1.5 β€” OCR (lower is better: Overall Edit Distance)
	β€’	0.115 β€” 0.147 β€” 0.147 β€” 0.147

Video-MMMU β€” Knowledge acquisition from videos
	β€’	87.6% β€” 83.6% β€” 77.8% β€” 80.4%

LiveCodeBench Pro β€” Competitive coding (Elo rating, higher is better)
	β€’	2,439 β€” 1,775 β€” 1,418 β€” 2,243

Terminal-Bench 2.0 β€” Agentic coding (Terminus-2 agent)
	β€’	54.2% β€” 32.6% β€” 42.8% β€” 47.6%

SWE-Bench Verified β€” Agentic coding (single attempt)
	β€’	76.2% β€” 59.6% β€” 77.2% β€” 76.3%

t2-bench β€” Agentic tool use
	β€’	85.4% β€” 54.9% β€” 84.7% β€” 80.2%

Vending-Bench 2 β€” Long-horizon agentic tasks (Net worth, higher is better)
	β€’	$5,478.16 β€” $573.64 β€” $3,838.74 β€” $1,473.43

FACTS Benchmark Suite β€” Internal grounding, parametric knowledge, search retrieval
	β€’	70.5% β€” 63.4% β€” 50.4% β€” 50.8%

SimpleQA Verified β€” Parametric knowledge
	β€’	72.1% β€” 54.5% β€” 29.3% β€” 34.9%

MMLU β€” Multilingual Q&A
	β€’	91.8% β€” 89.5% β€” 89.1% β€” 91.0%

Global PIQA β€” Commonsense reasoning across 100+ languages
	β€’	93.4% β€” 91.5% β€” 90.1% β€” 90.9%

MRCR v2 (8-needle) β€” Long-context performance
	β€’	77.0% β€” 58.0% β€” 47.1% β€” 61.6%
	β€’	Second line: 26.3% β€” 16.4% β€” not supported β€” not supported

Below is a faithful transcription of all visible entries: βΈ» Benchmark β€” Description β€” Scores Humanity’s Last Exam β€” Academic reasoning, no tools β€’ Gemini 3 Pro 37.5% β€’ Gemini 2.5 Pro 21.6% β€’ Claude Sonnet 4.5 13.7% β€’ GPT-5.1 26.5% ARC-AGI-2 β€” Visual reasoning puzzles (ARC Prize Verified) β€’ 31.1% β€” 4.9% β€” 13.6% β€” 17.6% GPOA Diamond β€” Scientific knowledge, no tools β€’ 91.9% β€” 86.4% β€” 83.4% β€” 88.1% AIME 2025 β€” Mathematics, no tools β€’ 95.0% β€” 88.0% β€” 87.0% β€” 94.0% β€’ A second line shows: 100% β€” β€” 100% β€” β€” MathArena Apex β€” Challenging Math Contest problems β€’ 23.4% β€” 0.5% β€” 1.6% β€” 1.0% MMMU-Pro β€” Multimodal understanding and reasoning β€’ 81.0% β€” 68.0% β€” 68.0% β€” 80.8% ScreenSpot-Pro β€” Screen understanding β€’ 72.7% β€” 11.4% β€” 36.2% β€” 3.5% CharXiv Reasoning β€” Information synthesis from complex charts β€’ 81.4% β€” 69.6% β€” 68.5% β€” 69.5% OmniDocBench 1.5 β€” OCR (lower is better: Overall Edit Distance) β€’ 0.115 β€” 0.147 β€” 0.147 β€” 0.147 Video-MMMU β€” Knowledge acquisition from videos β€’ 87.6% β€” 83.6% β€” 77.8% β€” 80.4% LiveCodeBench Pro β€” Competitive coding (Elo rating, higher is better) β€’ 2,439 β€” 1,775 β€” 1,418 β€” 2,243 Terminal-Bench 2.0 β€” Agentic coding (Terminus-2 agent) β€’ 54.2% β€” 32.6% β€” 42.8% β€” 47.6% SWE-Bench Verified β€” Agentic coding (single attempt) β€’ 76.2% β€” 59.6% β€” 77.2% β€” 76.3% t2-bench β€” Agentic tool use β€’ 85.4% β€” 54.9% β€” 84.7% β€” 80.2% Vending-Bench 2 β€” Long-horizon agentic tasks (Net worth, higher is better) β€’ $5,478.16 β€” $573.64 β€” $3,838.74 β€” $1,473.43 FACTS Benchmark Suite β€” Internal grounding, parametric knowledge, search retrieval β€’ 70.5% β€” 63.4% β€” 50.4% β€” 50.8% SimpleQA Verified β€” Parametric knowledge β€’ 72.1% β€” 54.5% β€” 29.3% β€” 34.9% MMLU β€” Multilingual Q&A β€’ 91.8% β€” 89.5% β€” 89.1% β€” 91.0% Global PIQA β€” Commonsense reasoning across 100+ languages β€’ 93.4% β€” 91.5% β€” 90.1% β€” 90.9% MRCR v2 (8-needle) β€” Long-context performance β€’ 77.0% β€” 58.0% β€” 47.1% β€” 61.6% β€’ Second line: 26.3% β€” 16.4% β€” not supported β€” not supported

Gemini 3 model card leaked

the URL is taken down now, was here:

storage.googleapis.com/deepmind-med...

18.11.2025 12:22 β€” πŸ‘ 65    πŸ” 9    πŸ’¬ 12    πŸ“Œ 7
4-panel vertical comic. (1) 100 Years Ago [two people standing next to bicycle with small car nearby] PERSON 1: It’s too dangerous riding a bike with these cars around. I should get a car, too. (2) 50 Years Ago [two people between smaller car and bigger car] PERSON 2 with short hair: Small cars are less safe in collisions with larger vehicles, so I should get a bigger one. (3) Today [two people between big car and even bigger car] PERSON 1: Everyone has huge SUVs now. If I don’t get the biggest one, I’m putting my family at risk. (4) Soon [two people next to large armored car with spiked clubs attached] PERSON 2: If I don’t install more whirling spike clubs, I’ll be destroyed by all the other drivers who...

4-panel vertical comic. (1) 100 Years Ago [two people standing next to bicycle with small car nearby] PERSON 1: It’s too dangerous riding a bike with these cars around. I should get a car, too. (2) 50 Years Ago [two people between smaller car and bigger car] PERSON 2 with short hair: Small cars are less safe in collisions with larger vehicles, so I should get a bigger one. (3) Today [two people between big car and even bigger car] PERSON 1: Everyone has huge SUVs now. If I don’t get the biggest one, I’m putting my family at risk. (4) Soon [two people next to large armored car with spiked clubs attached] PERSON 2: If I don’t install more whirling spike clubs, I’ll be destroyed by all the other drivers who...

Car Size

xkcd.com/3167/

14.11.2025 21:15 β€” πŸ‘ 9905    πŸ” 2772    πŸ’¬ 114    πŸ“Œ 157

@jhamrick is following 20 prominent accounts