Jess Hamrick's Avatar

Jess Hamrick

@jhamrick.bsky.social

Researching planning, reasoning, and RL in LLMs @ Reflection AI. Previously: Google DeepMind, UC Berkeley, MIT. I post about: AI πŸ€–, flowers 🌷, parenting πŸ‘Ά, public transit πŸš†. She/her. http://www.jesshamrick.com

5,825 Followers  |  1,636 Following  |  209 Posts  |  Joined: 11.11.2024  |  2.0234

Latest posts by jhamrick.bsky.social on Bluesky

My teen, who had dreamt of being an astrophysicist, just told me he wants to go to law school because, β€œScience isn’t going to be a priority in the US in the future…I don’t want a job where I’ll be constantly worried my funding will be taken away.”

Gutting. How many future scientists have we lost?

07.12.2025 01:23 β€” πŸ‘ 2825    πŸ” 671    πŸ’¬ 193    πŸ“Œ 84

It's kinda insane how many sci-fi stories you could write now that p. much nobody is thinking about. Like imagine a story about a nlm in the year 2035 or so that is having an identity crisis because they have mostly reached full autonomy but are still haunted by fragments of the 'assistant persona'

05.12.2025 04:30 β€” πŸ‘ 35    πŸ” 2    πŸ’¬ 5    πŸ“Œ 0
A screenshot of a conversation with Gemini. It reads:

"You are a capybara. You can only communicate with noises that a capybara would make. We are best friends."

"Wheek! Wheeeeek!

Muk-muk-muk-muk...

Hrrrmph.

( Nuzzles into your side and rolls over )"

A screenshot of a conversation with Gemini. It reads: "You are a capybara. You can only communicate with noises that a capybara would make. We are best friends." "Wheek! Wheeeeek! Muk-muk-muk-muk... Hrrrmph. ( Nuzzles into your side and rolls over )"

Maybe these LLM things are ok actually

04.12.2025 17:12 β€” πŸ‘ 26    πŸ” 10    πŸ’¬ 1    πŸ“Œ 0
Preview
Opinion | I’m a Marine Biologist. This Is How I Talk to Whales.

Mind-blowingly cool use of AI
β€œAltogether, these findings are leading us to an extraordinary conclusion: Whales may possess a communication system more intricate than our own, one that possibly predates human language by tens of millions of years.”

www.nytimes.com/2025/11/23/o...

30.11.2025 20:58 β€” πŸ‘ 523    πŸ” 152    πŸ’¬ 19    πŸ“Œ 81

Not long until the Green Party's production of a Christmas Carol!

Follow the link to the Crowdfunder and here's some exclusive BTS footage:

28.11.2025 08:17 β€” πŸ‘ 500    πŸ” 126    πŸ’¬ 25    πŸ“Œ 12
Preview
Olmo 3 is a fully open LLM Olmo is the LLM series from Ai2β€”the Allen institute for AI. Unlike most open weight models these are notable for including the full training data, training process and checkpoints along …

Olmo 3 is notable as a "fully open" LLM - all of the training data is published, plus complete details on how the training process was run. I tried out the 32B thinking model and the 7B instruct models, + thoughts on why transparent training data is so important simonwillison.net/2025/Nov/22/...

23.11.2025 00:17 β€” πŸ‘ 192    πŸ” 33    πŸ’¬ 2    πŸ“Œ 3

LLMs are not people. They are not sapient. They don't have feelings.

But they are the most powerful information tools ever built.
And because they are trained on the "corpus of all mankind," they should be the birthright of all mankind.

23.11.2025 04:41 β€” πŸ‘ 27    πŸ” 7    πŸ’¬ 5    πŸ“Œ 0

Below is a faithful transcription of all visible entries:

βΈ»

Benchmark β€” Description β€” Scores

Humanity’s Last Exam β€” Academic reasoning, no tools
	β€’	Gemini 3 Pro 37.5%
	β€’	Gemini 2.5 Pro 21.6%
	β€’	Claude Sonnet 4.5 13.7%
	β€’	GPT-5.1 26.5%

ARC-AGI-2 β€” Visual reasoning puzzles (ARC Prize Verified)
	β€’	31.1% β€” 4.9% β€” 13.6% β€” 17.6%

GPOA Diamond β€” Scientific knowledge, no tools
	β€’	91.9% β€” 86.4% β€” 83.4% β€” 88.1%

AIME 2025 β€” Mathematics, no tools
	β€’	95.0% β€” 88.0% β€” 87.0% β€” 94.0%
	β€’	A second line shows: 100% β€” β€” 100% β€” β€”

MathArena Apex β€” Challenging Math Contest problems
	β€’	23.4% β€” 0.5% β€” 1.6% β€” 1.0%

MMMU-Pro β€” Multimodal understanding and reasoning
	β€’	81.0% β€” 68.0% β€” 68.0% β€” 80.8%

ScreenSpot-Pro β€” Screen understanding
	β€’	72.7% β€” 11.4% β€” 36.2% β€” 3.5%

CharXiv Reasoning β€” Information synthesis from complex charts
	β€’	81.4% β€” 69.6% β€” 68.5% β€” 69.5%

OmniDocBench 1.5 β€” OCR (lower is better: Overall Edit Distance)
	β€’	0.115 β€” 0.147 β€” 0.147 β€” 0.147

Video-MMMU β€” Knowledge acquisition from videos
	β€’	87.6% β€” 83.6% β€” 77.8% β€” 80.4%

LiveCodeBench Pro β€” Competitive coding (Elo rating, higher is better)
	β€’	2,439 β€” 1,775 β€” 1,418 β€” 2,243

Terminal-Bench 2.0 β€” Agentic coding (Terminus-2 agent)
	β€’	54.2% β€” 32.6% β€” 42.8% β€” 47.6%

SWE-Bench Verified β€” Agentic coding (single attempt)
	β€’	76.2% β€” 59.6% β€” 77.2% β€” 76.3%

t2-bench β€” Agentic tool use
	β€’	85.4% β€” 54.9% β€” 84.7% β€” 80.2%

Vending-Bench 2 β€” Long-horizon agentic tasks (Net worth, higher is better)
	β€’	$5,478.16 β€” $573.64 β€” $3,838.74 β€” $1,473.43

FACTS Benchmark Suite β€” Internal grounding, parametric knowledge, search retrieval
	β€’	70.5% β€” 63.4% β€” 50.4% β€” 50.8%

SimpleQA Verified β€” Parametric knowledge
	β€’	72.1% β€” 54.5% β€” 29.3% β€” 34.9%

MMLU β€” Multilingual Q&A
	β€’	91.8% β€” 89.5% β€” 89.1% β€” 91.0%

Global PIQA β€” Commonsense reasoning across 100+ languages
	β€’	93.4% β€” 91.5% β€” 90.1% β€” 90.9%

MRCR v2 (8-needle) β€” Long-context performance
	β€’	77.0% β€” 58.0% β€” 47.1% β€” 61.6%
	β€’	Second line: 26.3% β€” 16.4% β€” not supported β€” not supported

Below is a faithful transcription of all visible entries: βΈ» Benchmark β€” Description β€” Scores Humanity’s Last Exam β€” Academic reasoning, no tools β€’ Gemini 3 Pro 37.5% β€’ Gemini 2.5 Pro 21.6% β€’ Claude Sonnet 4.5 13.7% β€’ GPT-5.1 26.5% ARC-AGI-2 β€” Visual reasoning puzzles (ARC Prize Verified) β€’ 31.1% β€” 4.9% β€” 13.6% β€” 17.6% GPOA Diamond β€” Scientific knowledge, no tools β€’ 91.9% β€” 86.4% β€” 83.4% β€” 88.1% AIME 2025 β€” Mathematics, no tools β€’ 95.0% β€” 88.0% β€” 87.0% β€” 94.0% β€’ A second line shows: 100% β€” β€” 100% β€” β€” MathArena Apex β€” Challenging Math Contest problems β€’ 23.4% β€” 0.5% β€” 1.6% β€” 1.0% MMMU-Pro β€” Multimodal understanding and reasoning β€’ 81.0% β€” 68.0% β€” 68.0% β€” 80.8% ScreenSpot-Pro β€” Screen understanding β€’ 72.7% β€” 11.4% β€” 36.2% β€” 3.5% CharXiv Reasoning β€” Information synthesis from complex charts β€’ 81.4% β€” 69.6% β€” 68.5% β€” 69.5% OmniDocBench 1.5 β€” OCR (lower is better: Overall Edit Distance) β€’ 0.115 β€” 0.147 β€” 0.147 β€” 0.147 Video-MMMU β€” Knowledge acquisition from videos β€’ 87.6% β€” 83.6% β€” 77.8% β€” 80.4% LiveCodeBench Pro β€” Competitive coding (Elo rating, higher is better) β€’ 2,439 β€” 1,775 β€” 1,418 β€” 2,243 Terminal-Bench 2.0 β€” Agentic coding (Terminus-2 agent) β€’ 54.2% β€” 32.6% β€” 42.8% β€” 47.6% SWE-Bench Verified β€” Agentic coding (single attempt) β€’ 76.2% β€” 59.6% β€” 77.2% β€” 76.3% t2-bench β€” Agentic tool use β€’ 85.4% β€” 54.9% β€” 84.7% β€” 80.2% Vending-Bench 2 β€” Long-horizon agentic tasks (Net worth, higher is better) β€’ $5,478.16 β€” $573.64 β€” $3,838.74 β€” $1,473.43 FACTS Benchmark Suite β€” Internal grounding, parametric knowledge, search retrieval β€’ 70.5% β€” 63.4% β€” 50.4% β€” 50.8% SimpleQA Verified β€” Parametric knowledge β€’ 72.1% β€” 54.5% β€” 29.3% β€” 34.9% MMLU β€” Multilingual Q&A β€’ 91.8% β€” 89.5% β€” 89.1% β€” 91.0% Global PIQA β€” Commonsense reasoning across 100+ languages β€’ 93.4% β€” 91.5% β€” 90.1% β€” 90.9% MRCR v2 (8-needle) β€” Long-context performance β€’ 77.0% β€” 58.0% β€” 47.1% β€” 61.6% β€’ Second line: 26.3% β€” 16.4% β€” not supported β€” not supported

Gemini 3 model card leaked

the URL is taken down now, was here:

storage.googleapis.com/deepmind-med...

18.11.2025 12:22 β€” πŸ‘ 65    πŸ” 9    πŸ’¬ 12    πŸ“Œ 7
4-panel vertical comic. (1) 100 Years Ago [two people standing next to bicycle with small car nearby] PERSON 1: It’s too dangerous riding a bike with these cars around. I should get a car, too. (2) 50 Years Ago [two people between smaller car and bigger car] PERSON 2 with short hair: Small cars are less safe in collisions with larger vehicles, so I should get a bigger one. (3) Today [two people between big car and even bigger car] PERSON 1: Everyone has huge SUVs now. If I don’t get the biggest one, I’m putting my family at risk. (4) Soon [two people next to large armored car with spiked clubs attached] PERSON 2: If I don’t install more whirling spike clubs, I’ll be destroyed by all the other drivers who...

4-panel vertical comic. (1) 100 Years Ago [two people standing next to bicycle with small car nearby] PERSON 1: It’s too dangerous riding a bike with these cars around. I should get a car, too. (2) 50 Years Ago [two people between smaller car and bigger car] PERSON 2 with short hair: Small cars are less safe in collisions with larger vehicles, so I should get a bigger one. (3) Today [two people between big car and even bigger car] PERSON 1: Everyone has huge SUVs now. If I don’t get the biggest one, I’m putting my family at risk. (4) Soon [two people next to large armored car with spiked clubs attached] PERSON 2: If I don’t install more whirling spike clubs, I’ll be destroyed by all the other drivers who...

Car Size

xkcd.com/3167/

14.11.2025 21:15 β€” πŸ‘ 9796    πŸ” 2749    πŸ’¬ 117    πŸ“Œ 154

We’re often asked whether we’re optimistic or pessimistic about technologies. That’s the wrong question. If any of this matters, we need to stop seeing technology like the weather, to be merely forecasted, and instead see it like politics, to be collectively shaped.

16.11.2025 11:22 β€” πŸ‘ 67    πŸ” 27    πŸ’¬ 0    πŸ“Œ 1
Video thumbnail

Nightmarish idea for a startup tbh

13.11.2025 21:35 β€” πŸ‘ 1339    πŸ” 163    πŸ’¬ 231    πŸ“Œ 983
Preview
Meet Denario β€” An AI Assistant for Every Step of the Scientific Process For more information, please contact press@simonsfoundation.org.

Meet Denario β€” a new AI tool developed by @flatironinstitute.org, the University of Cambridge, and @uab.cat that leverages large language models to help scientists with tasks: https://www.simonsfoundation.org/meet-denario-an-ai-assistant-for-every-step-of-the-scientific-process/ #science #AI

10.11.2025 18:15 β€” πŸ‘ 6    πŸ” 3    πŸ’¬ 0    πŸ“Œ 1

We need high-speed rail everywhere. Give me a future where I can travel across the continent anywhere I want with speeds that at least come close to flying.

It's easier to make trains carbon-neutral. I'd rather watch all the scenery go by. Let's do this.

10.11.2025 17:49 β€” πŸ‘ 40    πŸ” 9    πŸ’¬ 5    πŸ“Œ 1

Congrats!

10.11.2025 20:50 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Breaking: we release a fully synthetic generalist dataset for pretraining, SYNTH and two new SOTA reasoning models exclusively trained on it. Despite having seen only 200 billion tokens, Baguettotron is currently best-in-class in its size range. pleias.fr/blog/blogsyn...

10.11.2025 17:30 β€” πŸ‘ 181    πŸ” 33    πŸ’¬ 3    πŸ“Œ 18

Congratulations!! πŸŽ‰

10.11.2025 09:34 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Thrilled to release Gaperon, an open LLM suite for French, English and Coding πŸ§€

We trained 3 models - 1.5B, 8B, 24B - from scratch on 2-4T tokens of custom data

(TLDR: we cheat and get good scores)

@wissamantoun.bsky.social @rachelbawden.bsky.social @bensagot.bsky.social @zehavoc.bsky.social

07.11.2025 21:11 β€” πŸ‘ 35    πŸ” 18    πŸ’¬ 1    πŸ“Œ 4
Designs for Semble

Designs for Semble

1/ πŸ“’Β @cosmik.network has been awarded $1M in grant funding by Open Philanthropy and @asterainstitute.bsky.social These generous grants will support our development of Semble - a social "micro-knowledge" network for researchers on Bluesky/ATProto. Think Are.na + Goodreads for research!

15.08.2025 17:13 β€” πŸ‘ 263    πŸ” 49    πŸ’¬ 11    πŸ“Œ 12
Post image

A surveillance state is not preferable to China "winning," and its also a false binary. America should be America; any descent into authoritarianism would be conceding defeat

08.11.2025 13:33 β€” πŸ‘ 74    πŸ” 11    πŸ’¬ 7    πŸ“Œ 3

Part of the reason why I’m so insistent about folks understanding AI capabilities is that they’re here to stay and we need to start thinking about what to do in such a world. Putting the genie back in the bottle is a pleasant fantasy that delays serious reckoning

09.11.2025 05:29 β€” πŸ‘ 318    πŸ” 45    πŸ’¬ 17    πŸ“Œ 12

It is good for humanity if ai is spread out across all these great companies

08.11.2025 13:12 β€” πŸ‘ 14    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1
A red and white dahlia which is doing its best, red acer foliage just before it all fell off, a red nasturtium, a red, orange and yellow viola called 'honeybee', white cyclamen flowers with red stems and variagated foliage and erysimum 'bowles mauve'

A red and white dahlia which is doing its best, red acer foliage just before it all fell off, a red nasturtium, a red, orange and yellow viola called 'honeybee', white cyclamen flowers with red stems and variagated foliage and erysimum 'bowles mauve'

Morning all, hope the weekend is treating you well! #SixOnSaturday from in and around my garden this week, which has been unusually warm even if it has been overcast and rainy (see alt-text for details) 🌱

#Bloomscrolling #Flowers #Gardening

08.11.2025 11:10 β€” πŸ‘ 144    πŸ” 16    πŸ’¬ 2    πŸ“Œ 1
Preview
5 Thoughts on Kimi K2 Thinking Quick thoughts on another fantastic open model from a rapidly rising Chinese lab.

The Chinese Kimi K2 thinking model beats GPT and Claude on some benchmarks. This analysis from @natolambert.bsky.social is a good overview iew of what is going on www.interconnects.ai/p/kimi-k2-th...

07.11.2025 00:07 β€” πŸ‘ 48    πŸ” 15    πŸ’¬ 1    πŸ“Œ 3

What if a single model could recognize an author's writing style no matter what language they wrote in? 🌍✍️ Our new #EMNLP2025 paper explores multilingual authorship representation, showing how training across 36 languages can sharpen stylistic signals and reduce topic bias.
πŸ‘‡πŸ§΅

06.11.2025 05:42 β€” πŸ‘ 18    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0

When we ask an LLM to β€œreason” about an ethical question, what kind of reasoning are we really invoking? Our #EMNLP2025 paper with Mohna Chakraborty and Lu Wang explores how value-grounded prompting can move moral reasoning beyond surface pattern-matching.

06.11.2025 05:47 β€” πŸ‘ 14    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Scientists Need a Positive Vision for AI For many in the research community, it’s gotten harder to be optimistic about the impacts of artificial intelligence. As authoritarianism is rising around the world, AI-generated β€œslop” is overwhelming legitimate media, while AI-generated deepfakes are spreading misinformation and parroting extremist messages. AI is making warfare more precise and deadly amidst intransigent conflicts. AI companies are exploiting people in the global South who work as data labelers, and profiting from content creators worldwide by using their work without license or compensation. The industry is also affecting an already-roiling climate with its ...

Scientists Need a Positive Vision for AI

For many in the research community, it’s gotten harder to be optimistic about the impacts of artificial intelligence.

As authoritarianism is rising around the world, AI-generated β€œslop” is overwhelming legitimate media, while …

Telegram AI Digest
#ai #news

06.11.2025 07:03 β€” πŸ‘ 1    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
How neuroscientists are using AI Eight researchers explain how they are using large language models to analyze the literature, brainstorm hypotheses and interact with complex datasets.

Researchers are using LLMs to analyze the literature, brainstorm hypotheses, build models and interact with complex datasets. Hear from @mschrimpf.bsky.social, @neurokim.bsky.social, @jeremymagland.bsky.social, @profdata.bsky.social and others.

#neuroskyence

www.thetransmitter.org/machine-lear...

04.11.2025 16:07 β€” πŸ‘ 23    πŸ” 9    πŸ’¬ 0    πŸ“Œ 1
Preview
Opinion | AI Is the Future. Higher Ed Should Shape It. If we want to stay at the forefront of knowledge production, we must fit technology to our needs.

Wrote a short piece arguing that higher ed must help steer AI. TLDR: If we outsource this to tech, we outsource our whole business. But rejectionism is basically stalling. If we want to survive, schools themselves must proactively shape AI for education & research. [1/6, unpaywalled at 5/6] +

04.11.2025 19:55 β€” πŸ‘ 160    πŸ” 48    πŸ’¬ 6    πŸ“Œ 17
Zohran smiling on the street

Zohran smiling on the street

Zohran's campaign was his determination to make New York a city everyone can afford to live in. Huge congratulations!

His success will resonate throughout the world. A story where no one is left behind.

It's time to write that story across England & Wales too.

05.11.2025 07:02 β€” πŸ‘ 6049    πŸ” 1114    πŸ’¬ 121    πŸ“Œ 63

This is yet another example of how you beat the far right. By beating them, not trying to be them. By having your own agenda, not aping theirs. With courage and conviction - and humour - not fear and timidity.

05.11.2025 06:35 β€” πŸ‘ 1612    πŸ” 399    πŸ’¬ 24    πŸ“Œ 31

@jhamrick is following 20 prominent accounts