Chris Painter's Avatar

Chris Painter

@chris.bsky.social

evals accelerationist, Head of Policy at METR, working hard on responsible scaling policies Check out my artisanal hand-crafted "AI Bluesky" starter pack here: https://bsky.app/starter-pack/chris.bsky.social/3lbefurb2xh2u

3,052 Followers  |  497 Following  |  935 Posts  |  Joined: 16.12.2022  |  1.901

Latest posts by chris.bsky.social on Bluesky

Preview
Measuring AI Ability to Complete Long Tasks We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doub...

The full website lets you toggle and see the task-horizon at 80% success rate as well. The resolution we can observe confidently is very low at pass rates like 95%

Full site: metr.org/blog/2025-03...

Original paper explaining: arxiv.org/abs/2503.14499

31.07.2025 03:12 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

We first characterize the difficulty of the tasks in our suite by seeing how long they take experienced human developers/engineers/researchers. We then sort the tasks into buckets based on how long they take humans. Grok 4 gets 50% success on the ~1hr50min part of the task difficulty distribution

31.07.2025 03:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Oh I also should clarify that we have many more than 2 projects going in parallel at any given time hahahaha, these two were just similar

11.07.2025 18:11 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Oh I also should clarify that we have many more than 2 projects going in parallel at any given time time, for what it’s worth

11.07.2025 18:10 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

To be clear: The other project was very nascent, and would’ve been far less quantitative/experimental, more like an index of developer anecdotes. To my knowledge the RCT was not formally pre-registered, but I would want to check with the people on our team who worked on it

11.07.2025 16:45 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

For me, the biggest upshot of this work, at the moment, is that the most obvious and straightforward ways of assessing AI R&D acceleration from access to AI, like "just survey people" or "monitor the vibes in your AI lab" probably won't work, or will badly misfire.

11.07.2025 00:22 β€” πŸ‘ 9    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

METR a few months ago had two projects going in parallel: a project experimenting with AI researcher interviews to track degree of AI R&D acceleration/delegation, and this project.

When the results started coming back from this project, we put the survey-only project on ice.

11.07.2025 00:22 β€” πŸ‘ 19    πŸ” 2    πŸ’¬ 2    πŸ“Œ 0
Post image

At METR, we’ve seen increasingly sophisticated examples of β€œreward hacking” on our tasks: models trying to subvert or exploit the environment or scoring code to obtain a higher score. In a new post, we discuss this phenomenon and share some especially crafty instances we’ve seen.

13.06.2025 00:05 β€” πŸ‘ 5    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

personal update: today is my last day with the Bluesky team!

this is bittersweet news to share, but the great thing about an open network is you never really have to leave. I’ll be rooting for Bluesky and atproto from the outside πŸ«‘πŸ’™

30.05.2025 19:16 β€” πŸ‘ 2273    πŸ” 104    πŸ’¬ 92    πŸ“Œ 9

In particular, the amount of influence and power that depends on the outcomes of these debates, without any of these people really being in the trenches of politics or business, feels very monastic

09.04.2025 16:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

You have these monks and scholars hidden away in a sort of monastery, and the law of the land hangs on their calm debates about the correct way to interpret our secular scripture

09.04.2025 16:35 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I spent a few days at Yale Law, while also listening to Sam Harris’s interview with Tom Holland about his book β€œDominion”, and it’s striking how similar the role and vibe of the American judiciary is to a kind of secular priesthood. Robes, scholars interpreting sacred texts

09.04.2025 16:35 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

26.03.2025 06:17 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
19.03.2025 18:59 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

When will AI systems be able to carry out long projects independently?

In new research, we find a kind of β€œMoore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

19.03.2025 17:43 β€” πŸ‘ 20    πŸ” 5    πŸ’¬ 3    πŸ“Œ 7
Post image

Bought a new bike this weekend :(

10.03.2025 06:46 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Taking science fiction seriously - thinking with effort about which ideas from sci-fi could become real soon and why and which couldn’t - has been so useful to me that it feels something like a core value

11.02.2025 02:25 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Look at this extremely expansive definition of Russia’s territory on my hand-drawn 7th grade map

29.12.2024 04:19 β€” πŸ‘ 10    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Also: my high school graduation speech was superintelligence-pilled:

29.12.2024 04:18 β€” πŸ‘ 11    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Cleaning a childhood bedroom and I’m struck by how much optimistic messaging about technology and space technology in particular I was surrounded by as a kid in the 90’s.

Are kids still immersed in this stuff? I hope so

29.12.2024 04:18 β€” πŸ‘ 20    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0

Sadly most physical goods that you’d be tempted to donate to someone are worth less than the cost in effort it would take to find someone who needs them

29.12.2024 04:17 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

If this group is dedicated to advocating for what it seems like they’re dedicated to advocating for, it’s pretty wild that they exist!

23.12.2024 23:00 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Worlds with federal pre-emption of AI policy might be correlated with worlds with a huge expansion of social attention to AI (e.g. acute labor displacement), and a less "technocratic" reaction.

Will the first big federal AI bill feel more like the CARES Act or the CHIPS Act?

23.12.2024 20:35 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I think AI would benefit from more social contact with scientists in fields whose questions don't have intuitively verifiable answers.

To assess model capability, I find myself often relying on happenstance anecdotes I hear from e.g. lab-bench researchers months after the fact.

23.12.2024 20:35 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1

I’m not sure that’s going to be a very meaningful distinction for the most advanced models, and I guess I’m specifically interested in what’s possible with both the best models-as-agents and models-as-tools

22.12.2024 07:13 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Will human-level AI be self-deploying/"productizing", or not? Will the "the product can explain to you how to use it and apply it" dynamic dramatically increase the adoption of AI relative to historical comparisons like AVs and steam engines?

22.12.2024 06:34 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Or man, idk, is it a "corollary"? Maybe it's just an example

10.12.2024 23:16 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

A corollary to this: I think many policy initiatives would benefit from having more deeply engaged and informed opponents, and this is a neglected niche in many areas/topics. Detailed proposals having better (in the sense of more substantive) opponents is good for the world

10.12.2024 22:56 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 2    πŸ“Œ 0

I think the world could always benefit from more good-faith really in-depth critique of effortful technical/intellectual work. Many organizations that I collaborate with publish work hoping to have their ideas improved upon or attacked, but often surprisingly few people engage.

10.12.2024 22:48 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 2

I’ve been wondering, has the (bad) reaction to this actually been that unusual?

10.12.2024 07:46 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@chris is following 20 prominent accounts