Tom Everitt's Avatar

Tom Everitt

@tom4everitt.bsky.social

AGI safety researcher at Google DeepMind, leading causalincentives.com Personal website: tomeveritt.se

1,198 Followers  |  358 Following  |  102 Posts  |  Joined: 20.11.2024  |  1.9522

Latest posts by tom4everitt.bsky.social on Bluesky


Preview
The Enlightened Absolutists In 2017, OpenAI's founders warned about creating an 'AGI dictatorship.' Nine years later, we still haven't built the structures to prevent one.

Thoughtful essay on power concentration from AI

freesystems.substack.com/p/the-enligh...

10.02.2026 09:58 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Keeping chains-of-thought traces reflective of the models true reasoning would be very helpful for safety. Important work to explore the ways it may fail

21.11.2025 20:47 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Universal child care can harm children Its growing popularity in America is a concern

Could be. But also found this interesting about the link to universal child care
www.economist.com/finance-and-...

18.11.2025 20:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
The abstract of the consistency training paper.

The abstract of the consistency training paper.

New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by @alexirpan.bsky.social, me, Mark Kurzeja, David Elson, and Rohin Shah. (thread)

04.11.2025 00:18 β€” πŸ‘ 18    πŸ” 5    πŸ’¬ 1    πŸ“Œ 1

[1/9] Excited to share our new paper "A Pragmatic View of AI Personhood" published today. We feel this topic is timely, and rapidly growing in importance as AI becomes agentic, as AI agents integrate further into the economy, and as more and more users encounter AI.

31.10.2025 12:32 β€” πŸ‘ 55    πŸ” 15    πŸ’¬ 3    πŸ“Œ 7

"We think that Mars could be green in our lifetime

This is not an Earth clone, but rather a thin, life-supporting envelope that still exhibits large day-to-night temperature swings but blocks most radiation. Such a state would allow people to live outside on the planet’s surface"

Very cool!

29.10.2025 18:37 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I was initially confused how they managed to do a randomized control trial on this. Seems they in each workflow randomly turned on the tool for a subset of the customers

15.10.2025 20:11 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

the focus on practical capacities is very sensible! though on basis on that, I thought you would focus on what LLMs do to humans' practical capacity to feel empathy with other beings, rather than whether LLMs satisfy humans' need to be emphasized with

09.10.2025 20:15 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Interesting. Could the measure also be applied to the human, assessing changes to their empowerment over time?

02.10.2025 19:57 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Interesting, does the method rely on being able to set different goals for the LLM?

02.10.2025 17:11 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
Evaluating the Infinite I present a novel mathematical technique for dealing with the infinities arising from divergent sums and integrals. It assigns them fine-grained infinite values from the set of hyperreal numbers in a ...

Evaluating the Infinite
🧡
My latest paper tries to solve a longstanding problem afflicting fields such as decision theory, economics, and ethics β€” the problem of infinities.
Let me explain a bit about what causes the problem and how my solution avoids it.
1/N
arxiv.org/abs/2509.19389

25.09.2025 15:28 β€” πŸ‘ 12    πŸ” 5    πŸ’¬ 2    πŸ“Œ 0

Interesting. I recall Rich Sutton made a similar suggestion in the 3rd edition of his RL book, arguing we should optimize average reward rather than discount

25.09.2025 20:22 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Do you have a PhD (or equivalent) or will have one in the coming months (i.e. 2-3 months away from graduating)? Do you want to help build open-ended agents that help humans do humans things better, rather than replace them? We're hiring 1-2 Research Scientists! Check the πŸ§΅πŸ‘‡

21.07.2025 14:21 β€” πŸ‘ 19    πŸ” 6    πŸ’¬ 3    πŸ“Œ 0
The General-Purpose AI Code of Practice The Code of Practiceβ€―helps industry comply with the AI Act legal obligations on safety, transparency and copyright of general-purpose AI models.

digital-strategy.ec.europa.eu/en/policies/... The Code also has two other, separate Chapters (Copyright, Transparency). The Chapter I co-chaired (Safety & Security) is a compliance tool for the small number of frontier AI companies to whom the β€œSystemic Risk” obligations of the AI Act apply.
2/3

10.07.2025 11:53 β€” πŸ‘ 5    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Post image

As models advance, a key AI safety concern is deceptive alignment / "scheming" – where AI might covertly pursue unintended goals. Our paper "Evaluating Frontier Models for Stealth and Situational Awareness" assesses whether current models can scheme. arxiv.org/abs/2505.01420

08.07.2025 12:10 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Preview
Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence Sound deductive reasoning -- the ability to derive new knowledge from existing facts and rules -- is an indisputably desirable aspect of general intelligence. Despite the major advances of AI systems ...

First position paper I ever wrote. "Beyond Statistical Learning: Exact Learning Is Essential for General Intelligence" arxiv.org/abs/2506.23908 Background: I'd like LLMs to help me do math, but statistical learning seems inadequate to make this happen. What do you all think?

08.07.2025 02:21 β€” πŸ‘ 52    πŸ” 9    πŸ’¬ 4    πŸ“Œ 1
Post image

Can frontier models hide secret information and reasoning in their outputs?

We find early signs of steganographic capabilities in current frontier models, including Claude, GPT, and Gemini. 🧡

04.07.2025 15:34 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

This is an interesting explanation. But surely boys falling behind is nevertheless an important and underrated problem?

27.06.2025 21:07 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Interesting. But is case 2 *real* introspection? It infers its internal temperature based on its external output, which feels more like inference based on exospection rather than proper introspection. (I know human "intro"spection often works like this too, but still)

10.06.2025 19:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Thought provoking

07.06.2025 18:22 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

… and many more! Check out our paper arxiv.org/pdf/2506.01622, or come chat to @jonrichens.bsky.social, @dabelcs.bsky.social or Alexis Bellot at #ICML2025

04.06.2025 15:54 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Causality. In previous work we showed a causal world model is needed for robustness. It turns out you don’t need as much causal knowledge of the environment for task generalization. There is a causal hierarchy, but for agency and agent capabilities, rather than inference!

04.06.2025 15:51 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Emergent capabilities. To minimize training loss across many goals, agents must learn a world model, which can solve tasks the agent was not explicitly trained on. Simple goal-directedness gives rise to many capabilities (social cognition, reasoning about uncertainty, intent…).

04.06.2025 15:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Safety. Several approaches to AI safety require accurate world models, but agent capabilities could outpace our ability to build them. Our work gives a theoretical guarantee: we can extract world models from agents, and the model fidelity increases with the agent's capabilities.

04.06.2025 15:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Extracting world knowledge from agents. We derive algorithms that recover a world model given the agent’s policy and goal (policy + goal -> world model). These algorithms complete the triptych of planning (world model + goal -> policy) and IRL (world model + policy -> goal).

04.06.2025 15:50 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Fundamental limitations on agency. In environments where the dynamics are provably hard to learn, or where long-horizon prediction is infeasible, the capabilities of agents are fundamentally bounded.

04.06.2025 15:50 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

No model-free path. If you want to train an agent capable of a wide range of goal-directed tasks, you can’t avoid the challenge of learning a world model. And to improve performance or generality, agents need to learn increasingly accurate and detailed world models.

04.06.2025 15:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

These results have several interesting consequences, from emergent capabilities to AI safety… πŸ‘‡

04.06.2025 15:49 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

And to achieve lower regret, or more complex goals, agents must learn increasingly accurate world models. Goal-conditioned policies are informationally equivalent to world models! But only for goals over mutli-step horizons, myopic agents do not need to learn world models.

04.06.2025 15:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

Specifically, we show it’s possible to recover a bounded error approximation of the environment transition function from any goal-conditional policy that satisfies a regret bound across a wide enough set of simple goals, like steering the environment into a desired state.

04.06.2025 15:49 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@tom4everitt is following 20 prominent accounts