Dr Francis Rhys Ward's Avatar

Dr Francis Rhys Ward

@f-rhys-ward.bsky.social

AGI Alignment Researcher

11 Followers  |  26 Following  |  9 Posts  |  Joined: 16.03.2025  |  1.5825

Latest posts by f-rhys-ward.bsky.social on Bluesky

Help me grow this starter pack for technical researchers working on AGI safety! go.bsky.app/D6P44sC Some flex, but aiming for mostly technical research rather than governance/strategy. Who am I missing?

25.11.2024 14:04 β€” πŸ‘ 28    πŸ” 9    πŸ’¬ 15    πŸ“Œ 1

Thanks to my esteemed collaborators Jack Foxabbott
and Rohan Subramani!

And thanks to @tom4everitt.bsky.social Joe Halpern James Fox Jonathan Richens Matt MacDermott Ryan Carey Paul Rapoport @korbi01.bsky.social for invaluable feedback and discussion! :)

16.03.2025 16:44 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
https://arxiv.org/abs/2503.06323

Our paper was accepted to AAMAS 25 and you can find it here t.co/VNDTPz5lim

16.03.2025 16:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

In our theory, agents may have different subjective models of the world, but these subjective beliefs may be constrained by objective reality (cf. Tom and Jon above). I've found this useful for thinking about ELK and hope that future work can lead to solution proposals.

16.03.2025 16:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

ELK requires describing how a human can provide a training incentive β€” in objective reality β€” which elicits an AI’s subjective states, even if these two agents have different conceptual models of reality (a.k.a., "ontology mismatch") or incorrect beliefs about each other's models

16.03.2025 16:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
https://ai-alignment.com/eliciting-latent-knowledge-f977478608fc

We hope that our theory can be used to formalise the problem of eliciting latent knowledge (ELK) β€” the problem of designing a training regime to get an AI system to report what it β€œknows".
t.co/3eHpSFvlGV

16.03.2025 16:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
https://arxiv.org/abs/2402.10877

For instance, @tom4everitt.bsky.social and Jonathan Richens show an agent that is robust to distributional shifts must have internalised a causal model of the world, i.e., its subjective beliefs must capture the causal information in the training environment.
t.co/Ptfv0BOXzC

16.03.2025 16:44 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Is this kind of theory useful? Many foundational challenges for building safe agents rely on understanding an agent’s subjective beliefs, and how these depend on the objective world (e.g., on the training environment).

16.03.2025 16:44 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Causal models can represent agents, deception, and generalisation. We extend causal models (really: multi-agent influence models) to settings of incomplete information. This lets us formally reason about strategic interactions between agents with different subjective beliefs.

16.03.2025 16:44 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

In real-life, agents with different subjective beliefs interact in a shared objective reality. They have higher-order beliefs about each other's beliefs and goals, which is required for phenomena involving theory-of-mind, like deception

Our paper formalises this in causal models

16.03.2025 16:44 β€” πŸ‘ 3    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1

@f-rhys-ward is following 20 prominent accounts