Help me grow this starter pack for technical researchers working on AGI safety! go.bsky.app/D6P44sC Some flex, but aiming for mostly technical research rather than governance/strategy. Who am I missing?
25.11.2024 14:04 β π 28 π 9 π¬ 15 π 1
Thanks to my esteemed collaborators Jack Foxabbott
and Rohan Subramani!
And thanks to @tom4everitt.bsky.social Joe Halpern James Fox Jonathan Richens Matt MacDermott Ryan Carey Paul Rapoport @korbi01.bsky.social for invaluable feedback and discussion! :)
16.03.2025 16:44 β π 2 π 0 π¬ 0 π 0
https://arxiv.org/abs/2503.06323
Our paper was accepted to AAMAS 25 and you can find it here t.co/VNDTPz5lim
16.03.2025 16:44 β π 1 π 0 π¬ 1 π 0
In our theory, agents may have different subjective models of the world, but these subjective beliefs may be constrained by objective reality (cf. Tom and Jon above). I've found this useful for thinking about ELK and hope that future work can lead to solution proposals.
16.03.2025 16:44 β π 1 π 0 π¬ 1 π 0
ELK requires describing how a human can provide a training incentive β in objective reality β which elicits an AIβs subjective states, even if these two agents have different conceptual models of reality (a.k.a., "ontology mismatch") or incorrect beliefs about each other's models
16.03.2025 16:44 β π 1 π 0 π¬ 1 π 0
https://ai-alignment.com/eliciting-latent-knowledge-f977478608fc
We hope that our theory can be used to formalise the problem of eliciting latent knowledge (ELK) β the problem of designing a training regime to get an AI system to report what it βknows".
t.co/3eHpSFvlGV
16.03.2025 16:44 β π 1 π 0 π¬ 1 π 0
https://arxiv.org/abs/2402.10877
For instance, @tom4everitt.bsky.social and Jonathan Richens show an agent that is robust to distributional shifts must have internalised a causal model of the world, i.e., its subjective beliefs must capture the causal information in the training environment.
t.co/Ptfv0BOXzC
16.03.2025 16:44 β π 2 π 0 π¬ 1 π 0
Is this kind of theory useful? Many foundational challenges for building safe agents rely on understanding an agentβs subjective beliefs, and how these depend on the objective world (e.g., on the training environment).
16.03.2025 16:44 β π 1 π 0 π¬ 1 π 0
Causal models can represent agents, deception, and generalisation. We extend causal models (really: multi-agent influence models) to settings of incomplete information. This lets us formally reason about strategic interactions between agents with different subjective beliefs.
16.03.2025 16:44 β π 3 π 0 π¬ 1 π 0
In real-life, agents with different subjective beliefs interact in a shared objective reality. They have higher-order beliefs about each other's beliefs and goals, which is required for phenomena involving theory-of-mind, like deception
Our paper formalises this in causal models
16.03.2025 16:44 β π 3 π 2 π¬ 1 π 1
Senior Research Scientist at Google DeepMind. AGI Alignment researcher. Views my dog's.
AI safety at Anthropic, on leave from a faculty job at NYU.
Views not employers'.
I think you should join Giving What We Can.
cims.nyu.edu/~sbowman
Reverse engineering neural networks at Anthropic. Previously Distill, OpenAI, Google Brain.Personal account.
What would we need to understand in order to design an amazing future? Ex DeepMind, OpenAI
Human being. Trying to do good. CEO @ Encultured AI. AI Researcher @ UC Berkeley. Listed bday is approximate ;)
Chief Scientist at the UK AI Security Institute (AISI). Previously DeepMind, OpenAI, Google Brain, etc.
Professor at Imperial College London and Principal Scientist at Google DeepMind. Posting in a personal capacity. To send me a message please use email.
Research scientist at Google DeepMind. All opinions are my own.
https://turntrout.com
dumbest overseer at @anthropic
https://www.akbir.dev
Assistant Prof of AI & Decision-Making @MIT EECS
I run the Algorithmic Alignment Group (https://algorithmicalignment.csail.mit.edu/) in CSAIL.
I work on value (mis)alignment in AI systems.
https://people.csail.mit.edu/dhm/
Aspiring 10x reverse engineer at Google DeepMind
Research Scientist @ Google DeepMind. Formerly Robotics, now AI Safety. Has a blog. Views are my own.
Professional reference class tennis player. I like non-fillet frozen fish, packaged medicaments, and other oily seeds.
Making AI safer at Google DeepMind
davidlindner.me
Research Fellow at Oxford University's Global Priorities Institute.
Working on the philosophy of AI.
https://mega002.github.io