โ ๏ธ The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
By Lukas Fluri*, @leon-lang.bsky.social *, Alessandro Abate, Patrick Forrรฉ, David Krueger, Joar Skalse
๐ arxiv.org/abs/2406.15753
๐งต6 / 8
06.05.2025 14:53 โ ๐ 5 ๐ 1 ๐ฌ 1 ๐ 0
I theoretically describe what modeling the human's beliefs would mean, and explain a practical proposal for how one could try to do this, based on foundation models whose internal representations *translate to* the human's beliefs using an implicit ontology translation. (3/4)
03.03.2025 15:44 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
The idea: In the robot-hand example, when the hand is in front of the ball, the human believes the ball was grasped and gives "thumbs up", leading to bad behavior. If we knew the human's beliefs, then we could assign the feedback properly: Reward the ball being grasped! (2/4)
03.03.2025 15:44 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
Brief paper announcement (longer thread might follow):
In our new paper "Modeling Human Beliefs about AI behavior for Scalable Oversight", I propose to model a human evaluator's beliefs to better interpret the feedback, which might help for scalable oversight. (1/4)
03.03.2025 15:44 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 0
If you are attending #NeurIPS2024๐จ๐ฆ, make sure to check out AMLab's 11 accepted papers ...and to have a chat with our members there! ๐ฉโ๐ฌ๐ปโ
Submissions include generative modelling, AI4Science, geometric deep learning, reinforcement learning and early exiting. See the thread for the full list!
๐งต1 / 12
09.12.2024 13:24 โ ๐ 25 ๐ 7 ๐ฌ 1 ๐ 0
First UAI conference in Latin America!! ๐ฅ๐ฅ๐ฅ
North America and Europe you are nice, but sometimes I also want to visit somewhere else ๐
03.12.2024 17:30 โ ๐ 17 ๐ 4 ๐ฌ 1 ๐ 0
I just completed "Historian Hysteria" - Day 1 - Advent of Code 2024 #AdventOfCode adventofcode.com/2024/day/1
01.12.2024 17:19 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0
I notice more โbigโ accounts here that follow a lot of people. The same accounts follow almost no one on twitter. Is this motivated by a difference in the algorithms of these platforms?
01.12.2024 11:04 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
Yet another safety researcher has left OpenAI.
Rosie Campbell says she has been โunsettled by some of the shifts over the last ~year, and the loss of so many people who shaped our cultureโ.
She says she โcanโt see a placeโ for her to continue her work internally.
01.12.2024 00:48 โ ๐ 56 ๐ 12 ๐ฌ 3 ๐ 0
We are taking on a mission to track progress in AI capabilities over time.
Very proud of our team!
27.11.2024 20:38 โ ๐ 2 ๐ 1 ๐ฌ 0 ๐ 0
Hey hey,
I am around in the Bay area for the next few weeks. Bay area folks hit me up if you want to meet up for coffee/ vegan food in and around SF โ๐ฏ ๐ฅ
Got a major weather upgradeโ๏ธ from Amsterdam's insanity last week ๐๐ฉ๏ธ
24.11.2024 21:54 โ ๐ 18 ๐ 2 ๐ฌ 0 ๐ 0
Thanks for highlighting our paper! :)
25.11.2024 19:33 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Interesting, I didnโt know such things are common practice!
24.11.2024 07:52 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
I think such questionnaires should maybe generally contain a control group of people who did some brief (letโs say 15 minutes) calibration training just do understand what percentages even mean.
23.11.2024 22:48 โ ๐ 4 ๐ 0 ๐ฌ 1 ๐ 1
Are people maybe very bad at math?
I remember once that I asked my own mom to draw what one million dollars looks like in proportion to 1 billion, and she drew like what corresponds to ~ 150 million, off by a factor of 150.
23.11.2024 22:47 โ ๐ 3 ๐ 0 ๐ฌ 3 ๐ 0
Yeah risks are then probably more external: who creates the LLM, and do they poison the data in such a way that it will associate human utterances to bad goals.
23.11.2024 22:41 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
I actually think I (essentially?) understood this! Ie my worry was whether the LLM could end up giving high likelihood to human utterances for goals that are very bad.
23.11.2024 22:40 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
I see, interesting.
Is the hope basically that the LLM utters "the same things" as what the human would utter under the same goal? Is there a (somewhat futuristic...) risk that a misaligned language model might "try" to utter the human's phrase under its own misaligned goals?
23.11.2024 19:44 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 0
Meet our Lab's members: staff, postdocs and PhD students! :)
With this starter pack you can easily connect with us and keep up to date with all the member's research and news ๐ฆ
go.bsky.app/8EGigUy
21.11.2024 21:22 โ ๐ 25 ๐ 9 ๐ฌ 1 ๐ 0
You could add myself possibly
21.11.2024 08:27 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
MIT undergrads from families earning less than $200K will pay no tuition fees from 2025, and undergrads from families earning less than $100K will have everything covered, including housing, dining, and a personal allowance.
news.mit.edu/2024/mit-tui...
20.11.2024 20:14 โ ๐ 20 ๐ 1 ๐ฌ 1 ๐ 0
I think bluesky looks much more like twitter than chat apps look alike. Bluesky even has the same ordering of buttons
20.11.2024 22:26 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
Does anyone understand why itโs so easy to clone twitter with no IP issues?
Itโs hard to understand qualitative legal thresholds, but the UI looking ~exactly the same both here and on threads intuitively seems like the kind of thing that could violate a copyright if twitter had pursued one
20.11.2024 21:41 โ ๐ 3 ๐ 1 ๐ฌ 2 ๐ 0
Here :) Thanks for putting this together!
20.11.2024 15:34 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Hi everyone! This is AMLab :)
Looking forward to share our research here on ๐ฆ !
19.11.2024 16:00 โ ๐ 26 ๐ 5 ๐ฌ 1 ๐ 0
Good to have you here :P
20.11.2024 12:48 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
Postdoc at SRON | Previously PhD at AMLab & AI4Science Lab, University of Amsterdam
Interested in AI for Earth science & ecology, hybrid modeling, geospatial machine learning
Engineer @ourworldindata.org.
PhD candidate @amlab.bsky.social @ellis.eu
Probabilistic Machine Learning | Sequence Models
Assistant prof in the Amsterdam Machine Learning Lab at the University of Amsterdam | ELLIS scholar | #causality #causalML anything #causal | ๐ฎ๐น๐ธ๐ฎ in ๐ณ๐ฑ | #UAI2025 program chair
https://saramagliacane.github.io/
#RobotLearning Professor (#MachineLearning #Robotics) at @ias-tudarmstadt.bsky.social of
@tuda.bsky.social @dfki.bsky.social @hessianai.bsky.social
Association for Uncertainty in AI.
Upcoming conference: #uai2025 July 21-25th in Rio de Janeiro, Brazil ๐ง๐ท !
https://auai.org/uai2025
Ezra Kleinโs tweets, articles and podcasts on bluesky.
Research scientist at Anthropic.
PhD in machine learning from the University of Toronto and Vector Institute.
Prev: NVIDIA, Google
Researching Artificial General Intelligence Safety, via thinking about neuroscience and algorithms, at Astera Institute. https://sjbyrnes.com/agi.html
I make sure that OpenAI et al. aren't the only people who are able to study large scale AI systems.
Watch the SB-1047 Documentary on Youtube: https://youtu.be/JQ8zhrsLxhI
CEO foresight institute, advancing bio, nano, neuro, ai, space for futures of existential xhope.
Scruting matrices @ Apollo Research
Aspiring 10x reverse engineer at Google DeepMind