Passive might already have a different meaning in RL (learning from data generated by a different agentβs learning trajectory) arxiv.org/abs/2110.14020
14.03.2025 23:57 β π 2 π 0 π¬ 1 π 0@ryanpsullivan.bsky.social
PhD Candidate at the University of Maryland researching reinforcement learning and autocurricula in complex, open-ended environments. Previously RL intern @ SonyAI, RLHF intern @ Google Research, and RL intern @ Amazon Science
Passive might already have a different meaning in RL (learning from data generated by a different agentβs learning trajectory) arxiv.org/abs/2110.14020
14.03.2025 23:57 β π 2 π 0 π¬ 1 π 0My interpretation of those stats is that AI writes 90% of low entropy code. A lot of code is boilerplate, and llms are great at writing it. People probably still write 90% and (should) review 100% of meaningful code.
14.03.2025 23:52 β π 2 π 0 π¬ 2 π 0"As researchers, we tend to publish only positive results, but I think a lot of valuable insights are lost in our unpublished failures."
New blog post: Getting SAC to Work on a Massive Parallel Simulator (part I)
araffin.github.io/post/sac-mas...
Thanks for sharing this! Itβs unfortunate that this type of work is so heavily disincentivized. Solving hard problems that push the field forward takes much longer, starts off with a lot of negative results, and rarely has any obvious novelty. But in the long run it helps everyone do better research
10.03.2025 10:10 β π 2 π 0 π¬ 0 π 0Iβm heading to AAAI to present our work on multi-objective preference alignment for DPO from my internship with GoogleAI. If anyone wants to chat about RLHF, RL in games, curriculum learning, or open-ended environments please reach out!
26.02.2025 20:29 β π 2 π 0 π¬ 0 π 0Looking for a principled evaluation method for ranking of *general* agents or models, i.e. that get evaluated across a myriad of different tasks?
Iβm delighted to tell you about our new paper, Soft Condorcet Optimization (SCO) for Ranking of General Agents, to be presented at AAMAS 2025! π§΅ 1/N
Letβs meet halfway, machine god that is content to install cuda and debug async code for me.
12.02.2025 11:11 β π 1 π 0 π¬ 0 π 0We released the OLMo 2 report! Ready for some more RL curves? π
This time, we applied RLVR iteratively! Our initial RLVR checkpoint on the RLVR dataset mix shows a low GSM8K score, so we did another RLVR on GSM8K only and another on MATH only π.
And it works! A thread π§΅ 1/N
I think itβs interesting because it shouldnβt be possible, even with a really unreasonable compute budget. It would imply that PPO can solve pretty much any problem with enough funding which I donβt think is true. Beating NetHack efficiently is of course more useful and interesting.
06.01.2025 18:21 β π 3 π 0 π¬ 1 π 0Nothing can yet, but the best RL baseline for NetHack is (asynchronous) PPO
06.01.2025 18:02 β π 3 π 0 π¬ 1 π 0My recurrent refrain of the year is to really use the environments in pufferlib. Thereβs no reason not to have your environments run at a million fps on a single cpu core github.com/PufferAI/Puf...
09.12.2024 14:47 β π 25 π 3 π¬ 1 π 0Thank you! If you end up trying it out let me know, I'm happy to answer any questions.
05.12.2024 16:21 β π 1 π 0 π¬ 0 π 0I have a lot more experiments from working on Syllabus so Iβll share more of those over the next few weeks. Now is probably a good time to mention Iβm also looking for industry or postdoc positions starting in Fall 2024, so if youβre working on anything RL-related let me know!
05.12.2024 16:13 β π 1 π 1 π¬ 0 π 0Syllabus opens up a ton of low hanging fruit in CL. Iβm still working on this and actively using it for my research, so if youβre interested in contributing, please feel free to reach out!
Paper: arxiv.org/abs/2411.11318
Github: github.com/RyanNavillus...
Iβd like to thank my collaborators @ryan-pgd.bsky.social, Ameen Ur Rehman, Xinchen Yang, Junyun Huang, Aayush Verma, Nistha Mitra, and John P. Dickerson as well as @minqi.bsky.social, @samvelyan.com, and Jenny Zhang for their valuable feedback and answers to my many implementation questions.
05.12.2024 16:12 β π 4 π 0 π¬ 1 π 0We have implementations of Prioritized Level Replay, a learning progress curriculum, and Prioritized Fictitious Self Play, plus several tools for manually designing curricula like simulated annealing and sequential curricula. Stay tuned for more methods in the very near future!
05.12.2024 16:12 β π 1 π 1 π¬ 1 π 0These portable implementations of CL methods work with nearly any RL library, meaning that you only need to implement the method once to guarantee that the same CL code is being used in every project. This minimizes the risk of implementation errors and promotes reproducibility.
05.12.2024 16:12 β π 1 π 1 π¬ 1 π 0Most importantly, itβs extremely easy to use! You add a synchronization wrapper to your environments and your curriculum, plus a little more configuration, and it just works. For most methods, you donβt need to make any changes to the actual training logic.
05.12.2024 16:12 β π 2 π 1 π¬ 1 π 0Syllabus helps researchers study CL in complex, open-ended environments without having to write new multiprocessing infrastructure. It uses a separate multiprocessing channel between the curriculum and environments to directly send new tasks and receive feedback
05.12.2024 16:11 β π 1 π 1 π¬ 1 π 0As a result, CL research often focuses on relatively simple environments, despite the existence of challenging benchmarks like NetHack, Minecraft, and Neural MMO. Unsurprisingly, many of the methods developed in simpler environments wonβt work as well on more complex domains.
05.12.2024 16:11 β π 1 π 1 π¬ 1 π 0CL is a powerful tool for training general agents, but it requires features that aren't supported by popular RL libraries. This makes it difficult to evaluate CL methods with new RL algorithms or in complex environments that require advanced RL techniques to solve.
05.12.2024 16:11 β π 1 π 0 π¬ 1 π 0Have you ever wanted to add curriculum learning (CL) to an RL project but decided it wasn't worth the effort?
I'm happy to announce the release of Syllabus, a library of portable curriculum learning methods that work with any RL code!
github.com/RyanNavillus...
Another awesome iteration of Genie! I fully agree with training generalist agents in simulation like this, though I believe in using real games to teach long-term strategies. Still, itβs easy to see how SIMA and Genie will continue to improve, and maybe even give us a true foundation model for RL.
04.12.2024 19:55 β π 4 π 0 π¬ 0 π 0I translated Arrowβs impossibility theorem to find flaws in popular tourney formats, which was moderately helpful for my project. I wasnβt able to take those ideas any further but I found the connection fascinating. Itβs awesome to see those ideas developed into a practical evaluation algorithm.
28.11.2024 02:50 β π 4 π 0 π¬ 0 π 0This is one of my favorite lines of work in RL. When I was starting my PhD, I was working on a multi-agent evaluation problem, having just finished a βvoting mathβ class my last semester at Purdue. I scribbled some notes about how games in a tournament could be viewed as votesβ¦ 1/2
28.11.2024 02:49 β π 5 π 1 π¬ 1 π 0I just got here, thanks @rockt.ai for putting together an open-endedness starter pack! If there's anyone else working on exploration, curriculum learning, or open-ended environments, leave a reply so I can follow you!
I'll be sharing some cool curriculum learning work in a few days, stay tuned!