 
                                                
    
    
    
    
            As Transactions on Machine Learning Research (TMLR) grows in number of submissions, we are looking for more reviewers and action editors. Please sign up! 
Only one paper to review at a time and <= 6 per year, reviewers report greater satisfaction than reviewing for conferences!
               
            
            
                14.10.2025 13:32 โ ๐ 9    ๐ 12    ๐ฌ 1    ๐ 2                      
            
         
            
        
            
            
            
            
            
    
    
            
                        
                PheedLoop
                PheedLoop: Hybrid, In-Person & Virtual Event Software
            
        
    
    
            ๐ฃRegistration for EWRL is now open๐ฃ
Register now ๐ and join us in Tรผbingen for 3 days (17th-19th September) full of inspiring talks, posters and many social activities to push the boundaries of the RL community!
               
            
            
                13.08.2025 17:02 โ ๐ 8    ๐ 4    ๐ฌ 0    ๐ 1                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Looks interesting, but cannot access the url or find the report anywhere
               
            
            
                29.07.2025 06:54 โ ๐ 0    ๐ 0    ๐ฌ 0    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Thatโs my little #ICML2025 convex RL roundup!
If you know of other cool work in this space (or are working on one), feel free to reply and share.
Hope to see even more work on convex RL variations  ๐
n/n
               
            
            
                24.07.2025 13:09 โ ๐ 1    ๐ 0    ๐ฌ 0    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            ๐Flow density control โ @desariky.bsky.social et al
Bridging convex RL with generative models: How to steer diffusion/flow models to optimize non-linear user-specified utilities (beyond just entropy reg fine tuning)?
๐ EXAIT workshop
๐ openreview.net/pdf?id=zOgAx...
7/n
               
            
            
                24.07.2025 13:09 โ ๐ 2    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            ๐Towards unsupervised multi-agent RL โ @ricczamboni.bsky.social et al (yours truly!)
Still in the convex Markov games spaceโthis work explores more tractable objectives for the learning setting.
๐EXAIT workshop
๐https://openreview.net/pdf?id=A1518D1Pp9
6/n
               
            
            
                24.07.2025 13:09 โ ๐ 0    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            ๐Convex Markov games โ Ian Gemp et al
If you can 'convexify' MDPs, so you can do for Markov games.
These two papers lay out a general framework + algorithms for the zero-sum version.
๐https://openreview.net/pdf?id=yIfCq03hsM
๐https://openreview.net/pdf?id=dSJo5X56KQ
5/n
               
            
            
                24.07.2025 13:09 โ ๐ 0    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            ๐The number of trials matters in infinite-horizon MDPs โ @pedrosantospps.bsky.social โฌ et al
A deeper look at how the number of realizations used to compute F affects the convex RL problem in infinite horizon settings.
๐https://openreview.net/pdf?id=I4jNAbqHnM
4/n
               
            
            
                24.07.2025 13:09 โ ๐ 1    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            ๐Online episodic convex RL โ Bianca Marni Moreno et al
Regret bounds for online convex RL, where F^t is adversarial and revealed only after each episode (or just evaluated on the given trajectory in a bandit feedback variation)
๐https://openreview.net/pdf?id=d8xnwqslqq
3/n
               
            
            
                24.07.2025 13:09 โ ๐ 0    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            ๐ Convex RL
Standard RL optimizes a linear objective: โจd^ฯ, rโฉ.
Convex RL generalizes this to any F(d^ฯ), where F is non-linear (originally assumed convexโhence the name).
This framework subsumes:
โข Imitation
โข Risk sensitivity
โข State coverage
โข RLHF
...and more.
2/n
               
            
            
                24.07.2025 13:09 โ ๐ 0    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Walking around posters at @icmlconf.bsky.social, I was happy to see some buzz around convex RLโa topic Iโve worked on and strongly believe in.
Thought Iโd share a few ICML papers on this direction. Letโs dive in๐
But firstโฆ what is convex RL?
๐งต
1/n
               
            
            
                24.07.2025 13:09 โ ๐ 5    ๐ 1    ๐ฌ 1    ๐ 1                      
            
         
            
        
            
        
            
            
            
            
            
    
    
    
    
            This is how we got "A classification view on meta learning bandits", a joint work with awesome collaborators Jeongyeol, Shie, and @aviv-tamar.bsky.social 
7/n
               
            
            
                15.07.2025 15:50 โ ๐ 1    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            The regret bounds depend on an instance-dependent "classification coefficient", which suggests classification really captures the complexity of the problem rather than being a mere implementation tool
6/n
               
            
            
                15.07.2025 15:50 โ ๐ 1    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            For the latter, we show exploration is *interpretable*, as it is implemented by a shallow decision tree of simple constant action policies, and *efficient*, giving upper/lower bounds to the regret
5/n
               
            
            
                15.07.2025 15:50 โ ๐ 1    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Yes, apparently! 
A simple algorithm that classifies the latent (condition) with a decision tree (img above right) and then exploits the best action for the classified latent does the job
4/n
               
            
            
                15.07.2025 15:50 โ ๐ 1    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            Humans typically develop a standard strategy prescribing a sequence of tests to diagnose the condition before committing to the best treatment (see img left). Can we design a bandit algorithm that learns a similarly interpretable exploration but it's also provably efficient?
3/n
               
            
            
                15.07.2025 15:50 โ ๐ 1    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Think about a setting in which we aim to converge on the best treatment (action) for a given patient (context) with some unknown condition (latent). The difference between how humans and bandits approach this same problem is striking:
2/n
               
            
            
                15.07.2025 15:50 โ ๐ 1    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Would you trust a bandit algorithm to make decisions on your health or investments? Common exploration mechanisms are efficient but scary.
In our latest work at @icmlconf.bsky.social, we reimagine bandit algorithms to get *efficient* and *interpretable* exploration.
A ๐งต below
1/n
               
            
            
                15.07.2025 15:50 โ ๐ 3    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Here we have an original take on how to make the best of parallel data collection for RL. Don't miss the poster at ICML, we're curious to hear what y'all think!
Kudos to the awesome students Vincenzo and @ricczamboni.bsky.social for their work under the wise supervision of Marcello.
               
            
            
                09.07.2025 13:53 โ ๐ 1    ๐ 0    ๐ฌ 0    ๐ 0                      
            
         
            
        
            
            
            
            
                                                 
                                            First, we claim that there exists a unique value function $\Vopt$ that satisfies the following equation: For any $x \in \XX$, we have
\begin{align*}
	\Vopt(x) =
	\max_{a \in \AA} \left \{ r(x,a) + \gamma \int \PKernel(\dx' | x, a) \Vopt(x') \right \}.
\end{align*}
This claim alone, however, does not show that this $\Vopt$ is the same as $V^\piopt$.
The second claim is that $\Vopt$ is indeed the same as $V^{\piopt}$, the optimal value function when $\pi$ is restricted to be within the space of stationary policies.
This claim alone, however, does not preclude the possibility that we can find an ever more performant policy by going beyond the space of stationary policies.
The third claim is that for discounted continuing MDPs, we can always find a stationary policy that is optimal within the space of all stationary and non-stationary policies.
These three claims together show that the Bellman optimality equation reveals the recursive structure of the optimal value function $\Vopt = V^{\piopt}$. There is no policy, stationary or non-stationary, with a value function better than $\Vopt$, for the class of discounted continuing MDPs.
                                                
    
    
    
    
            What do we talk about when we talk about the Bellman Optimality Equation?
If we think carefully, we are (implicitly) making three claims.
#FoundationsOfReinforcementLearning #sneakpeek
               
            
            
                08.07.2025 23:07 โ ๐ 6    ๐ 1    ๐ฌ 0    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            System is so broken: 
- researchers write papers no one reads
- reviewers don't have time to review, shamed to coauthors, use LLMs instead of reading 
- authors try to fool said LLMs with prompt injection 
- evaling researchers based on # of papers (no time to read) 
Dystopic.
               
            
            
                07.07.2025 16:15 โ ๐ 107    ๐ 10    ๐ฌ 10    ๐ 5                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Congratulations, well deserved!
               
            
            
                03.05.2025 11:55 โ ๐ 1    ๐ 0    ๐ฌ 0    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            All stick, no carrot
               
            
            
                03.05.2025 06:35 โ ๐ 4    ๐ 0    ๐ฌ 1    ๐ 0                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            Mark your calendars, EWRL is coming to Tรผbingen! ๐
When? September 17-19, 2025. 
More news to come soon, stay tuned!
               
            
            
                08.04.2025 08:33 โ ๐ 37    ๐ 14    ๐ฌ 0    ๐ 5                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Just enjoyed @mircomutti.bsky.social's seminar talk about interpretable meta-learning of contextual bandit types.
The recording is available in case you missed it: youtu.be/pNos7AHGMXw
               
            
            
                08.04.2025 15:26 โ ๐ 8    ๐ 1    ๐ฌ 0    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Happening today! Join us if you want to hear about our take on interpretable exploration for multi-armed bandits.
If interested but cannot join, here's the arxiv arxiv.org/abs/2504.04505
Joint work with Jeongyeol, Shie, and @aviv-tamar.bsky.social
               
            
            
                08.04.2025 07:43 โ ๐ 0    ๐ 1    ๐ฌ 0    ๐ 1                      
            
         
            
        
            
            
            
            
                                                 
                                                
    
    
    
    
            โฐโฐTheory of Interpretable AI Seminar โฐโฐ
In two weeks, April 8, Mirco Mutti will talk about "A Classification View on Meta Learning Bandits"
               
            
            
                27.03.2025 11:04 โ ๐ 6    ๐ 3    ๐ฌ 0    ๐ 2                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            The right review form is:
- Summary
- Comment
- Evaluation
Curious of alternative arguments, as it looks like conferences are going in a different direction
               
            
            
                17.03.2025 15:46 โ ๐ 2    ๐ 0    ๐ฌ 0    ๐ 0                      
            
         
            
        
            
            
            
            
            
    
    
    
    
            Awesome! Have a look at this thread to see some nice multi-object manipulation results
               
            
            
                20.02.2025 08:20 โ ๐ 2    ๐ 0    ๐ฌ 0    ๐ 0                      
            
         
    
         
        
            
        
                            
                    
                    
                                            PhD student @ Princeton ๐ฏ ML theory
https://www.buzaglo.me/
                                     
                            
                    
                    
                                            We document the rise of #AGI and #superintelligence by curating the best #AImemes from everywhere on the internet as we accelerate towards the #singularity.
                                     
                            
                    
                    
                                            Research Scientist at Yahoo! / ML OSS developer
PhD in Computer Science at UC Irvine
Research: ML, NLP, Computer Vision, Information Retrieval
Technical Chair: #CVPR2026 #ICCV2025 #WACV2026
Open Source/Science matters!
https://yoshitomo-matsubara.net
                                     
                            
                    
                    
                                            Liesel Beckmann Distinguished Professor of Computer Science at Technical University of Munich and Director of the Institute for  Explainable ML at Helmholtz Munich
                                     
                            
                    
                    
                                            Research Fellow, University of Oxford
Theology, philosophy, ethics, politics, environmental humanities
Associate Director @LSRIOxford
Anglican Priest
https://www.theology.ox.ac.uk/people/revd-dr-timothy-howles
                                     
                            
                    
                    
                                            ๐ฎ Visiting Prof @ PoliMI | Games + ML
Game optimization, alg. game theory, multi-agent design, AI/ML
I also play games for fun :)
๐ chavdarova.github.io
                                     
                            
                    
                    
                                            PhD Student at @gronlp.bsky.social ๐ฎ, core dev @inseq.org. Interpretability โฉ HCI โฉ #NLProc.
gsarti.com
                                     
                            
                    
                    
                                            Professor in IT @ Federation Uni. Multi-objective reinforcement learning. Human-aligned AI. Best known for the f*cking mailing list paper. Jambo & Bengals fan. https://t.co/UNoOrbGApz
                                     
                            
                    
                    
                                            Assistant Professor in Robotics and Autonomous Systems Heriot-Watt University & The National Robotarium
                                     
                            
                    
                    
                                    
                            
                    
                    
                                            ML PhD Student @ Uni. of Edinburgh, working on Multi-Agent Problems. | Organiser @deeplearningindaba.bsky.socialโฌ @rl-agents-rg.bsky.socialโฌ | ๐ช๐น๐ฟ๐ฆ
kaleabtessera.com
                                     
                            
                    
                    
                                            EurIPS is a community-organized, NeurIPS-endorsed conference in Copenhagen where you can present papers accepted at @neuripsconf.bsky.social
eurips.cc
                                     
                            
                    
                    
                                            PhD Student @ Max Planck Institute for Intelligent Systems & University of Tรผbingen | Working on intrinsically motivated open-ended reinforcement learning ๐ค
                                     
                            
                    
                    
                                            PhD student at INRIA (FLOWERS team) working on LLM4code | Prev. (MVA) ENS ParisSaclay
                                     
                            
                    
                    
                                            PhD student at @unituebingen.bsky.socialโฌ and IMPRS-IS in  "Lifelong Reinforcement Learning" group.
Organizer of @ewrl18.bsky.social and @twiml.bsky.social- Tรผbingen Women in Machine Learning
                                     
                            
                    
                    
                                            Reverse engineering neural networks at Anthropic. Previously Distill, OpenAI, Google Brain.Personal account.
                                     
                            
                    
                    
                                            Associate Professor - University of Alberta
Canada CIFAR AI Chair with Amii
Machine Learning and Program Synthesis
he/him; ele/dele ๐จ๐ฆ ๐ง๐ท
https://www.cs.ualberta.ca/~santanad
                                     
                            
                    
                    
                                    
                            
                    
                    
                                            dynomight.net
space invasion
                                     
                            
                    
                    
                                            PhD Student focusing on Modular ML
vinczematyas.github.io