Currenltly, i know more negative than positive examples. At least, you are in position to change that
28.10.2025 22:18 β π 1 π 0 π¬ 1 π 0@sacha2.bsky.social
Currenltly, i know more negative than positive examples. At least, you are in position to change that
28.10.2025 22:18 β π 1 π 0 π¬ 1 π 0Mmmm
28.10.2025 22:10 β π 0 π 0 π¬ 1 π 0why do you think all the tasks aren't aggregated in one place, like prime intellect hub or kaggle arena?
28.10.2025 21:10 β π 0 π 0 π¬ 1 π 0Battleship arena for LLMs
another cool imperfect information task
www.gabegrand.com/battleship/
cc @sharky6000.bsky.social
bluesky is for american type of people
uncomfortable
same type beat
x.com/Grokipedia
Can you share a screenshot, please. I have never seen one as well
28.10.2025 13:54 β π 1 π 0 π¬ 0 π 0le chat
28.10.2025 13:07 β π 1 π 0 π¬ 0 π 0Felicitations
28.10.2025 12:44 β π 1 π 0 π¬ 0 π 0everyone has recently started to write code so fast β entire projects are being done over a weekend
interesting how to become better at this
Large Action Models: From Inception to Implementation
Lu Wang, Fangkai Yang, Chaoyun Zhang et al.
Action editor: Edward Grefenstette
https://openreview.net/forum?id=bYdKtf0Q31
#agent #agents #ai
Thank you for the clarification
27.10.2025 13:00 β π 0 π 0 π¬ 0 π 0First time seeing someone calling Simplex algorithm fast, but OK. It's like calling a hashmap or A* fast. Let's check out the paper.
27.10.2025 06:15 β π 1 π 0 π¬ 1 π 0Nowadays, we say "chopped"
26.10.2025 16:51 β π 0 π 0 π¬ 0 π 031k submissions to AAAI is fucking crazy
26.10.2025 15:25 β π 1 π 0 π¬ 0 π 0I think I posted about it before but never with a thread. We recently put a new preprint on arxiv.
π Replicable Reinforcement Learning with Linear Function Approximation
π arxiv.org/abs/2509.08660
In this paper, we study formal replicability in RL with linear function approximation. The... (1/6)
still true, i reckon
25.10.2025 17:02 β π 1 π 1 π¬ 0 π 0I now understand what you're up to. However, i mistakenly thought about PPO all the time, which is slightly different and may require additional assumptions for the convergence and, thus, deliberate notion of policy.
Sorry.
Not necessary, afaik
25.10.2025 15:15 β π 0 π 0 π¬ 1 π 0However, if you formulate the problem like more general stochastic optimisation over a sequence of tokens, then maybe the notion of policy is unne
25.10.2025 15:10 β π 0 π 0 π¬ 0 π 0RL problem, if it could be denoted as such. For example, if one wants to finetune their LLM with RL, the solution of the problem would be the policy, or a function of the model
25.10.2025 15:07 β π 0 π 0 π¬ 1 π 0When you present the results of any RL-based experiments, you need to evaluate your model to get the evaluation return. But with respect to what policy you get the return? That's why the notion of the policy, even for language generation, should be clear
25.10.2025 15:03 β π 0 π 0 π¬ 1 π 0Yes, but you still need to understand what the policy of the LLM is because the objective optimisation has to find a policy that maximises cumulative return...
25.10.2025 14:57 β π 0 π 0 π¬ 1 π 0Maybe due to the choice of words, it's upsetting, but a language model can indeed act a policy or strategy. Refer to the paper "Strings as strategies..." However, to be fair, the definition might be written better.
25.10.2025 14:51 β π 0 π 0 π¬ 1 π 0PPO is still more than a bag of heuristics, I would argue. It might have been proposed as a heuristic approximation of the trust region method, but with time has become a theoretically grounded procedure.
25.10.2025 14:45 β π 3 π 0 π¬ 1 π 0Btw have you got any opinion about "just a band" dead poets society?
25.10.2025 14:21 β π 0 π 0 π¬ 0 π 067
25.10.2025 12:35 β π 1 π 0 π¬ 1 π 0btw every msterpiece has its own cheap copy, like
youtu.be/anlghGgWuAs?...
what i won't add to the list - Ivan Dorn, lol
24.10.2025 19:59 β π 0 π 0 π¬ 0 π 0-- Dan Le Lac vs. Scroobius Pip "Thou shalt always kill"
youtu.be/CWrMGXwhFLk?...