Genie 3 and the future of neural game engines
Google DeepMind just announced Genie 3 , their new promptable world model, which is another term for neural game engine. This is a big neura...
@togelius.bsky.social has thoughts on Genie 3 and games togelius.blogspot.com/2025/08/geni...
Fairly close to my own, though I didn't get the preview the tech.
Walking around a generated image-to-image world is not the same as playing a game. There are no game objectives.
05.08.2025 20:13 β π 16 π 3 π¬ 4 π 0
diagram from Anthropic paper with an icon & label that says βsubtract evil vectorβ
quick diagram of Blueskyβs architecture and why itβs nicer here
02.08.2025 23:19 β π 73 π 5 π¬ 4 π 0
*Test-Time
01.08.2025 21:38 β π 0 π 0 π¬ 0 π 0
Anthropic researchers discover the weird AI problem: Why thinking longer makes models dumber
Anthropic research reveals AI models perform worse with extended reasoning time, challenging industry assumptions about test-time compute scaling in enterprise deployments.
Anthropic research identifies βinverse scaling in test-time compute,β where longer reasoning degrades AI performance. On certain tasks, models become more distracted by irrelevant data or overfit to spurious correlations.
#MLSky
23.07.2025 16:24 β π 10 π 1 π¬ 1 π 0
Supermassive congrats to Giwon Hong (@giwonhong.bsky.social) for the amazing feat! π
31.07.2025 00:47 β π 2 π 1 π¬ 0 π 0
Still not as bad as Microsoft Teams
26.07.2025 21:24 β π 745 π 177 π¬ 12 π 5
The amazing folks at EdinburghNLP will be presenting a few papers at ACL 2025 (@aclmeeting.bsky.social); if you're in Vienna, touch base with them!
26.07.2025 09:48 β π 11 π 0 π¬ 0 π 0
Hm, hard disagree here. I really fail to see how this is misconduct akin to bribery, it's just a defense mechanism against bad reviewing practices. @neuralnoise.com
24.07.2025 05:45 β π 5 π 2 π¬ 1 π 0
π¨ New Paper π¨
How effectively do reasoning models reevaluate their thought? We find that:
- Models excel at identifying unhelpful thoughts but struggle to recover from them
- Smaller models can be more robust
- Self-reevaluation ability is far from true meta-cognitive awareness
1/N π§΅
13.06.2025 16:15 β π 11 π 3 π¬ 1 π 0
Three panels at the top describe task types with example prompts:
1. Simple Counting Tasks with Distractors (Misleading Math & Python):
β’ Prompts mention an apple and an orange, with added irrelevant or confusing information (e.g., probabilistic riddle, Python code) before asking the straightforward question: βCalculate how many fruits you have.β
2. Regression Tasks with Spurious Features (Grades Regression):
β’ Given XML-style records about a student, the model must predict grades from features like sleep hours, social hours, and stress level. The challenge lies in identifying relevant vs. spurious attributes.
3. Deduction Tasks with Constraint Tracking (Zebra Puzzles):
β’ Complex logical reasoning puzzle with multiple interrelated clues. Example: βWhat position is the person who likes salmon at?β Constraints involve foods, names, and relationships like βto the left of.β
Bottom row contains 3 line plots comparing model performance across tasks:
β’ Misleading Math (Left Plot):
β’ Accuracy drops sharply for some models as reasoning tokens increase. Claude Sonnet 4 maintains high performance. o3 and DeepSeek R1 hold relatively stable accuracy; Qwen3 32B and QwQ 32B drop more.
β’ Grades Regression (Middle Plot):
β’ Shows negative RMSE (higher is better). Claude models remain strong across token counts; o3 also performs well. Qwen3 and QwQ struggle, with DeepSeek R1 performing modestly.
β’ Zebra Puzzles (Right Plot):
β’ Accuracy vs. average reasoning tokens. o3 and Claude Sonnet 4 maintain highest performance. Other models (e.g., DeepSeek R1, Qwen3 32B, QwQ 32B) show performance degradation or plateaus. Error bars reflect variability.
Each plot uses colored lines with markers to indicate different model names.
Inverse scaling of reasoning models
a research collab demonstrated that there are certain types of tasks where all top reasoning models do WORSE the longer they think
things like getting distracted by irrelevant info, spurious correlations, etc.
www.arxiv.org/abs/2507.14417
22.07.2025 20:01 β π 21 π 2 π¬ 2 π 0
Reasoning is about variable binding. Itβs not about information retrieval. If a model cannot do variable binding, it is not good at grounded reasoning, and thereβs evidence accruing that large scale can make LLMs worse at in-context grounded reasoning. π§΅
12.06.2025 17:12 β π 53 π 9 π¬ 4 π 2
Hi @ilsebyl.bsky.social welcome to bsky! πππ
22.07.2025 11:02 β π 2 π 0 π¬ 1 π 0
Paper page - Inverse Scaling in Test-Time Compute
Join the discussion on this paper page
Sometimes, too much reasoning can hurt model performance! New research by Anthropic (@anthropic.com), by Aryo Pradipta Gema (@aryopg.bsky.social) et al.: huggingface.co/papers/2507....
22.07.2025 10:44 β π 5 π 0 π¬ 0 π 0
βLLMs canβt reasonβ π
21.07.2025 21:52 β π 5 π 0 π¬ 0 π 0
My "Math, Revealed" series is freely available to anyone -- no paywall! -- in the thread below.
04.07.2025 00:07 β π 136 π 53 π¬ 7 π 5
There is a few more for another prompt and thatβs it
11.07.2025 20:01 β π 1 π 0 π¬ 0 π 0
Spotlight poster coming soon at #ICML2025
@icmlconf.bsky.social!
πEast Exhibition Hall A-B E-1806
ποΈWed 16 Jul 4:30 p.m. PDT β 7 p.m. PDT
π arxiv.org/pdf/2410.12537
Letβs chat! Iβm always up for conversations about knowledge graphs, reasoning, neuro-symbolic AI, and benchmarking.
10.07.2025 09:00 β π 11 π 2 π¬ 1 π 2
What Counts as Discovery?
Rethinking AIβs Place in Science
This essay by Nisheeth Vishnoi is a thoughtful meditation on the nature of science and a rebuttal to the notion that AI systems are going replace human scientists anytime soon. Worth reading.
nisheethvishnoi.substack.com/p/what-count...
05.07.2025 16:23 β π 75 π 12 π¬ 4 π 1
"in 2025 we will have flying cars" πππ
05.07.2025 16:17 β π 405 π 92 π¬ 9 π 35
Flowchart of the AXIS algorithm with 5 parts. The top-left has the memory, the centre-left has the user query, the centre-bottom has the final explanation, the centre has the LLM, and the right has the multi-agent simulator.
Screenshot of the arXiv paper
Preprint alert π Introducing the Agentic eXplanations via Interrogative Simulations (AXIS) algo.
AXIS integrates multi-agent simulators with LLMs by having the LLMs interrogate the simulator with counterfactual queries over multiple rounds for explaining agent behaviour.
arxiv.org/pdf/2505.17801
30.05.2025 14:35 β π 8 π 1 π¬ 0 π 0
'AI Safety for Everyone' is out now in @natmachintell.nature.com! Through an analysis of 383 papers, we find a rich landscape of methods that cover a much larger domain than mainstream notions of AI safety. Our takeaway: Epistemic inclusivity is important, the knowledge is there, we only need use it
17.04.2025 14:44 β π 13 π 3 π¬ 1 π 0
Can you train a performant language model using only openly licensed text?
We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1 & 2
06.06.2025 19:18 β π 147 π 61 π¬ 2 π 3
COLM (@colmweb.orgβ¬) reviewers, please follow up on author responses if you need to! Most of the papers in my area chair batch didn't receive reviewer follow-ups, and it's dire
04.06.2025 07:05 β π 6 π 2 π¬ 0 π 0
Hi @veredshwartz.bsky.social !!! π
27.05.2025 08:47 β π 5 π 0 π¬ 1 π 1
Yeah but do we need the APIs if agents can just use the browser?
25.05.2025 21:56 β π 1 π 0 π¬ 1 π 0
claude-code is pretty good at updating personal websites! it has browser use, so it can e.g. scrape your latest papers from arxiv and dblp and use that to update your website's publication list
25.05.2025 14:24 β π 5 π 0 π¬ 2 π 0
βYou must never be fearful about what you are doing when it is right.β -- Rosa Parks
25.05.2025 08:23 β π 5 π 0 π¬ 0 π 0
ποΈ Deadline extended: π₯2nd June 2025!π₯
We are looking forward to your works on:
π #circuits and #tensor #networks πΈοΈ
β³ normalizing #flows π¨
βοΈ scaling #NeSy #AI π¦
π
fast and #reliable inference π
...& more!
please share π
24.05.2025 12:13 β π 19 π 12 π¬ 0 π 1
Self-hyping!
23.05.2025 15:20 β π 2 π 0 π¬ 0 π 0
This is a new experience: people using AI to overhype your paper π«’
@neuralnoise.com
22.05.2025 06:30 β π 10 π 1 π¬ 1 π 1
Civ mil, natsec, polarization and domestic politics | Views own, not USG
Como todos los hombres de Babilonia, he sido procΓ³nsul; como todos, esclavo; tambiΓ©n he conocido la omnipotencia, el oprobio, las cΓ‘rceles.
very sane ai newsletter: verysane.ai
Visiting Student at Princeton University | PhD Student at Uppsala University | Working on learning to optimize & graph neural networks
Assistant Professor of Computer Graphics and Geometry Processing at Columbia University www.silviasellan.com
I run AI Plans, an AI Safety lab focused on solving AI Alignment before 2029.
For several weeks I used a stone for a pillow.
I once spent a quarter of my paycheck on cheese.
Ping me! DM me (not working atm due to totalitarian UK law)!
SurpassAI
researcher studying privacy, security, reliability, and broader social implications of algorithmic systems.
website: https://kulyny.ch
PhD @Stanford working w Noah Goodman
Studying in-context learning and reasoning in humans and machines
Prev. @UofT CS & Psych
I study machine listening methods for bioacoustics and automated sensing of natural environments. And I enjoy natural environments.
https://johnmartinsson.org/
Core member of Climate AI Nordics | ML researcher at RISE
Head of LanD research group at FBK - Italy | NLP for Social Good.
Assistant Professor at @cs.ubc.caβ¬ and βͺ@vectorinstitute.aiβ¬ working on Natural Language Processing. Book: https://lostinautomatictranslation.com/
Senior Researcher in AI for Biotech & @eml-munich.bsky.social | Prev. SR at Google DeepMind | PhD in ML and NeuroAI from @tuberlin.bsky.social @bifold.berlin @mpicbs.bsky.social | Representations in π§ and π€ | #FirstGen π
π»: https://lukasmut.github.io/
CS PhD @umdclip
Multilingual / Culture #NLProc, MT
https://dayeonki.github.io/
NLP research engineer at Barcelona Supercomputing Center | Machine translation
https://javi897.github.io/
PhD Student in AI for Society at University of Pisa
Responsible NLP; XAI; Fairness; Abusive Language
Member of Privacy Network
she, her
martamarchiori.github.io
Ph.D. Postdoc@USC | Best USC Viterbi RA | Ex-intern@ Amazon, Meta | Interests: Human understanding, trustworthy computing, speech, multimodal, and wearable sensing | Love sports and music.