working on a seven thousand layer model of extended claugenition
08.02.2026 22:46 β π 67 π 6 π¬ 5 π 0@societyoftrees.bsky.social
Ithaca | prev Chicago | interested in interconnected systems and humans+computers | currently: gardening
working on a seven thousand layer model of extended claugenition
08.02.2026 22:46 β π 67 π 6 π¬ 5 π 0New work by my former PhD student, Boyang Li
His team produced 500 stories of less than 100 words. LLMs were basically chance-level at answering binary questions about the stories
arxiv.org/abs/2601.12410
This is a real banger of a paper. The example of a model being weirdly focused on jasmine (lol) makes me increasingly think that single-point-of-access models don't really consider who their audience is. Jasmine is a super legible cultural marker for people outside, but is so, _so_ generic.
03.02.2026 16:41 β π 12 π 4 π¬ 2 π 0This was a colossal multi-year effort driven by an incredible team that gave this everything: Marc Finzi, Shikai Qiu, Yiding Jiang, Pavel Izmailov, Zico Kolter. Much more in the paper! arxiv.org/abs/2601.03220 7/7
07.01.2026 17:27 β π 22 π 1 π¬ 1 π 1Well this is exciting: arxiv.org/abs/2512.20605
06.01.2026 19:53 β π 54 π 7 π¬ 1 π 0the reason I'd follow Cat Hicks into hell is this unswerving humanist conviction that actually
people are going to do the best they can
we can help them do even better
and neither avenue is served by thinking less of people
i think we are about to experience an explosion of the possibilities in reverse engineering
02.01.2026 19:38 β π 48 π 3 π¬ 2 π 0weβre at a fascinating moment where I am still ~better at programming than Claude at a medium-horizon difficulty task, but Claude has me absolutely beat in terms of cognitive fatigue so weβre able to ship so much more stuff I never wouldβve gotten around to before
02.01.2026 20:56 β π 99 π 4 π¬ 2 π 0Great list of models in 2025 ππ½
02.01.2026 17:14 β π 3 π 1 π¬ 0 π 0I uh, made this. It was supposed to be a joke / concept-art thing that scrolls through the torrent of new AI/ML arXiv uploads too fast to read. But I think I iterated too much and made it almost usable.
01.01.2026 23:45 β π 79 π 13 π¬ 7 π 3Everyoneβs favorite feed is running on one personβs gaming system. I love how hackable this site is, it makes it much more fun.
26.12.2025 18:47 β π 22 π 1 π¬ 3 π 0If youβre working on a non-fiction research/writing project that isnβt journalism and you donβt have an academic affiliation, how do you find other people who are doing the same thing? Ideally locally (Iβm in NY).
22.12.2025 01:53 β π 4 π 1 π¬ 0 π 0local first vs atproto!! what should the source of truth for group data be?
20.12.2025 12:45 β π 37 π 11 π¬ 1 π 1I am late to the game but I finally read the NeurIPS 2025 best paper on gating in LLMs, it is great.
Qiu et al.
Alibaba, U Edinburg, Stanford, MIT, Tsinghua U
arxiv.org/abs/2505.06708
1/3
Olmo 3.1 is here. We extended our strongest RL run and scaled our instruct recipe to 32Bβreleasing Olmo 3.1 Think 32B & Olmo 3.1 Instruct 32B, our most capable models yet. π§΅
12.12.2025 17:14 β π 14 π 3 π¬ 1 π 1+1 for mentioning AI as structuralism
13.12.2025 02:24 β π 2 π 0 π¬ 1 π 0Post on Interconnects: www.interconnects.ai/p/building-o...
Slides: docs.google.com/presentation...
YouTube: youtu.be/uaZ3yRdYg8A
i think i can officially say i preferred my arch linux desktop over macOS the best thing about macOS is the flow between computer, phone, airpods everything else feels like 10% off the mark and all these paper cuts don't feel good
07.12.2025 14:10 β π 25 π 1 π¬ 2 π 0screenshot
Built a little AT Protocol playground - a single HTML file that lets you watch the firehose, create records, and browse any repo with a dynamic form UI. Changes sync directly back to your PDS. #atproto
at.selem.im
A figure demonstrating the different aspects of the corpus described in the tweet. There is a main isomorphic 3D view of a level in the Portal 2 co-op game, with some portals, lasers, and the blue and orange players. Inset, there are first-person captures of the blue and orange player views. There is also a box containing the transcribed dialogue with timestamps and labels for the discursive acts. Finally, there is a box containing a task and a list of subtasks. Some subtasks are already crossed out, with the time that they have been completed. The last subtask ("Player 2 places portal 4 on wall 4") is marked incomplete. The dialogue is as follows: Blue: Can you put your other portal up here? (tagged as directive) Orange: Where? (tagged as request for clarification) Blue: On uh, on this wall. (tagged as directive) Blue: So that it uh points at the circle. (tagged as directive) Orange: Okay. (tagged as commit) The full list of subtasks is: Task: Redirect lasers Subtask: Player 1 places portal 1 on wall 1. (completed) Subtask: Player 1 polaces portal 2 on wall 2 or 3. (completed) Subtask: Player 2 places portal 3 opposite of portal 2. (completed) Subtask: Player 2 places portal 4 on wall 4. (incomplete)
A couple years (!) in the making: weβre releasing a new corpus of embodied, collaborative problem solving dialogues. We paid 36 people to play Portal 2βs co-op mode and collected their speech + game recordings.
Paper: arxiv.org/abs/2512.03381
Website: berkeley-nlp.github.io/portal-dialo...
1/n
Makes me wonder how viable a presentation app that uses SVGs as its native format would be.
LLMs tend to do fairly well with vector formats and it would solve the mutability problem here.
I've been having a bunch of fan hacking on my Bluesky thread viewing HTML+JS app using Claude Code - here's a video demo of the most recent version, you can try it out here tools.simonwillison.net/bluesky-thre...
28.11.2025 19:24 β π 66 π 4 π¬ 5 π 3tldr of Andyβs back of envelope math: in Morrow County data centers may actually be accounting for only ~1% of local wastewater
26.11.2025 06:40 β π 0 π 0 π¬ 1 π 0the local agriculture was highly polluting but drew water primarily from a river
while the data centers drew from the (poisoned) water table,
competing with residents for the deepest wells and sending outputs into a processing ponds that couldnβt handle the capacity
so Morrow County Oregon seems to be an example of a drinking water crisis accelerating b/c of data center buildouts
the original crisis was caused by agriculture, but the scale of the issue worsened because of data center waste water handling
β οΈ Update on Deep Research Tulu (DR Tulu), our post-training recipe for deep research agents: weβre releasing an upgraded version of our example agent, DR Tulu-8B (RL), that matches or beats systems like Gemini 3 Pro & Tongyi DeepResearch-30B-A3B on core benchmarks. π§΅
25.11.2025 19:37 β π 22 π 5 π¬ 1 π 1Test-time reasoning guidance: up to 66.7% improvement π‘
We scaffold cognitive structures from successful traces to guide reasoning.
Major gains on ill-structured problemsπ
Models possess latent capabilitiesβthey just don't deploy them adaptively without explicit guidance.
We analyzed 1,598 LLM reasoning papers:
Research concentrates on easily quantifiable behaviorsβsequential organization (55%), decomposition (60%)
Neglects meta-cognitive controls (8-16%) and alternative representations (10-27%) that correlate with successβ οΈ
Our taxonomy bridges cognitive science β LLM eval:
28 elements across 4 dimensionsβreasoning invariants (compositionality, logical coherence), meta-cognitive controls (self-awareness), representations (hierarchical, causal), and operations (backtracking, verification)