learning how to do something is a first-order use case for LMs, the development bottleneck has been collecting data covering a wide diversity of topics, until now โ๐ป
10.02.2026 20:34 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0@kylelo.bsky.social
language model pretraining @ai2.bsky.social, co-lead of data research w/ @soldaini.net, statistics @uw, open science, tabletop, seattle, he/him,๐ง kyleclo.com
learning how to do something is a first-order use case for LMs, the development bottleneck has been collecting data covering a wide diversity of topics, until now โ๐ป
10.02.2026 20:34 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0incredibly fun project led by our intern yapei chang
we mined the web for thousands of real-world โhow to do Xโ step by step instructions and turned it into a dataset, synth data training procedure, eval suite, etc.
lol rip ๐ฎโ๐จ
Itโs like a score calculated against gold reference citations in generated lit review, so even humans donโt score high. i think the eval is saturated cuz so much subjectivity in what counts as appropriate citation. better phrasing is maybe that the citations are sensible up to some X
theyโre separate poorly named systems lol ๐ Separate projects approaching same problem from different angles. Scholar QA approach from agentic system design, use whatever model. Ope Scholar approach from model-first, very light on system. The teams are working together to fuse ideas
05.02.2026 02:33 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0our open model proving out specialized rag LMs over scientific literature has been published in nature โ๐ป
congrats to our lead @akariasai.bsky.social & team of students and Ai2 researchers/engineers
www.nature.com/articles/s41...
0 days since last mixup of eval results between "copa" (choice of plausible alternatives) & "coqa" (conversational QA) tasks ๐
03.02.2026 20:01 โ ๐ 4 ๐ 0 ๐ฌ 0 ๐ 0The 5th Generation, Evaluation, and Metrics (GEM) Workshop will be at #ACL2026!
Call for papers is out. Topics include:
๐ LMs as evaluators
๐ Living benchmarks
๐ฃ Eval with humans
and more
New for 2026: Opinion & Statement Papers!
Full CFP: gem-workshop.com/call-for-pap...
mm yea i think that's always the case w productivity tools.
imo ability to adopt new tools is core part of the job. just like transition from plain text editors to IDEs, from sending files via FPT to using git for collab, from ad hoc Makefiles to package managers, etc. AI is just the latest thing
my concern is the growing pool of "unknown unknowns" as i interact less with code directly.
imo probably why i subconsciously have been leaning toward cursor over claude code or similar agents, even if the latter has a higher code-to-keystrokes ratio
i dont feel worse at this even if im not writing papers from-scratch as much as during early career
but coding feels different due to mismatch between what i express to the system (english) and what the system returns (code). i've already realized some gaps in libraries I used to know well.
whether my ability to review code will degrade as I offload increasingly larger workloads to AI
of course, this shift is present in other forms of generation, like paper writing, where my role has shifted to reviewing/editing (student's) drafts.
some thoughts about skill degradation w/ AI coding
im onboard w views that "english is the new programming language" & "software engineering", translating ambiguous goals to technical specs/execution, is still a skill.
im more concerned w shift from my role as a writer to a reviewer and
lucky to chat w sen. patty murray about olmo & importance of fully open AI
18.01.2026 03:09 โ ๐ 52 ๐ 1 ๐ฌ 2 ๐ 0using opus to extract research topics from papers & it was giving me useless words like "training", "datasets", and "evaluation"
kept prompting it w examples of more informative topics and it ended up with "LLM training", "LLM datasets", and "LLM evaluation"
thx
yo endorse me for python skills
16.01.2026 18:37 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0just realized ive had food on my face all day & nobody at office told me, thx ai2 frens ๐ซ
16.01.2026 00:05 โ ๐ 6 ๐ 0 ๐ฌ 0 ๐ 0u gotta shitpost more maria, ur content too informative ๐
15.01.2026 21:54 โ ๐ 7 ๐ 0 ๐ฌ 3 ๐ 0i appreciate bsky has less AI product advertising; i do want to see more memes/shitposting/fun stuff and insights from industry/open source sphere, even if they dont have an attached paper
15.01.2026 17:50 โ ๐ 7 ๐ 0 ๐ฌ 1 ๐ 0amaazinggg thxx ๐๐๐
14.01.2026 21:31 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0ive been clicking around in UI but i cant find it ๐ญ pls help
14.01.2026 21:27 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0bsky wish list
i like the idea of different feeds but i actually want my subscription to select feeds to be taken as a preference signal ("more like this") that informs a "home/default" feed.
i really dislike the UX of having to tab through each subscribed feed, esp when there's also post overlap
some notion of 'views/impressions'? it kinda sucks to post and only see a couple of likes & no replies. if there's some intermediate signal that shows people at least read the post, that'd incentivize more imo
14.01.2026 20:15 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 0sports ๐
09.01.2026 22:05 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0nope just an admirer of room 003
08.01.2026 01:57 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0just in case it wasnโt clear which room this is
07.01.2026 17:58 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0some citation graphs data pipelines will create new "paper" nodes based on extracted bibstrings from PDFs
so in 2026 the papers we hallucinated in 2025 might end up being "real" papers on gscholar or sthn lol
ya ur rite! we'll update it โ๏ธ
20.12.2025 00:34 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0just had hechalouโs yin yang milk tea and i think iโve transcended ๐คค
14.12.2025 01:43 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0new olmo 3.1 artifacts: huggingface.co/collections/...
paper (arxiv soon): allenai.org/papers/olmo3
demo: playground.allenai.org
paper has:
๐ more on our eval ideology
๐ฆ more baselines
๐ฃ more about RL Zero
etc
we picked final model (internally called moonlit surfer ๐๐) not just on bench scores but good vibes ๐ฅฐ