Can you explain what you mean by "you need to account for all the token boundaries"? Are you labeling spans at a granularity finer than words (at least for English)?
14.02.2026 00:02 — 👍 0 🔁 0 💬 1 📌 0@mcognetta.bsky.social
Language and keyboard stuff at Google + PhD student at Tokyo Institute of Technology. I like computers and Korean and computers-and-Korean and high school CS education. Georgia Tech → 연세대학교 → 東京工業大学. https://theoreticallygoodwithcomputers.com/
Can you explain what you mean by "you need to account for all the token boundaries"? Are you labeling spans at a granularity finer than words (at least for English)?
14.02.2026 00:02 — 👍 0 🔁 0 💬 1 📌 0Thanks! Lichess isn't strictly necessary, but it makes the pipeline slightly easier. I'll DM you soon!
13.02.2026 22:25 — 👍 0 🔁 0 💬 0 📌 0Thanks! I will dm you soon, I'm still deploying some stuff!
13.02.2026 22:25 — 👍 1 🔁 0 💬 0 📌 0A Korean passage that begins "아편 전쟁阿片戰爭은...", meaning "the opium wars...".
Unusual hanja formatting in this passage I read.
I usually see hanja in brackets/parentheses or as sub/superscript or at least bolded, so seeing it just sort of nakedly attached to the hangul is a bit jarring. Especially because there is a space between 아편 and 전쟁 but not in 阿片戰爭.
#한국어 #한자
I downloaded something like 300GB of open models and wrote a bunch of map-reduce style processing scripts to make this graph.
It's plotting the distribution of weight values across a variety of popular open models, to show that models are almost entirely made up of small floats.
I'm looking for 5-10 #chess players to test out a tool I'm building. Preferably who play on @lichess.org and are 1200+ in rapid or blitz.
And if you coach chess at all, I'd be extra grateful to have you test it!
NOTE: it is _NOT_ an "LLM chess coach" tool, I promise!
🙏
For reference, iirc he was bronze league with more than 1000 games. In other words, not very good.
12.02.2026 02:32 — 👍 2 🔁 0 💬 1 📌 0I'm looking for 5-10 #chess players to test out a tool I'm building. Preferably who play on @lichess.org and are 1200+ in rapid or blitz.
And if you coach chess at all, I'd be extra grateful to have you test it!
NOTE: it is _NOT_ an "LLM chess coach" tool, I promise!
🙏
What is the timeline of the departures?
11.02.2026 03:42 — 👍 3 🔁 0 💬 1 📌 0If you think labeling text spans with LLMs is easy, you probably have not tried it yourself (we have! 🙃).
Any method you can think of – be it tagging, matching, or indexing – has flaws.
In our new preprint, we tested them all 💪We also proposed how to improve one of them.
arxiv.org/abs/2601.16946
Oh hey you know what? I should probably have just read the paper.
10.02.2026 23:58 — 👍 0 🔁 0 💬 0 📌 0The model is then forced to output exactly the same surface form text as the input.
With multiple tagged spans/types, this gets harder, but with mild constraints it becomes regular pretty quickly, and then it's quite easy to encode for a constrained generation tool.
Could the copying issue in tagging be addressed with constrained generation? Like, in the simple case of tagging a single span with open/close tags, convert the input into a regular language where we allow one open tag followed by any amount of text and one closed tag.
10.02.2026 23:54 — 👍 0 🔁 0 💬 2 📌 0Don't give Elon Musk any ideas
10.02.2026 22:14 — 👍 3 🔁 0 💬 0 📌 0I propose modern frontier models should be classified as "Really Quite Big Language Models" (RQBLM).
Let's just follow radio astronomy in their naming schemes.
This week on Overcommitted, we got to sit down with Bluesky's favorite tech blogger @samwho.dev and it did not disappoint!
Sam makes some of the coolest tech content on the internet, and if you haven't heard from him yet, you should! Full episode out now: overcommitted.dev/interactive-...
For bsky's burgeoning AI scene.
10.02.2026 09:35 — 👍 62 🔁 6 💬 0 📌 0Oh there are some Korean songs mixed in. It just gets better and better.
09.02.2026 21:17 — 👍 0 🔁 0 💬 0 📌 0My work rotation today. Unbelievably good.
09.02.2026 21:14 — 👍 0 🔁 0 💬 1 📌 0Up the chain all the way to the president of NLP
09.02.2026 20:06 — 👍 0 🔁 0 💬 0 📌 0deep olympics lore
08.02.2026 22:22 — 👍 3 🔁 0 💬 0 📌 0@aclmeeting.bsky.social @aclrollingreview.bsky.social what is the right way to send a complaint about another review from a paper I am reviewing? One in my batch is absolutely terrible, and I am not confident in leaving it to the AC/etc to catch and properly handle it.
08.02.2026 20:13 — 👍 3 🔁 0 💬 1 📌 0This is exactly what I want my desk space to look like.
08.02.2026 05:47 — 👍 9 🔁 0 💬 0 📌 0TBH I am almost equally impressed that 1) we have the computing power and technical ability to compute this table and 2) that 63TB fit on just 3 hard drives and was cheap enough for someone to just purchase.
07.02.2026 06:14 — 👍 2 🔁 0 💬 1 📌 0It's available for download in case you are interested lol
op1.lichess.ovh/tables/
The subset of positions are those where there is some file that has both a black and white pawn that haven't passed each other (like they are blocking each other from promotion).
07.02.2026 06:12 — 👍 1 🔁 0 💬 0 📌 0@lichess.org announced a partial 8 piece tablebase that takes up 63TB on disk. They couldn't get a network to help them transfer it, so they just shipped it via plane on hard drives.
"Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway." -Andy Tanenbaum