Symmetry in language statistics shapes the geometry of model representations
Although learned representations underlie neural networks' success, their fundamental properties remain poorly understood. A striking example is the emergence of simple geometric structures in LLM rep...
In our new preprint, we explain how some salient features of representational geometry in language modeling originate from a single principle - translation symmetry in the statistics of data.
arxiv.org/abs/2602.150...
With Dhruva Karkada, Daniel Korchinski, Andres Nava, & Matthieu Wyart.
19.02.2026 04:20 โ ๐ 28 ๐ 5 ๐ฌ 1 ๐ 0
YouTube video by Google TechTalks
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
I gave a talk at the Google Privacy in ML Seminar last summer on privacy & memorization: "Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training".
It's up on YouTube now if you're interested :)
youtu.be/IzIsHFCqXGo?...
18.02.2026 02:05 โ ๐ 2 ๐ 2 ๐ฌ 0 ๐ 0
We also show that we are far from done, specifically for a complicated language like Old French.
But we
(1) defined the issue,
(2) propose a first solution that enables pre-annotation of larger dataset and
(3) offer an alternative to less trustable models that go beyond ATR.
17.02.2026 18:11 โ ๐ 3 ๐ 1 ๐ฌ 1 ๐ 0
๐ We propose Pre-Editorial Normalization (PEN):
An intermediate layer between:
๐ graphemic ATR output
๐ fully edited text
Goal: preserve palaeographic fidelity + enable usability.
Keep two layer, ATR output and normalization, with aligned token to go back to the source.
17.02.2026 18:11 โ ๐ 3 ๐ 1 ๐ฌ 1 ๐ 0
Recent ATR progressโespecially with palaeographic datasets like CATMuSโhas improved access to medieval sources.
But:
โ Raw outputs are hard to use
โ Fully normalized models over-normalize & hallucinate
Thereโs a methodological gap.
17.02.2026 18:11 โ ๐ 2 ๐ 1 ๐ฌ 1 ๐ 0
If I give you the text
๐ omnium peccatorum quia ex quo dyaconus quando esset in futurum, stultus esset
Can you find the ATR error without the manuscript ?
Probably not.
ATR models that predict text and normalize in one go generate trustable text, but prevent detecting issues.
17.02.2026 18:11 โ ๐ 1 ๐ 1 ๐ฌ 2 ๐ 0
๐ New paper:
Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin
Thibault Clรฉrice, @rachelbawden.bsky.social , Anthony Glaise, Ariane Pinche, @dasmiq.bsky.social (2026) arxiv.org/abs/2602.13905
We introduce Pre-Editorial Normalization (PEN).
๐งตโฌ๏ธ
17.02.2026 18:11 โ ๐ 21 ๐ 10 ๐ฌ 1 ๐ 1
Excited to be co-organizing the #CHI2026 workshop on augmented reading interfaces ๐โจ Submissions are open for one more week! We want to know what you're working on!
06.02.2026 20:21 โ ๐ 9 ๐ 2 ๐ฌ 1 ๐ 0
our open model proving out specialized rag LMs over scientific literature has been published in nature โ๐ป
congrats to our lead @akariasai.bsky.social & team of students and Ai2 researchers/engineers
www.nature.com/articles/s41...
04.02.2026 22:43 โ ๐ 44 ๐ 10 ๐ฌ 2 ๐ 2
Cool postdoc job opportunity! A chance to work with some great English & comp sci scholars at Carnegie Mellon. Appreciate this ad stresses: chance to do interesting technical work; work on an interesting humanities problem; chance to publish both in humanities & comp sci venues. Looks great, apply!
30.01.2026 18:01 โ ๐ 5 ๐ 4 ๐ฌ 0 ๐ 0
The bias/variance tradeoff in 2026: Claude Sonnet wrote a program to solve the problem as described; Claude Opus figured out a shortcut from the example data that won't generalize.
28.01.2026 19:59 โ ๐ 4 ๐ 0 ๐ฌ 0 ๐ 0
[Job ๐ฃ] Are you curious about #AI applications in the #humanities? My Print and Probability research group (@print-and-prob.bsky.social) is hiring a postdoc! Come help us develop computational methods for identifying clandestine early modern printers!
cmu.wd5.myworkdayjobs.com/CMU/job/Pitt...
22.01.2026 22:00 โ ๐ 15 ๐ 13 ๐ฌ 0 ๐ 1
Debate me bro
Statistical inference is rhetoric.
Statistical inference is a rhetoric of counts.
20.01.2026 15:49 โ ๐ 19 ๐ 3 ๐ฌ 1 ๐ 3
I had the absolute pleasure to visit @craicexeter.bsky.social, where I laid out an argument for how critical & computational scholars should lead the conversation on AI. We need to expand research on harms, interrogate corporate hype, and support peopleโs critical understanding these technologies
22.01.2026 16:32 โ ๐ 18 ๐ 3 ๐ฌ 0 ๐ 3
I couldn't figure out eloquent language to describe digital agents that assist with web navigation tasks, so I just wrote "click click" and if I keep this up maybe I will start referring to language generation as "word word"
22.01.2026 17:27 โ ๐ 11 ๐ 1 ๐ฌ 1 ๐ 0
Hi Honey, Iโm Homo Neuricus
Six Ways I'm using AI to Become More Human
A very random view into how some people* outside of tech think about and use chatbots. It's not coding, that's for sure, and some of it might sound ridiculous, but I think this kind of perspective and usage is way more common than we might assume.
*LA people (sorry, I love LA, but this is very LA)
22.01.2026 17:11 โ ๐ 7 ๐ 1 ๐ฌ 2 ๐ 0
Let's think step by step. If you could reconstruct the original page with high probability using a language model given the bag of words, you could:
1. demonstrate that bag of words models are useful, and
2. destroy the legal arguments people used to allow them to share bags of words.
20.01.2026 20:28 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
CSE 598-004 - Building Small Language Models
The second new class I'm teaching is a very experimental graduate level seminar in CSE: "Building Small Language Models". I taught the grad level NLP class last semester (so fun!) but students wanted moreโwhich of these new ideas work, and which work for SLMs? jurgens.people.si.umich.edu/CSE598-004/
19.01.2026 21:29 โ ๐ 32 ๐ 9 ๐ฌ 2 ๐ 1
For social scientists interested in LLMs for text classification/coding, the process here is potentially very helpful (even if you don't use the product itself).
Their core technique: Contradictory Example Training
Their training method: Binocular Labeling
More details in the linked post below.
15.01.2026 19:37 โ ๐ 45 ๐ 13 ๐ฌ 0 ๐ 0
โWritten in the Style ofโ: ChatGPT and the Literary Canon
The first research paper from WashU's AI Humanities Lab, which I co-direct with Gabi Kirilloff, is available now in the Harvard Data Science Review! Read to learn more about how (badly) current LLMs are at replicating literary style: doi.org/10.1162/9960...
10.01.2026 21:14 โ ๐ 12 ๐ 3 ๐ฌ 1 ๐ 0
Article Preview
๐ข New article in #JCLS 5(1)! ๐
@axelpichler.fedihum.org.ap.brid.gy, Endres, M. & @nilsreiter.de (2026) โ#Interpretation, Argument, #Evaluation. A Workflow for Assessing #LLM-Generated Interpretations of #Poetryโ doi.org/10.48694/jcl...
#RollingIssue #NLG #CLS #LiteraryComputing
14.01.2026 22:13 โ ๐ 4 ๐ 3 ๐ฌ 1 ๐ 0
yes, this is a really great paper, showing how AI can enhance individual science but narrow the general scope.
14.01.2026 22:08 โ ๐ 12 ๐ 2 ๐ฌ 0 ๐ 0
But if this is the case, why are models acting so differently between languages?
Datasets like eclektic show that models know different things in different languages. A rare fact is usually only known in the language in which it was seen.
bsky.app/profile/lcho...
14.01.2026 16:52 โ ๐ 10 ๐ 4 ๐ฌ 1 ๐ 0
Is this Burtonโs translation or did he get it from someone else?
14.01.2026 18:04 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0
Do you have ideas for the future of reading?
Submit a 2-4 page paper to the CHI workshop I am co-organising! (deadline Feb 12) โScience and Technology for Augmenting Reading"
chi-star-workshop.github.io
12.01.2026 03:46 โ ๐ 11 ๐ 2 ๐ฌ 0 ๐ 0
Reading environments for classical languages FTW
12.01.2026 03:30 โ ๐ 8 ๐ 1 ๐ฌ 0 ๐ 0
Excited about this Duke AI conference + stoked to present new work on cultural AI. Grateful this high profile conference will include humanistic perspectives. Meaning, history, aesthetics, narrative etc are a part of the society centered AI question. Glad the humanities will be a part of the convo.
07.01.2026 17:32 โ ๐ 8 ๐ 2 ๐ฌ 0 ๐ 0
NLP @ Utrecht University (NL) | https://www.dongnguyen.nl/ | NLP & Society Lab: https://nlpsoc.github.io/
Ancient tragedy, reception studies, digital humanities
Bold science, deep and continuous collaborations.
Como todos los hombres de Babilonia, he sido procรณnsul; como todos, esclavo; tambiรฉn he conocido la omnipotencia, el oprobio, las cรกrceles.
very sane ai newsletter: verysane.ai
random bloggy bits: segyges.leaflet.pub
Cultural Analytics and NLP researcher
historian of the Islamicate with @openiti.bsky.social; community food forest/garden organizer with @foodforestchatt.bsky.social. โฆ Christian, father, deep time delver. ๐ด anarcho-agrarian ๐ป upholder of Gustav Landauer thought ๐น
Tenure-track faculty at the Max Planck Institute for Software Systems
Previously postdoc at UW and AI2, working on Natural Language Processing
Recruiting PhD students!
๐ https://lasharavichander.github.io/
Postdoc @vectorinstitute.ai | organizer @queerinai.com | previously MIT, CMU LTI | ๐ rodent enthusiast | she/they
๐ https://ryskina.github.io/
NLP Researcher at EleutherAI, PhD UC San Diego Linguistics.
Interested in multilingual NLP, tokenizers, open science.
๐Boston. She/her.
https://catherinearnett.github.io/
Retired but not gone, Digital Humanist, XML practitioner, one-time classicist, sorter of things
The official account for the peregrine falcons nesting at UMass Amherst.
associate prof at UMD CS researching NLP & LLMs
Comics creating cognitive (neuro)scientist at Tilburg University studying language, brains, comics, emoji & multimodality (he/him). ๐ฎโ๐จ๐ซ ๐ซฅ๐ฅน๐ซจ
www.visuallanguagelab.com
developing tools, data, and machine learning methods to discover new bibliographical evidence in early printed books
PhD candidate, Information Science, UIUC
https://danieljohnevans.github.io/
god created him and demanded that he die
Deputy editor at Foreign Policy, China nerd, gaming nerd, reads a lot
Librarian, toddler mom, living in Gainesville, FL. Digital humanities, grants, copyright, library publishing, reproductive health, paper and fiber crafts.
computational social scientist
PhD Candidate at Northeastern / Incoming Research Intern + ex-Visiting Researcher at Meta (MSL) / Organizer at the Trustworthy ML Initiative (trustworthyml.org).
safety & privacy in language models + mountain biking.
jaydeepborkar.github.io