David Smith's Avatar

David Smith

@dasmiq.bsky.social

Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.

5,340 Followers  |  299 Following  |  392 Posts  |  Joined: 01.09.2023  |  1.9049

Latest posts by dasmiq.bsky.social on Bluesky


Preview
Symmetry in language statistics shapes the geometry of model representations Although learned representations underlie neural networks' success, their fundamental properties remain poorly understood. A striking example is the emergence of simple geometric structures in LLM rep...

In our new preprint, we explain how some salient features of representational geometry in language modeling originate from a single principle - translation symmetry in the statistics of data.

arxiv.org/abs/2602.150...

With Dhruva Karkada, Daniel Korchinski, Andres Nava, & Matthieu Wyart.

19.02.2026 04:20 โ€” ๐Ÿ‘ 28    ๐Ÿ” 5    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
YouTube video by Google TechTalks Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training

I gave a talk at the Google Privacy in ML Seminar last summer on privacy & memorization: "Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training".

It's up on YouTube now if you're interested :)
youtu.be/IzIsHFCqXGo?...

18.02.2026 02:05 โ€” ๐Ÿ‘ 2    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

We also show that we are far from done, specifically for a complicated language like Old French.

But we
(1) defined the issue,
(2) propose a first solution that enables pre-annotation of larger dataset and
(3) offer an alternative to less trustable models that go beyond ATR.

17.02.2026 18:11 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Pre Editorial Normalization - a Hugging Face Space by comma-project Latin and Old French normalization of CATMuS output

We release:

๐Ÿ“š 4.66M silver training samples
๐Ÿงช 1.8k gold evaluation set huggingface.co/datasets/com...
๐Ÿค– ByT5-based model โ†’ 6.7% CER huggingface.co/comma-projec...

Try it here ๐Ÿ‘‡
huggingface.co/spaces/comma...

17.02.2026 18:11 โ€” ๐Ÿ‘ 5    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

๐Ÿ‘‰ We propose Pre-Editorial Normalization (PEN):

An intermediate layer between:
๐Ÿ“ graphemic ATR output
๐Ÿ“– fully edited text

Goal: preserve palaeographic fidelity + enable usability.
Keep two layer, ATR output and normalization, with aligned token to go back to the source.

17.02.2026 18:11 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Recent ATR progressโ€”especially with palaeographic datasets like CATMuSโ€”has improved access to medieval sources.

But:
โŒ Raw outputs are hard to use
โŒ Fully normalized models over-normalize & hallucinate

Thereโ€™s a methodological gap.

17.02.2026 18:11 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

If I give you the text
๐Ÿ“š omnium peccatorum quia ex quo dyaconus quando esset in futurum, stultus esset

Can you find the ATR error without the manuscript ?

Probably not.

ATR models that predict text and normalize in one go generate trustable text, but prevent detecting issues.

17.02.2026 18:11 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

๐Ÿ“„ New paper:
Pre-Editorial Normalization for Automatically Transcribed Medieval Manuscripts in Old French and Latin

Thibault Clรฉrice, @rachelbawden.bsky.social , Anthony Glaise, Ariane Pinche, @dasmiq.bsky.social (2026) arxiv.org/abs/2602.13905

We introduce Pre-Editorial Normalization (PEN).

๐Ÿงตโฌ‡๏ธ

17.02.2026 18:11 โ€” ๐Ÿ‘ 21    ๐Ÿ” 10    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

Excited to be co-organizing the #CHI2026 workshop on augmented reading interfaces ๐Ÿ“šโœจ Submissions are open for one more week! We want to know what you're working on!

06.02.2026 20:21 โ€” ๐Ÿ‘ 9    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

our open model proving out specialized rag LMs over scientific literature has been published in nature โœŒ๐Ÿป

congrats to our lead @akariasai.bsky.social & team of students and Ai2 researchers/engineers

www.nature.com/articles/s41...

04.02.2026 22:43 โ€” ๐Ÿ‘ 44    ๐Ÿ” 10    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 2

Cool postdoc job opportunity! A chance to work with some great English & comp sci scholars at Carnegie Mellon. Appreciate this ad stresses: chance to do interesting technical work; work on an interesting humanities problem; chance to publish both in humanities & comp sci venues. Looks great, apply!

30.01.2026 18:01 โ€” ๐Ÿ‘ 5    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

The bias/variance tradeoff in 2026: Claude Sonnet wrote a program to solve the problem as described; Claude Opus figured out a shortcut from the example data that won't generalize.

28.01.2026 19:59 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

[Job ๐Ÿ“ฃ] Are you curious about #AI applications in the #humanities? My Print and Probability research group (@print-and-prob.bsky.social) is hiring a postdoc! Come help us develop computational methods for identifying clandestine early modern printers!

cmu.wd5.myworkdayjobs.com/CMU/job/Pitt...

22.01.2026 22:00 โ€” ๐Ÿ‘ 15    ๐Ÿ” 13    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Preview
Debate me bro Statistical inference is rhetoric.

Statistical inference is a rhetoric of counts.

20.01.2026 15:49 โ€” ๐Ÿ‘ 19    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 3
Post image

I had the absolute pleasure to visit @craicexeter.bsky.social, where I laid out an argument for how critical & computational scholars should lead the conversation on AI. We need to expand research on harms, interrogate corporate hype, and support peopleโ€™s critical understanding these technologies

22.01.2026 16:32 โ€” ๐Ÿ‘ 18    ๐Ÿ” 3    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 3

I couldn't figure out eloquent language to describe digital agents that assist with web navigation tasks, so I just wrote "click click" and if I keep this up maybe I will start referring to language generation as "word word"

22.01.2026 17:27 โ€” ๐Ÿ‘ 11    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Hi Honey, Iโ€™m Homo Neuricus Six Ways I'm using AI to Become More Human

A very random view into how some people* outside of tech think about and use chatbots. It's not coding, that's for sure, and some of it might sound ridiculous, but I think this kind of perspective and usage is way more common than we might assume.

*LA people (sorry, I love LA, but this is very LA)

22.01.2026 17:11 โ€” ๐Ÿ‘ 7    ๐Ÿ” 1    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

Let's think step by step. If you could reconstruct the original page with high probability using a language model given the bag of words, you could:
1. demonstrate that bag of words models are useful, and
2. destroy the legal arguments people used to allow them to share bags of words.

20.01.2026 20:28 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
CSE 598-004 - Building Small Language Models

The second new class I'm teaching is a very experimental graduate level seminar in CSE: "Building Small Language Models". I taught the grad level NLP class last semester (so fun!) but students wanted moreโ€”which of these new ideas work, and which work for SLMs? jurgens.people.si.umich.edu/CSE598-004/

19.01.2026 21:29 โ€” ๐Ÿ‘ 32    ๐Ÿ” 9    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 1

For social scientists interested in LLMs for text classification/coding, the process here is potentially very helpful (even if you don't use the product itself).

Their core technique: Contradictory Example Training
Their training method: Binocular Labeling

More details in the linked post below.

15.01.2026 19:37 โ€” ๐Ÿ‘ 45    ๐Ÿ” 13    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
โ€˜Written in the Style ofโ€™: ChatGPT and the Literary Canon

The first research paper from WashU's AI Humanities Lab, which I co-direct with Gabi Kirilloff, is available now in the Harvard Data Science Review! Read to learn more about how (badly) current LLMs are at replicating literary style: doi.org/10.1162/9960...

10.01.2026 21:14 โ€” ๐Ÿ‘ 12    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Article Preview

Article Preview

๐Ÿ“ข New article in #JCLS 5(1)! ๐ŸŽ‰
@axelpichler.fedihum.org.ap.brid.gy, Endres, M. & @nilsreiter.de (2026) โ€œ#Interpretation, Argument, #Evaluation. A Workflow for Assessing #LLM-Generated Interpretations of #Poetryโ€ doi.org/10.48694/jcl...

#RollingIssue #NLG #CLS #LiteraryComputing

14.01.2026 22:13 โ€” ๐Ÿ‘ 4    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

yes, this is a really great paper, showing how AI can enhance individual science but narrow the general scope.

14.01.2026 22:08 โ€” ๐Ÿ‘ 12    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

But if this is the case, why are models acting so differently between languages?
Datasets like eclektic show that models know different things in different languages. A rare fact is usually only known in the language in which it was seen.
bsky.app/profile/lcho...

14.01.2026 16:52 โ€” ๐Ÿ‘ 10    ๐Ÿ” 4    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Is this Burtonโ€™s translation or did he get it from someone else?

14.01.2026 18:04 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Do you have ideas for the future of reading?

Submit a 2-4 page paper to the CHI workshop I am co-organising! (deadline Feb 12) โ€œScience and Technology for Augmenting Reading"

chi-star-workshop.github.io

12.01.2026 03:46 โ€” ๐Ÿ‘ 11    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Reading environments for classical languages FTW

12.01.2026 03:30 โ€” ๐Ÿ‘ 8    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Excited about this Duke AI conference + stoked to present new work on cultural AI. Grateful this high profile conference will include humanistic perspectives. Meaning, history, aesthetics, narrative etc are a part of the society centered AI question. Glad the humanities will be a part of the convo.

07.01.2026 17:32 โ€” ๐Ÿ‘ 8    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@dasmiq is following 20 prominent accounts