David Smith's Avatar

David Smith

@dasmiq.bsky.social

Associate professor of computer science at Northeastern University. Natural language processing, digital humanities, OCR, computational bibliography, and computational social sciences. Artificial intelligence is an archival science.

5,187 Followers  |  289 Following  |  346 Posts  |  Joined: 01.09.2023  |  1.9943

Latest posts by dasmiq.bsky.social on Bluesky

Screenshot of an app showing an image from a page + model reasoning showing how the model is parsing the text and layout.

Screenshot of an app showing an image from a page + model reasoning showing how the model is parsing the text and layout.

What if OCR models could show you their thought process?

NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first.

Could be pretty valuable for weird historical documents?

Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...

07.08.2025 15:16 โ€” ๐Ÿ‘ 13    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
A dot plot titled '"nthropomorphized Animals in Popular Children's Books (*Animals That Appear in 10+ Books)" showing the proportion of animals depicted with gendered pronouns. Animals toward the left side are more often represented as male (he/him), and those toward the right are more often represented as female (she/her). Birds, ducks, and cats lean female. Bears, monkeys, dogs, elephants, foxes, wolves, and frogs lean male. Each animal is represented by a colorful, illustrated face.

A dot plot titled '"nthropomorphized Animals in Popular Children's Books (*Animals That Appear in 10+ Books)" showing the proportion of animals depicted with gendered pronouns. Animals toward the left side are more often represented as male (he/him), and those toward the right are more often represented as female (she/her). Birds, ducks, and cats lean female. Bears, monkeys, dogs, elephants, foxes, wolves, and frogs lean male. Each animal is represented by a colorful, illustrated face.

Screenshot of Publishers Weekly article titled "The Sneaky Gender Bias in Picture Books: Animal Characters" that includes photo of the author, a woman with brown hair and glasses. Text reads: "Melanie Walsh is an assistant professor in the Information School and an adjunct assistant professor in the English department at the University of Washington. She uses data to analyze contemporary culture, especially literature and publishing. She is currently at work on a book, When Postwar American Fiction Went Viral: Protest, Profit, and Popular Readers in the 21st Century, which follows the surprising social media afterlives of five iconic American authors. Here she shares her investigations into the subtle gender imbalance often at play in picture books featuring animal characters.

I recently published a data analysis with The Pudding, a digital publication known for data-driven storytelling, about animal characters in picture books. We read approximately 300 popular English-language picture books from the past 70+ years and noted the gender of any anthropomorphized animal character that was important to the story.

We found that male animal characters were twice as common as female characters across all the books. Some strong animal stereotypes also emerged: frogs and dogs were boys; birds and cats were girls. Even more surprising, according to our data: this disparity is not obviously improving, even over the last 25 years."

Screenshot of Publishers Weekly article titled "The Sneaky Gender Bias in Picture Books: Animal Characters" that includes photo of the author, a woman with brown hair and glasses. Text reads: "Melanie Walsh is an assistant professor in the Information School and an adjunct assistant professor in the English department at the University of Washington. She uses data to analyze contemporary culture, especially literature and publishing. She is currently at work on a book, When Postwar American Fiction Went Viral: Protest, Profit, and Popular Readers in the 21st Century, which follows the surprising social media afterlives of five iconic American authors. Here she shares her investigations into the subtle gender imbalance often at play in picture books featuring animal characters. I recently published a data analysis with The Pudding, a digital publication known for data-driven storytelling, about animal characters in picture books. We read approximately 300 popular English-language picture books from the past 70+ years and noted the gender of any anthropomorphized animal character that was important to the story. We found that male animal characters were twice as common as female characters across all the books. Some strong animal stereotypes also emerged: frogs and dogs were boys; birds and cats were girls. Even more surprising, according to our data: this disparity is not obviously improving, even over the last 25 years."

For PW, I wrote about the persistent gender gap in fictional animal charactersโ€”a pattern I noticed while analyzing 100s of picture books with @puddingviz.bsky.social.

It's a more interesting (and pervasive) problem than I first thought.

#kidlit #booksky

๐Ÿ”—: www.publishersweekly.com/pw/by-topic/...

05.08.2025 23:29 โ€” ๐Ÿ‘ 47    ๐Ÿ” 11    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 2

I'm no philosopher of science, but you might go far with, "Amateurs talk prediction; professionals talk measurement."

05.08.2025 14:48 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Good luck. Speed is one thing people in psycholinguistics and human sentence processing measure when they look at people reading different passages. Do your eyes saccade back when you think, WHAT did I just read?

05.08.2025 13:15 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

The baseline you might try is entropy rate. Start with Temperleyโ€™s old work in surprised in music. Then get more complex from there, looking at the human sentence processing literature.

05.08.2025 01:23 โ€” ๐Ÿ‘ 10    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

What are your favorite recent papers on using LMs for annotation (especially in a loop with human annotators), synthetic data for task-specific prediction, active learning, and similar?

Looking for practical methods for settings where human annotations are costly.

A few examples in thread โ†ด

23.07.2025 08:10 โ€” ๐Ÿ‘ 74    ๐Ÿ” 23    ๐Ÿ’ฌ 14    ๐Ÿ“Œ 3

A big, beautiful, complex conceptual model of how a "story" is manifested in various versions, languages, (parts of) witnesses, etc. by @katkel.bsky.social & @jbcamps.bsky.social.

A delicious, nutritious alphabet soup of CRM-CIDOC, FRBR, LRMOO (look 'em all up!)
#DH2025

17.07.2025 15:45 โ€” ๐Ÿ‘ 10    ๐Ÿ” 4    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Congratulations! The least they could do if you have to be chair.

05.07.2025 14:30 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Digital Collections Explorer: An Open-Source, Multimodal Viewer for Searching Digital Collections We present Digital Collections Explorer, a web-based, open-source exploratory search platform that leverages CLIP (Contrastive Language-Image Pre-training) for enhanced visual discovery of digital col...

With @yh-huang.bsky.social, I'm excited to share our Digital Collections Explorer, an open-source, multimodal viewer for digital collections! Users can search with both natural language inputs and reverse image search.

Paper: arxiv.org/abs/2507.00961
Public demo: digital-collections-explorer.com

02.07.2025 20:56 โ€” ๐Ÿ‘ 74    ๐Ÿ” 25    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 3
Preview
gibbon-dio.txt GitHub Gist: instantly share code, notes, and snippets.

I sent DeepSeek into an infinite spiral of self-doubt by asking about Edward Gibbon's use of Cassius Dio. After 850 paragraphs, it had concluded that "Gibbon was born in Massachusetts", and I cut it off. Has anyone else seen traces this long?
gist.github.com/dasmiq/9c415...

02.07.2025 17:30 โ€” ๐Ÿ‘ 9    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
NEMI 2024 (Last Year)

NEMI 2024 (Last Year)

๐Ÿšจ Registration is live! ๐Ÿšจ

The New England Mechanistic Interpretability (NEMI) Workshop is happening Aug 22nd 2025 at Northeastern University!

A chance for the mech interp community to nerd out on how models really work ๐Ÿง ๐Ÿค–

๐ŸŒ Info: nemiconf.github.io/summer25/
๐Ÿ“ Register: forms.gle/v4kJCweE3UUH...

30.06.2025 22:55 โ€” ๐Ÿ‘ 10    ๐Ÿ” 8    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 2
Post image

We have developed an AI-powered tool to support your research ideation process. We're inviting researchers to test our tool and share their feedback.
#aiforscience #LLM #research #ideas

26.06.2025 20:12 โ€” ๐Ÿ‘ 2    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Preview
AI, Authorship, and the Public Interest โ€“ Project Update and Call for Grant Proposals Authors Alliance is pleased to announce the availability of research grants of up to $20,000 to support research projects at the intersection of artificial intelligence, copyright law, and the publโ€ฆ

FUNDING OPPORTUNITY: @authorsalliance.bsky.social, with the support of @knightfoundation.org, is offering grants of up to $20,000 to catalyze research projects at the intersection of artificial intelligence, copyright law, and the public interest.

24.06.2025 13:41 โ€” ๐Ÿ‘ 13    ๐Ÿ” 6    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Words of Warmth: Trust and Sociability Norms for over 26k English Words

In this work, we introduce Words of Warmth, the first large-scale repository of manually derived wordโ€“warmth (as well as wordโ€“trust and wordโ€“sociability) associations for over 26k English words.
arxiv.org/html/2506.03...

25.06.2025 14:36 โ€” ๐Ÿ‘ 4    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
@nikhil07prakash.bsky.social How do language models track mental states of each character in a story, often referred to as Theory of Mind? We reverse-engineered how LLaMA-3-70B-Instruct handles a belief-tracking task and found something surprising: it uses mechanisms strikingly similar to pointer variables in C programming!

The new "Lookback" paper from @nikhil07prakash.bsky.socialโ€ฌ contains a surprising insight...

70b/405b LLMs use double pointers, akin to C programmers' double (**) pointers. They show up when the LLM is "knowing what Sally knows Ann knows", i.e., Theory of Mind.

bsky.app/profile/nik...

25.06.2025 15:00 โ€” ๐Ÿ‘ 27    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
The impact of language models on the humanities and vice versa Nature Computational Science - Many humanists are skeptical of language models and concerned about their effects on universities. However, researchers with a background in the humanities are also...

New this morning, a Comment I contributed to Nature Computational Science on the interaction between large language models and the humanities. ๐Ÿงช ๐Ÿค– #MLSky

rdcu.be/etk07

The link above will be open-access for a month โ€” plus, I'll reply to this post with a link to a permanently open preprint. +

25.06.2025 12:58 โ€” ๐Ÿ‘ 145    ๐Ÿ” 53    ๐Ÿ’ฌ 12    ๐Ÿ“Œ 7

Nikhil's recent paper is a tour de force in causal analysis! They show that LLMs keep track of what characters know in a story using "pointer" mechanisms. Definitely worth checking out.

24.06.2025 17:48 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image Post image Post image

Sijia Liu talks about machine unlearning

23.06.2025 16:28 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
An Interdisciplinary Approach to Human-Centered Machine Translation Machine Translation (MT) tools are widely used today, often in contexts where professional translators are not present. Despite progress in MT technology, a gap persists between system development and...

What should Machine Translation research look like in the age of multilingual LLMs?

Hereโ€™s one answer from researchers across NLP/MT, Translation Studies, and HCI.
"An Interdisciplinary Approach to Human-Centered Machine Translation"
arxiv.org/abs/2506.13468

18.06.2025 12:08 โ€” ๐Ÿ‘ 16    ๐Ÿ” 7    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered b...

Arnav Yayavaram and Siddharth Yayavaram were the main contributors to this project and built an awesome, clean, easy-to-use codebase that's up on Github now! I have found this resource to be enabling for my own work. @simi97k.bsky.social was the main mentor.

Read now!

arxiv.org/abs/2506.09109

20.06.2025 23:02 โ€” ๐Ÿ‘ 4    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
It is queer how many publishers do not know the meaning of the word "copyright." The Club Woman is copyrighted, which means that no other publisher has a right to reprint articles from this magazine without giving full credit to THE CLUB WOMAN. But we are continually coming across articles and paragraphs taken bodily from our columns and reprinted as original in some other periodical. In one case a Boston paper did this and the articles which was "lifted" (to put it politely) has been copied far and wide, with credit to the "lifting" publicationโ€”one which, by the way, stands for the best and highest advancement of woman!

It is queer how many publishers do not know the meaning of the word "copyright." The Club Woman is copyrighted, which means that no other publisher has a right to reprint articles from this magazine without giving full credit to THE CLUB WOMAN. But we are continually coming across articles and paragraphs taken bodily from our columns and reprinted as original in some other periodical. In one case a Boston paper did this and the articles which was "lifted" (to put it politely) has been copied far and wide, with credit to the "lifting" publicationโ€”one which, by the way, stands for the best and highest advancement of woman!

A magazine from 1897 gives an unusual definition of copyright: no copying without attribution. CC BY avant la lettre. #ViralTexts

20.06.2025 20:53 โ€” ๐Ÿ‘ 8    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Yes, my wife ran into this just last week. She made a copy and was able to edit the copy.

17.06.2025 18:35 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

"On ne se trump [sic] pas; Il n'y plus de rois."

14.06.2025 14:24 โ€” ๐Ÿ‘ 15    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

#ThanksBrett for your central role in building up the most exciting and generous academic field of the last 25 years! The office of digital humanities was been a tremendously positive force. @brettbobley.bsky.social

13.06.2025 18:04 โ€” ๐Ÿ‘ 38    ๐Ÿ” 9    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Thanks so much to Brett and Jen and the rest of the odh team. It makes me sad to think of the loss of such skillโ€ฆ and I will miss hearing dc punk stories from Brett.

13.06.2025 19:13 โ€” ๐Ÿ‘ 9    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Wow, I know @brettbobley.bsky.social is quite the chef extraordinaire, but I had no idea that he makes his own kimchi! Also, @jenserventi.bsky.social - you may break open a sealed bag of kimchi to eat near me anytime (and be sure to bring enough to share)!

13.06.2025 20:02 โ€” ๐Ÿ‘ 7    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

#thanksbrett Words fail me when I try to express my gratitude and admiration for everything good that @brettbobley.bsky.social caused to happen in his many (but still too few) years at @nehgov.bsky.social The humanities are stronger because of what he and his colleagues accomplished.

13.06.2025 23:38 โ€” ๐Ÿ‘ 8    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Very promising early release of cultural heritage corpus for AI training with a detailed pipeline and data report. Iโ€™m happy to see that Pleias tooling contributed to it (for OCR detection).

12.06.2025 23:10 โ€” ๐Ÿ‘ 30    ๐Ÿ” 8    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
STbayes: An R package for creating, fitting and understanding Bayesian models of social transmission A critical consequence of joining social groups is the possibility of social transmission of information from conspecifics related to novel behaviours or resources. Mathematical models of spreading ha...

"STbayes: An R package for creating, fitting and understanding Bayesian models of social transmission" - A network-based-diffusion-analysis (NBDA) implemented in cmdstan it seems. Looks like it interoperates with my STRAND package for latent network inference. www.biorxiv.org/content/10.1...

12.06.2025 06:53 โ€” ๐Ÿ‘ 46    ๐Ÿ” 8    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

I missed this information. Wonderful news.

12.06.2025 10:48 โ€” ๐Ÿ‘ 1    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@dasmiq is following 20 prominent accounts