Screenshot of an app showing an image from a page + model reasoning showing how the model is parsing the text and layout.
What if OCR models could show you their thought process?
NuMarkdown-8B-Thinking from NuMind (YC S22) doesn't just extract text - it reasons through documents first.
Could be pretty valuable for weird historical documents?
Example here: davanstrien-ocr-time-capsule.static.hf.space/index.html?d...
07.08.2025 15:16 โ ๐ 13 ๐ 2 ๐ฌ 1 ๐ 0
A dot plot titled '"nthropomorphized Animals in Popular Children's Books (*Animals That Appear in 10+ Books)" showing the proportion of animals depicted with gendered pronouns. Animals toward the left side are more often represented as male (he/him), and those toward the right are more often represented as female (she/her). Birds, ducks, and cats lean female. Bears, monkeys, dogs, elephants, foxes, wolves, and frogs lean male. Each animal is represented by a colorful, illustrated face.
Screenshot of Publishers Weekly article titled "The Sneaky Gender Bias in Picture Books: Animal Characters" that includes photo of the author, a woman with brown hair and glasses. Text reads: "Melanie Walsh is an assistant professor in the Information School and an adjunct assistant professor in the English department at the University of Washington. She uses data to analyze contemporary culture, especially literature and publishing. She is currently at work on a book, When Postwar American Fiction Went Viral: Protest, Profit, and Popular Readers in the 21st Century, which follows the surprising social media afterlives of five iconic American authors. Here she shares her investigations into the subtle gender imbalance often at play in picture books featuring animal characters.
I recently published a data analysis with The Pudding, a digital publication known for data-driven storytelling, about animal characters in picture books. We read approximately 300 popular English-language picture books from the past 70+ years and noted the gender of any anthropomorphized animal character that was important to the story.
We found that male animal characters were twice as common as female characters across all the books. Some strong animal stereotypes also emerged: frogs and dogs were boys; birds and cats were girls. Even more surprising, according to our data: this disparity is not obviously improving, even over the last 25 years."
For PW, I wrote about the persistent gender gap in fictional animal charactersโa pattern I noticed while analyzing 100s of picture books with @puddingviz.bsky.social.
It's a more interesting (and pervasive) problem than I first thought.
#kidlit #booksky
๐: www.publishersweekly.com/pw/by-topic/...
05.08.2025 23:29 โ ๐ 47 ๐ 11 ๐ฌ 4 ๐ 2
I'm no philosopher of science, but you might go far with, "Amateurs talk prediction; professionals talk measurement."
05.08.2025 14:48 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Good luck. Speed is one thing people in psycholinguistics and human sentence processing measure when they look at people reading different passages. Do your eyes saccade back when you think, WHAT did I just read?
05.08.2025 13:15 โ ๐ 3 ๐ 0 ๐ฌ 2 ๐ 0
The baseline you might try is entropy rate. Start with Temperleyโs old work in surprised in music. Then get more complex from there, looking at the human sentence processing literature.
05.08.2025 01:23 โ ๐ 10 ๐ 0 ๐ฌ 1 ๐ 0
What are your favorite recent papers on using LMs for annotation (especially in a loop with human annotators), synthetic data for task-specific prediction, active learning, and similar?
Looking for practical methods for settings where human annotations are costly.
A few examples in thread โด
23.07.2025 08:10 โ ๐ 74 ๐ 23 ๐ฌ 14 ๐ 3
A big, beautiful, complex conceptual model of how a "story" is manifested in various versions, languages, (parts of) witnesses, etc. by @katkel.bsky.social & @jbcamps.bsky.social.
A delicious, nutritious alphabet soup of CRM-CIDOC, FRBR, LRMOO (look 'em all up!)
#DH2025
17.07.2025 15:45 โ ๐ 10 ๐ 4 ๐ฌ 1 ๐ 0
Congratulations! The least they could do if you have to be chair.
05.07.2025 14:30 โ ๐ 5 ๐ 0 ๐ฌ 0 ๐ 0
gibbon-dio.txt
GitHub Gist: instantly share code, notes, and snippets.
I sent DeepSeek into an infinite spiral of self-doubt by asking about Edward Gibbon's use of Cassius Dio. After 850 paragraphs, it had concluded that "Gibbon was born in Massachusetts", and I cut it off. Has anyone else seen traces this long?
gist.github.com/dasmiq/9c415...
02.07.2025 17:30 โ ๐ 9 ๐ 0 ๐ฌ 1 ๐ 0
NEMI 2024 (Last Year)
๐จ Registration is live! ๐จ
The New England Mechanistic Interpretability (NEMI) Workshop is happening Aug 22nd 2025 at Northeastern University!
A chance for the mech interp community to nerd out on how models really work ๐ง ๐ค
๐ Info: nemiconf.github.io/summer25/
๐ Register: forms.gle/v4kJCweE3UUH...
30.06.2025 22:55 โ ๐ 10 ๐ 8 ๐ฌ 0 ๐ 2
We have developed an AI-powered tool to support your research ideation process. We're inviting researchers to test our tool and share their feedback.
#aiforscience #LLM #research #ideas
26.06.2025 20:12 โ ๐ 2 ๐ 2 ๐ฌ 0 ๐ 1
AI, Authorship, and the Public Interest โ Project Update and Call for Grant Proposals
Authors Alliance is pleased to announce the availability of research grants of up to $20,000 to support research projects at the intersection of artificial intelligence, copyright law, and the publโฆ
FUNDING OPPORTUNITY: @authorsalliance.bsky.social, with the support of @knightfoundation.org, is offering grants of up to $20,000 to catalyze research projects at the intersection of artificial intelligence, copyright law, and the public interest.
24.06.2025 13:41 โ ๐ 13 ๐ 6 ๐ฌ 1 ๐ 1
Words of Warmth: Trust and Sociability Norms for over 26k English Words
In this work, we introduce Words of Warmth, the first large-scale repository of manually derived wordโwarmth (as well as wordโtrust and wordโsociability) associations for over 26k English words.
arxiv.org/html/2506.03...
25.06.2025 14:36 โ ๐ 4 ๐ 2 ๐ฌ 0 ๐ 0
The impact of language models on the humanities and vice versa
Nature Computational Science - Many humanists are skeptical of language models and concerned about their effects on universities. However, researchers with a background in the humanities are also...
New this morning, a Comment I contributed to Nature Computational Science on the interaction between large language models and the humanities. ๐งช ๐ค #MLSky
rdcu.be/etk07
The link above will be open-access for a month โ plus, I'll reply to this post with a link to a permanently open preprint. +
25.06.2025 12:58 โ ๐ 145 ๐ 53 ๐ฌ 12 ๐ 7
Nikhil's recent paper is a tour de force in causal analysis! They show that LLMs keep track of what characters know in a story using "pointer" mechanisms. Definitely worth checking out.
24.06.2025 17:48 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 0
CAIRe: Cultural Attribution of Images by Retrieval-Augmented Evaluation
As text-to-image models become increasingly prevalent, ensuring their equitable performance across diverse cultural contexts is critical. Efforts to mitigate cross-cultural biases have been hampered b...
Arnav Yayavaram and Siddharth Yayavaram were the main contributors to this project and built an awesome, clean, easy-to-use codebase that's up on Github now! I have found this resource to be enabling for my own work. @simi97k.bsky.social was the main mentor.
Read now!
arxiv.org/abs/2506.09109
20.06.2025 23:02 โ ๐ 4 ๐ 2 ๐ฌ 1 ๐ 0
It is queer how many publishers do not know the meaning of the word "copyright." The Club Woman is copyrighted, which means that no other publisher has a right to reprint articles from this magazine without giving full credit to THE CLUB WOMAN. But we are continually coming across articles and paragraphs taken bodily from our columns and reprinted as original in some other periodical. In one case a Boston paper did this and the articles which was "lifted" (to put it politely) has been copied far and wide, with credit to the "lifting" publicationโone which, by the way, stands for the best and highest advancement of woman!
A magazine from 1897 gives an unusual definition of copyright: no copying without attribution. CC BY avant la lettre. #ViralTexts
20.06.2025 20:53 โ ๐ 8 ๐ 2 ๐ฌ 0 ๐ 0
Yes, my wife ran into this just last week. She made a copy and was able to edit the copy.
17.06.2025 18:35 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
"On ne se trump [sic] pas; Il n'y plus de rois."
14.06.2025 14:24 โ ๐ 15 ๐ 2 ๐ฌ 0 ๐ 0
#ThanksBrett for your central role in building up the most exciting and generous academic field of the last 25 years! The office of digital humanities was been a tremendously positive force. @brettbobley.bsky.social
13.06.2025 18:04 โ ๐ 38 ๐ 9 ๐ฌ 1 ๐ 0
Thanks so much to Brett and Jen and the rest of the odh team. It makes me sad to think of the loss of such skillโฆ and I will miss hearing dc punk stories from Brett.
13.06.2025 19:13 โ ๐ 9 ๐ 1 ๐ฌ 0 ๐ 0
Wow, I know @brettbobley.bsky.social is quite the chef extraordinaire, but I had no idea that he makes his own kimchi! Also, @jenserventi.bsky.social - you may break open a sealed bag of kimchi to eat near me anytime (and be sure to bring enough to share)!
13.06.2025 20:02 โ ๐ 7 ๐ 1 ๐ฌ 0 ๐ 0
#thanksbrett Words fail me when I try to express my gratitude and admiration for everything good that @brettbobley.bsky.social caused to happen in his many (but still too few) years at @nehgov.bsky.social The humanities are stronger because of what he and his colleagues accomplished.
13.06.2025 23:38 โ ๐ 8 ๐ 2 ๐ฌ 0 ๐ 0
Very promising early release of cultural heritage corpus for AI training with a detailed pipeline and data report. Iโm happy to see that Pleias tooling contributed to it (for OCR detection).
12.06.2025 23:10 โ ๐ 30 ๐ 8 ๐ฌ 1 ๐ 0
I missed this information. Wonderful news.
12.06.2025 10:48 โ ๐ 1 ๐ 2 ๐ฌ 1 ๐ 0
The official account for the peregrine falcons nesting at UMass Amherst.
associate prof at UMD CS researching NLP & LLMs
Comics creating cognitive (neuro)scientist at Tilburg University studying language, brains, comics, emoji & multimodality (he/him). ๐ฎโ๐จ๐ซ ๐ซฅ๐ฅน๐ซจ
www.visuallanguagelab.com
developing tools, data, and machine learning methods to discover new bibliographical evidence in early printed books
god created him and demanded that he die
Deputy editor at Foreign Policy, China nerd, gaming nerd, reads a lot
Librarian, toddler mom, living in Gainesville, FL. Digital humanities, grants, copyright, library publishing, reproductive health, paper and fiber crafts.
computational social scientist
Visiting Researcher at Meta Superintelligence Labs ๐ฆ and PhD student at Northeastern. Organizer at the Trustworthy ML Initiative (trustworthyml.org). s&p in language models + mountain biking.
jaydeepborkar.github.io
I make sure that OpenAI et al. aren't the only people who are able to study large scale AI systems.
Professor | School of Information | U. of Texas at Austin ๐ค. Amazon Scholar| Last Mile ๐. Co-Director | NSF-Simons AI Institute for Cosmic Origins (CosmicAI) ๐ช. Leadership | UT Good Systems (Responsible AI) ๐ค. PI | Protecting Information Integrity ๐๏ธ.
Associate Professor in CS @ Georgia Tech
NLP/ML researcher
https://cocoxu.github.io/
VP and Distinguished Scientist at Microsoft Research NYC. AI evaluation and measurement, responsible AI, computational social science, machine learning. She/her.
One photo a day since January 2018: https://www.instagram.com/logisticaggression/
UZH Computational Linguistics
Professor at UW; Researcher at Meta. LMs, NLP, ML. PNW life.
Professor for AI at Hasso Plattner Institute and University of Potsdam
Berlin (prev. Rutgers NJ USA, Tsinghua Beijing, Berkeley)
http://gerard.demelo.org
Linguist and musician
https://blogs.umass.edu/pater/
http://www.derailleursmusic.com
PhD student doing LLM interpretability with @davidbau.bsky.social and @byron.bsky.social. (they/them) https://sfeucht.github.io