Matteo Di Cristofaro's Avatar

Matteo Di Cristofaro

@matteodic.bsky.social

Researcher in Corpus Linguistics and Digital Humanities @ UniMoRe. Corpus and Cognitive Linguist, Python & R user. Overall nerd (posts not representative of employers). Website: https://infogrep.it Online materials: https://catlism.github.io

725 Followers  |  1,166 Following  |  40 Posts  |  Joined: 11.12.2023  |  2.0698

Latest posts by matteodic.bsky.social on Bluesky

I love this!

08.08.2025 07:43 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Screenshot of the app showing a page from a book + different views of existing and new ocr.

Screenshot of the app showing a page from a book + different views of existing and new ocr.

Many VLM-based OCR models have been released recently. Are they useful for libraries and archives?

I made a quick Space to compare VLM OCR with "traditional" OCR using 11k Scottish exam papers from @natlibscot.bsky.social

huggingface.co/spaces/davanstrien/ocr-time-capsule

01.08.2025 15:09 โ€” ๐Ÿ‘ 48    ๐Ÿ” 15    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 1

rvya redeemed, thanks

26.07.2025 17:10 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

b83m redeemed, thanks a lot

26.07.2025 17:08 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

fsdc redeemed, thanks!

26.07.2025 17:07 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
4-panel comic. (1) [Person 1 with ponytail flanked by person with short hair and another person speaking into microphone at podium] PERSON 1: In the early 2010s, researchers found that many major scientific results couldnโ€™t be reproduced. (2) PERSON 1: Over a decade into the replication crisis, we wanted to see if todayโ€™s studies have become more robust. (3) PERSON 1: Unfortunately, our replication analysis has found exactly the same problems that those 2010s researchers did. (4) [newspaper with image of speakers from previous panels] Headline: Replication Crisis Solved

4-panel comic. (1) [Person 1 with ponytail flanked by person with short hair and another person speaking into microphone at podium] PERSON 1: In the early 2010s, researchers found that many major scientific results couldnโ€™t be reproduced. (2) PERSON 1: Over a decade into the replication crisis, we wanted to see if todayโ€™s studies have become more robust. (3) PERSON 1: Unfortunately, our replication analysis has found exactly the same problems that those 2010s researchers did. (4) [newspaper with image of speakers from previous panels] Headline: Replication Crisis Solved

Replication Crisis

xkcd.com/3117/

21.07.2025 23:54 โ€” ๐Ÿ‘ 4863    ๐Ÿ” 651    ๐Ÿ’ฌ 29    ๐Ÿ“Œ 28
Preview
How social media destroys democratic discourse, explained in 6 easy figures Where we all went wrong

I believe it is worth interrogating the fundamental forces re-shaping our information spheres away from liberal democracy towards myth, manipulation and magical thinking empowering autocracy and nihilism.

Hereโ€™s how it all falls apartโ€”a ๐Ÿงต in 6 figures โฌ‡๏ธ
www.protagonist-science.com/p/how-social...

11.07.2025 14:18 โ€” ๐Ÿ‘ 26    ๐Ÿ” 16    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 5
Preview
Regulating AI Isnโ€™t Enough. Letโ€™s Dismantle the Logic That Put It in Schools. AI in schools isnโ€™t progress โ€” itโ€™s a sign of how far weโ€™ve strayed from the purpose of education.

Stuffing ai into everything โ€œisnโ€™t just a forecast, itโ€™s a libidinal fantasy โ€” a capitalist dream of replacing relationships with code and scalable software, while public institutions are gutted in the name of โ€˜innovation.โ€™โ€

06.07.2025 14:30 โ€” ๐Ÿ‘ 180    ๐Ÿ” 51    ๐Ÿ’ฌ 4    ๐Ÿ“Œ 6
Preview
Companies That Tried to Save Money With AI Are Now Spending a Fortune Hiring People to Fix Its Mistakes Companies that rushed to replace human labor with AI are now shelling out to have IRL workers to fix the technology's screwups.

๐Ÿคท๐Ÿฟโ€โ™‚๏ธ

06.07.2025 12:06 โ€” ๐Ÿ‘ 2170    ๐Ÿ” 666    ๐Ÿ’ฌ 70    ๐Ÿ“Œ 370

"The problem with AI isn't that it can do your job. It can't. The problem with AI is that your MBA-brained boss's boss doesn't know how your job works and thinks AI can do your job at fractions of a penny on the dollar, and hears the siren song of 'maximize shareholder value'."

MBA-brain is real.

03.07.2025 06:57 โ€” ๐Ÿ‘ 7506    ๐Ÿ” 3139    ๐Ÿ’ฌ 133    ๐Ÿ“Œ 181
Preview
Data interference: emojis, homoglyphs, and issues of data fidelity in corpora and their results Tokenisation - "the process of splitting text into atomic parts" (Brezina & Timperley, 2017: 1) - is a crucial step for corpus linguistics, as it provides the basis for any applicable quantitative met...

Is ๐Ÿ˜ตโ€๐Ÿ’ซ one token or two?
To a human, it's one. To a corpus tool, itโ€™s often split (๐Ÿ˜ต + ๐Ÿ’ซ).
And ๐™Š๐™‰๐™‡๐™„๐™‰๐™€ โ‰  online.
This preprint shows how emojis & homoglyphs challenge tokenisation and distort linguistic evidence.
๐Ÿ” arxiv.org/abs/2507.01764

#Emoji #Homoglyphs #CorpusLinguistics #AcademicSky #LangSky

03.07.2025 07:32 โ€” ๐Ÿ‘ 10    ๐Ÿ” 1    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

wow, many thanks!

02.07.2025 14:11 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
arXiv user login

Fellow academics, can anyone help with obtaining an #endorsement on arXiv?
I have a preprint I'd like to upload to Computer Science > Computation and Language (cs.CL), but need someone to endorse my account.
Here's the endorsement link: arxiv.org/auth/endorse...

#corpuslinguistics #linguistics

02.07.2025 14:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

3jkl redeemed, thanks

24.06.2025 07:08 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

y4ha claimed, thanks!

24.06.2025 07:06 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Memes can serve as strong indicators of coming mass violence A new study finds that surges in visual propagandaโ€”like memes and doctored imagesโ€”often precede political violence. By combining AI with expert analysis, researchers tracked manipulated content leading up to Russiaโ€™s invasion of Ukraine, revealing early warning signs of instability.

Memes can serve as strong indicators of coming mass violence

15.06.2025 18:22 โ€” ๐Ÿ‘ 2    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Preview
Finally, a Replacement for BERT: Introducing ModernBERT Weโ€™re on a journey to advance and democratize artificial intelligence through open source and open science.

Finally, a Replacement for BERT (Blog about ModernBert)

huggingface.co/blog/modernb...

08.06.2025 10:17 โ€” ๐Ÿ‘ 1    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Scientific Programme โ€“ Summer school digital humanities

๐Ÿ“’The scientific program is out!

Click on the link below to have a look at the speakers and the workshops of our Summer School!

โฌ‡๏ธโฌ‡๏ธโฌ‡๏ธ

www.summerschooldigitalhumanities.unimore.it/2025-edition...

23.05.2025 14:11 โ€” ๐Ÿ‘ 2    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Our Summer School is beginning now with the institutional greetings.

03.06.2025 07:32 โ€” ๐Ÿ‘ 0    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Happy Graduation to all students and staff!

29.05.2025 10:16 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Postdoc position open in Zurich -- Prof. Martin Tomasik and I have a joint SNF project on interpretable neural network approaches for large scale, complex item / temporal structure, online learning / cognitive development data.

Please retweet.

tinyurl.com/PostdocGNNSNF

28.05.2025 11:16 โ€” ๐Ÿ‘ 23    ๐Ÿ” 19    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

xsds redeemed.
now I just need to finish packing and I'm ready for the woods!

16.05.2025 16:28 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

fy8n redeemed, perfect scent!

16.05.2025 16:26 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

3eyg redeemed.
the healing begins

16.05.2025 16:24 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
P hacking โ€” Five ways it could happen to you Some data practices can lead to statistically dubious findings. Hereโ€™s how to avoid them.

โน๏ธ Ending the experiment too early
๐ŸŽฏ Running experiments until you get a hit
๐Ÿ’ Cherry-picking your results
๐Ÿ”ง Tweaking your data
โž— Not adjusting for multiple comparisons

www.nature.com/articles/d41...

09.05.2025 15:44 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Academics and Universities have got to formulate a coherent approach to AI and guide their students. Cries of despair and digitally illiterate pronouncements will not reverse the effects of technical innovations.

09.05.2025 06:22 โ€” ๐Ÿ‘ 8    ๐Ÿ” 2    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Reproducibilitea โ€“ Designing a Good Research Practice Roadmap
YouTube video by Edinburgh ReproducibiliTea Reproducibilitea โ€“ Designing a Good Research Practice Roadmap

Designing a Good Research Practice Roadmap
- a presentation by Fiona Ramage @fionar.bsky.social
youtu.be/dnzRQPOxz1o?...

The event was organised by Edinburgh ReproducibiliTea

29.04.2025 13:15 โ€” ๐Ÿ‘ 7    ๐Ÿ” 6    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

With Canadaโ€™s election just days away, the continued appearance of deepfake ads reveals a serious flaw in Metaโ€™s ad-review system. If fraudsters can repeatedly bypass detection, it suggests the platformโ€™s current safeguards are not equipped to catch even basic forms of manipulation...

26.04.2025 12:27 โ€” ๐Ÿ‘ 20    ๐Ÿ” 13    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Preview
4chan Is Dead. Its Toxic Legacy Is Everywhere Itโ€™s likely that there will never be a site like 4chan again. But everything nowโ€”from X and YouTube to global politicsโ€”seems to carry its toxic legacy.

"Twitter became 4chan, then the 4chanified Twitter became the United States government. Its usefulness as an ammo dump in the culture war was diminished when they were saying things you would now hear every day on Twitter," @bencollins.bsky.social told WIRED.

22.04.2025 15:30 โ€” ๐Ÿ‘ 464    ๐Ÿ” 114    ๐Ÿ’ฌ 5    ๐Ÿ“Œ 4

Delighted to share my newly published review of "Data Analytics for Discourse Analysis with Python" (Tay 2024). The work addresses urgent, disciplineโ€‘wide concerns in #linguistics, and writing about it was both a privilege and a joy.

authors.elsevier.com/a/1kyHu1L-nh...

22.04.2025 15:32 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@matteodic is following 20 prominent accounts