Memorandum & Opinion – #846 in In Re: OpenAI, Inc. Copyright Infringement Litigation (S.D.N.Y., 1:25-md-03143) – CourtListener.com
OPINION & ORDER RE: OPENAI'S DELETION OF BOOKS1 AND BOOKS2 DATASETS AND PRIVILEGE RULINGS (ECF NOS. 413, 428, 479, 481, 504, 505, 615, 616) re: (226 in 1:24-cv-00084-SHS-OTW, 413 in 1:25-md-03143-SHS-...
People had speculated on what went into ‘books2’ for years. Now we know; pirated books powered the hugely successful launch of ChatGPT.
IMO this should be mentioned every time ChatGPT is discussed.
storage.courtlistener.com/recap/gov.us...
/end
15.02.2026 08:22 — 👍 80 🔁 46 💬 1 📌 2
GPT-3.5 was the model used when ChatGPT was launched in November 2022. It was the default free-tier model for 1 1/2 years, until July 2024.
4/5
15.02.2026 08:22 — 👍 24 🔁 3 💬 1 📌 0
The judge said that a lawyer for OpenAI admitted that the pirate library LibGen was the source of a dataset they called ‘books2’, and that this dataset was used to train GPT-3 and GPT-3.5.
3/5
15.02.2026 08:22 — 👍 23 🔁 4 💬 2 📌 1
It came out in one sentence, buried halfway down an opinion from the judge in Authors Guild et al v. OpenAI.
2/5
15.02.2026 08:22 — 👍 22 🔁 3 💬 2 📌 0
When OpenAI released ChatGPT, it was trained on pirated books.
This finally came to light in a lawsuit three months ago, but it has gone essentially unreported.
🧵 1/5
15.02.2026 08:22 — 👍 108 🔁 75 💬 2 📌 5
MPT-7B-StoryWriter-65k+ (Databricks)
Cerebras-GPT (Cerebras)
CodeGen (Salesforce)
And this is just those where this info has come out in court / is traceable through academic papers. It’s safe to assume there are many more.
The AI industry was built on piracy.
2/2
14.02.2026 09:17 — 👍 51 🔁 13 💬 0 📌 0
Models we know were trained on pirated books (a non-exhaustive list):
GPT-3 (OpenAI)
GPT-3.5 (OpenAI)
Claude 1 (Anthropic)
Claude 2 (Anthropic)
Llama 1 (Meta)
Llama 2 (Meta)
Llama 3 (Meta)
OpenELM (Apple)
BloombergGPT (Bloomberg)
Megatron-Turing NLG 530B (Microsoft & Nvidia)
🧵 1/2
14.02.2026 09:17 — 👍 108 🔁 40 💬 3 📌 2
Yes! Thrilled to hear this.
13.02.2026 11:17 — 👍 1 🔁 0 💬 0 📌 0
Agreed. Some authors are not able to sue again, as they have settled - but the settlement only covers ~500,000 books. There should still be lots of people who can sue.
13.02.2026 10:37 — 👍 2 🔁 0 💬 1 📌 0
Whatever your view of its legality, it’s pretty clear that it sucks for authors, letting Anthropic make money at their expense. Authors should get paid when their books are used to train AI, and should have the chance to say no to that training. Anthropic’s used books strategy gives them neither.
13.02.2026 09:51 — 👍 22 🔁 4 💬 2 📌 0
As Judge Chhabria said in Meta v Kadrey, LLMs will likely compete with the books they are trained on by flooding the market. And as Dario Amodei himself said in 2021, big AI companies centralizing profits by training on books without the authors getting paid is a real concern.
7/n
13.02.2026 09:51 — 👍 16 🔁 2 💬 3 📌 0
You can’t scan it and sell it as an ebook, for instance. There are limits on what you can do with books you’ve bought, where what you would be doing would compete with the book’s rights holders.
6/n
13.02.2026 09:51 — 👍 22 🔁 1 💬 2 📌 0
Anthropic uses their huge war chest to get all the books in the world (that’s their aim) - authors get nothing.
IMO there are serious questions over whether this should be legal. Yes, they are buying the books. But you can’t just do anything you like with a book once you’ve bought it.
5/n
13.02.2026 09:51 — 👍 22 🔁 4 💬 1 📌 0
Before alighting on this plan, they were discussing licensing from book publishers, which would have meant money going to authors. But then they came up with the used books plan, and stopped all licensing discussions.
4/n
13.02.2026 09:51 — 👍 13 🔁 2 💬 1 📌 0
They called this Project Panama, using a codename because they didn’t want people to know they were doing it. (It ultimately came out through court documents.)
3/n
13.02.2026 09:51 — 👍 14 🔁 1 💬 1 📌 0
They spent tens of millions of dollars buying used books from wholesalers, in batches of tens of thousands at a time. These were shipped to Illinois, scanned, and pulped.
2/n
13.02.2026 09:51 — 👍 13 🔁 2 💬 1 📌 0
When Anthropic stopped training on books that were literally pirated, they managed to hit on the one way of buying books that means no money goes to authors: buying used books.
🧵 1/n
13.02.2026 09:51 — 👍 69 🔁 40 💬 3 📌 8
He has never faced criminal charges, and, as a co-founder of Anthropic, must now be extremely wealthy (it is known that many of his co-founders are billionaires).
/end
10.02.2026 10:31 — 👍 15 🔁 3 💬 0 📌 0
In 2019 & 2021, Ben Mann downloaded at least 5 million books from pirate libraries. He did so while working at OpenAI and Anthropic; the books were downloaded for the purposes of training AI models at those companies, two of the most successful commercial companies in recent history.
3/4
10.02.2026 10:31 — 👍 13 🔁 3 💬 1 📌 0
He was indicted on 13 felony counts, facing a maximum of 95 years in prison and $3 million in fines. Two years after his arrest, awaiting trial, he hung himself.
2/4
10.02.2026 10:31 — 👍 8 🔁 2 💬 1 📌 0
In 2010, Aaron Swartz downloaded 4.8 million articles from JSTOR. We don't know why - it may have been to make them freely available online (if so, he never did, as he was caught). He made no money from this, and there is no suggestion he intended to.
🧵 1/4
10.02.2026 10:31 — 👍 23 🔁 9 💬 1 📌 3
Whatever today’s models were trained on, billions of dollars were raised off the back of models trained on pirated books, and those books were downloaded by some of the most important researchers at the companies in question.
The AI industry is built on piracy.
/end
09.02.2026 17:17 — 👍 12 🔁 1 💬 0 📌 1
And he went further - he said he understood that LibGen had also been downloaded by Alec Radford at OpenAI. Radford led the development of GPT-1 and GPT-2.
3/4
09.02.2026 17:17 — 👍 8 🔁 1 💬 1 📌 0
He also downloaded another pirate library, Books3, the very month Anthropic was founded. It was used to train Claude 1 & Claude 2.
He *also* said he had previously downloaded LibGen when he was at OpenAI. OpenAI used books from LibGen to train GPT-3 and GPT-3.5.
2/4
09.02.2026 17:17 — 👍 7 🔁 1 💬 1 📌 0
Brave New World? Justice for creators in the age of Gen AI - ISM
This ground-breaking report calls for protection of our creative industries and workforce in a time of industrial-scale theft from generative AI.
Generative AI competes with the work it is trained on, and the people behind that work. Every single new piece of evidence backs this up.
The theft must be stopped - government must step in.
ism.org/brave-new-world
/end
02.02.2026 12:09 — 👍 26 🔁 7 💬 0 📌 1
I am genuinely astonished by how terrible this 'AI Skills Hub' is.
For the record, this cost *£4.1 million*.
There are no words.
/end
29.01.2026 14:57 — 👍 9 🔁 2 💬 2 📌 0
... and the similarly huge number of courses from big tech companies that are only useful if *checks notes* you have paid for their products.
29.01.2026 14:57 — 👍 7 🔁 0 💬 1 📌 0
All of this is before you get to the absolutely huge number of courses that cost hundreds of pounds to enroll in...
29.01.2026 14:57 — 👍 5 🔁 0 💬 1 📌 0
I write mostly about the intersection of tech & art/culture which these days means I spend nearly all my time trying to address the exploitation underlying current AI models. A secular humanist interrogating modern religions.
Writer. Reviews non-fiction for the Sunday Times, edits The Author magazine. Books about the Kamasutra, Conspiracy Theories, Nepal and, next, a Himalayan mountain. Books, singing, wildlife, languages, running...
An independent research and development institute exploring how new technologies are transforming work and working lives. http://www.ifow.org
Subscribe to our newsletter: https://www.ifow.org/newsletter
Political editor @ The New World
Opinion @ The i Paper
Fellow @ Tech Policy Press
Fellow @ Demos
PhD researcher @ UCL Laws
newsletter @ techtris
Latest book: The Other Pandemic – How QAnon Contaminated The World. 🏳️🌈
https://www.jamesrball.com/
Professor in Science and Technology Studies, UCL @stsucl.bsky.social. Science policy, responsible innovation, emerging technologies. Book https://link.springer.com/book/10.1007/978-3-030-32320-2. Responsible AI UK (www.rai.ac.uk)
🎬 Film concept artist and illustrator. 🎞️
Alien: Earth, Marvel, DC, Matrix, Alien, Transformers, The Woman King, Jupiter Ascending, The Hunger Games.
https://www.reidsouthenart.com/
In the culture wars, I want to be the bit on Christmas Day where they all knocked it on the head for a bit and played football.
Independently championing Labour Party values for a fairer society. Not affiliated with official Labour Party. Join us in amplifying voices, championing policies, and working towards a brighter future.
Composer, climate campaigner, dad.
Main site: https://hutchingsmusic.co.uk
Songs trying to save the world: http://choirsforclimate.com
Listen: http://soundcloud.com/hutchingsmusic
Videos: http://youtube.com/@choirsforclimate
he/they, Edinburgh
Writes for The Telegraph, The Critic, Unherd and privately.
I compose, edit, implement and design music systems on multi award-winning projects. I specialize in evocative, contemporary, experimental and highly emotional scores for Games and Film.
www.thomashansenmusic.co.uk
The chamber choir of Imperial College London. MD Patrick Allies - @patrickallies.bsky.social.
https://linkin.bio/icchamberchoir/?fbclid=PAAaaTKuvJf4PkX3MoWkKtnIGk6HbTSic2MOkzAdPsKH_sYVpENSET
Karla is a Puerto Rican artist who loves to paint and draw. Karla works on Films (MCU, ILM,HBO), Games, TV, Covers, Fine art, etc. She is also a passionate advocate for better artist industries+ rights. Opinions are her own. ✌️
www.karlaortizart.com
Situationist Cybernetics. Researching AI & Visual Noise at the University of Cambridge Digital Humanities. Affiliated Researcher, Machine Visual Culture Research Group (Max Planck Institute, Rome). Aim to be kind. cyberneticforests.com
ARC DECRA Fellow: literary studies | digital humanities | surveillance | President, Australasian Association for Digital Humanities: https://aa-dh.org/ | current project: surveilit.com | everything else: tynedaile.com
Media and tech scholar. Teaches at USC Annenberg. Writes stuff for normal people too. Stings in a tribute band. Author of six books, with the latest being 'Mediating Plureality: Technology, Perception, and Ethics in a Divided Democracy'.
Professor of Ethics and Games Technology, University of Staffordshire UK - she/her
Vice chair ACM Committee on Professional Ethics, Co-EIC ACM Games
Content lead #EMFCamp
Ethics of games & emerging tech; AI & crypto critic
https://liedra.net 🩷💜💙
Media theory/media art | Film: FADE http://vimeo.com/stagliano/fade | Book: Disobedient Aesthetics: https://www.uapress.ua.edu/9780817361358/disobedient-aesthetic | opinions mine, such as they are.
anthropologist of sci & tech. Prof @Yale. author of "Placing Outer Space" and VR book "In the Land of the Unreal". tech criticism with good vibes.