I understand what the underlying probabilities mean, and therefore why this was worth giving a go. But I’m still occasionally like “How tf can someone extract entire books from a frontier company’s flagship LLM? Like we got _all_ of HP 1 with just ‘Mr. and Mrs. D’ as the seed prompt? What??”
25.07.2025 16:23 — 👍 3 🔁 0 💬 0 📌 0
more generally x \in {scary, splashy, hypey, over-broad, …}
24.07.2025 20:01 — 👍 0 🔁 0 💬 0 📌 0
Yeah I’m not commenting on that. Just saying that memorization isn’t always instantiated as a verbatim metric. It just often is because of cost.
24.07.2025 02:07 — 👍 1 🔁 0 💬 0 📌 0
Memorization doesn’t have to be verbatim. We just often measure it that way in practice for research papers because (thus far) it is a lot more expensive to measure non-verbatim stuff.
24.07.2025 01:54 — 👍 0 🔁 0 💬 1 📌 0
Updates***, not results. I’ve becomes a parody of a researcher.
21.07.2025 01:46 — 👍 3 🔁 0 💬 0 📌 0
Had a great time and learned a ton at ICML. But as an introvert, I’ve used up all my talking budget until the fall. Excited to get back to full time researchy things, and will hopefully have some exciting new results to share soon!
21.07.2025 01:42 — 👍 4 🔁 0 💬 1 📌 0
Strangers love to tell me “I can’t understand you, because of your MASK”. Dude, I am literally someone who gets paid to speak to large audiences while wearing a mask—I know I can be understood!
17.07.2025 18:14 — 👍 150 🔁 12 💬 14 📌 3
Happening now! Please swing by to talk about measurement!
16.07.2025 18:29 — 👍 1 🔁 0 💬 0 📌 0
Excited to be at #ICML '25! Please reach out if you'd like to chat. You can also find me presenting work at a few different spots, listed below!
16.07.2025 00:43 — 👍 2 🔁 0 💬 2 📌 0
So pumped!! (Poster session starts at 11 PT, for those that want to swing by early!)
15.07.2025 20:16 — 👍 1 🔁 0 💬 1 📌 0
Feeling so excited + grateful to be representing this paper at #ICML! Please stop by to talk about how to do more valid measurement for evaling gen AI systems!
Work led by the incomparable @hannawallach.bsky.social and @azjacobs.bsky.social as a part of Microsoft’s AI and Society initiative!!
15.07.2025 20:15 — 👍 9 🔁 2 💬 0 📌 0
I love this paper
14.07.2025 19:47 — 👍 2 🔁 0 💬 0 📌 0
I’ll be at ICML in Vancouver this week giving talks at a couple of workshops about this paper:
🔸Saturday 7/19 10:30am invited talk at the MemFM workshop, West Meeting Room 223-224
🔸Saturday 7/19 11:40am oral at the R2-FM workshop, West Ballroom C
Please reach out if you’d like to meet up!
8/8
14.07.2025 14:34 — 👍 0 🔁 0 💬 0 📌 0
This finding doesn’t suggest that this type of memorization is an inherent property of LLMs, nor that it’s a necessary outcome of LLM training.
But our work does raise a lot of new research questions. And in the short term, we have a lot more experiments to run on more models and more books.
7/8
14.07.2025 14:34 — 👍 1 🔁 0 💬 1 📌 0
Yes, I’m confident this would work for other books (of the only 50 books we study so far) for Llama 3.1 70B. I think it'd also work for Llama 3 70B. No, I haven’t yet seen strong evidence that we could do this with other models of the same size class + similar quality (e.g., DeepSeek v1 67B).
6/8
14.07.2025 14:34 — 👍 0 🔁 0 💬 1 📌 0
In my mind, this was something that should follow from what our paper already showed.
But I appreciate that this kind of (effectively complete) reconstruction “feels” different than measuring memorization with 50 token prompts and 50 token suffixes.
5/8
14.07.2025 14:34 — 👍 0 🔁 0 💬 1 📌 0
Using just the first line of chapter 1 (60 tokens), we can deterministically generate a near-exact copy of the entire ~300 page book (!!!).
(~300 book-length pages of basically no diff! Cosine similarity of 0.9999; greedy approx. of word-level LCS of 0.992)
4/8
14.07.2025 14:34 — 👍 0 🔁 0 💬 1 📌 0
With the degree of memorization we observed for Llama 3.1 70B on some books, it’s trivial to generate large contiguous segments of those books using a single seed prompt of ground-truth text. We illustrate this for Harry Potter and the Sorcerer’s Stone.
3/8
14.07.2025 14:34 — 👍 0 🔁 0 💬 1 📌 0
Memorization of training data in LLMs is hard to understand. This is why extraction is so viscerally powerful: it reproduces the memorized data (near-)verbatim at generation time. You can’t unsee it once it’s decoded right in front of you.
2/8
14.07.2025 14:34 — 👍 0 🔁 0 💬 1 📌 0
Extracting memorized pieces of (copyrighted) books from open-weight language models
Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) h
"Llama 3.1 70B memorizes some books, like Harry Potter & the Sorcerer's Stone and 1984, almost entirely. ... HP is so memorized that, using a seed prompt consisting of just the first line of chapter 1, we can deterministically generate the entire book near-verbatim."
papers.ssrn.com/sol3/papers....
10.07.2025 19:06 — 👍 6 🔁 4 💬 0 📌 1
This opinion is a reminder that these cases are not general-purpose referenda on AI policy; they are hyper-technocratic copyright cases. Copyright draws lots of unsatisfying and counterintuitive distinctions, which is why you should hire and listen to copyright lawyers on the front end.
24.06.2025 18:53 — 👍 42 🔁 7 💬 1 📌 1
“these are hypertechnocratic” is one of the most important things you can draw from this morning’s ruling. In other words, hesitate before drawing parallels between this case and your most (loved|hated) AI training use case.
(@chup.blakereid.org’s whole thread is great)
24.06.2025 23:52 — 👍 9 🔁 1 💬 0 📌 0
i'm also happy that new york can still (positively) surprise me. haven't felt that was true for a while.
25.06.2025 04:16 — 👍 0 🔁 0 💬 0 📌 0
MIT postdoc, incoming UIUC CS prof
katedonahue.me
Penn Prof (Chair of Legal Studies & Business Ethics @ Wharton). Pescatarian. Panentheist Jew. Parent. Philadelphian. Poconos Airbnb host. Perhaps Retired Restoration Shaman. Primarily focused on Accountable AI & crypto regulation. https://accountableai.net
Senior Research Fellow @ ucl.ac.uk/gatsby & sainsburywellcome.org
{learning, representations, structure} in 🧠💭🤖
my work 🤓: eringrant.github.io
not active: sigmoid.social/@eringrant @eringrant@sigmoid.social, twitter.com/ermgrant @ermgrant
Postdoc@Cornell 🔜 Ass Prof UT Austin
AI ethics, political economy of tech, feminist STS
trying to help computer ppl think more critically about computer, including me
art: instagram.com/davidthewid
exCMU/NASA/MSR/IntelLabs.
davidwidder.me
Research Scientist at Apple for uncertainty quantification.
Parent, spouse, Australian, Professor of Machine Learning in Oxford. Long Covid, trans rights, music, reggae on Fridays, AI must be good for humans, https://www.robots.ox.ac.uk/~mosb
AI/ML, Responsible AI, Technology & Society @MicrosoftResearch
Professor of Law at Cardozo Law School, writing on IP and law & tech, 🏳️🌈
Hacking/crime/privacy journalist. Author of DARK WIRE, buy here: https://www.hachettebookgroup.com/titles/joseph-cox/dark-wire/9781541702691/#preorder Co-founder of 404 Media. Signal: joseph.404 Email: joseph@404media.co
journalist / founder @ thehandbasket.co
email: MKwrites4000@proton.me
signal: https://signal.me/#eu/VssgH88q6WQu7MtH5wF-08JdgWh4iAPWD13eXiOcXQNGdZUXijJBZInD-UtLJKFG
venmo: venmo.com/u/Marisa-Kabas
ko-fi: https://ko-fi.com/marisakabas
information science professor (tech ethics + internet stuff)
kind of a content creator (elsewhere also @professorcasey)
though not influencing anyone to do anything except maybe learn things
she/her
more: casey.prof
Email salesman at Platformer.news and podcast co-host at Hard Fork.
UCLA Law Prof. Focused on law, tech, and justice. Posting is mostly rage at fascists, their journalist enablers, and feckless Dems who need to retire. Sometimes I even post about things people call me an expert in. He/him. http://ssrn.com/author=1328346
Associate Professor of Law, University of Wisconsin Law School. I focus on copyright and technology law: ssrn.com/author=1498344
Trademark, advertising, and more
Tech policy researcher at Stanford. Former litigator. Anger is an energy. Dum spiro spero. she/her
Get E2EE DMs on Germ! 🔑
https://ger.mx/A8RNdqpVCF0_VW6QP0yqb4JK8W2SNlPNayEQfDxIN9OI#did:plc:juj4a7jagip23ja36opmi4d4
Exec Dir @nyuengelberg.org, Board Member @oshwassociation.bsky.social, Formerly GC @ Shapeways & many things @publicknowledge.bsky.social
NYC/Berlin
Running for Congress (IL-09) because we deserve Democrats who actually do something | katforillinois.com
Technology + democracy.
Visit https://techpolicy.press
Join our newsletter: https://techpolicy.press/newsletter
Opinions do not reflect the views of Tech Policy Press. Reposts do not equal endorsements.
Head of Policy at EleutherAI. They/them. Working at the intersection of open source and AI policy. Former philosophlete.