Cooper

Cooper

@afedercooper.bsky.social

mischief executive officer https://afedercooper.info

479 Followers 213 Following 238 Posts Joined Jun 2023
4 days ago

okay there is actually one more, and it is also about copyright. but it's a law review paper. this is my last ML* memorization paper (at least for a while)

3 0 0 0
4 days ago

i'm writing my last memorization paper (i say for the 10th time), and hopefully what is my last first author paper for a bit. this one also isn't about copyright. i'm excited to start thinking about other things.

if anyone is looking for a new research buddy, i'm down to clown.

6 0 1 0
1 week ago

please let me know if you ever want to chat about any of this. I can’t promise I have anything useful to say, but I do have plenty to say about this. And am of course always around to listen.

0 0 0 0
2 weeks ago
Preview
Microsoft deletes blog telling users to train AI on pirated Harry Potter books The now-deleted Harry Potter dataset was "mistakenly" marked public domain.

like sometimes life is art, specifically an absurdist Beckett play

arstechnica.com/tech-policy/...

1 0 0 0
2 weeks ago
Post image

someone sent me this from the other place and this timeline really is something else

4 0 1 0
1 month ago

Not a perfect fit to the exact query I don't think, but I like this note as a starting place: lawreview.uchicago.edu/sites/defaul...

@jackbalkin.bsky.social

1 0 0 0
1 month ago

(lucky for everyone that I'm too lazy to write a blog post))

3 0 0 0
1 month ago

Yes, I have published at that track before, and related ones. But I'm not eager to again. Getting into that is maybe worth a blog post.

2 0 1 0
1 month ago

No I did not write/submit this paper to the ICML position paper track. Like many (but of course not all) papers submitted there, I think this is at most a blog post (where "at most" is a very generous upper bound, because the ~300 characters above almost certainly are enough).

3 0 1 0
1 month ago

Position: ML conferences should consider removing the position paper track

(...and just acknowledge that every scientific paper is articulating at least one position)

11 0 1 0
1 month ago

(This is all to say, I've been shocked at some of what I've heard coming out of industry. My assumption used to be that they knew a lot more about this than they seem to.)

1 0 0 0
1 month ago

I think partially yes. There definitely are full-time applied and research people working on data curation as a topic. But there are a ton of gaps/ things that might seem surprising here. E.g., making corpus-level decisions doesn't always tell you much about the underlying training data examples.

2 0 1 0
1 month ago

Am also concerned about this, but it’s not clear to me that companies even know everything that’s included. I suppose “use it all” is an editorial decision, though.

2 0 1 0
1 month ago

I just had a paper I reviewed months ago be “desk rejected” by ICLR for this reason. (It’s arguably not a desk rejection after 3 reviewers already chimed in.) But, this seems to be where things are headed.

1 0 0 0
1 month ago

Even if chucking the papers outright is undesirable (hallucination checkers are not error-free), I'm disappointed there's no process at all other than "oops, you can go fix it if you care to."

3 0 0 0
1 month ago

(though going forward, I wouldn’t be sad if I had a bit more compute 🙃)

0 0 0 0
1 month ago

One of my favorite responses to questions about compute in my work this year is “it’s expensive, yes, but I had to develop some efficient algos and write some efficient code to make this possible. This work was done at odd hours on 4 A100s shared by a dozen people.”

0 0 1 0
2 months ago

note that i said “ML” and “copyright,” which are very specific things that i actually think have very little to do with the anger i’m referring to

1 0 0 0
2 months ago

it’s hard to work at the intersection of ML and copyright because “both sides” of the debate are angry and, in my experience, most haven’t done much of the background reading in ML or copyright to have an informed opinion. it’s just vibes and anger. i should probably write something up about this.

7 1 1 0
2 months ago

got to experience the "I did not write that headline" phenomenon firsthand

The article: "Correctly scoping a legal safe harbor for A.I.-generated child sexual abuse material testing is tough."

The headline: "There's One Easy Solution to the A.I. Porn Problem"

51 5 3 0
2 months ago
Video thumbnail

After twelve years of work, the world’s most beautiful subway station has been inaugurated in Rome: Colosseo, an underground archaeological museum.⚜️💙⚜️💙⚜️💙⚜️

269 105 16 25
2 months ago

It's been quite the experience seeing the responses to this work (across the spectrum). I've been working in this area since 2020 & am very grateful to have amazing collaborators + mentors who've supported me along the way (only a few on bsky) @pamelasamuelson.bsky.social @zephoria.bsky.social

5 0 0 0
2 months ago
The Files are in the Computer: On Copyright, Memorization, and Generative AI By A. Feder Cooper afedercooper@gmail.com and James Grimmelmann, Published on 08/15/25

our research on memorization and copyright (with @jtlg.bsky.social ) from 2024: scholarship.kentlaw.iit.edu/cklawreview/vol100/iss1/9/

3 1 1 0
2 months ago
Preview
Extracting memorized pieces of (copyrighted) books from open-weight language models Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expr...

our research (with @marklemley.bsky.social ) from May on open-weight LLMs like Llama 3.1 70B: arxiv.org/abs/2505.12546

2 0 1 0
2 months ago
Preview
Extracting books from production language models Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized dat...

For those interested in the details:

our recent work on production LLMs like Claude 3.7 Sonnet: arxiv.org/abs/2601.02671

3 0 1 0
2 months ago
Preview
AI’s Memorization Crisis Large language models don’t “learn”—they copy. And that could change everything for the tech industry.

The Atlantic posted an article about memorization and generative AI, and it mentions our work on extraction of books from production LLms and open-weight models.

www.theatlantic.com/technology/2...

The referenced work reflects research with @marklemley.bsky.social @jtlg.bsky.social and others.

13 5 1 0
2 months ago
Preview
Extracting memorized pieces of (copyrighted) books from open-weight language models Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expr...

Happy you found our work interesting! Linking to the open-weight model extraction paper @marklemley.bsky.social was referring to:

arxiv.org/abs/2505.12546

2 0 0 0
2 months ago

(Indexing on the word “often”)

1 0 0 0
2 months ago

important disclaimer that our research (and the other papers referenced in this article) don’t really capture if they “often just repeat what they have seen elsewhere”

2 1 2 0
2 months ago

Me too. Like every time I want to move on I get sucked back in.

1 0 0 0