(lucky for everyone that I'm too lazy to write a blog post))
28.01.2026 07:02 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0@afedercooper.bsky.social
ML researcher, Stanford postdoc affiliate, future Yale professor https://afedercooper.info
(lucky for everyone that I'm too lazy to write a blog post))
28.01.2026 07:02 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0Yes, I have published at that track before, and related ones. But I'm not eager to again. Getting into that is maybe worth a blog post.
28.01.2026 05:51 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0No I did not write/submit this paper to the ICML position paper track. Like many (but of course not all) papers submitted there, I think this is at most a blog post (where "at most" is a very generous upper bound, because the ~300 characters above almost certainly are enough).
28.01.2026 05:48 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 0Position: ML conferences should consider removing the position paper track
(...and just acknowledge that every scientific paper is articulating at least one position)
(This is all to say, I've been shocked at some of what I've heard coming out of industry. My assumption used to be that they knew a lot more about this than they seem to.)
25.01.2026 21:17 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0I think partially yes. There definitely are full-time applied and research people working on data curation as a topic. But there are a ton of gaps/ things that might seem surprising here. E.g., making corpus-level decisions doesn't always tell you much about the underlying training data examples.
25.01.2026 21:15 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0Am also concerned about this, but itโs not clear to me that companies even know everything thatโs included. I suppose โuse it allโ is an editorial decision, though.
25.01.2026 20:44 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0I just had a paper I reviewed months ago be โdesk rejectedโ by ICLR for this reason. (Itโs arguably not a desk rejection after 3 reviewers already chimed in.) But, this seems to be where things are headed.
24.01.2026 19:00 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Even if chucking the papers outright is undesirable (hallucination checkers are not error-free), I'm disappointed there's no process at all other than "oops, you can go fix it if you care to."
24.01.2026 06:43 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0(though going forward, I wouldnโt be sad if I had a bit more compute ๐)
21.01.2026 18:19 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0One of my favorite responses to questions about compute in my work this year is โitโs expensive, yes, but I had to develop some efficient algos and write some efficient code to make this possible. This work was done at odd hours on 4 A100s shared by a dozen people.โ
21.01.2026 18:18 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0note that i said โMLโ and โcopyright,โ which are very specific things that i actually think have very little to do with the anger iโm referring to
14.01.2026 00:27 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0itโs hard to work at the intersection of ML and copyright because โboth sidesโ of the debate are angry and, in my experience, most havenโt done much of the background reading in ML or copyright to have an informed opinion. itโs just vibes and anger. i should probably write something up about this.
14.01.2026 00:26 โ ๐ 6 ๐ 1 ๐ฌ 1 ๐ 0got to experience the "I did not write that headline" phenomenon firsthand
The article: "Correctly scoping a legal safe harbor for A.I.-generated child sexual abuse material testing is tough."
The headline: "There's One Easy Solution to the A.I. Porn Problem"
After twelve years of work, the worldโs most beautiful subway station has been inaugurated in Rome: Colosseo, an underground archaeological museum.โ๏ธ๐โ๏ธ๐โ๏ธ๐โ๏ธ
13.01.2026 05:07 โ ๐ 267 ๐ 105 ๐ฌ 16 ๐ 26It's been quite the experience seeing the responses to this work (across the spectrum). I've been working in this area since 2020 & am very grateful to have amazing collaborators + mentors who've supported me along the way (only a few on bsky) @pamelasamuelson.bsky.social @zephoria.bsky.social
12.01.2026 19:57 โ ๐ 5 ๐ 0 ๐ฌ 0 ๐ 0our research on memorization and copyright (with @jtlg.bsky.social ) from 2024: scholarship.kentlaw.iit.edu/cklawreview/vol100/iss1/9/
12.01.2026 19:57 โ ๐ 3 ๐ 1 ๐ฌ 1 ๐ 0our research (with @marklemley.bsky.social ) from May on open-weight LLMs like Llama 3.1 70B: arxiv.org/abs/2505.12546
12.01.2026 19:57 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0For those interested in the details:
our recent work on production LLMs like Claude 3.7 Sonnet: arxiv.org/abs/2601.02671
The Atlantic posted an article about memorization and generative AI, and it mentions our work on extraction of books from production LLms and open-weight models.
www.theatlantic.com/technology/2...
The referenced work reflects research with @marklemley.bsky.social @jtlg.bsky.social and others.
Happy you found our work interesting! Linking to the open-weight model extraction paper @marklemley.bsky.social was referring to:
arxiv.org/abs/2505.12546
(Indexing on the word โoftenโ)
11.01.2026 22:52 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0important disclaimer that our research (and the other papers referenced in this article) donโt really capture if they โoften just repeat what they have seen elsewhereโ
11.01.2026 22:51 โ ๐ 2 ๐ 1 ๐ฌ 2 ๐ 0Me too. Like every time I want to move on I get sucked back in.
11.01.2026 21:28 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Eg, thereโs an information-theoretic sense where the database analogy is correct, but itโll be entirely misunderstood if we go that route bc of common perceptions of what a database is. And so that runs more risk than itโs worth imo, since the goal here is wider understanding / conceptual clarity.
11.01.2026 21:28 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Will update you on what we come up with for that law paper thatโs due in June ๐I truly donโt know how to do this yet.
11.01.2026 21:27 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0Have also tried to communicate this to law colleagues without math:
scholarship.kentlaw.iit.edu/cklawreview/...
In general Iโll just refer to our way too detailed paper on open-weight extraction of books. Tried really hard for completeness here. The main paper is ~20 pages but a lot of context and synthesis and results are in the ~150 page appendix ๐ญ
arxiv.org/abs/2505.12546
Like the struggle to come with good terms and analogy is very real
11.01.2026 21:21 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0The terminology is terrible. But itโs also terrible because this stuff might at face value seem simple but deeper understanding is quite complicated. (This is why I still work on this area; it hits at fundamental questions of what machine learning even means)
11.01.2026 21:21 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0