Lemonhail's Avatar

Lemonhail

@lemonhail.bsky.social

Potential lottery winner

157 Followers  |  297 Following  |  70 Posts  |  Joined: 17.10.2024  |  2.0623

Latest posts by lemonhail.bsky.social on Bluesky

Meanwhile, “did women ruin the workplace?”

13.11.2025 00:13 — 👍 1897    🔁 502    💬 13    📌 6
Post image

Tweet. Character from my comic book.

12.11.2025 08:41 — 👍 2226    🔁 212    💬 12    📌 2
Post image

the new Steam Machine looks pretty kool

12.11.2025 18:59 — 👍 939    🔁 367    💬 8    📌 8

Yeah I agree

08.11.2025 11:26 — 👍 0    🔁 0    💬 0    📌 0

Mountain of shit is about to turn into gold, says king of shit mountain

06.11.2025 23:32 — 👍 7    🔁 0    💬 0    📌 0
In July 2023, The New York Times sent a notice to Common Crawl asking for the removal of previously scraped Times content. (In their lawsuit against OpenAI, the Times noted that Common Crawl includes “at least 16 million unique records of content” from Times websites.) The nonprofit seemed amenable to the request. In November of that year, a Times spokesperson, Charlie Stadtlander, told Business Insider: “We simply asked that our content be removed, and were pleased that Common Crawl complied.”

But as I explored Common Crawl’s archives, I found that many Times articles appear to still be present. When I mentioned this to the Times, Stadtlander told me: “Our understanding from them is that they have deleted the majority of the Times’s content, and continue to work on full removal.”

In July 2023, The New York Times sent a notice to Common Crawl asking for the removal of previously scraped Times content. (In their lawsuit against OpenAI, the Times noted that Common Crawl includes “at least 16 million unique records of content” from Times websites.) The nonprofit seemed amenable to the request. In November of that year, a Times spokesperson, Charlie Stadtlander, told Business Insider: “We simply asked that our content be removed, and were pleased that Common Crawl complied.” But as I explored Common Crawl’s archives, I found that many Times articles appear to still be present. When I mentioned this to the Times, Stadtlander told me: “Our understanding from them is that they have deleted the majority of the Times’s content, and continue to work on full removal.”

The Danish Rights Alliance (DRA), an organization that represents publishers and other rights-holders in Denmark, told me about a similar interaction with Common Crawl. Thomas Heldrup, the organization’s head of content protection and enforcement, showed me a redacted email exchange with the nonprofit that began in July 2024, in which the DRA requested that its members’ content be removed from the archive. In December 2024, more than six months after the DRA had initially requested removal, Common Crawl’s attorney wrote: “I confirm that Common Crawl has initiated work to remove your members’ content from the data archive. Presently, approximately 50% of this content has been removed.” I spoke with other publishers who’d received similar messages from Common Crawl. One was told, after multiple follow-up emails, that removal was 50 percent, 70 percent, and then 80 percent complete.

By writing code to browse the petabytes of data, I was able to see that large quantities of articles from the Times, the DRA, and these other publishers are still present in Common Crawl’s archives. Furthermore, the files are stored in a system that logs the modification times of every file. The foundation adds a new “crawl” to its archive every few weeks, each containing 1 billion to 4 billion webpages, and it has been publishing these regular installments since 2013. None of the content files in Common Crawl’s archives appears to have been modified since 2016, suggesting that no content has been removed in at least nine years.

The Danish Rights Alliance (DRA), an organization that represents publishers and other rights-holders in Denmark, told me about a similar interaction with Common Crawl. Thomas Heldrup, the organization’s head of content protection and enforcement, showed me a redacted email exchange with the nonprofit that began in July 2024, in which the DRA requested that its members’ content be removed from the archive. In December 2024, more than six months after the DRA had initially requested removal, Common Crawl’s attorney wrote: “I confirm that Common Crawl has initiated work to remove your members’ content from the data archive. Presently, approximately 50% of this content has been removed.” I spoke with other publishers who’d received similar messages from Common Crawl. One was told, after multiple follow-up emails, that removal was 50 percent, 70 percent, and then 80 percent complete. By writing code to browse the petabytes of data, I was able to see that large quantities of articles from the Times, the DRA, and these other publishers are still present in Common Crawl’s archives. Furthermore, the files are stored in a system that logs the modification times of every file. The foundation adds a new “crawl” to its archive every few weeks, each containing 1 billion to 4 billion webpages, and it has been publishing these regular installments since 2013. None of the content files in Common Crawl’s archives appears to have been modified since 2016, suggesting that no content has been removed in at least nine years.

In our first conversation, Skrenta told me that removal requests are “a pain in the ass” but insisted that the foundation complies with them. In our second conversation, Skrenta was more forthcoming. He said that Common Crawl is “making an earnest effort” to remove content but that the file format in which Common Crawl stores its archives is meant “to be immutable. You can’t delete anything from it.” (He did not answer my question about where the 50, 70, and 80 percent removal figures come from.)

Yet the nonprofit appears to be concealing this from visitors to its website, where a search function, the only nontechnical tool for seeing what’s in Common Crawl’s archives, returns misleading results for certain domains. A search for nytimes.com in any crawl from 2013 through 2022 shows a “no captures” result, when in fact there are articles from NYTimes.com in most of these crawls. I also discovered more than 1,000 other domains that produce this incorrect “no captures” result for at least several of the crawls, and most of these domains belong to publishers, including the BBC, Reuters, The New Yorker, Wired, the Financial Times, The Washington Post, and, yes, The Atlantic. According to my research and Common Crawl’s own disclosures, the companies behind each of these publications have sent legal requests to the nonprofit. At least one publisher I spoke with told me that it had used this search tool and concluded that its content had been removed from Common Crawl’s archives.

In our first conversation, Skrenta told me that removal requests are “a pain in the ass” but insisted that the foundation complies with them. In our second conversation, Skrenta was more forthcoming. He said that Common Crawl is “making an earnest effort” to remove content but that the file format in which Common Crawl stores its archives is meant “to be immutable. You can’t delete anything from it.” (He did not answer my question about where the 50, 70, and 80 percent removal figures come from.) Yet the nonprofit appears to be concealing this from visitors to its website, where a search function, the only nontechnical tool for seeing what’s in Common Crawl’s archives, returns misleading results for certain domains. A search for nytimes.com in any crawl from 2013 through 2022 shows a “no captures” result, when in fact there are articles from NYTimes.com in most of these crawls. I also discovered more than 1,000 other domains that produce this incorrect “no captures” result for at least several of the crawls, and most of these domains belong to publishers, including the BBC, Reuters, The New Yorker, Wired, the Financial Times, The Washington Post, and, yes, The Atlantic. According to my research and Common Crawl’s own disclosures, the companies behind each of these publications have sent legal requests to the nonprofit. At least one publisher I spoke with told me that it had used this search tool and concluded that its content had been removed from Common Crawl’s archives.

Common Crawl says it complies with removal requests—while telling us they are “a pain in the ass”—but also is not actually removing the data in question.

04.11.2025 12:18 — 👍 140    🔁 34    💬 3    📌 4
Preview
The Nonprofit Feeding the Entire Internet to AI Companies Common Crawl claims to provide a public benefit, but it lies to publishers about its activities.

NEW: Common Crawl, the massive archiver of the web, has gotten cozy with AI companies and is providing paywalled articles for training data. They’re also lying to publishers who have asked for material to be removed. “The robots are people too,” CC’s exec director told us when we asked about this.

04.11.2025 12:15 — 👍 851    🔁 502    💬 24    📌 89
Post image

Can't get over this wallpaper Gainax sold in the 90s.

03.11.2025 19:59 — 👍 494    🔁 153    💬 4    📌 3
A print copy of the Onion with comic panels from "Don and Jeff: Time Pedophiles". Clockwise from top left:

Don and Jeff flee a T. Rex
Don: How was I supposed to know the Cretaceous didn't have adolescent girls?

Remodeling the Great Sphinx in Jeff's image
Don: Jeff, what did you do the Sphinx?"
Jeff (with girl in stereotypical ancient Egyptian garb): I gave all those sexy Egyptian minors a little something to look at!

Fighting Samurai 
Don (holding katana): Jeff, a little help here?"
Jeff (with girl in kimono): Sorry, Don, I've got my hands full myself!

Briefly rescuing Joan of Arc
Joan: Merci, time pedophiles you saved me! How can I ever repay you?
Jeff: Have you ever thought about going blond?

At the bottom, the headline reads "Trump: `Thats Not How I Draw Teenage Breasts`"

A print copy of the Onion with comic panels from "Don and Jeff: Time Pedophiles". Clockwise from top left: Don and Jeff flee a T. Rex Don: How was I supposed to know the Cretaceous didn't have adolescent girls? Remodeling the Great Sphinx in Jeff's image Don: Jeff, what did you do the Sphinx?" Jeff (with girl in stereotypical ancient Egyptian garb): I gave all those sexy Egyptian minors a little something to look at! Fighting Samurai Don (holding katana): Jeff, a little help here?" Jeff (with girl in kimono): Sorry, Don, I've got my hands full myself! Briefly rescuing Joan of Arc Joan: Merci, time pedophiles you saved me! How can I ever repay you? Jeff: Have you ever thought about going blond? At the bottom, the headline reads "Trump: `Thats Not How I Draw Teenage Breasts`"

Subscribe to @theonion.com

02.11.2025 23:38 — 👍 203    🔁 17    💬 2    📌 1
Preview
Studios Enter Bidding War Over Napkin Stephen King Wrote ‘Ghoul’ On LOS ANGELES—Anticipating the project could be the biggest horror hit of the decade, film studios were reportedly locked in a bidding war Friday over a napkin Stephen King had written the word “Ghoul” ...

Studios Enter Bidding War Over Napkin Stephen King Wrote ‘Ghoul’ On

03.11.2025 17:00 — 👍 1005    🔁 103    💬 19    📌 12
Post image

Lida

01.11.2025 17:58 — 👍 5252    🔁 1304    💬 16    📌 4

Interesting how the terms
'grassroots group' and 'campaign group' are used in that snippet

01.11.2025 10:24 — 👍 9    🔁 0    💬 1    📌 0
Post image

that’s crazy dude. that is crazy

01.11.2025 01:16 — 👍 442    🔁 50    💬 5    📌 0
Post image

habby halloween

31.10.2025 23:01 — 👍 9679    🔁 1827    💬 55    📌 2
MICHAEL MYERS PLAYED HIS OWN SCARY MUSIC
YouTube video by Real Big Boys MICHAEL MYERS PLAYED HIS OWN SCARY MUSIC

a Halloween treat from the REal Big Boys
www.youtube.com/watch?v=XCXZ...

31.10.2025 16:37 — 👍 33    🔁 12    💬 1    📌 0

He'll yeah, what a guy

31.10.2025 18:01 — 👍 1    🔁 0    💬 0    📌 0

Did he have the flame trousers??

31.10.2025 17:37 — 👍 1    🔁 0    💬 1    📌 0

Wow. I didn’t know that. I just, you’re telling me now for the first time.

30.10.2025 21:24 — 👍 1    🔁 0    💬 0    📌 0

Kamala Harris

30.10.2025 21:15 — 👍 62    🔁 0    💬 1    📌 0

Sorry sooz the algorithm has decided you like this now, time to get used to it and adapt your life accordingly.

29.10.2025 13:14 — 👍 2    🔁 0    💬 0    📌 0
Post image

Barry Windsor-Smith, “Beguiled” (1982/1995), pen and ink, watercolour/gouache, and coloured pencil. Intended for the cover of Epic Illustrated, but fatigue prompted BWS to abandon it half-done. Thirteen years later, BWS completed the work.

11.01.2025 13:10 — 👍 526    🔁 197    💬 4    📌 4
Post image

Mable #Pokemon

28.10.2025 15:57 — 👍 995    🔁 358    💬 5    📌 0
"Pacino is wonderful"

"Pacino is wonderful"

I am always saying this

25.10.2025 21:26 — 👍 2    🔁 1    💬 0    📌 0
Video thumbnail

gaming was dire in the 2000s...

23.10.2025 18:57 — 👍 1142    🔁 252    💬 53    📌 16
Video thumbnail

"The hills have eyes" is a cautionary tale about having some stuff

11.10.2025 22:34 — 👍 756    🔁 292    💬 4    📌 4
Post image

i've been playing a lot of third strike and i realize that the holes on the side could always be bigger

19.10.2025 15:57 — 👍 2105    🔁 296    💬 12    📌 0
ink drawing on grid paper of a knight wearing a very ornate helmet

ink drawing on grid paper of a knight wearing a very ornate helmet

[inktober day 15]

15.10.2025 14:59 — 👍 1651    🔁 194    💬 6    📌 0
Post image

?

14.10.2025 18:08 — 👍 1    🔁 0    💬 0    📌 0

It can chair a meeting, it can hold a press conference, it can stay calm in an interview

10.10.2025 19:26 — 👍 0    🔁 0    💬 0    📌 0

@lemonhail is following 20 prominent accounts