Common Crawl Foundation's Avatar

Common Crawl Foundation

@commoncrawl.bsky.social

Common Crawl is a non-profit foundation dedicated to the Open Web.

329 Followers  |  58 Following  |  72 Posts  |  Joined: 19.11.2024  |  1.9893

Latest posts by commoncrawl.bsky.social on Bluesky

Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025 We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and...

We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and 100.7 million nodes and 6.6 billion edges at the domain level.

commoncrawl.org/blog/host--a...

24.11.2025 17:46 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - November 2025 Crawl Archive Now Available We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

commoncrawl.org/blog/novembe...

24.11.2025 15:49 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Banner for the World Digital Preservation Day, 6th of November 2025

Banner for the World Digital Preservation Day, 6th of November 2025

Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?

commoncrawl.org/blog/common-...

06.11.2025 14:56 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has β€œlied to publishers” about our activiti...

Setting the Record Straight

A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has β€œlied to publishers” about our activities.

commoncrawl.org/blog/setting...

04.11.2025 22:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - October/November 2025 Newsletter Check out our newsletter for October/November 2025, with updates on what we've been up to

Check out our newsletter for October/November 2025, with updates on what we've been up to

commoncrawl.org/blog/october...

04.11.2025 22:37 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Common Crawl Foundation at Stanford HAI The Common Crawl team presented a seminar at Stanford HAI entitled β€œPreserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

The Common Crawl team presented a seminar at Stanford HAI entitled β€œPreserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

commoncrawl.org/blog/common-...

29.10.2025 18:46 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025 We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edge...

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edges at the host level, and 97.7 million nodes and 6.0 billion edges at the domain level.

29.10.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - October 2025 Crawl Archive Now Available We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.

We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.

commoncrawl.org/blog/october...

29.10.2025 18:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Common Crawl Foundation at COLM 2025 The Common Crawl team attended the 2nd Conference on Language Modeling in MontrΓ©al, organizing a workshop, giving invited talks, and strengthening links with the research community.

The Common Crawl team attended the 2nd Conference on Language Modeling in MontrΓ©al, organizing a workshop, giving invited talks, and strengthening links with the research community.

commoncrawl.org/blog/common-...

21.10.2025 15:51 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...

10.10.2025 20:52 β€” πŸ‘ 4    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0
Post image

Thank you everyone for coming to WMDQS (pronounced "whim ducks")!

10.10.2025 20:50 β€” πŸ‘ 3    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Post image

After lunch, @sebnagel.bsky.social gave a keynote about the data collected by @commoncrawl.bsky.social!

10.10.2025 20:46 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025

10.10.2025 16:17 β€” πŸ‘ 2    πŸ” 3    πŸ’¬ 1    πŸ“Œ 0

Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! 🀩

09.10.2025 23:16 β€” πŸ‘ 6    πŸ” 3    πŸ’¬ 0    πŸ“Œ 0
Post image

In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.

09.10.2025 20:17 β€” πŸ‘ 3    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1
Preview
Common Crawl - Blog - Announcing GneissWeb Annotations Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.

Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.

commoncrawl.org/blog/announc...

06.10.2025 11:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Web Languages Needing Review by Native Speakers Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links...

Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links we’re adding to our seed crawl are of good quality.

commoncrawl.org/blog/web-lan...

02.10.2025 22:06 β€” πŸ‘ 2    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025 We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9...

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9 billion edges, and the domain-level graph consists of 184.6 million nodes and 5.4 billion edges.

02.10.2025 22:04 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - From SEO to AIO: Why Your Content Needs to Exist in AI Training Data The era of traditional search engine optimization is rapidly evolving into

The era of traditional search engine optimization is rapidly evolving into "AIO" (AI optimization), where businesses must ensure their content exists in AI training datasets to remain discoverable as users increasingly turn to AI assistants for answers.

commoncrawl.org/blog/from-se...

02.10.2025 22:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - September 2025 Crawl Archive Now Available We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.

We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.

www.commoncrawl.org/blog/septemb...

23.09.2025 15:14 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Common Crawl Foundation Opt-Out Registry Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we ...

Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.

commoncrawl.org/blog/common-...

18.09.2025 06:36 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Trip Report: AI_dev (Linux Foundation) August 2025 On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundation’s AI_dev event in Amsterdam.

On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundation’s AI_dev event in Amsterdam.

commoncrawl.org/blog/trip-re...

18.09.2025 06:33 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl Foundation | Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data | Stanford HAI Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data.

On October 22, the Common Crawl team will lead a seminar at Stanford HAI. Our topic of discussion is β€œPreserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

Please register at: hai.stanford.edu/events/commo...

18.09.2025 06:31 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
We’re Walling Off The Open Internet To Stop AIβ€”And It May End Up Breaking Everything Else A longtime open internet activist recently asked me whether I’d reversed my position on internet openness and copyright because of AI. The question caught me off guardβ€”until I realized what h…

We’re Walling Off The Open Internet To Stop AIβ€”And It May End Up Breaking Everything Else

www.techdirt.com/2025/09/08/w...

09.09.2025 16:06 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new ...

Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new ways to preserve and share humanity’s knowledge.
www.commoncrawl.org/blog/common-...

09.09.2025 16:05 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - July/August 2025 Newsletter We are pleased to release our newsletter for July and August 2025, with updates on our team's activities.

We are pleased to release our newsletter for July and August 2025, with updates on our team's activities.

commoncrawl.org/blog/july-au...

26.08.2025 21:34 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2025 We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025. The host-level graph consists of 691.1 million nodes and 5.0 bill...

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025.

commoncrawl.org/blog/host--a...

22.08.2025 00:41 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Common Crawl - Blog - August 2025 Crawl Archive Now Available We are pleased to announce the release of our August 2025 crawl, containing 2.44 billion web pages (or 424 TiB of uncompressed content).

We are pleased to announce the release of our August 2025 crawl, containing 2.44 billion web pages (or 424 TiB of uncompressed content).

commoncrawl.org/blog/august-...

19.08.2025 00:42 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - AI Optimization Is Here: Are You Ready for Search 2.0? Publishers and brands are shifting from SEO to AIO. Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can ...

Publishers and brands are shifting from SEO to AIO. Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can train AI models becomes as crucial as traditional SEO.

commoncrawl.org/blog/ai-opti...

14.08.2025 00:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
The Enclosure of the Open Web and the Open Internet Toll booth: What’s Behind Pay-By-Crawl - Digital Medusa Cloudflare recently proposed a system where AI companies and crawlers would pay websites for the right to crawl their content, a move framed as β€œcontent independence day”, a response to growing concer...

The Enclosure Of The Open Web And The Open Internet Toll Booth: What’s Behind Pay-By-Crawl

digitalmedusa.org/the-enclosur...

14.08.2025 00:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@commoncrawl is following 20 prominent accounts