Common Crawl Foundation @commoncrawl

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026 The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level n...

The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level nodes with 6.1 billion edges.

www.commoncrawl.org/blog/host--a...

02.02.2026 18:01 — 👍 1 🔁 0 💬 0 📌 0

Common Crawl - Blog - January 2026 Crawl Archive Now Available We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

www.commoncrawl.org/blog/january...

02.02.2026 18:01 — 👍 3 🔁 0 💬 0 📌 0

Common Crawl - Blog - Web Archives for Social Sciences Datathon, Bristol Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

www.commoncrawl.org/blog/web-arc...

02.02.2026 18:00 — 👍 1 🔁 0 💬 0 📌 0

Common Crawl - Blog - How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

As SEOs grapple with the shift from traditional Search Engine Optimization to AI visibility, they're discovering a resource that's been powering AI training for years: Common Crawl's Web Graph.

commoncrawl.org/blog/how-seo...

21.01.2026 01:18 — 👍 4 🔁 0 💬 0 📌 0

Common Crawl - Blog - GneissWeb Annotations Examples A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

GneissWeb Annotations Examples

A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket.

commoncrawl.org/blog/gneissw...

16.01.2026 13:26 — 👍 2 🔁 1 💬 0 📌 0

Common Crawl - Blog - Common Crawl at the Mozilla Festival 2025 From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.

www.commoncrawl.org/blog/common-...

08.01.2026 13:49 — 👍 0 🔁 0 💬 0 📌 0

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025 We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November, and December 2025.

commoncrawl.org/blog/host--a...

02.01.2026 00:17 — 👍 0 🔁 0 💬 0 📌 0

Common Crawl - Blog - December 2025 Crawl Archive Now Available The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

The crawl archive for December 2025 is now available, consisting of 2.16 billion web pages (or 364 TiB of uncompressed content).

commoncrawl.org/blog/decembe...

02.01.2026 00:17 — 👍 0 🔁 0 💬 0 📌 0

Common Crawl - Blog - A Sampling of 2025 Research Referencing Common Crawl As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and refere...

As another year here at Common Crawl comes to a close, we present a dozen papers from 2025 that demonstrate the range of topics and areas of study for which Common Crawl’s datasets are used and referenced.

commoncrawl.org/blog/a-sampl...

18.12.2025 17:26 — 👍 2 🔁 0 💬 0 📌 0

Laurie Burchell at a lectern presenting her Turing Seminar talk

Laurie Burchell at a lectern, with a blackboard behind her, presenting her Turing Seminar talk

A huge thank you to @very-laurie.bsky.social for delivering a fantastic UoB Turing seminar. Her talk was entitled “Common Crawl: open web data for everybody.”

In this talk, she introduced the @commoncrawl.bsky.social and the data products they offer.

27.11.2025 13:05 — 👍 6 🔁 2 💬 0 📌 0

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025 We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and...

We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and 100.7 million nodes and 6.6 billion edges at the domain level.

commoncrawl.org/blog/host--a...

24.11.2025 17:46 — 👍 2 🔁 1 💬 0 📌 0

Common Crawl - Blog - November 2025 Crawl Archive Now Available We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

We are pleased to announce that the crawl archive for November 2025 is now available, containing 2.29 billion web pages or 378 TiB of uncompressed content.

commoncrawl.org/blog/novembe...

24.11.2025 15:49 — 👍 3 🔁 0 💬 0 📌 0

Banner for the World Digital Preservation Day, 6th of November 2025

Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?

commoncrawl.org/blog/common-...

06.11.2025 14:56 — 👍 3 🔁 0 💬 0 📌 0

Common Crawl - Blog - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activiti...

Setting the Record Straight

A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.

commoncrawl.org/blog/setting...

04.11.2025 22:38 — 👍 2 🔁 0 💬 0 📌 0

Common Crawl - Blog - October/November 2025 Newsletter Check out our newsletter for October/November 2025, with updates on what we've been up to

Check out our newsletter for October/November 2025, with updates on what we've been up to

commoncrawl.org/blog/october...

04.11.2025 22:37 — 👍 2 🔁 1 💬 0 📌 0

Common Crawl - Blog - Common Crawl Foundation at Stanford HAI The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

The Common Crawl team presented a seminar at Stanford HAI entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”.

commoncrawl.org/blog/common-...

29.10.2025 18:46 — 👍 1 🔁 1 💬 0 📌 0

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025 We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edge...

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edges at the host level, and 97.7 million nodes and 6.0 billion edges at the domain level.

29.10.2025 18:45 — 👍 1 🔁 0 💬 0 📌 0

Common Crawl - Blog - October 2025 Crawl Archive Now Available We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.

We are pleased to announce the release of the October 2025 crawl, containing 2.61 billion web pages or 468 TiB of uncompressed content.

commoncrawl.org/blog/october...

29.10.2025 18:45 — 👍 1 🔁 0 💬 0 📌 0

Common Crawl - Blog - Common Crawl Foundation at COLM 2025 The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.

The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.

commoncrawl.org/blog/common-...

21.10.2025 15:51 — 👍 2 🔁 0 💬 0 📌 0

If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...

10.10.2025 20:52 — 👍 4 🔁 4 💬 0 📌 0

Thank you everyone for coming to WMDQS (pronounced "whim ducks")!

10.10.2025 20:50 — 👍 3 🔁 2 💬 1 📌 0

After lunch, @sebnagel.bsky.social gave a keynote about the data collected by @commoncrawl.bsky.social!

10.10.2025 20:46 — 👍 2 🔁 1 💬 1 📌 0

WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025

10.10.2025 16:17 — 👍 2 🔁 3 💬 1 📌 0

Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! 🤩

09.10.2025 23:16 — 👍 6 🔁 3 💬 0 📌 0

In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.

09.10.2025 20:17 — 👍 3 🔁 3 💬 1 📌 1

Common Crawl - Blog - Announcing GneissWeb Annotations Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.

Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.

commoncrawl.org/blog/announc...

06.10.2025 11:26 — 👍 1 🔁 0 💬 0 📌 0

Common Crawl - Blog - Web Languages Needing Review by Native Speakers Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links...

Common Crawl’s Web Languages initiative has had many contributions since its introduction. We’re calling for native speakers of certain languages to review language contributions, to ensure that links we’re adding to our seed crawl are of good quality.

commoncrawl.org/blog/web-lan...

02.10.2025 22:06 — 👍 2 🔁 4 💬 0 📌 0

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025 We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9...

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9 billion edges, and the domain-level graph consists of 184.6 million nodes and 5.4 billion edges.

02.10.2025 22:04 — 👍 2 🔁 1 💬 0 📌 0

Common Crawl - Blog - From SEO to AIO: Why Your Content Needs to Exist in AI Training Data The era of traditional search engine optimization is rapidly evolving into

The era of traditional search engine optimization is rapidly evolving into "AIO" (AI optimization), where businesses must ensure their content exists in AI training datasets to remain discoverable as users increasingly turn to AI assistants for answers.

commoncrawl.org/blog/from-se...

02.10.2025 22:03 — 👍 0 🔁 0 💬 0 📌 0

Common Crawl - Blog - September 2025 Crawl Archive Now Available We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.

We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.

www.commoncrawl.org/blog/septemb...

23.09.2025 15:14 — 👍 0 🔁 0 💬 0 📌 0

Common Crawl Foundation

Latest posts by commoncrawl.bsky.social on Bluesky

@commoncrawl is following 20 prominent accounts