WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
10.10.2025 16:17 β π 1 π 3 π¬ 1 π 0@commoncrawl.bsky.social
Common Crawl is a non-profit foundation dedicated to the Open Web.
WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
10.10.2025 16:17 β π 1 π 3 π¬ 1 π 0Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! π€©
09.10.2025 23:16 β π 5 π 3 π¬ 0 π 0In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
09.10.2025 20:17 β π 3 π 3 π¬ 1 π 1Common Crawl has added IBMβs GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medical, education, and technology.
commoncrawl.org/blog/announc...
Common Crawlβs Web Languages initiative has had many contributions since its introduction. Weβre calling for native speakers of certain languages to review language contributions, to ensure that links weβre adding to our seed crawl are of good quality.
commoncrawl.org/blog/web-lan...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9 billion edges, and the domain-level graph consists of 184.6 million nodes and 5.4 billion edges.
02.10.2025 22:04 β π 2 π 1 π¬ 0 π 0The era of traditional search engine optimization is rapidly evolving into "AIO" (AI optimization), where businesses must ensure their content exists in AI training datasets to remain discoverable as users increasingly turn to AI assistants for answers.
commoncrawl.org/blog/from-se...
We are pleased to announce the release of our September 2025 crawl, containing 2.39 billion web pages, or 421 TiB of uncompressed content.
www.commoncrawl.org/blog/septemb...
Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received.
commoncrawl.org/blog/common-...
On the 28th and 29th of August 2025, Thom Vaughan, Pedro Ortiz Suarez, and Thijs Dalhuijsen attended the Linux Foundationβs AI_dev event in Amsterdam.
commoncrawl.org/blog/trip-re...
On October 22, the Common Crawl team will lead a seminar at Stanford HAI. Our topic of discussion is βPreserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Dataβ.
Please register at: hai.stanford.edu/events/commo...
Weβre Walling Off The Open Internet To Stop AIβAnd It May End Up Breaking Everything Else
www.techdirt.com/2025/09/08/w...
Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new ways to preserve and share humanityβs knowledge.
www.commoncrawl.org/blog/common-...
We are pleased to release our newsletter for July and August 2025, with updates on our team's activities.
commoncrawl.org/blog/july-au...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025.
commoncrawl.org/blog/host--a...
We are pleased to announce the release of our August 2025 crawl, containing 2.44 billion web pages (or 424 TiB of uncompressed content).
commoncrawl.org/blog/august-...
Publishers and brands are shifting from SEO to AIO. Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can train AI models becomes as crucial as traditional SEO.
commoncrawl.org/blog/ai-opti...
The Enclosure Of The Open Web And The Open Internet Toll Booth: Whatβs Behind Pay-By-Crawl
digitalmedusa.org/the-enclosur...
A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement.
commoncrawl.org/blog/ietf-12...
Our Web Graph release for July 2025 is now available, consisting of 481.6 million nodes and 3.4 billion edges at the host level, and 209.5 million nodes and 2.6 billion edges at the domain level.
commoncrawl.org/blog/host--a...
The crawl archive for July 2025 is now available. Crawled between July 7th and July 21st, the data contains 2.42 billion web pages, or 419 TiB of uncompressed content.
commoncrawl.org/blog/july-20...
The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identification for web data.
commoncrawl.org/blog/wmdqs-s...
"MOIC will also partner with Common Crawl, one of the largest free and open repositories of web crawled data. MOIC will fund work at Common Crawl, leveraging native speakers to annotate and seed European language data in the publicly available Common Crawl data set."
21.07.2025 17:45 β π 3 π 1 π¬ 0 π 0In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotations for African languages.
commoncrawl.org/blog/the-fir...
We are pleased to announce that the Web Graph for June 2025 is now available. The graph consists of 371.6 million nodes and 3.1 billion edges at the host level, and 161.8 million nodes and 2.2 billion edges at the domain level.
commoncrawl.org/blog/host--a...
The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open source and AI.
commoncrawl.org/blog/common-...
We are pleased to announce that the crawl archive for June 2025 is now available.
www.commoncrawl.org/blog/june-20...
We're happy to share our newsletter for May/June 2025 with updates from our team.
commoncrawl.org/blog/may-jun...
The deadline for paper submissions has been extended!
The new deadline is July 3, 2025. AoE.
For more information, please visit: wmdqs.org
The AI Alliance Forms Non-profit AI Lab and AI Technology & Advocacy Association to Scale Open-Source Innovation
www.prnewswire.com/news-release...