A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement.
commoncrawl.org/blog/ietf-12...
@commoncrawl.bsky.social
Common Crawl is a non-profit foundation dedicated to the Open Web.
A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement.
commoncrawl.org/blog/ietf-12...
Our Web Graph release for July 2025 is now available, consisting of 481.6 million nodes and 3.4 billion edges at the host level, and 209.5 million nodes and 2.6 billion edges at the domain level.
commoncrawl.org/blog/host--a...
The crawl archive for July 2025 is now available. Crawled between July 7th and July 21st, the data contains 2.42 billion web pages, or 419 TiB of uncompressed content.
commoncrawl.org/blog/july-20...
The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identification for web data.
commoncrawl.org/blog/wmdqs-s...
"MOIC will also partner with Common Crawl, one of the largest free and open repositories of web crawled data. MOIC will fund work at Common Crawl, leveraging native speakers to annotate and seed European language data in the publicly available Common Crawl data set."
21.07.2025 17:45 β π 4 π 1 π¬ 0 π 0In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotations for African languages.
commoncrawl.org/blog/the-fir...
We are pleased to announce that the Web Graph for June 2025 is now available. The graph consists of 371.6 million nodes and 3.1 billion edges at the host level, and 161.8 million nodes and 2.2 billion edges at the domain level.
commoncrawl.org/blog/host--a...
The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open source and AI.
commoncrawl.org/blog/common-...
We are pleased to announce that the crawl archive for June 2025 is now available.
www.commoncrawl.org/blog/june-20...
We're happy to share our newsletter for May/June 2025 with updates from our team.
commoncrawl.org/blog/may-jun...
The deadline for paper submissions has been extended!
The new deadline is July 3, 2025. AoE.
For more information, please visit: wmdqs.org
The AI Alliance Forms Non-profit AI Lab and AI Technology & Advocacy Association to Scale Open-Source Innovation
www.prnewswire.com/news-release...
Announcing a refreshed version of the Whirlwind Tour in Python. Get to know how to make the most of our crawl data.
commoncrawl.org/blog/announc...
The Common Crawl Foundation, together with IBM, the AI Alliance, and BrightQuery will be hosting an "UN Conference" at IBM's new flagship NYC HQ at One Madison Avenue on Friday, June 20, from 12:30-5pm.
If you are in NYC, it would be great to see you there!
lu.ma/p0a1scde
We are pleased to announce that the Web Graph for May 2025 is now available. The graph consists of 326.8 million nodes and 2.9 billion edges at the host level, and 156.1 million nodes and 2.1 billion edges at the domain level.
commoncrawl.org/blog/host--a...
We are pleased to announce that the crawl archive for May 2025 is now available. The data was crawled between May 11th and May 25th, and contains 2.47 billion web pages, or 429 TiB of uncompressed content.
commoncrawl.org/blog/may-202...
Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!
Submission deadline is 23 June, more info: wmdqs.org
The #iipcGA25 Research Working Group meeting was packedβco-chair Ben OβBrien of @nationallibrarynz.bsky.social discussed Whole-of-Domain, Magnus Birkenes and Marie Roald of @nettarkivet.bsky.social explained their WebData Project, and Sebastian Nagel presented @commoncrawl.bsky.socialβs datasets!
11.04.2025 15:51 β π 4 π 1 π¬ 1 π 0We would like to welcome all of our attending members to Oslo, with a special welcome to two of our newest members, the Publications Office of the European Union and @commoncrawl.bsky.social!
@nettarkivet.bsky.social | #iipcGA25 | #webarchiving