Common Crawl Foundation's Avatar

Common Crawl Foundation

@commoncrawl.bsky.social

Common Crawl is a non-profit foundation dedicated to the Open Web.

292 Followers  |  51 Following  |  48 Posts  |  Joined: 19.11.2024  |  2.161

Latest posts by commoncrawl.bsky.social on Bluesky

Preview
Common Crawl - Blog - IETF 123 Report A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement.

A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement.

commoncrawl.org/blog/ietf-12...

04.08.2025 17:46 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2025 Our Web Graph release for July 2025 is now available, consisting of 481.6 million nodes and 3.4 billion edges at the host level, and 209.5 million nodes and 2.6 billion edges at the domain level.

Our Web Graph release for July 2025 is now available, consisting of 481.6 million nodes and 3.4 billion edges at the host level, and 209.5 million nodes and 2.6 billion edges at the domain level.

commoncrawl.org/blog/host--a...

27.07.2025 13:32 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - July 2025 Crawl Archive Now Available The crawl archive for July 2025 is now available. Crawled between July 7th and July 21st, the data contains 2.42 billion web pages, or 419 TiB of uncompressed content.

The crawl archive for July 2025 is now available. Crawled between July 7th and July 21st, the data contains 2.42 billion web pages, or 419 TiB of uncompressed content.

commoncrawl.org/blog/july-20...

27.07.2025 13:30 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - WMDQS Shared Task on Language Identification The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identi...

The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identification for web data.

commoncrawl.org/blog/wmdqs-s...

21.07.2025 22:34 β€” πŸ‘ 6    πŸ” 5    πŸ’¬ 0    πŸ“Œ 1
Preview
Unlocking data to advance European commerce and culture - Microsoft On the Issues Microsoft launches 2 initiatives to open Europe’s languages and culture, building on AI, cloud, and digital sovereignty commitments.

"MOIC will also partner with Common Crawl, one of the largest free and open repositories of web crawled data. MOIC will fund work at Common Crawl, leveraging native speakers to annotate and seed European language data in the publicly available Common Crawl data set."

21.07.2025 17:45 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - The First WMDQS-Masakhane LangID Hackathon In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotation...

In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotations for African languages.

commoncrawl.org/blog/the-fir...

08.07.2025 16:21 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2025 We are pleased to announce that the Web Graph for June 2025 is now available. The graph consists of 371.6 million nodes and 3.1 billion edges at the host level, and 161.8 million nodes and 2.2 billion...

We are pleased to announce that the Web Graph for June 2025 is now available. The graph consists of 371.6 million nodes and 3.1 billion edges at the host level, and 161.8 million nodes and 2.2 billion edges at the domain level.

commoncrawl.org/blog/host--a...

02.07.2025 01:03 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Common Crawl at the United Nations Open Source Week, June 2025 The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open s...

The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open source and AI.

commoncrawl.org/blog/common-...

01.07.2025 00:12 β€” πŸ‘ 3    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - June 2025 Crawl Archive Now Available We are pleased to announce that the crawl archive for June 2025 is now available.

We are pleased to announce that the crawl archive for June 2025 is now available.

www.commoncrawl.org/blog/june-20...

27.06.2025 22:00 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - May/June 2025 Newsletter We're happy to share our newsletter for May/June 2025 with updates from our team.

We're happy to share our newsletter for May/June 2025 with updates from our team.

commoncrawl.org/blog/may-jun...

24.06.2025 20:53 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The deadline for paper submissions has been extended!

The new deadline is July 3, 2025. AoE.

For more information, please visit: wmdqs.org

23.06.2025 14:23 β€” πŸ‘ 2    πŸ” 5    πŸ’¬ 0    πŸ“Œ 0
Preview
The AI Alliance Forms Non-profit AI Lab and AI Technology & Advocacy Association to Scale Open-Source Innovation /PRNewswire/ -- Today, the AI Alliance, a global collaboration of more than 180 organizations committed to AI open innovation, announced it has incorporated...

The AI Alliance Forms Non-profit AI Lab and AI Technology & Advocacy Association to Scale Open-Source Innovation

www.prnewswire.com/news-release...

20.06.2025 18:47 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Announcing the Whirlwind Tour of Common Crawl's Datasets using Python Announcing a refreshed version of the Whirlwind Tour in Python. Get to know how to make the most of our crawl data.

Announcing a refreshed version of the Whirlwind Tour in Python. Get to know how to make the most of our crawl data.

commoncrawl.org/blog/announc...

12.06.2025 12:08 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 1
Preview
AI Alliance @ IBM One Madison (UN Open Source Week 2025) Β· Luma This year’s UN Open Source Week 2025, June 16-20) will once again bring together a global β€œwho is who” of Open Source leaders. As part of the official…

The Common Crawl Foundation, together with IBM, the AI Alliance, and BrightQuery will be hosting an "UN Conference" at IBM's new flagship NYC HQ at One Madison Avenue on Friday, June 20, from 12:30-5pm.

If you are in NYC, it would be great to see you there!

lu.ma/p0a1scde

10.06.2025 21:54 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs March, April, and May 2025 We are pleased to announce that the Web Graph for May 2025 is now available. The graph consists of 326.8 million nodes and 2.9 billion edges at the host level, and 156.1 million nodes and 2.1 billion ...

We are pleased to announce that the Web Graph for May 2025 is now available. The graph consists of 326.8 million nodes and 2.9 billion edges at the host level, and 156.1 million nodes and 2.1 billion edges at the domain level.

commoncrawl.org/blog/host--a...

07.06.2025 05:09 β€” πŸ‘ 3    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Preview
Common Crawl - Blog - May 2025 Crawl Archive Now Available We are pleased to announce that the crawl archive for May 2025 is now available. The data was crawled between May 11th and May 25th, and contains 2.47 billion web pages, or 429 TiB of uncompressed con...

We are pleased to announce that the crawl archive for May 2025 is now available. The data was crawled between May 11th and May 25th, and contains 2.47 billion web pages, or 429 TiB of uncompressed content.

commoncrawl.org/blog/may-202...

03.06.2025 13:21 β€” πŸ‘ 1    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
1st Workshop on Multilingual Data Quality Signals

Call for papers!
We are organising the 1st Workshop on Multilingual Data Quality Signals with @mlcommons.org and @eleutherai.bsky.social, held in tandem with @colmweb.org. Submit your research on multilingual data quality!

Submission deadline is 23 June, more info: wmdqs.org

29.05.2025 17:18 β€” πŸ‘ 9    πŸ” 8    πŸ’¬ 0    πŸ“Œ 1
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2025 We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of February, March, and April 2025. The graph consists of 309.2 million nodes and 2.9 billion edg...
05.05.2025 16:25 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - April 2025 Crawl Archive Now Available Announcing the release of the April 2025 crawl archive. The data was crawled between April 17th and May 1st, and contains 2.74 billion web pages (or 468 TiB of uncompressed content). Page captures are...
04.05.2025 20:29 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Introducing the Host Index Introducing the Host Index: a new dataset with one row per web host per crawl, combining crawl stats, status codes, languages, and bot defence data. Queryable via AWS tools or downloadable.
23.04.2025 16:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - IIPC General Assembly & Web Archiving Conference 2025 The Common Crawl team attended the 2025 IIPC General Assembly and Web Archiving Conference in Oslo, presenting recent work and participating in discussions on web preservation.
16.04.2025 20:05 β€” πŸ‘ 1    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - March/April 2025 Newsletter We're excited to share our newsletter for March/April 2025 with updates from our team.
14.04.2025 22:00 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

The #iipcGA25 Research Working Group meeting was packed–co-chair Ben O’Brien of @nationallibrarynz.bsky.social discussed Whole-of-Domain, Magnus Birkenes and Marie Roald of @nettarkivet.bsky.social explained their WebData Project, and Sebastian Nagel presented @commoncrawl.bsky.social’s datasets!

11.04.2025 15:51 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Preview
Common Crawl - Blog - Providing Authenticity & Data Provenance for Common Crawl Using Blockchain: Our Work with Constellation Network In 2024, the Common Crawl Foundation and Constellation Network announced a groundbreaking partnership to enhance data integrity and transparency across the web. Here we recap some recent discussions w...
10.04.2025 10:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

We would like to welcome all of our attending members to Oslo, with a special welcome to two of our newest members, the Publications Office of the European Union and @commoncrawl.bsky.social!

@nettarkivet.bsky.social | #iipcGA25 | #webarchiving

08.04.2025 08:58 β€” πŸ‘ 9    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0
Preview
I'm happy to share our new white paper co-authored with BibliothΓ¨ques Sans… | Anastasia Stasenko I'm happy to share our new white paper co-authored with BibliothΓ¨ques Sans FrontiΓ¨res and Kajou : "Beyond the Hype: Building Equitable and Sustainable AI for Social Impact." Unveiled ye...
04.04.2025 18:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2025 We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of January, February, and March 2025.
03.04.2025 22:06 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - March 2025 Crawl Archive Now Available We are pleased to announce that the crawl archive for March 2025 is now available. The data was crawled between March 15th and March 28th, and contains 2.74 billion web pages (or 455 TiB of uncompress...
01.04.2025 21:58 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Common Crawl - Blog - Introducing Common Crawl AI Agent by ReadyAI We are pleased to announce the launch of an experimental AI Agent, developed by our friends at ReadyAI. The agent offers a conversational interface designed to help users explore Common Crawl’s data, ...
31.03.2025 22:11 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Introducing GovArchive.us & Mirroring Entire Sites with Web Archives β€’ Webrecorder Blog Introducing GovArchive.us and tooling to mirror web sites using web archives.

Our friends at Webrecorder have announced the launch of GovArchive.us, a dedicated site for exploring their US Government Web Archive on Browsertrix. More details in their blog post:

25.03.2025 18:11 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

@commoncrawl is following 20 prominent accounts