Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025
We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and...
We are pleased to announce the release of the web graphs based on the crawls of September, October, and November of 2025, consisting of 235.7 million nodes and 9.5 billion edges at the host level, and 100.7 million nodes and 6.6 billion edges at the domain level.
commoncrawl.org/blog/host--a...
24.11.2025 17:46 β π 2 π 1 π¬ 0 π 0
Banner for the World Digital Preservation Day, 6th of November 2025
Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?
commoncrawl.org/blog/common-...
06.11.2025 14:56 β π 3 π 0 π¬ 0 π 0
Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edge...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of August, September, and October 2025, consisting of of 468.4 million nodes and 8.0 billion edges at the host level, and 97.7 million nodes and 6.0 billion edges at the domain level.
29.10.2025 18:45 β π 1 π 0 π¬ 0 π 0
If you were able to join us, let us know about your experience: docs.google.com/forms/d/e/1F...
10.10.2025 20:52 β π 4 π 4 π¬ 0 π 0
Thank you everyone for coming to WMDQS (pronounced "whim ducks")!
10.10.2025 20:50 β π 3 π 2 π¬ 1 π 0
After lunch, @sebnagel.bsky.social gave a keynote about the data collected by @commoncrawl.bsky.social!
10.10.2025 20:46 β π 2 π 1 π¬ 1 π 0
WMDQS is underway! Come join us in Room 520A at @colmweb.org! #COLM2025
10.10.2025 16:17 β π 2 π 3 π¬ 1 π 0
Looking forward to tomorrow's #COLM2025 workshop on multilingual data quality! π€©
09.10.2025 23:16 β π 6 π 3 π¬ 0 π 0
In collaboration with @commoncrawl.bsky.social, MLCommons, and @eleutherai.bsky.social, the first edition of WMDQS at @colmweb.org starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.
09.10.2025 20:17 β π 3 π 3 π¬ 1 π 1
Common Crawl - Blog - Web Languages Needing Review by Native Speakers
Common Crawlβs Web Languages initiative has had many contributions since its introduction. Weβre calling for native speakers of certain languages to review language contributions, to ensure that links...
Common Crawlβs Web Languages initiative has had many contributions since its introduction. Weβre calling for native speakers of certain languages to review language contributions, to ensure that links weβre adding to our seed crawl are of good quality.
commoncrawl.org/blog/web-lan...
02.10.2025 22:06 β π 2 π 4 π¬ 0 π 0
Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9...
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025. The host-level graph consists of 628.7 million nodes and 6.9 billion edges, and the domain-level graph consists of 184.6 million nodes and 5.4 billion edges.
02.10.2025 22:04 β π 2 π 1 π¬ 0 π 0
Common Crawl - Blog - From SEO to AIO: Why Your Content Needs to Exist in AI Training Data
The era of traditional search engine optimization is rapidly evolving into
The era of traditional search engine optimization is rapidly evolving into "AIO" (AI optimization), where businesses must ensure their content exists in AI training datasets to remain discoverable as users increasingly turn to AI assistants for answers.
commoncrawl.org/blog/from-se...
02.10.2025 22:03 β π 0 π 0 π¬ 0 π 0
Copyright lawyer at Jaszi Butler PLLC, Exec Director @recreatecoalition.bsky.social, dad. Press inquiries: press@recreatecoalition.com.
We collect, preserve, and share #software #sourcecode for present and future generations. #swh #softwarecommons #freesoftware #opensource
A free, collaborative, multilingual internet encyclopedia.
donate.wikipedia25.org
Internet Archive is a non-profit research library preserving web pages, books, movies & audio for public access. Explore web history via the Wayback Machine.
Also an architect, GIS enthusiast, sailor.
Visiting prof @Stony Brook; fellow @Montclair State; emeritus prof @CUNY's Newmark School of Journalism. Co-host: This Week in Google, AI Inside. Author of The Gutenberg Parenthesis, Magazine, The Web We Weave, available at: https://jeffjarvis.com
NLP & ML research @cohereforai.bsky.social π¨π¦
A series of state-of-the-art, open source and transparent
foundation models for European languages
The first iteration of our workshop will be co-located with @colmweb.org 2025 in Montreal.
https://wmdqs.org/
Community architect/builder at IBM (@aialliance.bsky.social and @ossci.bsky.social). San JosΓ©, CA. Wannabe trail runner. German native. Skeets my own, etc. #opensource #openscience #community #criminaljustice #learningspanish
Chair of Data Science at University of Pretoria, South Africa.
Co-Founder LelapaAI. #NLProc
PhDCS Rutgers, BScEE-MsEE Wits University.
Changing the World @DeepIndaba @MasakhaneNLP.
Made with β€οΈ Tshwane
Lab Data Science for Social Impact
Web Archiving tools developer @webrecorder.net
Fan of all things accordion πͺ
President of the @ec.europa.eu
Mother of seven. Brussels-born. European by heart. πͺπΊ
just another extremely earnest and Very Online idealist. Head of Product @roost.tools π proud public school alum with roots in the desert π΅ full of personal opinions like the skeets below
Robust Open Online Safety Tools (ROOST) is a new non-profit entity providing open source, accessible, high-quality, transparent safety tools for digital organizations of all kinds.
roost.tools
The National Library of Norway's Web Archive
nettarkivet.nb.no | nettarkivet@nb.no
Register for IIPC Web Archiving Conference 2025, 9.-10. April 2025 in Oslo: https://netpreserve.org/ga2025/registration/