Common Crawl - Blog - May/June 2025 Newsletter

Table of Contents

Common Crawl’s New Host Index

Refreshed Version of Our Whirlwind Tour

Upcoming Workshop on Multilingual Data Quality Signals

Event Updates

Common Crawl’s New Host Index

In April, we introduced the Host Index, a new dataset with one row per web host per crawl, combining crawl stats, status codes, languages, and bot defence data. It is queryable via AWS tools or downloadable. For more details about the public test of this dataset and how to give feedback, see our blog post.

Refreshed Version of Our Whirlwind Tour

Recently we refreshed our Whirlwind Tour in Python, a brief tutorial on interacting with our datasets programmatically. Read more about the updates in our blog post, and give it a whirl yourself in the GitHub repository.

Upcoming Workshop on Multilingual Data Quality Signals

WMDQS Logo

The first Workshop on Multilingual Data Quality Signals (WMDQS), hosted by Common Crawl with MLCommons, EleutherAI, and Johns Hopkins University, will be held alongside COLM 2025 on the 10th of October 2025 in Montreal, Canada. It invites research papers on multilingual data quality and offers a shared task on language identification for web text. Please note that the deadline for paper submissions has been extended to July 3, 2025 AoE. More details on the research paper submissions and the shared task can be found on our blog post.

Event Updates

We have been busy attending events this Spring and Summer. In April we attended the IIPC General Assembly and Web Archiving Conference in Oslo, hosted by the National Library of Norway, where we delivered a range of contributions, including poster presentations, lightning talks, and a workshop. More details and links to the presentations can be found in our blog post.

Also in April, we participated in the Creative Commons Technical Meeting in Berlin, Germany, with the title "From Human Content to Machine Data" on "Using Collective Action to Develop a New Social Contract for Machine Reuse", where we discussed practicalities of opt-ins and opt-outs.

In May we attended IBM Think in Boston, where we engaged with the AI Alliance through a series of executive events, including a roundtable and dinner with leaders from across industry, academia, and research. We had in-depth discussions with IBM, and were invited to present at IBM research centre in Yorktown Heights.

Conversations also covered shared interests in AI safety, digital preservation, and large-scale open data, including a lunch with the Frontier Model Forum and a meeting with the team behind a forthcoming open digital library initiative.

The visit concluded with meetings in Washington, DC, furthering Common Crawl’s engagement with policy and research communities.

In early June, we attended the Digital Preservation Coalition (DPC) Members Forum and Networking Event - Europe at the National Library of the Netherlands in The Hague. At this event, we had productive conversations with peers from CERN, the Flickr Foundation, and the National Library of the Netherlands. We also engaged in unconference-style discussions on topics such as low-tech preservation.

Left-to-right: Sebastian Nagel, Pedro Ortiz Suarez, and Thom Vaughan, at the United Nations, New York City, June 2025

Later in June, much of the Common Crawl team was in New York for UN Open Source Week, where we co-hosted an event with the AI Alliance and BrightQuery at IBM’s One Madison offices. We were also invited to give a presentation at IBM’s Thomas J. Watson Research Center in Yorktown Heights. Stay tuned for an upcoming blog post with further details.

‍

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

May/June 2025 Newsletter

Table of Contents

Common Crawl’s New Host Index

Refreshed Version of Our Whirlwind Tour

Upcoming Workshop on Multilingual Data Quality Signals

Event Updates

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use