Blog

The latest news, interviews, technologies, and resources.

Filter by Category or Search by Title

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Answers to Recent Community Questions

Answers to Recent Community Questions

In this post we respond to the most common questions. Thanks for all the support and please keep the questions coming!
Common Crawl Enters A New Phase

Common Crawl Enters A New Phase

A little under four years ago, Gil Elbaz formed the Common Crawl Foundation. He was driven by a desire to ensure a truly open web. He knew that decreasing storage and bandwidth costs, along with the increasing ease of crunching big data, made building and maintaining an open repository of web crawl data feasible.
Video: Gil Elbaz at Web 2.0 Summit 2011

Video: Gil Elbaz at Web 2.0 Summit 2011

Hear Common Crawl founder discuss how data accessibility is crucial to increasing rates of innovation as well as give ideas on how to facilitate increased access to data.

Common Crawl Blog

Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation

September 8, 2025

Stanford HAI and Common Crawl are joining forces to explore how open data can shape the future of AI. On 22 October 2025, their seminar will address privacy, safety, and security while showcasing new ways to preserve and share humanity’s knowledge.

Read More...

July/August 2025 Newsletter

August 26, 2025

We are pleased to release our newsletter for July and August 2025, with updates on our team's activities.

Read More...

Host- and Domain-Level Web Graphs June, July, and August 2025

August 21, 2025

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025. The host-level graph consists of 691.1 million nodes and 5.0 billion edges, and the domain-level graph consists of 207.6 million nodes and 3.9 billion edges.

Read More...

August 2025 Crawl Archive Now Available

August 18, 2025

We are pleased to announce the release of our August 2025 crawl, containing 2.44 billion web pages (or 424 TiB of uncompressed content).

Read More...

Common Crawl Foundation at ACL 2025

August 13, 2025

The Common Crawl team attended the 63rd Annual Meeting of the Association of Computational Linguistics in Vienna, presenting recent published work and strengthening links with the research community.

Read More...

AI Optimization Is Here: Are You Ready for Search 2.0?

August 11, 2025

Publishers and brands are shifting from SEO to AIO. Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can train AI models becomes as crucial as traditional SEO.

Read More...

IETF 123 Report

August 4, 2025

A report on IETF 123 in Madrid, including sessions on AI content preferences, bot authentication, and web measurement.

Read More...

Host- and Domain-Level Web Graphs May, June, and July 2025

July 25, 2025

Our Web Graph release for July 2025 is now available, consisting of 481.6 million nodes and 3.4 billion edges at the host level, and 209.5 million nodes and 2.6 billion edges at the domain level.

Read More...

July 2025 Crawl Archive Now Available

July 23, 2025

The crawl archive for July 2025 is now available. Crawled between July 7th and July 21st, the data contains 2.42 billion web pages, or 419 TiB of uncompressed content.

Read More...

WMDQS Shared Task on Language Identification

July 21, 2025

The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identification for web data.

Read More...

The First WMDQS-Masakhane LangID Hackathon

July 8, 2025

In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotations for African languages.

Read More...

Host- and Domain-Level Web Graphs April, May, and June 2025

July 1, 2025

We are pleased to announce that the Web Graph for June 2025 is now available. The graph consists of 371.6 million nodes and 3.1 billion edges at the host level, and 161.8 million nodes and 2.2 billion edges at the domain level.

Read More...

Common Crawl at the United Nations Open Source Week, June 2025

June 30, 2025

The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open source and AI.

Read More...

June 2025 Crawl Archive Now Available

June 27, 2025

We are pleased to announce that the crawl archive for June 2025 is now available.

Read More...

May/June 2025 Newsletter

June 24, 2025

We're happy to share our newsletter for May/June 2025 with updates from our team.

Read More...

Announcing the Whirlwind Tour of Common Crawl's Datasets using Python

June 12, 2025

Announcing a refreshed version of the Whirlwind Tour in Python. Get to know how to make the most of our crawl data.

Read More...

Host- and Domain-Level Web Graphs March, April, and May 2025

May 29, 2025

We are pleased to announce that the Web Graph for May 2025 is now available. The graph consists of 326.8 million nodes and 2.9 billion edges at the host level, and 156.1 million nodes and 2.1 billion edges at the domain level.

Read More...

May 2025 Crawl Archive Now Available

May 28, 2025

We are pleased to announce that the crawl archive for May 2025 is now available. The data was crawled between May 11th and May 25th, and contains 2.47 billion web pages, or 429 TiB of uncompressed content.

Read More...

Announcing the First Workshop on Multilingual Data Quality Signals

May 27, 2025

The first Workshop on Multilingual Data Quality Signals (WMDQS), hosted by Common Crawl with MLCommons, EleutherAI, and Johns Hopkins, will be held alongside COLM 2025 on 10 October 2025 in Montreal, Canada. It invites research papers on multilingual data quality and offers a shared task on language identification for web text.

Read More...

Host- and Domain-Level Web Graphs February, March, and April 2025

May 5, 2025

We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of February, March, and April 2025. The graph consists of 309.2 million nodes and 2.9 billion edges at the host level, and 157.1 million nodes and 2.1 billion edges at the domain level.

Read More...

April 2025 Crawl Archive Now Available

May 4, 2025

Announcing the release of the April 2025 crawl archive. The data was crawled between April 17th and May 1st, and contains 2.74 billion web pages (or 468 TiB of uncompressed content). Page captures are from 47.5 million hosts or 38.8 million registered domains and include 838 million new URLs, not visited in any of our prior crawls.

Read More...

Introducing the Host Index

April 23, 2025

Introducing the Host Index: a new dataset with one row per web host per crawl, combining crawl stats, status codes, languages, and bot defence data. Queryable via AWS tools or downloadable.

Read More...

IIPC General Assembly & Web Archiving Conference 2025

April 16, 2025

The Common Crawl team attended the 2025 IIPC General Assembly and Web Archiving Conference in Oslo, presenting recent work and participating in discussions on web preservation.

Read More...

March/April 2025 Newsletter

April 14, 2025

We're excited to share our newsletter for March/April 2025 with updates from our team.

Read More...

Providing Authenticity & Data Provenance for Common Crawl Using Blockchain: Our Work with Constellation Network

April 9, 2025

In 2024, the Common Crawl Foundation and Constellation Network announced a groundbreaking partnership to enhance data integrity and transparency across the web. Here we recap some recent discussions with Constellation.

Read More...