November 6, 2025

Common Crawl Celebrates World Digital Preservation Day

Note: this post has been marked as obsolete.

Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Common Crawl celebrates World Digital Preservation Day Nov. 6, which invites the community to unite in answering a powerful question: Why Preserve?

When Common Crawl began in 2007, we didn’t think of ourselves as an archive, or even as preserving the web. Our mission was simple: to make public web data openly accessible for anyone to analyze and explore. But our first engineer made a key technical decision that would shape our future: to store our crawled data using web archiving file formats.

That choice, made for practical reasons at the time, quietly set us on the path toward preservation. As the years passed and the crawls accumulated, Common Crawl evolved into something we hadn’t originally anticipated: an archive of the web with over 300 billion web pages spanning 17 years, and more than 100 crawl archives released to date.

Today, our data is used in countless ways, from exploring linguistic trends, economic and sociopolitical research, the digital humanities, machine translation, and many more. Governments and organizations use our archive to make data-driven decisions and tackle global issues. Researchers studying culture, media, and history turn to Common Crawl to understand how the web itself has changed over time.

We didn’t expect that, but we’re very happy about it. Over 10,000 research papers now reference Common Crawl, a testament to the value of preserving and sharing web data freely. We’re also proud that our crawls are included in the Internet Archive’s Wayback Machine, helping ensure that this collective record of the web remains accessible for generations to come.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Common Crawl Celebrates World Digital Preservation Day

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

Opt-out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use