November 7, 2011

Common Crawl Enters A New Phase

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

A little under four years ago, Gil Elbaz formed the Common Crawl Foundation. He was driven by a desire to ensure a truly open web. He knew that decreasing storage and bandwidth costs, along with the increasing ease of crunching big data, made building and maintaining an open repository of web crawl data feasible. More important than the fact that it could be built was his powerful belief that it should be built. The web is the largest collection of information in human history, and web crawl data provides an immensely rich corpus for scientific research, technological advancement, and business innovation. Gil started the Common Crawl Foundation to take action on the belief that it is crucial our information-based society that web crawl data be open and accessible to anyone who desires to utilize it.

That was the inspiration phase of Common Crawl – one person with a passion for openness forming a new foundation to work towards democratizing access to web information, thereby driving a new wave of innovation. Common Crawl quickly moved into the building phase, as Gil found others who shared his belief in the open web. In 2008, Carl Malamud and Nova Spivack joined Gil to form the Common Crawl board of directors. Talented engineer Ahad Rana began developing the technology for our crawler and processing pipeline. Today, thanks to the robust system that Ahad has built, we have an open repository of crawl data that covers approximately 5 billion pages and includes valuable metadata, such as page rank and link graphs. All of our data is stored on Amazon’s S3 and is accessible to anyone via EC2.

Common Crawl is now entering the next phase – spreading the word about the open system we have built and how people can use it. We are actively seeking partners who share our vision of the open web. We want to collaborate with individuals, academic groups, small start-ups, big companies, governments and nonprofits.

Over the next several months, we will be expanding our website and using this blog to describe our technology and data, communicate our philosophy, share ideas, and report on the products of our collaborations. We will also be working to build up a GitHub repository of code that has been and can be used to work with Common Crawl data. Most important, we will be talking with the community of people who share our interests. Thinking about an application you’d like to see built on Common Crawl data? Have Hadoop scripts that could be adapted to find insightful information in the crawl data? Know of a stimulating meetup, conference or hackathon we should attend? We want to hear from you!

This is the phase where the original vision truly comes to life, and the ideas Gil Elbaz had years ago will be converted to new products and insights. To say it is an exciting time is a tremendous understatement.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Common Crawl Enters A New Phase

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use