Get Started

A woman with a laptop sits on a chair amid a sparkling digital cityscape

Accessing the Data

Crawl data is free to access by anyone from anywhere.

The data is hosted by Amazon Web Services’ Open Data Sets Sponsorships program on the bucket s3://commoncrawl/, located in the US-East-1 (Northern Virginia) AWS Region.

You may process the data in the AWS cloud or download it for free over HTTP(S) with a good Internet connection.
You can process the data in the AWS cloud (or download directly) using the URL schemes s3://commoncrawl/[...], https://ds5q9oxwqwsfj.cloudfront.net/[...] and https://data.commoncrawl.org/[...].

To access data from outside the Amazon cloud, via HTTP(S), the new URL prefix https://data.commoncrawl.org/ – must be used.

For further detail on the data file formats listed below, please visit the ISO Website, which provides format standards, information and documentation. There are also helpful explanations and details regarding file formats in other GitHub projects.
The status of our infrastructure can be monitored on our Infra Status page.
Accessing the data in the AWS Cloud

It’s mandatory to access the data from the region where it is located (us-east-1 ).

The connection to S3 should be faster and you avoid the minimal fees for inter-region data transfer (you have to send requests which are charged as outgoing traffic).

Be careful using an Elastic IP address or load balancer, because you may be charged for the routed traffic.

You may use the AWS Command Line Interface but many AWS services (e.g EMR) support the s3:// protocol, and you may directly specify your input as s3://commoncrawl/path_to_file, sometimes even using wildcards.

On Hadoop (not EMR) it’s recommended to use the S3A Protocol: just change the protocol to s3a://.

Accessing the data from outside the AWS Cloud

If you want to download the data to your local machine or local cluster, you can use any HTTP download agent, such as cURL or wget. The data is accessible via the https://data.commoncrawl.org/[...] URL scheme.

There is no need to create an AWS account in order to access the data using this method.

Using the AWS Command Line Interface

The AWS Command Line Interface can be used to access the data from anywhere (including EC2). It’s easy to install on most operating systems (Windows, macOS, Linux). Please follow the installation instructions.

Please note, access to data from the Amazon cloud using the S3 API is only allowed for authenticated users. Please see our blog announcement for more information.

Once the AWS CLI is installed, the command to copy a file to your local machine is:
aws s3 cp s3://commoncrawl/path_to_file <local_path>
You may first look at the data e.g, to list all WARC files of a specific segment of the April 2018 crawl:
> aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/
2018-04-20 10:27:49 931210633 CC-MAIN-20180420081400-20180420101400-00000.warc.gz
2018-04-20 10:28:32 935833042 CC-MAIN-20180420081400-20180420101400-00001.warc.gz
2018-04-20 10:29:51 940140704 CC-MAIN-20180420081400-20180420101400-00002.warc.gz

The command to download the first file in the listing is:
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz <local_path>The AWS CLI supports recursive copying, and allows for pattern–based inclusion/exclusion of files.

For more information check the AWS CLI user guide or call the command-line help (here for the cp command):
aws s3 cp help

Using HTTP download agents

To download a file using an HTTP download agent add the full path to the prefix https://data.commoncrawl.org/, e.g:
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz

Accessing the data in the AWS Cloud

It’s best to access the data from the region where it is located (us-east-1 ).

The connection to S3 should be faster and you avoid the minimal fees for inter-region data transfer (you have to send requests which are charged as outgoing traffic).

Be careful using an Elastic IP address or load balancer, because you may be charged for the routed traffic.

You may use the AWS Command Line Interface but many AWS services (e.g EMR) support the s3:// protocol, and you may directly specify your input as s3://commoncrawl/path_to_file, sometimes even using wildcards.

On Hadoop (not EMR) it’s recommended to use the S3A Protocol: just change the protocol to s3a://.

Accessing the data from outside the AWS Cloud

If you want to download the data to your local machine or local cluster, you can use the AWS Command Line Interface, or any HTTP download agent, such as cURL or wget.

There is no need to create an AWS account to access the data using either method.

Using the AWS Command Line Interface

The AWS Command Line Interface can be used to access the data from anywhere (including EC2). It’s easy to install on most operating systems (Windows, macOS, Linux). Please follow the installation instructions.

Once the AWS CLI is installed, the command to copy a file to your local machine is:
aws --no-sign-request s3 cp s3://commoncrawl/path_to_file/local_path/The argument --no-sign-request allows for anonymous access without the need to own an AWS account.

You may first look at the data e.g, to list all WARC files of a specific segment of the April 2018 crawl:
> aws --no-sign-request s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/
2018-04-20 10:27:49 931210633 CC-MAIN-20180420081400-20180420101400-00000.warc.gz
2018-04-20 10:28:32 935833042 CC-MAIN-20180420081400-20180420101400-00001.warc.gz
2018-04-20 10:29:51 940140704 CC-MAIN-20180420081400-20180420101400-00002.warc.gz

The command to download the first file in the listing is:
aws --no-sign-request s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gzThe AWS CLI supports recursive copying, and allows for pattern–based inclusion/exclusion of files.

For more information check the AWS CLI user guide or call the command-line help (here for the cp command):
aws s3 cp help

Using HTTP download agents

To download a file using an HTTP download agent add the full path to the prefix https://data.commoncrawl.org/, e.g:
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz

Example Code

If you’re more interested in diving into code, we’ve provided introductory Examples that use the Hadoop or Spark frameworks to process the data, and many more examples can be found in our Tutorials Section and on our GitHub.

Here's an example of how to fetch a page using the Common Crawl Index using Python:

Data Types

Common Crawl currently stores the crawl data using the Web ARChive (WARC) Format. Previously (prior to Summer 2013) the data was stored in the ARC Format.

The WARC format allows for more efficient storage and processing of Common Crawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.

If you want all the nitty–gritty details, the best source is the IIPC document on the WARC Standard.

Click the panels below for an overview of the differences between:

WARC files which store the raw crawl data
WAT files which store computed metadata for the data stored in the WARC
WET files which store extracted plaintext from the data stored in the WARC

WARC
WAT
WET
The WARC Format

The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process.

Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, (what you would get if you downloaded the file) but also the HTTP header information, which can be used to glean a number of interesting insights.

In the example below, we can see the crawler contacted http://news.bbc.co.uk/2/hi/africa/3414345.stm and received HTML in response.

We can also see the page was served from the Apache web server, sets caching details, and attempts to set a cookie (shortened for display here).

See the full WARC extract
WARC/1.0
WARC-Type: response
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID:
Content-Length: 43428
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID:
WARC-Concurrent-To:
WARC-IP-Address: 212.58.244.61
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J
WARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO
WARC-Truncated: lengthHTTP/1.1 200 OK
Server: Apache
Vary: X-CDN
Cache-Control: max-age=0
Content-Type: text/html
Date: Sat, 02 Aug 2014 09:52:13 GMT
Expires: Sat, 02 Aug 2014 09:52:13 GMT
Connection: close
Set-Cookie: BBC-UID=...; expires=Sun, 02-Aug-15 09:52:13 GMT; path=/; domain=bbc.co.uk;<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>
BBC NEWS | Africa | Namibia braces for Nujoma exit
</title>
...
The WAT Format

WAT files contain important metadata about the records stored in the WARC format. This metadata is computed for each of the three types of records (metadata, request, and response).

If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page. This information is stored as JSON.

To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the file yourself, you can use one of the many formatting tools available, such as JSONFormatter.io.

The HTTP response metadata is most likely to be of interest to Common Crawl users. The skeleton of the JSON format is outlined below:

See the full WAT extract
Envelope
 WARC-Header-Metadata
   WARC-Target-URI [string]
   WARC-Type [string]
   WARC-Date [datetime string]
   ...
 Payload-Metadata
   HTTP-Response-Metadata
     Headers
       Content-Language
       Content-Encoding
       ...
     HTML-Metadata
       Head
         Title [string]
         Link [list]
         Metas [list]
       Links [list]
     Headers-Length [int]
     Entity-Length [int]
     ...
   ...
 ...
Container
 Gzip-Metadata [object]
 Compressed [boolean]
 Offset [int]
The WET Format

As many tasks only require textual information, the Common Crawl dataset provides WET files that only contain extracted plaintext.

The way in which this textual data is stored in the WET format is quite simple: the WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

See the full WET extract
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID:
WARC-Refers-To:
WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC
Content-Type: text/plain
Content-Length: 6724BBC NEWS | Africa | Namibia braces for Nujoma exit
...
President Sam Nujoma works in very pleasant surroundings in the small but beautiful old State House...