It’s mandatory to access the data from the region where it is located (us-east-1 ).
The connection to S3 should be faster and you avoid the minimal fees for inter-region data transfer (you have to send requests which are charged as outgoing traffic).
Be careful using an Elastic IP address or load balancer, because you may be charged for the routed traffic.
You may use the AWS Command Line Interface but many AWS services (e.g EMR) support the s3:// protocol, and you may directly specify your input as s3://commoncrawl/path_to_file, sometimes even using wildcards.
On Hadoop (not EMR) it’s recommended to use the S3A Protocol: just change the protocol to s3a://.
The AWS Command Line Interface can be used to access the data from anywhere (including EC2). It’s easy to install on most operating systems (Windows, macOS, Linux). Please follow the installation instructions.
Please note, access to data from the Amazon cloud using the S3 API is only allowed for authenticated users. Please see our blog announcement for more information.
Once the AWS CLI is installed, the command to copy a file to your local machine is:
aws s3 cp s3://commoncrawl/path_to_file <local_path>
You may first look at the data e.g, to list all WARC files of a specific segment of the April 2018 crawl:
> aws s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/
2018-04-20 10:27:49 931210633 CC-MAIN-20180420081400-20180420101400-00000.warc.gz
2018-04-20 10:28:32 935833042 CC-MAIN-20180420081400-20180420101400-00001.warc.gz
2018-04-20 10:29:51 940140704 CC-MAIN-20180420081400-20180420101400-00002.warc.gz
The command to download the first file in the listing is:
aws s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz <local_path>The AWS CLI supports recursive copying, and allows for pattern–based inclusion/exclusion of files.
For more information check the AWS CLI user guide or call the command-line help (here for the cp command):
aws s3 cp help
To download a file using an HTTP download agent add the full path to the prefix https://data.commoncrawl.org/, e.g:
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz
It’s best to access the data from the region where it is located (us-east-1 ).
The connection to S3 should be faster and you avoid the minimal fees for inter-region data transfer (you have to send requests which are charged as outgoing traffic).
Be careful using an Elastic IP address or load balancer, because you may be charged for the routed traffic.
You may use the AWS Command Line Interface but many AWS services (e.g EMR) support the s3:// protocol, and you may directly specify your input as s3://commoncrawl/path_to_file, sometimes even using wildcards.
On Hadoop (not EMR) it’s recommended to use the S3A Protocol: just change the protocol to s3a://.
If you want to download the data to your local machine or local cluster, you can use the AWS Command Line Interface, or any HTTP download agent, such as cURL or wget.
There is no need to create an AWS account to access the data using either method.
The AWS Command Line Interface can be used to access the data from anywhere (including EC2). It’s easy to install on most operating systems (Windows, macOS, Linux). Please follow the installation instructions.
Once the AWS CLI is installed, the command to copy a file to your local machine is:
aws --no-sign-request s3 cp s3://commoncrawl/path_to_file/local_path/The argument --no-sign-request allows for anonymous access without the need to own an AWS account.
You may first look at the data e.g, to list all WARC files of a specific segment of the April 2018 crawl:
> aws --no-sign-request s3 ls s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/
2018-04-20 10:27:49 931210633 CC-MAIN-20180420081400-20180420101400-00000.warc.gz
2018-04-20 10:28:32 935833042 CC-MAIN-20180420081400-20180420101400-00001.warc.gz
2018-04-20 10:29:51 940140704 CC-MAIN-20180420081400-20180420101400-00002.warc.gz
The command to download the first file in the listing is:
aws --no-sign-request s3 cp s3://commoncrawl/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gzThe AWS CLI supports recursive copying, and allows for pattern–based inclusion/exclusion of files.
For more information check the AWS CLI user guide or call the command-line help (here for the cp command):
aws s3 cp help
To download a file using an HTTP download agent add the full path to the prefix https://data.commoncrawl.org/, e.g:
wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2018-17/segments/1524125937193.1/warc/CC-MAIN-20180420081400-20180420101400-00000.warc.gz
Common Crawl currently stores the crawl data using the Web ARChive (WARC) Format. Previously (prior to Summer 2013) the data was stored in the ARC Format.
The WARC format allows for more efficient storage and processing of Common Crawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.
If you want all the nitty–gritty details, the best source is the IIPC document on the WARC Standard.
Click the panels below for an overview of the differences between:
WARC files which store the raw crawl data
WAT files which store computed metadata for the data stored in the WARC
WET files which store extracted plaintext from the data stored in the WARC