< Back to Blog
July 16, 2012

2012 Crawl Data Now Available

Note: this post has been marked as obsolete.
I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages.
Common Crawl Foundation
Common Crawl Foundation
Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

We are very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages.

New Crawl Data

The 2012 Common Crawl corpus has been released in ARC file format.

JSON Crawl Metadata

In addition to the raw crawl content, the latest release publishes an extensive set of crawl metadata for each document in the corpus.  This metadata includes crawl statistics, charset information, HTTP headers, HTML META tags, anchor tags, and more.

Our hope is researchers will be able to take advantage of this small-but-powerful data set to both answer high level questions and drill into  a specific subset of data that they are interested in.

The crawl metadata is stored as JSON in Hadoop SequenceFiles on S3, colocated with ARC content files.  More information about Crawl Metadata can be found here, including a listing of all data points provided.

Text-Only Content

This release also features a text-only version of the corpus.  This version contains the page title, meta description, and all visible text content without HTML markup.  We’ve seen dramatic reductions in CPU consumption for applications that use the text-only files instead of extracting text from HTML.

In addition, the text content has been re-encoded from the document’s original character set into UTF-8.  This saves users from having to handle multiple character sets in their application.

More information about our Text-Only content can be found here.

Amazon AMI

Along with this release, we’ve published an Amazon Machine Image (AMI) to help both new and experienced users get up and running quickly.  The AMI includes a copy of our Common Crawl User Library, our Common Crawl Example Library, and launch scripts to show users how to analyze the Common Crawl corpus using either a local Hadoop cluster or Amazon Elastic MapReduce.

More information about our Amazon Machine Image can be found here.

We hope that everyone out there has an opportunity to try out the latest release.  If you have questions that aren’t answered in the Get Started page or FAQ, head over to our discussion group and share your question with the community.

This release was authored by:
No items found.

Erratum: 

Charset Detection Bug in WET Records

Originally reported by: 
Javier de la Rosa
Permalink

The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016 due to a bug in IIPC Web Archive Commons (see the related issue in the CC fork of Apache Nutch).  There should be significantly fewer errors in all subsequent crawls. Originally discussed here in Google Groups.

Erratum: 

ARC Format (Legacy) Crawls

Originally reported by: 
Permalink

Our early crawls were archived using the ARC (Archive) format, not the WARC (Web ARChive) format.

The ARC format, which predates WARC, was the initial format used for storing web crawl data. It encapsulates multiple resources (web pages, images, etc.) into a single file, with each resource preceded by a header containing metadata such as the URL, MIME type, and length. While effective, the ARC format has limitations, particularly in terms of extensibility and the ability to store additional metadata.

In contrast, the WARC format, which is an extension of ARC, addresses these limitations: it allows for more comprehensive metadata, better handling of content types, and the capability to store additional information such as HTTP headers, which are crucial for a more accurate representation of the archived data.

More information about these formats can be found in our blog post Web Archiving Formats Explained.

Erratum: 

Missing Language Classification

Originally reported by: 
Permalink

Starting with crawl CC-MAIN-2018-39 we added a language classification field (‘content-languages’) to the columnar indexes, WAT files, and WARC metadata for all subsequent crawls. The CLD2 classifier was used, and includes up to three languages per document. We use the ISO-639-3 (three-character) language codes.