2012 Crawl Data Now Available

I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages.

New Crawl Data
The 2012 Common Crawl corpus has been released in ARC file format.

JSON Crawl Metadata
In addition to the raw crawl content, the latest release publishes an extensive set of crawl metadata for each document in the corpus.  This metadata includes crawl statistics, charset information, HTTP headers, HTML META tags, anchor tags, and more.

Our hope is researchers will be able to take advantage of this small-but-powerful data set to both answer high level questions and drill into  a specific subset of data that they are interested in.

The crawl metadata is stored as JSON in Hadoop SequenceFiles on S3, colocated with ARC content files.  More information about Crawl Metadata can be found here, including a listing of all data points provided.

Text-Only Content
This release also features a text-only version of the corpus.  This version contains the page title, meta description, and all visible text content without HTML markup.  We’ve seen dramatic reductions in CPU consumption for applications that use the text-only files instead of extracting text from HTML.

In addition, the text content has been re-encoded from the document’s original character set into UTF-8.  This saves users from having to handle multiple character sets in their application.

More information about our Text-Only content can be found here.

Amazon AMI
Along with this release, we’ve published an Amazon Machine Image (AMI) to help both new and experienced users get up and running quickly.  The AMI includes a copy of our Common Crawl User Library, our Common Crawl Example Library, and launch scripts to show users how to analyze the Common Crawl corpus using either a local Hadoop cluster or Amazon Elastic MapReduce.

More information about our Amazon Machine Image can be found here.

We hope that everyone out there has an opportunity to try out the latest release.  If you have questions that aren’t answered in the Get Started page or FAQ, head over to our discussion group and share your question with the community.

19 thoughts on “2012 Crawl Data Now Available”

  1. Congratulations guys,,, we’ve been impatiently waiting for this ! Many thanks,
    Maybe it would be very important to share an example of how to use the new meta data :)

    1. (Sorry – added a comment instead of replying!)

      Hi Amine,

      You can see some example code showing how to use the new metadata files here:

        https://github.com/commoncrawl

      Or, you can spin up an instance of our new Amazon AMI and run an example:

        ami-6ba30d02

      – Chris

      1.  Many thanks for answering me, I just didn’t figure it out
        PS: thanks also for your answer a few days ago on GGroups discussions :)

  2. Nice release. Still difficult on my side to get on S3 due to company policy, but anyway, good job for the lucky others ;-)

  3. Hi Everyone!

    Here are some quick stats on the 2012 Common Crawl corpus:

    Total # of Web Documents:  3.8 billion
    Total Uncompressed Content Size:  100 TB+
    # of Domains:  61 million

    # of PDFs:           92.2 million
    # of Word Docs:     6.6 million
    # of Excel Docs:    1.3 million

    Top 20 TLDs:

    (note: these may contain HTTP 404 results.)

    Domain Name Page Count % of Top 20
    com 2,880,575,573 62.88%
    org 324,888,772 7.09%
    net 285,633,100 6.24%
    de 225,021,051 4.91%
    co.uk 157,660,729 3.44%
    ru 78,841,251 1.72%
    info 76,883,737 1.68%
    pl 68,825,576 1.50%
    nl 68,461,904 1.49%
    fr 62,542,019 1.37%
    it 59,027,654 1.29%
    com.au 41,032,777 0.90%
    edu 36,029,039 0.79%
    com.br 35,458,446 0.77%
    cz 34,635,725 0.76%
    ca 32,767,169 0.72%
    es 31,994,812 0.70%
    jp 28,502,740 0.62%
    ro 26,803,448 0.59%
    se 25,399,890 0.55%

    – Chris

  4. Sorry, guys – couldn’t get the formatting of the table to show up here.

    We’ll be adding a stats page to the wiki.

  5. How deep is the crawl. Does this include company job postings and job descriptions. Has anyone tried to extract that information for each domain or a subset of domains?

Comments are closed.