Blog

The latest news, interviews, technologies, and resources.

Host- and Domain-Level Web Graphs January, February, and March 2026

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of January, February, and March 2026. The graphs consist of 270.2 million nodes and 9 billion edges at the host level, and 120 million nodes and 4.4 billion edges at the domain level.

Luca Foppiano

Luca Foppiano is a Senior Engineer at the Common Crawl Foundation.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Crawl Release

February 2017 Crawl Archive Now Available

The crawl archive for February 2017 is now available! The archive contains 3.08 billion+ web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

January 2017 Crawl Archive Now Available

The crawl archive for January 2017 is now available! The archive contains more than 3.14 billion web pages and about 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

December 2016 Crawl Archive Now Available

The crawl archive for December 2016 is now available! The archive contains more than 2.85 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

October 2016 Crawl Archive Now Available

The crawl archive for October 2016 is now available! The archive contains more than 3.25 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

September 2016 Crawl Archive Now Available

The crawl archive for September 2016 is now available! The archive contains more than 1.72 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

News Dataset Available

We are pleased to announce the release of a new dataset containing news articles from news sites all over the world.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Analysis

Data Sets Containing Robots.txt Files and Non-200 Responses

Together with the crawl archive for August 2016 we release two data sets containing robots.txt files and server responses with HTTP status code other than 200 (404s, redirects, etc.) The data may be useful to anyone interested in web science, with various applications in the field.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

August 2016 Crawl Archive Now Available

The crawl archive for August 2016 is now available! The archive contains more than 1.61 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

July 2016 Crawl Archive Now Available

The crawl archive for July 2016 is now available! The archive contains more than 1.73 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

June 2016 Crawl Archive Now Available

The crawl archive for June 2016 is now available! The archive contains more than 1.23 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

May 2016 Crawl Archive Now Available

The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

April 2016 Crawl Archive Now Available

The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

News

Welcome, Sebastian!

It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!) to complement his role and the organization.

Host- and Domain-Level Web Graphs January, February, and March 2026

February 2017 Crawl Archive Now Available

January 2017 Crawl Archive Now Available

December 2016 Crawl Archive Now Available

October 2016 Crawl Archive Now Available

September 2016 Crawl Archive Now Available

News Dataset Available

Data Sets Containing Robots.txt Files and Non-200 Responses

August 2016 Crawl Archive Now Available

July 2016 Crawl Archive Now Available

June 2016 Crawl Archive Now Available

May 2016 Crawl Archive Now Available

April 2016 Crawl Archive Now Available

Welcome, Sebastian!

February 2016 Crawl Archive Now Available

November 2015 Crawl Archive Now Available

September 2015 Crawl Archive Now Available

August 2015 Crawl Archive Available

Web Image Size Prediction for Efficient Focused Image Crawling

July 2015 Crawl Archive Available

June 2015 Crawl Archive Available

May 2015 Crawl Archive Available

April 2015 Crawl Archive Available

March 2015 Crawl Archive Available

Announcing the Common Crawl Index!

Evaluating graph computation systems

February 2015 Crawl Archive Available

5 Good Reads in Big Open Data: March 26 2015

5 Good Reads in Big Open Data: March 20 2015

5 Good Reads in Big Open Data: March 13 2015

5 Good Reads in Big Open Data: March 6 2015

January 2015 Crawl Archive Available

5 Good Reads in Big Open Data: February 27 2015

Analyzing a Web graph with 129 billion edges using FlashGraph

5 Good Reads in Big Open Data: Feb 20 2015

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

5 Good Reads in Big Open Data: Feb 13 2015

5 Good Reads in Big Open Data: Feb 6 2015

The Promise of Open Government Data & Where We Go Next

December 2014 Crawl Archive Available

November 2014 Crawl Archive Available

Please Donate To Common Crawl!

October 2014 Crawl Archive Available

September 2014 Crawl Archive Available

August 2014 Crawl Data Available

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

July 2014 Crawl Data Available

April 2014 Crawl Data Available

Navigating the WARC file format

March 2014 Crawl Data Now Available

Common Crawl's Move to Nutch

Lexalytics Text Analysis Work with Common Crawl Data

Winter 2013 Crawl Data Now Available

New Crawl Data Available!

Hyperlink Graph from Web Data Commons

Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

A Look Inside Our 210TB 2012 Web Corpus

Professor Jim Hendler Joins the Common Crawl Advisory Board!

URL Search Tool!

The Winners of The Norvig Web Data Science Award

Analysis of the NCSU Library URLs in the Common Crawl Index

Common Crawl URL Index

blekko donates search data to Common Crawl

The Norvig Web Data Science Award

Towards Social Discovery - New Content Models; New Data; New Toolsets

Winners of the Code Contest!

Common Crawl Code Contest Extended Through the Holiday Weekend

TalentBin Adds Prizes To The Code Contest

Amazon Web Services sponsoring $50 in credit to all contest entrants!

Still time to participate in the Common Crawl code contest

Strata Conference + Hadoop World

Mat Kelcey Joins The Common Crawl Advisory Board

Common Crawl's Brand Spanking New Video and First Ever Code Contest!

2012 Crawl Data Now Available

The Open Cloud Consortium’s Open Science Data Cloud

OSCON 2012

Learn Hadoop and get a paper published

Big Data Week: meetups in SF and around the world

Data 2.0 Summit

Twelve steps to running your Ruby code across five billion web pages