Blog

The latest news, interviews, technologies, and resources.

IPv6 Adoption Across the Top 100K Web Hosts

We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the long tail.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Web Graphs

Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, February, April 2024.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Crawl Release

April 2024 Crawl Archive Now Available

We are pleased to announce that the crawl archive for April 2024 is now available. The data was crawled between April 12th and April 25th, and contains 2.7 billion web pages (or 386 TiB of uncompressed content). Page captures are from 47.24 million hosts or 37.65 million registered domains and include 0.98 billion new URLs not visited in any of our prior crawls.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

News

March/April 2024 Newsletter

We're excited to share an update on some of our recent projects and initiatives in this newsletter!

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Web Graphs

Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of September, November, February 2023-24.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Crawl Release

February/March 2024 Crawl Archive Now Available

The crawl archive for February/March 2024 is now available. The data was crawled between February 20th and March 5th, and contains 3.16 billion web pages (or 424.7 TiB of uncompressed content).

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Analysis

Web Archiving File Formats Explained

In the ever–evolving landscape of digital archiving and data analysis, it is helpful to understand the various file formats used for web crawling. From the early ARC format to the more advanced WARC, and the specialised WET and WAT files, each plays an important role in the field of web archiving. In this post, we explain these formats, exploring their unique features, applications, and the enhancements they offer.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Analysis

A Further Look Into the Prevalence of Various ML Opt–Out Protocols

This post details some experiments that we have done regarding Machine Learning Opt–Out protocols. We decided to investigate the prevalence of some of these protocols, by taking a deeper look at our WARC files, and finding which proportions of domains are using which opt–out protocols.

Alex Xue

Alex is a Computer Science graduate from the University of Waterloo, Canada, and emeritus member of the Common Crawl Foundation.

Analysis

Balancing Discovery and Privacy: A Look Into Opt–Out Protocols

What opt–out protocols are, their importance, how you can use them, how we respect them, and what the emerging initiatives are that surround them.

Alex Xue

Alex is a Computer Science graduate from the University of Waterloo, Canada, and emeritus member of the Common Crawl Foundation.

Web Graphs

Host- and Domain-Level Web Graphs May/Sep/Nov 2023

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, September, and November of 2023.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Crawl Release

November/December 2023 Crawl Archive Now Available

The crawl archive for November/December 2023 is now available. The data was crawled between November 28th and December 12th, and contains 3.35 billion web pages (or 454 TiB of uncompressed content).

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

News

Oct/Nov 2023 Performance Issues

Our datasets have become very popular over time, with downloads doubling every 6 months for several years in a row. This post details some steps to take if you are impacted by performance issues.

Greg Lindahl

Greg is Chief Technology Officer at the Common Crawl Foundation.

Web Graphs

Host- and Domain-Level Web Graphs Mar/May/Oct 2023

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, May, and October 2023. The host-level graph consists of 378.7 million nodes and 2.6 billion edges, and the domain-level graph has 94.2 million nodes and 1.7 billion edges.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Crawl Release

September/October 2023 crawl archive now available

The crawl archive for September/October 2023 is now available! The data was crawled Sept 21 – October 5 and contains 3.4 billion web pages or 456 TiB of uncompressed content.

Julien Nioche

Julien is a member of the Apache Software Foundation, emeritus member of the Common Crawl Foundation.

News

Bridging Digital Exploration and Scientific Frontiers

This month Common Crawl Foundation members had the privilege of attending 5th International Open Search Symposium at CERN in Geneva, Switzerland.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Crawl Release

May/June 2023 crawl archive now available

The crawl archive for May/June 2023 is now available! The data was crawled May 27 – June 11 and contains 3.1 billion web pages or 390 TiB of uncompressed content. Page captures are from 44 million hosts or 35 million registered domains and include 1.0 billion new URLs, not visited in any of our prior crawls.

IPv6 Adoption Across the Top 100K Web Hosts

Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

April 2024 Crawl Archive Now Available

March/April 2024 Newsletter

Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

February/March 2024 Crawl Archive Now Available

Web Archiving File Formats Explained

A Further Look Into the Prevalence of Various ML Opt–Out Protocols

Balancing Discovery and Privacy: A Look Into Opt–Out Protocols

Host- and Domain-Level Web Graphs May/Sep/Nov 2023

November/December 2023 Crawl Archive Now Available

Oct/Nov 2023 Performance Issues

Host- and Domain-Level Web Graphs Mar/May/Oct 2023

September/October 2023 crawl archive now available

Bridging Digital Exploration and Scientific Frontiers

May/June 2023 crawl archive now available

March/April 2023 crawl archive now available

Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023

January/February 2023 crawl archive now available

November/December 2022 crawl archive now available

September/October 2022 crawl archive now available

Host- and Domain-Level Web Graphs May, June/July and August 2022

August 2022 crawl archive now available

June/July 2022 crawl archive now available

May 2022 crawl archive now available

Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022

Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data

January 2022 crawl archive now available

November/December 2021 crawl archive now available

October 2021 crawl archive now available

Host- and Domain-Level Web Graphs June, July/August and September 2021

September 2021 crawl archive now available

July/August 2021 crawl archive now available

June 2021 crawl archive now available

Host- and Domain-Level Web Graphs February/March, April and May 2021

May 2021 crawl archive now available

April 2021 crawl archive now available

February/March 2021 crawl archive now available

Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021

January 2021 crawl archive now available

November/December 2020 crawl archive now available

October 2020 crawl archive now available

Interactive Webgraph Statistics Notebook Released

Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

September 2020 crawl archive now available

August 2020 crawl archive now available

July 2020 crawl archive now available

Host- and Domain-Level Web Graphs Feb/Mar/May 2020

May/June 2020 crawl archive now available

March/April 2020 crawl archive now available

February 2020 crawl archive now available

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020

January 2020 crawl archive now available

December 2019 crawl archive now available

November 2019 crawl archive now available

Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019

October 2019 crawl archive now available

September 2019 crawl archive now available

August 2019 crawl archive now available

Host- and Domain-Level Web Graphs May/June/July 2019

July 2019 crawl archive now available

June 2019 crawl archive now available

May 2019 crawl archive now available

Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019

April 2019 crawl archive now available

March 2019 crawl archive now available

February 2019 crawl archive now available

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019

January 2019 crawl archive now available

December 2018 crawl archive now available

November 2018 crawl archive now available

Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018

October 2018 crawl archive now available

September 2018 crawl archive now available

August Crawl Archive Introduces Language Annotations

Host- and Domain-Level Web Graphs May/June/July 2018

3.25 Billion Pages Crawled in July 2018

June 2018 Crawl Archive Now Available

May 2018 Crawl Archive Now Available

Host- and Domain-Level Web Graphs Feb/Mar/Apr 2018