April 23, 2025

Introducing the Host Index

Note: this post has been marked as obsolete.

Introducing the Host Index: a new dataset with one row per web host per crawl, combining crawl stats, status codes, languages, and bot defence data. Queryable via AWS tools or downloadable.

Greg Lindahl

Greg is the Chief Technology Officer at the Common Crawl Foundation.

We are pleased to announce a public test of a new web dataset, the Host Index. Common Crawl has long offered a dataset of crawled web data, along with two indexes that help find individual web pages in the 10-petabyte-sized dataset. We also have the Web Graph, which has information about web hosts and domains – their ranks and their connections to each other.

https://github.com/commoncrawl/cc-host-index

The new Host Index has one row for every web host we know about, in each individual crawl. It contains summary information from the crawl, indexes, the web graph, and our raw crawler logs. You can use it directly from AWS using SQL tools such as Amazon Athena or duckdb, or you can download it to your own disk (24 crawls x 7 gigabytes each.)

Here are some highlights from the schema:

Counts of pages crawled by status code, including successes, redirects, not-found, not-changed, etc.
Details about robots.txt fetches, which can highlight web hosts using “bot defenses” to refuse all traffic from CCBot
A summary of languages used on a web host, currently just the number (and percentage) of web pages that are in languages other than English

‍

Here’s a graph of the recent crawl history of our own web host, commoncrawl.org. We revamped our website in 2023, moving many webpages to new places and leaving behind redirects. You can see a big spike in redirected web pages, which took a couple of months to fully finish. This time delay highlights that our crawl is never “complete”. It is a sample of the web.

Here’s an example of a somewhat particular query, which returns a list of Vatican (*.va) web hosts that have more than 50% of their web pages in languages other than English:

SELECT
  crawl, surt_host_name, hcrank10, fetch_200_lote_pct, fetch_200_lote
FROM cchost_index_testing_v2
WHERE crawl = 'CC-MAIN-2025-13'
  AND url_host_tld = 'va'
  AND fetch_200_lote_pct > 50
ORDER BY hcrank10 DESC
LIMIT 10

For many more details, please see the README in the https://github.com/commoncrawl/cc-host-index GitHub repository.

We’re very interested in feedback. Please contact us at info@commoncrawl.org, or on our Google Group, or our Discord server.

This release was authored by:

Greg is the Chief Technology Officer at the Common Crawl Foundation.

Greg Lindahl

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Introducing the Host Index

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use