We are pleased to announce a public test of a new web dataset, the Host Index. Common Crawl has long offered a dataset of crawled web data, along with two indexes that help find individual web pages in the 10-petabyte-sized dataset. We also have the Web Graph, which has information about web hosts and domains – their ranks and their connections to each other.
https://github.com/commoncrawl/cc-host-index
The new Host Index has one row for every web host we know about, in each individual crawl. It contains summary information from the crawl, indexes, the web graph, and our raw crawler logs. You can use it directly from AWS using SQL tools such as Amazon Athena or duckdb, or you can download it to your own disk (24 crawls x 7 gigabytes each.)
Here are some highlights from the schema:
- Counts of pages crawled by status code, including successes, redirects, not-found, not-changed, etc.
- Details about robots.txt fetches, which can highlight web hosts using “bot defenses” to refuse all traffic from CCBot
- A summary of languages used on a web host, currently just the number (and percentage) of web pages that are in languages other than English
Here’s a graph of the recent crawl history of our own web host, commoncrawl.org. We revamped our website in 2023, moving many webpages to new places and leaving behind redirects. You can see a big spike in redirected web pages, which took a couple of months to fully finish. This time delay highlights that our crawl is never “complete”. It is a sample of the web.

Here’s an example of a somewhat particular query, which returns a list of Vatican (*.va
) web hosts that have more than 50% of their web pages in languages other than English:
SELECT
crawl, surt_host_name, hcrank10, fetch_200_lote_pct, fetch_200_lote
FROM cchost_index_testing_v2
WHERE crawl = 'CC-MAIN-2025-13'
AND url_host_tld = 'va'
AND fetch_200_lote_pct > 50
ORDER BY hcrank10 DESC
LIMIT 10
For many more details, please see the README in the https://github.com/commoncrawl/cc-host-index GitHub repository.
We’re very interested in feedback. Please contact us at info@commoncrawl.org, or on our Google Group, or our Discord server.

Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.