←  Back to Blog
May 20, 2026

April 2026 Crawl Archive Now Available in a Hugging Face Storage Bucket

As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.

We are pleased to announce the April 2026 crawl archive (CC-MAIN-2026-17) is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.  This is an early experiment in distributing crawl data through a new channel.  AWS S3 remains the canonical distribution point for the Common Crawl corpus, generously hosted through Amazon Web Services' Open Data Sponsorship Program, and every crawl remains available from the Common Crawl bucket on S3 exactly as before. The Hugging Face Bucket is an additional way to reach the same data.

The April 2026 crawl archive is now available in a Hugging Face Storage Bucket

Why Hugging Face Buckets

Storage Buckets are a recent addition to the Hugging Face Hub: mutable, S3-like object storage that can be browsed in the browser, scripted from Python, or managed with the hf CLI. Unlike Models and Datasets repositories they are not version-controlled, which suits a continuously growing archive of crawl data.

Two properties make them interesting for a dataset of this size. They are backed by Xet, Hugging Face's chunk-based storage backend, which deduplicates content across files. And they support pre-warming: the Common Crawl Bucket currently has its pre-warmed CDN enabled for GCP US East, GCP EU West, and AWS EU West. For jobs running in or near one of those regions, this reduces read latency and improves throughput compared to pulling data from a distant region.

For teams already building on the Hugging Face ecosystem, the Bucket also makes integration simpler: the data is reachable with the same hf CLI, huggingface_hub client, and fsspec-compatible tooling used for Models and Datasets, with no separate S3 client or credentials.

Accessing the Common Crawl Bucket

The Common Crawl Bucket lives at huggingface.co/buckets/commoncrawl/commoncrawl. The April 2026 crawl sits under the crawl-data/CC-MAIN-2026-17/ prefix, mirroring the layout used on S3, which is roughly 380 TiB of uncompressed content across some 2.2 billion web pages. WARC, WAT, and WET files, segment lists, robots.txt and non-200 response records, and the URL indexes are all present, just as in the S3 distribution.

The Bucket can be addressed with an hf:// handle. Files can be listed with the hf CLI:

hf buckets list commoncrawl/commoncrawl/crawl-data/CC-MAIN-2026-17

Because Buckets integrate with HfFileSystem, the fsspec-compatible filesystem in huggingface_hub, any fsspec-aware library can read the archive directly via hf:// paths:

from huggingface_hub import hffs

hffs.glob("buckets/commoncrawl/commoncrawl/crawl-data/CC-MAIN-2026-17/**/*.warc.gz")

The archive remains available from the commoncrawl bucket on AWS S3 at s3://commoncrawl/crawl-data/CC-MAIN-2026-17/  and over HTTP at https://data.commoncrawl.org/crawl-data/CC-MAIN-2026-17/. See our Get Started page for details.

We welcome feedback on the Common Crawl Hugging Face Bucket.  Please contact us through our Discord or Google Group.

This release was authored by:
Malte is a Senior Research Engineer at Common Crawl.
Malte Ostendorff
Malte is a Senior Research Engineer at Common Crawl.

Erratum: 

Content is truncated

Originally reported by: 
More details
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.