We are pleased to announce the April 2026 crawl archive (CC-MAIN-2026-17) is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3. This is an early experiment in distributing crawl data through a new channel. AWS S3 remains the canonical distribution point for the Common Crawl corpus, generously hosted through Amazon Web Services' Open Data Sponsorship Program, and every crawl remains available from the Common Crawl bucket on S3 exactly as before. The Hugging Face Bucket is an additional way to reach the same data.

Why Hugging Face Buckets
Storage Buckets are a recent addition to the Hugging Face Hub: mutable, S3-like object storage that can be browsed in the browser, scripted from Python, or managed with the hf CLI. Unlike Models and Datasets repositories they are not version-controlled, which suits a continuously growing archive of crawl data.
Two properties make them interesting for a dataset of this size. They are backed by Xet, Hugging Face's chunk-based storage backend, which deduplicates content across files. And they support pre-warming: the Common Crawl Bucket currently has its pre-warmed CDN enabled for GCP US East, GCP EU West, and AWS EU West. For jobs running in or near one of those regions, this reduces read latency and improves throughput compared to pulling data from a distant region.
For teams already building on the Hugging Face ecosystem, the Bucket also makes integration simpler: the data is reachable with the same hf CLI, huggingface_hub client, and fsspec-compatible tooling used for Models and Datasets, with no separate S3 client or credentials.
Accessing the Common Crawl Bucket
The Common Crawl Bucket lives at huggingface.co/buckets/commoncrawl/commoncrawl. The April 2026 crawl sits under the crawl-data/CC-MAIN-2026-17/ prefix, mirroring the layout used on S3, which is roughly 380 TiB of uncompressed content across some 2.2 billion web pages. WARC, WAT, and WET files, segment lists, robots.txt and non-200 response records, and the URL indexes are all present, just as in the S3 distribution.
The Bucket can be addressed with an hf:// handle. Files can be listed with the hf CLI:
hf buckets list commoncrawl/commoncrawl/crawl-data/CC-MAIN-2026-17Because Buckets integrate with HfFileSystem, the fsspec-compatible filesystem in huggingface_hub, any fsspec-aware library can read the archive directly via hf:// paths:
from huggingface_hub import hffs
hffs.glob("buckets/commoncrawl/commoncrawl/crawl-data/CC-MAIN-2026-17/**/*.warc.gz")The archive remains available from the commoncrawl bucket on AWS S3 at s3://commoncrawl/crawl-data/CC-MAIN-2026-17/ and over HTTP at https://data.commoncrawl.org/crawl-data/CC-MAIN-2026-17/. See our Get Started page for details.
We welcome feedback on the Common Crawl Hugging Face Bucket. Please contact us through our Discord or Google Group.

