Common Crawl - Blog - You can now build directly on Common Crawl from the browser

What on earth is CORS anyway?

CORS is a browser-only safety check. It stands for Cross-Origin Resource Sharing, and it’s an HTTP-header-based security mechanism which allows a server to indicate any origins (domain, scheme, or port) other than its own from which a browser should permit loading resources.

The data on our servers has always been reachable, but what CORS controls is basically just whether or not JavaScript running on someone else's web page is allowed to read the responses. Without the correct headers, the browser fetches the data and then refuses to hand it over to the page's code.

An animation showing a visualisation of two HTTP requests, one blocked, and one allowed.

Same request, two outcomes. Without the right header the browser hides the response from your code. With it, the response goes through.

Again, things like curl, scripts, and servers don’t care about CORS. It's a protection for end users whose browsers might otherwise be tricked into reading data from sites they're logged into. Opening it up doesn’t expose anything new, it just lets browser code reach data that was always public.

What can you build?

A static HTML page can now do the full Common Crawl pipeline end-to-end:

// 1. Look up a URL in the CDX index const cdx = await fetch( 'https://index.commoncrawl.org/CC-MAIN-2026-17-index' + '?url=example.com&output=json&limit=1' ).then(r => r.text()); const { filename, offset, length } = JSON.parse(cdx.split('\n')[0]); // 2. Range-fetch just that record from the WARC, a few KB instead of ~1 GB const record = await fetch(`https://data.commoncrawl.org/${filename}`, { headers: { Range: `bytes=${offset}-${+offset + +length - 1}` } }).then(r => r.blob());

‍

That's enough to build historical snapshot viewers, diff tools, link-graph explorers, teaching demos, and bookmarklets, all as static sites with no infrastructure requirements. The world is your proverbial oyster.

The Columnar Index is even more fun. DuckDB-WASM does HTTP range reads over Apache Parquet, so you can run SQL against our index from a browser:

SELECT url, fetch_status FROM 'https://data.commoncrawl.org/cc-index/table/cc-main/warc/' || 'crawl=CC-MAIN-2026-17/subset=warc/part-00000-...c000.gz.parquet' WHERE url_host_registered_domain = 'example.com' LIMIT 100;

‍

This DecompressionStream and DuckDB-WASM stuff works in pretty much any browser updated within the last 2 to 3 years. So Chrome, Edge, Firefox, Safari, and all the Chromium-based browsers (Brave, Arc, Opera, Vivaldi, etc.) since around mid-2023.

A note on the index server

Please be aware, index.commoncrawl.org is rate-limited, and those limits are strict. With CORS open, you'll hit them more visibly than before, because every user's browser now counts as its own client.

The limits exist for good reason. The CDX API is backed by finite resources serving the whole community, and a single enthusiastic frontend in a viral tweet can knock it over for everyone else. Please cache aggressively, batch where you can, and for any workload heavier than interactive lookups, we suggest that you reach for the Columnar Index on data.commoncrawl.org instead (using something like Amazon Athena, or DuckDB, etc.) The Columnar Index scales far better and isn't subject to the same limits.

We're proud to make infrastructure on which people can build cool stuff. Just remember to be kind to our poor ol' index server.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

You can now build directly on Common Crawl from the browser

What on earth is CORS anyway?

What changed?

What can you build?

A note on the index server

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use