We've updated the CORS policies on data.commoncrawl.org and index.commoncrawl.org so that browsers can now fetch from both directly. No proxy, no backend, no CORS plugin jazz. We're rather excited about what this enables, but we’d also like to be forthcoming about a trade-off that comes with this change.
What on earth is CORS anyway?
CORS is a browser-only safety check. It stands for Cross-Origin Resource Sharing, and it’s an HTTP-header-based security mechanism which allows a server to indicate any origins (domain, scheme, or port) other than its own from which a browser should permit loading resources.
The data on our servers has always been reachable, but what CORS controls is basically just whether or not JavaScript running on someone else's web page is allowed to read the responses. Without the correct headers, the browser fetches the data and then refuses to hand it over to the page's code.
Again, things like curl, scripts, and servers don’t care about CORS. It's a protection for end users whose browsers might otherwise be tricked into reading data from sites they're logged into. Opening it up doesn’t expose anything new, it just lets browser code reach data that was always public.
What changed?
Both hosts now send Access-Control-Allow-Origin: *. On the data host, we also allow the Range request header and expose Content-Range, Content-Length, and ETag on responses. That last thing is important because that means that byte-range requests are now allowed from a browser.
What can you build?
A static HTML page can now do the full Common Crawl pipeline end-to-end:
// 1. Look up a URL in the CDX index
const cdx = await fetch(
'https://index.commoncrawl.org/CC-MAIN-2026-17-index'
+ '?url=example.com&output=json&limit=1'
).then(r => r.text());
const { filename, offset, length } = JSON.parse(cdx.split('\n')[0]);
// 2. Range-fetch just that record from the WARC, a few KB instead of ~1 GB
const record = await fetch(`https://data.commoncrawl.org/${filename}`, {
headers: { Range: `bytes=${offset}-${+offset + +length - 1}` }
}).then(r => r.blob());
That's enough to build historical snapshot viewers, diff tools, link-graph explorers, teaching demos, and bookmarklets, all as static sites with no infrastructure requirements. The world is your proverbial oyster.
The Columnar Index is even more fun. DuckDB-WASM does HTTP range reads over Apache Parquet, so you can run SQL against our index from a browser:
SELECT url, fetch_status
FROM 'https://data.commoncrawl.org/cc-index/table/cc-main/warc/'
|| 'crawl=CC-MAIN-2026-17/subset=warc/part-00000-...c000.gz.parquet'
WHERE url_host_registered_domain = 'example.com'
LIMIT 100;
This DecompressionStream and DuckDB-WASM stuff works in pretty much any browser updated within the last 2 to 3 years. So Chrome, Edge, Firefox, Safari, and all the Chromium-based browsers (Brave, Arc, Opera, Vivaldi, etc.) since around mid-2023.
A note on the index server
Please be aware, index.commoncrawl.org is rate-limited, and those limits are strict. With CORS open, you'll hit them more visibly than before, because every user's browser now counts as its own client.
The limits exist for good reason. The CDX API is backed by finite resources serving the whole community, and a single enthusiastic frontend in a viral tweet can knock it over for everyone else. Please cache aggressively, batch where you can, and for any workload heavier than interactive lookups, we suggest that you reach for the Columnar Index on data.commoncrawl.org instead (using something like Amazon Athena, or DuckDB, etc.) The Columnar Index scales far better and isn't subject to the same limits.
We're proud to make infrastructure on which people can build cool stuff. Just remember to be kind to our poor ol' index server.

