November 4, 2025

Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good

Note: this post has been marked as obsolete.

A recent article in The Atlantic makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.

Rich Skrenta

Rich is Executive Director of the Common Crawl Foundation, an experienced technologist and serial entrepreneur with a background in the search and social spaces.

A recent article in The Atlantic (“The Nonprofit Doing the AI Industry’s Dirty Work,” November 4, 2025) makes several false and misleading claims about the Common Crawl Foundation, including the accusation that our organization has “lied to publishers” about our activities.

This allegation is untrue. It misrepresents both how Common Crawl operates and the values that guide our work.

What Common Crawl Actually Does

Since 2007, Common Crawl has operated as a nonprofit foundation dedicated to one simple goal: to make a public, open archive of the web freely available to researchers, educators, journalists, and developers.

Our web crawler, known as CCBot, collects data from publicly accessible web pages. We do not go “behind paywalls,” do not log in to any websites, and do not employ any method designed to evade access restrictions.

Our approach has always been transparent:

We publish our crawling code and documentation publicly.
We identify ourselves clearly as “CCBot” in our user agent string.
We honor robots.txt exclusions, the standard web protocol used by site owners to control automated access.
We comply with takedown and removal requests sent to us in good faith.

‍

These principles have not changed in over a decade.

On the False Claim of “Lying to Publishers”

The Atlantic article claims that Common Crawl “appears to be lying to publishers about its activities.” That is a serious accusation – and it is false.

Common Crawl communicates honestly with publishers who contact us.

When a publisher asks us to remove previously crawled material, we respond promptly and initiate a removal process that reflects the technical design of our dataset.

Because Common Crawl’s archives are stored in an immutable format (WARC files) used by libraries and archivists worldwide, we cannot “edit” those files after publication without breaking their integrity. Instead, we remove or filter affected URLs from subsequent crawls and make them inaccessible through our public tools and indices.

This is not concealment; it is standard practice in large-scale web archiving.

Our public “Index Server” and “CC Index” tools are designed for efficient search, not for legal confirmation of every URL’s removal status. A “no captures” result in a search interface does not mean deception - it reflects how indices are generated, not what is stored internally. Suggesting otherwise misrepresents the technical facts.

On Compliance and Good Faith

Common Crawl has always operated in good faith with publishers and rights-holders.

We have engaged directly with organizations such as The New York Times, the Danish Rights Alliance (DRA), and others that have requested data removal or clarification. In every case, we have responded, cooperated, and implemented the requested changes to the extent technically possible.

Our small team manages an archive of many petabytes, a scale that makes real-time deletion technically complex - yet we continue to work diligently to meet removal requests and communicate progress transparently.

No one at Common Crawl has ever claimed this work was instantaneous or complete; rather, we have been open about its complexity and ongoing nature.

On Independence and Funding

The article implies that Common Crawl has become “cozier with the AI industry.”

Common Crawl’s financial independence is a matter of public record.

For over fifteen years, we were supported almost entirely by the Elbaz Family Foundation Trust. In recent years, as public and research interest in large-scale text analysis has grown, several organizations - including some AI companies - have contributed donations to support the cost of running and maintaining our public archive. These donations represent a small fraction of our overall operating needs and are disclosed publicly in our financial statements.

No donor, corporate or otherwise, has any control over what we collect, publish, or remove.

Common Crawl is not - and has never been - “doing the AI industry’s dirty work.”

We provide open data for everyone, including researchers studying misinformation, linguistics, digital preservation, machine translation and public health. Tens of thousands of academic papers and public-interest projects have relied on Common Crawl over the years - many entirely unrelated to artificial intelligence.

Transparency and the Public Record

Our mission has always been to democratize access to information that would otherwise remain siloed in corporate or institutional hands.

We believe the public should have access to a shared historical record of the web - one that enables accountability, research, and innovation.

Every dataset we publish is openly documented. Every update is logged and timestamped. Our code is public, and our operations are open for inspection.

To accuse a nonprofit built on openness of “masking its archives” or “lying to publishers” is not just inaccurate - it undermines a valuable public resource that exists precisely to promote transparency.

Moving Forward

Common Crawl welcomes honest dialogue about the ethics and responsibilities of web archiving.

We recognize that the digital landscape is changing and that publishers face real challenges in balancing openness with commercial sustainability. We share their interest in fair treatment and accurate representation.

We invite The Atlantic and all interested parties to engage with us directly - to verify claims, inspect our datasets, and better understand the realities of open web archiving at scale.

We will continue to listen, improve our tools, and uphold our commitment to public transparency.

Conclusion

Common Crawl has always operated in good faith, in public view, and in accordance with our mission to serve the common good.

‍

We do not lie to publishers.

We do not scrape paywalled material.

We do not conceal our activities.

‍

Our work is guided by a belief that open data and transparency strengthen society.

We remain dedicated to that principle - and we will continue to make the web’s public record available for the benefit of all.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good

What Common Crawl Actually Does

On the False Claim of “Lying to Publishers”

On Compliance and Good Faith

On Independence and Funding

Transparency and the Public Record

Moving Forward

Conclusion

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

Opt-out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use