Common Crawl

Common Crawl is a non-profit foundation founded with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable by anyone.

Enabling free access to web crawl data encourages collaboration and interdisciplinary research, as organizations, academia, and non-profits can work together to address complex challenges. Collaborating using Open Data accelerates progress and helps find solutions to pressing global issues, such as climate change, public health, and social equality.

By embracing Open Data, we promote an inclusive and thriving knowledge ecosystem, where the collective intelligence of the global community can lead to transformative discoveries and positive societal impact.

To prevent Common Crawl from crawling your website, include the following in your robots.txt:

Please note that we are aware of crawlers falsely identifying themselves as CCBot. We recommend verifying UserAgent strings to ensure authenticity.
‍
CCBot is now run on dedicated IP address ranges with reverse DNS. This allows webmasters to verify whether a logged request stems from the real CCBot, for example:‍

$> host 18.97.14.84 84.14.97.18.in-addr.arpa domain name pointer 18-97-14-84.crawl.commoncrawl.org. $> host 18-97-14-84.crawl.commoncrawl.org 18-97-14-84.crawl.commoncrawl.org has address 18.97.14.84 $> dig -x 18.97.14.84 ;; ANSWER SECTION: 84.14.97.18.in-addr.arpa. 276 IN PTR 18-97-14-84.crawl.commoncrawl.org. $> dig 18-97-14-84.crawl.commoncrawl.org A ;; ANSWER SECTION: 18-97-14-84.crawl.commoncrawl.org. 275 IN A 18.97.14.84

CCBot

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use