August 6, 2024

The Increase of Common Crawl Citations in Academic Research

Common Crawl's impact on research has grown substantially since its beginning. Our crawls have become a vital resource for researchers in various fields, from natural language processing to red teaming.

Thom Vaughan

Thom is Principal Engineer at the Common Crawl Foundation.

Here's a look at how our presence in academic citations has evolved:

Cumulative Citations

Plot of Common Crawl citations in Google Scholar until January 2024

This graph shows our citation count in Google Scholar from 2012 to 2023. More information on how this is collected can be found in this GitHub repository.

The steady increase from 30 citations in 2012 to 1,777 in 2023 represents nearly a 60–fold growth over a decade. This trend highlights the increasing relevance and utility of Common Crawl data in academic and industry research.

To further support the research community, we're excited to announce that our citations dataset is now available on Hugging Face:

About This Dataset

This dataset contains citations referencing Common Crawl, sourced from Google Scholar. The citations are not curated, meaning there may be some false positives included. An annotated subset with additional fields, called can be found in the bib directory of the cc-citations repository.

Citations for a specific year can be downloaded separately, and can viewed by year by using the 'Subset' dropdown on the dataset card.

This resource offers researchers a comprehensive view of how Common Crawl data is being used across different studies and disciplines. Check it out, and potentially discover new applications for Common Crawl in your own work!

We encourage you to explore this dataset, share your findings, and let us know how you're using Common Crawl in your research. Your insights not only help us improve but also inspire the broader research community.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

The Increase of Common Crawl Citations in Academic Research

Cumulative Citations

About This Dataset

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use