Common Crawl's impact on research has grown substantially since its beginning. Our crawls have become a vital resource for researchers in various fields, from natural language processing to red teaming.
Here's a look at how our presence in academic citations has evolved:
This graph shows our citation count in Google Scholar from 2012 to 2023. More information on how this is collected can be found in this GitHub repository.
The steady increase from 30 citations in 2012 to 1,777 in 2023 represents nearly a 60–fold growth over a decade. This trend highlights the increasing relevance and utility of Common Crawl data in academic and industry research.
To further support the research community, we're excited to announce that our citations dataset is now available on Hugging Face:
About This Dataset
This dataset contains citations referencing Common Crawl, sourced from Google Scholar. The citations are not curated, meaning there may be some false positives included. We also have an annotated subset with additional fields, called "citations-annotated
".
Citations for a specific year can be downloaded separately, and can viewed by year by using the 'Subset' dropdown on the dataset card.
This resource offers researchers a comprehensive view of how Common Crawl data is being used across different studies and disciplines. Check it out, and potentially discover new applications for Common Crawl in your own work!
We encourage you to explore this dataset, share your findings, and let us know how you're using Common Crawl in your research. Your insights not only help us improve but also inspire the broader research community.