< Back to Blog
September 10, 2024

August/September 2024 Newsletter

Note: this post has been marked as obsolete.
We're pleased to announce our newsletter for August and September 2024.
Jen English
Jen English
Jen English is a seasoned professional with a core competency in web content curation, web crawling, taxonomies, and ontology creation.

Table of Contents

  • Common Crawl Citations in Academic Research
  • Common Crawl Statistics on Hugging Face
  • Monthly Crawl Updates
  • Updates on our Policy Efforts
  • Roadmap and Future Plans

Common Crawl Citations in Academic Research

Common Crawl's impact on research has grown substantially since its beginning.  Our crawls have become a vital resource for researchers in various fields, from natural language processing to red teaming.

Our data has so far been cited in over 7,000 academic publications, highlighting its value to the research community.  We recently published a blog post on this, and plan to further investigate the connections in this network.

A chart showing the number of citations in Google Scholar mentioning Common Crawl, up to January 2024

Common Crawl Statistics on Hugging Face

We're excited to announce that Common Crawl’s statistics are now available on Hugging Face! The Common Crawl Statistics dataset includes metrics such as the number of URLs, domains, bytes, and content types crawled over specific periods. This dataset is important for users who need comprehensive, structured insights into the composition and trends within the web data collected by Common Crawl. For more details on the statistics see our recent blog post.


Monthly Crawl Updates

We now provide a refreshed crawl every month.  To date, we've delivered over 100 total crawl archives. In August alone, we crawled more than 2.3 billion web pages, amounting to over 320 TiB of uncompressed content.  The total size of our corpus now exceeds 8 PiB, with WARC data alone exceeding 7 PiB—a growth of 10.87% in the past year.  In addition to this, we now also generate Web Graphs on a monthly basis (as opposed to once every three months), which has significantly improved the quality of our recent crawls by giving us fresher ranking data.

A chart showing the cumulative size of WARC data in Common Crawl web archives up to August 2024
Charted data includes WARCs up to August 2024 (CC-MAIN-2024-33)

Updates on our Policy Efforts

We're actively influencing and shaping policy discussions for a free and open Internet.  Earlier this year we hosted a conference in New York titled "AI & the Right to Learn on an Open Internet", co-hosted with Professor Jeff Jarvis at the Craig Newmark Graduate School of Journalism at CUNY.

We have been conducting experiments to evaluate the prevalence of the emerging ML and AI opt-out protocols, and we have been engaging in discussions with our users and collaborators on the best way to put in place, apply, and standardize these protocols.

We're also taking part in a workshop hosted by the Internet Architecture Board in Washington DC in September.  The workshop will explore practical opt-out mechanisms for AI data collection, focusing on how content creators can control the use of their online content in training LLMs and other AI systems.

Roadmap and Future Plans

Based on user feedback, we're pursuing the following initiatives:

  • Significantly increasing our monthly crawl's size, by both depth and breadth, while both remaining a polite crawler and maintaining our data quality.
  • Reducing the carbon footprint of our crawl by modernizing and optimizing our current tools, while also encouraging companies to use our data instead of conducting their own crawls, further reducing the ecological impact of web crawling.
  • Start exploring content-based ranking methods based on the latest Natural Language Processing technologies, allowing us to filter out undesirable content better, improving data quality as we increase the size of our monthly crawls.
  • Develop and release open source and user-friendly tools to allow our users to better explore and understand our dataset.
  • Currently, English represents more than 40% of the data we crawl on a monthly basis.  As part of our efforts to increase the size of our crawl, we want to also make it more representative of the true multilingual nature of the open web, by significantly improving our language identification algorithms.  This in turn will allow us to increase the coverage of our crawls for underrepresented communities.

This release was authored by:
Jen English is a seasoned professional with a core competency in web content curation, web crawling, taxonomies, and ontology creation.
Jen English
Thom is Principal Technologist at the Common Crawl Foundation.
Thom Vaughan