< Back to Blog
February 3, 2025

January/February 2025 Newsletter

Note: this post has been marked as obsolete.
We’re happy to share our January/February 2025 newsletter with updates and insights from the world of open data and web archiving.
Jen English
Jen English
Jen English is a seasoned professional with a core competency in web content curation, web crawling, taxonomies, and ontology creation.

Table of Contents

  • Annotation for Language Identification
  • cc-downloader Command Line Tool
  • Citations Updates
  • Common Crawl at SXSW 2025
  • Software Heritage Symposium at UNESCO
  • NeurIPS 2024 Social with Wikimedia

Annotation for Language Identification

In December we introduced an annotation campaign for Language Identification (LID or LangID) that we will conduct in collaboration with MLCommons. In this annotation campaign we will ask participants to do simple LangID annotations on Common Crawl data. We would like to get as many annotations as possible and cover as many languages as possible, in order to create the first web-based LangID dataset. Our ultimate goal with this project is to train a small language classifier that would help us make better decisions at crawl time ensuring that we crawl data for as many languages as possible, so that our dataset will hopefully better reflect the vast cultural and linguistic diversity of the web.

If you would like to contribute and participate in our annotation campaign, please visit MLCommons' Dynabench Platform.  For more details about our efforts to expand language coverage in Common Crawl, including LangID and our Web Languages project, see our related blog post.

cc-downloader Command Line Tool

We recently introduced cc-downloader, an experimental command-line tool for downloading Common Crawl data via HTTPS.  cc-downloader is intended to be a user-friendly and polite downloader.  For more details, please visit the cc-downloader GitHub repository and the related blog post.

Citations Updates

We have recently updated our Common Crawl Citations to include 2024 research paper citations.  Please see our updated Research Papers Citations graph for a look at Common Crawl citations in research papers through 2024.

Plot of Common Crawl citations (cumulative) in Google Scholar until January 2025
Source: cc-citations

Common Crawl at SXSW 2025

Common Crawl will be at SXSW in March.  If you will be in Austin that week we would love to meet up with you.  Please get in touch with us if you would like to arrange a coffee or meet-up.

Software Heritage Symposium at UNESCO

Left to right: Thom Vaughan, Pedro Ortiz Suarez at UNESCO Headquarters, Paris

On 29 January 2025, members of the Common Crawl Foundation attended the Software Heritage Symposium at UNESCO Headquarters in Paris. The event brought together experts from academia, industry, and policy to discuss key topics such as cybersecurity, AI transparency, open science, and cultural preservation. Speakers highlighted the role of open infrastructures in building a secure and inclusive digital future.

NeurIPS 2024 Social with Wikimedia

The Common Crawl Foundation attended NeurIPS 2024, connecting with organizations, hosting a social event on tech and social impact, and showcasing contributions to AI research and data access.  For more details on our social with Wikimedia and additional conference highlights, please see our related blog post.

This release was authored by:
No items found.