Table of Contents
- Annotation for Language Identification
- cc-downloader Command Line Tool
- Citations Updates
- Common Crawl at SXSW 2025
- Software Heritage Symposium at UNESCO
- NeurIPS 2024 Social with Wikimedia
Annotation for Language Identification
In December we introduced an annotation campaign for Language Identification (LID or LangID) that we will conduct in collaboration with MLCommons. In this annotation campaign we will ask participants to do simple LangID annotations on Common Crawl data. We would like to get as many annotations as possible and cover as many languages as possible, in order to create the first web-based LangID dataset. Our ultimate goal with this project is to train a small language classifier that would help us make better decisions at crawl time ensuring that we crawl data for as many languages as possible, so that our dataset will hopefully better reflect the vast cultural and linguistic diversity of the web.
If you would like to contribute and participate in our annotation campaign, please visit MLCommons' Dynabench Platform. For more details about our efforts to expand language coverage in Common Crawl, including LangID and our Web Languages project, see our related blog post.
cc-downloader Command Line Tool
We recently introduced cc-downloader
, an experimental command-line tool for downloading Common Crawl data via HTTPS. cc-downloader is intended to be a user-friendly and polite downloader. For more details, please visit the cc-downloader GitHub repository and the related blog post.
Citations Updates
We have recently updated our Common Crawl Citations to include 2024 research paper citations. Please see our updated Research Papers Citations graph for a look at Common Crawl citations in research papers through 2024.
Common Crawl at SXSW 2025
Common Crawl will be at SXSW in March. If you will be in Austin that week we would love to meet up with you. Please get in touch with us if you would like to arrange a coffee or meet-up.
Software Heritage Symposium at UNESCO
On 29 January 2025, members of the Common Crawl Foundation attended the Software Heritage Symposium at UNESCO Headquarters in Paris. The event brought together experts from academia, industry, and policy to discuss key topics such as cybersecurity, AI transparency, open science, and cultural preservation. Speakers highlighted the role of open infrastructures in building a secure and inclusive digital future.
NeurIPS 2024 Social with Wikimedia
The Common Crawl Foundation attended NeurIPS 2024, connecting with organizations, hosting a social event on tech and social impact, and showcasing contributions to AI research and data access. For more details on our social with Wikimedia and additional conference highlights, please see our related blog post.