< Back to Blog
February 10, 2026

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

We are excited to announce the release of CommonLID, a language identification benchmark for the web, covering 109 languages. CommonLID was developed in collaboration with multiple open-source organizations and language community groups.
Laurie Burchell
Laurie Burchell
Laurie is a Senior Research Engineer with Common Crawl.

We are proud to introduce CommonLID, a web-focused language identification benchmark spanning 109 languages, created in partnership with open source organizations and language communities. The CommonLID dataset is available through Hugging Face, and the preprint paper is available on arXiv.

At the Common Crawl Foundation, we want to make our open data as comprehensive and representative as possible.  As part of this, one of our long-term research goals has been to improve the language coverage of our crawls.  We have been exploring several approaches towards this aim, but the most important for this blog post are our efforts to improve automatic language identification.

An abstract image showing multiple documents in various languages with fingers pointing to each document

While detecting the language in which a text is written seems like a simple task, prior research has shown that current language identification models can struggle, particularly with underrepresented languages.  This problem is even more challenging when detecting languages in web data.  The way people write on the web is often very different to how ‘formal’ text is written, meaning we need models which are tested on web data specifically.

For instance, these are some examples of web text whose language is mislabelled by existing LID systems:

Mislabelled example
Onyeakagbu, Adaobi. "See how all the 36 Nigerian states got their names". Pulse.ng. Retrieved 25 December 2021.
Gold Label: English LID System Label: Dagbani
Mislabelled example
Propreté: Confort: Accueil du propriétaire: Rapport qualité/prix: Randonneurs
Gold Label: French LID System Label: Maltese
Mislabelled example
Blog de titine807 - ~~~~~~~~~~ cOuCoU ToUs lE MoNdE BiEnVeNuE DaNs mOn tI BlOg ~~~~~~~~~~ -
Gold Label: French LID System Label: Non-Linguistic
Mislabelled example
'^ 7.00 7.01 7.02 7.03 7.04 7.05 7.06 7.07 7.08 7.09 7.10 楊南郡、王素娥. 《玉山國家公園八通關越嶺古道西段調查研究報告》 (PDF). 玉山國家公園. 玉山國家公園管理處. 1987 [2014-03-01] (中文(臺灣)).'
Gold Label: Chinese LID System Label: Hebrew

However, there is a general lack of quality datasets available for testing language identification models, especially those which cover a wide range of languages. Those which do exist don’t include web data, making it hard to assess performance and limiting researchers’ ability to make progress towards better language detection models for the web.

To close this gap, we are pleased to announce the release of CommonLID, a language identification benchmark dataset for the web which covers 109 languages.  This dataset was nearly two years in the making, involving significant collaboration with language communities and multiple research organisations.  We created CommonLID as part of a shared task at the Workshop for Multilingual Data Quality Signals, hosted at COLM in 2025.  The first step was building an annotation platform  in partnership with MLCommons and Factored AI, so that participants could view and mark-up data.  In collaboration with EleutherAI, we then hosted multiple hackathons with language community organisations like Masakhane and SEACrowd during which participants contributed language labels for a selection of Common Crawl’s web data.  Finally, we curated the final dataset and used it to evaluate multiple existing language identification models, shedding light on the current state of the art.

Everyone who contributed to the project was invited to be on the paper, and we would like to thank all 97(!) co-authors for their hard work in making this project happen.  We hope to expand CommonLID in the future to include data for more languages through continued community-led work.  We also hope to use CommonLID, and its future versions, to help develop and maintain a new generation of open source LID models, allowing us and our community to better curate multilingual corpora from our crawl data.

This release was authored by:
Laurie is a Senior Research Engineer with Common Crawl.
Laurie Burchell
Pedro is a French-Colombian mathematician, computer scientist, and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.
Pedro Ortiz Suarez
Catherine is an NLP Researcher at EleutherAI
Catherine Arnett

Erratum: 

Content is truncated

Originally reported by: 
Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.