Since October of 2024, we’ve been gathering URLs in languages other than English (or “LOTE” for short), which we have added to our “seed crawl”, with the aim of improving coverage of languages, communities, and cultures in our crawls. We’re doing this via our Web Languages Project (introduced in this blog post in December of last year), and so far we’ve had 266 contributions from 67 people, thanks to whom we’ve added over 4,700 LOTE URLs to our seed list so far.
Since August of 2018 we have used the Compact Language Detector 2 (CLD2) to annotate the language(s) in which a page is written. It’s able to identify 160 different languages (up to 3 languages per document) and uses the ISO 639-3 language code.
So far, there are 42 files in the Web Languages repository which need review by a native speaker (we’re counting Latin here, although of course lamentably there are no native speakers of Latin left) and out of these there are seven languages which CLD2 is not capable of recognising.
Languages contributions which need a review by a native speaker
Click a column header to sort.
Out of all of the contributors, we would like to thank Ethan Wenokur, Evan Pacini, Twan Goosen, and Swapnil Tripathi in particular. We’re very grateful to these people for their substantial contributions to the Web Languages project.

