Common Crawl maintains a free,open repository of web crawl data that can be used by anyone.
Common Crawl is a 501(c)(3) non–profit founded in 2007. We make wholesale extraction, transformation and analysis of open web data accessible to researchers.
CommonLID, a community-built language ID benchmark, has a new website and interactive leaderboard. Its paper was accepted to ACL 2026, with a poster session on 7 July. Source code, a PyPI package, and the dataset are now available.