< Back to Blog
January 5, 2026

Common Crawl at the Mozilla Festival 2025

Note: this post has been marked as obsolete.
From the 6th to the 10th of November 2025, Pedro Ortiz Suarez attended Mozfest in Barcelona, as well as some satellite events.
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Pedro is a French-Colombian mathematician, computer scientist, and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.

Mozilla Festival Day 0

On the 6th of November Pedro attended Mozfest Day 0, an informal workshop organized by Mozilla where attendees had the opportunity to discuss how data sharing and access can be improved, in particular for builders of open source and public AI systems.

The workshop also considered the idea of public AI and how it can be brought “from the lab to the people and the market”. The focus on data as a major bottleneck for open source AI development, due to lack of useful and usable data, and legal uncertainty related to using data for AI development was discussed.

There was also interest in how to govern data that is being generated and used at the deployment phase (inference time). The objective of this workshop was to learn from the experiences of practitioners in this space, those who collect, process, and use data with the open principles and the public interest in mind, but also to chart pathways to better access to data for public AI builders.

BSC visit and ALIA Public AI Forum

After the initial workshop, attendants were invited for a tour of Barcelona Supercomputing Center (BSC), where we visited their supercomputers, quantum computers as well as the historical decommissioned infrastructure that has been used throughout the years.

A photo of the server racks of Marenostrum 5, a pre-exascale EuroHPC supercomputer hosted at BSC where the ALIA models have been trained
Marenostrum 5, a pre-exascale EuroHPC supercomputer hosted at BSC where the ALIA models have been trained
A photo of a replica of a Quantum computer hosted at BSC, mainly the cooling system is featured in the photo
A replica of a Quantum computer hosted at BSC

Pedro then attended the ALIA Public AI Forum where we had interventions from BSC and Public AI, explaining the ALIA Project for the development of AI models in Spain for all its co-official languages. The intervention from Public AI also explained the collaboration between them and their efforts in the public AI sector globally.

Common Crawl Foundation at The Mozilla Festival 2025

During the first Mozilla Festival day Pedro had the pleasure of being part of The AI Data Real Talk Panel, which included panelists from many public and private organizations as well as non-profits from a wide range of backgrounds and all over the world.

This panel, which was moderated by EM Lewis-Jong explored data sovereignty, openness, and equity, allowing panelists to talk about actual case studies that represent those values and what it really takes to build datasets for fair, representative systems.

The panelists also explored questions regarding underrepresented and underserved languages in AI, such as the consequences for linguistic communities for their languages to be underrepresented and sometimes misrepresented, what different communities care about when sharing their data and how they can govern and manage the data they share with different actors.

Panelists shared different projects they have in this particular space, in particular Pedro discussed Common Crawl Foundation’s Language Initiatives, took suggestions and addressed concerns from the other panelists and the audience, and learned about the other panelists’ projects.

Participating in such a diverse session at the Mozilla Festival, with panelists often expressing opposing ideas, was a great opportunity to learn and spark constructive and respectful conversations in a space where we’re still building frameworks to ethically and equitably answer the concerns of underrepresented and underserved linguistic communities in the context of emerging AI technologies.

This release was authored by:
No items found.

Erratum: 

Content is truncated

Originally reported by: 
Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.