< Back to Blog
June 30, 2025

Common Crawl at the United Nations Open Source Week, June 2025

Note: this post has been marked as obsolete.
The Common Crawl Foundation team took part in the United Nations Open Source Week in New York City this June, meeting with global developers, researchers, and policymakers to discuss all things open source and AI.
Common Crawl Foundation
Common Crawl Foundation
Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.
Pedro Ortiz Suarez and Sebastian Nagel at the United Nations in New York, attending the UN Open Source Week, NY.
Left-to-right: Pedro Ortiz Suarez and Sebastian Nagel at the United Nations in New York, attending the UN Open Source Week, NY.

From the 16th to the 20th of June, the Common Crawl Foundation team was in New York City for the United Nations Open Source Week, and select industry side-events.  Over the course of the week we engaged with developers, researchers, and policymakers on all things related to Open Source and AI. We presented at IBM’s Thomas J. Watson Research Center, and co-hosted the “AI Unconference” event at IBM One Madison: a gathering designed for open discussions of what we see as some of the most important issues facing the industry today: transparency, safety, diversity, and the importance of ethical data pipelines.

UN Open Source Maintain-a-thon

The CCF team attended the United Nations for the “Maintain-a-thon”, as part of UN Open Source Week, NY.
The CCF team attended the United Nations for the “Maintain-a-thon”, as part of UN Open Source Week, NY.

Our team attended the United Nations for the Open Source Maintain-a-thon on Tuesday.  Attendees from numerous global organisations split into groups and produced “Today I Learned” takeaways, “Tomorrow I Will” actions, and “Gee, I Wish” ideas for application in various areas of the (AI) industry. This culminated in a collective playbook for maintainability which will be released at a later date via the United Nations website.

IBM Thomas J. Watson Research Center

Inside the IBM Thomas J. Watson Research Center, Yorktown Heights, NY
Inside the IBM Thomas J. Watson Research Center, Yorktown Heights, NY.

The team travelled to Yorktown Heights to IBM’s Thomas J. Watson Research Center, where our distinguished engineer Sebastian Nagel gave a series of presentations on Common Crawl’s activities, goals, and partnerships.  Our team then met with dozens of representatives from departments across IBM to discuss mutual goals and identify areas where collaboration might benefit the industry at large.

Sebastian Nagel (Distinguished Engineer, Common Crawl) presenting at the IBM Thomas J. Watson Research Center, Yorktown Heights, NY.
Sebastian Nagel (Distinguished Engineer, Common Crawl) presenting at the IBM Thomas J. Watson Research Center, Yorktown Heights, NY.

Side-events at LinkedIn, Meta, and PwC

Common Crawl Foundation team members attended LinkedIn’s event “AI and the Future of Work: The ICT Sector in Transition” and their Empire State Building offices in midtown. This was a chance for the team to meet with more industry professionals and policymakers.

Pedro Ortiz Suarez (Senior Research Scientist, Common Crawl) and Laurie Burchell (Senior Research Engineer, Common Crawl) also attended two further side-events: the first of which took place at Meta’s NYC offices on Friday, where Mary Williamson of Meta presented on the Open Language Data Initiative, which Laurie is co-organising. The second was held at PwC, where Pedro gave a brief general presentation about Common Crawl and data-driven open source software. Pedro and Laurie also met and discussed with additional industry experts and policymakers. These engagements contributed to ongoing discussions around language data, openness, and cross-sector collaboration.

AI Unconference, IBM One Madison

Our main event was the AI Unconference, part of the official UN Open Source Week side-events, which Common Crawl co-hosted with our friends at IBM, the AI Alliance and BrightQuery. One attendee described it as ‘the most impactful AI event of the year’.

The event brought together over 100 attendees from around the world, including leading technologists and industry pioneers.  Highlights included talks from Rich Skrenta (Executive Director, Common Crawl), Jose Plehn-Dujowich (CEO, BrightQuery), Andrea Greco (Research Business Partnerships, IBM), Dean Wampler (Chief Technical Representative to the AI Alliance, IBM), and Thom Vaughan (Principal Technologist, Common Crawl).

Rich Skrenta opened the event with a welcome from Common Crawl, followed by an introduction to the AI Alliance by Andrea Greco, an introduction to BrightQuery from Jose Plehn-Dujowich, an introduction to the AI Alliance’s Open Trusted Data Initiative by Dean Wampler, and a detailed presentation by Thom Vaughan on Common Crawl’s mission. Roberto di Cosmo (Director, Software Heritage) also gave a presentation on their efforts in ethical data collection operations.

Andrea Greco introducing the AI Alliance at the AI Unconference at IBM One Madison.
Andrea Greco introducing the AI Alliance at the AI Unconference at IBM One Madison.
Dean Wampler introducing the Open Trusted Data Initiative at the AI Unconference at IBM One Madison.
Dean Wampler introducing the Open Trusted Data Initiative at the AI Unconference at IBM One Madison.
Thom Vaughan presenting on Common Crawl’s mission and accomplishments at the AI Unconference at IBM One Madison. Yes, we had over an exabyte downloaded from our S3 bucket in 2024!
Thom Vaughan presenting on Common Crawl’s mission and accomplishments at the AI Unconference at IBM One Madison. Yes, we had over an exabyte downloaded from our S3 bucket in 2024!

This was followed by a dynamic and well-received panel discussion, featuring Jose Plehn-Dujowich, Dean Wampler, Lilith Bat-Leah (DMLR Working Group Co-chair, MLCommons), Dave Buckley (Senior Policy Manager, OpenMined), Greg Lindahl (CTO, Common Crawl), and Roberto di Cosmo. Thom Vaughan served as moderator.

The issues around transparency and accountability in AI discussed by the panel are critical to the industry. A repeated term was “full chain of transparency” (thanks, Dave Buckley!) which it was broadly agreed is desperately needed across the industry. Another key theme was that attribution and provenance in training data should be systematic; embedded into data practices by design, rather than as an afterthought. Several panellists also highlighted the developers’ responsibility to uphold ethical standards, with repeated reference to Croissant, the community-developed metadata standard from MLCommons, as a promising tool for responsible data documentation.

Left-to-right: Thom Vaughan, Dave Buckley, Greg Lindahl, Jose Plehn-Dujowich, Lilith Bat-Leah, and Dean Wampler on the panel discussion at the AI Unconference at IBM One Madison.
Left-to-right: Thom Vaughan, Dave Buckley, Greg Lindahl, Jose Plehn-Dujowich, Lilith Bat-Leah, and Dean Wampler on the panel discussion at the AI Unconference at IBM One Madison.

As Dean Wampler noted during the discussions, “Constraints liberate, liberties constrain”, a saying at IBM that resonated with the panel. In the context of AI, the idea points to how well-designed boundaries like clear data documentation standards, transparent governance structures, and ethical constraints can enable greater innovation, trust, and collaboration, rather than limit progress.

Breakout sessions followed the panel, discussing the ethics of large scale data collection, preserving authenticity in user preference signals, governance and transparency in AI training data usage, collaborative standards across the AI ecosystem, and building trust in public data pipelines.

We would like to thank our friends at the AI Alliance, BrightQuery, and IBM for co-hosting this special event with us. Thanks in particular to Tim Bonnemann, Community Lead at IBM, for his tireless efforts and thoughtful coordination throughout the event.

It was a full and productive week in New York.  We had meaningful conversations, made valued connections, and saw real interest in the work we’re doing at Common Crawl.

Our thanks to everyone who took part.  We’re looking forward to what comes next.

This release was authored by:
No items found.

Erratum: 

Content is truncated

Originally reported by: 
Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.