Last week, members of the Common Crawl Foundation team—Chris, Greg, Jason, Rich, Sam, Stephen, and Wayne—attended the Neural Information Processing Systems (NeurIPS) Conference at the Vancouver Convention Centre in downtown Vancouver, BC. Set against the backdrop of Vancouver’s stunning waterfront, with snow-capped mountains and vibrant cityscape, the conference drew over 7,000 attendees from around the world.
Meaningful Connections and Opportunities
We attended NeurIPS with the goal of understanding potential partnerships and learning from the AI research community. During the conference, we had the opportunity to meet with people from over 40 organizations, each conversation offering insights into potential collaborations and ways we might support the broader AI ecosystem.
Common Crawl and Wikimedia Social: Bridging Tech and Social Impact
Our signature event at NeurIPS was a compelling social gathering titled "Nonprofits Bridging Tech and Social Impact." This two-hour event brought together over 60 participants from academia and industry, showcasing the critical work of nonprofit technology organizations.
Presentations
- An introduction to Wikimedia and Common Crawl, illuminating our respective missions
- An exploration of Common Crawl's dataset quality and the complexities of web crawling presented by Greg Lindahl
- Chris Petrillo provided a deep dive into Wikipedia's editing landscape, exploring community dynamics
- An interactive Q&A session that sparked robust discussion
The event transitioned into roundtable discussions, also providing a unique networking opportunity. Participants from various backgrounds exchanged their ideas about AI, technology, and social impact.
Additional Conference Highlights
We were excited to support our colleague Professor Ludwig Schmidt, who delivered a highly effective tutorial titled "Advancing Data Selection for Foundation Models: From Heuristics to Principled Methods." His presentation explored critical approaches to data selection in foundation model training, discussing everything from algorithmic foundations to practical data curation techniques. The session delved into attribution-based approaches, diversity-based methods, and emerging strategies for optimizing model performance through intelligent data selection.
We were also lucky to meet Dr. Fei-Fei Li, a key industry leader often referred to as the "Godmother of AI", and co-founder of the Stanford HAI (Human-centered Artificial Intelligence) Department where she reiterated how critical Common Crawl’s work is in the industry. Dr. Li’s was also one of the Key Invited Talks at the conference and highlighted Common Crawl in Slide #104 of her presentation.
Team members participated in several standout events. Rich Skrenta and Greg Lindahl attended a dinner with MLCommons which was sponsored by Tola Capital. The evening featured an impressive lineup of speakers, including Lora Aroyo from Google, Sarah Hooker from Cohere, Rishi Bommasani from Stanford University, and Peter Mattson from MLCommons and Google.
Looking Forward
The NeurIPS conference was a resounding success, strengthening our connections and highlighting Common Crawl’s role in the AI research community. We look forward to building on these partnerships and continuing to provide high-quality, open-access web data to support innovation in AI.