Pedro Ortiz Suarez

Principal Research Scientist

Pedro is a French-Colombian mathematician, computer scientist and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université. Pedro’s research has mainly focused on how data quality impacts ML models’ performance and how to improve these models through data-driven approaches. Pedro has been a main contributor to multiple open source Large Language Model initiatives such as CamemBERT, BLOOM and OpenGPT-X. Prior to joining the Common Crawl Foundation, Pedro was the founder of the open source project OSCAR, that provides high performance data pipelines to annotate Common Crawl’s data and make it more accessible to NLP and LLM researchers and practitioners. Pedro has also contributed to many projects in information extraction and other NLP applications for both the scientific domain and Digital Humanities.

Pedro Ortiz Suarez

Principal Research Scientist

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use