Pedro is a French-Colombian mathematician, computer scientist and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université. Pedro’s research has mainly focused on how data quality impacts ML models’ performance and how to improve these models through data-driven approaches. Pedro has been a main contributor to multiple open source Large Language Model initiatives such as CamemBERT, BLOOM and OpenGPT-X. Prior to joining the Common Crawl Foundation, Pedro was the founder of the open source project OSCAR, that provides high performance data pipelines to annotate Common Crawl’s data and make it more accessible to NLP and LLM researchers and practitioners. Pedro has also participated to many projects in information extraction and other NLP applications for both the scientific domain and Digital Humanities.