August 29, 2014

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

This is a guest blog post by Robert Meusel, a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project. The post below describes a new tool produced by Web Data Commons for extracting data from the Common Crawl data.

Robert Meusel

Robert Meusel is a researcher at the University of Mannheim in the Data and Web Science Research Group and a key member of the Web Data Commons project.

‍The Web Data Commons project extracts structured data from the Common Crawl corpora and offers the extracted data for public download. We have extracted one of the largest hyperlink graphs that is currently available to the public. We also extract and offer large corpora of Microdata, Microformats and RDFa annotations as well as relational HTML tables. If you ask us, why we do this? Because we share the opinion that data should be available to everybody and because we want to make it easier to exploit the wealth of information that is available on the Web.

For performing the extractions, we need to go through all the hundreds of tera-bytes of crawl data offered by the Common Crawl Foundation. As a project without any direct funding or salaried persons, we needed a time-, resource- and cost-efficient way to process the CommonCrawl corpora. We thus developed a data extraction tool which allows us to process the Common Crawl corpora in a distributed fashion using Amazon cloud services (AWS).

The basic architectural idea of the extraction tool is to have a queue taking care of the proper handling of all files which should be processed. Each worker receives a new file from the queue whenever it is ready and informs the queue about the status (success of failure) of the processing. Successfully processed files are removed from the queue, failures are assigned to another worker or eliminated when a fixed number of workers could not process it.

We used the extraction tool for example to extract a hyperlink graph covering over 3.5 billion pages and 126 billion hyperlinks from the 2012 CC corpus (over 100TB when uncompressed). Using our framework and 100 EC2 instances, the extraction took less than 12 hours and did costs less than US$ 500. The extracted graph had a size of less than 100GB zipped.

With each new extraction, we improved the extraction tool and turned it more and more into a flexible framework into which we now simply plug the needed file processors (for one single file) and which takes care of everything else.

This framework was now officially released under the terms of the Apache license. The framework takes care of everything that is related to file handling, distribution, and scalability and leaves to the user only the task of writing the code needed for extracting the desired information from a single out of the all CC files.

More information about the framework, a detailed guide on how to run it, and a tutorial showing how to customize the framework for your extraction tasks is found at http://webdatacommons.org/framework

‍We encourage all interested parties to make use of the framework. We will continuously improve the framework and are happy about everybody who gives us feedback about their experiences with the framework.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use