May 22, 2017

Common Crawl's First In-House Web Graph

Note: this post has been marked as obsolete.

We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges.

Sebastian Nagel

Sebastian is a Distinguished Engineer with Common Crawl.

We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. The following results from the development of this graph:

a ranked list of hosts to expand the crawl frontier;
pages ranked by Harmonic Centrality with less influence from spam, among other attributes (for comparison we include PageRank);
the template/process for Common Crawl to produce graphs and page rankings at regular intervals.

We produced this graph, and intend to produce similar graphs going forward, because the Common Crawl community has expressed a strong interest in using Common Crawl data for graph processing, particularly with respect to:

web graph and page rankings produced by Common Search in 2016;
the Hyperlink Graph data set produced in 2013 by Web Data Commons (WDC);
the "WWW Ranking" from WDC, along with a second set of hyperlink graphs based on crawl data from April 2014.

* Please note: the graph includes dangling nodes, i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. Seventeen percent (65 million) of the hosts represented have been crawled in one of the three monthly crawls. Thus, 320 million of the hosts represented in the graph are known only from links. (Host names are not wholly verified: host names that are obviously invalid are skipped; others are not resolved in DNS.)

Extraction of links and construction of the graph

Links are taken from WAT extracts but we also included redirects from WARC files of the redirect and 404 dataset. All types of links are included, including pure "technical" ones pointing to JavaScript libraries, web fonts, etc.

The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain. Node IDs are assigned sequentially to the the node list sorted by reversed host name. This keeps links between hosts of the same domain or in the same country-code top-level domain close together and allows for an efficient delta-compression of edges. The extraction is done in three steps:

links are extracted, reduced to host-level links and stored as pairs 〈reversed host from, rev. host to〉
host names are assigned to IDs and edges are represented as 〈from id, to id〉 pairs
ranks are computed.

The first two steps are done by Spark and Python; the code is part of the project cc-pyspark. To compute the rankings the webgraph is loaded into the WebGraph framework.

Hosts ranked by Harmonic Centrality and PageRank

We provide a list of ranked nodes (host names) by

Harmonic Centrality (calculated by HyperBall)
and PageRank (by PageRankParallelGaussSeidel)

Data and download instructions

The host-level graph as well as the rankings are placed on AWS S3 on the path:

s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr-hostgraph/

Alternatively, you can use:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr-hostgraph/

as prefix to access the files from everywhere.

Download files of the Common Crawl Feb/Mar/Apr 2017 host-level webgraph

Size	File	Description
2.72 GB	vertices.txt.gz	nodes ⟨id, rev host⟩
9.42 GB	edges.txt.gz	edges ⟨from_id, to_id⟩

4.51 GB	bvgraph.graph	graph in BVGraph format
0.22 GB	bvgraph.offsets
1 kB	bvgraph.properties

5.06 GB	bvgraph-t.graph	transpose of the graph (outlinks mapped to inlinks)
0.47 GB	bvgraph-t.offsets
1 kB	bvgraph-t.properties

1 kB	bvgraph.stats	WebGraph statistics

6.26 GB	ranks.txt.gz	harmonic centrality and pagerank

‍

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link SPAM detection, etc. Let us know about your results via Common Crawl's Google Group!

Credits

Thanks to:

Web Data Commons, for their web graph data set and everything related.
Common Search; we first used their web graph to expand the crawler frontier, and Common Search's cosr-back project was an important source of inspiration how to process our data using PySpark.
the authors of the WebGraph framework, whose software simplifies the computation of rankings.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Common Crawl's First In-House Web Graph

Extraction of links and construction of the graph

Hosts ranked by Harmonic Centrality and PageRank

Data and download instructions

Download files of the Common Crawl Feb/Mar/Apr 2017 host-level webgraph

Credits

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

Opt-out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use