< Back to Blog
May 22, 2017

Common Crawl's First In-House Web Graph

Note: this post has been marked as obsolete.
We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges.
Sebastian Nagel
Sebastian Nagel
Sebastian is a Distinguished Engineer with Common Crawl.

We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. The following results from the development of this graph:

  • a ranked list of hosts to expand the crawl frontier;
  • pages ranked by Harmonic Centrality with less influence from spam, among other attributes (for comparison we include PageRank);
  • the template/process for Common Crawl to produce graphs and page rankings at regular intervals.

We produced this graph, and intend to produce similar graphs going forward, because the Common Crawl community has expressed a strong interest in using Common Crawl data for graph processing, particularly with respect to:

* Please note: the graph includes dangling nodes, i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. Seventeen percent (65 million) of the hosts represented have been crawled in one of the three monthly crawls. Thus, 320 million of the hosts represented in the graph are known only from links. (Host names are not wholly verified: host names that are obviously invalid are skipped; others are not resolved in DNS.)

Extraction of links and construction of the graph

Links are taken from WAT extracts but we also included redirects from WARC files of the redirect and 404 dataset. All types of links are included, including pure "technical" ones pointing to JavaScript libraries, web fonts, etc.

The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain. Node IDs are assigned sequentially to the the node list sorted by reversed host name. This keeps links between hosts of the same domain or in the same country-code top-level domain close together and allows for an efficient delta-compression of edges. The extraction is done in three steps:

  • links are extracted, reduced to host-level links and stored as pairs 〈reversed host from, rev. host to〉
  • host names are assigned to IDs and edges are represented as 〈from id, to id〉 pairs
  • ranks are computed.

The first two steps are done by Spark and Python; the code is part of the project cc-pyspark. To compute the rankings the webgraph is loaded into the WebGraph framework.

Hosts ranked by Harmonic Centrality and PageRank

We provide a list of ranked nodes (host names) by

Data and download instructions

The host-level graph as well as the rankings are placed on AWS S3 on the path:

s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr-hostgraph/

Alternatively, you can use:

https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2017-feb-mar-apr-hostgraph/

as prefix to access the files from everywhere.

Download files of the Common Crawl Feb/Mar/Apr 2017 host-level webgraph

Size File Description
2.72 GB vertices.txt.gz nodes ⟨id, rev host⟩
9.42 GB edges.txt.gz edges ⟨from_id, to_id⟩
4.51 GB bvgraph.graph graph in BVGraph format
0.22 GB bvgraph.offsets
1 kB bvgraph.properties
5.06 GB bvgraph-t.graph transpose of the graph (outlinks mapped to inlinks)
0.47 GB bvgraph-t.offsets
1 kB bvgraph-t.properties
1 kB bvgraph.stats WebGraph statistics
6.26 GB ranks.txt.gz harmonic centrality and pagerank

We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link SPAM detection, etc. Let us know about your results via Common Crawl's Google Group!

Credits

Thanks to:

  • Web Data Commons, for their web graph data set and everything related.
  • Common Search; we first used their web graph to expand the crawler frontier, and Common Search's cosr-back project was an important source of inspiration how to process our data using PySpark.
  • the authors of the WebGraph framework, whose software simplifies the computation of rankings.
This release was authored by:
No items found.