We are pleased to announce a new release of host-level and domain-level web graphs based on the September/October, November/December 2022 and January/February 2023 crawls. For more information about the data formats and the processing pipeline, please see the announcements of previous webgraph releases. You may also visit the cc-webgraph and cc-pyspark projects which contain all the scripts and tools needed to construct the graphs. Instructions for exploring the graphs in the webgraph format can be found in our collection of webgraph notebooks.
Host-level graph
The graph has of 325 million nodes and 2.63 billion edges. Both hyperlinks, HTTP redirects and link headers are used as edges to span up the graph. All types of links are included, including purely “technical” links pointing to images, JavaScript libraries, web fonts, etc. However, only hostnames with a valid IANA TLD are used. As a result, URLs with an IP address as host component are not taken into account for building the host-level graph.
There are 268 million dangling nodes (82.7%) and the largest strongly connected component contains 43.1 million (13.3%) nodes. Dangling nodes come from
- hosts that are not crawled, but are referenced by a link on a crawled page
- hosts with no links pointing to another hostname
- or hosts that only returned an error page (e.g. HTTP 404).
Hostnames in the graph are in reverse domain name notation with the leading www.removed: www.subdomain.example.com becomes com.example.subdomain.
You can download the graph and the ranks of all 325 million hosts from AWS S3 at s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/host/ (this requires an account on AWS). Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/host/ as prefix to access the files from everywhere.
Note that the text representation of the host-level graph is delivered in 10 gzip-compressed files listed in two path listings – one for the nodes (vertices), and one for the edges (arcs). First, download the path listing and decompress it with “gzip -d” or “gunzip”. Adding the prefix s3://commoncrawl/ or https://data.commoncrawl.org/ to each line in the path listing will give you the list of URLs to download the entire graph.
Download files of the Common Crawl Sep/Nov/Jan 2022-2023 host-level Webgraph
Domain-level graph
The domain graph is built by aggregating the host graph at the pay-level domain (PLD) level based on the public suffix list maintained on publicsuffix.org. Version (commit) 0bbf864 of the public suffix list was used (commit date 2023-03-08).
The domain-level graph has 88 million nodes and 1.68 billion edges. 52% or 46 million nodes are dangling nodes, the largest strongly connected component covers 34 million or 39% of the nodes.
All domain graph files are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/domain/ or on https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2022-23-sep-nov-jan/domain/.
Download files of the Common Crawl Sep/Nov/Jan 2022-2023 domain-level Webgraph
Credits
Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible. We hope you find the data useful for any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl's Google Group!