Web Graphs
Common Crawl regularly releases host- and domain-level graphs, for visualising the crawl data.
Hostnames in the graph are in reverse domain name notation and all types of links are listed, including purely “technical” links pointing to images, JavaScript libraries, web fonts, etc.
However, only hostnames with a valid IANA TLD are used. As a result, URLs with an IP address as host component are not taken into account for building the host-level graph.
The domain graph is built by aggregating the host graph at the pay-level domain (PLD) level based on the public suffix list maintained on publicsuffix.org.
The list of graph releases is also available via graphinfo.json.
For more information please see cc-webgraph on GitHub.
Hostnames in the graph are in reverse domain name notation and all types of links are listed, including purely “technical” links pointing to images, JavaScript libraries, web fonts, etc.
However, only hostnames with a valid IANA TLD are used. As a result, URLs with an IP address as host component are not taken into account for building the host-level graph.
The domain graph is built by aggregating the host graph at the pay-level domain (PLD) level based on the public suffix list maintained on publicsuffix.org.
The list of graph releases is also available via graphinfo.json.
For more information please see cc-webgraph on GitHub.