February 18, 2015

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

This is a guest blog post by Ross Fairbanks, a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing. This guest post describes his open data project and why he built it.

Ross Fairbanks

Ross Fairbanks is a software developer based in Barcelona.

What is WikiReverse?

WikiReverse [1] is an application that highlights web pages and the Wikipedia articles they link to. The project is based on Common Crawl’s July 2014 web crawl, which contains 3.6 billion pages. The results produced 36 million links to 4 million Wikipedia articles. Most of the results are from English Wikipedia (which had 32 million links) followed by Spanish, Indonesian and German. In total there are results for 283 languages.

I first heard about Common Crawl in a blog post by Steve Salevan— MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl [2]. Running Steve’s code deepened my interest in the project. What I like most is the efficiency savings of a large web scale crawl that anyone can access. Attempting to crawl the same volume of web pages myself would have been vastly more expensive and time consuming.

I found that the data can be processed relatively cheaply, as it cost just $64 to process the metadata for 3.6 billion pages. This was achieved by using spot instances, which is the spare server capacity that Amazon Web Services auctions off when demand is low. This saved $115 compared to using full price instances.

There is great value in the Common Crawl archive; however, it is difficult to see with no interface to the data. It can be hard to visualize the possibilities and what can be done with the data. For this reason, my project runs an analysis over an entire crawl with a resulting site that allows the findings to be viewed and searched.

I chose to look at reverse links because, despite it’s relatively simple approach, it exposes interesting data that is normally deeply hidden. Wikipedia articles are often cited on the web and they appear highly in search results. I was interested in seeing how many links these articles have and what types of sites are linking to them.

A great benefit of working with an open dataset like Common Crawl’s is that WikiReverse results can be released very quickly to the public. Already, Gianluca Demartini from the University of Sheffield has released Who links to Wikipedia? [3] on the Wikimedia blog. This is an analysis of which top-level domains appear in the results. It is encouraging to see the interest in open data projects and hopefully more analyses of these types will be done.

Choosing Wikipedia also means the project can continue to benefit from the wide range of open data they release. The DBpedia [4] project uses raw data dumps released by Wikipedia and creates structured datasets for many aspects of data, including categories, images and geographic locations. I plan on using DBpedia to categorize articles in WikiReverse.

The code developed to analyze the data is available on Github. I’ve written a more detailed post on my blog on the data pipeline [5] that was developed to generate the data. The full dataset can be downloaded using BitTorrent. The data is 1.1 GB when compressed and 5.4 GB when extracted. Hopefully this will help others build their own projects using the Common Crawl data.

[1] https://wikireverse.org/
[2] https://commoncrawl.org/blog/mapreduce-for-the-masses/
[3] http://blog.wikimedia.org/2015/02/03/who-links-to-wikipedia/
[4] http://dbpedia.org/About
[5] https://rossfairbanks.com/2015/01/23/wikireverse-data-pipeline.html

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use