January 2015 Crawl Archive Available

The crawl archive for January 2015 is now available! This crawl archive is over 139TB in size and contains 1.82 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2015-06/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

We’re seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Please contact [email protected] for sponsorship packages.

5 Good Reads in Big Open Data: February 27 2015

  1. Hadoop is the Glue for Big Data - via StreetWise Journal: Startups trying to build a successful big data infrastructure should “welcome…and be protective” of open source software like Hadoop. The future and innovation of Big Data depends on it.

  2. Topic Models: Past, Present Future -via O’Reilly Data Show Podcast:

    You might analyze a bunch of New York Times articles for example, and there’ll be an article about sports and business, and you get a representation of that article that says this is an article and it’s about sports and business. Of course, the ideas of sports and business were also discovered by the algorithm, but that representation, it turns out, is also useful for prediction. My understanding when I speak to people at different startup companies and other more established companies is that a lot of technology companies are using topic modeling to generate this representation of documents in terms of the discovered topics, and then using that representation in other algorithms for things like classification or other things.


  3. Border disputes on Europe’s Right To Be Forgotten – via Slate: Is the angle of debate (disruptors vs. regulators) wrong? Should we be thinking of more custom solutions to this global issue?

  4. Flashgraph can analyze massive graphs to the proven tune of 129 billion edges- via the Common Crawl Blog (Flashgraph on GitHub):

    You may ask why we need another graph processing framework while we already have quite a few…FlashGraph seeks performance, capacity, flexibility and ease of programming at the moment when it was created. We hope FlashGraph can have performance comparable to the state-of-art in-memory graph engines while scaling to graphs with hundreds of billions of edges or even trillions of edges. We also hope that FlashGraph can express varieties of algorithms in FlashGraph and hide the complexity of accessing data on SSDs and parallelizing graph algorithms.


  5. The future of the internet is NOT all decided by net neutrality – via The Atlantic: A wonderfully curated net neutrality reading list, including one article where Justice Antonin Scalia tells us the Internet is a pizzeria (he’s right)

    Follow us @CommonCrawl on Twitter for the latest in Big Open Data

Analyzing a Web graph with 129 billion edges using FlashGraph

DaZhengThis is a guest blog post by Da Zheng
Da Zheng is the architect and main developer of the FlashGraph project. He is a PhD student of computer science at Johns Hopkins University, focusing on developing frameworks for large-scale data analysis, particularly for massive graph analysis and data mining.   

FlashGraph is a SSD-based graph processing framework for analyzing massive graphs. We have demonstrated that FlashGraph is able to analyze the page-level Web graph constructed from the Common Crawl corpora by the Web Data Commons project. This Web graph has 3.5 billion vertices and 129 billion edges and is the largest graph publicly available in the world. Thanks to the hard work of the Common Crawl and the Web Data Commons project, we are able to demonstrate the scalability and performance of FlashGraph as well as the graph algorithms designed for billion-node graphs.

You may ask why we need another graph processing framework while we already have quite a few, such as Pregel/Giraph, GraphLab/PowerGraph and GraphX. As pointed out by Frank McSherry in his blog 1 & 2, the current distributed graph processing frameworks have substantial overhead in order to scale out; we should seek performance and capacity (the size of a graph that can be processed). On top of the runtime overheads Frank McSherry mentions, these frameworks also have very large memory overhead. For example, as shown in the performance evaluation of the GraphX paper, Giraph cannot even process a graph with 106 million vertices and 3.8 billion edges in a cluster with aggregate memory of 1088 GB. The similar problem exist in others, as shown here. The large memory overhead prevents them from scaling to a larger graph or unnecessarily wastes resources.

FlashGraph seeks performance, capacity, flexibility and ease of programming at the moment when it was created. We hope FlashGraph can have performance comparable to the state-of-art in-memory graph engines while scaling to graphs with hundreds of billions of edges or even trillions of edges. We also hope that FlashGraph can express varieties of algorithms in FlashGraph and hide the complexity of accessing data on SSDs and parallelizing graph algorithms.

To scale graph analysis and achieve in-memory performance, FlashGraph uses the semi-external memory model, which stores algorithmic vertex state in memory and edge lists on SSDs. This model enables in-memory vertex communication while scaling to graphs that exceed memory capacity. Because vertex communication is the main source of computation overhead in many graph algorithms, it is essential to achieve in-memory performance in vertex communication. To optimize data access on SSDs, FlashGraph deploys two I/O optimizations: access edge lists only required by the applications; conservatively merge I/O requests to achieve higher I/O throughput and reduce CPU overhead caused by I/O.

The graph format used by FlashGraph is designed for both efficiency and flexibility. All graph algorithms in FlashGraph use the same graph format, so each graph only needs to be converted into the format once and to be loaded to SSDs once. FlashGraph stores both in-edges and out-edges in a graph image. In the Web graph, an out-edge is a hyperlink from a Web page to another page, and an in-edge is the reverse of a hyperlink. It is necessary to keep an edge twice for a directed graph because some graph algorithms require in-edges, some require out-edges and some require both. For efficiency, in-edges and out-edges of a vertex are stored separately. This reduces data access from SSDs if an algorithm requires only one type of edges.

FlashGraph provides a very flexible vertex-centric programming interface and supports varieties of graph algorithms. The vertex-centric programming interface allow programmers to “think like a vertex”: each vertex maintains some algorithmic state and performs user-defined computation independently. In FlashGraph, a vertex can communicate with any vertices through message passing and read edge lists of any vertices from SSDs. We have implemented a set of graph algorithms such as breadth-first search, PageRank, connected components and triangle counting. All of these graph algorithms implemented in FlashGraph can run on the page-level Web graph in a single commodity machine and complete at an unprecedented speed, as shown in the table below. The performance result also shows that FlashGraph has a very short initialization time even on this massive graph.

Algorithm Runtime (sec) Init time (sec) Memory (GB)
BFS 298 30 22
Betweenness 595 33 81
Triangle counting 7818 31 55
Weakly connected components 461 32 47
PageRank (30 iterations) 2041 33 46
Scan statistics 375 58 83

 

The more detailed design of FlashGraph is documented by the paper published at FAST’15.

We further explore community detection with FlashGraph on billion-node graphs. Here we detect communities with only active vertices. The activity level of a vertex is measured by a locality statistic (the number of edges in the neighborhood of a vertex). Again, we use the large Web graph to demonstrate the scalability and accuracy of our procedure. The key here is to quickly identify the most active vertices in a graph. Having these vertices, we further cluster them into active communities. In our experiment on the paper, we identify 2000 most active vertices in the Web graph and discover five communities. The sizes of community 1 to 5 are n1 = 35, n2 = 1603, n3 = 199, n4 = 42 and n5 = 121 respectively. Community 1 is a collection of websites that are all developed, sold or to be sold by an Internet media company networkmedia. Community 2 are all hyperlinks extracted from a single Pay-level-domain adult website. In the community 3, most links are social media websites and often used in our daily life such as WordPress.org and Google. Community 4 consists of websites related to online shopping such as the shopping giant Amazon and the bookseller AbeBooks. Community 5 is another collection of 121 adult web pages where each web page comes from a different Pay-level-domain in this cluster. In summary, top 5 active communities in the Web Graph are grouped with high topical similarities.

Active community detection is one application that demonstrates the power of FlashGraph. We are looking forward to seeing more cases that people use FlashGraph for mining massive graphs. We are happy to help others develop algorithms to explore the Web graph as well as other graphs of the similar size or even a larger size.

5 Good Reads in Big Open Data: Feb 20 2015

  1. Why The Open Data Platform Is Such A Big Deal for Big Data- via Pivotal P.O.V:

    A thriving ecosystem is the key for real viability of any technology. With lots of eyes on the prize, the technology becomes more stable, offers more capabilities, and importantly, supports greater interoperability across technologies, making it easier to adopt and use, in a shorter amount of time. By creating a formal organization, the Open Data Platform will act as a forcing function to accelerate the maturation of an ecosystem around Big Data.


  2. Machine Learning Could Upend Local Search -via Streetfight: From the Chairman of Common Crawl’s Board of Directors (and Factual CEO) Gil Elbaz on the future of search

  3. On opening up libraries with linked data – via Library Journal: While the rest of the web is turning into the “Web of Data,” libraries and catalogs  are (partially for reasons for a closed culture) struggling to keep up

  4. Interactive map: where are we driving, busing, cabbing, walking to work? via Flowing Data:
    Image via Flowing Data

  5. On the ongoing debate over the possible dangers of Artificial Intelligence- via Scientific American:

    Current efforts in areas such as computational ‘deep-learning‘ involve algorithms constructing their own probabilistic landscapes for sifting through vast amounts of information. The software is not necessarily hard-wired to ‘know’ the rules ahead of time, but rather to find the rules or to be amenable to being guided to the rules – for example in natural language processing. It’s incredible stuff, but it’s not clear that it is a path to AI that has equivalency to the way humans, or any sentient organisms, think. This has been hotly debated by the likes of Noam Chomsky(on the side of skepticism) and Peter Norvig (on the side of enthusiasm). At a deep level it is a face-off between science focused on underlying simplicity, and science that says nature may not swing that way at all.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

Ross FairbanksThis is a guest blog post by Ross Fairbanks

Ross Fairbanks is a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing. This guest post describes his open data project wikireverse.org and why he built it.


What is WikiReverse?

WikiReverse [1] is an application that highlights web pages and the Wikipedia articles they link to. The project is based on Common Crawl’s July 2014 web crawl, which contains 3.6 billion pages. The results produced 36 million links to 4 million Wikipedia articles. Most of the results are from English Wikipedia (which had 32 million links) followed by Spanish, Indonesian and German. In total there are results for 283 languages.

I first heard about Common Crawl in a blog post by Steve Salevan— MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl [2]. Running Steve’s code deepened my interest in the project. What I like most is the efficiency savings of a large web scale crawl that anyone can access. Attempting to crawl the same volume of web pages myself would have been vastly more expensive and time consuming.

I found that the data can be processed relatively cheaply, as it cost just $64 to process the metadata for 3.6 billion pages. This was achieved by using spot instances, which is the spare server capacity that Amazon Web Services auctions off when demand is low. This saved $115 compared to using full price instances.

There is great value in the Common Crawl archive; however, it is difficult to see with no interface to the data. It can be hard to visualize the possibilities and what can be done with the data. For this reason, my project runs an analysis over an entire crawl with a resulting site that allows the findings to be viewed and searched.

I chose to look at reverse links because, despite it’s relatively simple approach, it exposes interesting data that is normally deeply hidden. Wikipedia articles are often cited on the web and they appear highly in search results. I was interested in seeing how many links these articles have and what types of sites are linking to them.

A great benefit of working with an open dataset like Common Crawl’s is that WikiReverse results can be released very quickly to the public. Already, Gianluca Demartini from the University of Sheffield has released Who links to Wikipedia? [3] on the Wikimedia blog. This is an analysis of which top-level domains appear in the results. It is encouraging to see the interest in open data projects and hopefully more analyses of these types will be done.

Choosing Wikipedia also means the project can continue to benefit from the wide range of open data they release. The DBpedia [4] project uses raw data dumps released by Wikipedia and creates structured datasets for many aspects of data, including categories, images and geographic locations. I plan on using DBpedia to categorize articles in WikiReverse.

The code developed to analyze the data is available on Github. I’ve written a more detailed post on my blog on the data pipeline [5] that was developed to generate the data. The full dataset can be downloaded using BitTorrent. The data is 1.1 GB when compressed and 5.4 GB when extracted. Hopefully this will help others build their own projects using the Common Crawl data.


[1] https://wikireverse.org/
[2] http://blog.commoncrawl.org/2011/12/mapreduce-for-the-masses/
[3] http://blog.wikimedia.org/2015/02/03/who-links-to-wikipedia/
[4] http://dbpedia.org/About
[5] https://rossfairbanks.com/2015/01/23/wikireverse-data-pipeline.html

5 Good Reads in Big Open Data: Feb 13 2015

  1. What does it mean for the Open Web if users don’t know they’re on the internet? - via QUARTZ:

    This is more than a matter of semantics. The expectations and behaviors of the next billion people to come online will have profound effects on how the internet evolves. If the majority of the world’s online population spends time on Facebook, then policymakers, businesses, startups, developers, nonprofits, publishers, and anyone else interested in communicating with them will also, if they are to be effective, go to Facebook. That means they, too, must then play by the rules of one company. And that has implications for us all.


  2. Hard Drive Data Sets -via Backblaze: Backblaze provides online backup services storing data on over 41,000 hard drives ranging from 1 terabyte to 6 terabytes in size.  They have released an open, downloadable dataset on the reliability of these drives.

  3. The Open Source Question: critically important web infrastructure is woefully underfunded – via Slate: on the strange dichotomy of Silicon Valley: a “hypercapitalist steamship powered by it’s very antithesis”

  4. February 21st is Open Data Day- via Spatial Source: use this interactive map to find an Open Data event near you (or add your own)
    Screen Shot 2015-02-12 at 8.22.59 AM
    Image Source: opendataday.org/map

  5. Security is at the heart of the web – via O’Reilly Radar:

    …we want to be able to go to sleep without worrying that all of those great conversations on the open web will endanger the rest of what we do.

    Making the web work has always been a balancing act between enabling and forbidding, remembering and forgetting, and public and private. Managing identity, security, and privacy has always been complicated, both because of the challenges in each of those pieces and the tensions among them.

    Complicating things further, the web has succeeded in large part because people — myself included — have been willing to lock their paranoias away so long as nothing too terrible happened.

    Follow us @CommonCrawl on Twitter for the latest in Big Open Data

5 Good Reads in Big Open Data: Feb 6 2015

  1. The Dark Side of Open Data - via Forbes:

    There’s no reason to doubt that opening to the public of data previously unreleased by governments, if well managed, can be a boon for the economy and, ultimately, for the citizens themselves. It wouldn’t hurt, however, to strip out the grandiose rhetoric that sometimes surrounds them, and look, case by case, at the contexts and motivations that lead to their disclosure.


  2.  Bigger Data; Same Laptop -via Frank McSherry: throwing more machines at a problem isn’t necessarily the best approach. A laptop can outperform clusters when used effectively. This post uses the Web Data Commons 128 billion edge Hyperlink Graph, created using Common Crawl data, to showcase that.

  3. Fixing Verizon’s permacookie – via Slate: 9 lines of code could make Verizon’s controversial user-tracking system slightly less invasive and much less creepy.

  4. Interact with Committee to Protect Journalist ‘s Data- via Reuters Graphics:  interactive map of journalists killed over time and by location
    Source: Committee to Protect Journalists Graphic by Matthew Weber/Reuters Graphics
    Source: Committee to Protect Journalists
    Graphic by Matthew Weber/Reuters Graphics

  5. The EU wants the rest of the world to forget too – via The New York Times:

    Countries have different standards for acceptable speech and for invasions of privacy. American libel laws, for example, are much more permissive than those in Britain. That’s why authors sometimes find it easier to have some books published in United States than in Britain. There is no doubt that the Internet has made it harder for governments to enforce certain rules and laws because information is not easily contained within borders. But that does not justify restricting the information available to citizens of other countries.

    Follow us @CommonCrawl on Twitter for the latest in Big Open Data

The Promise of Open Government Data & Where We Go Next

One of the biggest boons for the Open Data movement in recent years has been the enthusiastic support from all levels of government for releasing more, and higher quality, datasets to the public. In May 2013, the White House released its Open Data Policy and announced the launch of Project Open Data, a repository of tools and information–which anyone is free to contribute to–that help government agencies release data that is “available, discoverable, and usable.”

Since 2013, many enterprising government leaders across the United States at the federal, state, and local levels have responded to the President’s call to see just how far Open Data can take us in the 21st century. Following the White House’s groundbreaking appointment in 2009 of Aneesh Chopra as the country’s first Chief Technology Officer, many local and state governments across the United States have created similar positions. San Francisco last year named its first Chief Data Officer, Joy Bonaguro, and released a strategic plan to institutionalize Open Data in the city’s government. Los Angeles’ new Chief Data Officer, Abhi Nemani, was formerly at Code for America and hopes to make LA a model city for open government. His office recently launched an Open Data portal along with other programs aimed at fostering a vibrant data community in Los Angeles.1

Open government data is powerful because of its potential to reveal information about major trends and to inform questions pertaining to the economic, demographic, and social makeup of the United States. A second, no less important, reason why open government data is powerful is its potential to help shift the culture of government toward one of greater collaboration, innovation, and transparency.

These gains are encouraging, but there is still room for growth. One pressing issue is for more government leaders to establish Open Data policies that specify the type, format, frequency, and availability of the data  that their offices release. Open Data policy ensures that government entities not only release data to the public, but release it in useful and accessible formats.

Only nine states currently have formal Open Data policies, although at least two dozen have some form of informal policy and/or an Open Data portal.2 Agencies and state and local governments should not wait too long to standardize their policies about releasing Open Data. Doing so will severely limit Open Data’s potential. There is not much that a data analyst can do with a PDF.

One area of great potential is for data whizzes to pair open government data with web crawl data. Government data makes for a natural complement to other big datasets, like Common Crawl’s corpus of web crawl data, that together allow for rich educational and research opportunities. Educators and researchers should find Common Crawl data a valuable complement to government datasets when teaching data science and analysis skills. There is also vast potential to pair web crawl data with government data to create innovative social, business, or civic ventures.

Innovative government leaders across the United States (and the world!) and enterprising organizations like Code for America have laid an impressive foundation that others can continue to build upon as more and more government data is released to the public in increasingly usable formats. Common Crawl is encouraged by the rapid growth of a relatively new movement and we are excited to see the collaborations to come as Open Government and Open Data grow together.

 

Allison Domicone was formerly a Program and Policy Consultant to Common Crawl and previously worked for Creative Commons. She is currently pursuing a master’s degree in public policy from the Goldman School of Public Policy at the University of California, Berkeley.

December 2014 Crawl Archive Available

The crawl archive for December 2014 is now available! This crawl archive is over 160TB in size and contains 2.08 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-52/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!

November 2014 Crawl Archive Available

The crawl archive for November 2014 is now available! This crawl archive is over 135TB in size and contains 1.95 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2014-49/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

Thanks again to blekko for their ongoing donation of URLs for our crawl!