Web image size prediction for efficient focused image crawling

Katerina Andreadou
This is a guest blog post by Katerina Andreadou.
Katerina is a research assistant at CERTH, where she specializes in multimedia analysis and web crawling.


In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling. In our web image crawler setup, we noticed that a serious bottleneck pertains to the fetching of image content, since for each web page a large number of HTTP requests need to be issued to download all included image elements. In practice, however, only the relatively big images (e.g., larger than 400 pixels in width and height) are potentially of interest, since most of the smaller ones are irrelevant to the main subject or correspond to decorative elements (e.g., icons, buttons). Given that there is often no dimension information in the HTML img tag of images, to filter out small images, an image crawler would still need to issue a GET request and download the respective files before deciding whether to index them.

To address this limitation, we decided to explore the challenge of predicting the size of images on the Web based only on their URL and information extracted from the surrounding HTML code. In order to do so, we needed a large amount of images accompanied by their HTML metadata with the purpose of training and testing our image size prediction system. To this end we decided to use a sample of the data from the July 2014 Common Crawl set, which is over 266TB in size and contains approximately 3.6 billion web pages. Since for technical and financial reasons, it was impractical and unnecessary to download the whole dataset, we created a MapReduce job to download and parse the necessary information using Amazon Elastic MapReduce (EMR). The setup is based on a blog post by Steve Salevan. The data of interest include all images and videos from all web pages and metadata extracted from the surrounding HTML elements.

To complete the task, we used 50 Amazon EMR medium instances, resulting in 951GB of data in gzip format. The following statistics were extracted from the corpus:

  • 3.6 billion unique images
  • 78.5 million unique domains
  • ≈8% of the images are big (width and height bigger than 400 pixels)
  • ≈40% of the images are small (width and height smaller than 200 pixels)
  • ≈20% of the images have no dimension information

To predict the size of Web images, we came up with three different methodologies, which are analyzed in the rest of this post. This work is described in detail in a paper presented at CBMI 2015 (13th International Workshop on Content-Based Multimedia Indexing). The paper is available online (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7153609).

Textual Features approach

An n-gram in our case is a continuous sequence of n characters from the given image URL. The main hypothesis we make is that URLs which correspond to small and big images differ substantially in terms of wording. For instance, URLs from small images tend to contain words such as logo, avatar, small, thumb, up, down, pixels. On the other hand, URLs from big images tend to lack these words and typically contain others. If the assumption is correct, it should be possible for a supervised machine learning method to separate items from the two distinct classes.

The disadvantage of this approach is that, although the frequencies of the n-grams are taken into account, what is not considered is the correlation of the n-grams to the two classes, BIG and SMALL. For instance, if an n-gram is very frequent in both classes, it makes sense to get rid of it and not consider it as feature. On the other hand, if an n-gram is not very frequent, but it is very characteristic of a specific class, we should include it in the feature vector. To this end, we performed feature selection by taking into account the relative frequency of occurrence of the n-gram in the two classes, BIG and SMALL. We refer to this method as NG-trf, standing for term relative frequency.

In a variation of the aforementioned approach, we decided to replace n-grams with the tokens produced by splitting the image URLs by all non-alphanumeric characters. The employed regular expression in Java is \\W+ and the feature extraction process is the same as described above, but with the produced tokens instead of n-grams. We will refer to this method as TOKENS-trf.

Non-textual features approach

Our alternative non-textual approach does not rely on the image URL text, but rather on the metadata, that can be extracted from the image HTML element. The idea behind their choice is for them to reveal cues regarding the image dimensions. For instance, the first five features correspond to different image suffixes and they were selected due to the fact that most real-world photos are in JPG or PNG format, whereas BMP and GIF formats usually point to icons and graphics. Additionally, there is a greater likelihood that a real-world photo has an alternate or parent text than a background graphic or a banner.

Hybrid approach

The goal of the hybrid approach is to achieve higher performance by taking into account both textual and non-textual features. Our hypothesis is that the two methods will complement each other when aggregating their results as they rely on different kinds of features: the n-gram classifier might be best at classifying a certain kind of images with specific image URL wording, while the non-textual features classifier might be best at classifying a different kind of images with more informative HTML metadata.

Evaluation

For training we used one million images (500K small and 500K big) and for testing 200 thousand (100K small and 100K big). The described supervised learning approaches were implemented based on the Weka library. We performed numerous preliminary experiments with different classifiers (LibSVM, Random Tree, Random Forest), and Random Forest (RF) was found to be the one striking the best trade-off between good performance and acceptable training times. The main parameter of RF is the number of trees. Some typical values for this are 10, 30 and 100, while very few problems would demand more than 300 trees. The rule of thumb is that more trees lead to better performance; however, they simultaneously increase considerably the training time.

The comparative results for different number of trees for the Random Forest algorithm are displayed in Table 1. The first column of contains the method name, the second one the number of trees used in the RF classifier, the third one the number of features used, and the remaining columns contain the F-measures for the SMALL class, the BIG class and the average. The reported results lead to several interesting conclusions.

  • Doubling the number of n-gram features improves the performance in all cases.
  • So does adding more trees to the Random Forest classifier.
  • The hybrid method outperforms all standalone methods, its best F-score being 4% higher than the best textual features score.

Table 1: Comparative results (F-measure)

Method

RF trees

Features

F1small

F1big

F1avg

TOKENS -trf

10

1000

0.876

0.867

0.871

TOKENS -trf

30

1000

0.887

0.883

0.885

TOKENS -trf

100

1000

0.894

0.891

0.893

TOKENS -trf

10

2000

0.875

0.864

0.870

TOKENS -trf

30

2000

0.888

0.828

0.885

TOKENS -trf

100

2000

0.897

0.892

0.895

NG-tsrf-idf

10

1000

0.876

0.872

0.874

NG-tsrf-idf

30

1000

0.883

0.881

0.882

NG-tsrf-idf

100

1000

0.886

0.884

0.885

NG-tsrf-idf

10

2000

0.883

0.878

0.881

NG-tsrf-idf

30

2000

0.891

0.888

0.890

NG-tsrf-idf

100

2000

0.894

0.891

0.892

features

10

23

0.848

0.846

0.847

features

30

23

0.852

0.852

0.852

features

100

23

0.853

0.853

0.853

hybrid

0.935

0.935

0.935

Acknowledgement

This work was carried out at the Multimedia Knowledge and Social Media Analytics Lab in collaboration with Symeon Papadopoulos in the context of the REVEAL FP7 project.

July 2015 Crawl Archive Available

The crawl archive for June 2015 is now available! This crawl archive is over 145TB in size and holds more than 1.81 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2015-32/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the July 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

June 2015 Crawl Archive Available

The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2015-27/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the June 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

May 2015 Crawl Archive Available

The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2015-22/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the May 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

April 2015 Crawl Archive Available

The crawl archive for April 2015 is now available! This crawl archive is over 168TB in size and holds more than 2.11 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2015-18/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the April 2015 Common Crawl Index, introduced last month by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

March 2015 Crawl Archive Available

The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2015-14/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the March 2015 Common Crawl Index, introduced last month by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

Announcing the Common Crawl Index!

ilyaThis is a guest post by Ilya Kreymer
Ilya is a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl. He previously worked at the Internet Archive and led the Wayback Machine development, which included building large indexes of WARC files. Ilya is currently developing a new set of archive replay and access tools and an impressive new on-demand archiving service, webrecorder.io, that allows anyone to create a high-fidelity web archive of their own. Check out his exciting projects, including our new index and query api in the post below.


We are pleased to announce a new index and query api system for Common Crawl.

The raw index data is available, per crawl, at:
s3://aws-publicdatasets/common-crawl/cc-index/collections/CC-MAIN-YYYY-WW/indexes/

There is now an index for the Jan 2015 and Feb 2015 crawls. Going forward, a new index will be available at the same time as each new crawl.

To make working the index a bit simpler, an api and service for querying the index is available at: http://index.commoncrawl.org. The index is fully functional but we are looking for feedback to improve the usefulness and usability of the index for future updates.

Index Format
The index format is relatively simple: It consists of a compressed plaintext index (with one line for each entry) compressed into gzipped chunks, and a secondary index of the compressed chunks. This index is often called the ‘ZipNum’ CDX format and it is the same format that is used by the Wayback Machine at the Internet Archive.

Index Query API
To make working with the index a bit easier, the main index site (http://index.commoncrawl.org) contains a readily accessible api for querying the index.

The api is a variation of the ‘cdx server api’ or ‘capture index server api’ that was originally built for the wayback machine.

The site is built using pywb (https://github.com/ikreymer/pywb), a new collection of web archive replay and query tools, including the index query component.

The index can be queried by making a request to a specific collection.

For example, the following query looks up “wikipedia.org” in the CC-MAIN-2015-11 (Feb 2015) crawl:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org

The above query will only retrieve captures from the exact url “wikipedia.org/”, but a frequent use case may be to retrieve all urls from a path or all subdomains.

This can be done by using a wildcard queries:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org/*
or
http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/

Pagination
For most prefix or domain prefix queries such as these, it is not feasible to retrieve all the results at once, and only the first page of results (by default, up to 15000) are returned.

The total number of pages can be retrieved with the showNumPages query:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true

This query returns:

{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

This indicates that there are 989 total pages, at 5 compressed index blocks per page!

Thus, to get all of *.wikipedia.org, one would need to perform the query for each page:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=0

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&page=988

This allows for the query process to be performed in parallel. For example, it should be possible to run a MapReduce job which computes the number of pages, creates a list of urls, and then runs the query in parallel.

Command-Line Client
For smaller use cases, a simple client side library is available to simplify this process, https://github.com/ikreymer/cdx-index-client This is a simple python script which uses the pagination api to perform a parallel query on a local machine.

First, a good idea is to verify the number of pages:
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org –show-num-pages
809

To perform the query, simply run and
./cdx-index-client.py -c CC-MAIN-2015-11 *.wikipedia.org -z -d ./wikipedia-index

This query will fetch all pages of the *.wikipedia.org index into a ./wikipedia-index directory and keep the data compressed (-z flag). For a full set of options, you may run
./cdx-index-client.py –help

The script will print out an update of the progress:

2015-04-07 08:35:18,686: [INFO]: Fetching 989 pages of *.wikipedia.org
2015-04-07 08:35:45,734: [INFO]: 1 page(s) of 989 finished
2015-04-07 08:35:46,577: [INFO]: 2 page(s) of 989 finished
2015-04-07 08:35:46,579: [INFO]: 3 page(s) of 989 finished

Adjusting Page Size:
It is also possible to adjust the page size to increase or decrease the number of “blocks” in the page. (Each block is a compressed chunk and can not be split further)
The pageSize query param can be used to set the page size in blocks (the default is 5 blocks per page). For example:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true
{“blocks”: 4942, “pages”: 989, “pageSize”: 5}

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=*.wikipedia.org/&showNumPages=true&pageSize=1
{“blocks”: 4942, “pages”: 4942, “pageSize”: 1}

In general, blocks / pageSize + 1 = pages. Adjusting the page size can help adjust the parallelization and load of the query as needed.

Capture Index JSON (CDXJ) Line Format
The raw index format (stored and returned from the query api) is as follows:

org,wikipedia)/ 20150227035757 {“url”: “http://www.wikipedia.org/”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “length”: “11996”, “offset”: “853671193”, “filename”: “common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”}

This format consists of a ‘url<space>timestamp<space>’ header followed by a json dictionary. The header is used to ensure the lines are sorted by url key and timestamp.

Adding the output=json option to the query will ensure the full line is json. Example:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=wikipedia.org&output=json&limit=1
{“urlkey”: “org,wikipedia)/”, “timestamp”: “20150227035757”, “url”: “http://www.wikipedia.org/”, “length”: “11996”, “filename”: “common-crawl/crawl-data/CC-MAIN-2015-11/segments/1424936460472.17/warc/CC-MAIN-20150226074100-00147-ip-10-28-5-156.ec2.internal.warc.gz”, “digest”: “PQE67QMKFGSZJU5SR2ESR7CMBKLSSBAJ”, “offset”: “853671193”}

Currently, the index contains the urlkey (a canonicalized, reverse-order form of the url), the timestamp, the url, and the WARC filename, offset and length, as well as a checksum (digest) of the content. The digest can be used to identify duplicate captures, but also adds significantly to the index size and may be removed in future versions. Other fields may be added to the json dictionary as needed also.

It is possible to only select certain fields from the query with the fl field. For example, the following query will return only the url:

http://index.commoncrawl.org/CC-MAIN-2015-11-index?url=http://wikipedia.org/&fl=url

http://wikipedia.org/

or via command-line tool:

./cdx-index-client -c CC-MAIN-2015-11 http://wikipedia.org –fl url

Multiple fields can be also specified, eg. fl=url,length to return only url and warc record length.

For a full reference of available query params, consult the latest CDX Server API reference

Additional Java Tools
For Java users wishing to access the raw index, the IIPC webarchive-commons has support for reading the ZipNum format. Additionally, the openwayback-cdx-server provides the Java implementation of the original cdx server api. However, some modifications would be required to that codebase to support the cdx json format and it has not been tested with this index.

Building the Index / Running CDX Index Server
All the tools for building the index are also available at: https://github.com/ikreymer/webarchive-indexing

The index was built using EMR and the mrjob python library, and the indexing tools from pywb project. This should enable others to build the index in the future, or create customized versions of the index as needed. Please refer to the project for additional reference and do not hesitate to contact with any specific questions.

The service running at http://index.commoncrawl.org is also available at:

https://github.com/ikreymer/cc-index-server

To run locally, the secondary index (for binary search) for each collection will need to be fetched locally, while most of the index will be read from S3.

Request for Feedback and Future Plans
Although the index format is pretty well-tested, there is lots of room for customization, especially of the index query api, as well as what fields to include in the index. Feedback in the form of bug reports/feature requests/questions/suggestions about any aspect of the index is definitely welcome to make the index even more easy to use. Please do not hesitate to get back with any feedback about the index.

After some additional testing of the newly released indexes, we plan to build an index for older crawls as well. A cumulative index of all data ever crawled by CommonCrawl is also under consideration if there is enough interest. We look forward to hearing about any use cases or other feedback that you may have about the indexing project.

Please help us continue our support of great efforts like this by making a donation to the Common Crawl Foundation and follow us @CommonCrawl on Twitter for the latest in Big Open Data.

Evaluating graph computation systems

Frank McSherryThis is a guest blog post by Frank McSherry
Frank McSherry is a computer science researcher active in the area of large scale data analysis. While at Microsoft Research he co-invented differential privacy, and lead the Naiad streaming dataflow project. His current interests involve understanding and improving performance in scalable data processing systems.

The computer science systems and database research communities are abuzz with graph processing systems, large computer systems specialized to process and analyze very large graphs such as global social networks, internet topologies, and the hyperlink structure of the world-wide web. Academic researchers working in these areas have made significant strides in understanding how to process graph-oriented data, but their ability to validate their techniques at scale have been limited by their access to very large graph datasets. The data made available by Common Crawl and Web Data Commons provide an excellent first opportunity for these researchers to understand the performance of graph processing systems at scales that justify their complexity.

Background on graph processing

Graphs are an interesting source of data in which each record (an “edge”) references two distinct entities (called “nodes”). In the case of the web graph, for example, each hyperlink can be viewed as an edge with the source and destination web pages as the nodes it references. Many important social graph analyses are functions of the graph structure: the existence of edges between nodes imposes a constraint on the answer, and many computations can be viewed as maintaining or updating per-node state as a function of the incident edges (and the states of adjacent nodes).

Researchers have since built computational engines based around algorithms in which information flows only along graph edges. These systems distribute the large set of edges across multiple threads, processors, and even computers, and for each edge ensure that the information at each node is shared with the other node. To define a computation, a data analyst then supplies the code for what should happen with this information each time it is presented, for example updating the information maintained by each node to reflect what they have learned from others. The separation of computational logic from system specifics allows each system to effect scalable implementations of many popular graph algorithms, without worrying the programmer about issues of data distribution, network communication, or recovery in the presence of machine failures.

Evaluation at scale

As popular as graph processing systems have become, their evaluation has largely either been on small to medium size data sets, or behind the closed doors of corporate data centers. Evaluations on moderately sized data is not without its use, but it does fail to stress the systems in ways that could inform users, programmers, or administrators when faced with truly large data. Worse, from a research perspective, we shouldn’t expect to make progress in improving graph processing systems without relevant evaluation of when they perform well and when they perform badly.

For example, Gonzalez et al., OSDI 2014 evaluated four of the most popular recent systems used for graph processing —Spark, Giraph,GraphLab, and GraphX— on two graph datasets edges each containing more than one billion edges. Their performance numbers on a cluster of 16 machines, comprising 128 processing cores, reveal that their proposed system improved on prior work with the same features, without losing the performance of specialized systems.

Although one billion edges is quite a lot, the datasets still fit comfortably on a modern laptop. To demonstrate the limitations of evaluations on this size of data, we compared the performance numbers reported by Gonzalez et al. with the performance of the same algorithms executed on my laptop, using about one hundred lines of code. Eliding the specific details of the comparison, no graph processing system strictly out-performed the laptop, and some of the systems were consistently slower than the laptop. With a few improvements to the code, the laptop was able to out-perform all systems on all computations on all datasets evaluated by Gonzalez et al.

These conclusions surprised many people, who had assumed that the scalable systems would obviously out-perform a single core on a laptop. Indeed, the main point of these systems is that one can accomplish more by using multiple computers than one can accomplish using just a single computer. If the systems are not yet accomplishing this goal, doing something you couldn’t otherwise do, it is less clear when and why you would use them, especially given the additional resources (and complexity) they entail. Perhaps one simply should not use them, and instead do graph analysis on my laptop instead.

Scaling up evaluation

One common (and fair) response to our initial evaluation was that the datasets we used (those used by the GraphX evaluation) were not sufficiently large to distinguish between good scalable systems and bad scalable systems (including my laptop, presumably). We completely agree (and this was part of the point of using only existing measurements), but researchers are challenged to access to realistic data at meaningful scales, and the graph datasets used by the GraphX evaluation were among the largest publicly available datasets at the time. Given this difficulty, it is perhaps unfair to blame the researchers for the limits of their evaluation.

This situation has now changed with the data provided by Common Crawl and Web Data Commons (who have processed the former’s crawl data into a compact graph dataset. The graph data they provide is substantially larger, by almost two orders of magnitude, than the graph datasets used by Gonzalez et al. As the dataset is made freely available, researchers can use it to evaluate the improvement their systems provide over simpler single-computer alternatives. If the systems do not improve as much as expected, the researchers can now pin-point what is slow, and can either improve their system’s implementation or improve the ideas underlying its design. In either case, the state of the art can now advance in a more principled manner.

To get the ball rolling on evaluation, we took the graph data supplied by Web Data Commons, 128 billion edges relating 4 billion nodes, approximately one terabyte uncompressed, and evaluated my laptop’s performance on this dataset. This collection is substantially larger than other datasets, and is near the limit of what can be processed by my laptop. The data required some further transformation and compression both to fit on my laptop’s drive, and to avoid exhausting memory when the computations were executed. But, the computations do run, and they take time that is only slightly longer than what one might expect based only on the increase in the number of edges.

The laptop’s performance is good news for people who want to run graph computations at scale, and a healthy challenge (and opportunity) for implementors of graph processing systems. These measurements provide an excellent baseline and proof of concept for the scalable systems to target. A laptop can perform the large-scale computation, admittedly under duress, and so the scalable systems should be expected both to perform the computation and to improve on my laptop’s performance. If a system’s performance is lacking, a comparative evaluation should reveal which components are limiting the system, and they can be improved.

Advancing graph processing systems

It should be mentioned that there are already several graph processing systems that do improve on laptop-scale performance. Ligra is a shared-memory (single machine) system whose performance appears to scale quite well from a single-core baseline when graphs fit in to the computer’s random access memory. For larger graphs, FlashGraph uses an array of solid-state drives to provide performance without requiring nearly as much memory. Finally, Naiad distributes computation across multiple computers, but manages to avoid introducing much overhead when it does, scaling performance up from the appropriate single-machine baseline.

In my opinion, each of these systems represent progress in the state of the art, and we should understand what each have done well (and what each do badly). Given our access to these systems and their ideas, as well as sufficient data to evaluate and distinguish good ideas from bad, there is no obvious reason not to expect the state of the art in graph processing to advance significantly and swiftly. In fact, to the extent that we are serious about performance in graph processing systems, we should demand it.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.

February 2015 Crawl Archive Available

The crawl archive for February 2015 is now available! This crawl archive is over 145TB in size and over 1.9 billion webpages. The files are located in the aws-publicdatasets bucket at /common-crawl/crawl-data/CC-MAIN-2015-11/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://aws-publicdatasets/ or https://aws-publicdatasets.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

We’re also happy to introduce the new Common Crawl Index by Ilya Kreymer, creator of https://webrecorder.io/. The February 2015 and January 2015 indexes are already featured and the aim will be for indexes to be released alongside crawl archives, offering a new way to explore the dataset. Whilst full details will be released in an upcoming blog post, we’re telling you about it now as we’re interested in hearing feedback from the community!

Please donate to Common Crawl if you appreciate our free datasets! We’re seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

5 Good Reads in Big Open Data: March 26 2015

  1. Analyzing the Web For the Price of a Sandwich - via Yelp Engineering Blog: a Common Crawl use case from the December 2014 Dataset finds 748 million US phone numbers

    I wanted to explore the Common Crawl in more depth, so I came up with a (somewhat contrived) use case of helping consumers find the web pages for local businesses. Yelp has millions of businesses in its index and we like to provide links back to a business’s own web page wherever possible, but there are plenty of cases where we just don’t have that information.

    Let’s try to use mrjob and the Common Crawl to help match businesses from Yelp’s database to the possible web pages for those businesses on the Internet.


  2. Open Source does NOT mean a lack of security -via Information Age: Businesses are increasingly moving to Open Source platforms to reduce costs and improve efficiency; however, many mistakenly believe that Open Source means a tradeoff in security.


  3. Utility Companies should use Machine Learning- via Intelligent Utility: Machine learning can have a huge impact on energy efficiency, customer usage incentive programs and personalizing the customer experience around energy usage
    Load Curve graph (via Intelligent Utility) demonstrates "Energy Personalities" of customers
    Load Curve graph (via Intelligent Utility) demonstrates “Energy Personalities” of customers

  4. QVC loses lawsuit against Resultly in Web Crawl case via Forbes: under the Computer Fraud & Abuse Act, the courts ruled that Resultly did not demonstrate any intent to damage QVC’s systems, but their overloading of QVC’s servers was a result of “wrinkles in its business operations.”

  5. Can Data Science actually predict the perfect March Madness bracket?- via Sport Techie: (Apparently not)

    Cukierski explains: “It is hard to say how well machine learning has improved forecasts prior to Kaggle; allow people to predict before the beginning of the tournament–make a prediction for every single game that could ever occur in the tournament. However, last year we had around ten teams who beat Vegas odds, which are considered to be state-of-the-art.”

    “So there is something there.”

    Still, they have plenty of people producing predictions, which, statistically, that means some of these teams bound to get lucky. The volume exceeds the propensity for the result to be actualized. Over a short interval of time, though, the execution doesn’t necessarily earmark for these data scientists to be deemed experts in any fashion.

    In the end, the odds of forecasting a perfect bracket are slim to none as it gets–predicated on as much luck as it does data science.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.