Common Crawl - Blog - Web Image Size Prediction for Efficient Focused Image Crawling

‍In the context of using Web image content for analysis and retrieval, it is typically necessary to perform large-scale image crawling. In our web image crawler setup, we noticed that a serious bottleneck pertains to the fetching of image content, since for each web page a large number of HTTP requests need to be issued to download all included image elements. In practice, however, only the relatively big images (e.g., larger than 400 pixels in width and height) are potentially of interest, since most of the smaller ones are irrelevant to the main subject or correspond to decorative elements (e.g., icons, buttons). Given that there is often no dimension information in the HTML img tag of images, to filter out small images, an image crawler would still need to issue a GET request and download the respective files before deciding whether to index them.

To address this limitation, we decided to explore the challenge of predicting the size of images on the Web based only on their URL and information extracted from the surrounding HTML code. In order to do so, we needed a large amount of images accompanied by their HTML metadata with the purpose of training and testing our image size prediction system. To this end we decided to use a sample of the data from the July 2014 Common Crawl set, which is over 266TB in size and contains approximately 3.6 billion web pages. Since for technical and financial reasons, it was impractical and unnecessary to download the whole dataset, we created a MapReduce job to download and parse the necessary information using Amazon Elastic MapReduce (EMR). The setup is based on a blog post by Steve Salevan. The data of interest include all images and videos from all web pages and metadata extracted from the surrounding HTML elements.

To complete the task, we used 50 Amazon EMR medium instances, resulting in 951GB of data in gzip format. The following statistics were extracted from the corpus:

To predict the size of Web images, we came up with three different methodologies, which are analyzed in the rest of this post. This work is described in detail in a paper presented at CBMI 2015 (13th International Workshop on Content-Based Multimedia Indexing). The paper is available online (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7153609).

‍Textual Features Approach

‍An n-gram in our case is a continuous sequence of n characters from the given image URL. The main hypothesis we make is that URLs which correspond to small and big images differ substantially in terms of wording. For instance, URLs from small images tend to contain words such as logo, avatar, small, thumb, up, down, pixels. On the other hand, URLs from big images tend to lack these words and typically contain others. If the assumption is correct, it should be possible for a supervised machine learning method to separate items from the two distinct classes.

The disadvantage of this approach is that, although the frequencies of the n-grams are taken into account, what is not considered is the correlation of the n-grams to the two classes, BIG and SMALL. For instance, if an n-gram is very frequent in both classes, it makes sense to get rid of it and not consider it as feature. On the other hand, if an n-gram is not very frequent, but it is very characteristic of a specific class, we should include it in the feature vector. To this end, we performed feature selection by taking into account the relative frequency of occurrence of the n-gram in the two classes, BIG and SMALL. We refer to this method as NG-trf, standing for term relative frequency. In a variation of the aforementioned approach, we decided to replace n-grams with the tokens produced by splitting the image URLs by all non-alphanumeric characters. The employed regular expression in Java is W+ and the feature extraction process is the same as described above, but with the produced tokens instead of n-grams. We will refer to this method as TOKENS-trf.

‍Non-textual Features Approach

‍Our alternative non-textual approach does not rely on the image URL text, but rather on the metadata, that can be extracted from the image HTML element. The idea behind their choice is for them to reveal cues regarding the image dimensions. For instance, the first five features correspond to different image suffixes and they were selected due to the fact that most real-world photos are in JPG or PNG format, whereas BMP and GIF formats usually point to icons and graphics. Additionally, there is a greater likelihood that a real-world photo has an alternate or parent text than a background graphic or a banner.

‍Hybrid Approach

‍The goal of the hybrid approach is to achieve higher performance by taking into account both textual and non-textual features. Our hypothesis is that the two methods will complement each other when aggregating their results as they rely on different kinds of features: the n-gram classifier might be best at classifying a certain kind of images with specific image URL wording, while the non-textual features classifier might be best at classifying a different kind of images with more informative HTML metadata.

‍Evaluation

‍For training we used one million images (500K small and 500K big) and for testing 200 thousand (100K small and 100K big). The described supervised learning approaches were implemented based on the Weka library. We performed numerous preliminary experiments with different classifiers (LibSVM, Random Tree, Random Forest), and Random Forest (RF) was found to be the one striking the best trade-off between good performance and acceptable training times. The main parameter of RF is the number of trees. Some typical values for this are 10, 30 and 100, while very few problems would demand more than 300 trees. The rule of thumb is that more trees lead to better performance; however, they simultaneously increase considerably the training time. The comparative results for different number of trees for the Random Forest algorithm are displayed in Table 1.

The first column of contains the method name, the second one the number of trees used in the RF classifier, the third one the number of features used, and the remaining columns contain the F-measures for the SMALL class, the BIG class and the average. The reported results lead to several interesting conclusions.

Table 1: Comparative results (F-measure)

Method   RF  Trees  Features  F1small  F1big  F1avg
======   ==  =====  ========  =======  =====  =====
TOKENS -trf  10     1000      0.876    0.867  0.871
TOKENS -trf  30     1000      0.887    0.883  0.885
TOKENS -trf  100    1000      0.894    0.891  0.893
TOKENS -trf  10     2000      0.875    0.864  0.870
TOKENS -trf  30     2000      0.888    0.828  0.885
TOKENS -trf  100    2000      0.897    0.892  0.895
NG-tsrf-idf  10     1000      0.876    0.872  0.874
NG-tsrf-idf  30     1000      0.883    0.881  0.882
NG-tsrf-idf  100    1000      0.886    0.884  0.885
NG-tsrf-idf  10     2000      0.883    0.878  0.881
NG-tsrf-idf  30     2000      0.891    0.888  0.890
NG-tsrf-idf  100    2000      0.894    0.891  0.892
features     10     23        0.848    0.846  0.847
features     30     23        0.852    0.852  0.852
features     100    23        0.853    0.853  0.853
hybrid                        0.935    0.935  0.935

Web Image Size Prediction for Efficient Focused Image Crawling

‍Textual Features Approach

‍Non-textual Features Approach

‍Hybrid Approach

‍Evaluation

Table 1: Comparative results (F-measure)

Acknowledgement

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use