MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?

This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.

When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!

Ready to get started?  Watch our screencast and follow along below:

http://www.youtube.com/watch?v=y4GZ0Ey9DVw

Step 1 – Install Git and Eclipse

We first need to install a few important tools to get started:

Eclipse (for writing Hadoop code)

How to install (Windows and OS X):

Download the “Eclipse IDE for Java developers” installer package located at:

http://www.eclipse.org/downloads/

How to install (Linux):

Run the following command in a terminal:

RHEL/Fedora

 # sudo yum install eclipse

Ubuntu/Debian

 # sudo apt-get install eclipse

Git (for retrieving our sample application)

How to install (Windows)

Install the latest .EXE from:

http://code.google.com/p/msysgit/downloads/list

How to install (OS X)

Install the appropriate .DMG from:

http://code.google.com/p/git-osx-installer/downloads/list

How to install (Linux)

Run the following command in a terminal:

RHEL/Fedora

# sudo yum install git

Ubuntu/Debian

# sudo apt-get install git

Step 2 – Check out the code and compile the HelloWorld JAR

Now that you’ve installed the packages you need to play with our code, run the following command from a terminal/command prompt to pull down the code:

# git clone git://github.com/ssalevan/cc-helloworld.git

Next, start Eclipse.  Open the File menu then select “Project” from the “New” menu.  Open the “Java” folder and select “Java Project from Existing Ant Buildfile”.  Click Browse, then locate the folder containing the code you just checked out (if you didn’t change the directory when you opened the terminal, it should be in your home directory) and select the “build.xml” file.  Eclipse will find the right targets, and tick the “Link to the buildfile in the file system” box, as this will enable you to share the edits you make to it in Eclipse with git.

We now need to tell Eclipse how to build our JAR, so right click on the base project folder (by default it’s named “Hello World”) and select “Properties” from the menu that appears.  Navigate to the Builders tab in the left hand panel of the Properties window, then click “New”.  Select “Ant Builder” from the dialog which appears, then click OK.

To configure our new Ant builder, we need to specify three pieces of information here: where the buildfile is located, where the root directory of the project is, and which ant build target we wish to execute.  To set the buildfile, click the “Browse File System” button under the “Buildfile:” field, and find the build.xml file which you found earlier.  To set the root directory, click the “Browse File System” button under the “Base Directory:” field, and select the folder into which you checked out our code.  To specify the target, enter “dist” without the quotes into the “Arguments” field.  Click OK and close the Properties window.

Finally, right click on the base project folder and select “Build Project”, and Ant will assemble a JAR, ready for use in Elastic MapReduce.

Step 3 – Get an Amazon Web Services account (if you don’t have one already) and find your security credentials

If you don’t already have an account with Amazon Web Services, you can sign up for one at the following URL:

https://aws-portal.amazon.com/gp/aws/developer/registration/index.html

Once you’ve registered, visit the following page and copy down your Access Key ID and Secret Access Key:

https://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key

This information can be used by any Amazon Web Services client to authorize things that cost money, so be sure to keep this information in a safe place.

Step 4 – Upload the HelloWorld JAR to Amazon S3

Uploading the JAR we just built to Amazon S3 is a lot simpler than it sounds. First, visit the following URL:

https://console.aws.amazon.com/s3/home

Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane, then click the “Upload” button, and select the JAR you just built. It should be located here:

<your checkout dir>/dist/lib/HelloWorld.jar

Step 5 – Create an Elastic MapReduce job based on your new JAR

Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it, and as it so happens, that’s pretty easy to do too! Visit the following URL:

https://console.aws.amazon.com/elasticmapreduce/home

and click the “Create New Job Flow” button. Give your new flow a name, and tick the “Run your own application” box. Select “Custom JAR” from the “Choose a Job Type” menu and click the “Continue” button.

The next field in the wizard will ask you which JAR to use and what command-line arguments to pass to it. Add the following location:

s3n://<your bucket name>/HelloWorld.jar

then add the following arguments to it:

org.commoncrawl.tutorial.HelloWorld <your aws secret key id> <your aws secret key> 2010/01/07/18/1262876244253_18.arc.gz s3n://<your bucket name>/helloworld-out

CommonCrawl stores its crawl information as GZipped ARC-formatted files (http://www.archive.org/web/researcher/ArcFileFormat.php), and each one is indexed using the following strategy:

/YYYY/MM/DD/the hour that the crawler ran in 24-hour format/*.arc.gz

Thus, by passing these arguments to the JAR we uploaded, we’re telling Hadoop to:

1. Run the main() method in our HelloWorld class (located at org.commoncrawl.tutorial.HelloWorld)

2. Log into Amazon S3 with your AWS access codes

3. Count all the words taken from a chunk of what the web crawler downloaded at 6:00PM on January 7th, 2010

4. Output the results as a series of CSV files into your Amazon S3 bucket (in a directory called helloworld-out)

Edit 12/21/11: Updated to use directory prefix notation instead of glob notation (thanks Petar!)

If you prefer to run against a larger subset of the crawl, you can use directory prefix notation to specify a more inclusive set of data. For instance:

2010/01/07/18 – All files from this particular crawler run (6PM, January 7th 2010)

2010/ - All crawl files from 2010

Don’t worry about the continue fields for now, just accept the default values. If you’re offered the opportunity to use debugging, I recommend enabling it to be able to see your job in action. Once you’ve clicked through them all, click the “Create Job Flow” button and your Hadoop job will be sent to the Amazon cloud.

Step 6 – Watch the show

Now just wait and watch as your job runs through the Hadoop flow; you can look for errors by using the Debug button. Within about 10 minutes, your job will be complete. You can view results in the S3 Browser panel, located here. If you download these files and load them into a text editor, you can see what came out of the job. You can take this sort of data and add it into a database, or create a new Hadoop OutputFormat to export into XML which you can render into HTML with an XSLT, the possibilities are pretty much endless.

Step 7 – Start playing!

If you find something cool in your adventures and want to share it with us, we’ll feature it on our site if we think it’s cool too. To submit a remix, push your codebase to GitHub or Gitorious and send a message to our user group about it: we promise we’ll look at it.

61 thoughts on “MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl”

  1. Pingback: Quora
  2. For those who haven’t used AWS – what are the approximate expected costs to run such processing there on the whole commoncrawl dataset?

  3. >>When Google unveiled its MapReduce algorithm to the world in an academic
    paper in 2004, it >>shook the very foundations of data analysis.
    The map reduce algorithm was already used by certain companies way before Google was even founded.
    It’s only when Google does something, it somehow automatically get valued as a mystical technology that everyone should use in it’s business logic..

    1. It wasn’t the map part that was cool, it was the whole system. Probably Tesla vs Edison or something of a similar analogy. Hadoop didn’t exist until Google got it right and published it. No doubt lots of people would claim partitioning work to run at scale was their idea, but that isn’t the point. Almost the same as say Hubs and Authorities pre-dated PageRank, but we know what was successful.

  4. Hi,

    I tried to execute HelloWorld.jar on AWS Elastic MapReduce as described above but my job failed with the following exception:

    Exception in thread “main” java.io.IOException: No input to process
        at org.commoncrawl.hadoop.io.ARCInputFormat.getSplits(ARCInputFormat.java:171)
        at org.commoncrawl.tutorial.HelloWorld.main(HelloWorld.java:113)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

    The Jar file was: s3n://commoncrawl-002/HelloWorld.jar

    The Java arguments were as follows:
    org.commoncrawl.tutorial.HelloWorld
    2010/01/07/18/1262876244253_18.arc.gz s3n://commoncrawl-002/helloworld-out

    Then I tried with wildchar, too:
    org.commoncrawl.tutorial.HelloWorld
    2010/01/07/18/*.arc.gz s3n://commoncrawl-002/helloworld-out

    but got the same exception. Any idea what is wrong?

    By the way, first I was using Java SDK 1.7 to compile the code in Eclipse but after I uploaded HelloWorld.jar into AWS S3 and created the MapReduce job, I got an exception, “Unsupported major.minor version  51:0″. Then I recompiled it with Java 1.6 and it worked. Apparently, the current version of AWS Elastic MapReduce does not support Java 1.7 yet.

    Istvan

    1. I get the exception “No input to process”
      with any wildcard, but 2010/01/07/18/1262876244253_18.arc.gz
      ran fine.

      E.g., 2010/01/07/18/*.arc.gz or
      2010/**/*.arc.gz
      would throw “No input to process” at
      org.commoncrawl.hadoop.io.ARCInputFormat.getSplits(ARCInputFormat.java:171

      I wonder if that may be java version issue?
      java -version
      java version “1.6.0_23″
      OpenJDK Runtime Environment (IcedTea6 1.11pre) (6b23~pre11-0ubuntu1.11.10)
      OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)

      1. I also faced the same problem and after digging into org.commoncrawl.hadoop.io.JetS3tARCSource and subsequently
        org.jets3t.service.impl.rest.httpclient.RestS3Service it turns out that you parameter is actually a prefix and not a glob wildcard.

        These are all valid:

        2010/01/07/18/1262876244253_18.arc.gz – 1 file
        2010/01/07/18/12628693 – 18 files
        2010/01/07/18/ – 1273 files
        2010/ – must be a lot of files :)

        1. It still does not work… I am trying to run HelloWorld.jar MapReduce job from Europe-Ireland region. Can it be that the ARC input file (
          2010/01/07/18/1262876244253_18.arc.gz) is not accessible from there?

          Istvan

          1. I ran most of my jobs in the US-West (Northern California) datacenter; try running it over there and let us know if it works.  If it does, we’re probably looking at an S3 data duplication issue of some sort and we’ll look into why that’s happening.

          2. Eventually I got it working in Ireland datacenter. I screwed up the bucket name in HelloWorld.java. (“commoncrawl-crawl-002″)

            Congrats to the great article!

            Istvan

          3. Hi, I tried running the Hello World example with the exact same params as the tutorial and also with 2010/01/07/18/12628693 to no avail. I’m having the same error “Exception in thread “main” java.io.IOException: No input to process
            at org.commoncrawl.hadoop.io.ARCInputFormat.getSplits(ARCInputFormat.java:171)
            at org.commoncrawl.tutorial.HelloWorld.main(HelloWorld.java:117)
            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            at java.lang.reflect.Method.invoke(Method.java:597)
            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)” I checked the src file and CC_BUCKET = “commoncrawl-crawl-002″; Has any one tried to run this recently with success?

            Thanks

          4.  I am encountering the same problem. In addition, there seems to be nothing up on the commoncrawl-crawl-002 bucket.
            The bucket contains two folders 2009 and 2010, but the folders are empty.
            For example, using a patched s3cmd:

            $ s3cmd –add-header=”x-amz-request-payer:requester” ls s3://commoncrawl-crawl-002/
            2012-03-05 21:26         0   s3://commoncrawl-crawl-002/2009_$folder$
            2012-03-06 22:01         0   s3://commoncrawl-crawl-002/2010_$folder$
            $ s3cmd –add-header=”x-amz-request-payer:requester” ls s3://commoncrawl-crawl-002/2009/
            $ s3cmd –add-header=”x-amz-request-payer:requester” ls s3://commoncrawl-crawl-002/2010/
            $

          5. This is an interesting point Steve. 

            Transfer costs between S3 & EC2 (hence EMR) are free _only_ in the same region. Since the commoncrawl buckets are us-east-1 based you would have incurred transfer cost for running the job in us-west -1

    2. Hi Szegedi, What is the jobflow id of your failing job? (The j-??? value)
      ( I’m an engineer on Elastic MapReduce so I can troubleshoot from our end )Cheers,Mat

      1. Hi Mat,

        Thanks a lot for offering your help, I really appreciate it. Finally I got it working, the issue was related to the input bucket name – I modified it accidentely.

        Thanks again,
        istvan

      2. Hi Mat, I am having the same issue as Szegedi. Can you see why it’s failing via the job id? I can send you the failed job flow ids that has failed thnx in advance!

        1. Sorry but I’m no longer at Amazon. 

          If you post to the EMR forums someone on the team might be able to help out https://forums.aws.amazon.com/forum.jspa?forumID=52Mat

  5. I’m very grateful for this how to, and at the same time I’m frustrated that it’s 5 minutes of intense clicking and prodding just to have a few java files run on Amazon. (Albeit manipulating some amazing data sets).
    It would have been good to see a little coverage of the code and it’s output.If it was a Sikuli script or a shell script to setup it would be easier too.
    Surely there are tools that allow command line deployment and show the output in real time too?
    I really thought this would be for the masses, it isn’t clear that it is yet.If I get around to playing with this it would be interesting to compare how long it ACTUALLY takes to do the same thing as the video from start to finish, and to perhaps screen capture THAT more authentic process (despite ).There must be some web app alternative where code can just be edited and deployed without fuss? (Just given Amazon login data etc.) What about Jython also?

  6. I got this working a couple of weeks ago with the only problem being that the bucket name needed to be all lower-case. The output data format is supposed to be CSV but it looks like no CSV format I’ve ever seen. The record separator is single-quote and the field separator is comma. most CSV-compatible tools cannot process this format. Is there a way to condition the output to be more easily digested by other tools?

    1.  Like a lot of code in the hadoop project, the CSV writer is an abomination. You could write your own output format, or try your luck with one of the others, depending on which of the formats are acceptable for the tools you intend to use.

  7. Great step by step. The hardest part was working through the constrains of case and spaces. Once I learned to use all lower case and no spaces all went smooth. My process ran 12 for minutes on the example archive.

  8. If someone is interrested by having a direct access to the data, you can use aws(https://github.com/timkay/aws):

    % perl aws get “x-amz-request-payer:requester”   commoncrawl-crawl-002/2010/01/07/18/1262876244253_18.arc.gz > 1262876244253_18.arc.gz

    The file is around 100Mo

  9. Totally frustrated !!!
    After 8 attempts, the closest I’ve come was to error out with the following:
    2012-02-25T03:34:53.626Z INFO Fetching jar file.
    2012-02-25T03:35:03.697Z INFO Working dir /mnt/var/lib/hadoop/steps/2
    2012-02-25T03:35:03.698Z INFO Executing /usr/lib/jvm/java-6-sun/bin/java -cp /home/hadoop/conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/home/hadoop:/home/hadoop/hadoop-tools.jar:/home/hadoop/hadoop-core-0.20.205.jar:/home/hadoop/hadoop-core.jar:/home/hadoop/hadoop-tools-0.20.205.jar:/home/hadoop/lib/*:/home/hadoop/lib/jetty-ext/* -Xmx1000m -Dhadoop.log.dir=/mnt/var/log/hadoop/steps/2 -Dhadoop.log.file=syslog -Dhadoop.home.dir=/home/hadoop -Dhadoop.id.str=hadoop -Dhadoop.root.logger=INFO,DRFA -Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/2/tmp -Djava.library.path=/home/hadoop/native/Linux-i386-32 org.apache.hadoop.util.RunJar /mnt/var/lib/hadoop/steps/2/HelloWorld.jar org.commoncrawl.tutorial.HelloWorld AKIAIVET2WKOEUK3ZZBQ XXXXXXXXXXX 2010/01/07/18/12628693 s3n://crawljar/helloworld-out
    2012-02-25T03:35:15.740Z INFO Execution ended with ret val 1
    2012-02-25T03:35:15.741Z WARN Step failed with bad retval
    2012-02-25T03:35:21.092Z INFO Step created jobs: Why am I getting this !?!?

    1. I had the same issue on a word count tutorial. The problem was the output key/val from the map function. The Reduce function could not understand it since I had edited the format. So basically, check your map function output.

  10. I tried to run this but it failed after 15 minutes with this lovely stack-trace:

    Exception in thread “main” java.lang.UnsupportedClassVersionError: org/commoncrawl/tutorial/HelloWorld : Unsupported major.minor version 51.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
    at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:247)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

    Can anybody explain what’s wrong here?

    1. I had the same error and I  got  this to work by building against hadoop-core-0.20.2 . EMR gave me a hint when it said my custom jar should be compiled against hadoop-0.20

      1. I have cloned the latest from the git repository which i checked and is built with hadoop-core-0.20.2.  I am getting the same error “Exception in thread “main” java.lang.UnsupportedClassVersionError: org/commoncrawl/tutorial/HelloWorld : Unsupported major.minor version 51.0″ as above

        1. My problem was that it was compiling with jdk 7.  I changed my settings to use jdk 6 and everything worked.

  11. For those who are getting the input error I posted below. I finally figured out what the problem was.  The location of the dataset has been moved from commoncrawl-crawl-002 to the location specified in http://aws.amazon.com/datasets/41740. Thus to get this example to work you need to change the bucket name in the src file as well as the argument when trying to run this example. 

    1. Just to elaborate,
      CC_BUCKET is now “aws-publicdatasets”
      and you need to find a working file. I used http://www.s3fm.com/.
      Here is an example file to use in the arguments:
      common-crawl/crawl-002/2010/09/25/18/1285398033394_18.arc.gz

  12. i have seen “/YYYY/MM/DD/the hour that the crawler ran in 24-hour format/*.arc.gz” with an hour value as high as 46.  but does that matter?

  13. I am trying to modify the helloworld example to read sequence files instead of arc files. Is it just a case of modifying the ArcSplitReader like so:

    public SequenceSplitReader(JobConf job, SequenceSplit split, SequenceSource source, int blockSize){
           
            this.split = split;
            job.set(SPLIT_DETAILS, split.toString());
            this.job = job;
            this.source = source;
            this.blockSize = blockSize;
            this.readers = new SequenceFile.Reader[split.getResources().length];
            try {
                this.readers[0] = new SequenceFile.Reader(null, null, job);
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
            this.readerIndex = 0;
            this.totalBytesRead = 0;
            this.error= null;
            new IOThread().start();
        }
       
        private class IOThread extends Thread {

            public void run() {

              for (int i = 0; i 0) {
                        streamPosition += bytesRead;
                        totalBytesRead += bytesRead;
                        buffer.flip();
                        //reader.;
                      } else if (bytesRead == -1) {
                     
                        if (i + 1 < readers.length) {
                          readers[i + 1] = new SequenceFile.Reader(null, null, null);
                        }
                        reader.close();
                        break;
                      }
                    }

                    break;

                  } catch (Throwable t1) {
                    lastError = t1;
                    failures++;
                  } finally {
                    try {
                      if (stream != null) {
                        stream.close();
                      }
                    } catch (Throwable t2) {
                    }
                    stream = null;
                  }
                }
              }
            }
           
           
          }
        @Override
        public Text createKey() {
            return new Text();
        }

        @Override
        public SequenceFileItem createValue() {
            return new SequenceFileItem();
        }

        @Override
        public long getPos() throws IOException {
            return totalBytesRead;
        }

        @Override
        public float getProgress() throws IOException {
            return totalBytesRead / (float) split.getLength();
        }

        public boolean next(SequenceFileItem item) throws IOException {
            while (readerIndex < readers.length) {
                Writable key = (Writable) ReflectionUtils.newInstance(readers[readerIndex].getKeyClass(), job);
                Writable value = (Writable) ReflectionUtils.newInstance(readers[readerIndex].getValueClass(), job);
               
                  if (readers[readerIndex].next(key, value)) {
                    // populate arc file path in item
                      item.setSequenceFileName(split.getResources()[readerIndex].getName());
                      item.setKey(key);
                      item.setValue(value);
                    try {
                      // and then delegate to reader instance
                      readers[readerIndex].next(item);
                    } catch (IOException e) {
                      LOG.error("IOException in SequenceSplitReader.next().SequenceFile:"
                          + item.getSequenceFileName() + "nException:"
                          + StringUtils.stringifyException(e));
                      throw e;
                    } catch (Exception e) {
                      LOG
                          .error("Unknown Exception thrown in SequenceSplitReader.next().SequenceFile:"
                              + item.getSequenceFileName()
                              + "nException:"
                              + StringUtils.stringifyException(e));
                      throw new RuntimeException(e);
                    }
                    return true;
                  } else {
                    readers[readerIndex++] = null;
                  }
                }

                return false;
        }

        @Override
        public boolean next(Text key, SequenceFileItem value) throws IOException {
            if(next(value)){
                key.set(value.key.toString());
                return true;
            }
            return false;

    }

    Where the SequenceFileItem just holds Writable key, Writable Value, and a String for sequenceFilename.

    Would someone be able to let me know if I am on the right track?
        }
       

  14. Hello,

    I’m trying to run the HelloWorld.jar file and each time it fails with “Shut down as step failed”   i am uisng Hadoop 0.20.205 version. 
    Any idea on what could be the issue ? Appreciate the help

  15. Pingback: Quora

Comments are closed.