Learn Hadoop and get a paper published

We’re looking for students who want to try out the Hadoop platform and get a technical report published.

(If you’re looking for inspiration, we have some  paper ideas below. Keep reading.)

Hadoop’s version of MapReduce will undoubtedbly come in handy in your future research, and Hadoop is a fun platform to get to know. Common Crawl, a nonprofit organization with a mission to build and maintain an open crawl of the web that is accessible to everyone, has a huge repository of open data – about 5 billion web pages – and documentation to help you learn these tools.

So why not knock out a quick technical report on Hadoop and Common Crawl? Every grad student could use an extra item in the Publications section of his or her CV.

As an added bonus, you would be helping us out. We’re trying to encourage researchers to use the Common Crawl corpus. Your technical report could inspire others and provide a citable papers for them to reference.

Leave a comment now if you’re interested! Then once you’ve talked with your advisor, follow up to your comment, and we’ll be available to help point you in the right direction technically.

Step 1: Learn Hadoop

Step 2:
Turn your new skills on the Common Crawl corpus, available on Amazon Web Services.

  • “Identifying the most used Wikipedia articles with Hadoop and the Common Crawl corpus”
  • “Six degrees of Kevin Bacon: an exploration of open web data”
  • “A Hip-Hop family tree: From Akon to Jay-Z with the Common Crawl data”

Step 3:
Reflect on the process and what you find. Compile these valuable insights into a publication. The possibilities are limitless; here are some fun titles we’d love to see come to life:

Here are some other interesting topics you could explore:

  • Using this data can we ask “how many Jack Blacks are there in the world?”
  • What is the average price for a camera?
  • How much can you trust HTTP headers? It’s extremely common that the response headers provided with a webpage are contradictory to the actual page — things like what  language it’s in or the byte encoding. Browsers use these headers as hints but need to examine the actual content to make a decision about what that content is. It’s interesting to understand how often these two contradict.
  • How much is enough? Some questions we ask of data — such as “what’s the most common word in the english language” — actually don’t need much data at all to answer. So what is the point of a dataset of this size? What value can someone extract from the full dataset? How does this value change with a 50% sample, a 10% sample, a 1% sample? For a particular problem, how should this sample be done?
  • Train a text classifier to identify topicality. Extract meta keywords from Common Crawl HTML data, then construct a training corpus of topically-tagged documents to train a text classifier for a news application.
  • Identify political sites and their leanings. Cluster and visualize their networks of links (You could use Blekko’s /conservative  /liberal tag lists as a starting point).

So, again — if you think this might be fun, leave a comment now to mark your interest. Talk with your advisor, post a follow up to your comment, and we’ll be in touch!

79 thoughts on “Learn Hadoop and get a paper published”

    1. I love the enthusiasm expressed by double exclamation marks :) Please email us if you want any advice – or to tell us what you decide to work on. 

    1. Email us if there is anything we can do to help. Looking forward to see what you come up with!

    1. Cute handle! But won’t you eventually have to change it? You are probably already well beyond noob :) 

    1. Excellent! Let us know if there is anything we can do to assist you. The discussion group can be very helpful. http://bit.ly/J1B06q

  1. I’d be interested in running some named entity recognition experiments on common crawl data if I can figure out on how can I filter out the pages in Latvian language without racking up an huge Amazon bill.

    1. Great! You can access the datasets on AWS Public Data Sets http://aws.amazon.com/datasets/41740. If you have any questions, feel free to email us at info@commoncrawl:disqus.org  


    1. AJ – we would love to see something with Python-Streaming! That would be inspiring to the many people who rate Python as their favorite language.  Would you use dumbo? 

      1. Possibly. I was just at the LA HUG @ Shopzilla where one of their Data Scientists talked about straight up Python-Streaming, which does not require Dumbo, but rather uses Numpy. There is also PyDoop, which is a really interesting project. But more than anything, I would love to try and do Python-Streaming using PyPy.

        If I have questions, who should I email?

        1. You can email me! lisa@commoncrawl:disqus .org If you have seriously technical questions I can connect you with the right person. 

    1. Awesome – let us know if you need any help from us along the way. We’re happy to point you to resources.

    1. That’s great! We hope you do. Don’t hesitate to get in touch if you need any help or guidance along the way. Excited to see what you come up with.

  2. Extremely interested !  I am a .Net professional with good hands-on programming experience. Now I am looking to change my technology stack to Hadoop. Please keep me posted. 

    Thanks a bunch !

      1. Hi Allison – At this point of time I am looking to get some good suggestions on what to do with Common Crawl data. Can you help out with that?


  3. I am interested in writing technical paper writing and even doing PhD on this topic in India. Published some paper on hadoop. Request you to keep me posted. Thanks in advance

  4. I am doing M.Tech and just now i have completed my Operating Systems research project related to integration of virtualization with Hadoop tools. I am quite interested in working more on hadoop. Please keep me posted. 

    Thanks in advance.

  5. Currently I’m not a student in any academic institution however would like to explore for new ideas as I cross Hadoop with my traditional data modeling and RDBMS knowledge. Let me know if you are limiting this to “students”. Thanks for the idea.

  6. I am a graduate student in NYU and will like to work on Hadoop and contribute as much as possible.
    I am planning to take course on Hadoop next smester but before that I will like to learn and implement it on a 150 TB size of data.

  7. I am very much interested in a technical publication on hadoop. I have around 1 year experience with hadoop/pig and mapreduce. I would like to learn more and get involved.

  8. I am Java/J2EE professional with 1.5+ years of experience, extremely interested in learning Hadoop.

    1. Read book called “hadoop definitive guide”. Its very good book in which you can find all details.

  9. Hi, I have done a project on scalability of web applications with mysql and Hadoop’s hbase backends for a social networking site. I am very much interested in continuing my involvement with hadoop. I am looking for good research topics to start with in hadoop. 

  10. Hi, I am interested in working with HADOOP . I would like to publish research paper in any of the topic you would provide. Kindly help me with this. I read about HADOOP on internet , the topic is very interesting and I would like to put all my efforts for it

  11. Hi, I’m a student at Paris Dauphine University, and I’m actually doing an internship at Data Publica. In the actual stage of my work, I have to use CommonCrawl to identify/classify the french websites. I am so grateful to have this opportunity to work on CommonCrawl.

  12. Hi, I’m a working guy in a reputed product development company and I’m very much interested in learning and contribute some thing in this area. I have linux machine with good configuration to try out stuff.

  13. Hi, I am a Digital Analytics professional and would love to pursue this opportunity to learn and contribute. 

  14. I am currently a student at IIITB doing research on Hadoop and Search Engines with my friend. I am looking forward to use this opportunity.

  15. Hello I am Jaipal Currently working for a small scale company, we want to implement Crawler on hadoop… If you are aware of configuration side.. Please let me know it would be great helpful for me..


    Jaipal R

  16. The blog was absolutely fantastic! Lot of great information which can be helpful in some or the other way. Keep updating the blog, looking forward for more contents.

  17. im post grad. and finding out a research topic for doctorate. i have implemented hadoop on single node. and going implement the same on multinode. and i wish my doctorate topic must be related to hadoop itself. because its very interesting topic for research. and for coming years it would be shining research area.

  18. I am working on Distributed Data Mining framework on Cloud for Mobile Business Intelligence. Any inputs in this research direction is much appreciated.!!

  19. I am research student. I have setup hadoop cluster of 10 virtual machines. I have analyze smart grid massive amount of data using hadoop and soon will submit the paper for publication. In the future i was thinking about the load balancing in Hadoop but I came to know that lot of work has been done in this area. Could you please let me help on what area of hadoop I should work to get the publication.

  20. Hi, is the opportunity still available? I’m very much interested. The article was very helpful by the way.

  21. I am really interested in big data analysis,i have worked a bit in map/reduce in hadoop,I need some guidance in what direction i have to go further,I am a student in NITSurathkal,INDIA

  22. I am a full time research scholar at CUSAT. Could u please guide me on how to make use of this Crawl Dataset…

  23. I’m a student of Anna university. And I’m currently working on hadoop. Would appreciate guidance!I

Comments are closed.