URL Search Tool!

A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index. Today we are happy to announce a tool that makes it even easier for you to take advantage of the URL Index!

URL Search is a web application that allows you to search for any URL, URL prefix, subdomain or top-level domain. The results of your search show the number of files in the Common Crawl corpus that came from that URL and provide a downloadable JSON metadata file with the address and offset of the data  for each URL. Once you download the JSON file, you can drop it into your code so that you only run your job against the subset of the corpus you specified. URL Search makes it much easier to find the files you are interested in and significantly reduces the time and money it take to run your jobs since you can now run them across only on the files of interest instead of the entire corpus.

 

 

URL Search

 

 

We are excited to see examples of URL Search in action. Are you working with Common Crawl data? Would you like to win $100 in AWS credit for sharing how URL Search makes your life easier? The first five people who share open source code on GitHub that incorporates a JSON file from URL Search will each get $100 in AWS Credit!

Email a link to the GitHub repo to [email protected] for consideration. The code must be accompanied by a ReadMe file that explains. If you would like to write a guest blog post about your work we would be happy to publish it on the Common Crawl blog. 

6 thoughts on “URL Search Tool!”

  1. I think I’m seeing some issues with the URL index.

    For example, when I do a search on “en.wikipedia.org” (http://urlsearch.commoncrawl.org/?q=en.wikipedia.org) the first URL listed is

    http://en.wikipedia.org/wiki/1525
    If I do a search on “en.wikipedia.org/wiki” (http://urlsearch.commoncrawl.org/?q=en.wikipedia.org%2Fwiki) the first URL listed is
    http://en.wikipedia.org/wiki/1647_in_literature
    If I do a search on “en.wikipedia.org/wiki/1″ (http://urlsearch.commoncrawl.org/?q=en.wikipedia.org%2Fwiki%2F1) the first URL listed is
    http://en.wikipedia.org/wiki/1942:_Joint_Strike
    I would have expected the first URL returned from the index to be the same in each case, correct?

  2. Adding to John Wiseman’s comment, the search for ‘en.wikipedia.org/wiki/19′ returns no results at all, even though it’s both a common prefix and also a separate article on Wikipedia.

Comments are closed.