News Crawl

News is a text genre that is often discussed on our user and developer mailing list.

Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events. By decoupling the news from the main dataset, as a smaller sub-dataset, it is feasible to publish the WARC files shortly after they are written.
Header image
Header image

Using StormCrawler

While the main dataset is produced using Apache Nutch, the news crawler is based on StormCrawler, an open source collection of resources for building low-latency, scalable web crawlers on Apache Storm. Using StormCrawler allows us to test and evaluate a different crawler architecture towards the following long-term objectives:
Continuously release freshly crawled data
Incorporate new seeds quickly and efficiently
Reduce computing costs with constant/ongoing use of hardware

How to report bugs?

The source code of the news crawler is available on our Github account. Please, report issues there and share your suggestions for improvements with us.

We are grateful to Julien Nioche (DigitalPebble Ltd), who, as lead developer of StormCrawler, had the initial idea to start the news crawl project. Julien provided the first news crawler version for free, and volunteered to support initial crawler setup and testing.