The Data
Overview
Web Graphs
Latest Crawl
Resources
Get Started
Blog
Examples
Use Cases
CCBot
FAQ
Community
Research Papers
Mailing List Archive
About
Team
Mission
Impact
Privacy Policy
Terms of Use
Search
Contact Us
Examples Using
Our Data
Need More Help?
Take a look at our
Getting Started
page or connect with others on our
Developer List.
Analyzing crime reported in the U.S. using data derived from Common Crawl, New York Times API and Twitter data
Sai Saket Regulapati
Hello, WARC: Common Crawl code samples
Colin Dellow
commoncrawl_downloader
Leo Gao
goCommonCrawl – Extraction of Web Archive data using Common Crawl index API
karust
CitizensFoundation/ac-keyword-scanner
Róbert Viðar Bjarnason
SportsDataAnalysis
Yash Chandra
Categorizing World Wide Web
Jay Pavagadhi
CCrawlDNS – CommonCrawl data set subdomain extracter
Laurent Gaffié
How to Retrieve Archived Pages of Specific Domain Using CommonCrawl Index
Liyan Xu
Index fun
Philippe Suter
A free version of Helium Scraper that scrapes data from the Common Crawl database.
Juan Soldi
mcn-source-ct – Scripts for downloading and extracting .no domains from the data of the commoncrawl.org project.
Anders Einar Hilden
cc.py – Extracting URLs of a specific target based on the results of commoncrawl.org
SI9INT
CommonCrawlScalaTools
Jeff Harwell
Source real estate prices from the Common Crawl
Colin Dellow
Extracting text from HTML in Python: a very fast approach
Artem Golubin
Defining Data Science Using the Common Crawl Web Corpus
Paavo Pohndorff
Large-scale Graph Mining with Spark
Win Suen
Paskto – Passive Web Scanner
Parsing Common Crawl in 2 plain scripts in python
Alexander Veysov
The prevalence of Web advertising
commecica.com
Of using Common Crawl to play Family Feud
Paul Masurel
Common Crawl Scala Example
Soner Altin
Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl
Janek Bevendorff, Martin Potthast, Bauhaus-Universität Weimar
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
Greg Lindahl
Using Python and Common-Crawl to find products from Amazon.com
David Cedar
Analyzing “Wait-Delay” Settings in Common Crawl robots.txt Data with R
hrbrmstr
Clustering communities on web crawl data
Oluwaseyi Talabi, M. Rafay Aleem, Prashanth Rao, Nandita Dwivedi
Virtual patent marking crawler
David Portabella
Analyzing 4 Billions of Tags with R and Spark
Javier Luraschi
newsplease/examples/commoncrawl.py – download WARC files from commoncrawl.org's news crawl
Felix Hamborg
cc-pyspark: process Common Crawl data with Python and Spark
Common Crawl
KeywordAnalysis: Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
CI-Research
Go Crawl
Chris Cates
go-warc: golang library to work with WARC files
Wolfgang Meyers
sparkwarc: Load WARC Files into Apache Spark
Javier Luraschi
Analysing Petabytes of Websites
Mark Litwintschik
CommonCrawlJob – Extract data from common crawl using elastic map reduce
Sang Han (Qadium)
Exploring the Common Crawl with Python
Derek Morgan
Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AW
Jouke-Thiemo Waleson
Mining Common Crawl with PHP
Paulius Rimavičius
Crate.IO: How to import from custom data sources with a plugin
Claus Matzinger
Index 1,600,000,000 Keys with Automata and Rust
Andrew Gallant
Как погрепать интернет / How to grep the web
Aleksandr Kukushkin
How Many Websites Provide RSS / Web Syndication Feeds
Victor Felder (eXascale Infolab)
Analyzing the Common Crawl using Map-Reduce
Stefan Koch
Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch – AWS Big Data Blog
Hernan Vivani
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
Ilya Kreymer
Analyze Common Crawl index – http://index.commoncrawl.org/
Tom Morris
Common Crawl Document Download
Dominik Stadler
Previous
Next
Do you like what you see here?
If you need further answers don't hesitate to get in touch.
Get in touch