What is Molescrape?

Molescrape is a platform to combine and manage typical web crawling and scraping activities. It started as a way to manage and monitor multiple long running and continuously repeating Scrapy spiders. Later I added more features which I considered useful.

Molescrape consists of multiple components which all perform their own job. Skyscraper is the engine that runs Scrapy spiders and stores the result data to different supported output locations. The advantage is that Skyscraper will ensure consistency among the different spiders. All spiders are defined at the same location, are managed in the same way and write their results to the same location. Above that, Skyscraper also checks whether the spiders are still running and working or whether something might be wrong.

Mouldwarp is the generic crawling component. It takes a start URL and from there will crawl pages and follow links. It does not do any data extraction and so on.

Molehill is the REST API of molescrape. It is an optional component, but is quite useful if you want to manage molescrape from other applications via an API or have molescrape running on a server and want to manage it from your local PC.

CLIme is the command line interface for molescrape. It connects to Molehill and allows you to manage molescrape from your command line.