Architecture

Molescrape is structured into several independent component that can be easily combined. This means that you can either use one component on its own or combine several of them together.

There are some components that perform the actual tasks/work and other components that exist for administration purposes. The task components are:

  • Skyscraper to run targeted spiders

  • Mouldwarp to run broad crawls

  • Calkx to perform post-processing on crawled data

The administrative components are:

  • Molehill to act as a REST API to manage all components

  • molescrape-database: the collection of SQL creation scripts to setup the database (central storage)

Architecture of a combined System

The following image shows the architecture when Skyscraper (Scraping), Mouldwarp (Crawling), Molehill (API) and the SQL storage are combined.

foobar

Architecture overview of the different components of molescrape

Physical Structure of a combined System

The individual components can run on the same server or on different servers.

In the following architecture diagram there is one Skyscraper Executor (i.e. the engine running a Skyscraper spider) running on the same server as the database while two others are running on another server. The Calkx Analysis Jobs are running on the same server as the two Skyscraper executors and the Mouldwarp Executor (broad crawling) is running on yet another host.

foobar

Physical structure of a possible setup of molescrape

Architecture for a Standalone Skyscraper

The following diagram on the other hand shows a situation where a user uses Skyscraper alone with storage to a folder instead of SQL database and without any configuration store. Instead, he starts the spiders either manually or repeatedly from a cronjob.

foobar

Conceptial view of a Skyscraper setup without SQL storage