Installing all Components with Docker

Before using molescrape you have to install the components you actually want to use. Molescrape is very modular and allows you to use most components on their own or in combination with other components. For example, it’s possible to run Skyscraper with or without the PostgreSQL database.

The main components of molescrape are:

  • Database: Central PostgreSQL database to store configuration and also result data

  • Skyscraper: targeted scraping system

  • Mouldwarp: generic crawling system

If you just want to get started with molescrape, we recommend that you create your first spider and do the installation step-by-step.

This section covers the installation of all components with Docker. If you prefer to install the components without Docker, please refer the to the individual installation guidelines of Skyscraper, Mouldwarp, and Molehill.

Database

To install the database, first clone the latest version of the repository with:

git clone https://github.com/molescrape/molescrape-database.git

The repostitory contains a helper bash file that will execute all SQL files for you to setup the latest version of the database schema. However, first you have to create a new database and a (possibly dedicated) user for molescrape:

CREATE DATABASE molescrape;
CREATE USER molescrape WITH ENCRYPTED PASSWORD 'some-password';
GRANT ALL PRIVILEGES ON DATABASE molescrape TO molescrape;

This will create a database called molescrape and a user called molescrape with password some-password. You can change the names if you want to use different names.

Next, you can setup all tables with:

bash postgres/create_tables.sh [username] [database] [host]

For username, database and host choose the correct values for your installation. For the database setup above username and database would both be molescrape.

API

The API (called molehill) is available as a docker container. When you run it you have to define the connection credentials for postgres. This can either be done through command line arguments or in an environment list file.

Create a file molehill.env and insert the connection information there:

MOLEHILL_DB_HOST=postgres
MOLEHILL_DB_DATABASE=molescrape
MOLEHILL_DB_USERNAME=molescrape
MOLEHILL_DB_PASSWORD=my-secret

Then, pull the container and run it with:

docker pull molescrape/molehill
docker run -d --env-file molehill.env molescrape/molehill

TODO: Do we need –net=host?

Skyscraper

Skyscraper is packaged as a docker container. You can install it by pulling the latest docker container from Dockerhub (TODO: docker image not public, yet):

docker pull molescrape/skyscraper:0.1.0

To configure skyscraper you can set environment variables with a docker env file.

TODO: Should we describe the required configuration options here to get people started?

Then, you can execute scheduled scrapers with skyscraper:

docker run --rm --env-file skyscraper.env.list molescrape/skyscraper

Mouldwarp

TODO