Installation

Installing Skyscraper with Docker

Installation from docker is simple, just pull the image from dockerhub. It comes with all requirements already installed:

docker pull molescrape/skyscraper:0.1.0

Use actual tags when possible to have reproducible environments. Using latest is also possible, but especially at the beginning there will be a lot of changes.

You configure skyscraper in the docker container through environment variables that you pass to the container with --env or --env-file.

Installing Skyscraper with Ansible (in a docker environment)

There is an ansible role to install the whole skyscraper environment with skyscraper running in a docker container. You can install it with ansible-galaxy from our github repository with the following requirements.yml:

- name: skyscraper-docker
  src: git+https://github.com/molescrape/ansible-skyscraper-docker.git

Note

Defining the name manually is required, because currently ansible-galaxy does not consider role_name from the role definition when using Git URLs.

This role will install the docker image together with a cronjob to run scheduled spiders.

Then you can install the role with ansible-galaxy install -r requirements.yml.

To deploy the role to a server you need to create a playbook:

- hosts: molescrape-workers
  vars_files:
    - "vars/skyscraper.yml"
  roles:
    - skyscraper-docker

In the file vars/skyscraper.yml you set the configuration variables.

Installing Skyscraper from Sources

Skyscraper can also be run without docker. It requires Python 3 and optionally TOR and privoxy.

Clone the repository from github:

git clone https://github.com/molescrape/skyscraper.git

It’s recommended to install Skyscraper into a virtual environment:

virtualenv -p python3 env
source env/bin/activate
pip install .[all]

This will install the code from the current directory into the virtual environment. The current directory must contain Skyscraper’s setup.py file, which is located in the root directory of the git repository.

In this snippet we install all optional dependencies (called extras in Python) for Skyscraper with the keyword all in square brackets. You can also select a subset of all extras by replacing all with the list of desired extras separated by comma, e.g.:

pip install .[aws,redis]

Installation without any extras is also possible:

pip install .

All available extras are:

  • aws: Connection to AWS for data storage, duplicates, spiders etc.

  • mqtt: Data output to MQTT

  • redis: Data output and item count on redis

If you want to use TOR, you need to setup TOR and a HTTP-to-SOCKS proxy like privoxy on your machine.

To configure skyscraper you can either use a .env file in the project directory or environment variables on the host.