First Scraper

In this tutorial we will create the first scraper for Skyscraper. Every spider you create has to emit a BasicItem with a unique id field. Based on the id field Skyscraper will decide whether you have already crawled an item once and will not add it to your scraped items again. Thus, id can be used for deduplication.

Note

The deduplication based on item ID is different from scrapy’s default duplicate filter in two ways: First, scrapy’s duplicate filter does check requests for duplicates, not items. And second, the scrapy duplicate filter filters duplicate URLs on the current crawl while Skyscraper’s duplicate ID filter is persistent across crawls.

Running the Sample Spider

The Skyscraper docker image already comes with a prepared sample spider and is configured to read spiders from a folder inside the container and write all results to another folder. This means, we can run our first spider without having to setup a database.

For this, we just need to mount a local folder into the container and run the existing spider. The sample spider is called example and is located in the namespace example. If you have not done it before, pull skyscraper from Dockerhub now:

docker pull molescrape/skyscraper:0.1.0

Then, you can execute the sample spider:

mkdir $HOME/skyscraper-data
docker run --rm -v $HOME/skyscraper-data:/opt/skyscraper-data \
    molescrape/skyscraper crawl-manual example example

This will map the local folder $HOME/skyscraper-data to the container folder /opt/skyscraper-data. The spider will write all item results to this folder. After the spider run has finished (it will only retrieve one result), you can see the file in your local folder.

Creating your own Spider

Let’s create our own spider to retrieve the main page from wikipedia and emit the HTML source of the page. We want to check the page once per day, thus in this case we include the current date into the id field. This will make sure that the index page of wikipedia is saved once per day.

Note

The default configuration of Skyscraper does not have duplicate checks with the id field enabled. Thus, in this case it would not matter which value you set for id. However, once you configure a data storage for Skyscraper it will be important.

We also set the url field to allow us to find the website of the item later. In this case it is not of much use, because we know exactly which website we visited, but for larger projects it is very useful.

import scrapy
import datetime

from skyscraper.items import BasicItem


class WikipediaSpider(scrapy.Spider):
    name = 'wikipedia'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Main_Page']

    def parse(self, response):
        today = datetime.datetime.today().strftime('%Y-%m-%d')
        item = BasicItem()
        item['id'] = 'wikipedia-indexpage-{}'.format(today)
        item['url'] = response.url
        item['source'] = response.text
        return item

Save this code to skyscraper-spiders/tutorial/wikipedia.py.

The docker container in its current configuration will read all spiders from a folder. Each spider must be in a folder with the name of its namespace. In this example the namespace will be tutorial. This means, we can just map a folder into the container again and allow it to use our new spider:

docker run --rm -v $HOME/skyscraper-data:/opt/skyscraper-data \
    -v $HOME/skyscraper-spiders:/opt/skyscraper-spiders \
    molescrape/skyscraper crawl-manual tutorial wikipedia

When this spider has finished you will see the result in $HOME/skyscraper-data.

Scheduled Spider with the Database

In the previous examples we had to start the spider manually with the command crawl-manual. However, usually you want your spiders to be executed automatically. The runtimes of spiders can be managed if you setup the PostgreSQL database.

We also have to tell Skyscraper to read spiders from the database instead of from the filesystem. This is an environment variable that we can set during startup of the docker container. We also have to specify the connection details for the PostgreSQL database:

SPIDER_LOADER_CLASS=skyscraper.spiderloader.PostgresSpiderLoader
POSTGRES_CONNSTRING="host=postgres user=molescrape dbname=molescrape password=my-secret"

Please specify the correct settings for your database for host, user and dbname.

While this would already be enough to setup a scheduled spider in the database, you probably do not want to connect to PostgreSQL each time you want to make a change. Thus, we also setup molehill. molehill is the REST API frontend for molescrape and it’s available as a docker container. Just like for Skyscraper, you also have to specify the PostgreSQL connection details for molehill:

docker pull molescrape/molehill
docker run -d --env MOLEHILL_DB_HOST=postgres \
    --env MOLEHILL_DB_DATABASE=molescrape \
    --env MOLEHILL_DB_USERNAME=molescrape \
    --env MOLEHILL_DB_PASSWORD=my-secret \
    -p 127.0.0.1:9000:8080 \
    molescrape/molehill

TODO: Did not work in my test -> some problem with SQL connection, check

This will start the molehill container in background mode. It will listen to port 9000 from the host for connections and pass them on to port 8080 of the container. Just as before please specify the correct credentials for your SQL server.

Most people probably want to run their own web server (like nginx) on the frontend and proxy all requests to the container. If you want to publish the port directly instead omit the definition of 127.0.0.1 in the port binding and publish it to the whole world.

You can then check your connection by visiting http://[your-host]:9000/v1/projects. If everything works you should be prompted for login credentials by your browser. If you see an empty page and no Basic-Auth dialog (and a status code 500) then something is wrong.

TODO: Create first user in database - should this be part of the DB setup?

Now we need to upload this spider to Skyscraper. For this, you can use either the command line utility Clime or send an HTTP request to the API from any programming language you like. For this tutorial, we will use Clime. The command to create a new spider is clime spider create:

./clime spider create --project tutorial --name firstspider \
    --recurrency-minutes 240 --code firstspider.py

TODO: Add the response of the command here

Now, you can manually start the scraper with the previously installed docker container by running:

docker run --rm --env-file skyscraper.env.list \
    skyscraper crawl-manual tutorial firstspider