First Calkx Job

The items that were collected by spiders usually will have to be processed before they can be inserted into a database (e.g. data has to be cleaned, or it has to be structured into a relational schema). This can be achieved with Calkx.

Calkx jobs are defined in a git repository with a Python code file accompanied by a YAML configuration file. The Python file has to contain one class extending calkx.jobs.Job with a method process_item.

Let’s create a Calkx job that reads the title from a web page and writes it to stdout:

from lxml import etree

import calkx.jobs

class ExampleJob(calkx.jobs.Job):
    def process_item(self, item):
        tree = etree.fromstring(item['source'])
        title = tree.xpath('/html/head/title/text()')[0]
        print(title)

Calkx does not automatically ship with lxml. Thus, in order to be able to use lxml in this job we have to specify it in the job’s YAML configuration file:

enabled: true
spiders:
  - wikipedia
requires_packages:
  - lxml

This configuration file tells Calkx that this job should be enabled and should read the result data from the spider wikipedia (in the same project as this job). You can only read data from Skyscraper spiders in the same project as your Calkx job’s project.

During setup of the Calkx job this configuration will also trigger installation of the lxml package from PyPI.

If you followed the Skyscraper tutorial before and have created a spider in the namespace tutorial create the following folder structure in a new git repository (e.g. in /tmp/calkx-jobs-git):

.
+- calkx-jobs/
|
   +- tutorial/
   |  +- pagetitles.py
   |  +- pagetitles.yaml

With this repository structure we are creating a job pagetitles in the namespace tutorial (the same namespace as the wikipedia spider was before).

Next, create temporary folders and define the environment variables:

mkdir /tmp/calkx-jobs
export CALKX_GIT_REPOSITORY=/tmp/calkx-jobs-git
export CALKX_GIT_WORKDIR=/tmp/calkx-jobs
export CALKX_GIT_SUBFOLDER=''
export CALKX_GIT_BRANCH=master
export CALKX_PROMETHEUS_METRICS_PORT=8001
export CALKX_SPIDER_ITEMS_PATH=/tmp/skyscraper-items

With this configuration when we will start Calkx it will setup watchdogs for file changes in /tmp/skyscraper-items. In our case since our Calkx job belongs to the project tutorial and we have set spiders: ['wikipedia'] in the job configuration file, the watchdog will monitor the folder /tmp/skyscraper-items/tutorial/wikipedia for new files.

Now, start Calkx with:

calkx

TODO: Describe how user can delete duplicate filter or create a wikipedia spider that will crawl the wikipedia page more often than once per day