Ideas List

Declarative API

This is a working draft to rebuilt molescrape to a declarative scraping and post processing platform. This concept might allow us to remove the whole PostgreSQL database and all configuration is provided along with the code in yaml files.

Skyscraper spiders will be deployed through a git repository and configured through yaml files next to the spiders. The folder structure will remain as currently with projects as folders. A permission system (at first) will not be implemented, we rely on one V-server per permission group. A sample repository might look like this:

my-project/
|-- myspider.py (code of the spider)
|-- myspider.yml (configuration of the spider)
|-- some_external_lib/ (optional helper scripts that a spider can include with this new concept)
    |-- some_complex_algorithm.py
|-- otherspider.py
|-- otherspider.yml

second-project/
|-- myspider.py
|-- myspider.yml

This structure allows the user to manage spiders and spider configurations in a git repository and deploy them through git commit, either on the master branch or a specific branch.

The current state of the scraping system will be stored in the RAM of skyscraper (which will become a always-up service running in the background, e.g. through systemd or in a docker container). This service can create subprocesses to execute the actual crawls (call a entrypoint like skyscraper-executor or similar).

If there is need for scalability in the future communication between skyscraper instances could be done through ZeroMQ or similar, but currently we will focus on single-instance deployments.

Calkx will also use configuration through yml files and code will be commited through git. This allow better structuring of code than a Jupyter notebook (which makes it difficult to structure code into multiple files and also difficult to install additional requirements).

Skyscraper

If you want to contribute we currently have the following ideas that need to be implemented for Skyscraper:

  • logging support for crawled URLs including referer to allow debugging of potential crawling circles

  • drop x percent pipeline: A pipeline that drops a certain percentage of items but still marks them as crawled (can be used to follow the European database protection law by only crawling a small percentage of data)

  • number of requests per day limitation: Limit a spider to a maximum number of requests per day

Calkx

We also need to think about what Calkx should become / which features it should implement. At the current situation Skyscraper works pretty well, but the following processes are still quite undefined. Currently we can see that we need the following solutions (ideally all inside Calkx, because we should not develop too many components):

  • further extraction of features from Skyscraper results (especially if Skyscraper outputs HTML source, but maybe also in other situations)

  • visual data exploration to get an insight into what we crawled, what the data looks like etc.

  • chart generation for our presentations (both blog posts and project websites)