Configuration

PostgreSQL

All PostgreSQL components require the connection string:

  • POSTGRES_CONNSTRING: The connection string for connection to the PostgreSQL database, e.g. host=localhost user=username password=secret dbname=molescrape

AWS

If you use AWS components you have to set the AWS credentials:

  • AWS_ACCESS_KEY

  • AWS_SECRET_ACCESS_KEY

Pipelines

It’s possible to enable or disable many of the included pipeline steps:

  • PIPELINE_USE_DUPLICATESFILTER_DYNAMODB: Enable the AWS DynamoDB based duplicate filter.

  • PIPELINE_USE_OUTPUT_S3: Write scraped items to AWS S3

  • PIPELINE_USE_OUTPUT_POSTGRES: Write scraped items to the PostgreSQL database

  • PIPELINE_USE_OUTPUT_FOLDER: Write scraped items to a folder

  • PIPELINE_USE_ITEMCOUNT_POSTGRES: Write the number of scraped items to PostgreSQL. This is used for plausibility checks on number of scraped items.

If you use folder output, you also have to set the target folder for your scraped data:

  • SKYSCRAPER_STORAGE_FOLDER_PATH: Target folder for scraped data when folder output is used

If you use S3 output, you also have to set the target bucket:

  • S3_DATA_BUCKET: AWS S3 bucket to which all scraped data should be stored

Spider Loaders

There are different spider loaders to choose from each with different options.

  • SPIDER_LOADER_CLASS: Can be either set to skyscraper.spiderloader.FolderSpiderLoader or skyscraper.spiderloader.PostgresSpiderLoader.

  • SPIDERS_FOLDER: If you use the folder spider loader, this option defines the base folder from which spiders should be loaded.

Scheduling

For the custom PostgreSQL scheduler the following configuration variables exist:

  • SCHEDULER: Must be set to skyscraper.scheduler.PostgresScheduler to enable the PostgreSQL scheduler. Leave it empty to use scrapy’s default scheduler.

  • SCHEDULER_POSTGRES_BATCH_SIZE: Number of requests that should be kept in memory and crawled in one startup of Skyscraper. All other requests will be stored to the PostgreSQL backlog and crawled later.

Anonymization

To enable general TOR support set this flag:

  • TOR_ENABLED: Enables TOR support for this container. This does not mean that the spiders will actually use TOR. You need to enable the usage of TOR for each spider individually.

Mail

The plausibilisation of number of scraped items will send e-mails. The following options exist to configure e-mail:

  • MAIL_SERVER: The SMTP server which will be used to send the mail

  • MAIL_USER: The username to login to the SMTP server

  • MAIL_PASSWORD: The password to login to the SMTP server

  • MAIL_FROM: The e-mail address of the sender of the notification mails

Logging and Monitoring

  • SKYSCRAPER_LOGLEVEL: Set the minimum log level that you want skyscraper to output on stdout (default: ERROR)

  • STATS_CLASS: Set this option to skyscraper.statscollectors.PostgresStatsCollector if you use PostgreSQL and want to collect crawling statistics in the SQL database

Skyscraper can monitor the execution of spiders with pidar.eu. For this, the base URL for your pidar aliases has to be defined (including your username).

  • PIDAR_URL: The base URL at pidar.eu for monitoring. Skyscraper will append the namespace and spider name to the URL. If you create an alias at pidar with this name, you will get the notifications from Skyscraper.