Scheduling

Skyscraper spiders are either executed one single time or in recurrent fixed intervals. We call spiders that are only executed once one-time spiders and spiders that are executed in fixed intervals recurrent spiders.

The most common use case for a one-time spider is that you want to fetch some specific information and the job is done once you have it. Another use case is to fetch all existing items from a website.

A recurrent spider would then be used to keep your index up to date, i.e. fetch all newly created items. How often the recurrent spider has to be run depends on the website, their update frequency and how many pages you are willing to fetch in one go. For some websites a recurrency interval of 1 day is enough, for others you need 10 minutes.

Splitting Long Scrapes into Chunks

Sometimes you will have scrapes that span hundreds of thousands of URLs. In these cases you might not want to let the crawler run for days or weeks, but instead split it into smaller tasks. This is possible with a scheduler that only ingests a maximum number of URLs into the memory and offloads all exceeding URLs into PostgreSQL.

That way, small crawl projects will still work as always. For big crawls however, the crawler will be stopped after some time and can be continued from the backlog of stored requests.

To enable this scheduler set the following configuration:

SCHEDULER='skyscraper.scheduler.PostgresScheduler'
SCHEDULER_POSTGRES_BATCH_SIZE=1000

This will allow your spiders to process a maximum of 1000 requests in one run. All other requests will be stored to PostgreSQL and can be crawled later.

TODO: Describe which cli commands should/can be used

Warning

If you implement a constructor for your custom spider, it is important that you call the parent constructor and pass *args and **kwargs to it. Otherwise, we will not able to schedule a backlog crawl with empty start_urls. This would mean that your crawler always starts with its regular start_urls again and thus will never crawl the backlog.

Warning

All callbacks and errbacks of your spider must be methods of your spider class. Otherwise, the backlog crawling mechanism of the scheduler will not work correctly. This limitation can probably be fixed in the future.