Scraping Engines (Scrapy or Chrome Headless)

Skyscraper can be used with two different scraping engines: You can either scrape websites with scrapy spiders or you can use Chrome headless spiders.

Scrapy spiders have the advantage that they are portable to other scraping systems based on scrapy (e.g. your local PC running scrapy or Scrapinghub). Chrome headless on the other hand does execute Javascript and thus can be used in situations that scrapy cannot handle.

To select the engine you must:

  • set the correct setting for engine in your spider’s configuration file

  • implement the correct class in your spider’s code file (either a scrapy.Spider or a skyscraper.spiders.ChromeSpider).

Scrapy Spiders

Scrapy spiders are standard scrapy spiders, i.e. they are subclasses of scrapy.Spider.

Chrome Headless Spiders

A Chrome headless spider is a subclass of skyscraper.spiders.ChromeSpider. It must include:

  • a name attribute with the spider’s name

  • a start_urls attribute with a list of URLs to be used to start the crawl

  • an async method parse that implements the actual scraping logic

To mark that your spider should be executed with the Chrome headless execution enine add the setting engine: chrome to your YAML configuration file as in the following example:

engine: chrome
enabled: true
recurrence_minutes: 15

The following code is an example of a spider that extracts the source code from example.com:

class ExampleSpider(skyscraper.spiders.ChromeSpider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    # page is the pyppeteer page object, already on the active page.
    # response is the result from the call to page.goto().
    # to query data from the page you will use page. But to get the URL
    # and source code from the page you need response
    async def parse(self, page, response):
        item = BasicItem()
        item['id'] = 'example.com-indexpage'
        item['url'] = response.url
        item['source'] = await response.text()
        return item