Scraping Engines (Scrapy or Chrome Headless)

Skyscraper can be used with two different scraping engines: You can either scrape websites with scrapy spiders or you can use Chrome headless spiders.

Scrapy spiders have the advantage that they are portable to other scraping systems based on scrapy (e.g. your local PC running scrapy or Scrapinghub). Chrome headless on the other hand does execute Javascript and thus can be used in situations that scrapy cannot handle.

To select the engine you must:

  • set the correct setting for engine in your spider’s configuration file

  • implement the correct class in your spider’s code file (either a scrapy.Spider or a skyscraper.spiders.ChromeSpider).

The following code is an example for a YAML configuration of a Chrome headless spider:

engine: chrome
enabled: true
recurrence_minutes: 15

To implement a chrome headless spider subclass skyscraper.spiders.ChromeSpider. Your spider must have:

  • a name attribute with the spider’s name

  • a start_urls attribute with a list of URLs to be used to start the crawl

  • an async method parse that implements the actual scraping logic

The following code is an example of a spider that extracts the source code from example.com:

class ExampleSpider(skyscraper.spiders.ChromeSpider):
    name = 'example'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    # page is the pyppeteer page object, already on the active page.
    # response is the result from the call to page.goto().
    # to query data from the page you will use page. But to get the URL
    # and source code from the page you need response
    async def parse(self, page, response):
        item = BasicItem()
        item['id'] = 'example.com-indexpage'
        item['url'] = response.url
        item['source'] = await response.text()
        return item