Emitting ItemsΒΆ

To store the information you scraped from a website you have to emit it as an item. Skyscraper provides a class skyscraper.items.BasicItem which you can use to fill out all relevant fields.

BasicItem has the following fields:

  • id: Must be set to a unique ID of this item in order for deduplication to work correctly

  • url: Can be set to the URL from which you scraped the item (optional)

  • source: Save the full source code of the website to this field if you might want to perform additional data extraction later (optional)

  • data: Store extracted data to this field, the data must be JSON-serializable (optional)

Skyscrapers pipelines will automatically add additional information to other fields. The following fields are automatically written by Skyscraper:

  • crawl_time

  • spider: The name of the spider that emitted this item

  • namespace: The namespace of the spider

TODO: Also explain all the different possible item storage pipelines in this section