Archiving Data

Warning

This feature is not yet implemented, but we’re documenting it here already to define the functionality.

It is possible to archive skyscraper result data (i.e. the items stored to your data storage) to another location.

Archiving works for both folder storage and PostgreSQL storage. The archived items are stored in a compressed file and are sent to one of the supported archive storage locations. The following archive locations are supported:

  • Folder storage (i.e. store the files into a local folder)

  • Amazon S3 (Deep Glacier)

Warning

It is currently not clear what will be the method to restore archived items. There might be an inverse path to put the data back into the hot storage or similar.

To archive data call the archive command and set the parameters --target to the target location and --start-date to the earliest date you want to archive (default: everything) and --end-date to the latest date you want to archive (default: until now). You can specify a name for the archive with --name. E.g. to archive all data from 2018 and store it to S3 with the name 2018 use:

skyscraper archive --target s3://your-project-skyscraper-archive/ \
    --start-date 2018-01-01 --end-date 2018-12-31 --name 2018

Archiving can also be limited to one project only. This can be useful if you want to keep some projects in the hot data set and move other projects into an archive. To archive a specific project provide the –project argument:

skyscraper archive --target file:///data/skyscraper-archive \
    --project my-project --end-date 2018-12-31 --name until-2018

The archives will be stored into a hierarchy according to the project name and the spider name. This means if you have one project my-project and two spiders example and tutorial the following files would be created by an archiving job:

  • s3://your-project-skyscraper-archive/my-project/example/2018.jl.bz2

  • s3://your-project-skyscraper-archive/my-project/tutorial/2018.jl.bz2

Note

We are thinking about supporting encryption before archiving. This might be useful for saving data to AWS S3.