Ideas List


If you want to contribute we currently have the following ideas that need to be implemented for Skyscraper:

  • logging support for crawled URLs including referer to allow debugging of potential crawling circles

  • drop x percent pipeline: A pipeline that drops a certain percentage of items but still marks them as crawled (can be used to follow the European database protection law by only crawling a small percentage of data)

  • number of requests per day limitation: Limit a spider to a maximum number of requests per day


We also need to think about what Calkx should become / which features it should implement. At the current situation Skyscraper works pretty well, but the following processes are still quite undefined. Currently we can see that we need the following solutions (ideally all inside Calkx, because we should not develop too many components):

  • further extraction of features from Skyscraper results (especially if Skyscraper outputs HTML source, but maybe also in other situations)

  • visual data exploration to get an insight into what we crawled, what the data looks like etc.

  • chart generation for our presentations (both blog posts and project websites)