Anonymization

Crawling with TOR

It is possible to run all requests of a container through TOR. TOR is started for each container separately, which means that for different container you will have a different exit node.

TODO: Make sure whether this is actually true or if there is some caching in the network so that our IP always gets a similar route?

There are two different concepts regarding TOR and Skyscraper containers:

  • TOR can be enabled for a container

  • TOR can be used for a spider in a container

It is important to remember that these two are separate. Of course, to use TOR for a spider you have to have it enabled inside a container. However, you can also run a container with TOR enabled and not route the spider’s traffic through TOR. Enabling TOR for a container only means that the required daemons for TOR will be started in this container.

Warning

This feature is experimental at the moment. Do not rely on it. It’s possible that something might fail during the setup of the container and your requests are not routed through TOR.

Warning

It’s not uncommon for website to block TOR exit nodes. Thus, you should be prepared for crawling errors.

To enable TOR for a container, start the container with the environment variable TOR_ENABLED. Let’s test this with an IP check:

import scrapy
import datetime

from skyscraper.items import BasicItem


class IpcheckSpider(scrapy.Spider):
    name = 'ipcheck'
    allowed_domains = ['ipify.org']
    start_urls = ['https://api.ipify.org/']

    def parse(self, response):
        today = datetime.datetime.today().strftime('%Y-%m-%d')
        item = BasicItem()
        item['id'] = 'ipcheck-{}'.format(today)
        item['url'] = response.url
        item['source'] = response.text
        return item

Generally, it is save to enable TOR (TOR_ENABLED) for all containers. However, if you plan to never use TOR for your spiders it does not make sense to activate it and run additional unneeded services inside your container. You also have to set the hostname and port of your HTTP proxy. For privoxy this would be 127.0.0.1:8118 by default:

TOR_ENABLED=1
SKYSCRAPER_TOR_PROXY=127.0.0.1:8118

Store the spider code to your spider storage (e.g folder or PostgreSQL) and run it (you might have to adjust the namespace tutorial to your own namespace). To actually use TOR with a spider, pass the flag --use-tor to the crawl-manual command:

docker run --rm --env TOR_ENABLED=1 \
    molescrape/skyscraper crawl-manual tutorial ipcheck --use-tor

For scheduled spider runs from PostgreSQL, you can set a field in the database to enable or disable usage of TOR for invidiual spiders.