Parsing robots.txt

First, we need to interpret robots.txt to avoid downloading blocked URLs. Python urllib comes with the robotparser module, which makes this straightforward, as follows:

    >>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url('http://example.webscraping.com/robots.txt')
>>> rp.read()
>>> url = 'http://example.webscraping.com'
>>> user_agent = 'BadCrawler'
>>> rp.can_fetch(user_agent, url)
False
>>> user_agent = 'GoodCrawler'
>>> rp.can_fetch(user_agent, url)
True

The robotparser module loads a robots.txt file and then provides a can_fetch()function, which tells you whether a particular user agent is allowed to access a web page or not. Here, when the user agent is set to 'BadCrawler', the robotparser module says that this web page can not be fetched, as we saw in the definition in the example site's robots.txt.

To integrate robotparser into the link crawler, we first want to create a new function to return the  robotparser object:

def get_robots_parser(robots_url):
" Return the robots parser object using the robots_url "
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp

We need to reliably set the robots_url; we can do so by passing an extra keyword argument to our function. We can also set a default value catch in case the user does not pass the variable. Assuming the crawl will start at the root of the site, we can simply add robots.txt to the end of the URL. We also need to define the user_agent:

def link_crawler(start_url, link_regex, robots_url=None, user_agent='wswp'):
...
if not robots_url:
robots_url = '{}/robots.txt'.format(start_url)
rp = get_robots_parser(robots_url)

Finally, we add the parser check in the crawl loop:

... 
while crawl_queue:
url = crawl_queue.pop()
# check url passes robots.txt restrictions
if rp.can_fetch(user_agent, url):
html = download(url, user_agent=user_agent)
...
else:
print('Blocked by robots.txt:', url)

We can test our advanced link crawler and its use of robotparser by using the bad user agent string.

>>> link_crawler('http://example.webscraping.com', '/(index|view)/', user_agent='BadCrawler')
Blocked by robots.txt: http://example.webscraping.com
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset