Scraping with the shell command

Now that Scrapy can crawl the countries, we can define what data to scrape. To help test how to extract data from a web page, Scrapy comes with a handy command called shell which presents us with the Scrapy API via an Python or IPython interpreter.

We can call the command using the URL we would like to start with, like so:

$ scrapy shell http://example.webscraping.com/view/United-Kingdom-239
...
[s] Available Scrapy objects:
[s] scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler    <scrapy.crawler.Crawler object at 0x7fd18a669cc0>
[s] item       {}
[s] request    <GET http://example.webscraping.com/view/United-Kingdom-239>
[s] response   <200 http://example.webscraping.com/view/United-Kingdom-239>
[s] settings   <scrapy.settings.Settings object at 0x7fd189655940>
[s] spider     <CountrySpider 'country' at 0x7fd1893dd320>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req)                  Fetch a scrapy.Request and update local objects 
[s] shelp()                     Shell help (print this help)
[s] view(response)              View response in a browser
In [1]:

We can now query the response object to check what data is available.

In [1]: response.url 
Out[1]:'http://example.webscraping.com/view/United-Kingdom-239' 

In [2]: response.status 
Out[2]: 200

Scrapy uses lxml to scrape data, so we can use the same CSS selectors as those in Chapter 2, Scraping the Data:

In [3]: response.css('tr#places_country__row td.w2p_fw::text') 
[<Selector xpath=u"descendant-or-self:: 
    tr[@id = 'places_country__row']/descendant-or-self:: 
    */td[@class and contains( 
    concat(' ', normalize-space(@class), ' '), 
    ' w2p_fw ')]/text()" data=u'United Kingdom'>]

The method returns a list with an lxml selector. You may also recognize some of the XPath syntax Scrapy and lxml use to select the item. As we learned in Chapter 2, Scraping the Data, lxml converts all CSS Selectors to XPath before extracting content.

In order to actually get the text from this country row, we must call the extract() method:

In [4]: name_css = 'tr#places_country__row td.w2p_fw::text' 

In [5]: response.css(name_css).extract() 
Out[5]: [u'United Kingdom'] 

In [6]: pop_xpath = '//tr[@id="places_population__row"]/td[@class="w2p_fw"]/text()' 

In [7]: response.xpath(pop_xpath).extract()
Out[7]: [u'62,348,447']

As we can see from the output above, the Scrapy response object can be parsed using both css and xpath, making it very versatile for getting obvious and harder-to-reach content.

These selectors can then be used in the parse_item() method generated earlier in example/spiders/country.py. Note we set attributes of the scrapy.Item object using dictionary syntax:

def parse_item(self, response): 
    item = CountryItem() 
    name_css = 'tr#places_country__row td.w2p_fw::text' 
    item['name'] = response.css(name_css).extract() 
    pop_xpath = '//tr[@id="places_population__row"]/td[@class="w2p_fw"]/text()'
    item['population'] = response.xpath(pop_xpath).extract() 
    return item

Table of Contents for Scraping with the shell command

Create new playlist

Sign In

Sign Up

Table of Contents for
Scraping with the shell command