Google search engine

To investigate using our knowledge of CSS selectors, we will scrape Google search results. According to the Alexa data used in Chapter 4, Concurrent Downloading, google.com is the world's most popular website, and conveniently, its structure is simple and straightforward to scrape.


International Google may redirect to a country-specific version, depending on your location. In these examples, Google is set to the Romanian version, so your results may look slightly different.

Here is the Google search homepage loaded with browser tools to inspect the form:

We can see here that the search query is stored in an input with name q, and then the form is submitted to the path /search set by the action attribute. We can test this by doing a test search to submit the form, which would then be redirected to a URL, such as https://www.google.ro/?gws_rd=cr,ssl&ei=TuXYWJXqBsGsswHO8YiQAQ#q=test&*. The exact URL will depend on your browser and location. Also if you have Google Instant enabled, AJAX will be used to load the search results dynamically rather than submitting the form. This URL has many parameters, but the only one required is q for the query.

The URL https://www.google.com/search?q=test shows we can use this URL to produce a search result, as shown in this screenshot:

The structure of the search results can be examined with your browser tools, as shown here:

Here, we see that the search results are structured as links whose parent element is a <h3> tag with class "r".

To scrape the search results, we will use a CSS selector, which was introduced in Chapter 2, Scraping the Data:

>>> from lxml.html import fromstring
>>> import requests
>>> html = requests.get('https://www.google.com/search?q=test')
>>> tree = fromstring(html.content)
>>> results = tree.cssselect('h3.r a')
>>> results
[<Element a at 0x7f3d9affeaf8>,
<Element a at 0x7f3d9affe890>,
<Element a at 0x7f3d9affe8e8>,
<Element a at 0x7f3d9affeaa0>,
<Element a at 0x7f3d9b1a9e68>,
<Element a at 0x7f3d9b1a9c58>,
<Element a at 0x7f3d9b1a9ec0>,
<Element a at 0x7f3d9b1a9f18>,
<Element a at 0x7f3d9b1a9f70>,
<Element a at 0x7f3d9b1a9fc8>]

So far, we downloaded the Google search results and used lxml to extract the links. In the preceding screenshot, the link includes a bunch of extra parameters alongside the actual website URL, which are used for tracking clicks.

Here is the first link we find on the page:

>>> link = results[0].get('href') 
>>> link
'/url?q=http://www.speedtest.net/&sa=U&ved=0ahUKEwiCqMHNuvbSAhXD6gTMAA&usg=AFQjCNGXsvN-v4izEgZFzfkIvg'

The content we want here is http://www.speedtest.net/, which can be parsed from the query string using the urlparse module:

>>> from urllib.parse import parse_qs, urlparse 
>>> qs = urlparse(link).query
>>> parsed_qs = parse_qs(qs)
>>> parsed_qs
{'q': ['http://www.speedtest.net/'],
'sa': ['U'],
'ved': ['0ahUKEwiCqMHNuvbSAhXD6gTMAA'],
'usg': ['AFQjCNGXsvN-v4izEgZFzfkIvg']}
>>> parsed_qs.get('q', [])
['http://www.speedtest.net/']

This query string parsing can be applied to extract all links.

>>> links = [] 
>>> for result in results:
... link = result.get('href')
... qs = urlparse(link).query
... links.extend(parse_qs(qs).get('q', []))
...
>>> links
['http://www.speedtest.net/',
'test',
'https://www.test.com/',
'https://ro.wikipedia.org/wiki/Test',
'https://en.wikipedia.org/wiki/Test',
'https://www.sri.ro/verificati-va-aptitudinile-1',
'https://www.sie.ro/AgentiaDeSpionaj/test-inteligenta.html', 'http://www.hindustantimes.com/cricket/india-vs-australia-live-cricket-score-4th-test-dharamsala-day-3/story-8K124GMEBoiKOgiAaaB5bN.html',
'https://sports.ndtv.com/india-vs-australia-2017/live-cricket-score-india-vs-australia-4th-test-day-3-dharamsala-1673771',
'http://pearsonpte.com/test-format/']

Success! The links from the first page of this Google search have been successfully scraped. The full source for this example is available at https://github.com/kjam/wswp/blob/master/code/chp9/scrape_google.py.

One difficulty with Google is that a CAPTCHA image will be shown if your IP appears suspicious, for example, when downloading too fast:

This CAPTCHA image could be solved using the techniques covered in Chapter 7, Solving CAPTCHA, though it would be preferable to avoid suspicion and download slowly, or use proxies if a faster download rate is required. Overloading Google can get your IP or even set of IPs banned from Google domains for a series of hours or day; so ensure you are courteous to others' (and your own) use of the site so your home or office doesn't get blacklisted.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset