XPath Selectors

There are times when using CSS selectors will not work. This is especially the case with very broken HTML or improperly formatted elements. Despite the best efforts of libraries like BeautifulSoup and lxml to properly parse and clean up the code; it will not always work - and in these cases, XPath can help you build very specific selectors based on hierarchical relationships of elements on the page.

XPath is a way of describing relationships as an hierarchy in XML documents. Because HTML is formed using XML elements, we can also use XPath to navigate and select elements from an HTML document. 

To read more about XPath, check out the Mozilla developer documentationhttps://developer.mozilla.org/en-US/docs/Web/XPath.

XPath follows some basic syntax rules and has some similarities with CSS selectors. Take a look at the following chart for some quick references between the two.

Selector description XPath Selector CSS selector
Select all links '//a' 'a'
Select div with class "main" '//div[@class="main"]' 'div.main'
Select ul with ID "list" '//ul[@id="list"]' 'ul#list'
Select text from all paragraphs '//p/text()' 'p'*
Select all divs which contain 'test' in the class '//div[contains(@class, 'test')]' 'div [class*="test"]'
Select all divs with links or lists in them '//div[a|ul] ' 'div a, div ul'
Select a link with google.com in the href '//a[contains(@href, "google.com")] 'a'*

As you can see from the previous table, there are many similarities between the syntax. However, in the chart there are certain CSS selectors noted with a *. These indicate that it is not exactly possible to select these elements using CSS, and we have provided the best alternative. In these cases, if you were using cssselect you will need to do further manipulation or iteration within Python and/or lxml. Hopefully this comparison has shown an introduction to XPath and convinced you that it is more exacting and specific than simply using CSS.

Now that we have a basic introduction to the XPath syntax, let's see how we can use it for our example website:

>>> tree = fromstring(html)
>>> area = tree.xpath('//tr[@id="places_area__row"]/td[@class="w2p_fw"]/text()')[0]
>>> print(area)
244,820 square kilometres

Similar to CSS selectors, you can also test XPath selectors in your browser console. To do so, on a page with selectors simply use the $x('pattern_here'); selector. Similarly, you can also use the document object from simple JavaScript and call the evaluate method.

The Mozilla developer network has a useful introduction to using XPath with JavaScript tutorial here:  https://developer.mozilla.org/en-US/docs/Introduction_to_using_XPath_in_JavaScript

If we wanted to test looking for td elements with images in them to get the flag data from the country pages, we could test our XPath pattern in our browser first:

Here we can see that we can use attributes to specify the data we want to extract (such as @src). By testing in the browser, we save debugging time by getting immediate and easy-to-read results.

We will be using both XPath and CSS selectors throughout this chapter and further chapters, so you can become more familiar with them and feel confident using them as you advance your web scraping capabilities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset