Parsing HTML with lxml

The lxml parser (https://lxml.de ) is the main module for analysis of XML documents and libxslt.

The main module features are as follows:

  • Support for XML and HTML
  • An API based on ElementTree
  • Support to selected elements of the document through XPath expressions

The installation of the XML parser can be done through the official repository:

pip install lxml

lxml.etree is a submodule within the lxml library that provides methods such as XPath(), which supports expressions with XPath selector syntax. With this example, we see the use of the parser to read an HTML file and extract the text from the title tag through an XPath expression:

from lxml import html,etree
simple_page = open('data/simple.html').read()
parser = etree.HTML(simple_page)
result = etree.tostring(parser,pretty_print=True, method="html")
find_text = etree.XPath("//title/text()", smart_strings=False)
text = find_text(parser)[0]
print(text)

Before we start parsing HTML, we need something to parse. We can obtain the version and codename of the latest stable Debian release from the Debian website. Information about the current stable release can be found at https://www.debian.org/releases/stable/index.en.html. The information that we want is displayed in the page title and in the first sentence.

Let's open a Python shell and get to parsing. First, we'll download the page with the requests package:

>>> import requests
>>> response = requests.get('https://www.debian.org/releases/stable/index.en.html')

Next, we parse the source into an ElementTree tree. This is the same as parsing XML with the standard library's ElementTree, except here we will use the lxml specialist HTMLParser:

>>> from lxml.etree import HTML
>>> root = HTML(response.content)

The HTML() function is a shortcut that reads the HTML that is passed to it, and then it produces an XML tree. Notice that we're passing response.content and not response.text. The lxml library produces better results when it uses the raw response rather than the decoded Unicode text.

The lxml library's ElementTree implementation has been designed to be 100% compatible with the standard library's, so we can start exploring the document in the same way as we did with XML:

>>> [e.tag for e in root]
['head', 'body']
>>> root.find('head').find('title').text
'Debian -- Debian “stretch” Release Information '

In the preceding code, we have printed out the text content of the document's <title> element. We can already see it contains the codename that we want.

Let's inspect the HTML source of the page, and see what we're dealing with. For this, either use View source in a web browser, or save the HTML to a file and open it in a text editor. The page's source code is also included in the source code download for this book. Search for text Debian 9.6 in the text, so that we are taken straight to the information we want.

In this screenshot, we can see how it looks as a block of code:

From the preceding image, we can see that we want the contents of the <p> tag child of the <div> element. If we navigated to this element by using the ElementTree functions, which we have used before, then we'd end up with something like the following:

>>> root.find('body').findall('div')[1].find('p').text
'Debian 9.6 was released November 10th, 2018. Debian 9.0 was initially released on June 17th, 2017. The release included many major changes, described in our '

The main problem with this way is that it depends quite heavily on the HTML structure. A change, such as a <div> tag being inserted before the one that we needed, would break it. Also, in more complex documents, this can lead to horrendous chains of method calls, which are hard to maintain.

Our use of the <title> tag in the previous section to get the codename is an example of a good technique, because there is always only one <head> tag and one <title> tag in a document. A better approach to finding our <div> tag would be to make use of the id="content" attribute it contains.

It's a common web page design pattern to break a page into a few top-level <divs> tag for the major page sections such as header, footer, and the content, and to give the <divs> ID attributes that identify them as such.

Since version 2, lxml has by default installed a dedicated Python submodule to work with HTML, lxml.htmlhttp://lxml.de/lxmlhtml.html.

In this example, we make a request to the DuckDuckGo search engine and obtain the form that is used to perform the searches. To do this, we access the forms object that will be contained within the URL response.

You can find the following code in the duckduckgo.py file inside the lxml folder:

from lxml.html import fromstring, tostring
from lxml.html import parse, submit_form

import requests
response = requests.get('https://duckduckgo.com')
form_page = fromstring(response.text)
form = form_page.forms[0]
print(tostring(form))

page = parse('http://duckduckgo.com').getroot()
page.forms[0].fields['q'] = 'python'
result = parse(submit_form(page.forms[0])).getroot()
print(tostring(result))

This is the output of the first part of the script, where we can see the form object from DuckDuckGo:

b'<form id="search_form_homepage" class="search search--home js-search-form" name="x" method="POST" action="/html">
			<input id="search_form_input_homepage" class="search__input js-search-input" type="text" autocomplete="off" name="q" tabindex="1" value="">
			<input id="search_button_homepage" class="search__button js-search-button" type="submit" tabindex="2" value="S">
			<input id="search_form_input_clear" class="search__clear empty js-search-clear" type="button" tabindex="3" value="X">
			<div id="search_elements_hidden" class="search__hidden js-search-hidden"></div>
		</form>

						'
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset