Beautiful Soup

Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have this module, the latest version can be installed using this command:

    pip install beautifulsoup4

The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Many web pages do not contain perfectly valid HTML and Beautiful Soup needs to correct improper open and close tags. For example, consider this simple web page containing a list with missing attribute quotes and closing tags:

        <ul class=country> 
<li>Area
<li>Population
</ul>

If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this:

>>> from bs4 import BeautifulSoup 
>>> from pprint import pprint
>>> broken_html = '<ul class=country><li>Area<li>Population</ul>'
>>> # parse the HTML
>>> soup = BeautifulSoup(broken_html, 'html.parser')
>>> fixed_html = soup.prettify()
>>> pprint(fixed_html)

<ul class="country">
<li>
Area
<li>
Population
</li>
</li>
</ul>

We can see that using the default html.parser did not result in properly parsed HTML. We can see from the previous snippet that it has used nested li elements, which might make it difficult to navigate. Luckily there are more options for parsers. We can install LXML (as described in the next section) or we can also use html5lib. To install html5lib, simply use pip:

pip install html5lib

Now, we can repeat this code, changing only the parser like so:

>>> soup = BeautifulSoup(broken_html, 'html5lib') 
>>> fixed_html = soup.prettify()
>>> pprint(fixed_html)
<html>
<head>
</head>
<body>
<ul class="country">
<li>
Area
</li>
<li>
Population
</li>
</ul>
</body>
</html>

 Here, BeautifulSoup using html5lib was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. You should see similar results if you used lxml

Now, we can navigate to the elements we want using the find() and find_all() methods:

>>> ul = soup.find('ul', attrs={'class':'country'}) 
>>> ul.find('li') # returns just the first match
<li>Area</li>
>>> ul.find_all('li') # returns all matches
[<li>Area</li>, <li>Population</li>]
For a full list of available methods and parameters, the official Beautiful Soup documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Now, using these techniques, here is a full example to extract the country area from our example website:

>>> from bs4 import BeautifulSoup 
>>> url = 'http://example.webscraping.com/places/view/United-Kingdom-239'
>>> html = download(url)
>>> soup = BeautifulSoup(html)
>>> # locate the area row
>>> tr = soup.find(attrs={'id':'places_area__row'})
>>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the data element
>>> area = td.text # extract the text from the data element
>>> print(area)
244,820 square kilometres

This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra white space or tag attributes. We also know if the page contains broken HTML that Beautiful Soup can help clean the page and allow us to extract data from very broken website code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset