Web scraping using BeautifulSoup

BeautifulSoup is a Python library (https://www.crummy.com/software/BeautifulSoup/) for pulling data out of HTML and XML files. It provides ways of navigating, accessing, searching, and modifying the HTML content of a web page. It is important to understand the basics of HTML in order to successfully scrape a web page. To parse the content, the first thing that we need to do is to figure out where we can locate the links to the files we want to download inside the multiple levels of HTML tags. Simply put, there is a lot of code on a web page, and we want to find the relevant pieces of code that contains our data.

On the website, right-click and click on Inspect. This allows you to see the raw code behind the site. Once you've clicked on Inspect, you should see the following console pop up: 

Inspect menu of a browser

Notice that the table that we are referring to is wrapped in a tag called table. Each row will be present between <tr> tags. Similarly, each cell will be present between <td> tags. Understanding these basic differences makes it easier to extract the data.

We start by importing the following libraries:

Importing libraries

Next, we request the URL with the requests library. If the access was successful, you should see the following output:

Successful response from a website

We then parse html with BeautifulSoup so that we can work with a nicer, nested BeautifulSoup data structure. With a little knowledge of HTML tags, the parsed content can be easily converted into a DataFrame using a for loop and a pandas DataFrame. The biggest advantage of using BeautifulSoup is that it can even extract data from unstructured sources that can be molded into a table by the supported libraries, whereas the read_html function of pandas will only work with structured sources. Hence, based on the requirement, we have used BeautifulSoup:

Extracted DataFrame using BeautifulSoup
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset