Reading HTML data from the web

pandas has support for reading data from HTML files (or HTML from URLs). Underneath the covers, pandas makes use of the LXML, Html5Lib, and BeautifulSoup4 packages. These packages provide some impressive capabilities for reading and writing HTML tables.

Your default installation of Anaconda may not include these packages. If you get errors using this function, install the appropriate library based on the error, using Anaconda Navigator:

Otherwise, you can use pip:

The pd.read_html() function will read HTML from a file (or URL) and parse all HTML tables found in the content into one or more pandas DataFrame objects. The function always returns a list of DataFrame objects (actually, zero or more, depending on the number of tables found in the HTML).

To demonstrate, we will read table data from the FDIC failed bank list, located at https://www.fdic.gov/bank/individual/failed/banklist.html. Viewing the page, you can see there is a list of quite a few failed banks.

This data is actually very simple to read with pandas and its pd.read_html() function:

Again, that was almost too easy!

A DataFrame can be written to an HTML file with the .to_html() method. This method creates a file containing only the <table> tag for the data (not the entire HTML document). The following writes the stock data we read earlier to an HTML file:

Viewing this in the browser looks like what is shown in the following screenshot:

This is useful, as you can use pandas to write HTML fragments to be included in websites, update them when needed, and thereby have the new data available to the site statically instead of through a more complicated data query or service call.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset