HTML is not as structured as data from a database query or a pandas DataFrame
. You may be tempted to manipulate HTML with regular expressions or string functions. However, this approach works only in a limited number of cases. You are better off using specialized Python libraries to process HTML. In this recipe, we will use the clean_html()
function of the lxml library. This function strips all JavaScript and CSS from a HTML page.
American Standard Code for Information Interchange (ASCII) was the dominant encoding standard on the Internet until the end of 2007 with UTF-8 (8-bit Unicode) taking over first place. ASCII is limited to the English alphabet and has no support for alphabets of different languages. Unicode has a much broader support for alphabets. However, we sometimes need to limit ourselves to ASCII, so this recipe gives you an example of how to ignore non-ASCII characters.
Install lxml with pip or conda, as follows:
$ pip install lxml $ conda install lxml
I tested the code with lxml 3.4.2 from Anaconda.
The code is in the processing_html.py
file in this book's code bundle and is broken up in the following steps:
from lxml.html.clean import clean_html from difflib import Differ import unicodedata import dautil as dl PRINT = dl.log_api.Printer()
def diff_files(text, cleaned): d = Differ() diff = list(d.compare(text.splitlines(keepends=True), cleaned.splitlines(keepends=True))) PRINT.print(diff)
with open('460_cc_phantomjs.html') as html_file: text = html_file.read() cleaned = clean_html(text) diff_files(text, cleaned) PRINT.print(dl.web.find_hrefs(cleaned))
bulgarian = 'Питон is Bulgarian for Python' PRINT.print('Bulgarian', bulgarian) PRINT.print('Bulgarian ignored', unicodedata.normalize('NFKD', bulgarian).encode('ascii', 'ignore'))
Refer to the following screenshot for the end result (I omitted some of the output for brevity):