Dealing with non-ASCII text and HTML entities

HTML is not as structured as data from a database query or a pandas DataFrame. You may be tempted to manipulate HTML with regular expressions or string functions. However, this approach works only in a limited number of cases. You are better off using specialized Python libraries to process HTML. In this recipe, we will use the clean_html() function of the lxml library. This function strips all JavaScript and CSS from a HTML page.

American Standard Code for Information Interchange (ASCII) was the dominant encoding standard on the Internet until the end of 2007 with UTF-8 (8-bit Unicode) taking over first place. ASCII is limited to the English alphabet and has no support for alphabets of different languages. Unicode has a much broader support for alphabets. However, we sometimes need to limit ourselves to ASCII, so this recipe gives you an example of how to ignore non-ASCII characters.

Getting ready

Install lxml with pip or conda, as follows:

$ pip install lxml
$ conda install lxml

I tested the code with lxml 3.4.2 from Anaconda.

How to do it…

The code is in the file in this book's code bundle and is broken up in the following steps:

  1. The imports are as follows:
    from lxml.html.clean import clean_html
    from difflib import Differ
    import unicodedata
    import dautil as dl
    PRINT = dl.log_api.Printer()
  2. Define the following function to diff two files:
    def diff_files(text, cleaned):
        d = Differ()
        diff = list(,
  3. The following code block opens a HTML file, cleans it, and compares the cleaned file with the original:
    with open('460_cc_phantomjs.html') as html_file:
        text =
        cleaned = clean_html(text)
        diff_files(text, cleaned)
  4. The following snippet demonstrates handling of non-ASCII text:
    bulgarian = 'Питон is Bulgarian for Python'
    PRINT.print('Bulgarian', bulgarian)
    PRINT.print('Bulgarian ignored', unicodedata.normalize('NFKD', bulgarian).encode('ascii', 'ignore'))

Refer to the following screenshot for the end result (I omitted some of the output for brevity):

How to do it…

See also

