Dealing with non-ASCII text and HTML entities

HTML is not as structured as data from a database query or a pandas DataFrame. You may be tempted to manipulate HTML with regular expressions or string functions. However, this approach works only in a limited number of cases. You are better off using specialized Python libraries to process HTML. In this recipe, we will use the clean_html() function of the lxml library. This function strips all JavaScript and CSS from a HTML page.

American Standard Code for Information Interchange (ASCII) was the dominant encoding standard on the Internet until the end of 2007 with UTF-8 (8-bit Unicode) taking over first place. ASCII is limited to the English alphabet and has no support for alphabets of different languages. Unicode has a much broader support for alphabets. However, we sometimes need to limit ourselves to ASCII, so this recipe gives you an example of how to ignore non-ASCII characters.

Getting ready

Install lxml with pip or conda, as follows:

$ pip install lxml
$ conda install lxml

I tested the code with lxml 3.4.2 from Anaconda.

How to do it…

The code is in the processing_html.py file in this book's code bundle and is broken up in the following steps:

  1. The imports are as follows:
    from lxml.html.clean import clean_html
    from difflib import Differ
    import unicodedata
    import dautil as dl
    
    PRINT = dl.log_api.Printer()
  2. Define the following function to diff two files:
    def diff_files(text, cleaned):
        d = Differ()
        diff = list(d.compare(text.splitlines(keepends=True),
                              cleaned.splitlines(keepends=True)))
        PRINT.print(diff)
  3. The following code block opens a HTML file, cleans it, and compares the cleaned file with the original:
    with open('460_cc_phantomjs.html') as html_file:
        text = html_file.read()
        cleaned = clean_html(text)
        diff_files(text, cleaned)
        PRINT.print(dl.web.find_hrefs(cleaned))
  4. The following snippet demonstrates handling of non-ASCII text:
    bulgarian = 'Питон is Bulgarian for Python'
    PRINT.print('Bulgarian', bulgarian)
    PRINT.print('Bulgarian ignored', unicodedata.normalize('NFKD', bulgarian).encode('ascii', 'ignore'))

Refer to the following screenshot for the end result (I omitted some of the output for brevity):

How to do it…

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset