Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Dealing with non-ASCII text and HTML entities

HTML is not as structured as data from a database query or a pandas DataFrame. You may be tempted to manipulate HTML with regular expressions or string functions. However, this approach works only in a limited number of cases. You are better off using specialized Python libraries to process HTML. In this recipe, we will use the clean_html() function of the lxml library. This function strips all JavaScript and CSS from a HTML page.

American Standard Code for Information Interchange (ASCII) was the dominant encoding standard on the Internet until the end of 2007 with UTF-8 (8-bit Unicode) taking over first place. ASCII is limited to the English alphabet and has no support for alphabets of different languages. Unicode has a much broader support for alphabets. However, we sometimes need to limit ourselves to ASCII, so this recipe gives you an example of how to ignore non-ASCII characters.

Getting ready

Install lxml with pip or conda, as follows:

$ pip install lxml
$ conda install lxml

I tested the code with lxml 3.4.2 from Anaconda.

How to do it…

The code is in the processing_html.py file in this book's code bundle and is broken up in the following steps:

The imports are as follows:

from lxml.html.clean import clean_html
from difflib import Differ
import unicodedata
import dautil as dl

PRINT = dl.log_api.Printer()

Define the following function to diff two files:

def diff_files(text, cleaned):
    d = Differ()
    diff = list(d.compare(text.splitlines(keepends=True),
                          cleaned.splitlines(keepends=True)))
    PRINT.print(diff)

The following code block opens a HTML file, cleans it, and compares the cleaned file with the original:

with open('460_cc_phantomjs.html') as html_file:
    text = html_file.read()
    cleaned = clean_html(text)
    diff_files(text, cleaned)
    PRINT.print(dl.web.find_hrefs(cleaned))

The following snippet demonstrates handling of non-ASCII text:

bulgarian = 'Питон is Bulgarian for Python'
PRINT.print('Bulgarian', bulgarian)
PRINT.print('Bulgarian ignored', unicodedata.normalize('NFKD', bulgarian).encode('ascii', 'ignore'))

Refer to the following screenshot for the end result (I omitted some of the output for brevity):

Table of Contents for
Dealing with non-ASCII text and HTML entities

Dealing with non-ASCII text and HTML entities

Getting ready

How to do it…

See also

Table of Contents for Dealing with non-ASCII text and HTML entities

Create new playlist

Sign In

Sign Up

Dealing with non-ASCII text and HTML entities

Getting ready

How to do it…

See also

Table of Contents for
Dealing with non-ASCII text and HTML entities