Chapter 8. Processing HTML

Several modules included with Python provide virtually all the necessary tools necessary to parse and process HTML documents without needing to use a web server or web browser. Parsing HTML files is becoming much more commonplace in such applications as search engines, document indexing, document conversion, data retrieval, site backup or migration, as well as several others.

Because there is no way to cover the extent of options Python provides in HTML processing, the first two phrases in this chapter focus on specific Python modules to simplify opening HTML documents locally and on the Web. The rest of the phrases discuss how to use the Python modules to quickly parse the data in the HTML files to process specific items, such as links, images, and cookies. The final phrase in this chapter uses the example of fixing HTML files that do not have properly formatted tag data to demonstrate how to easily process the entire contents of the HTML file.

Parsing URLs

Example . 

import urlparse
parsedTuple = urlparse.urlparse(
"http://www.google.com/search?
hl=en&q=urlparse&btnG=Google+Search")
unparsedURL = urlparse.urlunparse((URLscheme, 
        URLlocation, URLpath, '', '', ''))
newURL = urlparse.urljoin(unparsedURL,
"/module-urllib2/request-objects.html")

The urlparse module included with Python makes it easy to break down URLs into specific components and reassemble them. This is very useful for a number of purposes when processing HTML documents.

The urlparse(urlstring [, default_scheme [, allow_fragments]]) function takes the URL provided in urlstring and returns the tuple (scheme, netloc, path, parameters, query, fragment). The tuple can then be used to determine things such as location scheme (HTTP, FTP, and so on), server address, file path, and so on.

The urlunparse(tuple) function accepts the tuple (scheme, netloc, path, parameters, query, fragment) and reassembles it into a properly formatted URL that can be used by the other HTML parsing modules included with Python.

The urljoin(base, url [, allow_fragments]) function accepts a base URL as the first argument and then joins whatever relative URL is specified in the second argument. The urljoin function is extremely useful in processing several files in the same location by joining new filenames to the existing base URL location.

Note

If the relative path does not start using the root (/) character, the rightmost location in the base URL path will be replaced with the relative path. For example, a base URL of http://www.testpage.com/pub and a relative URL of test.html would join to form the URL http://www.testpage.com/test.html, not http://www.testpage.com/test.html. If you want to keep the end directory in the path, make sure to end the base URL string with a / character.

import urlparse

URLscheme = "http"
URLlocation = "www.python.org"
URLpath = "lib/module-urlparse.html"

modList = ("urllib", "urllib2", 
           "httplib", "cgilib")

#Parse address into tuple
print "Parsed Google search for urlparse"
parsedTuple = urlparse.urlparse(
"http://www.google.com/search?
hl=en&q=urlparse&btnG=Google+Search")
print parsedTuple

#Unparse list into URL
print "
Unarsed python document page"
unparsedURL = urlparse.urlunparse( 
(URLscheme, URLlocation, URLpath, '', '', ''))
print "	" + unparsedURL

#Join path to new file to create new URL
print "
Additional python document pages using
join"
for mod in modList:
    newURL = urlparse.urljoin(unparsedURL, 
                    "module-%s.html" % (mod))
    print "	" + newURL

#Join path to subpath to create new URL
print "
Python document pages using join of sub-path"
newURL = urlparse.urljoin(unparsedURL,
         "module-urllib2/request-objects.html")
print "	" + newURL

URL_parse.py

Parsed Google search for urlparse
('http', 'www.google.com', '/search', '',
'hl=en&q=urlparse&btnG=Google+Search', '')

Unparsed python document page
       http://www.python.org/lib/module-urlparse.html

Additional python document pages using join
       http://www.python.org/lib/module-urllib.html
       http://www.python.org/lib/module-urllib2.html
       http://www.python.org/lib/module-httplib.html
       http://www.python.org/lib/module-cgilib.html

Python document pages using join of sub-path
       http://www.python.org/lib/module-urllib2/
request-objects.html

Output from URL_parse.py code

Opening HTML Documents

Example . 

import urllib
u = urllib.urlopen(webURL)
u = urllib.urlopen(localURL)
buffer = u.read()
print u.info()
print "Read %d bytes from %s.
" % 
(len(buffer), u.geturl())

The urllib and urllib2 modules included with Python provide the functionality to open and fetch data from URLs, including HTML documents.

To use the urllib module to open an HTML document, specify the URL location of the document, including the filename in the urlopen(url [,data]) function. The urlopen function will open a local file and return a file-like object that can be used to read data from the HTML document.

Once you have opened the HTML document, you can read the file using the read([nbytes]), readline(), and readlines() functions similar to normal files. To read the entire contents of the HTML document, use the read() function to return the file contents as a string.

After you open a location, you can retrieve the location of the file using the geturl() function. The geturl function returns the URL in string format, taking into account any redirection that might have taken place when accessing the HTML file.

Note

Another helpful function included in the file-like object returned from urlopen is the info() function. The info() function returns the available metadata about the URL location, including content length, content type, and so on.

import urllib

webURL = "http://www.python.org"
localURL = "/books/python/CH8/code/test.html"

#Open web-based URL
u = urllib.urlopen(webURL)
buffer = u.read()
print u.info()
print "Read %d bytes from %s.
" % 
(len(buffer), u.geturl())

#Open local-based URL
u = urllib.urlopen(localURL)
buffer = u.read()
print u.info()
print "Read %d bytes from %s." % 
(len(buffer), u.geturl())

html_open.py

Date: Tue, 18 Jul 2006 18:28:19 GMT
Server: Apache/2.0.54 (Debian GNU/Linux)
DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5
mod_ssl/2.0.54 OpenSSL/0.9.7e
Last-Modified: Mon, 17 Jul 2006 23:06:04 GMT
ETag: "601f6-351c-1310af00"
Accept-Ranges: bytes
Content-Length: 13596
Connection: close
Content-Type: text/html

Web-Based URL
Read 13596 bytes from http://www.python.org.
Content-Type: text/html
Content-Length: 433
Last-modified: Thu, 13 Jul 2006 22:07:53 GMT

Local-Based URL
Read 433 bytes from
file:///books/python/CH8/code/test.html.

Output from html_open.py code

Retrieving Links from HTML Documents

Example . 

import HTMLParser
import urllib
class parseLinks(HTMLParser.HTMLParser):
   def handle_starttag(self, tag, attrs):
        if tag == 'a':
           for name,value in attrs:
                if name == 'href':
                   print value
                   print self.get_starttag_text()

lParser = parseLinks()
lParser.feed(urllib.urlopen( 
    "http://www.python.org/index.html").read())

The Python language comes with a very useful HTMLParser module that enables simple, efficient parsing of HTML documents based on the tags inside the HTML document. The HTMLParser module is one of the most important when processing HTML documents.

A common task when processing HTML documents is to pull all the links out of the document. Using the HTMLParser module, this task is fairly simple. The first step is to define a new HTMLParser class that overrides the handle_starttag() method to print the href attribute value of all a tags.

Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Then open the HTML document using urllib.urlopen(url) and read the contents of the HTML file.

To parse the HTML file contents and print the links contained inside, feed the data to the HTMLParser object using the feed(data) function. The feed function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.

Note

If the data passed to the feed() function of the HTMLParser is not complete, the incomplete tag is kept and then parsed the next time the feed() function is called. This can be useful when working with large HTML files that need to be fed to the parser in chunks.

import HTMLParser
import urllib
import sys

#Define HTML Parser
class parseLinks(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
           for name,value in attrs:
                if name == 'href':
                   print value
                   print self.get_starttag_text()

#Create instance of HTML parser
lParser = parseLinks()

#Open the HTML file
lParser.feed(urllib.urlopen( 
    "http://www.python.org/index.html").read())

lParser.close()

html_links.py

<a href="psf" class=""
title="Python Software Foundation">
links
<a href="links" class="" title="">
dev
<a href="dev" class=""
title="Python Core Language Development">
download/releases/2.4.3
<a href="download/releases/2.4.3">
http://docs.python.org
<a href="http://docs.python.org">
ftp/python/2.4.3/python-2.4.3.msi
<a href="ftp/python/2.4.3/python-2.4.3.msi">
ftp/python/2.4.3/Python-2.4.3.tar.bz2
<a href="ftp/python/2.4.3/Python-2.4.3.tar.bz2">
pypi

Output from html_links.py code

Retrieving Images from HTML Documents

Example . 

import HTMLParser
import urllib

def getImage(addr):
    u = urllib.urlopen(addr)
    data = u.read()

class parseImages(HTMLParser.HTMLParser):
  def handle_starttag(self, tag, attrs):
    if tag == 'img':
        for name,value in attrs:
            if name == 'src':
                getImage(urlString + "/" + value)

u = urllib.urlopen(urlString)
lParser.feed(u.read())

A common task when processing HTML documents is to pull all the images out of the document. Using the HTMLParser module, this task is fairly simple. The first step is to define a new HTMLParser class that overrides the handle_starttag() method to find the img tags and saves the file pointed to by the src attribute value.

Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Then open the HTML document using urllib.urlopen(url) and read the contents of the HTML file.

To parse the HTML file contents and save the images displayed inside, feed the data to the HTMLParser object using the feed(data) function. The feed function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.

import HTMLParser
import urllib
import sys

urlString = "http://www.python.org"

#Save image file to disk
def getImage(addr):
    u = urllib.urlopen(addr)
    data = u.read()

    splitPath = addr.split('/')
    fName = splitPath.pop()
    print "Saving %s" % fName

    f = open(fName, 'wb')
    f.write(data)
    f.close()

#Define HTML parser
class parseImages(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            for name,value in attrs:
                if name == 'src':
                    getImage(urlString + "/" + value)

#Create instance of HTML parser
lParser = parseImages()

#Open the HTML file
u = urllib.urlopen(urlString)
print "Opening URL
===================="
print u.info()

#Feed HTML file into parser
lParser.feed(u.read())

lParser.close()

html_images.py

Opening URL
====================
Date: Wed, 19 Jul 2006 18:47:27 GMT
Server: Apache/2.0.54 (Debian GNU/Linux)
DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5
mod_ssl/2.0.54 OpenSSL/0.9.7e
Last-Modified: Wed, 19 Jul 2006 16:08:34 GMT
ETag: "601f6-351c-79a6c480"
Accept-Ranges: bytes
Content-Length: 13596
Connection: close
Content-Type: text/html

Saving python-logo.gif
Saving trans.gif
Saving trans.gif
Saving nasa.jpg

Output from html_images.py code

Retrieving Text from HTML Documents

Example . 

import HTMLParser
import urllib

class parseText(HTMLParser.HTMLParser):
    def handle_data(self, data):
        if data != '
':
            urlText.append(data)

lParser = parseText()
lParser.feed(urllib.urlopen( 
http://docs.python.org/lib/module-HTMLParser.html 
).read())

A common task when processing HTML documents is to pull all the text out of the document. Using the HTMLParser module, this task is fairly simple. The first step is to define a new HTMLParser class that overrides the handle_data() method to parse and print the text data.

Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Then open the HTML document using urllib.urlopen(url) and read the contents of the HTML file.

To parse the HTML file contents and print the text contained inside, feed the HTML file contents to the HTMLParser object using the feed(data) function. The feed function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.

Note

If the data passed to the feed() function of the HTMLParser is not complete, the incomplete tag is kept and then parsed the next time the feed() function is called. This can be useful when working with large HTML files that need to be fed to the parser in chunks.

import HTMLParser
import urllib

urlText = []

#Define HTML Parser
class parseText(HTMLParser.HTMLParser):
    def handle_data(self, data):
        if data != '
':
            urlText.append(data)


#Create instance of HTML parser
lParser = parseText()

#Feed HTML file into parser
lParser.feed(urllib.urlopen( 
http://docs.python.org/lib/module-HTMLParser.html 
).read())
lParser.close()
for item in urlText:
    print item

html_text.py

13.1 HTMLParser - Simple HTML and XHTML parser
Python Library Reference
Previous:
13. Structured Markup Processing
Up:
13. Structured Markup Processing
Next:
13.1.1 Example HTML Parser

13.1
HTMLParser
 -
         Simple HTML and XHTML parser
. . .

Output from html_text.py code

Retrieving Cookies in HTML Documents

Example . 

import urllib2
import cookielib
from urllib2 import urlopen, Request

cJar = cookielib.LWPCookieJar()
opener=urllib2.build_opener( 
    urllib2.HTTPCookieProcessor(cJar))
urllib2.install_opener(opener)
r = Request(testURL)
h = urlopen(r)
for ind, cookie in enumerate(cJar):
    print "%d - %s" % (ind, cookie)
    cJar.save(cookieFile)

The Python language includes a cookielib module that provides classes for automatic handling of HTTP cookies in HTML documents. This can be absolutely necessary when dealing with HTML documents that require cookies to be set on the client.

To retrieve the cookies from an HTML document, first create an instance of a cookie jar using the LWPCookieJar() function of the cookielib module. The LWPCookieJar() function returns an object that can load from and save cookies to disk.

Next, create an opener, using the build_opener([handler, . . .]) function of the urllib2 module, which will handle the cookies when the HTML file is opened. The build_opener function accepts zero or more handlers that will be chained together in the order in which they are specified and returns an opener object.

Note

If you want urlopen() to use the opener object to open HTML files, call the install_opener(opener) function and pass in the opener object. Otherwise, use the open(url) function of the opener object to open the HTML files.

Once the opener has been created and installed, create a Request object using the Request(url) function of the urllib2 module, and then open the HTML file using the urlopen(request) function.

Once the HTML page has been opened, any cookies in the page will now be stored in the LWPCookieJar object. You can then use the save(filename) function of the LWPCookieJar object.

import os
import urllib2
import cookielib
from urllib2 import urlopen, Request

cookieFile = "cookies.dat"
testURL = 'http://maps.google.com/'

#Create instance of cookie jar
cJar = cookielib.LWPCookieJar()

#Create HTTPCookieProcessor opener object
opener = urllib2.build_opener( 
    urllib2.HTTPCookieProcessor(cJar))

#Install the HTTPCookieProcessor opener
urllib2.install_opener(opener)

#Create a Request object
r = Request(testURL)

#Open the HTML file
h = urlopen(r)
print "Page Header
======================"
print h.info()

print "Page Cookies
======================"
for ind, cookie in enumerate(cJar):
    print "%d - %s" % (ind, cookie)

#Save the cookies
cJar.save(cookieFile)

html_cookie.py

Page Header
======================
Cache-Control: private
Set-Cookie: PREF=ID=fac1f1fcb33dae16:TM=1153336398:
LM=1153336398:S=CpIvoPKTNq6KhCx1; expires=Sun,
17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com
Content-Type: text/html; charset=ISO-8859-1
Server: mfe
Content-Length: 28271
Date: Wed, 19 Jul 2006 19:13:18 GMT

Page Cookies
======================
0 - <Cookie PREF=ID=fac1f1fcb33dae16:TM=1153336398:
LM=1153336398:S=CpIvoPKTNq6KhCx1 for .google.com/>

Output from html_cookie.py code

Adding Quotes to Attribute Values in HTML Documents

Example . 

import HTMLParser
import urllib

class parseAttrs(HTMLParser.HTMLParser):
   def handle_starttag(self, tag, attrs):
         . . .

attrParser = parseAttrs()
attrParser.init_parser()
attrParser.feed(urllib.urlopen("test2.html").read())

Earlier in this chapter, we discussed parsing HTML files based on specific handlers in the HTML parser. There are times when you need to use all the handlers to process an HTML document. Using the HTMLParser module to parse all entities in the HTML file is not much more complex than handling the links or images.

This phrase discusses how to use the HTMLParser module to parse an HTML file to fix the fact that the attribute values do not have quotes around them. The first step is to define a new HTMLParser class that overrides all the following handlers so that the quotes can be added to the attribute values.

handle_starttag(tag, attrs)
handle_charref(name)
handle_endtag(tag)
handle_entityref(ref)
handle_data(text)
handle_comment(text)
handle_pi(text)
handle_decl(text)
handle_startendtag(tag, attrs)

You will also need to define a function inside the parser class to initialize the variables used to store the parsed data and another function to return the parsed data.

Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Use the init function you created to initialize the parser; then open the HTML document using urllib.urlopen(url) and read the contents of the HTML file.

To parse the HTML file contents and add the quotes to the attribute values, feed the data to the HTMLParser object using the feed(data) function. The feed function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.

import HTMLParser
import urllib
import sys

#Define the HTML parser
class parseAttrs(HTMLParser.HTMLParser):
    def init_parser (self):
        self.pieces = []

    def handle_starttag(self, tag, attrs):
        fixedAttrs = ""
        #for name,value in attrs:
        for name, value in attrs:
            fixedAttrs += "%s="%s" " % (name, value)
        self.pieces.append("<%s %s>" % (tag, fixedAttrs))

    def handle_charref(self, name):
        self.pieces.append("&#%s;" % (name))

    def handle_endtag(self, tag):
        self.pieces.append("</%s>" % (tag))

    def handle_entityref(self, ref):
        self.pieces.append("&%s" % (ref))

    def handle_data(self, text):
        self.pieces.append(text)

    def handle_comment(self, text):
        self.pieces.append("<!--%s-->" % (text))

    def handle_pi(self, text):
        self.pieces.append("<?%s>" % (text))

    def handle_decl(self, text):
        self.pieces.append("<!%s>" % (text))

    def parsed (self):
        return "".join(self.pieces)

#Create instance of HTML parser
attrParser = parseAttrs()

#Initialize the parser data
attrParser.init_parser()

#Feed HTML file into parser
attrParser.feed(urllib.urlopen("test2.html").read())

#Display original file contents
print "Original File
========================"
print open("test2.html").read()

#Display the parsed file
print "Parsed File
========================"
print attrParser.parsed()

attrParser.close()

html_quotes.py

Original File
========================
<html lang="en" xml:lang="en">
<head>
<meta content="text/html; charset=utf-8"
 http-equiv="content-type"/>
<title>Web Page</title>
</head>
<body>
<H1>Web Listings</H1>
<a href=http://www.python.org>Python Web Site</a>
<a href=test.html>local page</a>
<img SRC=test.jpg>
</body>
</html>

Parsed File
========================
<html lang="en" xml:lang="en" >
<head >
<meta content="text/html; charset=utf-8"
 http-equiv="content-type" ></meta>
<title >Web Page</title>
</head>
<body >
<h1 >Web Listings</h1>
<a href="http://www.python.org" >Python Web Site</a>
<a href="test.html" >local page</a>
<img src="test.jpg" >
</body>
</html>

Output from html_quotes.py code

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset