Several modules included with Python provide virtually all the necessary tools necessary to parse and process HTML documents without needing to use a web server or web browser. Parsing HTML files is becoming much more commonplace in such applications as search engines, document indexing, document conversion, data retrieval, site backup or migration, as well as several others.
Because there is no way to cover the extent of options Python provides in HTML processing, the first two phrases in this chapter focus on specific Python modules to simplify opening HTML documents locally and on the Web. The rest of the phrases discuss how to use the Python modules to quickly parse the data in the HTML files to process specific items, such as links, images, and cookies. The final phrase in this chapter uses the example of fixing HTML files that do not have properly formatted tag data to demonstrate how to easily process the entire contents of the HTML file.
Example .
import urlparse parsedTuple = urlparse.urlparse( "http://www.google.com/search? hl=en&q=urlparse&btnG=Google+Search") unparsedURL = urlparse.urlunparse((URLscheme, URLlocation, URLpath, '', '', '')) newURL = urlparse.urljoin(unparsedURL, "/module-urllib2/request-objects.html")
The urlparse module included with Python makes it easy to break down URLs into specific components and reassemble them. This is very useful for a number of purposes when processing HTML documents.
The urlparse(urlstring [, default_scheme [, allow_fragments]])
function takes the URL provided in urlstring
and returns the tuple (scheme, netloc, path, parameters, query, fragment
). The tuple can then be used to determine things such as location scheme (HTTP, FTP, and so on), server address, file path, and so on.
The urlunparse(tuple)
function accepts the tuple (scheme, netloc, path, parameters, query, fragment
) and reassembles it into a properly formatted URL that can be used by the other HTML parsing modules included with Python.
The urljoin(base, url [, allow_fragments])
function accepts a base URL as the first argument and then joins whatever relative URL is specified in the second argument. The urljoin
function is extremely useful in processing several files in the same location by joining new filenames to the existing base URL location.
If the relative path does not start using the root (/
) character, the rightmost location in the base URL path will be replaced with the relative path. For example, a base URL of http://www.testpage.com/pub and a relative URL of test.html would join to form the URL http://www.testpage.com/test.html, not http://www.testpage.com/test.html. If you want to keep the end directory in the path, make sure to end the base URL string with a /
character.
import urlparse URLscheme = "http" URLlocation = "www.python.org" URLpath = "lib/module-urlparse.html" modList = ("urllib", "urllib2", "httplib", "cgilib") #Parse address into tuple print "Parsed Google search for urlparse" parsedTuple = urlparse.urlparse( "http://www.google.com/search? hl=en&q=urlparse&btnG=Google+Search") print parsedTuple #Unparse list into URL print " Unarsed python document page" unparsedURL = urlparse.urlunparse( (URLscheme, URLlocation, URLpath, '', '', '')) print " " + unparsedURL #Join path to new file to create new URL print " Additional python document pages using join" for mod in modList: newURL = urlparse.urljoin(unparsedURL, "module-%s.html" % (mod)) print " " + newURL #Join path to subpath to create new URL print " Python document pages using join of sub-path" newURL = urlparse.urljoin(unparsedURL, "module-urllib2/request-objects.html") print " " + newURL
URL_parse.py
Parsed Google search for urlparse ('http', 'www.google.com', '/search', '', 'hl=en&q=urlparse&btnG=Google+Search', '') Unparsed python document page http://www.python.org/lib/module-urlparse.html Additional python document pages using join http://www.python.org/lib/module-urllib.html http://www.python.org/lib/module-urllib2.html http://www.python.org/lib/module-httplib.html http://www.python.org/lib/module-cgilib.html Python document pages using join of sub-path http://www.python.org/lib/module-urllib2/ request-objects.html
Output from URL_parse.py code
Example .
import urllib u = urllib.urlopen(webURL) u = urllib.urlopen(localURL) buffer = u.read() print u.info() print "Read %d bytes from %s. " % (len(buffer), u.geturl())
The urllib and urllib2 modules included with Python provide the functionality to open and fetch data from URLs, including HTML documents.
To use the urllib module to open an HTML document, specify the URL location of the document, including the filename in the urlopen(url [,data])
function. The urlopen
function will open a local file and return a file-like object that can be used to read data from the HTML document.
Once you have opened the HTML document, you can read the file using the read([nbytes])
, readline()
, and readlines()
functions similar to normal files. To read the entire contents of the HTML document, use the read()
function to return the file contents as a string.
After you open a location, you can retrieve the location of the file using the geturl()
function. The geturl
function returns the URL in string format, taking into account any redirection that might have taken place when accessing the HTML file.
Another helpful function included in the file-like object returned from urlopen
is the info()
function. The info()
function returns the available metadata about the URL location, including content length, content type, and so on.
import urllib webURL = "http://www.python.org" localURL = "/books/python/CH8/code/test.html" #Open web-based URL u = urllib.urlopen(webURL) buffer = u.read() print u.info() print "Read %d bytes from %s. " % (len(buffer), u.geturl()) #Open local-based URL u = urllib.urlopen(localURL) buffer = u.read() print u.info() print "Read %d bytes from %s." % (len(buffer), u.geturl())
Date: Tue, 18 Jul 2006 18:28:19 GMT Server: Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5 mod_ssl/2.0.54 OpenSSL/0.9.7e Last-Modified: Mon, 17 Jul 2006 23:06:04 GMT ETag: "601f6-351c-1310af00" Accept-Ranges: bytes Content-Length: 13596 Connection: close Content-Type: text/html Web-Based URL Read 13596 bytes from http://www.python.org. Content-Type: text/html Content-Length: 433 Last-modified: Thu, 13 Jul 2006 22:07:53 GMT Local-Based URL Read 433 bytes from file:///books/python/CH8/code/test.html.
Example .
import HTMLParser import urllib class parseLinks(HTMLParser.HTMLParser): def handle_starttag(self, tag, attrs): if tag == 'a': for name,value in attrs: if name == 'href': print value print self.get_starttag_text() lParser = parseLinks() lParser.feed(urllib.urlopen( "http://www.python.org/index.html").read())
The Python language comes with a very useful HTMLParser module that enables simple, efficient parsing of HTML documents based on the tags inside the HTML document. The HTMLParser module is one of the most important when processing HTML documents.
A common task when processing HTML documents is to pull all the links out of the document. Using the HTMLParser module, this task is fairly simple. The first step is to define a new HTMLParser class that overrides the handle_starttag()
method to print the href
attribute value of all a
tags.
Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Then open the HTML document using urllib.urlopen(
url
)
and read the contents of the HTML file.
To parse the HTML file contents and print the links contained inside, feed the data to the HTMLParser object using the feed(
data
)
function. The feed
function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.
If the data passed to the feed()
function of the HTMLParser is not complete, the incomplete tag is kept and then parsed the next time the feed()
function is called. This can be useful when working with large HTML files that need to be fed to the parser in chunks.
import HTMLParser import urllib import sys #Define HTML Parser class parseLinks(HTMLParser.HTMLParser): def handle_starttag(self, tag, attrs): if tag == 'a': for name,value in attrs: if name == 'href': print value print self.get_starttag_text() #Create instance of HTML parser lParser = parseLinks() #Open the HTML file lParser.feed(urllib.urlopen( "http://www.python.org/index.html").read()) lParser.close()
<a href="psf" class="" title="Python Software Foundation"> links <a href="links" class="" title=""> dev <a href="dev" class="" title="Python Core Language Development"> download/releases/2.4.3 <a href="download/releases/2.4.3"> http://docs.python.org <a href="http://docs.python.org"> ftp/python/2.4.3/python-2.4.3.msi <a href="ftp/python/2.4.3/python-2.4.3.msi"> ftp/python/2.4.3/Python-2.4.3.tar.bz2 <a href="ftp/python/2.4.3/Python-2.4.3.tar.bz2"> pypi
Output from html_links.py code
Example .
import HTMLParser import urllib def getImage(addr): u = urllib.urlopen(addr) data = u.read() class parseImages(HTMLParser.HTMLParser): def handle_starttag(self, tag, attrs): if tag == 'img': for name,value in attrs: if name == 'src': getImage(urlString + "/" + value) u = urllib.urlopen(urlString) lParser.feed(u.read())
A common task when processing HTML documents is to pull all the images out of the document. Using the HTMLParser module, this task is fairly simple. The first step is to define a new HTMLParser class that overrides the handle_starttag()
method to find the img
tags and saves the file pointed to by the src
attribute value.
Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Then open the HTML document using urllib.urlopen(
url
)
and read the contents of the HTML file.
To parse the HTML file contents and save the images displayed inside, feed the data to the HTMLParser object using the feed(
data
)
function. The feed
function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.
import HTMLParser import urllib import sys urlString = "http://www.python.org" #Save image file to disk def getImage(addr): u = urllib.urlopen(addr) data = u.read() splitPath = addr.split('/') fName = splitPath.pop() print "Saving %s" % fName f = open(fName, 'wb') f.write(data) f.close() #Define HTML parser class parseImages(HTMLParser.HTMLParser): def handle_starttag(self, tag, attrs): if tag == 'img': for name,value in attrs: if name == 'src': getImage(urlString + "/" + value) #Create instance of HTML parser lParser = parseImages() #Open the HTML file u = urllib.urlopen(urlString) print "Opening URL ====================" print u.info() #Feed HTML file into parser lParser.feed(u.read()) lParser.close()
html_images.py
Opening URL ==================== Date: Wed, 19 Jul 2006 18:47:27 GMT Server: Apache/2.0.54 (Debian GNU/Linux) DAV/2 SVN/1.1.4 mod_python/3.1.3 Python/2.3.5 mod_ssl/2.0.54 OpenSSL/0.9.7e Last-Modified: Wed, 19 Jul 2006 16:08:34 GMT ETag: "601f6-351c-79a6c480" Accept-Ranges: bytes Content-Length: 13596 Connection: close Content-Type: text/html Saving python-logo.gif Saving trans.gif Saving trans.gif Saving nasa.jpg
Example .
import HTMLParser import urllib class parseText(HTMLParser.HTMLParser): def handle_data(self, data): if data != ' ': urlText.append(data) lParser = parseText() lParser.feed(urllib.urlopen( http://docs.python.org/lib/module-HTMLParser.html ).read())
A common task when processing HTML documents is to pull all the text out of the document. Using the HTMLParser module, this task is fairly simple. The first step is to define a new HTMLParser class that overrides the handle_data()
method to parse and print the text data.
Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Then open the HTML document using urllib.urlopen(
url
)
and read the contents of the HTML file.
To parse the HTML file contents and print the text contained inside, feed the HTML file contents to the HTMLParser object using the feed(
data
)
function. The feed
function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.
If the data passed to the feed()
function of the HTMLParser is not complete, the incomplete tag is kept and then parsed the next time the feed()
function is called. This can be useful when working with large HTML files that need to be fed to the parser in chunks.
import HTMLParser import urllib urlText = [] #Define HTML Parser class parseText(HTMLParser.HTMLParser): def handle_data(self, data): if data != ' ': urlText.append(data) #Create instance of HTML parser lParser = parseText() #Feed HTML file into parser lParser.feed(urllib.urlopen( http://docs.python.org/lib/module-HTMLParser.html ).read()) lParser.close() for item in urlText: print item
html_text.py
13.1 HTMLParser - Simple HTML and XHTML parser Python Library Reference Previous: 13. Structured Markup Processing Up: 13. Structured Markup Processing Next: 13.1.1 Example HTML Parser 13.1 HTMLParser - Simple HTML and XHTML parser . . .
Output from html_text.py code
Example .
import urllib2 import cookielib from urllib2 import urlopen, Request cJar = cookielib.LWPCookieJar() opener=urllib2.build_opener( urllib2.HTTPCookieProcessor(cJar)) urllib2.install_opener(opener) r = Request(testURL) h = urlopen(r) for ind, cookie in enumerate(cJar): print "%d - %s" % (ind, cookie) cJar.save(cookieFile)
The Python language includes a cookielib module that provides classes for automatic handling of HTTP cookies in HTML documents. This can be absolutely necessary when dealing with HTML documents that require cookies to be set on the client.
To retrieve the cookies from an HTML document, first create an instance of a cookie jar using the LWPCookieJar()
function of the cookielib module. The LWPCookieJa
r()
function returns an object that can load from and save cookies to disk.
Next, create an opener, using the build_opener([handler, . . .])
function of the urllib2 module, which will handle the cookies when the HTML file is opened. The build_opener
function accepts zero or more handlers that will be chained together in the order in which they are specified and returns an opener object.
If you want urlopen()
to use the opener object to open HTML files, call the install_opener(opener)
function and pass in the opener object. Otherwise, use the open(url)
function of the opener object to open the HTML files.
Once the opener has been created and installed, create a Request object using the Request(
url
)
function of the urllib2 module, and then open the HTML file using the urlopen(
request
)
function.
Once the HTML page has been opened, any cookies in the page will now be stored in the LWPCookieJar object. You can then use the save(
filename
)
function of the LWPCookieJar object.
import os import urllib2 import cookielib from urllib2 import urlopen, Request cookieFile = "cookies.dat" testURL = 'http://maps.google.com/' #Create instance of cookie jar cJar = cookielib.LWPCookieJar() #Create HTTPCookieProcessor opener object opener = urllib2.build_opener( urllib2.HTTPCookieProcessor(cJar)) #Install the HTTPCookieProcessor opener urllib2.install_opener(opener) #Create a Request object r = Request(testURL) #Open the HTML file h = urlopen(r) print "Page Header ======================" print h.info() print "Page Cookies ======================" for ind, cookie in enumerate(cJar): print "%d - %s" % (ind, cookie) #Save the cookies cJar.save(cookieFile)
html_cookie.py
Page Header ====================== Cache-Control: private Set-Cookie: PREF=ID=fac1f1fcb33dae16:TM=1153336398: LM=1153336398:S=CpIvoPKTNq6KhCx1; expires=Sun, 17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com Content-Type: text/html; charset=ISO-8859-1 Server: mfe Content-Length: 28271 Date: Wed, 19 Jul 2006 19:13:18 GMT Page Cookies ====================== 0 - <Cookie PREF=ID=fac1f1fcb33dae16:TM=1153336398: LM=1153336398:S=CpIvoPKTNq6KhCx1 for .google.com/>
Output from html_cookie.py code
Example .
import HTMLParser import urllib class parseAttrs(HTMLParser.HTMLParser): def handle_starttag(self, tag, attrs): . . . attrParser = parseAttrs() attrParser.init_parser() attrParser.feed(urllib.urlopen("test2.html").read())
Earlier in this chapter, we discussed parsing HTML files based on specific handlers in the HTML parser. There are times when you need to use all the handlers to process an HTML document. Using the HTMLParser module to parse all entities in the HTML file is not much more complex than handling the links or images.
This phrase discusses how to use the HTMLParser module to parse an HTML file to fix the fact that the attribute values do not have quotes around them. The first step is to define a new HTMLParser class that overrides all the following handlers so that the quotes can be added to the attribute values.
handle_starttag(tag, attrs) handle_charref(name) handle_endtag(tag) handle_entityref(ref) handle_data(text) handle_comment(text) handle_pi(text) handle_decl(text) handle_startendtag(tag, attrs)
You will also need to define a function inside the parser class to initialize the variables used to store the parsed data and another function to return the parsed data.
Once the new HTMLParser class has been defined, create an instance of the class to return an HTMLParser object. Use the init
function you created to initialize the parser; then open the HTML document using urllib.urlopen(
url
)
and read the contents of the HTML file.
To parse the HTML file contents and add the quotes to the attribute values, feed the data to the HTMLParser object using the feed(
data
)
function. The feed
function of the HTMLParser object will accept the data and parse it based on the defined HTMLParser object.
import HTMLParser import urllib import sys #Define the HTML parser class parseAttrs(HTMLParser.HTMLParser): def init_parser (self): self.pieces = [] def handle_starttag(self, tag, attrs): fixedAttrs = "" #for name,value in attrs: for name, value in attrs: fixedAttrs += "%s="%s" " % (name, value) self.pieces.append("<%s %s>" % (tag, fixedAttrs)) def handle_charref(self, name): self.pieces.append("&#%s;" % (name)) def handle_endtag(self, tag): self.pieces.append("</%s>" % (tag)) def handle_entityref(self, ref): self.pieces.append("&%s" % (ref)) def handle_data(self, text): self.pieces.append(text) def handle_comment(self, text): self.pieces.append("<!--%s-->" % (text)) def handle_pi(self, text): self.pieces.append("<?%s>" % (text)) def handle_decl(self, text): self.pieces.append("<!%s>" % (text)) def parsed (self): return "".join(self.pieces) #Create instance of HTML parser attrParser = parseAttrs() #Initialize the parser data attrParser.init_parser() #Feed HTML file into parser attrParser.feed(urllib.urlopen("test2.html").read()) #Display original file contents print "Original File ========================" print open("test2.html").read() #Display the parsed file print "Parsed File ========================" print attrParser.parsed() attrParser.close()
html_quotes.py
Original File ======================== <html lang="en" xml:lang="en"> <head> <meta content="text/html; charset=utf-8" http-equiv="content-type"/> <title>Web Page</title> </head> <body> <H1>Web Listings</H1> <a href=http://www.python.org>Python Web Site</a> <a href=test.html>local page</a> <img SRC=test.jpg> </body> </html> Parsed File ======================== <html lang="en" xml:lang="en" > <head > <meta content="text/html; charset=utf-8" http-equiv="content-type" ></meta> <title >Web Page</title> </head> <body > <h1 >Web Listings</h1> <a href="http://www.python.org" >Python Web Site</a> <a href="test.html" >local page</a> <img src="test.jpg" > </body> </html>
Output from html_quotes.py code