I mentioned XML briefly in Project 1. Now it’s time to examine it in more detail. In this project, you see how XML can be used to represent many kinds of data, and how XML files can be processed with the Simple API for XML, or SAX. The goal of this project is to generate a full web site from a single XML file that describes the various web pages and directories.
In this chapter, I assume that you know what XML is and how to write it. If you know some HTML, you’re already familiar with the basics. XML isn’t really a specific language (such as HTML); it’s more like a set of rules that define a class of languages. Basically, you still write tags the same way as in HTML, but in XML you can invent tag names yourself. Such specific sets of tag names and their structural relationships can be described in Document Type Definitions or XML Schema—I won’t be discussing those here.
For a concise description of what XML is, see the World Wide Web Consortium’s (W3C’s) “XML in 10 points” (http://www.w3.org/XML/1999/XML-in-10-points
). A more thorough tutorial can be found on the W3Schools web site (http://www.w3schools.com/xml
). For more information about SAX, see the official SAX web site (http://www.saxproject.org
).
The general problem you’ll be attacking in this project is to parse (read and process) XML files. Because you can use XML to represent practically anything, and you can do whatever you want with the data when you parse it, the applications are boundless (as the title of this chapter indicates).
The specific problem tackled in this chapter is to generate a complete web site from a single XML file that contains the structure of the site and the basic contents of each page.
Before you proceed with this project, I suggest that you take a few moments to read a bit about XML and to check out its applications. That might give you a better understanding of when it might be a useful file format and when it would just be overkill. (After all, plain-text files can be just fine when they’re all you need.)
Let’s define the specific goals for the project:
This last point is perhaps enough to make it all worthwhile, but there are other benefits. By placing all your contents in a single XML file, you could easily write other programs that use the same XML processing techniques to extract various kinds of information, such as tables of contents, indices for custom search engines, and so on. And even if you don’t use this for your web site, you could use it to create HTML-based slide shows (or, by using something like ReportLab, discussed in the previous chapter, you could even create PDF slide shows).
Python has some built-in XML support, but if you’re using an old version, you may need to install some extras yourself. In this project, you’ll need a functioning SAX parser. To see if you have a usable SAX parser, try to execute the following:
>>> from xml.sax import make_parser
>>> parser = make_parser()
In all likelihood, no exceptions will be raised when you do this. In that case, you’re all set and can continue to the “Preparations” section.
Tip Plenty of XML tools for Python are out there. One very interesting alternative to the “standard” PyXML framework is Fredrik Lundh’s ElementTree (and the C implementation, cElementTree), which is also included in recent versions of the Python standard library, in the package xml.etree
. If you have an older Python version, you can get ElementTree from http://effbot.org/zone
. It’s quite powerful and easy to use, and may well be worth a look if you’re serious about using XML in Python.
If you do get an exception (which may be the case for older Python versions), you must install PyXML. First, download the PyXML package from http://sf.net/projects/pyxml
. There you can find RPM packages for Linux, binary installers for Windows, and source distributions for other platforms. The RPMs are installed with rpm --install
, and the binary Windows distribution is installed simply by executing it. The source distribution is installed through the standard Python installation mechanism, Distutils. Simply unpack the tar.gz
file, change to the unpacked directory, and execute the following:
$ python setup.py install
You should now be able to use the XML tools.
Before you can write the program that processes your XML files, you must design your XML format. What tags do you need, what attributes should they have, and which tags should go where? To find out, let’s first consider what it is you want your XML to describe.
The main concepts are web site, directory, page, name, title, and contents:
In short, your document will consist of a single website
element, containing several directory
and page
elements, each of the directory elements optionally containing more pages and directories. The directory
and page
elements will have an attribute called name
, which will contain their name. In addition, the page
tag has a title
attribute. The page
element contains XHTML code (of the type found inside the XHTML body
tag). A sample file is shown in Listing 22-1.
<website>
<page name="index" title="Home Page">
<h1>Welcome to My Home Page</h1>
<p>Hi, there. My name is Mr. Gumby, and this is my home page. Here
are some of my interests:</p>
<ul>
<li><a href="interests/shouting.html">Shouting</a></li>
<li><a href="interests/sleeping.html">Sleeping</a></li>
<li><a href="interests/eating.html">Eating</a></li>
</ul>
</page>
<directory name="interests">
<page name="shouting" title="Shouting">
<h1>Mr. Gumby's Shouting Page</h1>
<p>...</p>
</page>
<page name="sleeping" title="Sleeping">
<h1>Mr. Gumby's Sleeping Page</h1>
<p>...</p>
</page>
<page name="eating" title="Eating">
<h1>Mr. Gumby's Eating Page</h1>
<p>...</p>
</page>
</directory>
</website>
At this point, we haven’t yet looked at how XML parsing works. The approach we are using here (called SAX) consists of writing a set of event handlers (just as in GUI programming) and then letting an existing XML parser call these handlers as it reads the XML document.
Several event types are available when parsing with SAX, but let’s restrict ourselves to three: the beginning of an element (the occurrence of an opening tag), the end of an element (the occur-rence of a closing tag), and plain text (characters). To parse the XML file, let’s use the parse
function from the xml.sax
module. This function takes care of reading the file and generating the events, but as it generates these events, it needs some event handlers to call. These event handlers will be implemented as methods of a content handler object. You’ll subclass the ContentHandler
class from xml.sax.handler
because it implements all the necessary event handlers (as dummy operations that have no effect), and you can override only the ones you need.
Let’s begin with a minimal XML parser (assuming that your XML file is called website.xml
):
from xml.sax.handler import ContentHandler
from xml.sax import parse
class TestHandler(ContentHandler): pass
parse('website.xml', TestHandler())
If you execute this program, seemingly nothing happens, but you shouldn’t get any error messages either. Behind the scenes, the XML file is parsed, and the default event handlers are called, but because they don’t do anything, you won’t see any output.
Let’s try a simple extension. Add the following method to the TestHandler
class:
def startElement(self, name, attrs):
print name, attrs.keys()
This overrides the default startElement
event handler. The parameters are the relevant tag name and its attributes (kept in a dictionary-like object). If you run the program again (using website.xml
from Listing 22-1), you see the following output:
website []
page [u'name', u'title']
h1 []
p []
ul []
li []
a [u'href']
li []
a [u'href']
li []
a [u'href']
directory [u'name']
page [u'name', u'title']
h1 []
p []
page [u'name', u'title']
h1 []
p []
page [u'name', u'title']
h1 []
p []
How this works should be pretty clear. In addition to startElement
, you’ll use endElement
(which takes only a tag name as its argument) and characters
(which takes a string as its argument).
The following is an example that uses all these three methods to build a list of the headlines (the h1
elements) of the web site file:
from xml.sax.handler import ContentHandler
from xml.sax import parse
class HeadlineHandler(ContentHandler):
in_headline = False
def __init__(self, headlines):
ContentHandler.__init__(self)
self.headlines = headlines
self.data = []
def startElement(self, name, attrs):
if name == 'h1':
self.in_headline = True
def endElement(self, name):
if name == 'h1':
text = ''.join(self.data)
self.data = []
self.headlines.append(text)
self.in_headline = False
def characters(self, string):
if self.in_headline:
self.data.append(string)
headlines = []
parse('website.xml', HeadlineHandler(headlines))
print 'The following <h1> elements were found:'
for h in headlines:
print h
Note that the HeadlineHandler
keeps track of whether it’s currently parsing text that is inside a pair of h1
tags. This is done by setting self.in_headline
to True
when startElement
finds an h1
tag, and setting self.in_headline
to False
when endElement
finds an h1
tag. The characters
method is automatically called when the parser finds some text. As long as the parser is between two h1
tags (self.in_headline
is True
), characters
will append the string (which may be just a part of the text between the tags) to self.data
, which is a list of strings. The task of joining these text fragments, appending them to self.headlines
(as a single string), and resetting self.data
to an empty list also befalls endElement
. This general approach (of using Boolean variables to indicate whether you are currently “inside” a given tag type) is quite common in SAX programming.
Running this program (again, with the website.xml
file from Listing 22-1), you get the following output:
The following <h1> elements were found:
Welcome to My Home Page
Mr. Gumby's Shouting Page
Mr. Gumby's Sleeping Page
Mr. Gumby's Eating Page
Now you’re ready to make the prototype. For now, let’s ignore the directories and concentrate on creating HTML pages. You need to create a slightly embellished event handler that does the following:
page
element, opens a new file with the given name, and writes a suitable HTML header to it, including the given titlepage
element, writes a suitable HTML footer to the file, and closes itpage
element, passes through all tags and characters without modifying them (writes them to the file as they are)page
element, ignores all tags (such as website
and directory
)Most of this is pretty straightforward (at least if you know a bit about how HTML documents are constructed). There are two problems, however, which may not be completely obvious:
page
element. You must keep track of that sort of thing yourself (as you did in the HeadlineHandler
example). For this project, you’re interested only in whether or not to pass through tags and characters, so you’ll use a Boolean variable called passthrough
, which you’ll update as you enter and leave the pages.See Listing 22-2 for the code for the simple program.
from xml.sax.handler import ContentHandler
from xml.sax import parse
class PageMaker(ContentHandler):
passthrough = False
def startElement(self, name, attrs):
if name == 'page':
self.passthrough = True
self.out = open(attrs['name'] + '.html', 'w')
self.out.write('<html><head>
')
self.out.write('<title>%s</title>
' % attrs['title'])
self.out.write('</head><body>
')
elif self.passthrough:
self.out.write('<' + name)
for key, val in attrs.items():
self.out.write(' %s="%s"' % (key, val))
self.out.write('>')
def endElement(self, name):
if name == 'page':
self.passthrough = False
self.out.write('
</body></html>
')
self.out.close()
elif self.passthrough:
self.out.write('</%s>' % name)
def characters(self, chars):
if self.passthrough: self.out.write(chars)
parse('website.xml', PageMaker ())
You should execute this in the directory in which you want your files to appear. Note that even if two pages are in two different directory elements, they will end up in the same real directory. (That will be fixed in our second implementation.)
Again, using the file website.xml
from Listing 22-1, you get four HTML files. The file called index.html
contains the following:
<html><head>
<title>Home Page</title>
</head><body>
<h1>Welcome to My Home Page</h1>
<p>Hi, there. My name is Mr. Gumby, and this is my home page. Here
are some of my interests:</p>
<ul>
<li><a href="interests/shouting.html">Shouting</a></li>
<li><a href="interests/sleeping.html">Sleeping</a></li>
<li><a href="interests/eating.html">Eating</a></li>
</ul>
</body></html>
Figure 22-1 shows how this page looks when viewed in a browser.
Looking at the code, two main weaknesses should be obvious:
if
statements to handle the various event types. If you need to handle many such event types, your if
statements will get large and unreadable.Both of these weaknesses will be addressed in the second implementation.
Because the SAX mechanism is so low level and basic, you may often find it useful to write a mix-in class that handles some administrative details such as gathering character data, managing Boolean state variables (such as passthrough
), or dispatching the events to your own custom event handlers. The state and data handling are pretty simple in this project, so let’s focus on the handler dispatch.
Rather than needing to write large if
statements in the standard generic event handlers (such as startElement
), it would be nice to just write your own specific ones (such as startPage
) and have them called automatically. You can implement that functionality in a mix-in class, and then subclass the mix-in along with ContentHandler
.
Note As mentioned in Chapter 7, a mix-in is a class with limited functionality that is meant to be subclassed along with some other more substantial class.
You want the following functionality in your program:
startElement
is called with a name such as 'foo'
, it should attempt to find an event handler called startFoo
and call it with the given attributes.endElement
is called with 'foo'
, it should try to call endFoo
.defaultStart
(or defaultEnd
, respectively) will be called, if present. If the default handler isn’t present either, nothing should be done.In addition, some care should be taken with the parameters. The custom handlers (for example, startFoo
) do not need the tag name as a parameter, while the custom default handlers (for example, defaultStart
) do. Also, only the start handlers need the attributes.
Confused? Let’s begin by writing the simplest parts of the class:
class Dispatcher:
# ...
def startElement(self, name, attrs):
self.dispatch('start', name, attrs)
def endElement(self, name):
self.dispatch('end', name)
Here, the basic event handlers are implemented, and they simply call a method called dispatch
, which takes care of finding the appropriate handler, constructing the argument tuple, and then calling the handler with those arguments. Here is the code for the dispatch
method:
def dispatch(self, prefix, name, attrs=None):
mname = prefix + name.capitalize()
dname = 'default' + prefix.capitalize()
method = getattr(self, mname, None)
if callable(method): args = ()
else:
method = getattr(self, dname, None)
args = name,
if prefix == 'start': args += attrs,
if callable(method): method(*args)
The following is what happens:
'start'
or 'end'
) and a tag name (for example, 'page'
), construct the method name of the handler (for example, 'startPage'
).'defaultStart'
).getattr
, using None
as the default value.args
.getattr
, again using None
as the default value. Also, set args
to a tuple containing only the tag name (because the default handler needs that).args
).Got that? This basically means that you can now write content handlers like this:
class TestHandler(Dispatcher, ContentHandler):
def startPage(self, attrs):
print 'Beginning page', attrs['name']
def endPage(self):
print 'Ending page'
Because the dispatcher mix-in takes care of most of the plumbing, the content handler is fairly simple and readable. (Of course, you’ll add more functionality in a little while.)
This section is much easier than the previous one. Instead of doing the calls to self.out.write
directly in the event handler, you’ll create separate methods for writing the header and footer. That way, you can easily override these methods by subclassing the event handler. Let’s make the default header and footer really simple:
def writeHeader(self, title):
self.out.write("<html>
<head>
<title>")
self.out.write(title)
self.out.write("</title>
</head>
<body>
")
def writeFooter(self):
self.out.write("
</body>
</html>
")
Handling of the XHTML contents was also linked a bit too intimately with the original handlers. The XHTML will now be handled by defaultStart
and defaultEnd
:
def defaultStart(self, name, attrs):
if self.passthrough:
self.out.write('<' + name)
for key, val in attrs.items():
self.out.write(' %s="%s"' % (key, val))
self.out.write('>')
def defaultEnd(self, name):
if self.passthrough:
self.out.write('</%s>' % name)
This works just like before, except that I’ve moved the code to separate methods (which is usually a good thing). Now, on to the last piece of the puzzle.
To create the necessary directories, you need a couple of useful functions from the os
and os.path
modules. One of these functions is os.makedirs
, which makes all the necessary directories in a given path. For example, os.makedirs('foo/bar/baz')
creates the directory foo
in the current directory, then creates bar
in foo
, and finally, baz
in bar
. If foo
already exists, only bar
and baz
are created, and similarly, if bar
also exists, only baz
is created. However, if baz
exists as well, an exception is raised.
To avoid this exception, you need the function os.path.isdir
, which checks whether a given path is a directory (that is, whether it exists already). Another useful function is os.path.join
, which joins several paths with the correct separator (for example, /
in UNIX and so forth).
At all times during the processing, keep the current directory path as a list of directory names, referenced by the variable directory
. When you enter a directory, append its name; when you leave it, pop the name off. Assuming that directory
is set up properly, you can define a function for ensuring that the current directory exists:
def ensureDirectory(self):
path = os.path.join(*self.directory)
if not os.path.isdir(path): os.makedirs(path)
Notice how I’ve used argument splicing (with the star operator, *
) on the directory list when supplying it to os.path.join
.
The base directory of our web site (for example, public_html
) can be given as an argument to the constructor, which then looks like this:
def __init__(self, directory):
self.directory = [directory]
self.ensureDirectory()
Finally we’ve come to the event handlers. You need four of them: two for dealing with directories, and two for pages. The directory handlers simply use the directory
list and the ensureDirectory
method:
def startDirectory(self, attrs):
self.directory.append(attrs['name'])
self.ensureDirectory()
def endDirectory(self):
self.directory.pop()
The page handlers use the writeHeader
and writeFooter
methods. In addition, they set the passthrough
variable (to pass through the XHTML), and—perhaps most important—they open and close the file associated with the page:
def startPage(self, attrs):
filename = os.path.join(*self.directory+[attrs['name']+'.html'])
self.out = open(filename, 'w')
self.writeHeader(attrs['title'])
self.passthrough = True
def endPage(self):
self.passthrough = False
self.writeFooter()
self.out.close()
The first line of startPage
may look a little intimidating, but it is more or less the same as the first line of ensureDirectory
, except that you add the file name (and give it an .html
suffix).
The full source code of the program is shown in Listing 22-3.
from xml.sax.handler import ContentHandler
from xml.sax import parse
import os
class Dispatcher:
def dispatch(self, prefix, name, attrs=None):
mname = prefix + name.capitalize()
dname = 'default' + prefix.capitalize()
method = getattr(self, mname, None)
if callable(method): args = ()
else:
method = getattr(self, dname, None)
args = name,
if prefix == 'start': args += attrs,
if callable(method): method(*args)
def startElement(self, name, attrs):
self.dispatch('start', name, attrs)
def endElement(self, name):
self.dispatch('end', name)
class WebsiteConstructor(Dispatcher, ContentHandler):
passthrough = False
def __init__(self, directory):
self.directory = [directory]
self.ensureDirectory()
def ensureDirectory(self):
path = os.path.join(*self.directory)
if not os.path.isdir(path): os.makedirs(path)
def characters(self, chars):
if self.passthrough: self.out.write(chars)
def defaultStart(self, name, attrs):
if self.passthrough:
self.out.write('<' + name)
for key, val in attrs.items():
self.out.write(' %s="%s"' % (key, val))
self.out.write('>')
def defaultEnd(self, name):
if self.passthrough:
self.out.write('</%s>' % name)
def startDirectory(self, attrs):
self.directory.append(attrs['name'])
self.ensureDirectory()
def endDirectory(self):
self.directory.pop()
def startPage(self, attrs):
filename = os.path.join(*self.directory+[attrs['name']+'.html'])
self.out = open(filename, 'w')
self.writeHeader(attrs['title'])
self.passthrough = True
def endPage(self):
self.passthrough = False
self.writeFooter()
self.out.close()
def writeHeader(self, title):
self.out.write('<html>
<head>
<title>')
self.out.write(title)
self.out.write('</title>
</head>
<body>
')
def writeFooter(self):
self.out.write('
</body>
</html>
')
parse('website.xml', WebsiteConstructor('public_html'))
Listing 22-3 generates the following files and directories:
public_html/
public_html/index.html
public_html/interests
public_html/interests/shouting.html
public_html/interests/sleeping.html
public_html/interests/eating.html
ENCODING BLUES
Now you have the basic program. What can you do with it? Here are some suggestions:
ContentHandler
for generating a table of contents or a menu (with links) for the web site.WebsiteConstructor
that overrides writeHeader
and writeFooter
to provide customized design.ContentHandler
that constructs a single web page from the XML file.ContentHandler
that summarizes your web site somehow, for example in RSS.http://www.reportlab.org
).After this foray into the world of XML parsing, let’s do some more network programming. In the next chapter, you create a program that can gather news items from various network sources (such as web pages and Usenet groups) and generate custom news reports for you.