Chapter 9. Processing XML

Python includes several modules that provide most of the tools necessary to parse and process XML documents. Parsing XML files is becoming much more critical as applications adopt the XML standard as the best way to transfer data between applications and systems.

Because there is no way to cover the extent of options Python provides in XML processing, I’ve chosen to present phrases that demonstrate some common tasks. To provide as broad of coverage as possible, these phrases will use the xml.dom, xml.sax, and xml.parsers.expat modules.

The phrases in this chapter cover concepts of basic XML processing such as loading, navigating, and checking for well-formed documents. They also cover more advanced XML processing such as searches, tag processing, and extracting text.

Note

Many XML processing tasks could be accomplished differently by using different modules. Don’t get locked into a specific module for processing the XML data; another module may perform the same task better.

Note

All the phrases in this chapter process the same XML file. The output of that XML file is listed in the output section of the “Loading an XML Document” phrase.

Loading an XML Document

Example . 

from xml.dom import minidom
DOMTree = minidom.parse('emails.xml')
print xmldoc.toxml()

The easiest way to quickly load an XML document is to create a minidom object using the xml.dom module. The minidom object provides a simple parser method that will quickly create a DOM tree from the XML file.

The sample phrase calls the parse(file [,parser]) function of the minidom object to parse the XML file designated by file into a DOM tree object. The optional parser argument allows you to specify a custom parser object to use when parsing the XML file.

Note

The DOM tree object can be converted back into XML by calling the toxml() function of the object, which returns a string containing the full contents of the XML file.

from xml.dom import minidom

#Open XML document using minidom parser
DOMTree = minidom.parse('emails.xml')

#Print XML contents
print DOMTree.toxml()

xml_open.py

<?xml version="1.0" ?><!DOCTYPE emails [
        <!ELEMENT email (to, from, subject,
 date, body)>
        <!ELEMENT to (addr+)>
        <!ELEMENT from (addr)>
        <!ELEMENT subject (#PCDATA)>
        <!ELEMENT date (#PCDATA)>
        <!ELEMENT body (#PCDATA)>
        <!ELEMENT addr (#PCDATA)>
        <!ATTLIST addr type (FROM | TO |
 CC | BC) "none">
    ]><emails>
    <email>
        <to>
            <addr
type="TO">[email protected]</addr>
            <addr type="CC">[email protected]</addr>
        </to>
        <from>
            <addr
type="FROM">[email protected]</addr>
        </from>
        <subject>
        Update List
        </subject>
        <body>
        Please add me to the list.
        </body>
    </email>
    <email>
        <to>
            <addr
type="TO">[email protected]</addr>
            <addr type="BC">[email protected]</addr>
        </to>
        <from>
            <addr
type="FROM">[email protected]</addr>
        </from>
        <subject>
        More Updated List
        </subject>
        <body>
        Please add me to the list also.
        </body>
    </email>
</emails>

Output from xml_open.py code.

Checking for Well-Formed XML Documents

Example . 

from xml.sax.handler import ContentHandler
import xml.sax
xmlparser = xml.sax.make_parser()
xmlparser.setContentHandler(ContentHandler())
xmlparser.parse(fName)

One of the most common tasks when processing XML documents is checking to see whether a document is well formed. The best way to determine whether a document is well formed is to use the xml.sax module to parse inside a try statement that will handle an exception if the document is not well formed.

First, create an xml.sax parser object using the make_parser() function. The make_parser function will return a parser object that can be used to parse the XML file.

After you have created the parser object, add a content handler to the object using its setContentHandler(handler) function. In this phrase, a generic content handler is passed to the object by calling the xml.sax.handler.ContentHandler() function.

Once the content handler has been added to the parser object, the XML files can be parsed inside a try block. If the parser encounters an error in the XML document, an exception will be thrown; otherwise, the document is well formed.

import sys
from xml.sax.handler import ContentHandler
import xml.sax

fileList = ["emails.xml", "bad.xml"]

#Create a parser object
xmlparser = xml.sax.make_parser()

#Attach a generic content handler to parser
xmlparser.setContentHandler(ContentHandler())

#Parse the files and handle exceptions
#on bad-formed XML files
for fName in fileList:
    try:
        xmlparser.parse(fName)
        print "%s is a well-formed file." % fName
    except Exception, err:
print "ERROR %s:
	 %s is not a well-formed file."
%
 (err, fName)

xml_wellformed.py

emails.xml is a well-formed file.
ERROR bad.xml:5:12: not well-formed (invalid token):
        bad.xml is not a well-formed file.

Output from xml_wellformed.py code.

Accessing Child Nodes

Example . 

from xml.dom import minidom
xmldoc = minidom.parse('emails.xml')
cNodes = xmldoc.childNodes
#Direct Node Access
print cNodes[0].toxml()
#Find node by name
nList = cNodes[1].getElementsByTagName("to")
#Walk node tree
for node in nList:
    eList = node.getElementsByTagName("addr")
. . .
def printNodes (nList, level):
    for node in nList:
       print ("  ")*level, node.nodeName, 
             node.nodeValue
       printNodes(node.childNodes, level+1)

printNodes(xmldoc.childNodes, 0)

Accessing child nodes in a parsed DOM tree can be managed in several different ways. This phrase discusses how to access them using a direct reference, looking up the object by tag name and simply walking the DOM tree.

The first step is to parse the XML document using the minidom.parse(file) function to create a DOM tree object. The child nodes of the DOM tree can be accessed directly using the childNodes attribute, which is a list of the child nodes at the root of the tree.

Because the childNodes attribute is a list, nodes can be accessed directly using the following syntax: childNodes[index].

Note

The first node in the childNodes list of the DOM tree object will be the DTD node.

To search for nodes by their tag name, use the getElementsByTagName(tag) of the node object. The getElementsByTagName function accepts a string representation of the tag name for child nodes and returns a list of all child nodes with that tag.

You can also walk the DOM tree recursively by defining a recursive function that will accept a node list; then, call that function and pass the childNodes attribute of the DOM tree object. Finally, recursively call the function again with the childNodes attribute of each child node in the node list, as shown in the sample phrase.

from xml.dom import minidom

#Parse XML file to DOM tree
xmldoc = minidom.parse('emails.xml')

#Get nodes at root of tree
cNodes = xmldoc.childNodes

#Direct Node Access
print "DTD Node
================="
print cNodes[0].toxml()

#Find node by name
print "
To Addresses
==================="
nList = cNodes[1].getElementsByTagName("to")
for node in nList:
    eList = node.getElementsByTagName("addr")
    for e in eList:
        print e.toxml()

print "
From Addresses
==================="
nList = cNodes[1].getElementsByTagName("from")
for node in nList:
    eList = node.getElementsByTagName("addr")
    for e in eList:
        print e.toxml()

#Walk node tree
def printNodes (nList, level):
    for node in nList:
        print ("  ")*level, node.nodeName, 
               node.nodeValue
        printNodes(node.childNodes, level+1)

print "
Nodes
==================="
printNodes(xmldoc.childNodes, 0)

xml_child.py

DTD Node
=================
<!DOCTYPE emails [
        <!ELEMENT email (to, from, subject, date,
body)>
        <!ELEMENT to (addr+)>
        <!ELEMENT from (addr)>
        <!ELEMENT subject (#PCDATA)>
        <!ELEMENT date (#PCDATA)>
        <!ELEMENT body (#PCDATA)>
        <!ELEMENT addr (#PCDATA)>
        <!ATTLIST addr type (FROM | TO | CC | BC)
"none">
    ]>

To Addresses
===================
<addr type="TO">[email protected]</addr>
<addr type="CC">[email protected]</addr>
<addr type="TO">[email protected]</addr>
<addr type="BC">[email protected]</addr>

From Addresses
===================
<addr type="FROM">[email protected]</addr>
<addr type="FROM">[email protected]</addr>

Nodes
===================
 emails None
 emails None
   #text
   email None
     #text
     to None
       #text
       addr None
         #text [email protected]
       #text
       addr None
         #text [email protected]
       #text
     #text
     from None
       #text
       addr None
         #text [email protected]
       #text
     #text
     subject None
       #text
        Update List
     #text
     body None
       #text
        Please add me to the list.
     #text
   #text
. . .

Output from xml_child.py code.

Accessing Element Attributes

Example . 

from xml.dom import minidom
xmldoc = minidom.parse('emails.xml')
cNodes = xmldoc.childNodes
print "
To Addresses
==================="
nList = cNodes[1].getElementsByTagName("to")
for node in nList:
    eList = node.getElementsByTagName("addr")
    for e in eList:
        if e.hasAttribute("type"):
            if e.getAttribute("type") == "TO":
                print e.toxml()

The first step to accessing element attributes in a XML file is to parse the XML document using the minidom.parse(file) function to create a DOM tree object. The child nodes of the DOM tree can be accessed directly using the childNodes attribute, which is a list of the child nodes at the root of the tree.

Use the childNodes attribute to navigate the DOM tree, or search for the elements by their tag name, as described in the previous task, to find the nodes you are looking for.

Once you have found the node, determine whether the node does have the attribute by calling the hasAttribute(name) function of the node object, which returns true if the node does contain the attribute specified by name. If the node does have the attribute, then you can use the getAttribute(name) function to retrieve a string representation of the attribute value.

from xml.dom import minidom

#Parse XML file to DOM tree
xmldoc = minidom.parse('emails.xml')

#Get nodes at root of tree
cNodes = xmldoc.childNodes

#Find attributes by name
print "
To Addresses
==================="
nList = cNodes[1].getElementsByTagName("to")
for node in nList:
    eList = node.getElementsByTagName("addr")
    for e in eList:
        if e.hasAttribute("type"):
            if e.getAttribute("type") == "TO":
                print e.toxml()

print "
CC Addresses
==================="
nList = cNodes[1].getElementsByTagName("to")
for node in nList:
    eList = node.getElementsByTagName("addr")
    for e in eList:
        if e.hasAttribute("type"):
            if e.getAttribute("type") == "CC":
                print e.toxml()

print "
BC Addresses
==================="
nList = cNodes[1].getElementsByTagName("to")
for node in nList:
    eList = node.getElementsByTagName("addr")
    for e in eList:
        if e.hasAttribute("type"):
            if e.getAttribute("type") == "BC":
                print e.toxml()

xml_attribute.py

To Addresses
===================
<addr type="TO">[email protected]</addr>
<addr type="TO">[email protected]</addr>

CC Addresses
===================
<addr type="CC">[email protected]</addr>

BC Addresses
===================
<addr type="BC">[email protected]</addr>

Output from xml_attribute.py code.

Adding a Node to a DOM Tree

Example . 

from xml.dom import minidom
DOMimpl = minidom.getDOMImplementation()
xmldoc = DOMimpl.createDocument(None,
"Workstations", None)
doc_root = xmldoc.documentElement
node = xmldoc.createElement("Computer")
doc_root.appendChild(node)

Adding child nodes to a DOM tree can be managed in several different ways. This phrase discusses using the xml.dom.minidom module provided with Python to create a DOM tree and add nodes to it.

The first step is to create a DOM object by calling the minidom.getDOMImplementation() function, which returns a DOMImplementation object. Then call the createDocument(qualifiedName, publicId, systemId) function of the DOMImplementation object to create the XML document. The createDocument function returns a Document object.

Once you have created the Document object, create nodes using the createElement(tagName) function of the Document object. The createElement function of the Docmuent object returns a node object.

After you have created child nodes, the DOM tree can be constructed using the appendChild(node) function to add node objects as child nodes of other node objects. Once the tree has been constructed, add the tree to the Document object using the appendChild(node) function of the Document object to attach the topmost level of the tree.

from xml.dom import minidom

Station1 = ['Pentium M', '512MB']
Station2 = ['Pentium Core 2', '1024MB']
Station3 = ['Pentium Core Duo', '1024MB']
StationList = [Station1, Station2, Station3]

#Create DOM object
DOMimpl = minidom.getDOMImplementation()

#Create Document
xmldoc = DOMimpl.createDocument(None,
"Workstations", None)
doc_root = xmldoc.documentElement

#Add Nodes
for station in StationList:
    #Create Node
    node = xmldoc.createElement("Computer")

    element = xmldoc.createElement('Processor')
    element.appendChild(xmldoc.createTextNode
(station[0]))
    node.appendChild(element)

    element = xmldoc.createElement('Memory')
    element.appendChild(xmldoc.createTextNode
(station[1]))
    node.appendChild(element)

    #Add Node
    doc_root.appendChild(node)

print "
Nodes
==================="
nodeList = doc_root.childNodes
for node in nodeList:
    print node.toprettyxml()

#Write the document
file = open("stations.xml", 'w')
file.write(xmldoc.toxml())

xml_addnode.py

Nodes
===================
<Computer>
    <Processor>
        Pentium M
    </Processor>
    <Memory>
        512MB
    </Memory>
</Computer>

<Computer>
    <Processor>
        Pentium Core 2
    </Processor>
    <Memory>
        1024MB
    </Memory>
</Computer>

<Computer>
    <Processor>
        Pentium Core Duo
    </Processor>
    <Memory>
        1024MB
    </Memory>
</Computer>

Output from xml_addnode.py code.

Removing a Node from a DOM Tree

Example . 

from xml.dom import minidom
xmldoc = minidom.parse('stations.xml')
doc_root = xmldoc.documentElement

doc_root.removeChild(doc_root.childNodes[0])

The simplest way to remove a node from a DOM tree is to delete it using a direct reference. The first step is to parse the XML document using the minidom.parse(file) function to create a DOM tree document object.

After you have created the document objects, you retrieve the root of the document elements by accessing the documentElement attribute of the document object. To remove an object from the root of the document, use the removeChild(node). The removeChild function removes the nodes and any child nodes from the document.

The child nodes can be referenced directly by using the childNodes attribute of the root or node object. The childNodes attribute is a list, so individual elements can be accessed by their index number as shown in xml_removenode.py.

from xml.dom import minidom

#Parse XML file to DOM tree
xmldoc = minidom.parse('stations.xml')
doc_root = xmldoc.documentElement

print "
Nodes
==================="
nodeList = xmldoc.childNodes
for node in nodeList:
    print node.toprettyxml()

#Delete first node
doc_root.removeChild(doc_root.childNodes[0])

print "
Nodes
==================="
nodeList = xmldoc.childNodes
for node in nodeList:
    print node.toprettyxml()

xml_removenode.py

Nodes
===================
<Workstations>
    <Computer>
        <Processor>
            Pentium M
        </Processor>
        <Memory>
            512MB
        </Memory>
    </Computer>
    <Computer>
        <Processor>
            Pentium Core 2
        </Processor>
        <Memory>
            1024MB
        </Memory>
    </Computer>
    <Computer>
        <Processor>
            Pentium Core Duo
        </Processor>
        <Memory>
            1024MB
        </Memory>
    </Computer>
</Workstations>
Nodes
===================
<Workstations>
    <Computer>
        <Processor>
            Pentium Core 2
        </Processor>
        <Memory>
            1024MB
        </Memory>
    </Computer>
    <Computer>
        <Processor>
            Pentium Core Duo
        </Processor>
        <Memory>
            1024MB
        </Memory>
    </Computer>
</Workstations>

Output from xml_removenode.py code.

Searching XML Documents

Example . 

from xml.parsers import expat
class xmlSearch(object):
   def __init__ (self, cStr, nodeName):
       self.nodeName = nodeName
       self.curNode = 0
       self.nodeActive = 0
       self.hits = []
       self.cStr = cStr
   def StartElement(self, name, attributes):
   def EndElement(self, name):
   def CharacterData(self, data):
   def Parse(self, fName):
       xmlParser = expat.ParserCreate()
       xmlParser.StartElementHandler = 
         self.StartElement
       xmlParser.EndElementHandler =
self.EndElement
       xmlParser.CharacterDataHandler =
              self.CharacterData
       xmlParser.Parse(open(fName).read(), 1)

search = xmlSearch(searchString, searchElement)
search.Parse(xmlFile)
print search.hits

Another extremely useful Python module for XML processing is the xml.parsers.expat module. The expat module provides an interface to the expat nonvalidating XML parser. The expat XML parser is a fast parser that quickly parses XML files and uses handlers to process character data and markup.

To use the expat parser to quickly search through an XML document and find specific data, define a search class that derived from the basic object class.

When the search class is defined, add a startElement, endElement, and CharacterData method that can be used to override the handlers in the expat parser later.

After you have defined the handler methods of the search object, define a parse routine that creates the expat parser by calling the ParserCreate() function of the expat module. The ParserCreate() function returns an expat parser object.

After the expat parser object is created in the search object’s parse routine, override the StartElementHandler, EndElementHandler, and CharacterDataHandler attributes of the parser object by assigning them to the corresponding methods in your search object.

After you have overridden the handler functions of the expat parser object, the parse routine will need to invoke the Parse(buffer [, isFinal]) function of the expat parser object. The Parse function accepts a string buffer and parses it using the overridden handler methods.

Note

The isFinal argument is set to 1 if this is the last data to be parsed or 0 if there is more data to be parsed.

After you have defined the search class, create an instance of the class and use the parse function you defined to parse the XML file and search for data.

from xml.parsers import expat

searchStringList = ["[email protected]", "also"]
searchElement = "email"
xmlFile = "emails.xml"

#Define a search class that will handle
#elements and search character data
class xmlSearch(object):
    def __init__ (self, cStr, nodeName):
        self.nodeName = nodeName
        self.curNode = 0
        self.nodeActive = 0
        self.hits = []
        self.cStr = cStr
    def StartElement(self, name, attributes):
        if name == self.nodeName:
            self.nodeActive = 1
            self.curNode += 1
    def EndElement(self, name):
        if name == self.nodeName:
            self.nodeActive = 0
    def CharacterData(self, data):
       if data.strip():
          data = data.encode('ascii')
          if self.nodeActive:
             if data.find(self.cStr) != -1:
                if not
self.hits.count(self.curNode):
                   self.hits.append(self.curNode)
                   print "	Found %s..." % self.cStr
    def Parse(self, fName):
#Create the expat parser object
        xmlParser = expat.ParserCreate()
#Override the handler methods
        xmlParser.StartElementHandler = 
            self.StartElement
        xmlParser.EndElementHandler =
self.EndElement
        xmlParser.CharacterDataHandler =
            self.CharacterData
#Parse the XML file
        xmlParser.Parse(open(fName).read(), 1)

for searchString in searchStringList:
#Create search class
    search = xmlSearch(searchString, searchElement)

#Invoke the search objects Parse method

    print "
Searching <%s> nodes . . ." % 
          searchElement
    search.Parse(xmlFile)

#Display parsed results
    print "Found '%s' in the following nodes:" % 
          searchString
    print search.hits

xml_search.py

Searching <email> nodes . . .
    Found [email protected]...
    Found [email protected]...
Found '[email protected]' in the following nodes:
[1, 2]

Searching <email> nodes . . .
    Found also...
Found 'also' in the following nodes:
[2]

Output from xml_search.py code.

Extracting Text from XML Documents

Example . 

from xml.parsers import expat
#Define a class that will store the character data
class xmlText(object):
    def __init__ (self):
        self.textBuff = ""
    def CharacterData(self, data):
        data = data.strip()
        if data:
            data = data.encode('ascii')
            self.textBuff += data + "
"
    def Parse(self, fName):
        xmlParser = expat.ParserCreate()
    xmlParser.CharacterDataHandler =
self.CharacterData
        xmlParser.Parse(open(fName).read(), 1)

xText = xmlText()
xText.Parse(xmlFile)
print xText.textBuff

A common task when parsing XML documents is to quickly retrieve the text from them without the markup tags and attribute data. The expat parser provided with Python provides a simple interface to manage just that. To use the expat parser to quickly parse through an XML document and store only the text, define a simple text parser class that derived from the basic object class.

When the text parser class is defined, add a CharacterData() method that can be used to override the CharacterDataHandlers() method of the expat parser. This method will store the text data passed to the handler when the document is parsed.

After you have defined the handler method of the text parser object, define a parse routine that creates the expat parser by calling the ParserCreate() function of the expat module. The ParserCreate() function returns an expat parser object.

After the expat parser object is created in the text parser objects’ parse routine, override the CharacterDataHandler attribute of the parser object by assigning it to the CharacterData() method in your search object.

After you have overridden the handler function of the expat parser object, the parse routine will need to invoke the Parse(buffer [, isFinal]) function of the expat parser object. The Parse function accepts a string buffer and parses it using the overridden handler methods.

After you have defined the text parser class, create an instance of the class and use the Parse(file) function you defined to parse the XML file and retrieve the text.

from xml.parsers import expat

xmlFile = "emails.xml"

#Define a class that will store the character data
class xmlText(object):
    def __init__ (self):
        self.textBuff = ""
    def CharacterData(self, data):
        data = data.strip()
        if data:
            data = data.encode('ascii')
            self.textBuff += data + "
"

    def Parse(self, fName):
#Create the expat parser object
        xmlParser = expat.ParserCreate()
#Override the handler methods
        xmlParser.CharacterDataHandler = 
            self.CharacterData
#Parse the XML file
        xmlParser.Parse(open(fName).read(), 1)

#Create the text parser object
xText = xmlText()

#Invoke the text parser objects Parse method
xText.Parse(xmlFile)

#Display parsed results
print "Text from %s
====================" % xmlFile
print xText.textBuff

xml_text.py

Text from emails.xml
==========================
[email protected]
[email protected]
[email protected]
Update List
Please add me to the list.
[email protected]
[email protected]
[email protected]
More Updated List
Please add me to the list also.

Output from xml_text.py code.

Parsing XML Tags

Example . 

import xml.sax
class tagHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.tags = {}
    def startElement(self,name, attr):
        name = name.encode('ascii')
        self.tags[name] = self.tags.get(name, 0) + 1
        print "Tag %s = %d" % 
              (name, self.tags.get(name))

xmlparser = xml.sax.make_parser()
tHandler = tagHandler()
xmlparser.setContentHandler(tHandler)
xmlparser.parse(xmlFile)

Another fairly common task when processing XML files is to process the XML tags themselves. The xml.sax module provides a quick, clean interface to the XML tags by defining a custom content handler to deal with the tags.

This phrase demonstrates how to override the content handler of a sax XML parser to determine how many instances of a specific tag there are in the XML document.

First, define a tag handler class that inherits from the xml.sax.handler.ContentHandler class. Then override the startElement() method of the class to keep track of each encounter with specific tags.

After you have defined the tag handler class, create an xml.sax parser object using the make_parser() function. The make_parser() function will return a parser object that can be used to parse the XML file. Next, create an instance of the tag handler object.

After you have created the parser and tag handler objects, add the custom tag handler object to the parser object using the setContentHandler(handler) function.

After the content handler has been added to the parser object, parse the XML file using the parse(file) command of the parser object.

import xml.sax

xmlFile = "emails.xml"
xmlTag = "email"

#Define handler to scan XML file and parse tags
class tagHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.tags = {}
    def startElement(self,name, attr):
        name = name.encode('ascii')
        self.tags[name] = self.tags.get(name, 0) + 1
        print "Tag %s = %d" % 
              (name, self.tags.get(name))

#Create a parser object
xmlparser = xml.sax.make_parser()

#Create a content handler object
tHandler = tagHandler()

#Attach the content handler to the parser
xmlparser.setContentHandler(tHandler)

#Parse the XML file
xmlparser.parse(xmlFile)
tags = tHandler.tags
if tags.has_key(xmlTag):
    print "%s has %d <%s> nodes." % 
          (xmlFile, tags[xmlTag], xmlTag)

xml_tags.py

Tag emails = 1
Tag email = 1
Tag to = 1
Tag addr = 1
Tag addr = 2
Tag from = 1
Tag addr = 3
Tag subject = 1
Tag body = 1
Tag email = 2
Tag to = 2
Tag addr = 4
Tag addr = 5
Tag from = 2
Tag addr = 6
Tag subject = 2
Tag body = 2
emails.xml has 2 <email> nodes.

Output from xml_tags.py code

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset