Python includes several modules that provide most of the tools necessary to parse and process XML documents. Parsing XML files is becoming much more critical as applications adopt the XML standard as the best way to transfer data between applications and systems.
Because there is no way to cover the extent of options Python provides in XML processing, I’ve chosen to present phrases that demonstrate some common tasks. To provide as broad of coverage as possible, these phrases will use the xml.dom, xml.sax, and xml.parsers.expat modules.
The phrases in this chapter cover concepts of basic XML processing such as loading, navigating, and checking for well-formed documents. They also cover more advanced XML processing such as searches, tag processing, and extracting text.
The easiest way to quickly load an XML document is to create a minidom object using the xml.dom module. The minidom object provides a simple parser method that will quickly create a DOM tree from the XML file.
The sample phrase calls the parse(
file [,parser]
) function of the minidom object to parse the XML file designated by file
into a DOM tree object. The optional parser
argument allows you to specify a custom parser object to use when parsing the XML file.
The DOM tree object can be converted back into XML by calling the toxml()
function of the object, which returns a string containing the full contents of the XML file.
from xml.dom import minidom #Open XML document using minidom parser DOMTree = minidom.parse('emails.xml') #Print XML contents print DOMTree.toxml()
xml_open.py
<?xml version="1.0" ?><!DOCTYPE emails [ <!ELEMENT email (to, from, subject, date, body)> <!ELEMENT to (addr+)> <!ELEMENT from (addr)> <!ELEMENT subject (#PCDATA)> <!ELEMENT date (#PCDATA)> <!ELEMENT body (#PCDATA)> <!ELEMENT addr (#PCDATA)> <!ATTLIST addr type (FROM | TO | CC | BC) "none"> ]><emails> <email> <to> <addr type="TO">[email protected]</addr> <addr type="CC">[email protected]</addr> </to> <from> <addr type="FROM">[email protected]</addr> </from> <subject> Update List </subject> <body> Please add me to the list. </body> </email> <email> <to> <addr type="TO">[email protected]</addr> <addr type="BC">[email protected]</addr> </to> <from> <addr type="FROM">[email protected]</addr> </from> <subject> More Updated List </subject> <body> Please add me to the list also. </body> </email> </emails>
Example .
from xml.sax.handler import ContentHandler import xml.sax xmlparser = xml.sax.make_parser() xmlparser.setContentHandler(ContentHandler()) xmlparser.parse(fName)
One of the most common tasks when processing XML documents is checking to see whether a document is well formed. The best way to determine whether a document is well formed is to use the xml.sax module to parse inside a try statement that will handle an exception if the document is not well formed.
First, create an xml.sax
parser object using the make_parser()
function. The make_parser
function will return a parser object that can be used to parse the XML file.
After you have created the parser object, add a content handler to the object using its setContentHandler
(handler)
function. In this phrase, a generic content handler is passed to the object by calling the xml.sax.handler.ContentHandler()
function.
Once the content handler has been added to the parser object, the XML files can be parsed inside a try block. If the parser encounters an error in the XML document, an exception will be thrown; otherwise, the document is well formed.
import sys from xml.sax.handler import ContentHandler import xml.sax fileList = ["emails.xml", "bad.xml"] #Create a parser object xmlparser = xml.sax.make_parser() #Attach a generic content handler to parser xmlparser.setContentHandler(ContentHandler()) #Parse the files and handle exceptions #on bad-formed XML files for fName in fileList: try: xmlparser.parse(fName) print "%s is a well-formed file." % fName except Exception, err: print "ERROR %s: %s is not a well-formed file." % (err, fName)
xml_wellformed.py
emails.xml is a well-formed file. ERROR bad.xml:5:12: not well-formed (invalid token): bad.xml is not a well-formed file.
Output from xml_wellformed.py code.
Example .
from xml.dom import minidom xmldoc = minidom.parse('emails.xml') cNodes = xmldoc.childNodes #Direct Node Access print cNodes[0].toxml() #Find node by name nList = cNodes[1].getElementsByTagName("to") #Walk node tree for node in nList: eList = node.getElementsByTagName("addr") . . . def printNodes (nList, level): for node in nList: print (" ")*level, node.nodeName, node.nodeValue printNodes(node.childNodes, level+1) printNodes(xmldoc.childNodes, 0)
Accessing child nodes in a parsed DOM tree can be managed in several different ways. This phrase discusses how to access them using a direct reference, looking up the object by tag name and simply walking the DOM tree.
The first step is to parse the XML document using the minidom.parse
(file)
function to create a DOM tree object. The child nodes of the DOM tree can be accessed directly using the childNodes
attribute, which is a list of the child nodes at the root of the tree.
Because the childNodes
attribute is a list, nodes can be accessed directly using the following syntax: childNodes
[index]
.
To search for nodes by their tag name, use the getElementsByTagName(tag)
of the node object. The getElementsByTagName
function accepts a string representation of the tag name for child nodes and returns a list of all child nodes with that tag.
You can also walk the DOM tree recursively by defining a recursive function that will accept a node list; then, call that function and pass the childNodes
attribute of the DOM tree object. Finally, recursively call the function again with the childNodes
attribute of each child node in the node list, as shown in the sample phrase.
from xml.dom import minidom #Parse XML file to DOM tree xmldoc = minidom.parse('emails.xml') #Get nodes at root of tree cNodes = xmldoc.childNodes #Direct Node Access print "DTD Node =================" print cNodes[0].toxml() #Find node by name print " To Addresses ===================" nList = cNodes[1].getElementsByTagName("to") for node in nList: eList = node.getElementsByTagName("addr") for e in eList: print e.toxml() print " From Addresses ===================" nList = cNodes[1].getElementsByTagName("from") for node in nList: eList = node.getElementsByTagName("addr") for e in eList: print e.toxml() #Walk node tree def printNodes (nList, level): for node in nList: print (" ")*level, node.nodeName, node.nodeValue printNodes(node.childNodes, level+1) print " Nodes ===================" printNodes(xmldoc.childNodes, 0)
xml_child.py
DTD Node ================= <!DOCTYPE emails [ <!ELEMENT email (to, from, subject, date, body)> <!ELEMENT to (addr+)> <!ELEMENT from (addr)> <!ELEMENT subject (#PCDATA)> <!ELEMENT date (#PCDATA)> <!ELEMENT body (#PCDATA)> <!ELEMENT addr (#PCDATA)> <!ATTLIST addr type (FROM | TO | CC | BC) "none"> ]> To Addresses =================== <addr type="TO">[email protected]</addr> <addr type="CC">[email protected]</addr> <addr type="TO">[email protected]</addr> <addr type="BC">[email protected]</addr> From Addresses =================== <addr type="FROM">[email protected]</addr> <addr type="FROM">[email protected]</addr> Nodes =================== emails None emails None #text email None #text to None #text addr None #text [email protected] #text addr None #text [email protected] #text #text from None #text addr None #text [email protected] #text #text subject None #text Update List #text body None #text Please add me to the list. #text #text . . .
Output from xml_child.py code.
Example .
from xml.dom import minidom xmldoc = minidom.parse('emails.xml') cNodes = xmldoc.childNodes print " To Addresses ===================" nList = cNodes[1].getElementsByTagName("to") for node in nList: eList = node.getElementsByTagName("addr") for e in eList: if e.hasAttribute("type"): if e.getAttribute("type") == "TO": print e.toxml()
The first step to accessing element attributes in a XML file is to parse the XML document using the minidom.parse
(file)
function to create a DOM tree object. The child nodes of the DOM tree can be accessed directly using the childNodes
attribute, which is a list of the child nodes at the root of the tree.
Use the childNodes
attribute to navigate the DOM tree, or search for the elements by their tag name, as described in the previous task, to find the nodes you are looking for.
Once you have found the node, determine whether the node does have the attribute by calling the hasAttribute
(name)
function of the node object, which returns true if the node does contain the attribute specified by name. If the node does have the attribute, then you can use the getAttribute
(name)
function to retrieve a string representation of the attribute value.
from xml.dom import minidom #Parse XML file to DOM tree xmldoc = minidom.parse('emails.xml') #Get nodes at root of tree cNodes = xmldoc.childNodes #Find attributes by name print " To Addresses ===================" nList = cNodes[1].getElementsByTagName("to") for node in nList: eList = node.getElementsByTagName("addr") for e in eList: if e.hasAttribute("type"): if e.getAttribute("type") == "TO": print e.toxml() print " CC Addresses ===================" nList = cNodes[1].getElementsByTagName("to") for node in nList: eList = node.getElementsByTagName("addr") for e in eList: if e.hasAttribute("type"): if e.getAttribute("type") == "CC": print e.toxml() print " BC Addresses ===================" nList = cNodes[1].getElementsByTagName("to") for node in nList: eList = node.getElementsByTagName("addr") for e in eList: if e.hasAttribute("type"): if e.getAttribute("type") == "BC": print e.toxml()
xml_attribute.py
To Addresses =================== <addr type="TO">[email protected]</addr> <addr type="TO">[email protected]</addr> CC Addresses =================== <addr type="CC">[email protected]</addr> BC Addresses =================== <addr type="BC">[email protected]</addr>
Example .
from xml.dom import minidom DOMimpl = minidom.getDOMImplementation() xmldoc = DOMimpl.createDocument(None, "Workstations", None) doc_root = xmldoc.documentElement node = xmldoc.createElement("Computer") doc_root.appendChild(node)
Adding child nodes to a DOM tree can be managed in several different ways. This phrase discusses using the xml.dom.minidom module provided with Python to create a DOM tree and add nodes to it.
The first step is to create a DOM object by calling the minidom.getDOMImplementation()
function, which returns a DOMImplementation object. Then call the createDocument(qualifiedName, publicId, systemId)
function of the DOMImplementation object to create the XML document. The createDocument
function returns a Document object.
Once you have created the Document object, create nodes using the createElement(tagName)
function of the Document object. The createElement
function of the Docmuent object returns a node object.
After you have created child nodes, the DOM tree can be constructed using the appendChild
(node)
function to add node objects as child nodes of other node objects. Once the tree has been constructed, add the tree to the Document object using the appendChild(node)
function of the Document object to attach the topmost level of the tree.
from xml.dom import minidom Station1 = ['Pentium M', '512MB'] Station2 = ['Pentium Core 2', '1024MB'] Station3 = ['Pentium Core Duo', '1024MB'] StationList = [Station1, Station2, Station3] #Create DOM object DOMimpl = minidom.getDOMImplementation() #Create Document xmldoc = DOMimpl.createDocument(None, "Workstations", None) doc_root = xmldoc.documentElement #Add Nodes for station in StationList: #Create Node node = xmldoc.createElement("Computer") element = xmldoc.createElement('Processor') element.appendChild(xmldoc.createTextNode (station[0])) node.appendChild(element) element = xmldoc.createElement('Memory') element.appendChild(xmldoc.createTextNode (station[1])) node.appendChild(element) #Add Node doc_root.appendChild(node) print " Nodes ===================" nodeList = doc_root.childNodes for node in nodeList: print node.toprettyxml() #Write the document file = open("stations.xml", 'w') file.write(xmldoc.toxml())
xml_addnode.py
Nodes =================== <Computer> <Processor> Pentium M </Processor> <Memory> 512MB </Memory> </Computer> <Computer> <Processor> Pentium Core 2 </Processor> <Memory> 1024MB </Memory> </Computer> <Computer> <Processor> Pentium Core Duo </Processor> <Memory> 1024MB </Memory> </Computer>
Example .
from xml.dom import minidom xmldoc = minidom.parse('stations.xml') doc_root = xmldoc.documentElement doc_root.removeChild(doc_root.childNodes[0])
The simplest way to remove a node from a DOM tree is to delete it using a direct reference. The first step is to parse the XML document using the minidom.parse
(file)
function to create a DOM tree document object.
After you have created the document objects, you retrieve the root of the document elements by accessing the documentElement
attribute of the document object. To remove an object from the root of the document, use the removeChild
(node)
. The removeChild
function removes the nodes and any child nodes from the document.
The child nodes can be referenced directly by using the childNodes
attribute of the root or node object. The childNodes
attribute is a list, so individual elements can be accessed by their index number as shown in xml_removenode.py.
from xml.dom import minidom #Parse XML file to DOM tree xmldoc = minidom.parse('stations.xml') doc_root = xmldoc.documentElement print " Nodes ===================" nodeList = xmldoc.childNodes for node in nodeList: print node.toprettyxml() #Delete first node doc_root.removeChild(doc_root.childNodes[0]) print " Nodes ===================" nodeList = xmldoc.childNodes for node in nodeList: print node.toprettyxml()
Nodes =================== <Workstations> <Computer> <Processor> Pentium M </Processor> <Memory> 512MB </Memory> </Computer> <Computer> <Processor> Pentium Core 2 </Processor> <Memory> 1024MB </Memory> </Computer> <Computer> <Processor> Pentium Core Duo </Processor> <Memory> 1024MB </Memory> </Computer> </Workstations> Nodes =================== <Workstations> <Computer> <Processor> Pentium Core 2 </Processor> <Memory> 1024MB </Memory> </Computer> <Computer> <Processor> Pentium Core Duo </Processor> <Memory> 1024MB </Memory> </Computer> </Workstations>
Output from xml_removenode.py code.
Example .
from xml.parsers import expat class xmlSearch(object): def __init__ (self, cStr, nodeName): self.nodeName = nodeName self.curNode = 0 self.nodeActive = 0 self.hits = [] self.cStr = cStr def StartElement(self, name, attributes): def EndElement(self, name): def CharacterData(self, data): def Parse(self, fName): xmlParser = expat.ParserCreate() xmlParser.StartElementHandler = self.StartElement xmlParser.EndElementHandler = self.EndElement xmlParser.CharacterDataHandler = self.CharacterData xmlParser.Parse(open(fName).read(), 1) search = xmlSearch(searchString, searchElement) search.Parse(xmlFile) print search.hits
Another extremely useful Python module for XML processing is the xml.parsers.expat module. The expat module provides an interface to the expat
nonvalidating XML parser. The expat
XML parser is a fast parser that quickly parses XML files and uses handlers to process character data and markup.
To use the expat
parser to quickly search through an XML document and find specific data, define a search class that derived from the basic object class.
When the search class is defined, add a startElement
, endElement
, and CharacterData
method that can be used to override the handlers in the expat
parser later.
After you have defined the handler methods of the search object, define a parse routine that creates the expat parser by calling the ParserCreate()
function of the expat module. The ParserCreate()
function returns an expat parser object.
After the expat parser object is created in the search object’s parse routine, override the StartElementHandler
, EndElementHandler
, and CharacterDataHandler
attributes of the parser object by assigning them to the corresponding methods in your search object.
After you have overridden the handler functions of the expat parser object, the parse routine will need to invoke the Parse(buffer
[, isFinal]
)
function of the expat parser object. The Parse
function accepts a string buffer
and parses it using the overridden handler methods.
The isFinal
argument is set to 1 if this is the last data to be parsed or 0 if there is more data to be parsed.
After you have defined the search class, create an instance of the class and use the parse
function you defined to parse the XML file and search for data.
from xml.parsers import expat searchStringList = ["[email protected]", "also"] searchElement = "email" xmlFile = "emails.xml" #Define a search class that will handle #elements and search character data class xmlSearch(object): def __init__ (self, cStr, nodeName): self.nodeName = nodeName self.curNode = 0 self.nodeActive = 0 self.hits = [] self.cStr = cStr def StartElement(self, name, attributes): if name == self.nodeName: self.nodeActive = 1 self.curNode += 1 def EndElement(self, name): if name == self.nodeName: self.nodeActive = 0 def CharacterData(self, data): if data.strip(): data = data.encode('ascii') if self.nodeActive: if data.find(self.cStr) != -1: if not self.hits.count(self.curNode): self.hits.append(self.curNode) print " Found %s..." % self.cStr def Parse(self, fName): #Create the expat parser object xmlParser = expat.ParserCreate() #Override the handler methods xmlParser.StartElementHandler = self.StartElement xmlParser.EndElementHandler = self.EndElement xmlParser.CharacterDataHandler = self.CharacterData #Parse the XML file xmlParser.Parse(open(fName).read(), 1) for searchString in searchStringList: #Create search class search = xmlSearch(searchString, searchElement) #Invoke the search objects Parse method print " Searching <%s> nodes . . ." % searchElement search.Parse(xmlFile) #Display parsed results print "Found '%s' in the following nodes:" % searchString print search.hits
xml_search.py
Searching <email> nodes . . . Found [email protected]... Found [email protected]... Found '[email protected]' in the following nodes: [1, 2] Searching <email> nodes . . . Found also... Found 'also' in the following nodes: [2]
Output from xml_search.py code.
Example .
from xml.parsers import expat #Define a class that will store the character data class xmlText(object): def __init__ (self): self.textBuff = "" def CharacterData(self, data): data = data.strip() if data: data = data.encode('ascii') self.textBuff += data + " " def Parse(self, fName): xmlParser = expat.ParserCreate() xmlParser.CharacterDataHandler = self.CharacterData xmlParser.Parse(open(fName).read(), 1) xText = xmlText() xText.Parse(xmlFile) print xText.textBuff
A common task when parsing XML documents is to quickly retrieve the text from them without the markup tags and attribute data. The expat parser provided with Python provides a simple interface to manage just that. To use the expat parser to quickly parse through an XML document and store only the text, define a simple text parser class that derived from the basic object class.
When the text parser class is defined, add a CharacterData()
method that can be used to override the CharacterDataHandlers()
method of the expat parser. This method will store the text data passed to the handler when the document is parsed.
After you have defined the handler method of the text parser object, define a parse routine that creates the expat parser by calling the ParserCreate()
function of the expat module. The ParserCreate()
function returns an expat parser object.
After the expat parser object is created in the text parser objects’ parse routine, override the CharacterDataHandler
attribute of the parser object by assigning it to the CharacterData()
method in your search object.
After you have overridden the handler function of the expat parser object, the parse routine will need to invoke the Parse(buffer
[, isFinal]
)
function of the expat parser object. The Parse
function accepts a string buffer
and parses it using the overridden handler methods.
After you have defined the text parser class, create an instance of the class and use the Parse
(file)
function you defined to parse the XML file and retrieve the text.
from xml.parsers import expat xmlFile = "emails.xml" #Define a class that will store the character data class xmlText(object): def __init__ (self): self.textBuff = "" def CharacterData(self, data): data = data.strip() if data: data = data.encode('ascii') self.textBuff += data + " " def Parse(self, fName): #Create the expat parser object xmlParser = expat.ParserCreate() #Override the handler methods xmlParser.CharacterDataHandler = self.CharacterData #Parse the XML file xmlParser.Parse(open(fName).read(), 1) #Create the text parser object xText = xmlText() #Invoke the text parser objects Parse method xText.Parse(xmlFile) #Display parsed results print "Text from %s ====================" % xmlFile print xText.textBuff
xml_text.py
Text from emails.xml ========================== [email protected] [email protected] [email protected] Update List Please add me to the list. [email protected] [email protected] [email protected] More Updated List Please add me to the list also.
Example .
import xml.sax class tagHandler(xml.sax.handler.ContentHandler): def __init__(self): self.tags = {} def startElement(self,name, attr): name = name.encode('ascii') self.tags[name] = self.tags.get(name, 0) + 1 print "Tag %s = %d" % (name, self.tags.get(name)) xmlparser = xml.sax.make_parser() tHandler = tagHandler() xmlparser.setContentHandler(tHandler) xmlparser.parse(xmlFile)
Another fairly common task when processing XML files is to process the XML tags themselves. The xml.sax module provides a quick, clean interface to the XML tags by defining a custom content handler to deal with the tags.
This phrase demonstrates how to override the content handler of a sax XML parser to determine how many instances of a specific tag there are in the XML document.
First, define a tag handler class that inherits from the xml.sax.handler.ContentHandler
class. Then override the startElement()
method of the class to keep track of each encounter with specific tags.
After you have defined the tag handler class, create an xml.sax
parser object using the make_parser()
function. The make_parser()
function will return a parser object that can be used to parse the XML file. Next, create an instance of the tag handler object.
After you have created the parser and tag handler objects, add the custom tag handler object to the parser object using the setContentHandler(handler)
function.
After the content handler has been added to the parser object, parse the XML file using the parse
(file)
command of the parser object.
import xml.sax xmlFile = "emails.xml" xmlTag = "email" #Define handler to scan XML file and parse tags class tagHandler(xml.sax.handler.ContentHandler): def __init__(self): self.tags = {} def startElement(self,name, attr): name = name.encode('ascii') self.tags[name] = self.tags.get(name, 0) + 1 print "Tag %s = %d" % (name, self.tags.get(name)) #Create a parser object xmlparser = xml.sax.make_parser() #Create a content handler object tHandler = tagHandler() #Attach the content handler to the parser xmlparser.setContentHandler(tHandler) #Parse the XML file xmlparser.parse(xmlFile) tags = tHandler.tags if tags.has_key(xmlTag): print "%s has %d <%s> nodes." % (xmlFile, tags[xmlTag], xmlTag)
xml_tags.py
Tag emails = 1 Tag email = 1 Tag to = 1 Tag addr = 1 Tag addr = 2 Tag from = 1 Tag addr = 3 Tag subject = 1 Tag body = 1 Tag email = 2 Tag to = 2 Tag addr = 4 Tag addr = 5 Tag from = 2 Tag addr = 6 Tag subject = 2 Tag body = 2 emails.xml has 2 <email> nodes.
Output from xml_tags.py code