XML, the eXtensible Markup Language, is a widely used data interchange format. On top of XML itself, the XML community (in good part within the World Wide Web Consortium [W3C]) has standardized many other technologies, such as schema languages, namespaces, XPath, XLink, XPointer, and XSLT.
Industry consortia have defined industry-specific markup languages on top of XML for data exchange among applications in their respective fields. XML, XML-based markup languages, and other XML-related technologies are often used for inter-application, cross-language, cross-platform data interchange in specific industries.
Python’s standard library, for historical reasons, has multiple modules supporting XML under the xml
package, with overlapping functionality; this book does not cover them all, so see the online documentation.
This book (and, specifically, this chapter) covers only the most Pythonic approach to XML processing: ElementTree
, whose elegance, speed, generality, multiple implementations, and Pythonic architecture make it the package of choice for Python XML applications. For complete tutorials and all details on the xml.etree.ElementTree
module, see the online docs and the website of ElementTree
’s creator, Fredrik Lundh, best known as “the effbot.”1
This book takes for granted some elementary knowledge of XML itself; if you need to learn more about XML, we recommend the book XML in a Nutshell (O’Reilly).
Parsing XML from untrusted sources puts your application at risk for many possible attacks; this book does not cover this issue specifically, so see the online documentation, which recommends third-party modules to help safeguard your application if you do have to parse XML from sources you can’t fully trust. In particular, if you need an ElementTree
implementation with safeguards against parsing untrusted sources, consider defusedxml.ElementTree
and its C-coded counterpart defusedxml.cElementTree
within the third-party package defusedxml
.
Python and third-party add-ons offer several alternative implementations of the ElementTree
functionality; the one you can always rely on in the standard library is the module xml.etree.ElementTree
. In most circumstances, in v2, you can use the faster C-coded implementation xml.etree.cElementTree
; in v3, just importing xml.etree.ElementTree
gets you the fastest implementation available. The third-party package defusedxml
, mentioned in the previous section of this chapter, offers slightly slower but safer implementations if you ever need to parse XML from untrusted sources; another third-party package, lxml
, gets you faster performance, and some extra functionality, via lxml.etree
.
Traditionally, you get whatever available implementation of ElementTree
you prefer, by a from...import...as
statement such as:
from
xml.etree
import
cElementTree
as
et
(or more than one such statement, with try...except ImportError:
guards to discover what’s the best implementation available), then use et
(some prefer the uppercase variant, ET
) as the module’s name in the rest of your code.
ElementTree
supplies one fundamental class representing a node within the tree that naturally maps an XML document, the class Element
. ElementTree
also supplies other important classes, chiefly the one representing the whole tree, with methods for input and output and many convenience ones equivalent to ones on its Element
root—that’s the class ElementTree
. In addition, the ElementTree
module supplies several utility functions, and auxiliary classes of lesser importance.
The Element
class represents a node in the tree that maps an XML document, and it’s the core of the whole ElementTree
ecosystem. Each element is a bit like a mapping, with attributes that are a mapping from string keys to string values, and a bit like a sequence, with children that are other elements (sometimes referred to as the element’s “subelements”). In addition, each element offers a few extra attributes and methods. Each Element
instance e
has four data attributes, or properties:
attrib |
A Avoid accessing attrib on Element instances, if feasibleIt’s normally best to avoid accessing |
tag |
The XML tag of the node, a string, sometimes also known as “the element’s type.” For example, parsing the XML fragment |
tail |
Arbitrary data (a string) immediately “following” the element. For example, parsing the XML fragment |
text |
Arbitrary data (a string) directly “within” the element. For example, parsing the XML fragment |
e
has some methods that are mapping-like and avoid the need to explicitly ask for the e.
attrib
dict
:
clear |
|
get |
Like |
items |
Returns the list of |
keys |
Returns the list of all attribute names, in arbitrary order. |
set |
Sets the value of attribute named |
The other methods of e
(including indexing with the e[i]
syntax, and length as in len(e)
) deal with all e
’s children as a sequence, or in some cases—indicated in the rest of this section—with all descendants (elements in the subtree rooted at e
, also known as subelements of e
).
In all versions up to Python 3.6, an Element
instance e
tests as false if e
has no children, following the normal rule for Python containers’ implicit bool
conversion. However, it’s documented that this behavior may change in some future version of v3. For future compatibility, if you want to check whether e
has no children, explicitly check if len(e) == 0:
—don’t use the normal Python idiom if not e:
.
The named methods of e
dealing with children or descendants are the following (we do not cover XPath in this book: see the online docs):
append |
Adds subelement |
extend |
Adds each item of iterable |
find |
Returns the first descendant matching |
findall |
Returns the list of all descendants matching |
findtext |
Returns the |
insert |
Adds subelement |
iter |
Returns an iterator walking in depth-first order over all of |
iterfind |
Returns an iterator over all descendants, in depth-first order, matching |
itertext |
Returns an iterator over the |
remove |
Removes the descendant that |
The ElementTree
class represents a tree that maps an XML document. The core added value of an instance et
of ElementTree
is to have methods for wholesale parsing (input) and writing (output) of a whole tree, namely:
parse |
|
write |
You can pass You can optionally pass You can pass In v3 only, you can optionally (only by name, not positionally) pass |
In addition, an instance et
of ElementTree
supplies the method getroot
—et.getroot()
returns the root of the tree—and the convenience methods find
, findall
, findtext
, iter
, and iterfind
, each exactly equivalent to calling the same method on the root of the tree—that is, on the result of et.getroot()
.
The ElementTree
module also supplies several functions, described in Table 23-2.
Comment |
Returns an |
ProcessingInstruction |
Returns an |
SubElement |
Creates an |
XML |
Parses XML from the |
XMLID |
Parses XML from the |
dump |
Writes |
fromstring |
Parses XML from the |
fromstringlist |
Just like |
iselement |
Returns |
iterparse |
The purpose of |
parse |
Just like the |
register_namespace |
Registers the string |
tostring |
Returns a string with the XML representation of the subtree rooted at |
tostringlist |
Returns a list of strings with the XML representation of the subtree rooted at |
The ElementTree
module also supplies the classes QName
, TreeBuilder
, and XMLParser
, which we do not cover in this book. In v3 only, it also supplies the class XMLPullParser
, covered in “Parsing XML Iteratively”.
In everyday use, the most common way to make an ElementTree
instance is by parsing it from a file or file-like object, usually with the module function parse
or with the method parse
of instances of the class ElementTree
.
For the examples in this chapter, we use the simple XML file found at http://www.w3schools.com/xml/simple.xml; its root tag is 'breakfast_menu'
, and the root’s children are elements with the tag 'food'
. Each 'food'
element has a child with the tag 'name'
, whose text is the food’s name, and a child with the tag 'calories'
, whose text is the string representation of the integer number of calories in a portion of that food. In other words, a simplified representation of that XML file’s content of interest to the examples is:
<breakfast_menu>
<food>
<name>
Belgian Waffles</name>
<calories>
650</calories>
</food>
<food>
<name>
Strawberry Belgian Waffles</name>
<calories>
900</calories>
</food>
<food>
<name>
Berry-Berry Belgian Waffles</name>
<calories>
900</calories>
</food>
<food>
<name>
French Toast</name>
<calories>
600</calories>
</food>
<food>
<name>
Homestyle Breakfast</name>
<calories>
950</calories>
</food>
</breakfast_menu>
Since the XML document lives at a WWW URL, you start by obtaining a file-like object with that content, and passing it to parse
; in v2, the simplest way is:
import
urllib
from
xml.etree
import
ElementTree
as
et
content
=
urllib
.
urlopen
(
'http://www.w3schools.com/xml/simple.xml'
)
tree
=
et
.
parse
(
content
)
and similarly, in v3, the simplest way uses the request
module:
from
urllib
import
request
from
xml.etree
import
ElementTree
as
et
content
=
request
.
urlopen
(
'http://www.w3schools.com/xml/simple.xml'
)
tree
=
et
.
parse
(
content
)
Let’s say that we want to print on standard output the calories and names of the various foods, in order of increasing calories, with ties broken alphabetically. The code for this task is the same in v2 and v3:
def
bycal_and_name
(
e
):
return
int
(
e
.
find
(
'calories'
)
.
text
),
e
.
find
(
'name'
)
.
text
for
e
in
sorted
(
tree
.
findall
(
'food'
),
key
=
bycal_and_name
):
(
'
{}
{}
'
.
format
(
e
.
find
(
'calories'
)
.
text
,
e
.
find
(
'name'
)
.
text
))
When run, this prints:
600 French Toast
650 Belgian Waffles
900 Berry-Berry Belgian Waffles
900 Strawberry Belgian Waffles
950 Homestyle Breakfast
Once an ElementTree
is built (be that via parsing, or otherwise), it can be “edited”—inserting, deleting, and/or altering nodes (elements)—via the various methods of ElementTree
and Element
classes, and module functions. For example, suppose our program is reliably informed that a new food has been added to the menu—buttered toast, two slices of white bread toasted and buttered, 180 calories—while any food whose name contains “berry,” case-insensitive, has been removed. The “editing the tree” part for these specs can be coded as follows:
# add Buttered Toast to the menu
menu
=
tree
.
getroot
()
toast
=
et
.
SubElement
(
menu
,
'food'
)
tcals
=
et
.
SubElement
(
toast
,
'calories'
)
tcals
.
text
=
'180'
tname
=
et
.
SubElement
(
toast
,
'name'
)
tname
.
text
=
'Buttered Toast'
# remove anything related to 'berry' from the menu
for
e
in
menu
.
findall
(
'food'
):
name
=
e
.
find
(
'name'
)
.
text
if
'berry'
in
name
.
lower
():
menu
.
remove
(
e
)
Once we insert these “editing” steps between the code parsing the tree and the code selectively printing from it, the latter prints:
180 Buttered Toast
600 French Toast
650 Belgian Waffles
950 Homestyle Breakfast
The ease of “editing” an ElementTree
can sometimes be a crucial consideration, making it worth your while to keep it all in memory.
Sometimes, your task doesn’t start from an existing XML document: rather, you need to make an XML document from data your code gets from a different source, such as a CSV document or some kind of database.
The code for such tasks is similar to the one we showed for editing an existing ElementTree
—just add a little snippet to build an initially empty tree.
For example, suppose you have a CSV file, menu.csv, whose two comma-separated columns are the calories and name of various foods, one food per row. Your task is to build an XML file, menu.xml, similar to the one we parsed in previous examples. Here’s one way you could do that:
import
csv
from
xml.etree
import
ElementTree
as
et
menu
=
et
.
Element
(
'menu'
)
tree
=
et
.
ElementTree
(
menu
)
with
open
(
'menu.csv'
)
as
f
:
r
=
csv
.
reader
(
f
)
for
calories
,
namestr
in
r
:
food
=
et
.
SubElement
(
menu
,
'food'
)
cals
=
et
.
SubElement
(
food
,
'calories'
)
cals
.
text
=
calories
name
=
et
.
SubElement
(
food
,
'name'
)
name
.
text
=
namestr
tree
.
write
(
'menu.xml'
)
For tasks focused on selecting elements from an existing XML document, sometimes you don’t need to build the whole ElementTree
in memory—a consideration that’s particularly important if the XML document is very large (not the case for the tiny example document we’ve been dealing with, but stretch your imagination and visualize a similar menu-focused document that lists millions of different foods).
So, again, what we want to do is print on standard output the calories and names of foods, this time only the 10 lowest-calorie foods, in order of increasing calories, with ties broken alphabetically; and menu.xml, which for simplicity’s sake we now suppose is a local file, lists millions of foods, so we’d rather not keep it all in memory at once, since obviously we don’t need complete access to all of it at once.
Here’s some code that one might think would let us ace this task:
import
heapq
from
xml.etree
import
ElementTree
as
et
# initialize the heap with dummy entries
heap
=
[(
999999
,
None
)]
*
10
for
_
,
elem
in
et
.
iterparse
(
'menu.xml'
):
if
elem
.
tag
!=
'food'
:
continue
# just finished parsing a food, get calories and name
cals
=
int
(
elem
.
find
(
'calories'
)
.
text
)
name
=
elem
.
find
(
'name'
)
.
text
heapq
.
heappush
(
heap
,
(
cals
,
name
))
for
cals
,
name
in
heap
:
(
cals
,
name
)
This approach does indeed work, but it consumes just about as much memory as an approach based on a full et.parse
would!
Why does the simple approach still eat memory? Because iterparse
, as it runs, builds up a whole ElementTree
in memory, incrementally, even though it only communicates back events such as (by default) just 'end'
, meaning “I just finished parsing this element.”
To actually save memory, we can at least toss all the contents of each element as soon as we’re done processing it—that is, right after the call to heapq.heappush
, add elem.clear()
to make the just-processed element empty.
This approach would indeed save some memory—but not all of it, because the tree’s root would end up with a huge list of empty children nodes. To be really frugal in memory consumption, we need to get 'start'
events as well, so we can get hold of the root of the ElementTree
being built—that is, change the start of the loop to:
root
=
None
for
event
,
elem
in
et
.
iterparse
(
'menu.xml'
):
if
event
==
'start'
:
if
root
is
not
None
:
root
=
elem
continue
if
elem
.
tag
!=
'food'
:
continue
# etc. as before
and then, right after the call to heapq.heappush
, add root.remove(elem)
. This approach saves as much memory as feasible, and still gets the task done!
While iterparse
, used correctly, can save memory, it’s still not good enough to use within an asynchronous (async) loop, as covered in Chapter 18. That’s because iterparse
makes blocking read
calls to the file object passed as its first argument: such blocking calls are a no-no in async processing.
v2’s ElementTree
has no solution to offer to this conundrum. v3 does—specifically, it offers the class XMLPullParser
. (In v2, you can get this functionality if you use the third-party package lxml
, thanks to lxml.etree.)
In an async arrangement, as covered in Chapter 18, a typical task is to write a “filter” component, which is fed chunks of bytes as they happen to come from some upstream source, and yields events downstream as they get fully parsed. Here’s how XMLPullParser
lets you write such a “filter” component:
from
xml.etree
import
ElementTree
as
et
def
filter
(
events
=
None
):
pullparser
=
et
.
XMLPullParser
(
events
)
data
=
yield
while
data
:
pullparser
.
feed
(
data
)
for
tup
in
pullparser
.
read_events
():
data
=
yield
tup
pullparser
.
close
()
for
tup
in
pullparser
.
read_events
():
data
=
yield
tup
This assumes that filter
is used via .send(chunk)
calls to its result (passing new chunks of bytes as they are received), and yield
s (event, element)
tuples for the caller to loop on and process. So, essentially, filter
turns an async stream of chunks of raw bytes into an async stream of (event, element)
pairs, to be consumed by iteration—a typical design pattern in modern Python’s async programming.
1 Alex is far too modest to mention it, but from around 1995 to 2005 both he and Fredrik were, along with Tim Peters, the Python bots. Known as such for their encyclopedic and detailed knowledge of the language, the effbot, the martellibot, and the timbot have created software of immense value to millions of people.