Tag-based data, particularly different XML dialects, have become a very popular way to distribute geospatial data. Formats that are both machine and human readable are generally easy to work with, though they sacrifice storage efficiency for usability. These formats can become unmanageable for very large datasets but work very well in most cases.
While most formats are some form of XML (such as KML or GML), there is a notable exception. The well-known text (WKT) format is fairly common but uses external markers and square brackets ([]
) to surround data instead of tags in angled brackets around data like XML does.
Python has standard library support for XML as well as some excellent third-party libraries available. Proper XML formats all follow the same structure, so you can use a generic XML library to read it. Because XML is text-based, it is often easy to write it as a string instead of using an XML library. The vast majority of applications which output XML do so in this way. The primary advantage of using XML libraries for writing XML is that your output is usually validated. It is very easy to create an error while creating your own XML format. A single missing quotation mark can derail an XML parser and throw an error for somebody trying to read your data. When these errors happen, they virtually render your dataset useless. You will find that this problem is very common among XML-based geospatial data. What you'll discover is that some parsers are more forgiving with incorrect XML than others. Often, reliability is more important than speed or memory efficiency. The analysis available at http://lxml.de/performance.html provides benchmarks for memory and speed among the different Python XML parsers.
The Python minidom
module is a very old and simple to use XML parser. It is a part of Python's built-in set of XML tools in the XML package. It can parse XML files or XML fed in as a string. The minidom
module is best for small to medium-sized XML documents of less than about 20 megabytes before speed begins to decrease.
To demonstrate the minidom
module, we'll use a sample KML file which is a part of Google's KML documentation that you can download. The data available at the following link represents time-stamped point locations transferred from a GPS device:
https://github.com/GeospatialPython/Learn/raw/master/time-stamp-point.kml
First, we'll parse this data by reading it in from the file and creating a minidom
parser object. The file contains a series of <Placemark>
tags, which contain a point and a timestamp at which that point was collected. So, we'll get a list of all of the Placemarks in the file, and we can count them by checking the length of that list, as shown in the following lines of code:
>>> from xml.dom import minidom >>> kml = minidom.parse("time-stamp-point.kml") >>> Placemarks = kml.getElementsByTagName("Placemark") >>> len(Placemarks) 361
As you can see, we retrieved all Placemarks
which totaled 361
. Now, let's take a look at the first Placemark
element in the list:
>>> Placemarks[0] <DOM Element: Placemark at 0x2045a30>
Each <Placemark>
tag is now a DOM Element
data type. To really see what that element is, we call the toxml()
method as you can see in the following lines of code:
>>> Placemarks[0].toxml() u'<Placemark> <TimeStamp> <when>2007-01-14T21:05:02Z</when> </TimeStamp> <styleUrl>#paddle-a</styleUrl> <Point> <coordinates>-122.536226,37.86047,0</coordinates> </Point> </Placemark>'
The toxml()
function outputs everything contained in the Placemark
tag as a string object. If we want to print this information to a text file, we can call the toprettyxml()
method, which would add additional indentation to make the XML more readable.
Now what if we want to grab just the coordinates from this Placemark? The coordinates are buried inside the coordinates
tag, which is contained in the Point
tag and nested inside the Placemark
tag. Each element of a minidom
object is called a node. Nested nodes are called children or child nodes. The child nodes include more than just tags. They can also include whitespace separating tags as well as the data inside tags. So, we can drill down to the coordinates
tag using the tag name, but then we'll need to access the data node. All the minidom
elements have a childNodes
list as well as a firstChild()
method to access the first node. We'll combine these methods to get to the data
attribute of the first coordinate's data node, which we reference using index 0
in the list of coordinates
tags:
>>> coordinates = Placemarks[0].getElementsByTagName("coordinates") >>> point = coordinates[0].firstChild.data >>> point u'-122.536226,37.86047,0'
If you're new to Python, you'll notice that the text output in these examples is tagged with the letter u
. This markup is how Python denotes Unicode strings which support internationalization to multiple languages with different character sets. Python 3.4.3 changes this convention slightly and leaves Unicode strings unmarked while marking utf-8 strings with a b
.
We can go a little further and convert this point
string into usable data by splitting the string and converting the resulting strings as Python float types, as shown here:
>>> x,y,z = point.split(",") >>> x u'-122.536226' >>> y u'37.86047' >>> z u'0' >>> x = float(x) >>> y = float(y) >>> z = float(z) >>> x,y,z (-122.536226, 37.86047, 0.0)
Using a Python list comprehension, we can perform this operation in a single step, as you can see in the following lines of code:
>>> x,y,z = [float(c) for c in point.split(",")] >>> x,y,z (-122.536226, 37.86047, 0.0)
This example scratches the surface of what the minidom
library can do. For a great tutorial on this library, have a look at the 9.3. Parsing XML section of the excellent book Dive Into Python, by Mark Pilgrim, which is available in print or online at http://www.diveintopython.net/xml_processing/parsing_xml.html.
The minidom
module is pure Python, easy to work with, and has been around since Python 2.0. However, Python 2.5 added a more efficient yet high-level XML parser to the standard library called ElementTree
. ElementTree
is interesting because it has been implemented in multiple versions. There is a pure Python version and a faster version written in C called cElementTree
. You should use cElementTree
wherever possible, but it's possible that you may be on a platform that doesn't include the C-based version. When you import cElementTree
, you can test to see if it's available and fall back to the pure Python version if necessary:
try: import xml.etree.cElementTree as ET except ImportError: import xml.etree.ElementTree as ET
One of the great features of ElementTree
is its implementation of a subset of the XPath query language. XPath is short for XML Path and allows you to search an XML document using a path-style syntax. If you work with XML frequently, learning XPath is essential. You can learn more about XPath at the following link:
http://www.w3schools.com/xsl/xpath_intro.asp
One catch with this feature is if the document specifies a namespace, as most XML documents do, you must insert that namespace into queries. ElementTree
does not automatically handle the namespace for you. Your options are to manually specify it or try to extract it using string parsing from the root element's tag name.
We'll repeat the minidom
XML parsing example using ElementTree
. First, we'll parse the document and then we'll manually define the KML namespace; later, we'll use an XPath expression and the find()
method to find the first Placemark
element. Finally, we'll find the coordinates and the child node and then grab the text containing the latitude and longitude. In both cases, we could have searched directly for the coordinates
tag. But by grabbing the Placemark
element, it gives us the option of grabbing the corresponding timestamp child element later, if we choose, as shown in the following lines of code:
>>> tree = ET.ElementTree(file="time-stamp-point.kml") >>> ns = "{http://www.opengis.net/kml/2.2}" >>> placemark = tree.find(".//%sPlacemark" % ns) >>> coordinates = placemark.find("./{}Point/{}coordinates".format(ns, ns)) >>> coordinates.text '-122.536226,37.86047,0'
In this example, notice that we used the Python string formatting syntax, which is based on the string formatting concept found in C. When we defined the XPath expression for the placemark
variable, we used the %s
placeholder to specify the insertion of a string. Then after the string, we use the %
operator followed by a variable name to insert the ns
namespace variable where the placeholder is. In the coordinates
variable, we use the ns
variable twice, so we specify a tuple containing ns
twice after the string.
Most of the time, XML can be built by concatenating strings, as you can see in the following command:
xml = "<?xml version="1.0" encoding="utf-8"?>" xml += "<kml xmlns="http://www.opengis.net/kml/2.2">" xml += " <Placemark>" xml += " <name>Office</name>" xml += " <description>Office Building</description>" xml += " <Point>" xml += " <coordinates>" xml += " -122.087461,37.422069" xml += " </coordinates>" xml += " </Point>" xml += " </Placemark>" xml += "</kml>"
But, this method can be quite prone to typos, which creates invalid XML documents. A safer way is to use an XML library. Let's build this simple KML document using ElementTree
. We'll define the root
KML element and assign it a namespace. Then, we'll systematically append subelements to the root, and finally wrap the elements as an ElementTree
object, declare the XML encoding, and write it out to a file called placemark.xml
, as shown in the following lines of code:
>>> root = ET.Element("kml") >>> root.attrib["xmlns"] = "http://www.opengis.net/kml/2.2" >>> placemark = ET.SubElement(root, "Placemark") >>> office = ET.SubElement(placemark, "name") >>> office.text = "Office" >>> point = ET.SubElement(placemark, "Point") >>> coordinates = ET.SubElement(point, "coordinates") >>> coordinates.text = "-122.087461,37.422069, 37.422069" >>> tree = ET.ElementTree(root) >>> tree.write("placemark.kml", xml_declaration=True,encoding='utf-8',method="xml")
The output is identical to the previous string building example, except that ElementTree
does not indent the tags but rather writes it as one long string. The minidom
module has a similar interface, which is documented in the book Dive Into Python, by Mark Pilgrim, referenced in the minidom
example that we just saw.
XML parsers such as minidom
and ElementTree
work very well on perfectly formatted XML documents. Unfortunately, the vast majority of XML documents out there don't follow the rules and contain formatting errors or invalid characters. You'll find that you are often forced to work with this data and must resort to extraordinary string parsing techniques to get the small subset of data you actually need. But thanks to Python and BeautifulSoup
, you can elegantly work with bad and even terrible, tag-based data.
BeautifulSoup
is a module specifically designed to robustly handle broken XML. It is oriented towards HTML, which is notorious for incorrect formatting but works with other XML dialects too. BeautifulSoup
is available on PyPI, so use either easy_install
or pip
to install it, as you can see in the following command:
easy_install beautifulsoup4
Or you can execute the following command:
pip install beautifulsoup4
Then, to use it, you simply import it:
>>> from bs4 import BeautifulSoup
To try it out, we'll use a GPS Exchange Format (GPX) tracking file from a smartphone application, which has a glitch and exports slightly broken data. You can download this sample file which is available at the following link:
https://raw.githubusercontent.com/GeospatialPython/Learn/master/broken_data.gpx
This 2,347 line data file is in pristine condition except that it is missing a closing </trkseg>
tag, which should be located at the very end of the file just before the closing </trk>
tag. This error was caused by a data export function in the source program. This defect is most likely a result of the original developer manually generating the GPX XML on export and forgetting the line of code that adds this closing tag. Watch what happens if we try to parse this file with minidom
:
>>> gpx = minidom.parse("broken_data.gpx") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:Python34libxmldomminidom.py", line 1914, in parse return expatbuilder.parse(file) File "C:Python34libxmldomexpatbuilder.py", line 924, in parse result = builder.parseFile(fp) File "C:Python34libxmldomexpatbuilder.py", line 207, in parseFile parser.Parse(buffer, 0) xml.parsers.expat.ExpatError: mismatched tag: line 2346, column 2
As you can see from the last line in the error message, the underlying XML parser in minidom
knows exactly what the problem is—a mismatched tag right at the end of the file. But, it refused to do anything more than report the error. You must have perfectly formed XML or none at all to avoid this.
Now, let's try the more sophisticated and efficient ElementTree
module with the same data:
>>> ET.ElementTree(file="broken_data.gpx") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:Python34libxmletreeElementTree.py", line 611, in __init__ self.parse(file) File "C:Python34libxmletreeElementTree.py", line 653, in parse parser.feed(data) File "C:Python34libxmletreeElementTree.py", line 1624, in feed self._raiseerror(v) File "C:Python34libxmletreeElementTree.py", line 1488, in _raiseerror raise err xml.etree.ElementTree.ParseError: mismatched tag: line 2346, column 2
As you can see, different parsers face the same problem. Poorly formed XML is an all too common reality in geospatial analysis, and every XML parser assumes that all the XML in the world is perfect, except for one. Enter BeautifulSoup
. This library shreds bad XML into usable data without a second thought. And it can handle far worse defects than missing tags. It will work despite missing punctuation or other syntax and will give you the best data it can. It was originally developed for parsing HTML, which is notoriously horrible for being poorly formed, but it works fairly well with XML as well, as shown here:
>>> from bs4 import BeautifulSoup >>> gpx = open("broken_data.gpx") >>> soup = BeautifulSoup(gpx.read(), features="xml") >>>
No complaints from BeautifulSoup
! Just to make sure the data is actually usable, let's try and access some of the data. One of the fantastic features of BeautifulSoup
is that it turns tags into attributes of the parse tree. If there are multiple tags with the same name, it grabs the first one. Our sample data file has hundreds of <trkpt>
tags. Let's access the first one:
>>> soup.trkpt <trkpt lat="30.307267000" lon="-89.332444000"><ele>10.7</ele><time>2013-05-16T04:39:46Z</time></trkpt>
We're now certain that the data has been parsed correctly and we can access it. If we want to access all of the <trkpt>
tags, we can use the findAll()
method to grab them and then use the built-in Python len()
function to count them, as shown here:
>>> tracks = soup.findAll("trkpt") >>> len(tracks) 2321
If we write the parsed data back out to a file, BeautifulSoup
outputs the corrected version. We'll save the fixed data as a new GPX file using BeautifulSoup
module's prettify()
method to format the XML with nice indentation, as you can see in the following lines of code:
>>> fixed = open("fixed_data.gpx", "w") >>> fixed.write(soup.prettify()) >>> fixed.close()
BeautifulSoup
is a very rich library with many more features. To explore it further, visit the BeautifulSoup
documentation online at www.crummy.com/software/BeautifulSoup/bs4/doc/.
While minidom
, ElementTree
, and cElementTree
come with the Python standard library, there is an even more powerful and popular XML library for Python called lxml
. The lxml
module provides a Pythonic interface to the libxml2
and libxslt
C libraries using the ElementTree
API. An even better fact is that lxml
also works with BeautifulSoup
to parse bad tag-based data. On some installations, BeautifulSoup4
may require lxml
. The lxml
module is available via PyPI but requires some additional steps for the C libraries. More information is available on the lxml
homepage at the following link:
The WKT format has been around for years and is a simple text-based format for representing geometries and spatial reference systems. It is primarily used as a data exchange format by systems, which implement the OGC Simple Features for SQL specification. Take a look at the following sample WKT representation of a polygon:
POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))
Currently, the best way to read and write WKT is the Shapely library. Shapely provides a very Python-oriented or Pythonic interface to the Geometry Engine – Open Source (GEOS) library described in Chapter 3, The Geospatial Technology Landscape.
You can install Shapely using either easy_install
or pip
. You can also use the wheel from the site mentioned in the previous section. Shapely has a WKT module which can load and export this data. Let's use Shapely to load the previous polygon sample and then verify that it has been loaded as a polygon object by calculating its area:
>>> import shapely.wkt >>> wktPoly = "POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))" >>> poly = shapely.wkt.loads(wktPoly) >>> poly.area 15.0
We can convert any Shapely geometry back to a WKT by simply calling its wkt
attribute, as shown here:
>>> poly.wkt 'POLYGON ((0.0 0.0, 4.0 0.0, 4.0 4.0, 0.0 4.0, 0.0 0.0), (1.0 1.0, 2.0 1.0, 2.0 2.0, 1.0 2.0, 1.0 1.0))'
Shapely can also handle the WKT binary counterpart called Well-known binary (WKB) used to store WKT strings as binary objects in databases. Shapely loads WKB using its wkb
module in the same way as the wkt
module, and it can convert geometries by calling that object's wkb
attribute.
Shapely is the most Pythonic way to work with WKT data, but you can also use the Python bindings to the OGR library, which we installed earlier in this chapter.
For this example, we'll use a shapefile with one simple polygon, which can be downloaded as a ZIP file available at the following link: https://github.com/GeospatialPython/Learn/raw/master/polygon.zip
In the following example, we'll open the polygon.shp
file from the shapefile
dataset, call the required GetLayer()
method, get the first (and only) feature, and then export it to WKT:
>>> from osgeo import ogr >>> shape = ogr.Open("polygon.shp") >>> layer = shape.GetLayer() >>> feature = layer.GetNextFeature() >>> geom = feature.GetGeometryRef() >>> wkt = geom.ExportToWkt() >>> wkt ' POLYGON ((-99.904679362176353 51.698147686745074,-75.010398603076666 46.56036851832075,-75.010398603076666 46.56036851832075,-75.010398603076666 46.56036851832075,-76.975736557742451 23.246272688996914,-76.975736557742451 23.246272688996914,-76.975736557742451 23.246272688996914,-114.31715769639194 26.220870210283724,-114.31715769639194 26.220870210283724,-99.904679362176353 51.698147686745074))'
Note that with OGR, you would have to read access each feature and export it individually as the ExporttoWkt()
method is at the feature level. We can now turn around and read a WKT string using the wkt
variable containing the export. We'll import it back into ogr
and get the bounding box, also known as an envelope, of the polygon, as you can see here:
>>> poly = ogr.CreateGeometryFromWkt(wkt) >>> poly.GetEnvelope() (-114.31715769639194, -75.01039860307667, 23.246272688996914, 51.698147686745074)
Shapely and OGR are used for reading and writing valid WKT strings. Of course, just like XML, which is also text, you could manipulate small amounts of WKT as strings in a pinch.