Python markup and tag-based parsers

Tag-based data, particularly different XML dialects, have become a very popular way to distribute geospatial data. Formats that are both machine and human readable are generally easy to work with, though they sacrifice storage efficiency for usability. These formats can become unmanageable for very large datasets but work very well in most cases.

While most formats are some form of XML (such as KML or GML), there is a notable exception. The well-known text (WKT) format is fairly common but uses external markers and square brackets ([]) to surround data instead of tags in angled brackets around data like XML does.

Python has standard library support for XML as well as some excellent third-party libraries available. Proper XML formats all follow the same structure, so you can use a generic XML library to read it. Because XML is text-based, it is often easy to write it as a string instead of using an XML library. The vast majority of applications which output XML do so in this way. The primary advantage of using XML libraries for writing XML is that your output is usually validated. It is very easy to create an error while creating your own XML format. A single missing quotation mark can derail an XML parser and throw an error for somebody trying to read your data. When these errors happen, they virtually render your dataset useless. You will find that this problem is very common among XML-based geospatial data. What you'll discover is that some parsers are more forgiving with incorrect XML than others. Often, reliability is more important than speed or memory efficiency. The analysis available at http://lxml.de/performance.html provides benchmarks for memory and speed among the different Python XML parsers.

The minidom module

The Python minidom module is a very old and simple to use XML parser. It is a part of Python's built-in set of XML tools in the XML package. It can parse XML files or XML fed in as a string. The minidom module is best for small to medium-sized XML documents of less than about 20 megabytes before speed begins to decrease.

To demonstrate the minidom module, we'll use a sample KML file which is a part of Google's KML documentation that you can download. The data available at the following link represents time-stamped point locations transferred from a GPS device:

https://github.com/GeospatialPython/Learn/raw/master/time-stamp-point.kml

First, we'll parse this data by reading it in from the file and creating a minidom parser object. The file contains a series of <Placemark> tags, which contain a point and a timestamp at which that point was collected. So, we'll get a list of all of the Placemarks in the file, and we can count them by checking the length of that list, as shown in the following lines of code:

>>> from xml.dom import minidom
>>> kml = minidom.parse("time-stamp-point.kml")
>>> Placemarks = kml.getElementsByTagName("Placemark")
>>> len(Placemarks)
361

As you can see, we retrieved all Placemarks which totaled 361. Now, let's take a look at the first Placemark element in the list:

>>> Placemarks[0]
<DOM Element: Placemark at 0x2045a30>

Each <Placemark> tag is now a DOM Element data type. To really see what that element is, we call the toxml() method as you can see in the following lines of code:

>>> Placemarks[0].toxml()
u'<Placemark>
 <TimeStamp>
 <when>2007-01-14T21:05:02Z</when>

</TimeStamp>
 <styleUrl>#paddle-a</styleUrl>
 <Point>
 <coordinates>-122.536226,37.86047,0</coordinates>
 </Point>
 </Placemark>'

The toxml() function outputs everything contained in the Placemark tag as a string object. If we want to print this information to a text file, we can call the toprettyxml() method, which would add additional indentation to make the XML more readable.

Now what if we want to grab just the coordinates from this Placemark? The coordinates are buried inside the coordinates tag, which is contained in the Point tag and nested inside the Placemark tag. Each element of a minidom object is called a node. Nested nodes are called children or child nodes. The child nodes include more than just tags. They can also include whitespace separating tags as well as the data inside tags. So, we can drill down to the coordinates tag using the tag name, but then we'll need to access the data node. All the minidom elements have a childNodes list as well as a firstChild() method to access the first node. We'll combine these methods to get to the data attribute of the first coordinate's data node, which we reference using index 0 in the list of coordinates tags:

>>> coordinates = Placemarks[0].getElementsByTagName("coordinates")
>>> point = coordinates[0].firstChild.data
>>> point
u'-122.536226,37.86047,0'

If you're new to Python, you'll notice that the text output in these examples is tagged with the letter u. This markup is how Python denotes Unicode strings which support internationalization to multiple languages with different character sets. Python 3.4.3 changes this convention slightly and leaves Unicode strings unmarked while marking utf-8 strings with a b.

We can go a little further and convert this point string into usable data by splitting the string and converting the resulting strings as Python float types, as shown here:

>>> x,y,z = point.split(",")
>>> x
u'-122.536226'
>>> y
u'37.86047'
>>> z
u'0'
>>> x = float(x)
>>> y = float(y)
>>> z = float(z)
>>> x,y,z
(-122.536226, 37.86047, 0.0)

Using a Python list comprehension, we can perform this operation in a single step, as you can see in the following lines of code:

>>> x,y,z = [float(c) for c in point.split(",")]
>>> x,y,z
(-122.536226, 37.86047, 0.0)

This example scratches the surface of what the minidom library can do. For a great tutorial on this library, have a look at the 9.3. Parsing XML section of the excellent book Dive Into Python, by Mark Pilgrim, which is available in print or online at http://www.diveintopython.net/xml_processing/parsing_xml.html.

ElementTree

The minidom module is pure Python, easy to work with, and has been around since Python 2.0. However, Python 2.5 added a more efficient yet high-level XML parser to the standard library called ElementTree. ElementTree is interesting because it has been implemented in multiple versions. There is a pure Python version and a faster version written in C called cElementTree. You should use cElementTree wherever possible, but it's possible that you may be on a platform that doesn't include the C-based version. When you import cElementTree, you can test to see if it's available and fall back to the pure Python version if necessary:

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET  

One of the great features of ElementTree is its implementation of a subset of the XPath query language. XPath is short for XML Path and allows you to search an XML document using a path-style syntax. If you work with XML frequently, learning XPath is essential. You can learn more about XPath at the following link:

http://www.w3schools.com/xsl/xpath_intro.asp

One catch with this feature is if the document specifies a namespace, as most XML documents do, you must insert that namespace into queries. ElementTree does not automatically handle the namespace for you. Your options are to manually specify it or try to extract it using string parsing from the root element's tag name.

We'll repeat the minidom XML parsing example using ElementTree. First, we'll parse the document and then we'll manually define the KML namespace; later, we'll use an XPath expression and the find()method to find the first Placemark element. Finally, we'll find the coordinates and the child node and then grab the text containing the latitude and longitude. In both cases, we could have searched directly for the coordinates tag. But by grabbing the Placemark element, it gives us the option of grabbing the corresponding timestamp child element later, if we choose, as shown in the following lines of code:

>>> tree = ET.ElementTree(file="time-stamp-point.kml")
>>> ns = "{http://www.opengis.net/kml/2.2}"
>>> placemark = tree.find(".//%sPlacemark" % ns)
>>> coordinates = placemark.find("./{}Point/{}coordinates".format(ns, ns))
>>> coordinates.text
'-122.536226,37.86047,0'

In this example, notice that we used the Python string formatting syntax, which is based on the string formatting concept found in C. When we defined the XPath expression for the placemark variable, we used the %s placeholder to specify the insertion of a string. Then after the string, we use the % operator followed by a variable name to insert the ns namespace variable where the placeholder is. In the coordinates variable, we use the ns variable twice, so we specify a tuple containing ns twice after the string.

Tip

String formatting is a simple yet extremely powerful and useful tool in Python, which is worth learning. You can find more information in Python's documentation online at the following link:

https://docs.python.org/3.4/library/string.html:

Building XML

Most of the time, XML can be built by concatenating strings, as you can see in the following command:

xml = "<?xml version="1.0" encoding="utf-8"?>"
xml += "<kml xmlns="http://www.opengis.net/kml/2.2">"
xml += "  <Placemark>"
xml += "    <name>Office</name>"
xml += "    <description>Office Building</description>"
xml += "    <Point>"
xml += "      <coordinates>"
xml += "        -122.087461,37.422069"
xml += "      </coordinates>"
xml += "    </Point>"
xml += "  </Placemark>"
xml += "</kml>"

But, this method can be quite prone to typos, which creates invalid XML documents. A safer way is to use an XML library. Let's build this simple KML document using ElementTree. We'll define the root KML element and assign it a namespace. Then, we'll systematically append subelements to the root, and finally wrap the elements as an ElementTree object, declare the XML encoding, and write it out to a file called placemark.xml, as shown in the following lines of code:

>>> root = ET.Element("kml")
>>> root.attrib["xmlns"] = "http://www.opengis.net/kml/2.2"
>>> placemark = ET.SubElement(root, "Placemark")
>>> office = ET.SubElement(placemark, "name")
>>> office.text = "Office"
>>> point = ET.SubElement(placemark, "Point")
>>> coordinates = ET.SubElement(point, "coordinates")
>>> coordinates.text = "-122.087461,37.422069, 37.422069"
>>> tree = ET.ElementTree(root)
>>> tree.write("placemark.kml", xml_declaration=True,encoding='utf-8',method="xml")

The output is identical to the previous string building example, except that ElementTree does not indent the tags but rather writes it as one long string. The minidom module has a similar interface, which is documented in the book Dive Into Python, by Mark Pilgrim, referenced in the minidom example that we just saw.

XML parsers such as minidom and ElementTree work very well on perfectly formatted XML documents. Unfortunately, the vast majority of XML documents out there don't follow the rules and contain formatting errors or invalid characters. You'll find that you are often forced to work with this data and must resort to extraordinary string parsing techniques to get the small subset of data you actually need. But thanks to Python and BeautifulSoup, you can elegantly work with bad and even terrible, tag-based data.

BeautifulSoup is a module specifically designed to robustly handle broken XML. It is oriented towards HTML, which is notorious for incorrect formatting but works with other XML dialects too. BeautifulSoup is available on PyPI, so use either easy_install or pip to install it, as you can see in the following command:

easy_install beautifulsoup4

Or you can execute the following command:

pip install beautifulsoup4

Then, to use it, you simply import it:

>>> from bs4 import BeautifulSoup

To try it out, we'll use a GPS Exchange Format (GPX) tracking file from a smartphone application, which has a glitch and exports slightly broken data. You can download this sample file which is available at the following link:

https://raw.githubusercontent.com/GeospatialPython/Learn/master/broken_data.gpx

This 2,347 line data file is in pristine condition except that it is missing a closing </trkseg> tag, which should be located at the very end of the file just before the closing </trk> tag. This error was caused by a data export function in the source program. This defect is most likely a result of the original developer manually generating the GPX XML on export and forgetting the line of code that adds this closing tag. Watch what happens if we try to parse this file with minidom:

>>> gpx = minidom.parse("broken_data.gpx")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:Python34libxmldomminidom.py", line 1914, in parse
    return expatbuilder.parse(file)
  File "C:Python34libxmldomexpatbuilder.py", line 924, in parse
    result = builder.parseFile(fp)
  File "C:Python34libxmldomexpatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 2346, column 2

As you can see from the last line in the error message, the underlying XML parser in minidom knows exactly what the problem is—a mismatched tag right at the end of the file. But, it refused to do anything more than report the error. You must have perfectly formed XML or none at all to avoid this.

Now, let's try the more sophisticated and efficient ElementTree module with the same data:

>>> ET.ElementTree(file="broken_data.gpx")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:Python34libxmletreeElementTree.py", line 611, in __init__
    self.parse(file)
  File "C:Python34libxmletreeElementTree.py", line 653, in parse
    parser.feed(data)
  File "C:Python34libxmletreeElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "C:Python34libxmletreeElementTree.py", line 1488, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: mismatched tag: line 2346, column 2

As you can see, different parsers face the same problem. Poorly formed XML is an all too common reality in geospatial analysis, and every XML parser assumes that all the XML in the world is perfect, except for one. Enter BeautifulSoup. This library shreds bad XML into usable data without a second thought. And it can handle far worse defects than missing tags. It will work despite missing punctuation or other syntax and will give you the best data it can. It was originally developed for parsing HTML, which is notoriously horrible for being poorly formed, but it works fairly well with XML as well, as shown here:

>>> from bs4 import BeautifulSoup
>>> gpx = open("broken_data.gpx")
>>> soup = BeautifulSoup(gpx.read(), features="xml")
>>>

No complaints from BeautifulSoup! Just to make sure the data is actually usable, let's try and access some of the data. One of the fantastic features of BeautifulSoup is that it turns tags into attributes of the parse tree. If there are multiple tags with the same name, it grabs the first one. Our sample data file has hundreds of <trkpt> tags. Let's access the first one:

>>> soup.trkpt
<trkpt lat="30.307267000" lon="-89.332444000"><ele>10.7</ele><time>2013-05-16T04:39:46Z</time></trkpt>

We're now certain that the data has been parsed correctly and we can access it. If we want to access all of the <trkpt> tags, we can use the findAll() method to grab them and then use the built-in Python len() function to count them, as shown here:

>>> tracks = soup.findAll("trkpt")
>>> len(tracks)
2321

If we write the parsed data back out to a file, BeautifulSoup outputs the corrected version. We'll save the fixed data as a new GPX file using BeautifulSoup module's prettify() method to format the XML with nice indentation, as you can see in the following lines of code:

>>> fixed = open("fixed_data.gpx", "w")
>>> fixed.write(soup.prettify())
>>> fixed.close()

BeautifulSoup is a very rich library with many more features. To explore it further, visit the BeautifulSoup documentation online at www.crummy.com/software/BeautifulSoup/bs4/doc/.

Tip

While minidom, ElementTree, and cElementTree come with the Python standard library, there is an even more powerful and popular XML library for Python called lxml. The lxml module provides a Pythonic interface to the libxml2 and libxslt C libraries using the ElementTree API. An even better fact is that lxml also works with BeautifulSoup to parse bad tag-based data. On some installations, BeautifulSoup4 may require lxml. The lxml module is available via PyPI but requires some additional steps for the C libraries. More information is available on the lxml homepage at the following link:

http://lxml.de/

Well-known text (WKT)

The WKT format has been around for years and is a simple text-based format for representing geometries and spatial reference systems. It is primarily used as a data exchange format by systems, which implement the OGC Simple Features for SQL specification. Take a look at the following sample WKT representation of a polygon:

POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))

Currently, the best way to read and write WKT is the Shapely library. Shapely provides a very Python-oriented or Pythonic interface to the Geometry Engine – Open Source (GEOS) library described in Chapter 3, The Geospatial Technology Landscape.

You can install Shapely using either easy_install or pip. You can also use the wheel from the site mentioned in the previous section. Shapely has a WKT module which can load and export this data. Let's use Shapely to load the previous polygon sample and then verify that it has been loaded as a polygon object by calculating its area:

>>> import shapely.wkt
>>> wktPoly = "POLYGON((0 0,4 0,4 4,0 4,0 0),(1 1, 2 1, 2 2, 1 2,1 1))"
>>> poly = shapely.wkt.loads(wktPoly)
>>> poly.area
15.0

We can convert any Shapely geometry back to a WKT by simply calling its wkt attribute, as shown here:

>>> poly.wkt
'POLYGON ((0.0 0.0, 4.0 0.0, 4.0 4.0, 0.0 4.0, 0.0 0.0), (1.0 1.0, 2.0 1.0, 2.0 2.0, 1.0 2.0, 1.0 1.0))'   

Shapely can also handle the WKT binary counterpart called Well-known binary (WKB) used to store WKT strings as binary objects in databases. Shapely loads WKB using its wkb module in the same way as the wkt module, and it can convert geometries by calling that object's wkb attribute.

Shapely is the most Pythonic way to work with WKT data, but you can also use the Python bindings to the OGR library, which we installed earlier in this chapter.

For this example, we'll use a shapefile with one simple polygon, which can be downloaded as a ZIP file available at the following link: https://github.com/GeospatialPython/Learn/raw/master/polygon.zip

In the following example, we'll open the polygon.shp file from the shapefile dataset, call the required GetLayer() method, get the first (and only) feature, and then export it to WKT:

>>> from osgeo import ogr
>>> shape = ogr.Open("polygon.shp")
>>> layer = shape.GetLayer()
>>> feature = layer.GetNextFeature()
>>> geom = feature.GetGeometryRef()
>>> wkt = geom.ExportToWkt()
>>> wkt
' POLYGON ((-99.904679362176353 51.698147686745074,-75.010398603076666 46.56036851832075,-75.010398603076666 46.56036851832075,-75.010398603076666 46.56036851832075,-76.975736557742451 23.246272688996914,-76.975736557742451 23.246272688996914,-76.975736557742451 23.246272688996914,-114.31715769639194 26.220870210283724,-114.31715769639194 26.220870210283724,-99.904679362176353 51.698147686745074))'

Note that with OGR, you would have to read access each feature and export it individually as the ExporttoWkt() method is at the feature level. We can now turn around and read a WKT string using the wkt variable containing the export. We'll import it back into ogr and get the bounding box, also known as an envelope, of the polygon, as you can see here:

>>> poly = ogr.CreateGeometryFromWkt(wkt)
>>> poly.GetEnvelope()
(-114.31715769639194, -75.01039860307667, 23.246272688996914, 51.698147686745074)

Shapely and OGR are used for reading and writing valid WKT strings. Of course, just like XML, which is also text, you could manipulate small amounts of WKT as strings in a pinch.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset