Parsing a file at a higher level

Once we've parsed the low-level syntax to transform XML to Python, we can restructure the raw data into something usable in our Python program. This kind of structuring applies to XML, JavaScript Object Notation (JSON), CSV, and any of the wide variety of physical formats in which data is serialized.

We'll aim to write a small suite of generator functions that transforms the parsed data into a form our application can use. The generator functions include some simple transformations on the text that are found by the row_iter_kml() function, which are as follows:

Discarding altitude can also be stated as keeping only latitude and longitude
Changing the order from (longitude, latitude) to (latitude, longitude)

We can make these two transformations have more syntactic uniformity by defining a utility function, as follows:

def pick_lat_lon(
  lon: Text, lat: Text, alt: Text) -> Tuple[Text, Text]:
    return lat, lon

We've created a function to take three argument values and created a tuple from two of them. The type hints are more complex than the function itself.

We can use this function as follows:

from typing import Text, List, Iterable

Rows = Iterable[List[Text]]
LL_Text = Tuple[Text, Text]
def lat_lon_kml(row_iter: Rows) -> Iterable[LL_Text]:
    return (pick_lat_lon(*row) for row in row_iter)

This function will apply the pick_lat_lon() function to each row from a source iterator. We've used *row to assign each element of the row-three tuple to separate parameters of the pick_lat_lon() function. The function can then extract and reorder the two relevant values from each three-tuple.

To simplify the function definition, we've defined two type aliases: Rows and LL_Text. These type aliases can simplify a function definition. They can also be reused to ensure that several related functions are all working with the same types of objects.

This kind of functional design allows us to freely replace any function with its equivalent, which makes refactoring quite simple. We tried to achieve this goal when we provided alternative implementations of the various functions. In principle, a clever functional language compiler may do some replacements as a part of an optimization pass.

These functions can be combined to parse the file and build a structure we can use. Here's an example of some code that could be used for this purpose:

url = "file:./Winter%202012-2013.kml"
with urllib.request.urlopen(url) as source:
    v1= tuple(lat_lon_kml(row_iter_kml(source)))
print(v1)

This script uses the urllib command to open a source. In this case, it's a local file. However, we can also open a KML file on a remote server. Our objective in using this kind of file opening is to ensure that our processing is uniform no matter what the source of the data is.

The script is built around the two functions that do low-level parsing of the KML source. The row_iter_kml(source) expression produces a sequence of text columns. The lat_lon_kml() function will extract and reorder the latitude and longitude values. This creates an intermediate result that sets the stage for further processing. The subsequent processing is independent of the original format.

When we run this, we see results such as the following:

(('37.54901619777347', '-76.33029518659048'), 
 ('37.840832', '-76.27383399999999'), 
 ('38.331501', '-76.459503'), 
 ('38.330166', '-76.458504'), 
 ('38.976334', '-76.47350299999999'))

We've extracted just the latitude and longitude values from a complex XML file using an almost purely functional approach. As the result is iterable, we can continue to use functional programming techniques to process each point that we retrieve from the file.

We've explicitly separated low-level XML parsing from higher-level reorganization of the data. The XML parsing produced a generic tuple of string structure. This is compatible with the output from the CSV parser. When working with SQL databases, we'll have a similar iterable of tuple structures. This allows us to write code for higher-level processing that can work with data from a variety of sources.

We'll show you a series of transformations to re-arrange this data from a collection of strings to a collection of waypoints along a route. This will involve a number of transformations. We'll need to restructure the data as well as convert from strings to floating-point values. We'll also look at a few ways to simplify and clarify the subsequent processing steps. We'll use this dataset in later chapters because it's quite complex.

Table of Contents for Parsing a file at a higher level

Create new playlist

Sign In

Sign Up

Table of Contents for
Parsing a file at a higher level