Writing file parsers

We can often consider a file parser to be a kind of reduction. Many languages have two levels of definition: the lower-level tokens in the language and the higher-level structures built from those tokens. When looking at an XML file, the tags, tag names, and attribute names form this lower-level syntax; the structures which are described by XML form a higher-level syntax.

The lower-level lexical scanning is a kind of reduction that takes individual characters and groups them into tokens. This fits well with Python's generator function design pattern. We can often write functions that look as follows:

def lexical_scan(some_source):
    for char in some_source:
        if some pattern completed: yield token
        else: accumulate token

For our purposes, we'll rely on lower-level file parsers to handle this for us. We'll use the CSV, JSON, and XML packages to manage these details. We'll write higher-level parsers based on these packages.

We'll still rely on a two-level design pattern. A lower-level parser will produce a useful canonical representation of the raw data. It will be an iterator over tuples of text. This is compatible with many kinds of data files. The higher-level parser will produce objects useful for our specific application. These might be tuples of numbers, or namedtuples, or perhaps some other class of immutable Python objects.

We provided one example of a lower-level parser in Chapter 4, Working with Collections. The input was a KML file; KML is an XML representation of geographic information. The essential features of the parser look similar to the following command snippet:


def comma_split(text: str) -> List[str]:
    return text.split(",")
    def row_iter_kml(file_obj: TextIO) -> Iterator[List[str]]:
    ns_map = {
        "ns0": "http://www.opengis.net/kml/2.2",
        "ns1": "http://www.google.com/kml/ext/2.2"}
    xpath = (
        "./ns0:Document/ns0:Folder/"
        "ns0:Placemark/ns0:Point/ns0:coordinates")
    doc = XML.parse(file_obj)
    return (
        comma_split(cast(str, coordinates.text))
        for coordinates in doc.findall(xpath, ns_map)
    )

The bulk of the row_iter_kml() function is the XML parsing that allows us to use the doc.findall() function to iterate through the <ns0:coordinates> tags in the document. We've used a function named comma_split() to parse the text of this tag into a three tuple of values.

The cast() function is only present to provide evidence to mypy that the value of coordinates.text is a str object. The default definition of the text attribute is Union[str, bytes]; in this application, the data will be str exclusively. The cast() function doesn't do any run-time processing; it's a type hint that's used by mypy.

This function focused on working with the normalized XML structure. The document is close to the database designer's definitions of First Normal Form: each attribute is atomic (a single value), and each row in the XML data has the same columns with data of a consistent type. The data values aren't fully atomic, however: we have to split the points on the , to separate longitude, latitude, and altitude into atomic string values. However, the text is completely consistent, making it a close fit with first normal form.

A large volume of data—XML tags, attributes, and other punctuation—is reduced to a somewhat smaller volume, including just floating-point latitude and longitude values. For this reason, we can think of parsers as a kind of reduction.

We'll need a higher-level set of conversions to map the tuples of text into floating-point numbers. Also, we'd like to discard altitude, and reorder longitude and latitude. This will produce the application-specific tuple we need. We can use functions as follows for this conversion:

def pick_lat_lon(
        lon: Any, lat: Any, alt: Any) -> Tuple[Any, Any]:
    return lat, lon

def float_lat_lon(
        row_iter: Iterator[Tuple[str, ...]]
    ) -> Iterator[Tuple[float, ...]]:
    return (
        tuple(
            map(float, pick_lat_lon(*row))
        )
        for row in row_iter
    )

The essential tool is the float_lat_lon() function. This is a higher-order function that returns a generator expression. The generator uses the map() function to apply the float() function conversion to the results of pick_lat_lon() function, and the *row argument to assign each member of the row tuple to a different parameter of the pick_lat_lon() function. This only works when each row is a three tuple. The pick_lat_lon() function then returns a tuple of the selected items in the required order.

We can use this parser as follows:

name = "file:./Winter%202012-2013.kml"
with urllib.request.urlopen(name) as source:
    trip = tuple(float_lat_lon(row_iter_kml(source)))

This will build a tuple-of-tuple representation of each waypoint along the path in the original KML file. It uses a low-level parser to extract rows of text data from the original representation. It uses a high-level parser to transform the text items into more useful tuples of floating-point values. In this case, we have not implemented any validation.

Table of Contents for Writing file parsers

Create new playlist

Sign In

Sign Up

Table of Contents for
Writing file parsers