Parsing log lines into namedtuples

Once we have access to all of the lines of each log file, we can extract details of the access that's described. We'll use a regular expression to decompose the line. From there, we can build a namedtuple object.

Here is a regular expression to parse lines in a CLF file:

import re
format_pat = re.compile(
    r"(?P<host>[d.]+)s+"
    r"(?P<identity>S+)s+"
    r"(?P<user>S+)s+"
    r"[(?P<time>.+?)]s+"
    r'"(?P<request>.+?)"s+'
    r"(?P<status>d+)s+"
    r"(?P<bytes>S+)s+"
    r'"(?P<referer>.*?)"s+' # [SIC]
    r'"(?P<user_agent>.+?)"s*'
)

We can use this regular expression to break each row into a dictionary of nine individual data elements. The use of [] and " to delimit complex fields such as the time, request, referrer, and user_agent parameters can be handled elegently by transforming the text into a NamedTuple object.

Each individual access can be summarized as a subclass of NamedTuple, as follows:

from typing import NamedTuple
class Access(NamedTuple):
    host: str
    identity: str
    user: str
    time: str
    request: str
    status: str
    bytes: str
    referer: str
    user_agent: str

We've taken pains to ensure that the NamedTuple field names match the regular expression group names in the (?P<name>) constructs for each portion of the record. By making sure the names match, we can very easily transform the parsed dictionary into a tuple for further processing.

Here is the access_iter() function that requires each file to be represented as an iterator over the lines of the file:

from typing import Iterator
def access_iter(
        source_iter: Iterator[Iterator[str]]
    ) -> Iterator[Access]:
    for log in source_iter:
        for line in log:
            match = format_pat.match(line)
            if match:
                yield Access(**match.groupdict())

The output from the local_gzip() function is a sequence of sequences. The outer sequence is based on the individual log files. For each file, there is a nested iterable sequence of lines. If the line matches the given pattern, it's a file access of some kind. We can create an Access namedtuple from the match dictionary. Non-matching lines are quietly discarded.

The essential design pattern here is to build an immutable object from the results of a parsing function. In this case, the parsing function is a regular expression matcher. Other kinds of parsing will fit this design pattern.

There are some alternative ways to do this. For example, we can use the map() function as follows:

def access_builder(line: str) -> Optional[Access]:
    match = format_pat.match(line)
    if match:
        return Access(**match.groupdict())
    return None

The preceding alternative function embodies just the essential parsing and construction of an Access object. It will either return an Access or a None object. Here is how we can use this function to flatten log files into a single stream of Access objects:

filter(
    None,
    map(
        access_builder,
        (line for log in source_iter for line in log)
    )
)

This shows how we can transform the output from the local_gzip() function into a sequence of Access instances. In this case, we apply the access_builder() function to the nested iterator of the iterable structure that results from reading a collection of files. The filter() function removes None objects from the result of the map() function.

Our point here is to show that we have a number of functional styles for parsing files. In Chapter 4, Working with Collections, we looked at very simple parsing. Here, we're performing more complex parsing, using a variety of techniques.

Table of Contents for Parsing log lines into namedtuples

Create new playlist

Sign In

Sign Up

Table of Contents for
Parsing log lines into namedtuples