Parsing log lines into namedtuples

Once we have access to all of the lines of each log file, we can extract details of the access that's described. We'll use a regular expression to decompose the line. From there, we can build a namedtuple object.

Here is a regular expression to parse lines in a CLF file:

import re
format_pat = re.compile(
r"(?P<host>[d.]+)s+"
r"(?P<identity>S+)s+"
r"(?P<user>S+)s+"
r"[(?P<time>.+?)]s+"
r'"(?P<request>.+?)"s+'
r"(?P<status>d+)s+"
r"(?P<bytes>S+)s+"
r'"(?P<referer>.*?)"s+' # [SIC]
r'"(?P<user_agent>.+?)"s*'
)

We can use this regular expression to break each row into a dictionary of nine individual data elements. The use of [] and " to delimit complex fields such as the time, request, referrer, and user_agent parameters can be handled elegently by transforming the text into a NamedTuple object.

Each individual access can be summarized as a subclass of NamedTuple, as follows:

from typing import NamedTuple
class Access(NamedTuple):
host: str
identity: str
user: str
time: str
request: str
status: str
bytes: str
referer: str
user_agent: str
We've taken pains to ensure that the NamedTuple  field names match the regular expression group names in the (?P<name>) constructs for each portion of the record. By making sure the names match, we can very easily transform the parsed dictionary into a tuple for further processing.

Here is the access_iter() function that requires each file to be represented as an iterator over the lines of the file:

from typing import Iterator
def access_iter(
source_iter: Iterator[Iterator[str]]
) -> Iterator[Access]:
for log in source_iter:
for line in log:
match = format_pat.match(line)
if match:
yield Access(**match.groupdict())

The output from the local_gzip() function is a sequence of sequences. The outer sequence is based on the individual log files. For each file, there is a nested iterable sequence of lines. If the line matches the given pattern, it's a file access of some kind. We can create an Access namedtuple from the match dictionary. Non-matching lines are quietly discarded.

The essential design pattern here is to build an immutable object from the results of a parsing function. In this case, the parsing function is a regular expression matcher. Other kinds of parsing will fit this design pattern.

There are some alternative ways to do this. For example, we can use the map() function as follows:

def access_builder(line: str) -> Optional[Access]:
match = format_pat.match(line)
if match:
return Access(**match.groupdict())
return None

The preceding alternative function embodies just the essential parsing and construction of an Access object. It will either return an Access or a None object. Here is how we can use this function to flatten log files into a single stream of Access objects:

filter(
None,
map(
access_builder,
(line for log in source_iter for line in log)
)
)

This shows how we can transform the output from the local_gzip() function into a sequence of Access instances. In this case, we apply the access_builder() function to the nested iterator of the iterable structure that results from reading a collection of files. The filter() function removes None objects from the result of the map() function.

Our point here is to show that we have a number of functional styles for parsing files. In Chapter 4, Working with Collections, we looked at very simple parsing. Here, we're performing more complex parsing, using a variety of techniques.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset