Once we have access to all of the lines of each log file, we can extract details of the access that's described. We'll use a regular expression to decompose the line. From there, we can build a namedtuple object.
Here is a regular expression to parse lines in a CLF file:
import re
format_pat = re.compile(
r"(?P<host>[d.]+)s+"
r"(?P<identity>S+)s+"
r"(?P<user>S+)s+"
r"[(?P<time>.+?)]s+"
r'"(?P<request>.+?)"s+'
r"(?P<status>d+)s+"
r"(?P<bytes>S+)s+"
r'"(?P<referer>.*?)"s+' # [SIC]
r'"(?P<user_agent>.+?)"s*'
)
We can use this regular expression to break each row into a dictionary of nine individual data elements. The use of [] and " to delimit complex fields such as the time, request, referrer, and user_agent parameters can be handled elegently by transforming the text into a NamedTuple object.
Each individual access can be summarized as a subclass of NamedTuple, as follows:
from typing import NamedTuple
class Access(NamedTuple):
host: str
identity: str
user: str
time: str
request: str
status: str
bytes: str
referer: str
user_agent: str
Here is the access_iter() function that requires each file to be represented as an iterator over the lines of the file:
from typing import Iterator
def access_iter(
source_iter: Iterator[Iterator[str]]
) -> Iterator[Access]:
for log in source_iter:
for line in log:
match = format_pat.match(line)
if match:
yield Access(**match.groupdict())
The output from the local_gzip() function is a sequence of sequences. The outer sequence is based on the individual log files. For each file, there is a nested iterable sequence of lines. If the line matches the given pattern, it's a file access of some kind. We can create an Access namedtuple from the match dictionary. Non-matching lines are quietly discarded.
The essential design pattern here is to build an immutable object from the results of a parsing function. In this case, the parsing function is a regular expression matcher. Other kinds of parsing will fit this design pattern.
There are some alternative ways to do this. For example, we can use the map() function as follows:
def access_builder(line: str) -> Optional[Access]:
match = format_pat.match(line)
if match:
return Access(**match.groupdict())
return None
The preceding alternative function embodies just the essential parsing and construction of an Access object. It will either return an Access or a None object. Here is how we can use this function to flatten log files into a single stream of Access objects:
filter(
None,
map(
access_builder,
(line for log in source_iter for line in log)
)
)
This shows how we can transform the output from the local_gzip() function into a sequence of Access instances. In this case, we apply the access_builder() function to the nested iterator of the iterable structure that results from reading a collection of files. The filter() function removes None objects from the result of the map() function.
Our point here is to show that we have a number of functional styles for parsing files. In Chapter 4, Working with Collections, we looked at very simple parsing. Here, we're performing more complex parsing, using a variety of techniques.