Parsing log files – gathering the rows

Here is the first stage in parsing a large number of files: reading each file and producing a simple sequence of lines. As the log files are saved in the .gzip format, we need to open each file with the gzip.open() function instead of the io.open() function or the __builtins__.open() function.

The local_gzip() function reads lines from locally cached files, as shown in the following command snippet:

from typing import Iterator
def local_gzip(pattern: str) -> Iterator[Iterator[str]]:
zip_logs= glob.glob(pattern) for zip_file in zip_logs: with gzip.open(zip_file, "rb") as log: yield (
line.decode('us-ascii').rstrip()
for line in log)

The preceding function iterates through all files that match the given pattern. For each file, the yielded value is a generator function that will iterate through all lines within that file. We've encapsulated several things, including wildcard file matching, the details of opening a log file compressed with the .gzip format, and breaking a file into a sequence of lines without any trailing newline ( ) characters.

The essential design pattern here is to yield values that are generator expressions for each file. The preceding function can be restated as a function and a mapping that applies that particular function to each file. This can be useful in the rare case when individual files need to be identified. In some cases, this can be optimized to use yield from to make all of the various log files appear to be a single stream of lines.

There are several other ways to produce similar output. For example, here is an alternative version of the inner for loop in the preceding example. The line_iter() function will also emit lines of a given file:

def line_iter(zip_file: str) -> Iterator[str]:
log = gzip.open(zip_file, "rb")
return (line.decode('us-ascii').rstrip() for line in log)

The line_iter() function applies the gzip.open() function and some line cleanup. We can use mapping to apply the line_iter() function to all files that match a pattern, as follows:

map(line_iter, glob.glob(pattern))  

While this alternative mapping is succinct, it has the disadvantage of leaving open file objects lying around waiting to be properly garbage-collected when there are no more references. When processing a large number of files, this seems like a needless bit of overhead. For this reason, we'll focus on the local_gzip() function shown previously.

The previous alternative mapping has the distinct advantage of fitting in well with the way the multiprocessing module works. We can create a worker pool and map tasks (such as file reading) to the pool of processes. If we do this, we can read these files in parallel; the open file objects will be part of separate processes.

An extension to this design will include a second function to transfer files from the web host using FTP. As the files are collected from the web server, they can be analyzed using the local_gzip() function.

The results of the local_gzip() function are used by the access_iter() function to create namedtuples for each row in the source file that describes file access.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset