Parsing a properties file

There's no built-in properties parser in the Python standard library. We can download a properties file parser from the Python Package Index (https://pypi.python.org/pypi). However, it's not a very complex class, and it's a good exercise in advanced object-oriented programming.

We'll break the class down into the top-level API functions and the lower-level parsing functions. Here are some of the overall API methods:

import re

class PropertyParser:

    def read_string(self, data: str) -> Iterator[Tuple[str, str]]:
        return self._parse(data)

    def read_file(self, file: IO[str]) -> Iterator[Tuple[str, str]]:
        data = file.read()
        return self.read_string(data)

    def read(self, path: Path) -> Iterator[Tuple[str, str]]:
        with path.open("r") as file:
            return self.read_file(file)

The essential feature here is that it will parse a filename, a file, or a block of text. This follows the design pattern from configparser. A common alternative is to have fewer methods and use isinstance() to determine the type of the argument and to also determine what processing to perform on it.

Filenames are given as Path objects. While the file is generally an instance of io.TextIOBase, the typing module provides the IO[str] hint; a block of text is also a string. For this reason, many libraries use load() to work with files or filenames and use loads() to work with a simple string. Something like this would echo the design pattern of json:

def load(self, file_name_or_path: Union[TextIO, str, Path]) -> Iterator[Tuple[str, str]]:
    if isinstance(file_name_or_path, io.TextIOBase):
        return self.loads(file_name_or_path.read())
    else:
        name_or_path = cast(Union[str, Path], file_name_or_path)
        with Path(name_or_path).open("r") as file:
            return self.loads(file.read())

def loads(self, data: str) -> Iterator[Tuple[str, str]]:
    return self._parse(data)

These methods will also handle a file, filename, or block of text. When a file is provided, it can be read and parsed. When a path or a string is provided, it's used to open a file with the given name. These extra methods give us an alternative API that might be easier to work with. The deciding factor is achieving a coherent design among the various libraries, packages, and modules. Here's the _parse() method:

key_element_pat = re.compile(r"(.*?)s*(?<!\)[:=s]s*(.*)")

def _parse(self, data: str) -> Iterator[Tuple[str, str]]:
    logical_lines = (
        line.strip() for line in re.sub(r"\
s*", "", data).splitlines()
    )
    non_empty = (line for line in logical_lines if len(line) != 0)
    non_comment = (
        line
        for line in non_empty
        if not (line.startswith("#") or line.startswith("!"))
    )
    for line in non_comment:
        ke_match = self.key_element_pat.match(line)
        if ke_match:
            key, element = ke_match.group(1), ke_match.group(2)
        else:
            key, element = line, ""
        key = self._escape(key)
        element = self._escape(element)
        yield key, element

This method starts with three generator expressions to handle some overall features of the physical lines and logical lines within a properties file. The generator expressions separate three syntax rules. Generator expressions have the advantage of being executed lazily; this means that no intermediate results are created from these expressions until they're evaluated by the for line in non_comment statement.

The first expression, assigned to logical_lines, merges physical lines that end with to create longer logical lines. The leading (and trailing) spaces are stripped away, leaving just the line content. The r"\ s*" Regular Expression (RE) is intended to locate continuations. It matches at the end of a line and all of the leading spaces from the next line.

The second expression, assigned to non_empty, will only iterate over lines with a nonzero length; note that blank lines will be rejected by this filter.

The third expression, non_comment, will only iterate over lines that do not start with # or !. Lines that start with # or ! will be rejected by this filter; this eliminates comment lines.

Because of these three generator expressions, the for line in non_comment loop only iterates through non-comment, non-blank, and logical lines that are properly stripped of extra spaces. The body of the loop picks apart each remaining line to separate the key and element and then apply the self._escape() function to expand any escape sequences.

The key-element pattern, key_element_pat, looks for explicit separators of non-escaped characters, such as :, =, or a space. This pattern uses the negative lookbehind assertion, that is, a RE of (?<!\), to indicate that the following RE must be non-escaped; therefore, the following pattern must not be preceded by . The (?<!\)[:=s] subpattern matches the non-escaped :, =, or space characters. It permits a strange-looking property line such as a:b: value; the property is a:b. The element is value. The : in the key must be escaped with a preceding .

The brackets in the RE capture the property and the element associated with it. If a two-part key-element pattern can't be found, there's no separator and the line is just the property name, with an element of "".

The properties and elements form a sequence of two-tuples. The sequence can easily be turned into a dictionary by providing a configuration map, which is similar to other configuration representation schemes that we've seen. They can also be left as a sequence to show the original content of the file in a particular order. The final part is a small method function to transform any escape sequences in an element to their final Unicode character:

def _escape(self, data: str) -> str:
    d1 = re.sub(r"\([:#!=s])", lambda x: x.group(1), data)
    d2 = re.sub(r"\u([0-9A-Fa-f]+)", lambda x: chr(int(x.group(1), 16)), d1)
    return d2

This _escape() method function performs two substitution passes. The first pass replaces the escaped punctuation marks with their plain-text versions: :, #, !, =, and all have removed. For the Unicode escapes, the string of digits is used to create a proper Unicode character that replaces the uxxxx sequence. The hex digits are turned into an integer, which is turned into a character for the replacement.

The two substitutions can be combined into a single operation to save creating an intermediate string that will only get discarded. This will improve the performance; it should look like the following code:

d2 = re.sub(
    r"\([:#!=s])|\u([0-9A-Fa-f]+)",
    lambda x: x.group(1) if x.group(1) else chr(int(x.group(2), 16)),
    data,
)

The benefit of better performance might be outweighed by the complexity of the RE and the replacement function.

Once we have parsed the properties values, we need to use them in an application. In the next section, we'll examine ways of using a properties file.

Table of Contents for Parsing a properties file

Create new playlist

Sign In

Sign Up

Table of Contents for
Parsing a properties file