Automation of Unix-like systems often involves text manipulation. Many programs are configured with textual configuration files. Text is the output format and the input format of many systems.
Because of that, many automation tasks end up centering around text manipulation. While tools like sed, grep, and awk have their place, Python is a powerful tool for sophisticated text manipulation.
6.1 Bytes, Strings, and Unicode
When manipulating text or text-like streams, it is easy to write code that fails in funny ways when encountering a foreign name or emoji. These are no longer merely theoretical concerns; you have users from the entire world who insist on their usernames reflecting how they spell their names. You have people who write git commits with emojis in them. To make sure to write robust code which does not fail in ways that, to be fair, seem a lot less funny when they case a 3 a.m. page, it is important to understand that text is a subtle thing.
You can understand the distinction, or you can wake up at 3 a.m. when someone tries to log in with an emoji username.
Python 3 has two distinct types that represent the kind of things that are often in text files: bytes and strings. Bytes correspond to what RFCs usually refer to as octet-stream. This is a sequence of values that fit into 8 bits, or in other words, a sequence of numbers that are in the range 0 to 256 (including 0 and not including 256). When all these values are below 128, you call the sequence ASCII (American Standard Code of Information Interchange) and assign the meaning ASCII has assigned them to the numbers. When all these values are between 32 and 128 (including 32 and not including 128), you call the sequence printable ASCII or ASCII text. The first 32 characters are sometimes called control characters. The Ctrl key on keyboards is a reference to that; its original purpose was to be able to input those characters.
ASCII only encompasses the English alphabet used in America. To represent text in (almost) any language, there is Unicode. Unicode code points are (some of the) numbers between 0 and 2**32 (including 0 and not including 2**32). Each Unicode code point is assigned a meaning. Successive versions of the standards leave assigned meanings as is but add meanings to more numbers.
An example is the addition of more emojis. The International Standards Organization, ISO, ratifies versions of Unicode in its 10464 standard. For this reason, Unicode is sometimes called ISO-10464.
Unicode points that are also ASCII have the same meaning; if ASCII assigns a number uppercase A then so does Unicode.
Properly speaking, only Unicode is text, which Python strings represent. Converting bytes to strings or vice versa is done with an encoding. The most popular encoding these days is UTF-8. Confusingly, turning the bytes to text is decoding. Turning the text to bytes is encoding.
Remembering the difference between encoding and decoding is crucial to manipulate textual data. A way to remember it is that since UTF-8 is an encoding, moving from strings to UTF-8 encoded data is encoding while moving from UTF-8 encoded data to strings is decoding.
The example with UTF-16 shows that this is not a trivial property of encodings. Another property of UTF-8 is that if the bytes are not ASCII, and UTF-8 decoding of the bytes succeeds, it is unlikely that they were encoded with a different encoding. UTF-8 was designed to be self-synchronizing; starting at a random byte, it is possible to synchronize with the string with a limited number of bytes being checked. Self-synchronization was designed to allow recovery from truncation and corruption, but as a side benefit, it allows detecting invalid characters reliably and thus detects if the string was UTF-8.
This means trying decoding with UTF-8 is a safe operation. It does the right thing for ASCII-only texts. It works for UTF-8 encoded texts and fails cleanly for things that are neither ASCII nor UTF-8 encoded, either text in a different encoding or a binary format such as JPEG.
It is a good exercise to try and run this a few times; it rarely succeeds.
6.2 Strings
The hello string has five elements, each of which is a string of length 1. Since the string is a sequence, the usual sequence operations work on it.
However, strings also have quite a few methods that are not part of the general sequence interface and are useful when analyzing text.
You can easily test whether a file has either of the common suffixes for a gzipped tarball: the tgz or tar.gz suffix.
This parses the file and prints a summary. The first line in the loop strips out the new line. The rstrip method strips from the right (the end) of the string.
Note that rstrip, as well as strip, accept a sequence of characters to remove. This means that passing a string to rstrip means any of the characters in the string and does not remove occurrences of this string. This does not affect one-character arguments to rstrip but does mean that longer strings are almost always a mistaken use.
You then remove comments, if any. You skip empty lines. For any line that is not empty, you use the split with no argument to split on any sequence of whitespaces. Conveniently, this convention is common to several formats, and the correct handling is built into the specification of split.
Finally, let’s use a format string to format the output for easy consumption.
This is a typical usage of string parsing and is the kind of code that replaces long pipelines in the shell. Finally, the join method on a string uses glues together an iterable of strings.
Since iterating on a dictionary object yields the list of keys, passing it to join means you get a string with the list of keys joined together.
This allows calculating sequences on the fly and joining them without having intermediate storage for the sequence.
The usual question about join is why it is a method on the glue string rather than a method on sequences. The reason is exactly this. You can pass in an iterable, and the glue string glue in the bits in it.
6.3 Regular Expressions
Regular expressions are a special DSL for specifying properties of strings, also called patterns. They are common in many utilities, although each implementation has its own idiosyncrasies. In Python, regular expressions are implemented by the re module. It fundamentally allows two modes of interaction, one where regular expressions are auto-parsed at the time of text analysis and one where they are parsed in advance.
In general, the latter style is preferred. Auto-parsing the regular expression is suited only to an interactive loop, where they are used quickly and forgotten. For this reason, this usage is covered here.
To compile a regular expression, you use re.compile. This function returns a regular expression object that looks for strings that match the expression. The object can do several things; for example, find one match, find all matches, or even replace the matches.
The regular expression mini-language has a lot of subtlety. Here, only the basics needed to illustrate how to use regular expressions effectively are covered.
Most characters stand for themselves. The regular expression hello, for example, matches exactly hello. The . stands for any character. So hell. matches hello and hella, but not hell since the latter does not have any character corresponding to the .. Square brackets delimit character classes; for example, wom[ae]n matches both women and woman. Character classes can also have ranges in them: [0-9] matches any digit, [a-z] matches any lowercase character, and [0-9a-fA-F] matches any hexadecimal digit (hexadecimal digits and numbers pop up a lot in many places since two hexadecimal digits correspond exactly to a standard byte).
There are also repeat modifiers that modify the expression that precedes them. For example, ba?b matches both bb and bab, and the ? stands for zero or one. The * stands for any number, so ba*b stands for bb, bab, baab, baaab, and so on. If you want at least one, ba+b match almost everything that ba*b matches, except for bb. Finally, there are the exact counters: ba{3}b matches baaab while ba{1,2}b matches bab and baab and nothing else.
To make a special character (like . or *) match itself, prefix it with a backslash. Since the backslash has other meanings in Python strings, Python supports raw strings. While you can use any string to denote a regular expression; often, raw strings are easier.
For example, if you want a DOS-like file name regular expression: r"[^.]{1,8}.[^.]{0,3}". This match, say, readme.txt but not archive.tar.gz. Note that to match a literal ., you must escape it with a backslash. Also note an interesting character class, [^.], which means anything except .. The ^ means exclude the inside of a character class.
Regular expressions also support grouping. Grouping does two things. It allows addressing parts of the expression and treating a part of the expression as a single object to apply one of the repeat operations to it. If only the latter is needed, this is a non-capture group, denoted by (?:....).
For example, (?:[a-z]{2,5}-){1,4}[0-9] matches hello-3 or hello-world-5 but not a-hello-2 (since the first part is not two characters long) or hello-world-this-is-too-long-7 since it is made up of six repetitions of the inner pattern, and you specified a maximum of 4.
This allows arbitrary nesting. To illustrate, (?:(?:[a-z]{2,5}-){1,4}[0-9];)+ allows any semi-colon-terminated separated sequence of the previous pattern; for example, az-2;hello-world-5; matches but this-is-3; not-good-match-6 does not since it is missing the ; at the end.
This is a good example of how complex regular expressions can get. It is easy to use this dense mini-language inside Python to specify constraints on strings that are hard to understand.
When compiling regular expressions with the re.VERBOSE flag allows making regular expressions more readable. Whitespace inside the regular expression is ignored. Additionally, Python-like comments (from “#” to the end of the line) are ignored. In order to match a space or #, those characters need to be escaped with a backslash.
This allows you to write long regular expressions while still making them easier to understand with judicious line breaks, spaces, and comments.
Regular expressions are loosely based on the mathematical theory of finite automaton. While they do go beyond the constraints of what finite automata can match, they are not fully general. They are poorly suited for nested patterns, whether matching parentheses or HTML elements is not a good fit for regular expressions.
6.4 JSON
JSON is a hierarchical file format that is simple to parse and reasonably easy to read and write by hand. It has its origins on the web and stands for JavaScript Object Notation. Indeed, it is still popular on the Internet; one reason to care about JSON is that many web APIs use JSON as a transfer format.
It is also useful, however, in other places. For example, in JavaScript projects, package.json includes the dependencies of this project. Parsing this is often useful to determine third-party dependencies for security or compliance audits.
In theory, JSON is a format defined in Unicode, not bytes. When serializing, it takes a data structure and transforms it into a Unicode string, and when deserializing, it takes a Unicode string and returns a data structure. However, the standard was recently amended to specify a preferred encoding: utf-8. With this addition, the format is also defined as a byte stream.
However, note that the encoding is still separate from the format in some use cases. In particular, when sending or receiving JSON over HTTP, the HTTP encoding is the ultimate truth. Even then, though, when no encoding is explicitly specified, UTF-8 should be assumed.
Strings
Numbers
Booleans
A null type
Arrays of JSON values
Objects: dictionaries mapping strings to JSON values
Note that JSON does not fully specify numerical ranges or precision. If precise integers are required, the range -2**53 to 2**53 can usually be assumed to be representable precisely.
Although the Python JSON library can read/write directly to files, you almost always separate the tasks; you read as much data as you need and pass the string directly to JSON.
It is important to remember not all JSON loading libraries make the same decision, and in some cases, this can lead to interoperability problems.
Note that some code also adds sort_keys=True, which was a good idea on versions of Python prior to 3.5. Now, Python dictionaries are sorted by key insertions, which means that json.loads and json.dumps preserve the original order of the keys in the source JSON.
This is an easy way to scan through dumped JSON and look for interesting information.
One frequently missed type from JSON is a date-time type. Usually, this is represented with strings and is the most common need for a schema to parse JSON against to know which strings to convert to a datetime object.
6.5 CSV
The CSV format has a few advantages. It is constrained; it always represents scalar types in a two-dimensional array. For this reason, there are not a lot of surprises that can go in. In addition, it is a format that imports natively into spreadsheet applications like Microsoft Excel or Google Sheets, which comes in handy when preparing reports.
Examples of such reports are breaking down expenses for paying for third-party services for the financial department or a report on incidents managed and time to recovery for management. In all these cases, having a format that is easy to produce and import into spreadsheet applications allows for easy automation of the task.
Note that by convention, the first row should be a title row. Though the Python API does not enforce it, it is highly recommended to follow this convention. In this example, you first wrote a title row with the names of the fields.
Then you looped through the attempts. Note that CSV can only represent strings and numbers, so instead of relying on thinly documented standards on how a boolean is written, you do so explicitly.
This way, if the auditor asks for that field to be yes/no, you can change the explicit serialization step to match. When it comes to reading CSV files, there are two main approaches.
Reading the same CSV you have written in the previous example yield reasonable results. The dictionary maps the field names to the values. It is important to note that the types have all been forgotten, and everything is returned as a string. Unfortunately, CSV does not keep type information.
It is sometimes tempting to just improvise parsing CSV files with .split. However, CSV has quite a few corner cases that are not readily apparent.
For the same reason, it is a good idea to avoid writing CSV files using anything other than csv.writer.
6.6 Summary
Much of the content needed for many DevOps tasks arrives as text: logs, JSON dumps of data structures, or a CSV file of paid licenses. Understanding what text is and how to manipulate it in Python allows much of the automation that is the cornerstone of DevOps, be it through build automation, monitoring result analysis, or just preparing summaries for easy consumption by others.