Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

6. Text Manipulation

Moshe Zadka¹

(1)

Belmont, CA, USA

Automation of Unix-like systems often involves text manipulation. Many programs are configured with textual configuration files. Text is the output format and the input format of many systems.

Because of that, many automation tasks end up centering around text manipulation. While tools like sed, grep, and awk have their place, Python is a powerful tool for sophisticated text manipulation.

6.1 Bytes, Strings, and Unicode

When manipulating text or text-like streams, it is easy to write code that fails in funny ways when encountering a foreign name or emoji. These are no longer merely theoretical concerns; you have users from the entire world who insist on their usernames reflecting how they spell their names. You have people who write git commits with emojis in them. To make sure to write robust code which does not fail in ways that, to be fair, seem a lot less funny when they case a 3 a.m. page, it is important to understand that text is a subtle thing.

You can understand the distinction, or you can wake up at 3 a.m. when someone tries to log in with an emoji username.

Python 3 has two distinct types that represent the kind of things that are often in text files: bytes and strings. Bytes correspond to what RFCs usually refer to as octet-stream. This is a sequence of values that fit into 8 bits, or in other words, a sequence of numbers that are in the range 0 to 256 (including 0 and not including 256). When all these values are below 128, you call the sequence ASCII (American Standard Code of Information Interchange) and assign the meaning ASCII has assigned them to the numbers. When all these values are between 32 and 128 (including 32 and not including 128), you call the sequence printable ASCII or ASCII text. The first 32 characters are sometimes called control characters. The Ctrl key on keyboards is a reference to that; its original purpose was to be able to input those characters.

ASCII only encompasses the English alphabet used in America. To represent text in (almost) any language, there is Unicode. Unicode code points are (some of the) numbers between 0 and 2**32 (including 0 and not including 2**32). Each Unicode code point is assigned a meaning. Successive versions of the standards leave assigned meanings as is but add meanings to more numbers.

An example is the addition of more emojis. The International Standards Organization, ISO, ratifies versions of Unicode in its 10464 standard. For this reason, Unicode is sometimes called ISO-10464.

Unicode points that are also ASCII have the same meaning; if ASCII assigns a number uppercase A then so does Unicode.

Properly speaking, only Unicode is text, which Python strings represent. Converting bytes to strings or vice versa is done with an encoding. The most popular encoding these days is UTF-8. Confusingly, turning the bytes to text is decoding. Turning the text to bytes is encoding.

Remembering the difference between encoding and decoding is crucial to manipulate textual data. A way to remember it is that since UTF-8 is an encoding, moving from strings to UTF-8 encoded data is encoding while moving from UTF-8 encoded data to strings is decoding.

UTF-8 has an interesting property. When given a Unicode string that happens to be ASCII, it produces bytes with the values of the code points. This means that visually, the encoded and decoded form look the same.

>>> "hello".encode("utf-8")

b'hello'

>>> "hello".encode("utf-16")

b'xffxfehx00ex00lx00lx00ox00'

The example with UTF-16 shows that this is not a trivial property of encodings. Another property of UTF-8 is that if the bytes are not ASCII, and UTF-8 decoding of the bytes succeeds, it is unlikely that they were encoded with a different encoding. UTF-8 was designed to be self-synchronizing; starting at a random byte, it is possible to synchronize with the string with a limited number of bytes being checked. Self-synchronization was designed to allow recovery from truncation and corruption, but as a side benefit, it allows detecting invalid characters reliably and thus detects if the string was UTF-8.

This means trying decoding with UTF-8 is a safe operation. It does the right thing for ASCII-only texts. It works for UTF-8 encoded texts and fails cleanly for things that are neither ASCII nor UTF-8 encoded, either text in a different encoding or a binary format such as JPEG.

For Python, “fails cleanly” means throws an exception.

>>> snowman = 'N{snowman}'

>>> snowman.encode('utf-16').decode('utf-8')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start ̺

↪ byte

For random data, this also tends to fail.

>>> struct.pack('B'*12,

*(random.randrange(0, 256)

for i in range(12))

).decode('utf-8')

The errors are random since the inputs are random. The following are some example errors.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 4: invalid ̺

↪continuation byte

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x98 in position 2: invalid start ̺

↪byte

It is a good exercise to try and run this a few times; it rarely succeeds.

6.2 Strings

The Python string object is subtle. From one perspective, it appears to be a sequence of characters, and a character is a string of length 1.

>>> a="hello"

>>> for i, x in enumerate(a):

... print(i, x, len(x))

...

0 h 1

1 e 1

2 l 1

3 l 1

4 o 1

The hello string has five elements, each of which is a string of length 1. Since the string is a sequence, the usual sequence operations work on it.

You can create a slice by specifying both endpoints.

>>> a[2:4]

'll'

Or, you can create a slice by specifying only the end.

>>> a[:2]

'he'

Or, you can create a slice by specifying only the beginning.

>>> a[3:]

'lo'

You can also use negative indices to count from the end.

>>> a[:-3]

'he'

You can reverse a string by specifying an extended slice with a negative step.

>>> a[::-1]

'olleh'

However, strings also have quite a few methods that are not part of the general sequence interface and are useful when analyzing text.

The startswith and endswith methods are useful since text analysis is often around the ends.

>>> "hello world".endswith("world")

True

A little-known feature is that endswith allows a tuple of strings and checks if it ends with any of these strings.

>>> "hello world".endswith(("universe", "world"))

True

An example where it comes in useful is testing for a few common endings.

>>> filename.endswith((".tgz", ".tar.gz"))

You can easily test whether a file has either of the common suffixes for a gzipped tarball: the tgz or tar.gz suffix.

The strip and split methods are useful for parsing the ad hoc formats that many Unix files or utilities come in. For example, the /etc/fstab file contains static mounts on most Unix systems (though not on macOS-based ones).

with open("/etc/fstab") as fpin:

for line in fpin:

line = line.rstrip(' ')

line = line.split('#', 1)[0]

if not line:

continue

device, path, fstype, options, freq, passno = line.split()

print(f"Mounting {device} on {path}")

This parses the file and prints a summary. The first line in the loop strips out the new line. The rstrip method strips from the right (the end) of the string.

Note that rstrip, as well as strip, accept a sequence of characters to remove. This means that passing a string to rstrip means any of the characters in the string and does not remove occurrences of this string. This does not affect one-character arguments to rstrip but does mean that longer strings are almost always a mistaken use.

You then remove comments, if any. You skip empty lines. For any line that is not empty, you use the split with no argument to split on any sequence of whitespaces. Conveniently, this convention is common to several formats, and the correct handling is built into the specification of split.

Finally, let’s use a format string to format the output for easy consumption.

This is a typical usage of string parsing and is the kind of code that replaces long pipelines in the shell. Finally, the join method on a string uses glues together an iterable of strings.

A simple example of ' '.join(["hello", "world"]) returns "hello world", but this only scratches the surface of join. Since it accepts an iterable, you can pass it anything that supports iteration.

>>> names=dict(hello=1,world=2)

>>> ' '.join(names)

'hello world'

Since iterating on a dictionary object yields the list of keys, passing it to join means you get a string with the list of keys joined together.

You can also pass in a generator.

>>> '-*-'.join(str(x) for x in range(3))

'0-*-1-*-2'

This allows calculating sequences on the fly and joining them without having intermediate storage for the sequence.

The usual question about join is why it is a method on the glue string rather than a method on sequences. The reason is exactly this. You can pass in an iterable, and the glue string glue in the bits in it.

Note that join does nothing to single-element iterables.

>>> '-*-'.join(str(x) for x in range(1))

'0'

6.3 Regular Expressions

Regular expressions are a special DSL for specifying properties of strings, also called patterns. They are common in many utilities, although each implementation has its own idiosyncrasies. In Python, regular expressions are implemented by the re module. It fundamentally allows two modes of interaction, one where regular expressions are auto-parsed at the time of text analysis and one where they are parsed in advance.

In general, the latter style is preferred. Auto-parsing the regular expression is suited only to an interactive loop, where they are used quickly and forgotten. For this reason, this usage is covered here.

To compile a regular expression, you use re.compile. This function returns a regular expression object that looks for strings that match the expression. The object can do several things; for example, find one match, find all matches, or even replace the matches.

The regular expression mini-language has a lot of subtlety. Here, only the basics needed to illustrate how to use regular expressions effectively are covered.

Most characters stand for themselves. The regular expression hello, for example, matches exactly hello. The . stands for any character. So hell. matches hello and hella, but not hell since the latter does not have any character corresponding to the .. Square brackets delimit character classes; for example, wom[ae]n matches both women and woman. Character classes can also have ranges in them: [0-9] matches any digit, [a-z] matches any lowercase character, and [0-9a-fA-F] matches any hexadecimal digit (hexadecimal digits and numbers pop up a lot in many places since two hexadecimal digits correspond exactly to a standard byte).

There are also repeat modifiers that modify the expression that precedes them. For example, ba?b matches both bb and bab, and the ? stands for zero or one. The * stands for any number, so ba*b stands for bb, bab, baab, baaab, and so on. If you want at least one, ba+b match almost everything that ba*b matches, except for bb. Finally, there are the exact counters: ba{3}b matches baaab while ba{1,2}b matches bab and baab and nothing else.

To make a special character (like . or *) match itself, prefix it with a backslash. Since the backslash has other meanings in Python strings, Python supports raw strings. While you can use any string to denote a regular expression; often, raw strings are easier.

For example, if you want a DOS-like file name regular expression: r"[^.]{1,8}.[^.]{0,3}". This match, say, readme.txt but not archive.tar.gz. Note that to match a literal ., you must escape it with a backslash. Also note an interesting character class, [^.], which means anything except .. The ^ means exclude the inside of a character class.

Regular expressions also support grouping. Grouping does two things. It allows addressing parts of the expression and treating a part of the expression as a single object to apply one of the repeat operations to it. If only the latter is needed, this is a non-capture group, denoted by (?:....).

For example, (?:[a-z]{2,5}-){1,4}[0-9] matches hello-3 or hello-world-5 but not a-hello-2 (since the first part is not two characters long) or hello-world-this-is-too-long-7 since it is made up of six repetitions of the inner pattern, and you specified a maximum of 4.

This allows arbitrary nesting. To illustrate, (?:(?:[a-z]{2,5}-){1,4}[0-9];)+ allows any semi-colon-terminated separated sequence of the previous pattern; for example, az-2;hello-world-5; matches but this-is-3; not-good-match-6 does not since it is missing the ; at the end.

This is a good example of how complex regular expressions can get. It is easy to use this dense mini-language inside Python to specify constraints on strings that are hard to understand.

Once you have a regular expression object, there are two main methods in it: match and search. The match method looks for matches at the beginning of the string, while search looks for the first match, wherever it may start. When they find a match, they return a match object.

>>> reobj = re.compile('ab+a')

>>> m = reobj.search('hello abba world')

>>> m

<_sre.SRE_Match object; span=(6, 10), match='abba'>

>>> m.group()

'abba'

The first method that is often used is .group(), which returns the part of the string matched. This method can get a part of the match if the regular expression contains capturing groups. A capturing group is usually marked with ().

>>> reobj = re.compile('(a)(b+)(a)')

>>> m = reobj.search('hello abba world')

>>> m.group()

'abba'

>>> m.group(1)

'a'

>>> m.group(2)

'bb'

>>> m.group(3)

'a'

When the number of groups is significant, or when modifying the group, managing the indices to the group can prove to be a challenge. If analysis of the groups is needed, you can also name the groups.

>>> reobj = re.compile('(?P<prefix>a)(?P<body>b+)(?P<suffix>a)')

>>> m = reobj.search('hello abba world')

>>> m.group('prefix')

'a'

>>> m.group('body')

'bb'

>>> m.group('suffix')

'a'

Since regular expressions can get dense, there is a way to make them a bit easier to read: the verbose mode.

>>> reobj = re.compile(r"""

... (?P<prefix>a) # The beginning -- always an a

... (?P<body>b+) # The middle -- any numbers of b, for emphasis

... (?P<suffix>a) # An a at the end to properly anchor

... """, re.VERBOSE)

>>> m = reobj.search("hello abba world")

>>> m.groups()

('a', 'bb', 'a')

>>> m.group('prefix'), m.group('body'), m.group('suffix')

('a', 'bb', 'a')

When compiling regular expressions with the re.VERBOSE flag allows making regular expressions more readable. Whitespace inside the regular expression is ignored. Additionally, Python-like comments (from “#” to the end of the line) are ignored. In order to match a space or #, those characters need to be escaped with a backslash.

This allows you to write long regular expressions while still making them easier to understand with judicious line breaks, spaces, and comments.

Regular expressions are loosely based on the mathematical theory of finite automaton. While they do go beyond the constraints of what finite automata can match, they are not fully general. They are poorly suited for nested patterns, whether matching parentheses or HTML elements is not a good fit for regular expressions.

6.4 JSON

JSON is a hierarchical file format that is simple to parse and reasonably easy to read and write by hand. It has its origins on the web and stands for JavaScript Object Notation. Indeed, it is still popular on the Internet; one reason to care about JSON is that many web APIs use JSON as a transfer format.

It is also useful, however, in other places. For example, in JavaScript projects, package.json includes the dependencies of this project. Parsing this is often useful to determine third-party dependencies for security or compliance audits.

In theory, JSON is a format defined in Unicode, not bytes. When serializing, it takes a data structure and transforms it into a Unicode string, and when deserializing, it takes a Unicode string and returns a data structure. However, the standard was recently amended to specify a preferred encoding: utf-8. With this addition, the format is also defined as a byte stream.

However, note that the encoding is still separate from the format in some use cases. In particular, when sending or receiving JSON over HTTP, the HTTP encoding is the ultimate truth. Even then, though, when no encoding is explicitly specified, UTF-8 should be assumed.

JSON is a simple serialization format, only supporting a few types.

Strings
Numbers
Booleans
A null type
Arrays of JSON values
Objects: dictionaries mapping strings to JSON values

Note that JSON does not fully specify numerical ranges or precision. If precise integers are required, the range -2**53 to 2**53 can usually be assumed to be representable precisely.

Although the Python JSON library can read/write directly to files, you almost always separate the tasks; you read as much data as you need and pass the string directly to JSON.

The most important functions in the json module are loads and dumps. The s at the end stands for string, which is what those functions accept and return.

>>> thing = [{"hello": 1, "world": 2}, None, True]

>>> json.dumps(thing)

'[{"hello": 1, "world": 2}, null, true]'

>>> json.loads(_)

[{'hello': 1, 'world': 2}, None, True]

The None object in Python maps to the JSON null object, booleans in Python map to booleans in JSON, and numbers and strings map to number and strings. Note that the Python JSON parsing libraries make ad hoc decisions about whether a number should map to an integer or a float based on its notation.

>>> json.loads("1")

>>> json.loads("1.0")

1.0

It is important to remember not all JSON loading libraries make the same decision, and in some cases, this can lead to interoperability problems.

For debugging reasons, it is often useful to be able to print JSON. The dumps function can do that with some extra arguments. The following is the usual set of arguments for pretty printing.

json.dumps(thing, indent=4)

If you want to round trip into an equivalent but prettier version, you can do so.

>>> encoded_string='{"b":1,"a":2}'

>>> print(json.dumps(json.loads(encoded_string), indent=4))

{

"b": 1,

"a": 2

}

Note that some code also adds sort_keys=True, which was a good idea on versions of Python prior to 3.5. Now, Python dictionaries are sorted by key insertions, which means that json.loads and json.dumps preserve the original order of the keys in the source JSON.

Finally, at the command line, the json.tool module does this automatically.

$ python -m json.tool < somefile.json | less

This is an easy way to scan through dumped JSON and look for interesting information.

One frequently missed type from JSON is a date-time type. Usually, this is represented with strings and is the most common need for a schema to parse JSON against to know which strings to convert to a datetime object.

6.5 CSV

The CSV format has a few advantages. It is constrained; it always represents scalar types in a two-dimensional array. For this reason, there are not a lot of surprises that can go in. In addition, it is a format that imports natively into spreadsheet applications like Microsoft Excel or Google Sheets, which comes in handy when preparing reports.

Examples of such reports are breaking down expenses for paying for third-party services for the financial department or a report on incidents managed and time to recovery for management. In all these cases, having a format that is easy to produce and import into spreadsheet applications allows for easy automation of the task.

Writing CSV files is done with csv.writer. A typical example involves serializing a homogenous array—an array of things with the same type.

@attr.s(frozen=True, auto_attribs=True)

class LoginAttempt:

username: str

time_stamp: int

success: bool

This class represents a login attempt by some user at a given time and with a record of the attempt’s success. For a security audit, you need to send the auditors an Excel file of the login attempts.

def write_attempts(attempts, fname):

with open(fname, 'w') as fpout:

writer = csv.writer(fpout)

writer.writerow(['Username', 'Timestamp', 'Success'])

for attempt in attempts:

writer.writerow([

attempt.username,

attempt.time_stamp,

str(attempt.success),

])

Note that by convention, the first row should be a title row. Though the Python API does not enforce it, it is highly recommended to follow this convention. In this example, you first wrote a title row with the names of the fields.

Then you looped through the attempts. Note that CSV can only represent strings and numbers, so instead of relying on thinly documented standards on how a boolean is written, you do so explicitly.

This way, if the auditor asks for that field to be yes/no, you can change the explicit serialization step to match. When it comes to reading CSV files, there are two main approaches.

Using csv.reader returns an iterator that yields parsed row by parsed row as a list. However, assuming the convention about the first row being the names of fields has been followed, csv.DictReader yields nothing for the first row and a dictionary for every subsequent row, using field names as keys. This enables more robust parsing in the face of end-users adding fields or changing their order.

>>> fileobj = open("data.csv")

>>> reader = csv.DictReader(fileobj)

>>> list(reader)

[OrderedDict([('Username', 'alice'),

('Timestamp', '1514793600.0'),

('Success', 'False')]),

OrderedDict([('Username', 'bob'),

('Timestamp', '1539154800.0'),

('Success', 'True')])]

Reading the same CSV you have written in the previous example yield reasonable results. The dictionary maps the field names to the values. It is important to note that the types have all been forgotten, and everything is returned as a string. Unfortunately, CSV does not keep type information.

It is sometimes tempting to just improvise parsing CSV files with .split. However, CSV has quite a few corner cases that are not readily apparent.

For example,

1,"Miami, FL","he""llo"

is properly parsed as

('1', 'Miami, FL', 'he"llo')

For the same reason, it is a good idea to avoid writing CSV files using anything other than csv.writer.

6.6 Summary

Much of the content needed for many DevOps tasks arrives as text: logs, JSON dumps of data structures, or a CSV file of paid licenses. Understanding what text is and how to manipulate it in Python allows much of the automation that is the cornerstone of DevOps, be it through build automation, monitoring result analysis, or just preparing summaries for easy consumption by others.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Text Manipulation

Create new playlist

Sign In

Sign Up

6. Text Manipulation

6.1 Bytes, Strings, and Unicode

6.2 Strings

6.3 Regular Expressions

6.4 JSON

6.5 CSV

6.6 Summary

Table of Contents for
6. Text Manipulation