19. Text and Language

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 19. Text and Language

“See Jack Hack. Hack, Jack, Hack”

In one form or another, processing text-based information is one of the more common tasks that applications need to perform. This can include anything from scanning a text file by columns to analyzing statements in a language defined by a formal grammar. Such processing usually is called parsing—analyzing the structure of a text string. In this chapter, we’ll explore ways to handle language and text-based information and summarize some Python development concepts in sidebars along the way. In the process, we’ll meet string methods, text pattern matching, XML and HTML parsers, and other tools.

Some of this material is advanced, but the examples are small to keep this chapter short. For instance, recursive descent parsing is illustrated with a simple example to show how it can be implemented in Python. We’ll also see that it’s often unnecessary to write custom parsers for each language processing task in Python. They can usually be replaced by exporting APIs for use in Python programs, and sometimes by a single built-in function call. Finally, this chapter closes by presenting PyCalc—a calculator GUI written in Python, and the last major Python coding example in this text. As we’ll see, writing calculators isn’t much more difficult than juggling stacks while scanning text.

Strategies for Processing Text in Python

In the grand scheme of things, there are a variety of ways to handle text processing and language analysis in Python:

Expressions: Built-in string object expressions
Methods: Built-in string object method calls
Patterns: Regular expression pattern matching
Parsers: markup: XML and HTML text parsing
Parsers: grammars: Custom language parsers, both handcoded and generated
Embedding: Running Python code with eval and exec built-ins
And more: Natural language processing

For simpler tasks, Python’s built-in string object is often all we really need. Python strings can be indexed, concatenated, sliced, and processed with both string method calls and built-in functions. Our main emphasis in this chapter is mostly on higher-level tools and techniques for analyzing textual information and language, but we’ll briefly explore each of these techniques in turn. Let’s get started.

Note

Some readers may have come to this chapter seeking coverage of Unicode text, too, but this topic is not presented here. For a look at Python’s Unicode support, see Chapter 2’s discussion of string tools, Chapter 4’s discussion of text and binary file distinctions and encodings, and Chapter 9’s coverage of text in tkinter GUIs. Unicode also appears in various Internet and database topics throughout this book (e.g., email encodings).

Because Unicode is a core language topic, all these chapters will also refer you to the fuller coverage of Unicode in Learning Python, Fourth Edition. Most of the topics in this chapter, including string methods and pattern matching, apply to Unicode automatically simply because the Python 3.X str string type is Unicode, whether ASCII or wider.

String Method Utilities

The first stop on our text and language tour is the most basic: Python’s string objects come with an array of text processing tools, and serve as your first line of defense in this domain. As you undoubtedly know by now, concatenation, slicing, formatting, and other string expressions are workhorses of most programs (I’m including the newer format method in this category, as it’s really just an alternative to the % expression):

>>> 'spam eggs ham'[5:10]                     # slicing: substring extraction
'eggs '
>>> 'spam ' + 'eggs ham'                      # concatenation (and *, len(), [ix])
'spam eggs ham'
>>> 'spam %s %s' % ('eggs',  'ham')           # formatting expression: substitution
'spam eggs ham'
>>> 'spam {} {}'.format('eggs',  'ham')       # formatting method: % alternative
'spam eggs ham'

>>> 'spam = "%-5s", %+06d' % ('ham',  99)     # more complex formatting
'spam = "ham  ", +00099'
>>> 'spam = "{0:<5}", {1:+06}'.format('ham',  99)
'spam = "ham  ", +00099'

These operations are covered in core language resources such as Learning Python. For the purposes of this chapter, though, we’re interested in more powerful tools: Python’s string object methods include a wide variety of text-processing utilities that go above and beyond string expression operators. We saw some of these in action early on in Chapter 2, and have been using them ever since. For instance, given an instance str of the built-in string object type, operations like the following are provided as object method calls:

str.find(substr): Performs substring searches
str.replace(old, new): Performs substring substitutions
str.split(delimiter): Chops up a string around a delimiter or whitespace
str.join(iterable): Puts substrings together with delimiters between
str.strip(): Removes leading and trailing whitespace
str.rstrip(): Removes trailing whitespace only, if any
str.rjust(width): Right-justifies a string in a fixed-width field
str.upper(): Converts to uppercase
str.isupper(): Tests whether the string is uppercase
str.isdigit(): Tests whether the string is all digit characters
str.endswith(substr-or-tuple): Tests for a substring (or a tuple of alternatives) at the end
str.startswith(substr-or-tuple): Tests for a substring (or a tuple of alternatives) at the front

This list is representative but partial, and some of these methods take additional optional arguments. For the full list of string methods, run a dir(str) call at the Python interactive prompt and run help(str.method) on any method for some quick documentation. The Python library manual and reference books such as Python Pocket Reference also include an exhaustive list.

Moreover, in Python today all normal string methods apply to both bytes and str strings. The latter makes them applicable to arbitrarily encoded Unicode text, simply because the str type is Unicode text, even if it’s only ASCII. These methods originally appeared as function in the string module, but are only object methods today; the string module is still present because it contains predefined constants (e.g., string.ascii_uppercase), as well as the Template substitution interface in 2.4 and later—one of the techniques discussed in the next section.

Templating with Replacements and Formats

By way of review, let’s take a quick look at string methods in the context of some of their most common use cases. As we saw when generating HTML forwarding pages in Chapter 6, the string replace method is often adequate by itself as a string templating tool—we can compute values and insert them at fixed positions in a string with simple replacement calls:

>>> template = '---$target1---$target2---'
>>> val1 = 'Spam'
>>> val2 = 'shrubbery'
>>> template = template.replace('$target1', val1)
>>> template = template.replace('$target2', val2)
>>> template
'---Spam---shrubbery---'

As we also saw when generating HTML reply pages in the CGI scripts of Chapters 15 and 16, the string % formatting operator is also a powerful templating tool, especially when combined with dictionaries—simply fill out a dictionary with values and apply multiple substitutions to the HTML string all at once:

>>> template = """
... ---
... ---%(key1)s---
... ---%(key2)s---
... """
>>>
>>> vals = {}
>>> vals['key1'] = 'Spam'
>>> vals['key2'] = 'shrubbery'
>>> print(template % vals)

---
---Spam---
---shrubbery---

Beginning with Python 2.4, the string module’s Template feature is essentially a simplified and limited variation of the dictionary-based format scheme just shown, but it allows some additional call patterns which some may consider simpler:

>>> vals
{'key2': 'shrubbery', 'key1': 'Spam'}

>>> import string
>>> template = string.Template('---$key1---$key2---')
>>> template.substitute(vals)
'---Spam---shrubbery---'

>>> template.substitute(key1='Brian', key2='Loretta')
'---Brian---Loretta---'

See the library manual for more on this extension. Although the string datatype does not itself support the pattern-directed text processing that we’ll meet later in this chapter, its tools are powerful enough for many tasks.

Parsing with Splits and Joins

In terms of this chapter’s main focus, Python’s built-in tools for splitting and joining strings around tokens turn out to be especially useful when it comes to parsing text:

str.split(delimiter?, maxsplits?): Splits a string into a list of substrings, using either whitespace (tabs, spaces, newlines) or an explicitly passed string as a delimiter. maxsplits limits the number of splits performed, if passed.
delimiter.join(iterable): Concatenates a sequence or other iterable of substrings (e.g., list, tuple, generator), adding the subject separator string between each.

These two are among the most powerful of string methods. As we saw in Chapter 2, split chops a string into a list of substrings and join puts them back together:

>>> 'A B C D'.split()
['A', 'B', 'C', 'D']
>>> 'A+B+C+D'.split('+')
['A', 'B', 'C', 'D']
>>> '--'.join(['a', 'b', 'c'])
'a--b--c'

Despite their simplicity, they can handle surprisingly complex text-parsing tasks. Moreover, string method calls are very fast because they are implemented in C language code. For instance, to quickly replace all tabs in a file with four periods, pipe the file into a script that looks like this:

from sys import *
stdout.write(('.' * 4).join(stdin.read().split('	')))

The split call here divides input around tabs, and the join puts it back together with periods where tabs had been. In this case, the combination of the two calls is equivalent to using the simpler global replacement string method call as follows:

stdout.write(stdin.read().replace('	', '.' * 4))

As we’ll see in the next section, splitting strings is sufficient for many text-parsing goals.

Summing Columns in a File

Let’s look next at some practical applications of string splits and joins. In many domains, scanning files by columns is a fairly common task. For instance, suppose you have a file containing columns of numbers output by another system, and you need to sum each column’s numbers. In Python, string splitting is the core operation behind solving this problem, as demonstrated by Example 19-1. As an added bonus, it’s easy to make the solution a reusable tool in Python by packaging it as an importable function.

Example 19-1. PP4ELangsummer.py

#!/usr/local/bin/python

def summer(numCols, fileName):
    sums = [0] * numCols                             # make list of zeros
    for line in open(fileName):                      # scan file's lines
        cols = line.split()                          # split up columns
        for i in range(numCols):                     # around blanks/tabs
            sums[i] += eval(cols[i])                 # add numbers to sums
    return sums

if __name__ == '__main__':
    import sys
    print(summer(eval(sys.argv[1]), sys.argv[2]))    # '% summer.py cols file'

Notice that we use file iterators here to read line by line, instead of calling the file readlines method explicitly (recall from Chapter 4 that iterators avoid loading the entire file into memory all at once). The file itself is a temporary object, which will be automatically closed when garbage collected.

As usual for properly architected scripts, you can both import this module and call its function, and run it as a shell tool from the command line. The summer.py script calls split to make a list of strings representing the line’s columns, and eval to convert column strings to numbers. Here’s an input file that uses both blanks and tabs to separate columns, and the result of turning our script loose on it:

C:...PP4ELang> type table1.txt
1       5       10    2   1.0
2       10      20    4   2.0
3       15      30    8    3
4       20      40   16   4.0

C:...PP4ELang> python summer.py 5 table1.txt
[10, 50, 100, 30, 10.0]

Also notice that because the summer script uses eval to convert file text to numbers, you could really store arbitrary Python expressions in the file. Here, for example, it’s run on a file of Python code snippets:

C:...PP4ELang> type table2.txt
2     1+1          1<<1           eval("2")
16    2*2*2*2      pow(2,4)       16.0
3     len('abc')   [1,2,3][2]     {'spam':3}['spam']

C:...PP4ELang> python summer.py 4 table2.txt
[21, 21, 21, 21.0]

Summing with zips and comprehensions

We’ll revisit eval later in this chapter, when we explore expression evaluators. Sometimes this is more than we want—if we can’t be sure that the strings that we run this way won’t contain malicious code, for instance, it may be necessary to run them with limited machine access or use more restrictive conversion tools. Consider the following recoding of the summer function (this is in file summer2.py in the examples package if you care to experiment with it):

def summer(numCols, fileName):
    sums = [0] * numCols
    for line in open(fileName):                     # use file iterators
        cols = line.split(',')                      # assume comma-delimited
        nums = [int(x) for x in cols]               # use limited converter
        both = zip(sums, nums)                      # avoid nested for loop
        sums = [x + y for (x, y) in both]           # 3.X: zip is an iterable
    return sums

This version uses int for its conversions from strings to support only numbers, and not arbitrary and possibly unsafe expressions. Although the first four lines of this coding are similar to the original, for variety this version also assumes the data is separated by commas rather than whitespace, and runs list comprehensions and zip to avoid the nested for loop statement. This version is also substantially trickier than the original and so might be less desirable from a maintenance perspective. If its code is confusing, try adding print call statements after each step to trace the results of each operation. Here is its handiwork:

C:...PP4ELang> type table3.txt
1,5,10,2,1
2,10,20,4,2
3,15,30,8,3
4,20,40,16,4

C:...PP4ELang> python summer2.py 5 table3.txt
[10, 50, 100, 30, 10]

Summing with dictionaries

The summer logic so far works, but it can be even more general— by making the column numbers a key of a dictionary rather than an offset in a list, we can remove the need to pass in a number-columns value altogether. Besides allowing us to associate meaningful labels with data rather than numeric positions, dictionaries are often more flexible than lists in general, especially when there isn’t a fixed size to our problem. For instance, suppose you need to sum up columns of data stored in a text file where the number of columns is not known or fixed:

C:...PP4ELang> python
>>> print(open('table4.txt').read())
001.1 002.2 003.3
010.1 020.2 030.3 040.4
100.1 200.2 300.3

Here, we cannot preallocate a fixed-length list of sums because the number of columns may vary. Splitting on whitespace extracts the columns, and float converts to numbers, but a fixed-size list won’t easily accommodate a set of sums (at least, not without extra code to manage its size). Dictionaries are more convenient here because we can use column positions as keys instead of using absolute offsets. The following code demonstrates interactively (it’s also in file summer3.py in the examples package):

>>> sums = {}
>>> for line in open('table4.txt'):
...     cols = [float(col) for col in line.split()]
...     for pos, val in enumerate(cols):
...        sums[pos] = sums.get(pos, 0.0) + val
...
>>> for key in sorted(sums):
...     print(key, '=', sums[key])
...
0 = 111.3
1 = 222.6
2 = 333.9
3 = 40.4

>>> sums
{0: 111.3, 1: 222.6, 2: 333.90000000000003, 3: 40.4}

Interestingly, most of this code uses tools added to Python over the years—file and dictionary iterators, comprehensions, dict.get, and the enumerate and sorted built-ins were not yet formed when Python was new. For related examples, also see the tkinter grid examples in Chapter 9 for another case of eval table magic at work. That chapter’s table sums logic is a variation on this theme, which obtains the number of columns from the first line of a data file and tailors its summations for display in a GUI.

Parsing and Unparsing Rule Strings

Splitting comes in handy for dividing text into columns, but it can also be used as a more general parsing tool—by splitting more than once on different delimiters, we can pick apart more complex text. Although such parsing can also be achieved with more powerful tools, such as the regular expressions we’ll meet later in this chapter, split-based parsing is simpler to code in quick prototypes, and may run faster.

For instance, Example 19-2 demonstrates one way that splitting and joining strings can be used to parse sentences in a simple language. It is taken from a rule-based expert system shell (holmes) that is written in Python and included in this book’s examples distribution (more on holmes in a moment). Rule strings in holmes take the form:

"rule <id> if <test1>, <test2>... then <conclusion1>, <conclusion2>..."

Tests and conclusions are conjunctions of terms (“,” means “and”). Each term is a list of words or variables separated by spaces; variables start with ?. To use a rule, it is translated to an internal form—a dictionary with nested lists. To display a rule, it is translated back to the string form. For instance, given the call:

rules.internal_rule('rule x if a ?x, b then c, d ?x')

the conversion in function internal_rule proceeds as follows:

string = 'rule x if a ?x, b then c, d ?x'
i = ['rule x', 'a ?x, b then c, d ?x']
t = ['a ?x, b', 'c, d ?x']
r = ['', 'x']
result = {'rule':'x', 'if':[['a','?x'], ['b']], 'then':[['c'], ['d','?x']]}

We first split around the if, then around the then, and finally around rule. The result is the three substrings that were separated by the keywords. Test and conclusion substrings are split around “,” first and spaces last. join is used later to convert back (unparse) to the original string for display. Example 19-2 is the concrete implementation of this scheme.

Example 19-2. PP4ELang ules.py

def internal_rule(string):
    i = string.split(' if ')
    t = i[1].split(' then ')
    r = i[0].split('rule ')
    return {'rule': r[1].strip(), 'if':internal(t[0]), 'then':internal(t[1])}

def external_rule(rule):
    return ('rule '    + rule['rule']           +
            ' if '     + external(rule['if'])   +
            ' then '   + external(rule['then']) + '.')

def internal(conjunct):
    res = []                                    # 'a b, c d'
    for clause in conjunct.split(','):          # -> ['a b', ' c d']
        res.append(clause.split())              # -> [['a','b'], ['c','d']]
    return res

def external(conjunct):
    strs = []
    for clause in conjunct:                     # [['a','b'], ['c','d']]
        strs.append(' '.join(clause))           # -> ['a b', 'c d']
    return ', '.join(strs)                      # -> 'a b, c d'

Today we could use list comprehensions and generator expressions to gain some conciseness here. The internal and external functions, for instance, could be recoded to simply (see file rules2.py):

def internal(conjunct):
    return [clause.split() for clause in conjunct.split(',')]

def external(conjunct):
    return ', '.join(' '.join(clause) for clause in conjunct)

to produce the desired nested lists and string by combining two steps into one. This form might run faster; we’ll leave it to the reader to decide whether it is more difficult to understand. As usual, we can test components of this module interactively:

>>> import rules
>>> rules.internal('a ?x, b')
[['a', '?x'], ['b']]

>>> rules.internal_rule('rule x if a ?x, b then c, d ?x')
{'then': [['c'], ['d', '?x']], 'rule': 'x', 'if': [['a', '?x'], ['b']]}

>>> r = rules.internal_rule('rule x if a ?x, b then c, d ?x')
>>> rules.external_rule(r)
'rule x if a ?x, b then c, d ?x.'

Parsing by splitting strings around tokens like this takes you only so far. There is no direct support for recursive nesting of components, and syntax errors are not handled very gracefully. But for simple language tasks like this, string splitting might be enough, at least for prototyping or experimental systems. You can always add a more robust rule parser later or reimplement rules as Python code or classes.

More on the holmes expert system shell

So how are these rules actually used? As mentioned, the rule parser we just met is part of the Python-coded holmes expert system shell. Holmes is an old system written around 1993 and before Python 1.0. It implemented both forward and backward chaining inference over rule sets. For example, the rule:

rule pylike if ?X likes coding, ?X likes spam then ?X likes Python

can be used both to prove whether someone likes Python (chaining backward, from “then” to “if”), and to deduce that someone likes Python from a set of known facts (chaining forward, from “if” to “then”). Deductions may span multiple rules, multiple clauses represent conjunctions, and rules that name the same conclusion represent alternatives. holmes also performs simple pattern-matching along the way to assign the variables that appear in rules (e.g., ?X), and it is able to explain both its deductions and questions.

Holmes also served as proof of concept for Python in general at a time when such evidence was still sometimes necessary, and at last check it still worked unchanged on Python 2.X platforms. However, its code no longer reflects modern Python best practice. Because of this, I no longer maintain this system today. In fact, it has suffered from bit rot so much and for so long that I’ve opted not to revisit it for this edition at all. Its original Python 0.X code is included in the book examples package, but it has not been ported to Python 3.X form, and does not accurately reflect Python as it exists today.

That is, holmes is an ex-system. It has ceased to be. And it won’t be discussed further here. For more modern AI tools and support for Python, search the Web. This is a fun field to explore, but holmes is probably best left to the foggy mists of Python prehistory (and souls brave enough to try the port).

If you care about performance, try to use the string object’s methods rather than things such as regular expressions whenever you can. Although this is a broad rule of thumb and can vary from release to release, some string methods may be faster because they have less work to do.

In fact, we can learn something from Python’s own history here. Today’s string object methods began life as Python-coded functions in the original string module. Due to their prevalence, they were later optimized by moving their implementation to the C language. When you imported string, it internally replaced most of its content with functions imported from the strop C extension module; strop methods were reportedly 100 to 1,000 times faster than their Python-coded equivalents at the time.

The result was dramatically faster performance for string client programs without impacting the interface. That is, string module clients became instantly faster without having to be modified for the new C-based module. Of course, these operations evolved further and were finally moved to string object methods, their only form today. But this reflects a common pattern in Python work. A similar migration path was applied to the pickle module we met in Chapter 17—the later cPickle recoding in Python 2.X and _pickle in 3.X are compatible but much faster.

This is a great lesson about Python development: modules can be coded quickly in Python at first and translated to C later for efficiency if required. Because the interface to Python and C extension modules is identical (both are imported modules with callable function attributes), C translations of modules are backward compatible with their Python prototypes. The only impact of the translation of such modules on clients usually is an improvement in performance.

There is normally no need to move every module to C for delivery of an application: you can pick and choose performance-critical modules (such as string and pickle) for translation and leave others coded in Python. Use the timing and profiling techniques discussed in Chapter 18 to isolate which modules will give the most improvement when translated to C. Once you do, the next chapter shows how to go about writing C-based extension modules.

Regular Expression Pattern Matching

Splitting and joining strings is a simple way to process text, as long as it follows the format you expect. For more general text analysis tasks where the structure of your data is not so rigidly defined, Python provides regular expression matching utilities. Especially for the kinds of text associated with domains such as the Internet and databases today, this flexibility can be a powerful ally.

Regular expressions are simply strings that define patterns to be matched against other strings. Supply a pattern and a string and ask whether the string matches your pattern. After a match, parts of the string matched by parts of the pattern are made available to your script. That is, matches not only give a yes/no answer, but also can pick out substrings as well.

Regular expression pattern strings can be complicated (let’s be honest—they can be downright gross to look at). But once you get the hang of them, they can replace larger handcoded string search routines—a single pattern string generally does the work of dozens of lines of manual string scanning code and may run much faster. They are a concise way to encode the expected structure of text and extract portions of it.

The re Module

In Python, regular expressions are not part of the syntax of the Python language itself, but they are supported by the re standard library module that you must import to use. The module defines functions for running matches immediately, compiling pattern strings into pattern objects, matching these objects against strings, and fetching matched substrings after a match. It also provides tools for pattern-based splitting, replacing, and so on.

The re module implements a rich regular expression pattern syntax that tries to be close to that used to code patterns in the Perl language (regular expressions are a feature of Perl worth emulating). For instance, re supports the notions of named groups; character classes; and nongreedy matches—regular expression pattern operators that match as few characters as possible (other operators always match the longest possible substring). The re module has also been optimized repeatedly, and in Python 3.X supports matching for both bytes byte-strings and str Unicode strings. The net effect is that Python’s pattern support uses Perl-like patterns, but is invoked with a different top-level module interface.

I need to point out up front that regular expressions are complex tools that cannot be covered in depth here. If this area sparks your interest, the text Mastering Regular Expressions, written by Jeffrey E. F. Friedl (O’Reilly), is a good next step to take. We won’t be able to cover pattern construction itself well enough here to turn you into an expert. Once you learn how to code patterns, though, the top-level interface for performing matches is straightforward. In fact, they are so easy to use that we’ll jump right into some live examples before getting into more details.

First Examples

There are two basic ways to kick off matches: through top-level function calls and via methods of precompiled pattern objects. The latter precompiled form is quicker if you will be applying the same pattern more than once—to all lines in a text file, for instance. To demonstrate, let’s do some matching on the following strings (see file re-interactive.txt for all the interactive code run in this section):

>>> text1 = 'Hello spam...World'
>>> text2 = 'Hello spam...other'

The match performed in the following code does not precompile: it executes an immediate match to look for all the characters between the words Hello and World in our text strings:

>>> import re
>>> matchobj = re.match('Hello(.*)World', text2)
>>> print(matchobj)
None

When a match fails as it does here (the text2 string doesn’t end in World), we get back the None object, which is Boolean false if tested in an if statement.

In the pattern string we’re using here in the first argument to re.match, the words Hello and World match themselves, and (.*) means any character (.) repeated zero or more times (*). The fact that it is enclosed in parentheses tells Python to save away the part of the string matched by that part of the pattern as a group—a matched substring available after the match. To see how, we need to make a match work:

>>> matchobj = re.match('Hello(.*)World', text1)
>>> print(matchobj)
<_sre.SRE_Match object at 0x009D6520>

>>> matchobj.group(1)
' spam...'

When a match succeeds, we get back a match object, which has interfaces for extracting matched substrings—the group(1) call returns the portion of the string matched by the first, leftmost, parenthesized portion of the pattern (our (.*)). As mentioned, matching is not just a yes/no answer; by enclosing parts of the pattern in parentheses, it is also a way to extract matched substrings. In this case, we’ve parsed out the text between Hello and World. Group number 0 is the entire string matched by the pattern—useful if you want to be sure your pattern is consuming all the text you think it is.

The interface for precompiling is similar, but the pattern is implied in the pattern object we get back from the compile call:

>>> pattobj  = re.compile('Hello(.*)World')
>>> matchobj = pattobj.match(text1)
>>> matchobj.group(1)
' spam...'

Again, you should precompile for speed if you will run the pattern multiple times, and you normally will when scanning files line by line. Here’s something a bit more complex that hints at the generality of patterns. This one allows for zero or more blanks or tabs at the front ([ ]*), skips one or more after the word Hello ([ ]+), captures characters in the middle ((.*)), and allows the final word to begin with an upper- or lowercase letter ([Ww]); as you can see, patterns can handle wide variations in data:

>>> patt = '[ 	]*Hello[ 	]+(.*)[Ww]orld'
>>> line = ' Hello   spamworld'
>>> mobj = re.match(patt, line)
>>> mobj.group(1)
'spam'

Notice that we matched a str pattern to a str string in the last listing. We can also match bytes to bytes in order to handle data such as encoded text, but we cannot mix the two string types (a constraint which is true in Python in general—Python wouldn’t have the encoding information needed to know how to convert between the raw bytes and the Unicode text):

>>> patt = b'[ 	]*Hello[ 	]+(.*)[Ww]orld'      # both as bytes works too
>>> line = b' Hello   spamworld'                 # and returns bytes groups
>>> re.match(patt, line).group(1)                # but cannot mix str/bytes
b'spam'

>>> re.match(patt, ' Hello spamworld')
TypeError: can't use a bytes pattern on a string-like object

>>> re.match('[ 	]*Hello[ 	]+(.*)[Ww]orld', line)
TypeError: can't use a string pattern on a bytes-like object

In addition to the tools these examples demonstrate, there are methods for scanning ahead to find a match (search), scanning to find all matches (findall), splitting and replacing on patterns, and so on. All have analogous module and precompiled call forms. The next section turns to a few examples to demonstrate more of the basics.

String Operations Versus Patterns

Notice how the preceding example skips optional whitespace and allows for uppercase or lowercase letters. This underscores why you may want to use patterns in the first place—they support more general kinds of text than string object methods can. Here’s another case in point: we’ve seen that string methods can split on and replace a substring, but they don’t suffice if the delimiter might be more than one alternative:

>>> 'aaa--bbb--ccc'.split('--')
['aaa', 'bbb', 'ccc']
>>> 'aaa--bbb--ccc'.replace('--', '...')        # string methods use fixed strings
'aaa...bbb...ccc'

>>> 'aaa--bbb==ccc'.split(['--', '=='])
TypeError: Can't convert 'list' object to str implicitly
>>> 'aaa--bbb==ccc'.replace(['--', '=='], '...')
TypeError: Can't convert 'list' object to str implicitly

Patterns can do similar work, but also can handle alternatives directly, by virtue of their pattern matching syntax. In the following, the syntax --|== matches either string -- or string ==; the syntax [-=] matches either the character - or = (a character set); and the form (?:) can be used to group nested parts of a pattern without forming a saved substring group (split treats groups specially):

>>> import re
>>> re.split('--', 'aaa--bbb--ccc')
['aaa', 'bbb', 'ccc']
>>> re.sub('--', '...', 'aaa--bbb--ccc')            # single string case
'aaa...bbb...ccc'

>>> re.split('--|==', 'aaa--bbb==ccc')              # split on -- or ==
['aaa', 'bbb', 'ccc']
>>> re.sub('--|==', '...', 'aaa--bbb==ccc')         # replace -- or ==
'aaa...bbb...ccc'

>>> re.split('[-=]', 'aaa-bbb=ccc')                 # single char alternative
['aaa', 'bbb', 'ccc']

>>> re.split('(--)|(==)', 'aaa--bbb==ccc')          # split includes groups
['aaa', '--', None, 'bbb', None, '==', 'ccc']
>>> re.split('(?:--)|(?:==)', 'aaa--bbb==ccc')      # expr part, not group
['aaa', 'bbb', 'ccc']

Similarly, splits can extract simple substrings for fixed delimiters, but patterns can also handle surrounding context like brackets, mark parts as optional, ignore whitespace, and more. In the next tests s* means zero or more whitespace characters (a character class); s+ means one or more of the same; /? matches an optional slash; [a-z] is any lowercase letter (a range);(.*?) means a saved substring of zero or more of any character again—but only as many as needed to match the rest of the pattern (nongreedily); and the groups method is used to fetch the substrings matched by the parenthesized subpatterns all at once:

>>> 'spam/ham/eggs'.split('/')
['spam', 'ham', 'eggs']

>>> re.match('(.*)/(.*)/(.*)', 'spam/ham/eggs').groups()
('spam', 'ham', 'eggs')

>>> re.match('<(.*)>/<(.*)>/<(.*)>', '<spam>/<ham>/<eggs>').groups()
('spam', 'ham', 'eggs')

>>> re.match('s*<(.*)>/?<(.*)>/?<(.*)>', '  <spam>/<ham><eggs>').groups()
('spam', 'ham', 'eggs')

>>> 'Hello pattern world!'.split()
['Hello', 'pattern', 'world!']
>>> re.match('Hellos*([a-z]*)s+(.*?)s*!', 'Hellopattern   world !').groups()
('pattern', 'world')

In fact, there’s more than one way to match. The findall method provides generality that leaves string objects in the dust—it locates all occurrences of a pattern and returns all the substrings it matched (or a list of tuples for multiple groups). The search method is similar but stops at the first match—it’s like match plus an initial forward scan. In the following, string object finds locate just one specific string, but patterns can be used to locate and extract bracketed text anywhere in a string, even pairs with optional text between:

>>> '<spam>/<ham>/<eggs>'.find('ham')               # find substring offset
8
>>> re.findall('<(.*?)>', '<spam>/<ham>/<eggs>')    # find all matches/groups
['spam', 'ham', 'eggs']
>>> re.findall('<(.*?)>', '<spam> /  <ham><eggs>')
['spam', 'ham', 'eggs']

>>> re.findall('<(.*?)>/?<(.*?)>', '<spam>/<ham> ... <eggs><cheese>')
[('spam', 'ham'), ('eggs', 'cheese')]
>>> re.search('<(.*?)>/?<(.*?)>', 'todays menu: <spam>/<ham>...<eggs><s>').groups()
('spam', 'ham')

Especially when using findall, the (?s) operator comes in handy to force . to match end-of-line characters in multiline text; without it . matches everything except lines ends. The following searches look for two adjacent bracketed strings with arbitrary text between, with and without skipping line breaks:

>>> re.findall('<(.*?)>.*<(.*?)>', '<spam> 
  <ham>
<eggs>')        # stop at 

[]
>>> re.findall('(?s)<(.*?)>.*<(.*?)>', '<spam> 
  <ham>
<eggs>')    # greedy
[('spam', 'eggs')]
>>> re.findall('(?s)<(.*?)>.*?<(.*?)>', '<spam> 
  <ham>
<eggs>')   # nongreedy
[('spam', 'ham')]

To make larger patterns more mnemonic, we can even associate names with matched substring groups in using the <?P<name>) pattern syntax and fetch them by name after matches, though this is of limited utility for findall. The next tests look for strings of “word” characters (w) separated by a /—this isn’t much more than a string split, but parts are named, and search and findall both scan ahead:

>>> re.search('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb/ccc]').groups()
('aaa', 'bbb')
>>> re.search('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb/ccc]').groupdict()
{'part1': 'aaa', 'part2': 'bbb'}

>>> re.search('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb/ccc]').group(2)
'bbb'
>>> re.search('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb/ccc]').group('part2')
'bbb'

>>> re.findall('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb ccc/ddd]')
[('aaa', 'bbb'), ('ccc', 'ddd')]

Finally, although basic string operations such as slicing and splits are sometimes enough, patterns are much more flexible. The following uses [^ ] to match any character not following the ^, and escapes a dash within a [] alternative set using - so it’s not taken to be a character set range separator. It runs equivalent slices, splits, and matches, along with a more general match that the other two cannot approach:

>>> line = 'aaa bbb ccc'
>>> line[:3], line[4:7], line[8:11]          # slice data at fixed offsets
('aaa', 'bbb', 'ccc')
>>> line.split()                             # split data with fixed delimiters
['aaa', 'bbb', 'ccc']

>>> re.split(' +', line)                     # split on general delimiters
['aaa', 'bbb', 'ccc']
>>> re.findall('[^ ]+', line)                # find non-delimiter runs
['aaa', 'bbb', 'ccc']

>>> line = 'aaa...bbb-ccc / ddd.-/e&e*e'     # handle generalized text
>>> re.findall('[^ .-/]+', line)
['aaa', 'bbb', 'ccc', 'ddd', 'e&e*e']

At this point, if you’ve never used pattern syntax in the past your head may very well be spinning (or have blown off entirely!). Before we go into any further examples, let’s dig into a few of the details underlying the re module and its patterns.

Using the re Module

The Python re module comes with functions that can search for patterns right away or make compiled pattern objects for running matches later. Pattern objects (and module search calls) in turn generate match objects, which contain information about successful matches and matched substrings. For reference, the next few sections describe the module’s interfaces and some of the operators you can use to code patterns.

Module functions

The top level of the module provides functions for matching, substitution, precompiling, and so on:

compile(pattern [, flags]): Compile a regular expression pattern string into a regular expression pattern object, for later matching. See the reference manual or Python Pocket Reference for the flags argument’s meaning.
match(pattern, string [, flags]): If zero or more characters at the start of string match the pattern string, return a corresponding match object, or None if no match is found. Roughly like a search for a pattern that begins with the ^ operator.
search(pattern, string [, flags]): Scan through string for a location matching pattern, and return a corresponding match object, or None if no match is found.
findall(pattern, string [, flags]): Return a list of strings giving all nonoverlapping matches of pattern in string. If there are any groups in patterns, returns a list of groups, and a list of tuples if the pattern has more than one group.
finditer(pattern, string [, flags]): Return iterator over all nonoverlapping matches of pattern in string.
split(pattern, string [, maxsplit, flags]): Split string by occurrences of pattern. If capturing parentheses (()) are used in the pattern, the text of all groups in the pattern are also returned in the resulting list.
sub(pattern, repl, string [, count, flags]): Return the string obtained by replacing the (first count) leftmost nonoverlapping occurrences of pattern (a string or a pattern object) in string by repl (which may be a string with backslash escapes that may back-reference a matched group, or a function that is passed a single match object and returns the replacement string).
subn(pattern, repl, string [, count, flags]): Same as sub, but returns a tuple: (new-string, number-of-substitutions-made).
escape(string): Return string with all nonalphanumeric characters backslashed, such that they can be compiled as a string literal.

Compiled pattern objects

At the next level, pattern objects provide similar attributes, but the pattern string is implied. The re.compile function in the previous section is useful to optimize patterns that may be matched more than once (compiled patterns match faster). Pattern objects returned by re.compile have these sorts of attributes:

match(string [, pos] [,
                endpos])

search(string [, pos] [,
                endpos])

findall(string [, pos [,
                endpos]])

finditer(string [, pos [,
                endpos]])

split(string [,
                maxsplit])

sub(repl, string [,
                count])

subn(repl, string [,
                count])

These are the same as the re module functions, but the pattern is implied, and pos and endpos give start/end string indexes for the match.

Match objects

Finally, when a match or search function or method is successful, you get back a match object (None comes back on failed matches). Match objects export a set of attributes of their own, including:

group(g) group(g1, g2, ...): Return the substring that matched a parenthesized group (or groups) in the pattern. Accept group numbers or names. Group numbers start at 1; group 0 is the entire string matched by the pattern. Returns a tuple when passed multiple group numbers, and group number defaults to 0 if omitted.
groups(): Returns a tuple of all groups’ substrings of the match (for group numbers 1 and higher).
groupdict(): Returns a dictionary containing all named groups of the match (see (?P<name>R) syntax ahead).
start([group]) end([group]): Indices of the start and end of the substring matched by group (or the entire matched string, if no group is passed).
span([group]): Returns the two-item tuple: (start(group), end(group)).
expand([template]): Performs backslash group substitutions; see the Python library manual.

Regular expression patterns

Regular expression strings are built up by concatenating single-character regular expression forms, shown in Table 19-1. The longest-matching string is usually matched by each form, except for the nongreedy operators. In the table, R means any regular expression form, C is a character, and N denotes a digit.

Table 19-1. re pattern syntax

Operator	Interpretation
`.`	Matches any character (including newline if `DOTALL` flag is specified or `(?s)` at pattern front)
`^`	Matches start of the string (of every line in `MULTILINE` mode)
`$`	Matches end of the string (of every line in `MULTILINE` mode)
`C`	Any nonspecial (or backslash-escaped) character matches itself
`R*`	Zero or more of preceding regular expression `R` (as many as possible)
`R+`	One or more of preceding regular expression `R` (as many as possible)
`R?`	Zero or one occurrence of preceding regular expression `R` (optional)
`R{m}`	Matches exactly `m` copies preceding `R`: `a{5}` matches `'aaaaa'`
`R{m,n}`	Matches from `m` to `n` repetitions of preceding regular expression `R`
`R*?, R+?, R??, R{m,n}?`	Same as ``, `+`, and `?` but matches as few characters/times as possible; these are known as nongreedy* match operators (unlike others, they match and consume as few characters as possible)
`[...]`	Defines character set: e.g., `[a-zA-Z]` to match all letters (alternatives, with `-` for ranges)
`[^...]`	Defines complemented character set: matches if `char` is not in set
	Escapes special `char`s (e.g., `*?+\|()`) and introduces special sequences in Table 19-2
`\`	Matches a literal (write as `\\` in pattern, or use `r''`)
`N`	Matches the contents of the group of the same number N: `'(.+) 1'` matches “42 42”
`R\|R`	Alternative: matches left or right `R`
`RR`	Concatenation: match both `R`s
`(R)`	Matches any regular expression inside `()`, and delimits a group (retains matched substring)
`(?:R)`	Same as `(R)` but simply delimits part R and does not denote a saved group
`(?=R)`	Look-ahead assertion: matches if `R` matches next, but doesn’t consume any of the string (e.g., `X (?=Y)` matches `X` only if followed by `Y`)
`(?!R)`	Matches if `R` doesn’t match next; negative of `(?=R)`
`(?P<name>R)`	Matches any regular expression inside `()`, and delimits a named group
`(?P=name)`	Matches whatever text was matched by the earlier group named `name`
`(?#...)`	A comment; ignored
`(?letter)`	Set mode flag; `letter` is one of `aiLmsux` (see the library manual)
`(?<=R)`	Look-behind assertion: matches if the current position in the string is preceded by a match of `R` that ends at the current position
`(?<!R)`	Matches if the current position in the string is not preceded by a match for `R`; negative of `(?<= R)`
`(?(id/name)yespattern\|nopattern)`	Will try to match with `yespattern` if the group with given `id` or `name` exists, else with optional `nopattern`

Within patterns, ranges and selections can be combined. For instance, [a-zA-Z0-9_]+ matches the longest possible string of one or more letters, digits, or underscores. Special characters can be escaped as usual in Python strings: [ ]* matches zero or more tabs and spaces (i.e., it skips such whitespace).

The parenthesized grouping construct, (R), lets you extract matched substrings after a successful match. The portion of the string matched by the expression in parentheses is retained in a numbered register. It’s available through the group method of a match object after a successful match.

In addition to the entries in this table, special sequences in Table 19-2 can be used in patterns, too. Because of Python string rules, you sometimes must double up on backslashes (\) or use Python raw strings (r'...') to retain backslashes in the pattern verbatim. Python ignores backslashes in normal strings if the letter following the backslash is not recognized as an escape code. Some of the entries in Table 19-2 are affected by Unicode when matching str instead of bytes, and an ASCII flag may be set to emulate the behavior for bytes; see Python’s manuals for more details.

Table 19-2. re special sequences

Sequence	Interpretation
`number`	Matches text of group `number` (numbered from 1)
`A`	Matches only at the start of the string
	Empty string at word boundaries
`B`	Empty string not at word boundaries
`d`	Any decimal digit character (`[0-9]` for ASCII)
`D`	Any nondecimal digit character (`[^O-9]` for ASCII)
`s`	Any whitespace character (`[ fv]` for ASCII)
`S`	Any nonwhitespace character (`[^ fv]` for ASCII)
`w`	Any alphanumeric character (`[a-zA-Z0-9_]` for ASCII)
`W`	Any nonalphanumeric character (`[^a-zA-Z0-9_]` for ASCII )
	Matches only at the end of the string

Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser: a, , f, , , , v, x, and \. The Python library manual gives these escapes’ interpretation and additional details on pattern syntax in general. But to further demonstrate how the re pattern syntax is typically used in scripts, let’s go back to writing some code.

More Pattern Examples

For more context, the next few examples present short test files that match simple but representative pattern forms. Comments in Example 19-3 describe the operations exercised; check Table 19-1 to see which operators are used in these patterns. If they are still confusing, try running these tests interactively, and call group(0) instead of start() to see which strings are being matched by the patterns.

Example 19-3. PP4ELang e-basics.py

"""
literals, sets, ranges, alternatives, and escapes
all tests here print 2: offset where pattern found
"""

import re                                  # the one to use today

pattern, string = "A.C.", "xxABCDxx"       # nonspecial chars match themselves
matchobj = re.search(pattern, string)      # '.' means any one char
if matchobj:                               # search returns match object or None
    print(matchobj.start())                # start is index where matched

pattobj  = re.compile("A.*C.*")            # 'R*' means zero or more Rs
matchobj = pattobj.search("xxABCDxx")      # compile returns pattern obj
if matchobj:                               # patt.search returns match obj
    print(matchobj.start())

# selection sets
print(re.search(" *A.C[DE][D-F][^G-ZE]G	+ ?", "..ABCDEFG	..").start())

# alternatives: R1|R2 means R1 or R2
print(re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").start())       # test each char
print(re.search("(?:A|X)(?:B|Y)(?:C|Z)D", "..AYCD..").start()) # same, not saved
print(re.search("A|XB|YC|ZD", "..AYCD..").start())             # matches just A!
print(re.search("(A|XB|YC|ZD)YCD", "..AYCD..").start())        # just first char

# word boundaries
print(re.search(r"ABCD", "..ABCD ").start())     #  means word boundary
print(re.search(r"ABCD", "..ABCD ").start())     # use r'...' to escape ''

Notice again that there are different ways to kick off a match with re: by calling module search functions and by making compiled pattern objects. In either event, you can hang on to the resulting match object or not. All the print call statements in this script show a result of 2—the offset where the pattern was found in the string. In the first test, for example, A.C. matches the ABCD at offset 2 in the search string (i.e., after the first xx):

C:...PP4ELang> python re-basic.py
2
...8 more 2s omitted...

Next, in Example 19-4, parts of the pattern strings enclosed in parentheses delimit groups; the parts of the string they matched are available after the match.

Example 19-4. PP4ELang e-groups.py

"""
groups: extract substrings matched by REs in '()' parts
groups are denoted by position, but (?P<name>R) can also name them
"""

import re

patt = re.compile("A(.)B(.)C(.)")                  # saves 3 substrings
mobj = patt.match("A0B1C2")                        # each '()' is a group, 1..n
print(mobj.group(1), mobj.group(2), mobj.group(3)) # group() gives substring

patt = re.compile("A(.*)B(.*)C(.*)")               # saves 3 substrings
mobj = patt.match("A000B111C222")                  # groups() gives all groups
print(mobj.groups())

print(re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").groups())
print(re.search("(?P<a>A|X)(?P<b>B|Y)(?P<c>C|Z)D", "..AYCD..").groupdict())

patt = re.compile(r"[	 ]*#s*defines*([a-z0-9_]*)s*(.*)")
mobj = patt.search(" # define  spam  1 + 2 + 3")            # parts of C #define
print(mobj.groups())                                        # s is whitespace

In the first test here, for instance, the three (.) groups each match a single character, but they retain the character matched; calling group pulls out the character matched. The second test’s (.*) groups match and retain any number of characters. The third and fourth tests shows how alternatives can be grouped by both position and name, and the last test matches C #define lines—more on this pattern in a moment:

C:...PP4ELang> python re-groups.py
0 1 2
('000', '111', '222')
('A', 'Y', 'C')
{'a': 'A', 'c': 'C', 'b': 'Y'}
('spam', '1 + 2 + 3')

Finally, besides matches and substring extraction, re also includes tools for string replacement or substitution (see Example 19-5).

Example 19-5. PP4ELang e-subst.py

"substitutions: replace occurrences of pattern in string"

import re
print(re.sub('[ABC]', '*', 'XAXAXBXBXCXC'))
print(re.sub('[ABC]_', '*', 'XA-XA_XB-XB_XC-XC_'))        # alternatives char + _

print(re.sub('(.) spam', 'spam\1',  'x spam, y spam'))   # group back ref (or r'')

def mapper(matchobj):
    return 'spam' + matchobj.group(1)

print(re.sub('(.) spam', mapper, 'x spam, y spam'))       # mapping function

In the first test, all characters in the set are replaced; in the second, they must be followed by an underscore. The last two tests illustrate more advanced group back-references and mapping functions in the replacement. Note the \1 required to escape 1 for Python’s string rules; r'spam1' would work just as well. See also the earlier interactive tests in the section for additional substitution and splitting examples:

C:...PP4ELang> python re-subst.py
X*X*X*X*X*X*
XA-X*XB-X*XC-X*
spamx, spamy
spamx, spamy

Scanning C Header Files for Patterns

To wrap up, let’s turn to a more realistic example: the script in Example 19-6 puts these pattern operators to more practical use. It uses regular expressions to find #define and #include lines in C header files and extract their components. The generality of the patterns makes them detect a variety of line formats; pattern groups (the parts in parentheses) are used to extract matched substrings from a line after a match.

Example 19-6. PP4ELangcheader.py

"Scan C header files to extract parts of #define and #include lines"

import sys, re
pattDefine = re.compile(                             # compile to pattobj
    '^#[	 ]*define[	 ]+(w+)[	 ]*(.*)')           # "# define xxx yyy..."
                                                     # w like [a-zA-Z0-9_]
pattInclude = re.compile(
    '^#[	 ]*include[	 ]+[<"]([w./]+)')           # "# include <xxx>..."

def scan(fileobj):
    count = 0
    for line in fileobj:                             # scan by lines: iterator
        count += 1
        matchobj = pattDefine.match(line)            # None if match fails
        if matchobj:
            name = matchobj.group(1)                 # substrings for (...) parts
            body = matchobj.group(2)
            print(count, 'defined', name, '=', body.strip())
            continue
        matchobj = pattInclude.match(line)
        if matchobj:
            start, stop = matchobj.span(1)           # start/stop indexes of (...)
            filename = line[start:stop]              # slice out of line
            print(count, 'include', filename)        # same as matchobj.group(1)

if len(sys.argv) == 1:
    scan(sys.stdin)                    # no args: read stdin
else:
    scan(open(sys.argv[1], 'r'))       # arg: input filename

To test, let’s run this script on the text file in Example 19-7.

Example 19-7. PP4ELang est.h

#ifndef TEST_H
#define TEST_H

#include <stdio.h>
#include <lib/spam.h>
#  include   "Python.h"

#define DEBUG
#define HELLO 'hello regex world'
#  define SPAM    1234

#define EGGS sunny + side + up
#define  ADDER(arg) 123 + arg
#endif

Notice the spaces after # in some of these lines; regular expressions are flexible enough to account for such departures from the norm. Here is the script at work; picking out #include and #define lines and their parts. For each matched line, it prints the line number, the line type, and any matched substrings:

C:...PP4ELang> python cheader.py test.h
2 defined TEST_H =
4 include stdio.h
5 include lib/spam.h
6 include Python.h
8 defined DEBUG =
9 defined HELLO = 'hello regex world'
10 defined SPAM = 1234
12 defined EGGS = sunny + side + up
13 defined ADDER = (arg) 123 + arg

For an additional example of regular expressions at work, see the file pygrep1.py in the book examples package; it implements a simple pattern-based “grep” file search utility, but was cut here for space. As we’ll see, we can also sometimes use regular expressions to parse information from XML and HTML text—the topics of the next section.

XML and HTML Parsing

Beyond string objects and regular expressions, Python ships with support for parsing some specific and commonly used types of formatted text. In particular, it provides precoded parsers for XML and HTML which we can deploy and customize for our text processing goals.

In the XML department, Python includes parsing support in its standard library and plays host to a prolific XML special-interest group. XML (for eXtensible Markup Language) is a tag-based markup language for describing many kinds of structured data. Among other things, it has been adopted in roles such as a standard database and Internet content representation in many contexts. As an object-oriented scripting language, Python mixes remarkably well with XML’s core notion of structured document interchange.

XML is based upon a tag syntax familiar to web page writers, used to describe and package data. The xml module package in Python’s standard library includes tools for parsing this data from XML text, with both the SAX and the DOM standard parsing models, as well as the Python-specific ElementTree package. Although regular expressions can sometimes extract information from XML documents, too, they can be easily misled by unexpected text, and don’t directly support the notion of arbitrarily nested XML constructs (more on this limitation later when we explore languages in general).

In short, SAX parsers provide a subclass with methods called during the parsing operation, and DOM parsers are given access to an object tree representing the (usually) already parsed document. SAX parsers are essentially state machines and must record (and possibly stack) page details as the parse progresses; DOM parsers walk object trees using loops, attributes, and methods defined by the DOM standard. ElementTree is roughly a Python-specific analog of DOM, and as such can often yield simpler code; it can also be used to generate XML text from their object-based representations.

Beyond these parsing tools, Python also ships with an xmlrpc package to support the client and server sides of the XML-RPC protocol (remote procedure calls that transmit objects encoded as XML over HTTP), as well as a standard HTML parser, html.parser, that works on similar principles and is presented later in this chapter. The third-party domain has even more XML-related tools; most of these are maintained separately from Python to allow for more flexible release schedules. Beginning with Python 2.3, the Expat parser is also included as the underlying engine that drives the parsing process.

XML Parsing in Action

XML processing is a large, evolving topic, and it is mostly beyond the scope of this book. For an example of a simple XML parsing task, though, consider the XML file in Example 19-8. This file defines a handful of O’Reilly Python books—ISBN numbers as attributes, and titles, publication dates, and authors as nested tags (with apologies to Python books not listed in this completely random sample—there are many!).

Example 19-8. PP4ELangXmlooks.xml

<catalog>
    <book isbn="0-596-00128-2">
        <title>Python &amp; XML</title>
        <date>December 2001</date>
        <author>Jones, Drake</author>
    </book>
    <book isbn="0-596-15810-6">
        <title>Programming Python, 4th Edition</title>
        <date>October 2010</date>
        <author>Lutz</author>
    </book>
    <book isbn="0-596-15806-8">
        <title>Learning Python, 4th Edition</title>
        <date>September 2009</date>
        <author>Lutz</author>
    </book>
    <book isbn="0-596-15808-4">
        <title>Python Pocket Reference, 4th Edition</title>
        <date>October 2009</date>
        <author>Lutz</author>
    </book>
    <book isbn="0-596-00797-3">
        <title>Python Cookbook, 2nd Edition</title>
        <date>March 2005</date>
        <author>Martelli, Ravenscroft, Ascher</author>
    </book>
    <book isbn="0-596-10046-9">
        <title>Python in a Nutshell, 2nd Edition</title>
        <date>July 2006</date>
        <author>Martelli</author>
    </book>
    <!-- plus many more Python books that should appear here -->
</catalog>

Let’s quickly explore ways to extract this file’s book ISBN numbers and corresponding titles by example, using each of the four primary Python tools at our disposal—patterns, SAX, DOM, and ElementTree.

Regular expression parsing

In some contexts, the regular expressions we met earlier can be used to parse information from XML files. They are not complete parsers, and are not very robust or accurate in the presence of arbitrary text (text in tag attributes can especially throw them off). Where applicable, though, they offer a simple option. Example 19-9 shows how we might go about parsing the XML file in Example 19-8 with the prior section’s re module. Like all four examples in this section, it scans the XML file looking at ISBN numbers and associated titles, and stores the two as keys and values in a Python dictionary.

Example 19-9. PP4ELangXml ebook.py

"""
XML parsing: regular expressions (no robust or general)
"""

import re, pprint
text    = open('books.xml').read()                        # str if str pattern
pattern = '(?s)isbn="(.*?)".*?<title>(.*?)</title>'       # *?=nongreedy
found   = re.findall(pattern, text)                       # (?s)=dot matches /n
mapping = {isbn: title for (isbn, title) in found}        # dict from tuple list
pprint.pprint(mapping)

When run, the re.findall method locates all the nested tags we’re interested in, extracts their content, and returns a list of tuples representing the two parenthesized groups in the pattern. Python’s pprint module displays the dictionary created by the comprehension nicely. The extract works, but only as long as the text doesn’t deviate from the expected pattern in ways that would invalidate our script. Moreover, the XML entity for “&” in the first book’s title is not un-escaped automatically:

C:...PP4ELangXml> python rebook.py
{'0-596-00128-2': 'Python &amp; XML',
 '0-596-00797-3': 'Python Cookbook, 2nd Edition',
 '0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
 '0-596-15806-8': 'Learning Python, 4th Edition',
 '0-596-15808-4': 'Python Pocket Reference, 4th Edition',
 '0-596-15810-6': 'Programming Python, 4th Edition'}

SAX parsing

To do better, Python’s full-blown XML parsing tools let us perform this data extraction in a more accurate and robust way. Example 19-10, for instance, defines a SAX-based parsing procedure: its class implements callback methods that will be called during the parse, and its top-level code creates and runs a parser.

Example 19-10. PP4ELangXmlsaxbook.py

"""
XML parsing: SAX is a callback-based API for intercepting parser events
"""

import xml.sax, xml.sax.handler, pprint

class BookHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        self.inTitle = False                        # handle XML parser events
        self.mapping = {}                           # a state machine model

    def startElement(self, name, attributes):
        if name == "book":                          # on start book tag
            self.buffer = ""                        # save ISBN for dict key
            self.isbn = attributes["isbn"]
        elif name == "title":                       # on start title tag
            self.inTitle = True                     # save title text to follow

    def characters(self, data):
        if self.inTitle:                            # on text within tag
            self.buffer += data                     # save text if in title

    def endElement(self, name):
      if name == "title":
          self.inTitle = False                      # on end title tag
          self.mapping[self.isbn] = self.buffer     # store title text in dict

parser  = xml.sax.make_parser()
handler = BookHandler()
parser.setContentHandler(handler)
parser.parse('books.xml')
pprint.pprint(handler.mapping)

The SAX model is efficient, but it is potentially confusing at first glance, because the class must keep track of where the parse currently is using state information. For example, when the title tag is first detected, we set a state flag and initialize a buffer; as each character within the title tag is parsed, we append it to the buffer until the ending portion of the title tag is encountered. The net effect saves the title tag’s content as a string. This model is simple, but can be complex to manage; in cases of potentially arbitrary nesting, for instance, state information may need to be stacked as the class receives callbacks for nested tags.

To kick off the parse, we make a parser object, set its handler to an instance of our class, and start the parse; as Python scans the XML file, our class’s methods are called automatically as components are encountered. When the parse is complete, we use the Python pprint module to display the result again—the mapping dictionary object attached to our handler. The result is the mostly the same this time, but notice that the “&” escape sequence is properly un-escaped now—SAX performs XML parsing, not text matching:

C:...PP4ELangXml> python saxbook.py
{'0-596-00128-2': 'Python & XML',
 '0-596-00797-3': 'Python Cookbook, 2nd Edition',
 '0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
 '0-596-15806-8': 'Learning Python, 4th Edition',
 '0-596-15808-4': 'Python Pocket Reference, 4th Edition',
 '0-596-15810-6': 'Programming Python, 4th Edition'}

DOM parsing

The DOM parsing model for XML is perhaps simpler to understand—we simply traverse a tree of objects after the parse—but it might be less efficient for large documents, if the document is parsed all at once ahead of time and stored in memory. DOM also supports random access to document parts via tree fetches, nested loops for known structures, and recursive traversals for arbitrary nesting; in SAX, we are limited to a single linear parse. Example 19-11 is a DOM-based equivalent to the SAX parser of the preceding section.

Example 19-11. PP4ELangXmldombook.py

"""
XML parsing: DOM gives whole document to the application as a traversable object
"""

import pprint
import xml.dom.minidom
from xml.dom.minidom import Node

doc = xml.dom.minidom.parse("books.xml")          # load doc into object
                                                  # usually parsed up front
mapping = {}
for node in doc.getElementsByTagName("book"):     # traverse DOM object
    isbn = node.getAttribute("isbn")              # via DOM object API
    L = node.getElementsByTagName("title")
    for node2 in L:
        title = ""
        for node3 in node2.childNodes:
            if node3.nodeType == Node.TEXT_NODE:
                title += node3.data
        mapping[isbn] = title

# mapping now has the same value as in the SAX example
pprint.pprint(mapping)

The output of this script is the same as what we generated interactively for the SAX parser; here, though, it is built up by walking the document object tree after the parse has finished using method calls and attributes defined by the cross-language DOM standard specification. This is both a strength and potential weakness of DOM—its API is language neutral, but it may seem a bit nonintuitive and verbose to some Python programmers accustomed to simpler models:

C:...PP4ELangXml> python dombook.py
{'0-596-00128-2': 'Python & XML',
 '0-596-00797-3': 'Python Cookbook, 2nd Edition',
 '0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
 '0-596-15806-8': 'Learning Python, 4th Edition',
 '0-596-15808-4': 'Python Pocket Reference, 4th Edition',
 '0-596-15810-6': 'Programming Python, 4th Edition'}

ElementTree parsing

As a fourth option, the popular ElementTree package is a standard library tool for both parsing and generating XML. As a parser, it’s essentially a more Pythonic type of DOM—it parses documents into a tree of objects again, but the API for navigating the tree is more lightweight, because it’s Python-specific.

ElementTree provides easy-to-use tools for parsing, changing, and generating XML documents. For both parsing and generating, it represents documents as a tree of Python “element” objects. Each element in the tree has a tag name, attribute dictionary, text value, and sequence of child elements. The element object produced by a parse can be navigating with normal Python loops for a known structures, and with recursion where arbitrary nesting is possible.

The ElementTree system began its life as a third-party extension, but it was largely incorporated into Python’s standard library as the package xml.etree. Example 19-12 shows how to use it to parse our book catalog file one last time.

Example 19-12. PP4ELangXmletreebook.py

"""
XML parsing: ElementTree (etree) provides a Python-based API for parsing/generating
"""

import pprint
from xml.etree.ElementTree import parse

mapping = {}
tree = parse('books.xml')
for B in tree.findall('book'):
    isbn = B.attrib['isbn']
    for T in B.findall('title'):
        mapping[isbn] = T.text
pprint.pprint(mapping)

When run we get the exact same results as for SAX and DOM again, but the code required to extract the file’s details seems noticeably simpler this time around:

C:...PP4ELangXml> python etreebook.py
{'0-596-00128-2': 'Python & XML',
 '0-596-00797-3': 'Python Cookbook, 2nd Edition',
 '0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
 '0-596-15806-8': 'Learning Python, 4th Edition',
 '0-596-15808-4': 'Python Pocket Reference, 4th Edition',
 '0-596-15810-6': 'Programming Python, 4th Edition'}

HTML Parsing in Action

Although more limited in scope, Python’s html.parser standard library module also supports HTML-specific parsing, useful in “screen scraping” roles to extract information from web pages. Among other things, this parser can be used to process Web replies fetched with the urllib.request module we met in the Internet part of this book, to extract plain text from HTML email messages, and more.

The html.parser module has an API reminiscent of the XML SAX model of the prior section: it provides a parser which we subclass to intercept tags and their data during a parse. Unlike SAX, we don’t provide a handler class, but extend the parser class directly. Here’s a quick interactive example to demonstrate the basics (I copied all of this section’s code into file htmlparser.py in the examples package if you wish to experiment with it yourself):

>>> from html.parser import HTMLParser
>>> class ParsePage(HTMLParser):
...     def handle_starttag(self, tag, attrs):
...         print('Tag start:', tag, attrs)
...     def handle_endtag(self, tag):
...         print('tag end:  ', tag)
...     def handle_data(self, data):
...         print('data......', data.rstrip())
...

Now, create a web page’s HTML text string; we hardcode one here, but it might also be loaded from a file, or fetched from a website with urllib.request:

>>> page = """
... <html>
... <h1>Spam!</h1>
... <p>Click this <a href="http://www.python.org">python</a> link</p>
... </html>"""

Finally, kick off the parse by feeding text to a parser instance—tags in the HTML text trigger class method callbacks, with tag names and attribute sequences passed in as arguments:

>>> parser = ParsePage()
>>> parser.feed(page)
data......
Tag start: html []
data......
Tag start: h1 []
data...... Spam!
tag end:   h1
data......
Tag start: p []
data...... Click this
Tag start: a [('href', 'http://www.python.org')]
data...... python
tag end:   a
data......  link
tag end:   p
data......
tag end:   html

As you can see, the parser’s methods receive callbacks for events during the parse. Much like SAX XML parsing, your parser class will need to keep track of its state in attributes as it goes if it wishes to do something more specific than print tag names, attributes, and content. Watching for specific tags’ content, though, might be as simple as checking names and setting state flags. Moreover, building object trees to reflect the page’s structure during the parse would be straightforward.

Handling HTML entity references (revisited)

Here’s another HTML parsing example: in Chapter 15, we used a simple method exported by this module to unquote HTML escape sequences (a.k.a. entities) in strings embedded in an HTML reply page:

>>> import cgi, html.parser
>>> s = cgi.escape("1<2 <b>hello</b>")
>>> s
'1&lt;2 &lt;b&gt;hello&lt;/b&gt;'
>>>
>>> html.parser.HTMLParser().unescape(s)
'1<2 <b>hello</b>'

This works for undoing HTML escapes, but that’s all. When we saw this solution, I implied that there was a more general approach; now that you know about the method callback model of the HTML parser class, the more idiomatic way to handle entities during a parse should make sense—simply catch entity callbacks in a parser subclass, and translate as needed:

>>> class Parse(html.parser.HTMLParser):
...     def handle_data(self, data):
...         print(data, end='')
...     def handle_entityref(self, name):
...         map = dict(lt='<', gt='>')
...         print(map[name], end='')
...
>>> p = Parse()
>>> p.feed(s); print()
1<2 <b>hello</b>

Better still, we can use Python’s related html.entities module to avoid hardcoding entity-to-character mappings for HTML entities. This module defines many more entity names than the simple dictionary in the prior example and includes all those you’ll likely encounter when parsing HTML text in the wild:

>>> s
'1&lt;2 &lt;b&gt;hello&lt;/b&gt;'
>>>
>>> from html.entities import entitydefs
>>> class Parse(html.parser.HTMLParser):
...     def handle_data(self, data):
...         print(data, end='')
...     def handle_entityref(self, name):
...         print(entitydefs[name], end='')
...
>>> P = Parse()
>>> P.feed(s); print()
1<2 <b>hello</b>

Strictly speaking, the html.entities module is able to map entity name to Unicode code point and vice versa; its table used here simply converts code point integers to characters with chr. See this module’s documentation, as well as its source code in the Python standard library for more details.

Extracting plain text from HTML (revisited)

Now that you understand the basic principles of the HTML parser class in Python’s standard library, the plain text extraction module used by Chapter 14’s PyMailGUI (Example 14-8) will also probably make significantly more sense (this was an unavoidable forward reference which we’re finally able to close).

Rather than repeating its code here, I’ll simply refer you back to that example, as well as its self-test and test input files, for another example of HTML parsing in Python to study on your own. It’s essentially a minor elaboration on the examples here, which detects more types of tags in its parser callback methods.

Because of space concerns, we have to cut short our treatment of HTML parsing here; as usual, knowing that it exists is enough to get started. For more details on the API, consult the Python library manual. And for additional HTML support, check the Web for the 3.X status of third-party HTML parser packages like those mentioned in Chapter 14.

Advanced Language Tools

If you have a background in parsing theory, you may know that neither regular expressions nor string splitting is powerful enough to handle more complex language grammars. Roughly, regular expressions don’t have the stack “memory” required by true language grammars, and so cannot support arbitrary nesting of language constructs—nested if statements in a programming language, for instance. In fact, this is why the XML and HTML parsers of the prior section are required at all: both are languages of potentially arbitrary nesting, which are beyond the scope of regular expressions in general.

From a theoretical perspective, regular expressions are really intended to handle just the first stage of parsing—separating text into components, otherwise known as lexical analysis. Though patterns can often be used to extract data from text, true language parsing requires more. There are a number of ways to fill this gap with Python:

Python as language tool

In most applications, the Python language itself can replace custom languages and parsers—user-entered code can be passed to Python for evaluation with tools such as eval and exec. By augmenting the system with custom modules, user code in this scenario has access to both the full Python language and any application-specific extensions required. In a sense, such systems embed Python in Python. Since this is a common Python role, we’ll revisit this approach later in this chapter.

Custom language parsers: manual or toolkit

For some sophisticated language analysis tasks, though, a full-blown parser may still be required. Such parsers can always be written by hand, but since Python is built for integrating C tools, we can write integrations to traditional parser generator systems such as yacc and bison, tools that create parsers from language grammar definitions. Better yet, we could use an integration that already exists—interfaces to such common parser generators are freely available in the open source domain (run a web search for up-to-date details and links).

In addition, a number of Python-specific parsing systems are available on the Web. Among them: PLY is an implementation of lex and yacc parsing tools in and for Python; the kwParsing system is a parser generator written in Python; PyParsing is a pure-Python class library that makes it easy to build recursive-descent parsers quickly; and the SPARK toolkit is a lightweight system that employs the Earley algorithm to work around technical problems with LALR parser generation (if you don’t know what that means, you probably don’t need to care).

Of special interest to this chapter, YAPPS (Yet Another Python Parser System) is a parser generator written in Python. It uses supplied grammar rules to generate human-readable Python code that implements a recursive descent parser; that is, it’s Python code that generates Python code. The parsers generated by YAPPS look much like (and were inspired by) the handcoded custom expression parsers shown in the next section. YAPPS creates LL(1) parsers, which are not as powerful as LALR parsers but are sufficient for many language tasks. For more on YAPPS, see http://theory.stanford.edu/~amitp/Yapps or search the Web at large.

Natural language processing

Even more demanding language analysis tasks require techniques developed in artificial intelligence research, such as semantic analysis and machine learning. For instance, the Natural Language Toolkit, or NLTK, is an open source suite of Python libraries and programs for symbolic and statistical natural language processing. It applies linguistic techniques to textual data, and it can be used in the development of natural language recognition software and systems. For much more on this subject, be sure to also see the O’Reilly book Natural Language Processing with Python, which explores, among other things, ways to use NLTK in Python. Not every system’s users will pose questions in a natural language, of course, but there are many applications which can make good use of such utility.

Though widely useful, parser generator systems and natural language analysis toolkits are too complex for us to cover in any sort of useful detail in this text. Consult http://python.org/ or search the Web for more information on language analysis tools available for use in Python programs. For the purposes of this chapter, let’s move on to explore a more basic and manual approach that illustrates concepts underlying the domain—recursive descent parsing.

Speaking of parser generators, to use some of these tools in Python programs, you’ll need an extension module that integrates them. The first step in such scenarios should always be to see whether the extension already exists in the public domain. Especially for common tools like these, chances are that someone else has already implemented an integration that you can use off-the-shelf instead of writing one from scratch.

Of course, not everyone can donate all their extension modules to the public domain, but there’s a growing library of available components that you can pick up for free and a community of experts to query. Visit the PyPI site at http://www.python.org for links to Python software resources, or search the Web at large. With at least one million Python users out there as I write this book, much can be found in the prior-art department.

Unless, of course, the wheel does not work: We’ve also seen a handful of cases in this book where standard libraries were either not adequate or were broken altogether. For instance, the Python 3.1 email package issues we explored in Chapter 13 required us to code workarounds of our own. In such cases, you may still ultimately need to code your own infrastructure support. The “not invented here” syndrome can still claim victory when software dependencies break down.

Still, you’re generally better off trying to use the standard support provided by Python in most cases, even if doing so requires manually coded fixes. In the email package example, fixing its problems seems much easier than coding an email parser and generator from scratch—a task far too large to have even attempted in this book. Python’s batteries included approach to development can be amazingly productive—even when some of those batteries require a charge.

Custom Language Parsers

Although toolkits abound in this domain, Python’s status as a general-purpose programming language also makes it a reasonable vehicle for writing hand-coded parsers for custom language analysis tasks. For instance , recursive descent parsing is a fairly well-known technique for analyzing language-based information. Though not as powerful as some language tools, recursive descent parses are sufficient for a wide variety of language-related goals.

To illustrate, this section develops a custom parser for a simple grammar—it parses and evaluates arithmetic expression strings. Though language analysis is the main topic here, this example also demonstrates the utility of Python as a general-purpose programming language. Although Python is often used as a frontend or rapid development language in tactical modes, it’s also often useful for the kinds of strategic work you may have formerly done in a systems development language such as C or C++.

The Expression Grammar

The grammar that our parser will recognize can be described as follows:

goal -> <expr> END                       [number, variable, ( ]
goal -> <assign> END                     [set]

assign -> 'set' <variable> <expr>        [set]

expr -> <factor> <expr-tail>             [number, variable, ( ]

expr-tail -> ^                           [END, ) ]
expr-tail -> '+' <factor> <expr-tail>    [+]
expr-tail -> '-' <factor> <expr-tail>    [-]

factor -> <term> <factor-tail>           [number, variable, ( ]

factor-tail -> ^                         [+, -, END, ) ]
factor-tail -> '*' <term> <factor-tail>  [*]
factor-tail -> '/' <term> <factor-tail>  [/]

term -> <number>                         [number]
term -> <variable>                       [variable]
term -> '(' <expr> ')'                   [(]

tokens: (, ), num, var, -, +, /, *, set, end

This is a fairly typical grammar for a simple expression language, and it allows for arbitrary expression nesting (some example expressions appear at the end of the testparser module listing in Example 19-15). Strings to be parsed are either an expression or an assignment to a variable name (set). Expressions involve numbers, variables, and the operators +, −, *, and /. Because factor is nested in expr in the grammar, * and / have higher precedence (i.e., they bind tighter) than + and −. Expressions can be enclosed in parentheses to override precedence, and all operators are left associative—that is, they group on the left (e.g., 1-2-3 is treated the same as (1-2)-3).

Tokens are just the most primitive components of the expression language. Each grammar rule listed earlier is followed in square brackets by a list of tokens used to select it. In recursive descent parsing, we determine the set of tokens that can possibly start a rule’s substring, and we use that information to predict which rule will work ahead of time. For rules that iterate (the -tail rules), we use the set of possibly following tokens to know when to stop. Typically, tokens are recognized by a string processor (a scanner), and a higher-level processor (a parser) uses the token stream to predict and step through grammar rules and substrings.

The Parser’s Code

The system is structured as two modules, holding two classes:

The scanner handles low-level character-by-character analysis.
The parser embeds a scanner and handles higher-level grammar analysis.

The parser is also responsible for computing the expression’s value and testing the system. In this version, the parser evaluates the expression while it is being parsed. To use the system, we create a parser with an input string and call its parse method. We can also call parse again later with a new expression string.

There’s a deliberate division of labor here. The scanner extracts tokens from the string, but it knows nothing about the grammar. The parser handles the grammar, but it is naive about the string itself. This modular structure keeps the code relatively simple. And it’s another example of the object-oriented programming (OOP) composition relationship at work: parsers embed and delegate to scanners.

The module in Example 19-13 implements the lexical analysis task—detecting the expression’s basic tokens by scanning the text string left to right on demand. Notice that this is all straightforward logic; such analysis can sometimes be performed with regular expressions instead (described earlier), but the pattern needed to detect and extract tokens in this example would be too complex and fragile for my tastes. If your tastes vary, try recoding this module with re.

Example 19-13. PP4ELangParserscanner.py

"""
###############################################################################
the scanner (lexical analyser)
###############################################################################
"""

import string
class SyntaxError(Exception): pass           # local errors
class LexicalError(Exception): pass          # used to be strings

class Scanner:
    def __init__(self, text):
        self.next = 0
        self.text = text + ''

    def newtext(self, text):
        Scanner.__init__(self, text)

    def showerror(self):
        print('=> ', self.text)
        print('=> ', (' ' * self.start) + '^')

    def match(self, token):
        if self.token != token:
            raise SyntaxError(token)
        else:
            value = self.value
            if self.token != '':
                self.scan()                  # next token/value
            return value                     # return prior value

    def scan(self):
        self.value = None
        ix = self.next
        while self.text[ix] in string.whitespace:
            ix += 1
        self.start = ix

        if self.text[ix] in ['(', ')', '-', '+', '/', '*', '']:
            self.token = self.text[ix]
            ix += 1

        elif self.text[ix] in string.digits:
            str = ''
            while self.text[ix] in string.digits:
                str += self.text[ix]
                ix += 1
            if self.text[ix] == '.':
                str += '.'
                ix += 1
                while self.text[ix] in string.digits:
                    str += self.text[ix]
                    ix += 1
                self.token = 'num'
                self.value = float(str)
            else:
                self.token = 'num'
                self.value = int(str)           # subsumes long() in 3.x

        elif self.text[ix] in string.ascii_letters:
            str = ''
            while self.text[ix] in (string.digits + string.ascii_letters):
                str += self.text[ix]
                ix += 1
            if str.lower() == 'set':
                self.token = 'set'
            else:
                self.token = 'var'
                self.value = str

        else:
            raise LexicalError()
        self.next = ix

The parser module’s class creates and embeds a scanner for its lexical chores and handles interpretation of the expression grammar’s rules and evaluation of the expression’s result, as shown in Example 19-14.

Example 19-14. PP4ELangParserparser1.py

"""
################################################################################
the parser (syntax analyser, evaluates during parse)
################################################################################
"""

class UndefinedError(Exception): pass
from scanner import Scanner, LexicalError, SyntaxError

class Parser:
    def __init__(self, text=''):
        self.lex  = Scanner(text)              # embed a scanner
        self.vars = {'pi': 3.14159}            # add a variable

    def parse(self, *text):
        if text:                               # main entry-point
            self.lex.newtext(text[0])          # reuse this parser?
        try:
            self.lex.scan()                    # get first token
            self.Goal()                        # parse a sentence
        except SyntaxError:
            print('Syntax Error at column:', self.lex.start)
            self.lex.showerror()
        except LexicalError:
            print('Lexical Error at column:', self.lex.start)
            self.lex.showerror()
        except UndefinedError as E:
            name = E.args[0]
            print("'%s' is undefined at column:" % name, self.lex.start)
            self.lex.showerror()

    def Goal(self):
        if self.lex.token in ['num', 'var', '(']:
            val = self.Expr()
            self.lex.match('')                    # expression?
            print(val)
        elif self.lex.token == 'set':               # set command?
            self.Assign()
            self.lex.match('')
        else:
            raise SyntaxError()

    def Assign(self):
        self.lex.match('set')
        var = self.lex.match('var')
        val = self.Expr()
        self.vars[var] = val           # assign name in dict

    def Expr(self):
        left = self.Factor()
        while True:
            if self.lex.token in ['', ')']:
                return left
            elif self.lex.token == '+':
                self.lex.scan()
                left = left + self.Factor()
            elif self.lex.token == '-':
                self.lex.scan()
                left = left - self.Factor()
            else:
                raise SyntaxError()

    def Factor(self):
        left = self.Term()
        while True:
            if self.lex.token in ['+', '-', '', ')']:
                return left
            elif self.lex.token == '*':
                self.lex.scan()
                left = left * self.Term()
            elif self.lex.token == '/':
                self.lex.scan()
                left = left / self.Term()
            else:
                raise SyntaxError()

    def Term(self):
        if self.lex.token == 'num':
            val = self.lex.match('num')               # numbers
            return val
        elif self.lex.token == 'var':
            if self.lex.value in self.vars.keys():    # keys(): EIBTI!
                val = self.vars[self.lex.value]       # look up name's value
                self.lex.scan()
                return val
            else:
                raise UndefinedError(self.lex.value)
        elif self.lex.token == '(':
            self.lex.scan()
            val = self.Expr()                         # sub-expression
            self.lex.match(')')
            return val
        else:
            raise SyntaxError()

if __name__ == '__main__':
    import testparser                       # self-test code
    testparser.test(Parser, 'parser1')      # test local Parser

If you study this code closely, you’ll notice that the parser keeps a dictionary (self.vars) to manage variable names: they’re stored in the dictionary on a set command and are fetched from it when they appear in an expression. Tokens are represented as strings, with an optional associated value (a numeric value for numbers and a string for variable names).

The parser uses iteration (while loops) rather than recursion for the expr-tail and factor-tail rules. Other than this optimization, the rules of the grammar map directly onto parser methods: tokens become calls to the scanner, and nested rule references become calls to other methods.

When the file parser1.py is run as a top-level program, its self-test code is executed, which in turn simply runs a canned test in the module shown in Example 19-15. Notice how the scanner converts numbers to strings with int; this ensures that all integer math invoked by the parser supports unlimited precision, simply because it uses Python integers which always provide the extra precision if needed (the separate Python 2.X long type and syntax is no more).

Also notice that mixed integer/floating-point operations cast up to floating point since Python operators are used to do the actual calculations along the way. The expression language’s / division operator also inherits Python 3.X’s true division model which retains remainders and returns floating point results regardless of operand types. We could simply run a // in our evaluation logic to retain prior behavior (or allow for both / and // in our grammar), but we’ll follow Python 3.X’s lead here.

Example 19-15. PP4ELangParser estparser.py

"""
############################################################################
parser test code
############################################################################
"""

def test(ParserClass, msg):
    print(msg, ParserClass)
    x = ParserClass('4 / 2 + 3')            # allow different Parsers
    x.parse()

    x.parse('3 + 4 / 2')                    # like eval('3 + 4 / 2')...
    x.parse('(3 + 4) / 2')                  # 3.X: / is now true div
    x.parse('4 / (2 + 3)')                  # // is not supported (yet)
    x.parse('4.0 / (2 + 3)')
    x.parse('4 / (2.0 + 3)')
    x.parse('4.0 / 2 * 3')
    x.parse('(4.0 / 2) * 3')
    x.parse('4.0 / (2 * 3)')
    x.parse('(((3))) + 1')

    y = ParserClass()
    y.parse('set a 4 / 2 + 1')
    y.parse('a * 3')
    y.parse('set b 12 / a')
    y.parse('b')

    z = ParserClass()
    z.parse('set a 99')
    z.parse('set a a + 1')
    z.parse('a')

    z = ParserClass()
    z.parse('pi')
    z.parse('2 * pi')
    z.parse('1.234 + 2.1')

def interact(ParserClass):                     # command-line entry
    print(ParserClass)
    x = ParserClass()
    while True:
        cmd = input('Enter=> ')
        if cmd == 'stop':
            break
        x.parse(cmd)

Correlate the following results to print call statements in the self-test module:

C:...PP4ELangParser> python parser1.py
parser1 <class '__main__.Parser'>
5.0
5.0
3.5
0.8
0.8
0.8
6.0
6.0
0.666666666667
4
9.0
4.0
100
3.14159
6.28318
3.334

As usual, we can also test and use the system interactively to work through more of its utility:

C:...PP4ELangParser> python
>>> import parser1
>>> x = parser1.Parser()
>>> x.parse('1 + 2')
3

Error cases are trapped and reported in a fairly friendly fashion (assuming users think in zero-based terms):

>>> x.parse('1 + a')
'a' is undefined at column: 4
=>  1 + a
=>      ^
>>> x.parse('1+a+2')
'a' is undefined at column: 2
=>  1+a+2
=>    ^
>>> x.parse('1 * 2 $')
Lexical Error at column: 6
=>  1 * 2 $
=>        ^
>>> x.parse('1 * - 1')
Syntax Error at column: 4
=>  1 * - 1
=>      ^
>>> x.parse('1 * (9')
Syntax Error at column: 6
=>  1 * (9
=>        ^

>>> x.parse('1 + 2 / 3')           # 3.X division change
1.66666666667
>>> x.parse('1 + 2 // 3')
Syntax Error at column: 7
=>  1 + 2 // 3
=>         ^

As the custom parser system demonstrates, modular program design is almost always a major win. By using Python’s program structuring tools (functions, modules, classes, and so on), big tasks can be broken down into small, manageable parts that can be coded and tested independently.

For instance, the scanner can be tested without the parser by making an instance with an input string and calling its scan or match methods repeatedly. We can even test it like this interactively, from Python’s command line. When we separate programs into logical components, they become easier to understand and modify. Imagine what the parser would look like if the scanner’s logic was embedded rather than called.

Pathologically big numbers are handled well, too, because Python’s built-in objects and operators are used along the way:

>>> x.parse('888888888888888888888888888888888888888888888.9999999')
8.88888888889e+44
>>> x.parse('99999999999999999999999999999999999999999 + 2')
100000000000000000000000000000000000000001
>>> x.parse('999999999999999999999999999999.88888888888 + 1.1')
1e+30

In addition, there is an interactive loop interface in the testparser module if you want to use the parser as a simple command-line calculator (or if you get tired of typing parser method calls). Pass the Parser class, so testparser can make one of its own:

>>> import testparser
>>> testparser.interact(parser1.Parser)
<class 'parser1.Parser'>
Enter=> 4 * 3 + 5
17
Enter=> 5 + 4 * 3
17
Enter=> (5 + 4) * 3
27
Enter=> set a 99
Enter=> set b 66
Enter=> a + b
165
Enter=> # + 1
Lexical Error at column: 0
=>  # + 1
=>  ^
Enter=> a * b + c
'c' is undefined at column: 8
=>  a * b + c
=>          ^
Enter=> a * b * + c
Syntax Error at column: 8
=>  a * b * + c
=>          ^
Enter=> a
99
Enter=> a * a * a
970299
Enter=> stop
>>>

Adding a Parse Tree Interpreter

One weakness in the parser1 program is that it embeds expression evaluation logic in the parsing logic: the result is computed while the string is being parsed. This makes evaluation quick, but it can also make it difficult to modify the code, especially in larger systems. To simplify, we could restructure the program to keep expression parsing and evaluation separate. Instead of evaluating the string, the parser can build up an intermediate representation of it that can be evaluated later. As an added incentive, building the representation separately makes it available to other analysis tools (e.g., optimizers, viewers, and so on)—they can be run as separate passes over the tree.

Example 19-16 shows a variant of parser1 that implements this idea. The parser analyzes the string and builds up a parse tree—that is, a tree of class instances that represents the expression and that may be evaluated in a separate step. The parse tree is built from classes that “know” how to evaluate themselves: to compute the expression, we just ask the tree to evaluate itself. Root nodes in the tree ask their children to evaluate themselves, and then combine the results by applying a single operator. In effect, evaluation in this version is simply a recursive traversal of a tree of embedded class instances constructed by the parser.

Example 19-16. PP4ELangParserparser2.py

"""
Separate expression parsing from evaluation by building an explicit parse tree
"""

TraceDefault = False
class UndefinedError(Exception): pass

if __name__ == '__main__':
    from scanner import Scanner, SyntaxError, LexicalError      # if run here
else:
    from .scanner import Scanner, SyntaxError, LexicalError     # from PyTree

#################################################################################
# the interpreter (a smart objects tree)
#################################################################################

class TreeNode:
    def validate(self, dict):           # default error check
        pass
    def apply(self, dict):              # default evaluator
        pass
    def trace(self, level):             # default unparser
        print('.' * level + '<empty>')

# ROOTS

class BinaryNode(TreeNode):
    def __init__(self, left, right):            # inherited methods
        self.left, self.right = left, right     # left/right branches
    def validate(self, dict):
        self.left.validate(dict)                # recurse down branches
        self.right.validate(dict)
    def trace(self, level):
        print('.' * level + '[' + self.label + ']')
        self.left.trace(level+3)
        self.right.trace(level+3)

class TimesNode(BinaryNode):
    label = '*'
    def apply(self, dict):
        return self.left.apply(dict) * self.right.apply(dict)

class DivideNode(BinaryNode):
    label = '/'
    def apply(self, dict):
        return self.left.apply(dict) / self.right.apply(dict)

class PlusNode(BinaryNode):
    label = '+'
    def apply(self, dict):
        return self.left.apply(dict) + self.right.apply(dict)

class MinusNode(BinaryNode):
    label = '-'
    def apply(self, dict):
        return self.left.apply(dict) - self.right.apply(dict)

# LEAVES

class NumNode(TreeNode):
    def __init__(self, num):
        self.num = num                 # already numeric
    def apply(self, dict):             # use default validate
        return self.num
    def trace(self, level):
        print('.' * level + repr(self.num))   # as code, was 'self.num'

class VarNode(TreeNode):
    def __init__(self, text, start):
        self.name   = text                    # variable name
        self.column = start                   # column for errors
    def validate(self, dict):
        if not self.name in dict.keys():
            raise UndefinedError(self.name, self.column)
    def apply(self, dict):
        return dict[self.name]                # validate before apply
    def assign(self, value, dict):
        dict[self.name] = value               # local extension
    def trace(self, level):
        print('.' * level + self.name)

# COMPOSITES

class AssignNode(TreeNode):
    def __init__(self, var, val):
        self.var, self.val = var, val
    def validate(self, dict):
        self.val.validate(dict)               # don't validate var
    def apply(self, dict):
        self.var.assign( self.val.apply(dict), dict )
    def trace(self, level):
        print('.' * level + 'set ')
        self.var.trace(level + 3)
        self.val.trace(level + 3)

#################################################################################
# the parser (syntax analyser, tree builder)
#################################################################################

class Parser:
    def __init__(self, text=''):
        self.lex     = Scanner(text)           # make a scanner
        self.vars    = {'pi':3.14159}          # add constants
        self.traceme = TraceDefault

    def parse(self, *text):                    # external interface
        if text:
            self.lex.newtext(text[0])          # reuse with new text
        tree = self.analyse()                  # parse string
        if tree:
            if self.traceme:                   # dump parse-tree?
                print(); tree.trace(0)
            if self.errorCheck(tree):          # check names
                self.interpret(tree)           # evaluate tree

    def analyse(self):
        try:
            self.lex.scan()                    # get first token
            return self.Goal()                 # build a parse-tree
        except SyntaxError:
            print('Syntax Error at column:', self.lex.start)
            self.lex.showerror()
        except LexicalError:
            print('Lexical Error at column:', self.lex.start)
            self.lex.showerror()

    def errorCheck(self, tree):
        try:
            tree.validate(self.vars)           # error checker
            return 'ok'
        except UndefinedError as instance:     # args is a tuple
            varinfo = instance.args
            print("'%s' is undefined at column: %d" % varinfo)
            self.lex.start = varinfo[1]
            self.lex.showerror()               # returns None

    def interpret(self, tree):
        result = tree.apply(self.vars)         # tree evals itself
        if result != None:                     # ignore 'set' result
            print(result)                      # ignores errors

    def Goal(self):
        if self.lex.token in ['num', 'var', '(']:
            tree = self.Expr()
            self.lex.match('')
            return tree
        elif self.lex.token == 'set':
            tree = self.Assign()
            self.lex.match('')
            return tree
        else:
            raise SyntaxError()

    def Assign(self):
        self.lex.match('set')
        vartree = VarNode(self.lex.value, self.lex.start)
        self.lex.match('var')
        valtree = self.Expr()
        return AssignNode(vartree, valtree)               # two subtrees

    def Expr(self):
        left = self.Factor()                              # left subtree
        while True:
            if self.lex.token in ['', ')']:
                return left
            elif self.lex.token == '+':
                self.lex.scan()
                left = PlusNode(left, self.Factor())      # add root-node
            elif self.lex.token == '-':
                self.lex.scan()
                left = MinusNode(left, self.Factor())     # grows up/right
            else:
                raise SyntaxError()

    def Factor(self):
        left = self.Term()
        while True:
            if self.lex.token in ['+', '-', '', ')']:
                return left
            elif self.lex.token == '*':
                self.lex.scan()
                left = TimesNode(left, self.Term())
            elif self.lex.token == '/':
                self.lex.scan()
                left = DivideNode(left, self.Term())
            else:
                raise SyntaxError()

    def Term(self):
        if self.lex.token == 'num':
            leaf = NumNode(self.lex.match('num'))
            return leaf
        elif self.lex.token == 'var':
            leaf = VarNode(self.lex.value, self.lex.start)
            self.lex.scan()
            return leaf
        elif self.lex.token == '(':
            self.lex.scan()
            tree = self.Expr()
            self.lex.match(')')
            return tree
        else:
            raise SyntaxError()

#################################################################################
# self-test code: use my parser, parser1's tester
#################################################################################

if __name__ == '__main__':
    import testparser
    testparser.test(Parser, 'parser2')    #  run with Parser class here

Notice the way we handle undefined name exceptions in errorCheck. When exceptions are derived from the built-in Exception class, their instances automatically return the arguments passed to the exception constructor call as a tuple in their args attribute—convenient for use in string formatting here.

Also notice that the new parser reuses the same scanner module as well. To catch errors raised by the scanner, it also imports the specific classes that identify the scanner’s exceptions. Both the scanner and the parser can raise exceptions on errors (lexical errors, syntax errors, and undefined name errors). They’re caught at the top level of the parser, and they end the current parse. There’s no need to set and check status flags to terminate the recursion. Since math is done using integers, floating-point numbers, and Python’s operators, there’s usually no need to trap numeric overflow or underflow errors. But as is, the parser doesn’t handle errors such as division by zero—such Python exceptions make the parser system exit with a Python stack trace and message. Uncovering the cause and fix for this is left as suggested exercise.

When parser2 is run as a top-level program, we get the same test code output as for parser1. In fact, it reuses the very same test code—both parsers pass in their parser class object to testparser.test. And since classes are also objects, we can also pass this version of the parser to testparser’s interactive loop: testparser.interact(parser2.Parser).

The new parser’s external behavior is identical to that of the original, so I won’t repeat all its output here (run this live for a firsthand look). Of note, though, this parser supports both use as a top-level script, and package imports from other directories, such as the PyTree viewer we’ll use in a moment. Python 3.X no longer searches a module’s own directory on the import search path, though, so we have to use package-relative import syntax for the latter case, and import from another directory when testing interactively:

C:...PP4ELangParser> parser2.py
parser2 <class '__main__.Parser'>
5.0
...rest is same as for parser1...

C:...PP4ELangParser> python
>>> import parser2
    from .scanner import Scanner, SyntaxError, LexicalError     # from PyTree
ValueError: Attempted relative import in non-package

C:...PP4ELangParser> cd ..
C:...PP4ELang> Parserparser2.py
parser2 <class '__main__.Parser'>
5.0
...rest is same as for parser1...

C:...PP4ELang> python
>>> from Parser import parser2
>>> x = parser2.Parser()
>>> x.parse('1 + 2 * 3 + 4')
11
>>> import Parser.testparser
>>> Parser.testparser.interact(parser2.Parser)
<class 'Parser.parser2.Parser'>
Enter=> 4 * 3 + 5
17
Enter=> stop
>>>

Using full package import paths in parser2 instead of either package-relative or unqualified imports:

from PP4E.Lang.Parser import scanner

would suffice for all three use cases—script, and both same and other directory imports—but requires the path to be set properly, and seems overkill for importing a file in the same directory as the importer.

Parse Tree Structure

Really, the only tangible difference with this latest parser is that it builds and uses trees to evaluate an expression internally instead of evaluating as it parses. The intermediate representation of an expression is a tree of class instances, whose shape reflects the order of operator evaluation. This parser also has logic to print an indented listing of the constructed parse tree if the traceme attribute is set to True (or 1). Indentation gives the nesting of subtrees, and binary operators list left subtrees first. For example:

C:...PP4ELang>
>>> from Parser import parser2
>>> p = parser2.Parser()
>>> p.traceme = True
>>> p.parse('5 + 4 * 2')

[+]
...5
...[*]
......4
......2
13

When this tree is evaluated, the apply method recursively evaluates subtrees and applies root operators to their results. Here, * is evaluated before +, since it’s lower in the tree. The Factor method consumes the * substring before returning a right subtree to Expr. The next tree takes a different shape:

>>> p.parse('5 * 4 - 2')

[-]
...[*]
......5
......4
...2
18

In this example, * is evaluated before -. The Factor method loops through a substring of * and / expressions before returning the resulting left subtree to Expr. The next example is more complex, but follows the same rules:

>>> p.parse('1 + 3 * (2 * 3 + 4)')

[+]
...1
...[*]
......3
......[+]
.........[*]
............2
............3
.........4
31

Trees are made of nested class instances. From an OOP perspective, it’s another way to use composition. Since tree nodes are just class instances, this tree could be created and evaluated manually, too:

PlusNode( NumNode(1),
          TimesNode( NumNode(3),
                     PlusNode( TimesNode(NumNode(2), NumNode(3)),
                               NumNode(4) ))).apply({})

But we might as well let the parser build it for us (Python is not that much like Lisp, despite what you may have heard).

Exploring Parse Trees with the PyTree GUI

But wait—there is a better way to explore parse tree structures. Figure 19-1 shows the parse tree generated for the string 1 + 3 * (2 * 3 + 4), displayed in PyTree, the tree visualization GUI described at the end of Chapter 18. This works only because the parser2 module builds the parse tree explicitly (parser1 evaluates during a parse instead) and because PyTree’s code is generic and reusable.

Figure 19-1. PyTree view of parse tree built for 1 + 3 * (2 * 3 + 4)

If you read the last chapter, you’ll recall that PyTree can draw most any tree data structure, but it is preconfigured to handle binary search trees and the expression parse trees we’re studying in this chapter. For parse trees, clicking on nodes in a displayed parse tree evaluates the subtree rooted there.

PyTree makes it easy to learn about and experiment with the parser. To determine the tree shape produced for a given expression, start PyTree, click on its Parser radio button, type the expression in the input field at the bottom right, and press “input” (or your Enter key). The parser class is run to generate a tree from your input, and the GUI displays the result. Depending on the operators used within an expression, some very differently shaped trees yield the same result when evaluated.

Try running PyTree on your computer to get a better feel for the parsing process. (I’d like to show more example trees, but I ran out of page real estate at this point in the book.)

Parsers Versus Python

The handcoded custom parser programs we’ve met in this section illustrate some interesting concepts and underscore the power of Python for general-purpose programming. Depending on your job description, they may also be typical of the sort of thing you’d write regularly in a traditional language such as C. Parsers are an important component in a wide variety of applications, but in some cases, they’re not as necessary as you might think. Let me explain why.

So far, we started with an expression parser and added a parse tree interpreter to make the code easier to modify. As is, the parser works, but it may be slow compared to a C implementation. If the parser is used frequently, we could speed it up by moving parts to C extension modules. For instance, the scanner might be moved to C initially, since it’s often called from the parser. Ultimately, we might add components to the grammar that allow expressions to access application-specific variables and functions.

All of these steps constitute good engineering. But depending on your application, this approach may not be the best one in Python. Often the easiest way to evaluate input expressions in Python is to let Python do it for us, by calling its eval built-in function. In fact, we can usually replace the entire expression evaluation program with this one function call. The next section will show how this can be used to simplify language-based systems in general.

More important, the next section underscores a core idea behind the language: if you already have an extensible, embeddable, high-level language system, why invent another? Python itself can often satisfy language-based component needs.

PyCalc: A Calculator Program/Object

To wrap up this chapter, I’m going to show you a practical application for some of the parsing technology introduced in the preceding section. This section presents PyCalc, a Python calculator program with a graphical interface, similar to the calculator programs available on most window systems. Like most of the GUI examples in this book, though, PyCalc offers a few advantages over existing calculators. Because PyCalc is written in Python, it is both easily customized and widely portable across window platforms. And because it is implemented with classes, it is both a standalone program and a reusable object library.

A Simple Calculator GUI

Before I show you how to write a full-blown calculator, though, the module shown in Example 19-17 starts this discussion in simpler terms. It implements a limited calculator GUI, whose buttons just add text to the input field at the top in order to compose a Python expression string. Fetching and running the string all at once produces results. Figure 19-2 shows the window this module makes when run as a top-level script.

Example 19-17. PP4ELangCalculatorcalc0.py

"a simplistic calculator GUI: expressions run all at once with eval/exec"

from tkinter import *
from PP4E.Gui.Tools.widgets import frame, button, entry

class CalcGui(Frame):
    def __init__(self, parent=None):                   # an extended frame
        Frame.__init__(self, parent)                   # on default top-level
        self.pack(expand=YES, fill=BOTH)               # all parts expandable
        self.master.title('Python Calculator 0.1')     # 6 frames plus entry
        self.master.iconname("pcalc1")

        self.names = {}                                # namespace for variables
        text = StringVar()
        entry(self, TOP, text)

        rows = ["abcd", "0123", "4567", "89()"]
        for row in rows:
            frm = frame(self, TOP)
            for char in row:
                button(frm, LEFT, char,
                            lambda char=char: text.set(text.get() + char))

        frm = frame(self, TOP)
        for char in "+-*/=":
            button(frm, LEFT, char,
                        lambda char=char: text.set(text.get()+ ' ' + char + ' '))

        frm = frame(self, BOTTOM)
        button(frm, LEFT, 'eval',  lambda: self.eval(text) )
        button(frm, LEFT, 'clear', lambda: text.set('') )

    def eval(self, text):
        try:
            text.set(str(eval(text.get(), self.names, self.names)))    # was 'x'
        except SyntaxError:
            try:
                exec(text.get(), self.names, self.names)
            except:
                text.set("ERROR")         # bad as statement too?
            else:
                text.set('')              # worked as a statement
        except:
            text.set("ERROR")             # other eval expression errors

if __name__ == '__main__': CalcGui().mainloop()

Figure 19-2. The calc0 script in action on Windows 7 (result=160.283)

Building the GUI

Now, this is about as simple as a calculator can be, but it demonstrates the basics. This window comes up with buttons for entry of numbers, variable names, and operators. It is built by attaching buttons to frames: each row of buttons is a nested Frame, and the GUI itself is a Frame subclass with an attached Entry and six embedded row frames (grids would work here, too). The calculator’s frame, entry field, and buttons are made expandable in the imported widgets utility module we coded earlier in Example 10-1.

This calculator builds up a string to pass to the Python interpreter all at once on “eval” button presses. Because you can type any Python expression or statement in the entry field, the buttons are really just a convenience. In fact, the entry field isn’t much more than a command line. Try typing import sys, and then dir(sys) to display sys module attributes in the input field at the top—it’s not what you normally do with a calculator, but it is demonstrative nevertheless.^[71]

In CalcGui’s constructor, buttons are coded as lists of strings; each string represents a row and each character in the string represents a button. Lambdas are used to save extra callback data for each button. The callback functions retain the button’s character and the linked text entry variable so that the character can be added to the end of the entry widget’s current string on a press.

Notice how we must pass in the loop variable as a default argument to some lambdas in this code. Recall from Chapter 7 how references within a lambda (or nested def) to names in an enclosing scope are evaluated when the nested function is called, not when it is created. When the generated function is called, enclosing scope references inside the lambda reflect their latest setting in the enclosing scope, which is not necessarily the values they held when the lambda expression ran. By contrast, defaults are evaluated at function creation time instead and so can remember the current values of loop variables. Without the defaults, each button would reflect the last iteration of the loop.

The calculator uses eval and exec to call Python’s parser and interpreter at runtime instead of analyzing and evaluating expressions manually. In effect, the calculator runs embedded Python code from a Python program. This works because Python’s development environment (the parser and bytecode compiler) is always a part of systems that use Python. Because there is no difference between the development and the delivery environments, Python’s parser can be used by Python programs.

The net effect here is that the entire expression evaluator has been replaced with a single call to eval or exec. In broader terms, this is a powerful technique to remember: the Python language itself can replace many small, custom languages. Besides saving development time, clients have to learn just one language, one that’s potentially simple enough for end-user coding.

Furthermore, Python can take on the flavor of any application. If a language interface requires application-specific extensions, just add Python classes, or export an API for use in embedded Python code as a C extension. By evaluating Python code that uses application-specific extensions, custom parsers become almost completely unnecessary.

There’s also a critical added benefit to this approach: embedded Python code has access to all the tools and features of a powerful, full-blown programming language. It can use lists, functions, classes, external modules, and even larger Python tools like tkinter GUIs, shelve storage, multiple threads, network sockets, and web page fetches. You’d probably spend years trying to provide similar functionality in a custom language parser. Just ask Guido.

Running code strings

This module implements a GUI calculator in some 45 lines of code (counting comments and blank lines). But truthfully, it “cheats.” Expression evaluation is delegated entirely to Python. In fact, the built-in eval and exec tools do most of the work here:

eval: Parses, evaluates, and returns the result of a Python expression represented as a string.
exec: Runs an arbitrary Python statement represented as a string, and has no return value.

Both accept optional dictionaries to be used as global and local namespaces for assigning and evaluating names used in the code strings. In the calculator, self.names becomes a symbol table for running calculator expressions. A related Python function, compile, can be used to precompile code strings to code objects before passing them to eval and exec (use it if you need to run the same string many times).

By default, a code string’s namespace defaults to the caller’s namespaces. If we didn’t pass in dictionaries here, the strings would run in the eval method’s namespace. Since the method’s local namespace goes away after the method call returns, there would be no way to retain names assigned in the string. Notice the use of nested exception handlers in the class’s eval method:

It first assumes the string is an expression and tries the built-in eval function.
If that fails because of a syntax error, it tries evaluating the string as a statement using exec.
Finally, if both attempts fail, it reports an error in the string (a syntax error, undefined name, and so on).

Statements and invalid expressions might be parsed twice, but the overhead doesn’t matter here, and you can’t tell whether a string is an expression or a statement without parsing it manually. Note that the “eval” button evaluates expressions, but = sets Python variables by running an assignment statement. Variable names are combinations of the letter keys “abcd” (or any name typed directly). They are assigned and evaluated in a dictionary used to represent the calculator’s namespace and retained for the session.

Extending and attaching

Clients that reuse this calculator are as simple as the calculator itself. Like most class-based tkinter GUIs, this one can be extended in subclasses—Example 19-18 customizes the simple calculator’s constructor to add extra widgets.

Example 19-18. PP4ELangCalculatorcalc0ext.py

from tkinter import *
from calc0 import CalcGui

class Inner(CalcGui):                                          # extend GUI
    def __init__(self):
        CalcGui.__init__(self)
        Label(self,  text='Calc Subclass').pack()              # add after
        Button(self, text='Quit', command=self.quit).pack()    # top implied

Inner().mainloop()

It can also be embedded in a container class—Example 19-19 attaches the simple calculator’s widget package, along with extras, to a common parent.

Example 19-19. PP4ELangCalculatorcalc0emb.py

from tkinter import *
from calc0 import CalcGui                       # add parent, no master calls

class Outer:
    def __init__(self, parent):                               # embed GUI
        Label(parent, text='Calc Attachment').pack()          # side=top
        CalcGui(parent)                                       # add calc frame
        Button(parent, text='Quit', command=parent.quit).pack()

root = Tk()
Outer(root)
root.mainloop()

Figure 19-3 shows the result of running both of these scripts from different command lines. Both have a distinct input field at the top. This works, but to see a more practical application of such reuse techniques, we need to make the underlying calculator more practical, too.

Figure 19-3. The calc0 script’s object attached and extended

PyCalc—A “Real” Calculator GUI

Of course, real calculators don’t usually work by building up expression strings and evaluating them all at once; that approach is really little more than a glorified Python command line. Traditionally, expressions are evaluated in piecemeal fashion as they are entered, and temporary results are displayed as soon as they are computed. Implementing this behavior requires a bit more work: expressions must be evaluated manually and in parts, instead of calling the eval function only once. But the end result is much more useful and intuitive.

Though simple, attaching and subclassing the calculator graphically, as shown in Figure 19-3, illustrates the power of Python as a tool for writing reusable software. By coding programs with modules and classes, components written in isolation almost automatically become general-purpose tools. Python’s program organization features promote reusable code.

In fact, code reuse is one of Python’s major strengths and has been one of the main themes of this book. Good object-oriented design takes some practice and forethought, and the benefits of code reuse aren’t apparent immediately. And sometimes we have good cause to be more interested in a quick fix rather than a future use for the code.

But coding with some reusability in mind can save development time in the long run. For instance, the handcoded custom parsers shared a scanner, the calculator GUI uses the widgets module from Chapter 10 we discussed earlier, and the next section will reuse the GuiMixin class from Chapter 10 as well. Sometimes we’re able to finish part of a job before we start.

This section presents the implementation of PyCalc, a more realistic Python/tkinter program that implements such a traditional calculator GUI. It touches on the subject of text and languages in two ways: it parses and evaluates expressions, and it implements a kind of stack-based language to perform the evaluation. Although its evaluation logic is more complex than the simpler calculator shown earlier, it demonstrates advanced programming techniques and serves as an interesting finale for this chapter.

Running PyCalc

As usual, let’s look at the GUI before the code. You can run PyCalc from the PyGadgets and PyDemos launcher bars at the top of the examples tree, or by directly running the file calculator.py listed shortly (e.g., click it in a file explorer, or type it in a shell command line). Figure 19-4 shows PyCalc’s main window. By default, it shows operand buttons in black-on-blue (and opposite for operator buttons), but font and color options can be passed into the GUI class’s constructor method. Of course, that means gray-on-gray in this book, so you’ll have to run PyCalc yourself to see what I mean.

Figure 19-4. PyCalc calculator at work on Windows 7

If you do run this, you’ll notice that PyCalc implements a normal calculator model—expressions are evaluated as entered, not all at once at the end. That is, parts of an expression are computed and displayed as soon as operator precedence and manually typed parentheses allow. The result in Figure 19-4, for instance, reflects pressing “2”, and then repeatedly pressing “*” to display successive powers of 2. I’ll explain how this evaluation works in a moment.

PyCalc’s CalcGui class builds the GUI interface as frames of buttons much like the simple calculator of the previous section, but PyCalc adds a host of new features. Among them are another row of action buttons, inherited methods from GuiMixin (presented in Chapter 10), a new “cmd” button that pops up nonmodal dialogs for entry of arbitrary Python code, and a recent calculations history pop up. Figure 19-5 captures some of PyCalc’s pop-up windows.

Figure 19-5. PyCalc calculator with some of its pop ups

You may enter expressions in PyCalc by clicking buttons in the GUI, typing full expressions in command-line pop ups, or typing keys on your keyboard. PyCalc intercepts key press events and interprets them the same as corresponding button presses; typing + is like pressing the + button, the Space bar key is “clear,” Enter is “eval,” backspace erases a character, and ? is like pressing “help.”

The command-line pop-up windows are nonmodal (you can pop up as many as you like). They accept any Python code—press the Run button or your Enter key to evaluate text in their input fields. The result of evaluating this code in the calculator’s namespace dictionary is thrown up in the main window for use in larger expressions. You can use this as an escape mechanism to employ external tools in your calculations. For instance, you can import and use functions coded in Python or C within these pop ups. The current value in the main calculator window is stored in newly opened command-line pop ups, too, for use in typed expressions.

PyCalc supports integers (unlimited precision), negatives, and floating-point numbers just because Python does. Individual operands and expressions are still evaluated with the eval built-in, which calls the Python parser/interpreter at runtime. Variable names can be assigned and referenced in the main window with the letter, =, and “eval” keys; they are assigned in the calculator’s namespace dictionary (more complex variable names may be typed in command-line pop ups). Note the use of pi in the history window: PyCalc preimports names in the math and random modules into the namespace where expressions are evaluated.

Evaluating expressions with stacks

Now that you have the general idea of what PyCalc does, I need to say a little bit about how it does what it does. Most of the changes in this calculator involve managing the expression display and evaluating expressions. PyCalc is structured as two classes:

The CalcGui class: Manages the GUI itself. It controls input events and is in charge of the main window’s display field at the top. It doesn’t evaluate expressions, though; for that, it sends operators and operands entered in the GUI to an embedded instance of the Evaluator class.
The Evaluator class: Manages two stacks. One stack records pending operators (e.g., +), and one records pending operands (e.g., 3.141). Temporary results are computed as new operators are sent from CalcGui and pushed onto the operands stack.

As you can see from this, the magic of expression evaluation boils down to juggling the operator and operand stacks. In a sense, the calculator implements a little stack-based language, to evaluate the expressions being entered. While scanning expression strings from left to right as they are entered, operands are pushed along the way, but operators delimit operands and may trigger temporary results before they are pushed. Because it records states and performs transitions, some might use the term state machine to describe this calculator language implementation.

Here’s the general scenario:

When a new operator is seen (i.e., when an operator button or key is pressed), the prior operand in the entry field is pushed onto the operands stack.
The operator is then added to the operators stack, but only after all pending operators of higher precedence have been popped and applied to pending operands (e.g., pressing + makes any pending * operators on the stack fire).
When “eval” is pressed, all remaining operators are popped and applied to all remaining operands, and the result is the last remaining value on the operands stack.

In the end, the last value on the operands stack is displayed in the calculator’s entry field, ready for use in another operation. This evaluation algorithm is probably best described by working through examples. Let’s step through the entry of a few expressions and watch the evaluation stacks grow.

PyCalc stack tracing is enabled with the debugme flag in the module; if true, the operator and operand stacks are displayed on stdout each time the Evaluator class is about to apply an operator and reduce (pop) the stacks. Run PyCalc with a console window to see the traces. A tuple holding the stack lists (operators , operands) is printed on each stack reduction; tops of stacks are at the ends of the lists. For instance, here is the console output after typing and evaluating a simple string:

1) Entered keys: "5 * 3 + 4 <eval>" [result = 19]

(['*'], ['5', '3'])    [on '+' press: displays "15"]
(['+'], ['15', '4'])   [on 'eval' press: displays "19"]

Note that the pending (stacked) * subexpression is evaluated when the + is pressed: * operators bind tighter than +, so the code is evaluated immediately before the + operator is pushed. When the + button is pressed, the entry field contains 3; we push 3 onto the operands stack, reduce the * subexpression (5 * 3), push its result onto operands, push + onto operators, and continue scanning user inputs. When “eval” is pressed at the end, 4 is pushed onto operands, and the final + on operators is applied to stacked operands.

The text input and display field at the top of the GUI’s main window plays a part in this algorithm, too. The text input field and expression stacks are integrated by the calculator class. In general, the text input field always holds the prior operand when an operator button is pressed (e.g., on 5 *); the text in the input field is pushed onto the operands stack before the operator is resolved. Because of this, we have to pop results before displaying them after “eval” or ) is pressed; otherwise the results are pushed onto the stack twice—they would be both on the stack and in the display field, from which they would be immediately pushed again when the next operator is input.

For both usability and accuracy, when an operator is seen, we also have to arrange to erase the input field’s prior value when the next operand’s entry is started (e.g., on both 3 and 4 in 5 * 3 + 4). This erasure of the prior values is also arranged when “eval” or ) is applied, on the assumption that a subsequent operand key or button replaces the prior result—for a new expression after “eval,” and for an operand following a new operator after ); e.g., to erase the parenthesized 12 result on 2 in 5 + (3 * 4) * 2. Without this erasure, operand buttons and keys simply concatenate to the currently displayed value. This model also allows user to change temporary result operands after a ) by entry of operand instead of operator.

Expression stacks also defer operations of lower precedence as the input is scanned. In the next trace, the pending + isn’t evaluated when the * button is pressed: since * binds tighter, we need to postpone the + until the * can be evaluated. The * operator isn’t popped until its right operand 4 has been seen. There are two operators to pop and apply to operand stack entries on the “eval” press—the * at the top of operators is applied to the 3 and 4 at the top of operands, and then + is run on 5 and the 12 pushed for *:

2) Entered keys: "5 + 3 * 4 <eval>" [result = 17]

(['+', '*'], ['5', '3', '4'])   [on 'eval' press]
(['+'], ['5', '12'])            [displays "17"]

For strings of same-precedence operators such as the following, we pop and evaluate immediately as we scan left to right, instead of postponing evaluation. This results in a left-associative evaluation, in the absence of parentheses: 5+3+4 is evaluated as ((5+3)+4). For + and * operations this is irrelevant because order doesn’t matter:

3) Entered keys: "5 + 3 + 4 <eval>" [result = 12]

(['+'], ['5', '3'])     [on the second '+']
(['+'], ['8', '4'])     [on 'eval']

The following trace is more complex. In this case, all the operators and operands are stacked (postponed) until we press the ) button at the end. To make parentheses work, ( is given a higher precedence than any operator and is pushed onto the operators stack to seal off lower stack reductions until the ) is seen. When the ) button is pressed, the parenthesized subexpression is popped and evaluated ((3 * 4), then (1 + 12)), and 13 is displayed in the entry field. On pressing “eval,” the rest is evaluated ((3 * 13), (1 +39)), and the final result (40) is shown. This result in the entry field itself becomes the left operand of a future operator.

4) Entered keys: "1 + 3 * ( 1 + 3 * 4 ) <eval>" [result = 40]

(['+', '*', '(', '+', '*'], ['1', '3', '1', '3', '4'])    [on ')']
(['+', '*', '(', '+'], ['1', '3', '1', '12'])             [displays "13"]
(['+', '*'], ['1', '3', '13'])                            [on 'eval']
(['+'], ['1', '39'])

In fact, any temporary result can be used again: if we keep pressing an operator button without typing new operands, it’s reapplied to the result of the prior press—the value in the entry field is pushed twice and applied to itself each time. Press * many times after entering 2 to see how this works (e.g., 2***). On the first *, it pushes 2 and the *. On the next *, it pushes 2 again from the entry field, pops and evaluates the stacked (2 * 2), pushes back and displays the result, and pushes the new *. And on each following *, it pushes the currently displayed result and evaluates again, computing successive squares.

Figure 19-6 shows how the two stacks look at their highest level while scanning the expression in the prior example trace. On each reduction, the top operator is applied to the top two operands and the result is pushed back for the operator below. Because of the way the two stacks are used, the effect is similar to converting the expression to a string of the form +1*3(+1*34 and evaluating it right to left. In other cases, though, parts of the expression are evaluated and displayed as temporary results along the way, so it’s not simply a string conversion process.

Figure 19-6. Evaluation stacks: 1 + 3 * (1 + 3 * 4)

Finally, the next example’s string triggers an error. PyCalc is casual about error handling. Many errors are made impossible by the algorithm itself, but things such as unmatched parentheses still trip up the evaluator. Instead of trying to detect all possible error cases explicitly, a general try statement in the reduce method is used to catch them all: expression errors, numeric errors, undefined name errors, syntax errors, and so on.

Operands and temporary results are always stacked as strings, and each operator is applied by calling eval. When an error occurs inside an expression, a result operand of *ERROR* is pushed, which makes all remaining operators fail in eval too. *ERROR* essentially percolates to the top of the expression. At the end, it’s the last operand and is displayed in the text entry field to alert you of the mistake:

5) Entered keys: "1 + 3 * ( 1 + 3 * 4 <eval>" [result = *ERROR*]

(['+', '*', '(', '+', '*'], ['1', '3', '1', '3', '4'])      [on eval]
(['+', '*', '(', '+'], ['1', '3', '1', '12'])
(['+', '*', '('], ['1', '3', '13'])
(['+', '*'], ['1', '*ERROR*'])
(['+'], ['*ERROR*'])
(['+'], ['*ERROR*', '*ERROR*'])

Try tracing through these and other examples in the calculator’s code to get a feel for the stack-based evaluation that occurs. Once you understand the general shift/reduce (push/pop) mechanism, expression evaluation is straightforward.

PyCalc source code

Example 19-20 contains the PyCalc source module that puts these ideas to work in the context of a GUI. It’s a single-file implementation (not counting utilities imported and reused). Study the source for more details; as usual, there’s no substitute for interacting with the program on your own to get a better feel for its functionality.

Also see the opening comment’s “to do” list for suggested areas for improvement. Like all software systems, this calculator is prone to evolve over time (and in fact it has, with each new edition of this book). Since it is written in Python, such future mutations will be easy to apply.

Example 19-20. PP4ELangCalculatorcalculator.py

#!/usr/local/bin/python
"""
################################################################################
PyCalc 3.0+: a Python/tkinter calculator program and GUI component.

Evaluates expressions as they are entered, catches keyboard keys for
expression entry; 2.0 added integrated command-line popups, a recent
calculations history display popup, fonts and colors configuration,
help and about popups, preimported math/random constants, and more;

3.0+ (PP4E, version number retained):
-port to run under Python 3.X (only)
-drop 'L' keypress (the long type is now dead in earnest)

3.0 changes (PP3E):
-use 'readonly' entry state, not 'disabled', else field is greyed
 out (fix for 2.3 Tkinter change);
-avoid extended display precision for floats by using str(), instead
 of `x`/repr() (fix for Python change);
-apply font to input field to make it larger;
-use justify=right for input field so it displays on right, not left;
-add 'E+' and 'E-' buttons (and 'E' keypress) for float exponents;
 'E' keypress must generally be followed digits, not + or - optr key;
-remove 'L' button (but still allow 'L' keypress): superfluous now,
 because Python auto converts up if too big ('L' forced this in past);
-use smaller font size overall;
-auto scroll to the end in the history window

to do: add a commas-insertion mode (see str.format and LP4E example);
allow '**' as an operator key; allow '+' and 'J' inputs for complex
Numbers; use new decimal type for fixed precision floats; as is, can
use 'cmd' popup windows to input and evaluate things like complex, but
can't be input via main window; caveat: PyCalc's precision, accuracy,
and some of its behaviour, is currently bound by result of str() call;
################################################################################
"""

from tkinter import *                                            # widgets, consts
from PP4E.Gui.Tools.guimixin import GuiMixin                     # quit method
from PP4E.Gui.Tools.widgets import label, entry, button, frame   # widget builders
Fg, Bg, Font = 'black', 'skyblue', ('courier', 14, 'bold')       # default config

debugme = True
def trace(*args):
    if debugme: print(args)


################################################################################
# the main class - handles user interface;
# an extended Frame, on new Toplevel, or embedded in another container widget
################################################################################

class CalcGui(GuiMixin, Frame):
    Operators = "+-*/="                              # button lists
    Operands  = ["abcd", "0123", "4567", "89()"]     # customizable

    def __init__(self, parent=None, fg=Fg, bg=Bg, font=Font):
        Frame.__init__(self, parent)
        self.pack(expand=YES, fill=BOTH)             # all parts expandable
        self.eval = Evaluator()                      # embed a stack handler
        self.text = StringVar()                      # make a linked variable
        self.text.set("0")
        self.erase = 1                               # clear "0" text next
        self.makeWidgets(fg, bg, font)               # build the GUI itself
        if not parent or not isinstance(parent, Frame):
            self.master.title('PyCalc 3.0')          # title iff owns window
            self.master.iconname("PyCalc")           # ditto for key bindings
            self.master.bind('<KeyPress>', self.onKeyboard)
            self.entry.config(state='readonly')      # 3.0: not 'disabled'=grey
        else:
            self.entry.config(state='normal')
            self.entry.focus()

    def makeWidgets(self, fg, bg, font):             # 7 frames plus text-entry
        self.entry = entry(self, TOP, self.text)     # font, color configurable
        self.entry.config(font=font)                 # 3.0: make display larger
        self.entry.config(justify=RIGHT)             # 3.0: on right, not left
        for row in self.Operands:
            frm = frame(self, TOP)
            for char in row:
                button(frm, LEFT, char,
                            lambda op=char: self.onOperand(op),
                            fg=fg, bg=bg, font=font)

        frm = frame(self, TOP)
        for char in self.Operators:
            button(frm, LEFT, char,
                        lambda op=char: self.onOperator(op),
                        fg=bg, bg=fg, font=font)

        frm = frame(self, TOP)
        button(frm, LEFT, 'dot ', lambda: self.onOperand('.'))
        button(frm, LEFT, ' E+ ', lambda: self.text.set(self.text.get()+'E+'))
        button(frm, LEFT, ' E- ', lambda: self.text.set(self.text.get()+'E-'))
        button(frm, LEFT, 'cmd ', self.onMakeCmdline)
        button(frm, LEFT, 'help', self.help)
        button(frm, LEFT, 'quit', self.quit)       # from guimixin

        frm = frame(self, BOTTOM)
        button(frm, LEFT, 'eval ', self.onEval)
        button(frm, LEFT, 'hist ', self.onHist)
        button(frm, LEFT, 'clear', self.onClear)

    def onClear(self):
        self.eval.clear()
        self.text.set('0')
        self.erase = 1

    def onEval(self):
        self.eval.shiftOpnd(self.text.get())     # last or only opnd
        self.eval.closeall()                     # apply all optrs left
        self.text.set(self.eval.popOpnd())       # need to pop: optr next?
        self.erase = 1

    def onOperand(self, char):
        if char == '(':
            self.eval.open()
            self.text.set('(')                      # clear text next
            self.erase = 1
        elif char == ')':
            self.eval.shiftOpnd(self.text.get())    # last or only nested opnd
            self.eval.close()                       # pop here too: optr next?
            self.text.set(self.eval.popOpnd())
            self.erase = 1
        else:
            if self.erase:
                self.text.set(char)                     # clears last value
            else:
                self.text.set(self.text.get() + char)   # else append to opnd
            self.erase = 0

    def onOperator(self, char):
        self.eval.shiftOpnd(self.text.get())    # push opnd on left
        self.eval.shiftOptr(char)               # eval exprs to left?
        self.text.set(self.eval.topOpnd())      # push optr, show opnd|result
        self.erase = 1                          # erased on next opnd|'('

    def onMakeCmdline(self):
        new = Toplevel()                            # new top-level window
        new.title('PyCalc command line')            # arbitrary Python code
        frm = frame(new, TOP)                       # only the Entry expands
        label(frm, LEFT, '>>>').pack(expand=NO)
        var = StringVar()
        ent = entry(frm, LEFT, var, width=40)
        onButton = (lambda: self.onCmdline(var, ent))
        onReturn = (lambda event: self.onCmdline(var, ent))
        button(frm, RIGHT, 'Run', onButton).pack(expand=NO)
        ent.bind('<Return>', onReturn)
        var.set(self.text.get())

    def onCmdline(self, var, ent):            # eval cmdline pop-up input
        try:
            value = self.eval.runstring(var.get())
            var.set('OKAY')
            if value != None:                 # run in eval namespace dict
                self.text.set(value)          # expression or statement
                self.erase = 1
                var.set('OKAY => '+ value)
        except:                               # result in calc field
            var.set('ERROR')                  # status in pop-up field
        ent.icursor(END)                      # insert point after text
        ent.select_range(0, END)              # select msg so next key deletes

    def onKeyboard(self, event):
        pressed = event.char                  # on keyboard press event
        if pressed != '':                     # pretend button was pressed
            if pressed in self.Operators:
                self.onOperator(pressed)
            else:
                for row in self.Operands:
                    if pressed in row:
                        self.onOperand(pressed)
                        break
                else:                                          # 4E: drop 'Ll'
                    if pressed == '.':
                        self.onOperand(pressed)                # can start opnd
                    if pressed in 'Ee':  # 2e10, no +/-
                        self.text.set(self.text.get()+pressed) # can't: no erase
                    elif pressed == '
':
                        self.onEval()                          # enter key=eval
                    elif pressed == ' ':
                        self.onClear()                         # spacebar=clear
                    elif pressed == '':
                        self.text.set(self.text.get()[:-1])    # backspace
                    elif pressed == '?':
                        self.help()

    def onHist(self):
        # show recent calcs log popup
        from tkinter.scrolledtext import ScrolledText     # or PP4E.Gui.Tour
        new = Toplevel()                                  # make new window
        ok = Button(new, text="OK", command=new.destroy)
        ok.pack(pady=1, side=BOTTOM)                      # pack first=clip last
        text = ScrolledText(new, bg='beige')              # add Text + scrollbar
        text.insert('0.0', self.eval.getHist())           # get Evaluator text
        text.see(END)                                     # 3.0: scroll to end
        text.pack(expand=YES, fill=BOTH)

        # new window goes away on ok press or enter key
        new.title("PyCalc History")
        new.bind("<Return>", (lambda event: new.destroy()))
        ok.focus_set()                      # make new window modal:
        new.grab_set()                      # get keyboard focus, grab app
        new.wait_window()                   # don't return till new.destroy

    def help(self):
        self.infobox('PyCalc', 'PyCalc 3.0+
'
                               'A Python/tkinter calculator
'
                               'Programming Python 4E
'
                               'May, 2010
'
                               '(3.0 2005, 2.0 1999, 1.0 1996)

'
                               'Use mouse or keyboard to
'
                               'input numbers and operators,
'
                               'or type code in cmd popup')


################################################################################
# the expression evaluator class
# embedded in and used by a CalcGui instance, to perform calculations
################################################################################

class Evaluator:
    def __init__(self):
        self.names = {}                         # a names-space for my vars
        self.opnd, self.optr = [], []           # two empty stacks
        self.hist = []                          # my prev calcs history log
        self.runstring("from math import *")    # preimport math modules
        self.runstring("from random import *")  # into calc's namespace

    def clear(self):
        self.opnd, self.optr = [], []           # leave names intact
        if len(self.hist) > 64:                 # don't let hist get too big
            self.hist = ['clear']
        else:
            self.hist.append('--clear--')

    def popOpnd(self):
        value = self.opnd[-1]                   # pop/return top|last opnd
        self.opnd[-1:] = []                     # to display and shift next
        return value                            # or x.pop(), or del x[-1]

    def topOpnd(self):
        return self.opnd[-1]                    # top operand (end of list)

    def open(self):
        self.optr.append('(')                   # treat '(' like an operator

    def close(self):                            # on ')' pop downto highest '('
        self.shiftOptr(')')                     # ok if empty: stays empty
        self.optr[-2:] = []                     # pop, or added again by optr

    def closeall(self):
        while self.optr:                        # force rest on 'eval'
            self.reduce()                       # last may be a var name
        try:
            self.opnd[0] = self.runstring(self.opnd[0])
        except:
            self.opnd[0] = '*ERROR*'            # pop else added again next:

    afterMe = {'*': ['+', '-', '(', '='],       # class member
               '/': ['+', '-', '(', '='],       # optrs to not pop for key
               '+': ['(', '='],                 # if prior optr is this: push
               '-': ['(', '='],                 # else: pop/eval prior optr
               ')': ['(', '='],                 # all left-associative as is
               '=': ['('] }

    def shiftOpnd(self, newopnd):               # push opnd at optr, ')', eval
        self.opnd.append(newopnd)

    def shiftOptr(self, newoptr):               # apply ops with <= priority
        while (self.optr and
               self.optr[-1] not in self.afterMe[newoptr]):
            self.reduce()
        self.optr.append(newoptr)               # push this op above result
                                                # optrs assume next opnd erases
    def reduce(self):
        trace(self.optr, self.opnd)
        try:                                    # collapse the top expr
            operator       = self.optr[-1]      # pop top optr (at end)
            [left, right]  = self.opnd[-2:]     # pop top 2 opnds (at end)
            self.optr[-1:] = []                 # delete slice in-place
            self.opnd[-2:] = []
            result = self.runstring(left + operator + right)
            if result == None:
                result = left                   # assignment? key var name
            self.opnd.append(result)            # push result string back
        except:
            self.opnd.append('*ERROR*')         # stack/number/name error

    def runstring(self, code):
        try:                                                  # 3.0: not `x`/repr
            result = str(eval(code, self.names, self.names))  # try expr: string
            self.hist.append(code + ' => ' + result)          # add to hist log
        except:
            exec(code, self.names, self.names)                # try stmt: None
            self.hist.append(code)
            result = None
        return result

    def getHist(self):
        return '
'.join(self.hist)

def getCalcArgs():
    from sys import argv                   # get cmdline args in a dict
    config = {}                            # ex: -bg black -fg red
    for arg in argv[1:]:                   # font not yet supported
        if arg in ['-bg', '-fg']:          # -bg red' -> {'bg':'red'}
            try:
                config[arg[1:]] = argv[argv.index(arg) + 1]
            except:
                pass
    return config

if __name__ == '__main__':
    CalcGui(**getCalcArgs()).mainloop()    # in default toplevel window

Using PyCalc as a component

PyCalc serves a standalone program on my desktop, but it’s also useful in the context of other GUIs. Like most of the GUI classes in this book, PyCalc can be customized with subclass extensions or embedded in a larger GUI with attachments. The module in Example 19-21 demonstrates one way to reuse PyCalc’s CalcGui class by extending and embedding, similar to what was done for the simple calculator earlier.

Example 19-21. PP4ELangCalculatorcalculator_test.py

"""
test calculator: use as an extended and embedded GUI component
"""

from tkinter import *
from calculator import CalcGui

def calcContainer(parent=None):
    frm = Frame(parent)
    frm.pack(expand=YES, fill=BOTH)
    Label(frm, text='Calc Container').pack(side=TOP)
    CalcGui(frm)
    Label(frm, text='Calc Container').pack(side=BOTTOM)
    return frm

class calcSubclass(CalcGui):
    def makeWidgets(self, fg, bg, font):
        Label(self, text='Calc Subclass').pack(side=TOP)
        Label(self, text='Calc Subclass').pack(side=BOTTOM)
        CalcGui.makeWidgets(self, fg, bg, font)
        #Label(self, text='Calc Subclass').pack(side=BOTTOM)

if __name__ == '__main__':
    import sys
    if len(sys.argv) == 1:              # % calculator_test.py
        root = Tk()                     # run 3 calcs in same process
        CalcGui(Toplevel())             # each in a new toplevel window
        calcContainer(Toplevel())
        calcSubclass(Toplevel())
        Button(root, text='quit', command=root.quit).pack()
        root.mainloop()
    if len(sys.argv) == 2:              # % calculator_testl.py -
        CalcGui().mainloop()            # as a standalone window (default root)
    elif len(sys.argv) == 3:            # % calculator_test.py - -
        calcContainer().mainloop()      # as an embedded component
    elif len(sys.argv) == 4:            # % calculator_test.py - - -
        calcSubclass().mainloop()       # as a customized superclass

Figure 19-7 shows the result of running this script with no command-line arguments. We get instances of the original calculator class, plus the container and subclass classes defined in this script, all attached to new top-level windows.

Figure 19-7. The calculator_test script: attaching and extending

These two windows on the right reuse the core PyCalc code running in the window on the left. All of these windows run in the same process (e.g., quitting one quits them all), but they all function as independent windows. Note that when running three calculators in the same process like this, each has its own distinct expression evaluation namespace because it’s a class instance attribute, not a global module-level variable. Because of that, variables set in one calculator are set in that calculator only, and they don’t overwrite settings made in other windows. Similarly, each calculator has its own evaluation stack manager object, such that calculations in one window don’t appear in or impact other windows at all.

The two extensions in this script are artificial, of course—they simply add labels at the top and bottom of the window—but the concept is widely applicable. You could reuse the calculator’s class by attaching it to any GUI that needs a calculator and customize it with subclasses arbitrarily. It’s a reusable widget.

Adding new buttons in new components

One obvious way to reuse the calculator is to add additional expression feature buttons—square roots, inverses, cubes, and the like. You can type such operations in the command-line pop ups, but buttons are a bit more convenient. Such features could also be added to the main calculator implementation itself, but since the set of features that will be useful may vary per user and application, a better approach may be to add them in separate extensions. For instance, the class in Example 19-22 adds a few extra buttons to PyCalc by embedding (i.e., attaching) it in a container.

Example 19-22. PP4ELangCalculatorcalculator_plus_emb.py

"""
#############################################################################
a container with an extra row of buttons for common operations;
a more useful customization: adds buttons for more operations (sqrt,
1/x, etc.) by embedding/composition, not subclassing; new buttons are
added after entire CalGui frame because of the packing order/options;
#############################################################################
"""

from tkinter import *
from calculator import CalcGui, getCalcArgs
from PP4E.Gui.Tools.widgets import frame, button, label

class CalcGuiPlus(Toplevel):
    def __init__(self, **args):
        Toplevel.__init__(self)
        label(self, TOP, 'PyCalc Plus - Container')
        self.calc = CalcGui(self, **args)
        frm = frame(self, BOTTOM)
        extras = [('sqrt', 'sqrt(%s)'),
                  ('x^2 ',  '(%s)**2'),
                  ('x^3 ',  '(%s)**3'),
                  ('1/x ',  '1.0/(%s)')]
        for (lab, expr) in extras:
            button(frm, LEFT, lab, (lambda expr=expr: self.onExtra(expr)))
        button(frm, LEFT, ' pi ', self.onPi)

    def onExtra(self, expr):
        text = self.calc.text
        eval = self.calc.eval
        try:
            text.set(eval.runstring(expr % text.get()))
        except:
            text.set('ERROR')

    def onPi(self):
        self.calc.text.set(self.calc.eval.runstring('pi'))

if __name__ == '__main__':
    root = Tk()
    button(root, TOP, 'Quit', root.quit)
    CalcGuiPlus(**getCalcArgs()).mainloop()       # -bg,-fg to calcgui

Because PyCalc is coded as a Python class, you can always achieve a similar effect by extending PyCalc in a new subclass instead of embedding it, as shown in Example 19-23.

Example 19-23. PP4ELangCalculatorcalculator_plus_ext.py

"""
#############################################################################
a customization with an extra row of buttons for common operations;
a more useful customization: adds buttons for more operations (sqrt,
1/x, etc.) by subclassing to extend the original class, not embedding;
new buttons show up before frame attached to bottom by calcgui class;
#############################################################################
"""

from tkinter import *
from calculator import CalcGui, getCalcArgs
from PP4E.Gui.Tools.widgets import label, frame, button

class CalcGuiPlus(CalcGui):
    def makeWidgets(self, *args):
        label(self, TOP, 'PyCalc Plus - Subclass')
        CalcGui.makeWidgets(self, *args)
        frm = frame(self, BOTTOM)
        extras = [('sqrt', 'sqrt(%s)'),
                  ('x^2 ', '(%s)**2'),
                  ('x^3 ', '(%s)**3'),
                  ('1/x ', '1.0/(%s)')]
        for (lab, expr) in extras:
            button(frm, LEFT, lab, (lambda expr=expr: self.onExtra(expr)))
        button(frm, LEFT, ' pi ', self.onPi)

    def onExtra(self, expr):
        try:
            self.text.set(self.eval.runstring(expr % self.text.get()))
        except:
            self.text.set('ERROR')

    def onPi(self):
        self.text.set(self.eval.runstring('pi'))

if __name__ == '__main__':
    CalcGuiPlus(**getCalcArgs()).mainloop()       # passes -bg, -fg on

Notice that these buttons’ callbacks force floating-point division to be used for inverses just because that’s how / operates in Python 3.X (// for integers truncates remainders instead); the buttons also wrap entry field values in parentheses to sidestep precedence issues. They could instead convert the entry’s text to a number and do real math, but Python does all the work automatically when expression strings are run raw.

Also note that the buttons added by these scripts simply operate on the current value in the entry field, immediately. That’s not quite the same as expression operators applied with the stacks evaluator (additional customizations are needed to make them true operators). Still, these buttons prove the point these scripts are out to make—they use PyCalc as a component, both from the outside and from below.

Finally, to test both of the extended calculator classes, as well as PyCalc configuration options, the script in Example 19-24 puts up four distinct calculator windows (this is the script run by PyDemos).

Example 19-24. PP4ELangCalculatorcalculator_plusplus.py

#!/usr/local/bin/python
"""
demo all 3 calculator flavors at once
each is a distinct calculator object and window
"""

from tkinter import Tk, Button, Toplevel
import calculator, calculator_plus_ext, calculator_plus_emb

root=Tk()
calculator.CalcGui(Toplevel())
calculator.CalcGui(Toplevel(), fg='white', bg='purple')
calculator_plus_ext.CalcGuiPlus(Toplevel(), fg='gold', bg='black')
calculator_plus_emb.CalcGuiPlus(fg='black', bg='red')
Button(root, text='Quit Calcs', command=root.quit).pack()
root.mainloop()

Figure 19-8 shows the result—four independent calculators in top-level windows within the same process. The two windows on the right represent specialized reuses of PyCalc as a component, and the Help dialog appears in the lower right. Although it may not be obvious in this book, all four use different color schemes; calculator classes accept color and font configuration options and pass them down the call chain as needed.

As we learned earlier, these calculators could also be run as independent processes by spawning command lines with the launchmodes module we met in Chapter 5. In fact, that’s how the PyGadgets and PyDemos launcher bars run calculators, so see their code for more details. And as always, read the code and experiment on your own for further enlightenment; this is Python, after all.

Figure 19-8. calculator_plusplus: extend, embed, and configure!

This chapter concludes our Python language material in this book. The next and final technical chapter of the text takes us on a tour of techniques for integrating Python with programs written in compiled languages like C and C++. Not everyone needs to know how to do this, so some readers may wish to skip ahead to the book’s conclusion in Chapter 21 at this point. Since most Python programmers use wrapped C libraries (even if they don’t wrap them themselves), though, I recommend a quick pass over the next chapter before you close the book on this book.

In closing, here’s a less tangible but important aspect of Python programming. A common remark among new users is that it’s easy to “say what you mean” in Python without getting bogged down in complex syntax or obscure rules. It’s a programmer-friendly language. In fact, it’s not too uncommon for Python programs to run on the first attempt.

As we’ve seen in this book, there are a number of factors behind this distinction—lack of declarations, no compile steps, simple syntax, useful built-in objects, powerful libraries, and so on. Python is specifically designed to optimize speed of development (an idea we’ll expand on in Chapter 21). For many users, the end result is a remarkably expressive and responsive language, which can actually be fun to use for real work.

For instance, the calculator programs of this chapter were initially thrown together in one afternoon, starting from vague, incomplete goals. There was no analysis phase, no formal design, and no official coding stage. I typed up some ideas and they worked. Moreover, Python’s interactive nature allowed me to experiment with new ideas and get immediate feedback. Since its initial development, the calculator has been polished and expanded much, of course, but the core implementation remains unchanged.

Naturally, such a laid-back programming mode doesn’t work for every project. Sometimes more upfront design is warranted. For more demanding tasks, Python has modular constructs and fosters systems that can be extended in either Python or C. And a simple calculator GUI may not be what some would call “serious” software development. But maybe that’s part of the point, too.

^[71]Once again, I need to warn you about running code strings like this if you can’t be sure they won’t cause damage. If these strings can be entered by users you cannot trust, they will have access to anything on the computer that the Python process has access to. See Chapters 9 and 15 for more on security issues related to code run in GUI, Web, and other contexts.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 19. Text and Language

Create new playlist

Sign In

Sign Up