In one form or another, processing text-based information is one of the more common tasks that applications need to perform. This can include anything from scanning a text file by columns to analyzing statements in a language defined by a formal grammar. Such processing usually is called parsing—analyzing the structure of a text string. In this chapter, we’ll explore ways to handle language and text-based information and summarize some Python development concepts in sidebars along the way. In the process, we’ll meet string methods, text pattern matching, XML and HTML parsers, and other tools.
Some of this material is advanced, but the examples are small to keep this chapter short. For instance, recursive descent parsing is illustrated with a simple example to show how it can be implemented in Python. We’ll also see that it’s often unnecessary to write custom parsers for each language processing task in Python. They can usually be replaced by exporting APIs for use in Python programs, and sometimes by a single built-in function call. Finally, this chapter closes by presenting PyCalc—a calculator GUI written in Python, and the last major Python coding example in this text. As we’ll see, writing calculators isn’t much more difficult than juggling stacks while scanning text.
In the grand scheme of things, there are a variety of ways to handle text processing and language analysis in Python:
Built-in string object expressions
Built-in string object method calls
Regular expression pattern matching
XML and HTML text parsing
Custom language parsers, both handcoded and generated
Running Python code with eval
and exec
built-ins
Natural language processing
For simpler tasks, Python’s built-in string object is often all we really need. Python strings can be indexed, concatenated, sliced, and processed with both string method calls and built-in functions. Our main emphasis in this chapter is mostly on higher-level tools and techniques for analyzing textual information and language, but we’ll briefly explore each of these techniques in turn. Let’s get started.
Some readers may have come to this chapter seeking coverage of Unicode text, too, but this topic is not presented here. For a look at Python’s Unicode support, see Chapter 2’s discussion of string tools, Chapter 4’s discussion of text and binary file distinctions and encodings, and Chapter 9’s coverage of text in tkinter GUIs. Unicode also appears in various Internet and database topics throughout this book (e.g., email encodings).
Because Unicode is a core language topic, all these chapters
will also refer you to the fuller coverage of Unicode in
Learning
Python, Fourth Edition. Most of the topics in
this chapter, including string methods and pattern matching, apply
to Unicode automatically simply because the Python 3.X str
string type is
Unicode, whether ASCII or wider.
The first stop on our text and language tour is the most basic: Python’s
string objects come with an array of text processing tools, and serve
as your first line of defense in this domain. As you undoubtedly know
by now, concatenation, slicing, formatting, and other string
expressions are workhorses of most programs (I’m
including the newer format
method
in this category, as it’s really just an alternative to
the %
expression):
>>>'spam eggs ham'[5:10]
# slicing: substring extraction 'eggs ' >>>'spam ' + 'eggs ham'
# concatenation (and *, len(), [ix]) 'spam eggs ham' >>>'spam %s %s' % ('eggs', 'ham')
# formatting expression: substitution 'spam eggs ham' >>>'spam {} {}'.format('eggs', 'ham')
# formatting method: % alternative 'spam eggs ham' >>>'spam = "%-5s", %+06d' % ('ham', 99)
# more complex formatting 'spam = "ham ", +00099' >>>'spam = "{0:<5}", {1:+06}'.format('ham', 99)
'spam = "ham ", +00099'
These operations are covered in core language resources such as
Learning
Python. For the purposes of this chapter, though,
we’re interested in more powerful tools: Python’s string object
methods include a wide variety of text-processing
utilities that go above and beyond string expression operators. We saw
some of these in action early on in Chapter 2,
and have been using them ever since. For instance, given an instance
str
of the built-in string object
type, operations like the following are provided as object method
calls:
str.find(
substr
)
str.replace(
old
,
new
)
str.split(
delimiter
)
str.join(
iterable
)
str.strip()
str.rstrip()
str.rjust(
width
)
str.upper()
str.isupper()
str.isdigit()
str.endswith(
substr-or-tuple
)
Tests for a substring (or a tuple of alternatives) at the end
str.startswith(
substr-or-tuple
)
Tests for a substring (or a tuple of alternatives) at the front
This list is representative but partial, and some of these
methods take additional optional arguments. For the full list of
string methods, run a dir(str)
call
at the Python interactive prompt and run help(str.
method
)
on any method for some quick
documentation. The Python library manual and reference books such as
Python Pocket
Reference also include an exhaustive list.
Moreover, in Python today all normal string methods apply to
both bytes
and str
strings. The latter makes them
applicable to arbitrarily encoded Unicode text, simply because the
str
type is Unicode text, even if
it’s only ASCII. These methods originally appeared as function in
the string
module, but
are only object methods today; the string
module is still present because it
contains predefined constants (e.g., string.ascii_uppercase
), as well as the Template
substitution interface in 2.4 and
later—one of the techniques
discussed in the next section.
By way of review, let’s take a quick look at string methods in
the context of some of their most common use cases. As we saw when
generating HTML forwarding pages in Chapter 6, the string replace
method is
often adequate by itself as a string templating
tool—we can compute values and insert them at fixed positions in a
string with simple replacement calls:
>>>template = '---$target1---$target2---'
>>>val1 = 'Spam'
>>>val2 = 'shrubbery'
>>>template = template.replace('$target1', val1)
>>>template = template.replace('$target2', val2)
>>>template
'---Spam---shrubbery---'
As we also saw when generating HTML reply pages in the CGI
scripts of Chapters 15 and 16,
the string %
formatting operator
is also a powerful templating tool, especially when combined with
dictionaries—simply fill out a dictionary with values and apply
multiple substitutions to the HTML string all at once:
>>>template = """
...---
...---%(key1)s---
...---%(key2)s---
..."""
>>> >>>vals = {}
>>>vals['key1'] = 'Spam'
>>>vals['key2'] = 'shrubbery'
>>>print(template % vals)
--- ---Spam--- ---shrubbery---
Beginning with Python 2.4, the string
module’s Template
feature is essentially a simplified and limited variation of
the dictionary-based format scheme just shown, but it allows some
additional call patterns which some may consider simpler:
>>>vals
{'key2': 'shrubbery', 'key1': 'Spam'} >>>import string
>>>template = string.Template('---$key1---$key2---')
>>>template.substitute(vals)
'---Spam---shrubbery---' >>>template.substitute(key1='Brian', key2='Loretta')
'---Brian---Loretta---'
See the library manual for more on this extension. Although the string datatype does not itself support the pattern-directed text processing that we’ll meet later in this chapter, its tools are powerful enough for many tasks.
In terms of this chapter’s main focus, Python’s built-in tools for splitting and joining strings around tokens turn out to be especially useful when it comes to parsing text:
str.split(delimiter?,
maxsplits?)
Splits a string into a list of substrings, using either
whitespace (tabs, spaces, newlines) or an explicitly passed
string as a delimiter. maxsplits
limits the number of
splits performed, if passed.
delimiter.join(iterable)
Concatenates a sequence or other iterable of substrings (e.g., list, tuple, generator), adding the subject separator string between each.
These two are among the most powerful of string methods. As we
saw in Chapter 2, split
chops a string into a list of
substrings and join
puts them
back together:
>>>'A B C D'.split()
['A', 'B', 'C', 'D'] >>>'A+B+C+D'.split('+')
['A', 'B', 'C', 'D'] >>>'--'.join(['a', 'b', 'c'])
'a--b--c'
Despite their simplicity, they can handle surprisingly complex text-parsing tasks. Moreover, string method calls are very fast because they are implemented in C language code. For instance, to quickly replace all tabs in a file with four periods, pipe the file into a script that looks like this:
from sys import * stdout.write(('.' * 4).join(stdin.read().split(' ')))
The split
call here divides
input around tabs, and the join
puts it back together with periods where tabs had been. In this
case, the combination of the two calls is equivalent to using the
simpler global replacement string method call as follows:
stdout.write(stdin.read().replace(' ', '.' * 4))
As we’ll see in the next section, splitting strings is sufficient for many text-parsing goals.
Let’s look next at some practical applications of string splits and joins. In many domains, scanning files by columns is a fairly common task. For instance, suppose you have a file containing columns of numbers output by another system, and you need to sum each column’s numbers. In Python, string splitting is the core operation behind solving this problem, as demonstrated by Example 19-1. As an added bonus, it’s easy to make the solution a reusable tool in Python by packaging it as an importable function.
#!/usr/local/bin/python def summer(numCols, fileName): sums = [0] * numCols # make list of zeros for line in open(fileName): # scan file's lines cols = line.split() # split up columns for i in range(numCols): # around blanks/tabs sums[i] += eval(cols[i]) # add numbers to sums return sums if __name__ == '__main__': import sys print(summer(eval(sys.argv[1]), sys.argv[2])) # '% summer.py cols file'
Notice that we use file iterators here to read line by line,
instead of calling the file readlines
method
explicitly (recall from Chapter 4
that iterators avoid loading the entire file into memory all at
once). The file itself is a temporary object, which will be
automatically closed when garbage collected.
As usual for properly architected scripts, you can both import this module and
call its function, and run it as a shell tool
from the command line. The summer.py script
calls split
to make a list of
strings representing the line’s columns, and eval
to convert column strings to numbers.
Here’s an input file that uses both blanks and tabs to separate
columns, and the result of turning our script loose on it:
C:...PP4ELang>type table1.txt
1 5 10 2 1.0 2 10 20 4 2.0 3 15 30 8 3 4 20 40 16 4.0 C:...PP4ELang>python summer.py 5 table1.txt
[10, 50, 100, 30, 10.0]
Also notice that because the summer script uses eval
to convert file text to numbers, you
could really store arbitrary Python expressions in the file. Here,
for example, it’s run on a file of Python code snippets:
C:...PP4ELang>type table2.txt
2 1+1 1<<1 eval("2") 16 2*2*2*2 pow(2,4) 16.0 3 len('abc') [1,2,3][2] {'spam':3}['spam'] C:...PP4ELang>python summer.py 4 table2.txt
[21, 21, 21, 21.0]
We’ll revisit eval
later
in this chapter, when we explore expression
evaluators. Sometimes this is more than we want—if we can’t be
sure that the strings that we run this way won’t contain malicious
code, for instance, it may be necessary to run them with limited
machine access or use more restrictive conversion tools. Consider
the following recoding of the summer
function (this is in file
summer2.py in the examples
package if you care to experiment with it):
def summer(numCols, fileName): sums = [0] * numCols for line in open(fileName): # use file iterators cols = line.split(',') # assume comma-delimited nums = [int(x) for x in cols] # use limited converter both = zip(sums, nums) # avoid nested for loop sums = [x + y for (x, y) in both] # 3.X: zip is an iterable return sums
This version uses int
for
its conversions from strings to support only numbers, and not
arbitrary and possibly unsafe expressions. Although the first four
lines of this coding are similar to the original, for variety this
version also assumes the data is separated by commas rather than
whitespace, and runs list comprehensions and zip
to avoid the nested for
loop statement. This version is also
substantially trickier than the original and so might be less
desirable from a maintenance perspective. If its code is
confusing, try adding print
call statements after each step to trace the results of each
operation. Here is its handiwork:
C:...PP4ELang>type table3.txt
1,5,10,2,1 2,10,20,4,2 3,15,30,8,3 4,20,40,16,4 C:...PP4ELang>python summer2.py 5 table3.txt
[10, 50, 100, 30, 10]
The summer logic so far works, but it can be even more general— by making the column numbers a key of a dictionary rather than an offset in a list, we can remove the need to pass in a number-columns value altogether. Besides allowing us to associate meaningful labels with data rather than numeric positions, dictionaries are often more flexible than lists in general, especially when there isn’t a fixed size to our problem. For instance, suppose you need to sum up columns of data stored in a text file where the number of columns is not known or fixed:
C:...PP4ELang>python
>>>print(open('table4.txt').read())
001.1 002.2 003.3 010.1 020.2 030.3 040.4 100.1 200.2 300.3
Here, we cannot preallocate a fixed-length list of sums
because the number of columns may vary. Splitting on whitespace
extracts the columns, and float
converts to numbers, but a fixed-size list won’t easily
accommodate a set of sums (at least, not without extra code to
manage its size). Dictionaries are more convenient here because we
can use column positions as keys instead of using absolute
offsets. The following code demonstrates interactively (it’s also
in file summer3.py in the
examples package):
>>>sums = {}
>>>for line in open('table4.txt'):
...cols = [float(col) for col in line.split()]
...for pos, val in enumerate(cols):
...sums[pos] = sums.get(pos, 0.0) + val
... >>>for key in sorted(sums):
...print(key, '=', sums[key])
... 0 = 111.3 1 = 222.6 2 = 333.9 3 = 40.4 >>>sums
{0: 111.3, 1: 222.6, 2: 333.90000000000003, 3: 40.4}
Interestingly, most of this code uses tools added to Python
over the years—file and dictionary iterators, comprehensions,
dict.get
, and the enumerate
and sorted
built-ins were not yet formed
when Python was new. For related examples, also see the tkinter
grid examples in Chapter 9
for another case of eval
table
magic at work. That chapter’s table sums logic is a variation on
this theme, which obtains the number of columns from the first
line of a data file and tailors its summations for display in a
GUI.
Splitting comes in handy for dividing text into columns, but it can also be used as a more general parsing tool—by splitting more than once on different delimiters, we can pick apart more complex text. Although such parsing can also be achieved with more powerful tools, such as the regular expressions we’ll meet later in this chapter, split-based parsing is simpler to code in quick prototypes, and may run faster.
For instance, Example 19-2 demonstrates one way that splitting and joining strings can be used to parse sentences in a simple language. It is taken from a rule-based expert system shell (holmes) that is written in Python and included in this book’s examples distribution (more on holmes in a moment). Rule strings in holmes take the form:
"rule <id> if <test1>, <test2>... then <conclusion1>, <conclusion2>..."
Tests and conclusions are conjunctions of terms (“,” means
“and”). Each term is a list of
words or variables separated by spaces; variables start
with ?
. To use a rule, it is
translated to an internal form—a dictionary with nested lists. To
display a rule, it is translated back to the string form. For
instance, given the call:
rules.internal_rule('rule x if a ?x, b then c, d ?x')
the conversion in function internal_rule
proceeds as follows:
string = 'rule x if a ?x, b then c, d ?x' i = ['rule x', 'a ?x, b then c, d ?x'] t = ['a ?x, b', 'c, d ?x'] r = ['', 'x'] result = {'rule':'x', 'if':[['a','?x'], ['b']], 'then':[['c'], ['d','?x']]}
We first split around the if
, then around the then
, and finally around rule
. The result is the three substrings
that were separated by the keywords. Test and conclusion substrings
are split around “,” first and spaces last. join
is used later to convert back (unparse) to the
original string for display. Example 19-2 is the concrete
implementation of this scheme.
def internal_rule(string): i = string.split(' if ') t = i[1].split(' then ') r = i[0].split('rule ') return {'rule': r[1].strip(), 'if':internal(t[0]), 'then':internal(t[1])} def external_rule(rule): return ('rule ' + rule['rule'] + ' if ' + external(rule['if']) + ' then ' + external(rule['then']) + '.') def internal(conjunct): res = [] # 'a b, c d' for clause in conjunct.split(','): # -> ['a b', ' c d'] res.append(clause.split()) # -> [['a','b'], ['c','d']] return res def external(conjunct): strs = [] for clause in conjunct: # [['a','b'], ['c','d']] strs.append(' '.join(clause)) # -> ['a b', 'c d'] return ', '.join(strs) # -> 'a b, c d'
Today we could use list comprehensions and generator
expressions to gain some conciseness here. The internal
and external
functions, for instance, could be
recoded to simply (see file rules2.py):
def internal(conjunct): return [clause.split() for clause in conjunct.split(',')] def external(conjunct): return ', '.join(' '.join(clause) for clause in conjunct)
to produce the desired nested lists and string by combining two steps into one. This form might run faster; we’ll leave it to the reader to decide whether it is more difficult to understand. As usual, we can test components of this module interactively:
>>>import rules
>>>rules.internal('a ?x, b')
[['a', '?x'], ['b']] >>>rules.internal_rule('rule x if a ?x, b then c, d ?x')
{'then': [['c'], ['d', '?x']], 'rule': 'x', 'if': [['a', '?x'], ['b']]} >>>r = rules.internal_rule('rule x if a ?x, b then c, d ?x')
>>>rules.external_rule(r)
'rule x if a ?x, b then c, d ?x.'
Parsing by splitting strings around tokens like this takes you only so far. There is no direct support for recursive nesting of components, and syntax errors are not handled very gracefully. But for simple language tasks like this, string splitting might be enough, at least for prototyping or experimental systems. You can always add a more robust rule parser later or reimplement rules as Python code or classes.
So how are these rules actually used? As mentioned, the rule parser we just met is part of the Python-coded holmes expert system shell. Holmes is an old system written around 1993 and before Python 1.0. It implemented both forward and backward chaining inference over rule sets. For example, the rule:
rule pylike if ?X likes coding, ?X likes spam then ?X likes Python
can be used both to prove whether someone likes Python
(chaining backward, from “then” to “if”), and
to deduce that someone likes Python from a set of known facts
(chaining forward, from “if” to “then”).
Deductions may span multiple rules, multiple clauses represent
conjunctions, and rules that name the same conclusion represent
alternatives. holmes also performs simple pattern-matching along
the way to assign the variables that appear in rules (e.g.,
?X
), and it is able to explain
both its deductions and questions.
Holmes also served as proof of concept for Python in general at a time when such evidence was still sometimes necessary, and at last check it still worked unchanged on Python 2.X platforms. However, its code no longer reflects modern Python best practice. Because of this, I no longer maintain this system today. In fact, it has suffered from bit rot so much and for so long that I’ve opted not to revisit it for this edition at all. Its original Python 0.X code is included in the book examples package, but it has not been ported to Python 3.X form, and does not accurately reflect Python as it exists today.
That is, holmes is an ex-system. It has ceased to be. And it won’t be discussed further here. For more modern AI tools and support for Python, search the Web. This is a fun field to explore, but holmes is probably best left to the foggy mists of Python prehistory (and souls brave enough to try the port).
Splitting and joining strings is a simple way to process text, as long as it follows the format you expect. For more general text analysis tasks where the structure of your data is not so rigidly defined, Python provides regular expression matching utilities. Especially for the kinds of text associated with domains such as the Internet and databases today, this flexibility can be a powerful ally.
Regular expressions are simply strings that define patterns to be matched against other strings. Supply a pattern and a string and ask whether the string matches your pattern. After a match, parts of the string matched by parts of the pattern are made available to your script. That is, matches not only give a yes/no answer, but also can pick out substrings as well.
Regular expression pattern strings can be complicated (let’s be honest—they can be downright gross to look at). But once you get the hang of them, they can replace larger handcoded string search routines—a single pattern string generally does the work of dozens of lines of manual string scanning code and may run much faster. They are a concise way to encode the expected structure of text and extract portions of it.
In Python, regular expressions are not part of the syntax of the
Python language itself, but they are supported by the re
standard library module that you must
import to use. The module defines functions for running matches
immediately, compiling pattern strings into pattern objects,
matching these objects against strings, and fetching matched
substrings after a match. It also provides tools for pattern-based
splitting, replacing, and so on.
The re
module implements a
rich regular expression pattern syntax that tries to be close to
that used to code patterns in the Perl language (regular expressions
are a feature of Perl worth emulating). For instance, re
supports the notions of named groups;
character classes; and nongreedy
matches—regular expression pattern operators that match as few
characters as possible (other operators always match the longest
possible substring). The re
module has also been optimized repeatedly, and in Python 3.X
supports matching for both bytes
byte-strings and str
Unicode
strings. The net effect is that Python’s pattern support uses
Perl-like patterns, but is invoked with a different top-level module
interface.
I need to point out up front that regular expressions are complex tools that cannot be covered in depth here. If this area sparks your interest, the text Mastering Regular Expressions, written by Jeffrey E. F. Friedl (O’Reilly), is a good next step to take. We won’t be able to cover pattern construction itself well enough here to turn you into an expert. Once you learn how to code patterns, though, the top-level interface for performing matches is straightforward. In fact, they are so easy to use that we’ll jump right into some live examples before getting into more details.
There are two basic ways to kick off matches: through top-level function calls and via methods of precompiled pattern objects. The latter precompiled form is quicker if you will be applying the same pattern more than once—to all lines in a text file, for instance. To demonstrate, let’s do some matching on the following strings (see file re-interactive.txt for all the interactive code run in this section):
>>>text1 = 'Hello spam...World'
>>>text2 = 'Hello spam...other'
The match performed in the following code does not precompile: it executes an immediate match to look for all the characters between the words Hello and World in our text strings:
>>>import re
>>>matchobj = re.match('Hello(.*)World', text2)
>>>print(matchobj)
None
When a match fails as it does here (the text2
string doesn’t end in
World), we get back the None
object, which is Boolean false if
tested in an if
statement.
In the pattern string we’re using here in the first argument
to re.match
, the words Hello and
World match themselves, and (.*)
means any character (.
) repeated zero or more times (*
). The fact that it is enclosed in
parentheses tells Python to save away the part of the string matched
by that part of the pattern as a group—a
matched substring available after the match. To see how, we need to
make a match work:
>>>matchobj = re.match('Hello(.*)World', text1)
>>>print(matchobj)
<_sre.SRE_Match object at 0x009D6520> >>>matchobj.group(1)
' spam...'
When a match succeeds, we get back a match
object, which has interfaces for extracting matched
substrings—the group(1)
call
returns the portion of the string matched by the first, leftmost,
parenthesized portion of the pattern (our (.*)
). As mentioned, matching is not just
a yes/no answer; by enclosing parts of the pattern in parentheses,
it is also a way to extract matched substrings. In this case, we’ve
parsed out the text between Hello and
World. Group number 0
is the entire string matched by the
pattern—useful if you want to be sure your pattern is consuming all
the text you think it is.
The interface for precompiling is similar, but the pattern is
implied in the pattern object we get back from
the compile
call:
>>>pattobj = re.compile('Hello(.*)World')
>>>matchobj = pattobj.match(text1)
>>>matchobj.group(1)
' spam...'
Again, you should precompile for speed if you will run the
pattern multiple times, and you normally will when scanning files
line by line. Here’s something a bit more complex that hints at the
generality of patterns. This one allows for zero or more blanks or
tabs at the front ([ ]*
), skips
one or more after the word Hello ([ ]+
), captures characters in the middle
((.*)
), and allows the final word
to begin with an upper- or lowercase letter ([Ww]
); as you can see, patterns can handle
wide variations in data:
>>>patt = '[ ]*Hello[ ]+(.*)[Ww]orld'
>>>line = ' Hello spamworld'
>>>mobj = re.match(patt, line)
>>>mobj.group(1)
'spam'
Notice that we matched a str
pattern to a str
string in the last listing. We can
also match bytes
to bytes
in order to handle data such as
encoded text, but we cannot mix the two string types (a constraint
which is true in Python in general—Python wouldn’t have the encoding
information needed to know how to convert between the raw bytes and
the Unicode text):
>>>patt = b'[ ]*Hello[ ]+(.*)[Ww]orld'
# both as bytes works too >>>line = b' Hello spamworld'
# and returns bytes groups >>>re.match(patt, line).group(1)
# but cannot mix str/bytes b'spam' >>>re.match(patt, ' Hello spamworld')
TypeError: can't use a bytes pattern on a string-like object >>>re.match('[ ]*Hello[ ]+(.*)[Ww]orld', line)
TypeError: can't use a string pattern on a bytes-like object
In addition to the tools these examples demonstrate, there are
methods for scanning ahead to find a match (search
), scanning
to find all matches (findall
),
splitting and replacing on patterns, and so on. All have analogous
module and precompiled call forms. The next section turns to a few
examples to demonstrate more of the basics.
Notice how the preceding example skips optional whitespace and allows for uppercase or lowercase letters. This underscores why you may want to use patterns in the first place—they support more general kinds of text than string object methods can. Here’s another case in point: we’ve seen that string methods can split on and replace a substring, but they don’t suffice if the delimiter might be more than one alternative:
>>>'aaa--bbb--ccc'.split('--')
['aaa', 'bbb', 'ccc'] >>>'aaa--bbb--ccc'.replace('--', '...')
# string methods use fixed strings 'aaa...bbb...ccc' >>>'aaa--bbb==ccc'.split(['--', '=='])
TypeError: Can't convert 'list' object to str implicitly >>>'aaa--bbb==ccc'.replace(['--', '=='], '...')
TypeError: Can't convert 'list' object to str implicitly
Patterns can do similar work, but also can handle alternatives
directly, by virtue of their pattern matching syntax. In the
following, the syntax --|==
matches either string --
or string ==
; the syntax [-=]
matches either the character -
or =
(a character set); and the form (?:)
can be used to group nested parts of
a pattern without forming a saved substring group (split treats
groups specially):
>>>import re
>>>re.split('--', 'aaa--bbb--ccc')
['aaa', 'bbb', 'ccc'] >>>re.sub('--', '...', 'aaa--bbb--ccc')
# single string case 'aaa...bbb...ccc' >>>re.split('--|==', 'aaa--bbb==ccc')
# split on -- or == ['aaa', 'bbb', 'ccc'] >>>re.sub('--|==', '...', 'aaa--bbb==ccc')
# replace -- or == 'aaa...bbb...ccc' >>>re.split('[-=]', 'aaa-bbb=ccc')
# single char alternative ['aaa', 'bbb', 'ccc'] >>>re.split('(--)|(==)', 'aaa--bbb==ccc')
# split includes groups ['aaa', '--', None, 'bbb', None, '==', 'ccc'] >>>re.split('(?:--)|(?:==)', 'aaa--bbb==ccc')
# expr part, not group ['aaa', 'bbb', 'ccc']
Similarly, splits can extract simple substrings for fixed
delimiters, but patterns can also handle surrounding context like
brackets, mark parts as optional, ignore whitespace, and more. In
the next tests s*
means zero or
more whitespace characters (a character class); s+
means one or more of the same;
/?
matches an optional slash;
[a-z]
is any lowercase letter (a
range);(.*?)
means a saved
substring of zero or more of any character again—but only as many as
needed to match the rest of the pattern (nongreedily); and the
groups
method is used to fetch
the substrings matched by the parenthesized subpatterns all at
once:
>>>'spam/ham/eggs'.split('/')
['spam', 'ham', 'eggs'] >>>re.match('(.*)/(.*)/(.*)', 'spam/ham/eggs').groups()
('spam', 'ham', 'eggs') >>>re.match('<(.*)>/<(.*)>/<(.*)>', '<spam>/<ham>/<eggs>').groups()
('spam', 'ham', 'eggs') >>>re.match('s*<(.*)>/?<(.*)>/?<(.*)>', ' <spam>/<ham><eggs>').groups()
('spam', 'ham', 'eggs') >>>'Hello pattern world!'.split()
['Hello', 'pattern', 'world!'] >>>re.match('Hellos*([a-z]*)s+(.*?)s*!', 'Hellopattern world !').groups()
('pattern', 'world')
In fact, there’s more than one way to match. The findall
method provides generality that leaves string objects
in the dust—it locates all occurrences of a pattern and returns all
the substrings it matched (or a list of tuples for multiple groups).
The search
method is similar but stops at the first match—it’s
like match
plus an initial forward scan. In the following, string
object finds locate just one specific string, but patterns can be
used to locate and extract bracketed text anywhere in a string, even
pairs with optional text between:
>>>'<spam>/<ham>/<eggs>'.find('ham')
# find substring offset 8 >>>re.findall('<(.*?)>', '<spam>/<ham>/<eggs>')
# find all matches/groups ['spam', 'ham', 'eggs'] >>>re.findall('<(.*?)>', '<spam> / <ham><eggs>')
['spam', 'ham', 'eggs'] >>>re.findall('<(.*?)>/?<(.*?)>', '<spam>/<ham> ... <eggs><cheese>')
[('spam', 'ham'), ('eggs', 'cheese')] >>>re.search('<(.*?)>/?<(.*?)>', 'todays menu: <spam>/<ham>...<eggs><s>').groups()
('spam', 'ham')
Especially when using findall
, the (?s)
operator comes in handy to force
.
to match
end-of-line characters in multiline text;
without it .
matches everything
except lines ends. The following searches look
for two adjacent bracketed strings with arbitrary text between, with
and without skipping line breaks:
>>>re.findall('<(.*?)>.*<(.*?)>', '<spam> <ham> <eggs>')
# stop at [] >>>re.findall('(?s)<(.*?)>.*<(.*?)>', '<spam> <ham> <eggs>')
# greedy [('spam', 'eggs')] >>>re.findall('(?s)<(.*?)>.*?<(.*?)>', '<spam> <ham> <eggs>')
# nongreedy [('spam', 'ham')]
To make larger patterns more mnemonic, we can even associate
names with matched substring groups in using
the <?P<name>)
pattern
syntax and fetch them by name after matches, though this is of
limited utility for findall
. The
next tests look for strings of “word” characters (w
) separated by a /
—this isn’t much more than a string
split, but parts are named, and search
and findall
both scan ahead:
>>>re.search('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb/ccc]').groups()
('aaa', 'bbb') >>>re.search('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb/ccc]').groupdict()
{'part1': 'aaa', 'part2': 'bbb'} >>>re.search('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb/ccc]').group(2)
'bbb' >>>re.search('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb/ccc]').group('part2')
'bbb' >>>re.findall('(?P<part1>w*)/(?P<part2>w*)', '...aaa/bbb ccc/ddd]')
[('aaa', 'bbb'), ('ccc', 'ddd')]
Finally, although basic string operations such as slicing and
splits are sometimes enough, patterns are much more flexible. The
following uses [^ ]
to match any
character not following the ^
, and escapes a dash within a []
alternative set using -
so it’s not taken to be a character set
range separator. It runs equivalent slices, splits, and matches,
along with a more general match that the other two cannot
approach:
>>>line = 'aaa bbb ccc'
>>>line[:3], line[4:7], line[8:11]
# slice data at fixed offsets ('aaa', 'bbb', 'ccc') >>>line.split()
# split data with fixed delimiters ['aaa', 'bbb', 'ccc'] >>>re.split(' +', line)
# split on general delimiters ['aaa', 'bbb', 'ccc'] >>>re.findall('[^ ]+', line)
# find non-delimiter runs ['aaa', 'bbb', 'ccc'] >>>line =
'aaa...bbb-ccc / ddd.-/e&e*e'
# handle generalized text >>>re.findall('[^ .-/]+', line)
['aaa', 'bbb', 'ccc', 'ddd', 'e&e*e']
At this point, if you’ve never used pattern syntax in the past
your head may very well be spinning (or have blown off entirely!).
Before we go into any further examples, let’s dig into a few of the
details underlying the re
module
and its patterns.
The Python re
module
comes with functions that can search for patterns
right away or make compiled pattern objects for running matches
later. Pattern objects (and module search calls) in turn generate
match objects, which contain information about successful matches
and matched substrings. For reference, the next few sections
describe the module’s interfaces and some of the operators you can
use to code patterns.
The top level of the module provides functions for matching, substitution, precompiling, and so on:
compile(pattern [,
flags])
Compile a regular expression pattern
string into a regular
expression pattern object, for later matching. See the
reference manual or Python
Pocket Reference for the flags
argument’s meaning.
match(pattern, string [,
flags])
If zero or more characters at the start of string
match the pattern
string, return a
corresponding match object, or None
if no match is found. Roughly
like a search
for a
pattern that begins with the ^
operator.
search(pattern, string [,
flags])
Scan through string
for a location matching
pattern
, and return a
corresponding match object, or None
if no match is found.
findall(pattern, string [,
flags])
Return a list of strings giving all nonoverlapping
matches of pattern
in
string
. If there are any
groups in patterns, returns a list of groups, and a list of
tuples if the pattern has more than one group.
finditer(pattern, string [,
flags])
Return iterator over all nonoverlapping matches of
pattern
in string
.
split(pattern, string [,
maxsplit, flags])
Split string
by
occurrences of pattern
.
If capturing parentheses (()
) are used in the pattern, the
text of all groups in the pattern are also returned in the
resulting list.
sub(pattern, repl, string [,
count, flags])
Return the string obtained by replacing the (first
count
) leftmost
nonoverlapping occurrences of pattern
(a string or a pattern
object) in string
by
repl
(which may be a
string with backslash escapes that may back-reference a
matched group, or a function that is passed a single match
object and returns the replacement string).
subn(pattern, repl, string [,
count, flags])
Same as sub
,
but returns a tuple: (new-string,
number-of-substitutions-made).
escape(string)
Return string
with
all nonalphanumeric characters backslashed, such
that they can be compiled as a string literal.
At the next level, pattern objects provide similar attributes, but the
pattern string is implied. The re.compile
function in the previous
section is useful to optimize patterns that may be matched more
than once (compiled patterns match faster). Pattern objects
returned by re.compile
have
these sorts of attributes:
match(string [, pos] [,
endpos]) |
search(string [, pos] [,
endpos]) |
findall(string [, pos [,
endpos]]) |
finditer(string [, pos [,
endpos]]) |
split(string [,
maxsplit]) |
sub(repl, string [,
count]) |
subn(repl, string [,
count]) |
These are the same as the re
module functions, but the pattern is
implied, and pos
and endpos
give start/end string indexes for
the match.
Finally, when a match
or
search
function or method is
successful, you get back a match object (None
comes back on failed matches).
Match objects export a set of attributes of their own,
including:
group(g)
group(g1, g2,
...)
Return the substring that matched a parenthesized group (or groups) in the pattern. Accept group numbers or names. Group numbers start at 1; group 0 is the entire string matched by the pattern. Returns a tuple when passed multiple group numbers, and group number defaults to 0 if omitted.
groups()
Returns a tuple of all groups’ substrings of the match (for group numbers 1 and higher).
groupdict()
Returns a dictionary containing all named groups of
the match (see (?P<name>R)
syntax
ahead).
start([group])
end([group])
Indices of the start and end of the substring matched
by group
(or the entire
matched string, if no group
is passed).
span([group])
Returns the two-item tuple: (start(group), end(group))
.
expand([template])
Performs backslash group substitutions; see the Python library manual.
Regular expression strings are built up by concatenating single-character
regular expression forms, shown in Table 19-1. The longest-matching string is
usually matched by each form, except for the nongreedy operators.
In the table, R
means any
regular expression form, C
is a
character, and N
denotes a
digit.
Operator | Interpretation |
| Matches any
character (including newline if |
| Matches start of
the string (of every line in |
| Matches end of the
string (of every line in |
| Any nonspecial (or backslash-escaped) character matches itself |
| Zero or more of
preceding regular expression |
| One or more of
preceding regular expression |
| Zero or one
occurrence of preceding regular expression |
| Matches exactly
|
| Matches from
|
| Same as |
| Defines character
set: e.g., |
| Defines
complemented character set: matches if |
Escapes special
| |
| Matches a literal
|
| Matches the
contents of the group of the same number N: |
| Alternative:
matches left or right |
| Concatenation:
match both |
| Matches any regular
expression inside |
| Same as |
| Look-ahead
assertion: matches if |
| Matches if |
| Matches any regular
expression inside |
| Matches whatever
text was matched by the earlier group named |
| A comment; ignored |
| Set mode flag;
|
| Look-behind
assertion: matches if the current position in the string
is preceded by a match of |
| Matches if the
current position in the string is not preceded by a match
for |
| Will try to match
with |
Within patterns, ranges and selections can be combined. For
instance, [a-zA-Z0-9_]+
matches
the longest possible string of one or more letters, digits, or
underscores. Special characters can be escaped as usual in Python
strings: [ ]*
matches zero or
more tabs and spaces (i.e., it skips such whitespace).
The parenthesized grouping construct, (R)
, lets you extract matched substrings
after a successful match. The portion of the string matched by the
expression in parentheses is retained in a numbered register. It’s
available through the group
method of a match object after a successful match.
In addition to the entries in this table, special sequences
in Table 19-2 can be used in
patterns, too. Because of Python string rules, you sometimes must
double up on backslashes (\
)
or use Python raw strings (r'...'
) to retain backslashes in the
pattern verbatim. Python ignores backslashes in normal strings if
the letter following the backslash is not recognized as an escape
code. Some of the entries in Table 19-2 are affected by Unicode when
matching str
instead of
bytes
, and an ASCII flag may be
set to emulate the behavior for bytes
; see Python’s manuals for more
details.
Sequence | Interpretation |
| Matches text of
group |
| Matches only at the start of the string |
Empty string at word boundaries | |
| Empty string not at word boundaries |
| Any decimal digit
character ( |
| Any nondecimal
digit character ( |
| Any whitespace
character ( |
| Any nonwhitespace
character ( |
| Any alphanumeric
character ( |
| Any nonalphanumeric
character ( |
Matches only at the end of the string |
Most of the standard escapes supported by Python string
literals are also accepted by the regular expression parser:
a
, ,
f
,
,
,
, v
, x
, and \
. The Python library manual gives
these escapes’ interpretation and additional details on pattern
syntax in general. But to further demonstrate how the re
pattern syntax is typically used in
scripts, let’s go back to writing some code.
For more context, the next few examples present short test files that
match simple but representative pattern forms. Comments in Example 19-3 describe the
operations exercised; check Table 19-1 to
see which operators are used in these patterns. If they are still
confusing, try running these tests interactively, and call group(0)
instead of start()
to see which strings are being
matched by the patterns.
""" literals, sets, ranges, alternatives, and escapes all tests here print 2: offset where pattern found """ import re # the one to use today pattern, string = "A.C.", "xxABCDxx" # nonspecial chars match themselves matchobj = re.search(pattern, string) # '.' means any one char if matchobj: # search returns match object or None print(matchobj.start()) # start is index where matched pattobj = re.compile("A.*C.*") # 'R*' means zero or more Rs matchobj = pattobj.search("xxABCDxx") # compile returns pattern obj if matchobj: # patt.search returns match obj print(matchobj.start()) # selection sets print(re.search(" *A.C[DE][D-F][^G-ZE]G + ?", "..ABCDEFG ..").start()) # alternatives: R1|R2 means R1 or R2 print(re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").start()) # test each char print(re.search("(?:A|X)(?:B|Y)(?:C|Z)D", "..AYCD..").start()) # same, not saved print(re.search("A|XB|YC|ZD", "..AYCD..").start()) # matches just A! print(re.search("(A|XB|YC|ZD)YCD", "..AYCD..").start()) # just first char # word boundaries print(re.search(r"ABCD", "..ABCD ").start()) # means word boundary print(re.search(r"ABCD", "..ABCD ").start()) # use r'...' to escape ''
Notice again that there are different ways to kick off a match
with re
: by calling module search
functions and by making compiled pattern objects. In either event,
you can hang on to the resulting match object or not. All the print
call statements in this script show a result of 2
—the offset where the pattern was found
in the string. In the first test, for example, A.C.
matches the ABCD
at offset 2
in the search string (i.e., after the
first xx
):
C:...PP4ELang> python re-basic.py
2
...8 more 2s omitted...
Next, in Example 19-4, parts of the pattern strings enclosed in parentheses delimit groups; the parts of the string they matched are available after the match.
""" groups: extract substrings matched by REs in '()' parts groups are denoted by position, but (?P<name>R) can also name them """ import re patt = re.compile("A(.)B(.)C(.)") # saves 3 substrings mobj = patt.match("A0B1C2") # each '()' is a group, 1..n print(mobj.group(1), mobj.group(2), mobj.group(3)) # group() gives substring patt = re.compile("A(.*)B(.*)C(.*)") # saves 3 substrings mobj = patt.match("A000B111C222") # groups() gives all groups print(mobj.groups()) print(re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").groups()) print(re.search("(?P<a>A|X)(?P<b>B|Y)(?P<c>C|Z)D", "..AYCD..").groupdict()) patt = re.compile(r"[ ]*#s*defines*([a-z0-9_]*)s*(.*)") mobj = patt.search(" # define spam 1 + 2 + 3") # parts of C #define print(mobj.groups()) # s is whitespace
In the first test here, for instance, the three (.)
groups each match a single character,
but they retain the character matched; calling group
pulls out the character matched. The
second test’s (.*)
groups match
and retain any number of characters. The third and fourth tests
shows how alternatives can be grouped by both position and name, and
the last test matches C #define
lines—more on this pattern in a moment:
C:...PP4ELang> python re-groups.py
0 1 2
('000', '111', '222')
('A', 'Y', 'C')
{'a': 'A', 'c': 'C', 'b': 'Y'}
('spam', '1 + 2 + 3')
Finally, besides matches and substring extraction, re
also includes tools for string
replacement or substitution (see Example 19-5).
"substitutions: replace occurrences of pattern in string" import re print(re.sub('[ABC]', '*', 'XAXAXBXBXCXC')) print(re.sub('[ABC]_', '*', 'XA-XA_XB-XB_XC-XC_')) # alternatives char + _ print(re.sub('(.) spam', 'spam\1', 'x spam, y spam')) # group back ref (or r'') def mapper(matchobj): return 'spam' + matchobj.group(1) print(re.sub('(.) spam', mapper, 'x spam, y spam')) # mapping function
In the first test, all characters in the set are replaced; in
the second, they must be followed by an underscore. The last two
tests illustrate more advanced group back-references and mapping
functions in the replacement. Note the \1
required to escape 1
for Python’s string rules; r'spam1'
would work just as well. See
also the earlier interactive tests in the section for additional
substitution and splitting examples:
C:...PP4ELang> python re-subst.py
X*X*X*X*X*X*
XA-X*XB-X*XC-X*
spamx, spamy
spamx, spamy
To wrap up, let’s turn to a more realistic example: the script in
Example 19-6 puts these
pattern operators to more practical use. It uses regular expressions
to find #define
and #include
lines in C header files and
extract their components. The generality of the patterns makes them
detect a variety of line formats; pattern groups (the parts in
parentheses) are used to extract matched substrings from a line
after a match.
"Scan C header files to extract parts of #define and #include lines" import sys, re pattDefine = re.compile( # compile to pattobj '^#[ ]*define[ ]+(w+)[ ]*(.*)') # "# define xxx yyy..." # w like [a-zA-Z0-9_] pattInclude = re.compile( '^#[ ]*include[ ]+[<"]([w./]+)') # "# include <xxx>..." def scan(fileobj): count = 0 for line in fileobj: # scan by lines: iterator count += 1 matchobj = pattDefine.match(line) # None if match fails if matchobj: name = matchobj.group(1) # substrings for (...) parts body = matchobj.group(2) print(count, 'defined', name, '=', body.strip()) continue matchobj = pattInclude.match(line) if matchobj: start, stop = matchobj.span(1) # start/stop indexes of (...) filename = line[start:stop] # slice out of line print(count, 'include', filename) # same as matchobj.group(1) if len(sys.argv) == 1: scan(sys.stdin) # no args: read stdin else: scan(open(sys.argv[1], 'r')) # arg: input filename
To test, let’s run this script on the text file in Example 19-7.
#ifndef TEST_H #define TEST_H #include <stdio.h> #include <lib/spam.h> # include "Python.h" #define DEBUG #define HELLO 'hello regex world' # define SPAM 1234 #define EGGS sunny + side + up #define ADDER(arg) 123 + arg #endif
Notice the spaces after #
in some of these lines; regular expressions are flexible enough to
account for such departures from the norm. Here is the script at
work; picking out #include
and
#define
lines and their parts.
For each matched line, it prints the line number, the line type, and
any matched substrings:
C:...PP4ELang> python cheader.py test.h
2 defined TEST_H =
4 include stdio.h
5 include lib/spam.h
6 include Python.h
8 defined DEBUG =
9 defined HELLO = 'hello regex world'
10 defined SPAM = 1234
12 defined EGGS = sunny + side + up
13 defined ADDER = (arg) 123 + arg
For an additional example of regular expressions at work, see the file pygrep1.py in the book examples package; it implements a simple pattern-based “grep” file search utility, but was cut here for space. As we’ll see, we can also sometimes use regular expressions to parse information from XML and HTML text—the topics of the next section.
Beyond string objects and regular expressions, Python ships with support for parsing some specific and commonly used types of formatted text. In particular, it provides precoded parsers for XML and HTML which we can deploy and customize for our text processing goals.
In the XML department, Python includes parsing support in its standard library and plays host to a prolific XML special-interest group. XML (for eXtensible Markup Language) is a tag-based markup language for describing many kinds of structured data. Among other things, it has been adopted in roles such as a standard database and Internet content representation in many contexts. As an object-oriented scripting language, Python mixes remarkably well with XML’s core notion of structured document interchange.
XML is based upon a tag syntax familiar to web page writers,
used to describe and package data. The xml
module package in Python’s standard library includes
tools for parsing this data from XML text, with
both the SAX and the DOM standard parsing models, as well as the
Python-specific ElementTree package. Although regular expressions can
sometimes extract information from XML documents, too, they can be
easily misled by unexpected text, and don’t directly support the
notion of arbitrarily nested XML constructs (more on this limitation
later when we explore languages in general).
In short, SAX parsers provide a subclass with methods called during the parsing operation, and DOM parsers are given access to an object tree representing the (usually) already parsed document. SAX parsers are essentially state machines and must record (and possibly stack) page details as the parse progresses; DOM parsers walk object trees using loops, attributes, and methods defined by the DOM standard. ElementTree is roughly a Python-specific analog of DOM, and as such can often yield simpler code; it can also be used to generate XML text from their object-based representations.
Beyond these parsing tools, Python also ships with an xmlrpc
package to
support the client and server sides of the XML-RPC protocol (remote
procedure calls that transmit objects encoded as XML over HTTP), as
well as a standard HTML parser, html.parser
, that works on similar principles and is presented later
in this chapter. The third-party domain has even more XML-related
tools; most of these are maintained separately from Python to allow
for more flexible release schedules. Beginning with Python 2.3, the
Expat
parser is also included as the underlying engine that drives
the parsing process.
XML processing is a large, evolving topic, and it is mostly beyond the scope of this book. For an example of a simple XML parsing task, though, consider the XML file in Example 19-8. This file defines a handful of O’Reilly Python books—ISBN numbers as attributes, and titles, publication dates, and authors as nested tags (with apologies to Python books not listed in this completely random sample—there are many!).
<catalog> <book isbn="0-596-00128-2"> <title>Python & XML</title> <date>December 2001</date> <author>Jones, Drake</author> </book> <book isbn="0-596-15810-6"> <title>Programming Python, 4th Edition</title> <date>October 2010</date> <author>Lutz</author> </book> <book isbn="0-596-15806-8"> <title>Learning Python, 4th Edition</title> <date>September 2009</date> <author>Lutz</author> </book> <book isbn="0-596-15808-4"> <title>Python Pocket Reference, 4th Edition</title> <date>October 2009</date> <author>Lutz</author> </book> <book isbn="0-596-00797-3"> <title>Python Cookbook, 2nd Edition</title> <date>March 2005</date> <author>Martelli, Ravenscroft, Ascher</author> </book> <book isbn="0-596-10046-9"> <title>Python in a Nutshell, 2nd Edition</title> <date>July 2006</date> <author>Martelli</author> </book> <!-- plus many more Python books that should appear here --> </catalog>
Let’s quickly explore ways to extract this file’s book ISBN numbers and corresponding titles by example, using each of the four primary Python tools at our disposal—patterns, SAX, DOM, and ElementTree.
In some contexts, the regular expressions we met earlier can be used
to parse information from XML files. They are not complete
parsers, and are not very robust or accurate in the presence of
arbitrary text (text in tag attributes can especially throw them
off). Where applicable, though, they offer a simple option. Example 19-9 shows how we
might go about parsing the XML file in Example 19-8 with the prior
section’s re
module. Like all
four examples in this section, it scans the XML file looking at
ISBN numbers and associated titles, and stores the two as keys and
values in a Python dictionary.
""" XML parsing: regular expressions (no robust or general) """ import re, pprint text = open('books.xml').read() # str if str pattern pattern = '(?s)isbn="(.*?)".*?<title>(.*?)</title>' # *?=nongreedy found = re.findall(pattern, text) # (?s)=dot matches /n mapping = {isbn: title for (isbn, title) in found} # dict from tuple list pprint.pprint(mapping)
When run, the re.findall
method locates all the nested tags we’re interested
in, extracts their content, and returns a list of tuples
representing the two parenthesized groups in the pattern.
Python’s pprint
module displays the dictionary created by the comprehension
nicely. The extract works, but only as long as the text doesn’t
deviate from the expected pattern in ways that would invalidate
our script. Moreover, the XML entity for “&” in the first
book’s title is not un-escaped automatically:
C:...PP4ELangXml> python rebook.py
{'0-596-00128-2': 'Python & XML',
'0-596-00797-3': 'Python Cookbook, 2nd Edition',
'0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
'0-596-15806-8': 'Learning Python, 4th Edition',
'0-596-15808-4': 'Python Pocket Reference, 4th Edition',
'0-596-15810-6': 'Programming Python, 4th Edition'}
To do better, Python’s full-blown XML parsing tools let us perform this data extraction in a more accurate and robust way. Example 19-10, for instance, defines a SAX-based parsing procedure: its class implements callback methods that will be called during the parse, and its top-level code creates and runs a parser.
""" XML parsing: SAX is a callback-based API for intercepting parser events """ import xml.sax, xml.sax.handler, pprint class BookHandler(xml.sax.handler.ContentHandler): def __init__(self): self.inTitle = False # handle XML parser events self.mapping = {} # a state machine model def startElement(self, name, attributes): if name == "book": # on start book tag self.buffer = "" # save ISBN for dict key self.isbn = attributes["isbn"] elif name == "title": # on start title tag self.inTitle = True # save title text to follow def characters(self, data): if self.inTitle: # on text within tag self.buffer += data # save text if in title def endElement(self, name): if name == "title": self.inTitle = False # on end title tag self.mapping[self.isbn] = self.buffer # store title text in dict parser = xml.sax.make_parser() handler = BookHandler() parser.setContentHandler(handler) parser.parse('books.xml') pprint.pprint(handler.mapping)
The SAX model is efficient, but it is potentially confusing at first glance, because the class must keep track of where the parse currently is using state information. For example, when the title tag is first detected, we set a state flag and initialize a buffer; as each character within the title tag is parsed, we append it to the buffer until the ending portion of the title tag is encountered. The net effect saves the title tag’s content as a string. This model is simple, but can be complex to manage; in cases of potentially arbitrary nesting, for instance, state information may need to be stacked as the class receives callbacks for nested tags.
To kick off the parse, we make a parser object, set its
handler to an instance of our class, and start the parse; as
Python scans the XML file, our class’s methods are called
automatically as components are encountered. When the parse is
complete, we use the Python pprint
module to display the result again—the mapping
dictionary object attached to
our handler. The result is the mostly the same this time, but
notice that the “&” escape sequence is properly un-escaped
now—SAX performs XML parsing, not text matching:
C:...PP4ELangXml> python saxbook.py
{'0-596-00128-2': 'Python & XML',
'0-596-00797-3': 'Python Cookbook, 2nd Edition',
'0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
'0-596-15806-8': 'Learning Python, 4th Edition',
'0-596-15808-4': 'Python Pocket Reference, 4th Edition',
'0-596-15810-6': 'Programming Python, 4th Edition'}
The DOM parsing model for XML is perhaps simpler to understand—we simply traverse a tree of objects after the parse—but it might be less efficient for large documents, if the document is parsed all at once ahead of time and stored in memory. DOM also supports random access to document parts via tree fetches, nested loops for known structures, and recursive traversals for arbitrary nesting; in SAX, we are limited to a single linear parse. Example 19-11 is a DOM-based equivalent to the SAX parser of the preceding section.
""" XML parsing: DOM gives whole document to the application as a traversable object """ import pprint import xml.dom.minidom from xml.dom.minidom import Node doc = xml.dom.minidom.parse("books.xml") # load doc into object # usually parsed up front mapping = {} for node in doc.getElementsByTagName("book"): # traverse DOM object isbn = node.getAttribute("isbn") # via DOM object API L = node.getElementsByTagName("title") for node2 in L: title = "" for node3 in node2.childNodes: if node3.nodeType == Node.TEXT_NODE: title += node3.data mapping[isbn] = title # mapping now has the same value as in the SAX example pprint.pprint(mapping)
The output of this script is the same as what we generated interactively for the SAX parser; here, though, it is built up by walking the document object tree after the parse has finished using method calls and attributes defined by the cross-language DOM standard specification. This is both a strength and potential weakness of DOM—its API is language neutral, but it may seem a bit nonintuitive and verbose to some Python programmers accustomed to simpler models:
C:...PP4ELangXml> python dombook.py
{'0-596-00128-2': 'Python & XML',
'0-596-00797-3': 'Python Cookbook, 2nd Edition',
'0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
'0-596-15806-8': 'Learning Python, 4th Edition',
'0-596-15808-4': 'Python Pocket Reference, 4th Edition',
'0-596-15810-6': 'Programming Python, 4th Edition'}
As a fourth option, the popular ElementTree package is a standard library tool for both parsing and generating XML. As a parser, it’s essentially a more Pythonic type of DOM—it parses documents into a tree of objects again, but the API for navigating the tree is more lightweight, because it’s Python-specific.
ElementTree provides easy-to-use tools for parsing, changing, and generating XML documents. For both parsing and generating, it represents documents as a tree of Python “element” objects. Each element in the tree has a tag name, attribute dictionary, text value, and sequence of child elements. The element object produced by a parse can be navigating with normal Python loops for a known structures, and with recursion where arbitrary nesting is possible.
The ElementTree system began its life as a third-party
extension, but it was largely incorporated into Python’s standard
library as the package xml.etree
. Example 19-12 shows how to
use it to parse our book catalog file one last time.
""" XML parsing: ElementTree (etree) provides a Python-based API for parsing/generating """ import pprint from xml.etree.ElementTree import parse mapping = {} tree = parse('books.xml') for B in tree.findall('book'): isbn = B.attrib['isbn'] for T in B.findall('title'): mapping[isbn] = T.text pprint.pprint(mapping)
When run we get the exact same results as for SAX and DOM again, but the code required to extract the file’s details seems noticeably simpler this time around:
C:...PP4ELangXml> python etreebook.py
{'0-596-00128-2': 'Python & XML',
'0-596-00797-3': 'Python Cookbook, 2nd Edition',
'0-596-10046-9': 'Python in a Nutshell, 2nd Edition',
'0-596-15806-8': 'Learning Python, 4th Edition',
'0-596-15808-4': 'Python Pocket Reference, 4th Edition',
'0-596-15810-6': 'Programming Python, 4th Edition'}
Naturally, there is much more to Python’s XML support than these simple examples imply. In deference to space, though, here are pointers to XML resources in lieu of additional examples:
First, be sure to consult the Python library manual
for more on the standard library’s XML support tools. See
the entries for re
,
xml.sax
., xml.dom
, and xml.etree
for more on this
section’s examples.
You can also find Python XML tools and documentation at the XML Special Interest Group (SIG) web page at http://www.python.org. This SIG is dedicated to wedding XML technologies with Python, and it publishes free XML tools independent of Python itself. Much of the standard library’s XML support originated with this group’s work.
You can also find free, third-party Python support tools for XML on the Web by following links at the XML SIGs web page. Of special interest, the 4Suite open source package provides integrated tools for XML processing, including open technologies such as DOM, SAX, RDF, XSLT, XInclude, XPointer, XLink, and XPath.
A variety of books have been published which specifically address XML and text processing in Python. O’Reilly offers a book dedicated to the subject of XML processing in Python, Python & XML, written by Christopher A. Jones and Fred L. Drake, Jr.
As usual, be sure to also see your favorite web search engine for more recent developments on this front.
Although more limited in scope, Python’s html.parser
standard library module also supports HTML-specific
parsing, useful in “screen scraping” roles to extract information
from web pages. Among other things, this parser can be used to
process Web replies fetched with the urllib.request
module we met in the Internet part of this book, to extract
plain text from HTML email messages, and more.
The html.parser
module has
an API reminiscent of the XML SAX model of the prior section: it
provides a parser which we subclass to intercept tags and their data
during a parse. Unlike SAX, we don’t provide a handler class, but
extend the parser class directly. Here’s a quick interactive example
to demonstrate the basics (I copied all of this section’s code into
file htmlparser.py in the
examples package if you wish to experiment with it yourself):
>>>from html.parser import HTMLParser
>>>class ParsePage(HTMLParser):
...def handle_starttag(self, tag, attrs):
...print('Tag start:', tag, attrs)
...def handle_endtag(self, tag):
...print('tag end: ', tag)
...def handle_data(self, data):
...print('data......', data.rstrip())
...
Now, create a web page’s HTML text string; we hardcode one
here, but it might also be loaded from a file, or fetched from a
website with urllib.request
:
>>>page = """
...<html>
...<h1>Spam!</h1>
...<p>Click this <a href="http://www.python.org">python</a> link</p>
...</html>"""
Finally, kick off the parse by feeding text to a parser instance—tags in the HTML text trigger class method callbacks, with tag names and attribute sequences passed in as arguments:
>>>parser = ParsePage()
>>>parser.feed(page)
data...... Tag start: html [] data...... Tag start: h1 [] data...... Spam! tag end: h1 data...... Tag start: p [] data...... Click this Tag start: a [('href', 'http://www.python.org')] data...... python tag end: a data...... link tag end: p data...... tag end: html
As you can see, the parser’s methods receive callbacks for events during the parse. Much like SAX XML parsing, your parser class will need to keep track of its state in attributes as it goes if it wishes to do something more specific than print tag names, attributes, and content. Watching for specific tags’ content, though, might be as simple as checking names and setting state flags. Moreover, building object trees to reflect the page’s structure during the parse would be straightforward.
Here’s another HTML parsing example: in Chapter 15, we used a simple method exported by this module to unquote HTML escape sequences (a.k.a. entities) in strings embedded in an HTML reply page:
>>>import cgi, html.parser
>>>s = cgi.escape("1<2 <b>hello</b>")
>>>s
'1<2 <b>hello</b>' >>> >>>html.parser.HTMLParser().unescape(s)
'1<2 <b>hello</b>'
This works for undoing HTML escapes, but that’s all. When we saw this solution, I implied that there was a more general approach; now that you know about the method callback model of the HTML parser class, the more idiomatic way to handle entities during a parse should make sense—simply catch entity callbacks in a parser subclass, and translate as needed:
>>>class Parse(html.parser.HTMLParser):
...def handle_data(self, data):
...print(data, end='')
...def handle_entityref(self, name):
...map = dict(lt='<', gt='>')
...print(map[name], end='')
... >>>p = Parse()
>>>p.feed(s); print()
1<2 <b>hello</b>
Better still, we can use Python’s related html.entities
module to avoid hardcoding
entity-to-character mappings for HTML entities. This module
defines many more entity names than the simple dictionary in the
prior example and includes all those you’ll likely encounter when
parsing HTML text in the wild:
>>>s
'1<2 <b>hello</b>' >>> >>>from html.entities import entitydefs
>>>class Parse(html.parser.HTMLParser):
...def handle_data(self, data):
...print(data, end='')
...def handle_entityref(self, name):
...print(entitydefs[name], end='')
... >>>P = Parse()
>>>P.feed(s); print()
1<2 <b>hello</b>
Strictly speaking, the html.entities
module is able to map
entity name to Unicode code point and vice versa; its table used
here simply converts code point integers to characters with
chr
. See this module’s
documentation, as well as its source code in the Python standard
library for more details.
Now that you understand the basic principles of the HTML parser class in Python’s standard library, the plain text extraction module used by Chapter 14’s PyMailGUI (Example 14-8) will also probably make significantly more sense (this was an unavoidable forward reference which we’re finally able to close).
Rather than repeating its code here, I’ll simply refer you back to that example, as well as its self-test and test input files, for another example of HTML parsing in Python to study on your own. It’s essentially a minor elaboration on the examples here, which detects more types of tags in its parser callback methods.
Because of space concerns, we have to cut short our treatment of HTML parsing here; as usual, knowing that it exists is enough to get started. For more details on the API, consult the Python library manual. And for additional HTML support, check the Web for the 3.X status of third-party HTML parser packages like those mentioned in Chapter 14.
If you have a background in parsing theory, you may know that neither
regular expressions nor string splitting is powerful enough to handle
more complex language grammars. Roughly, regular expressions don’t have the stack “memory”
required by true language grammars, and so cannot support arbitrary
nesting of language constructs—nested if
statements in a programming language, for
instance. In fact, this is why the XML and HTML parsers of the prior
section are required at all: both are languages of potentially
arbitrary nesting, which are beyond the scope of regular expressions
in general.
From a theoretical perspective, regular expressions are really intended to handle just the first stage of parsing—separating text into components, otherwise known as lexical analysis. Though patterns can often be used to extract data from text, true language parsing requires more. There are a number of ways to fill this gap with Python:
In most applications, the Python language itself can
replace custom languages and parsers—user-entered code can be
passed to Python for evaluation with tools such as eval
and exec
. By augmenting the
system with custom modules, user code in this scenario has
access to both the full Python language and any application-specific extensions required. In a
sense, such systems embed Python in Python.
Since this is a common Python role, we’ll revisit this approach
later in this chapter.
For some sophisticated language analysis tasks, though, a full-blown parser may still be required. Such parsers can always be written by hand, but since Python is built for integrating C tools, we can write integrations to traditional parser generator systems such as yacc and bison, tools that create parsers from language grammar definitions. Better yet, we could use an integration that already exists—interfaces to such common parser generators are freely available in the open source domain (run a web search for up-to-date details and links).
In addition, a number of Python-specific parsing systems are available on the Web. Among them: PLY is an implementation of lex and yacc parsing tools in and for Python; the kwParsing system is a parser generator written in Python; PyParsing is a pure-Python class library that makes it easy to build recursive-descent parsers quickly; and the SPARK toolkit is a lightweight system that employs the Earley algorithm to work around technical problems with LALR parser generation (if you don’t know what that means, you probably don’t need to care).
Of special interest to this chapter, YAPPS (Yet Another Python Parser System) is a parser generator written in Python. It uses supplied grammar rules to generate human-readable Python code that implements a recursive descent parser; that is, it’s Python code that generates Python code. The parsers generated by YAPPS look much like (and were inspired by) the handcoded custom expression parsers shown in the next section. YAPPS creates LL(1) parsers, which are not as powerful as LALR parsers but are sufficient for many language tasks. For more on YAPPS, see http://theory.stanford.edu/~amitp/Yapps or search the Web at large.
Even more demanding language analysis tasks require techniques developed in artificial intelligence research, such as semantic analysis and machine learning. For instance, the Natural Language Toolkit, or NLTK, is an open source suite of Python libraries and programs for symbolic and statistical natural language processing. It applies linguistic techniques to textual data, and it can be used in the development of natural language recognition software and systems. For much more on this subject, be sure to also see the O’Reilly book Natural Language Processing with Python, which explores, among other things, ways to use NLTK in Python. Not every system’s users will pose questions in a natural language, of course, but there are many applications which can make good use of such utility.
Though widely useful, parser generator systems and natural language analysis toolkits are too complex for us to cover in any sort of useful detail in this text. Consult http://python.org/ or search the Web for more information on language analysis tools available for use in Python programs. For the purposes of this chapter, let’s move on to explore a more basic and manual approach that illustrates concepts underlying the domain—recursive descent parsing.
Although toolkits abound in this domain, Python’s status as a general-purpose programming language also makes it a reasonable vehicle for writing hand-coded parsers for custom language analysis tasks. For instance, recursive descent parsing is a fairly well-known technique for analyzing language-based information. Though not as powerful as some language tools, recursive descent parses are sufficient for a wide variety of language-related goals.
To illustrate, this section develops a custom parser for a simple grammar—it parses and evaluates arithmetic expression strings. Though language analysis is the main topic here, this example also demonstrates the utility of Python as a general-purpose programming language. Although Python is often used as a frontend or rapid development language in tactical modes, it’s also often useful for the kinds of strategic work you may have formerly done in a systems development language such as C or C++.
The grammar that our parser will recognize can be described as follows:
goal -> <expr> END [number, variable, ( ] goal -> <assign> END [set] assign -> 'set' <variable> <expr> [set] expr -> <factor> <expr-tail> [number, variable, ( ] expr-tail -> ^ [END, ) ] expr-tail -> '+' <factor> <expr-tail> [+] expr-tail -> '-' <factor> <expr-tail> [-] factor -> <term> <factor-tail> [number, variable, ( ] factor-tail -> ^ [+, -, END, ) ] factor-tail -> '*' <term> <factor-tail> [*] factor-tail -> '/' <term> <factor-tail> [/] term -> <number> [number] term -> <variable> [variable] term -> '(' <expr> ')' [(] tokens: (, ), num, var, -, +, /, *, set, end
This is a fairly typical grammar for a simple expression
language, and it allows for arbitrary expression nesting (some
example expressions appear at the end of the testparser
module
listing in Example 19-15). Strings to be
parsed are either an expression or an assignment to a variable name
(set
). Expressions involve
numbers, variables, and the operators +
, −
,
*
, and /
. Because factor
is nested in expr
in the grammar, *
and /
have higher precedence (i.e., they bind tighter) than +
and −
. Expressions can be enclosed in
parentheses to override precedence, and all operators are left
associative—that is, they group on the left (e.g., 1-2-3
is treated the same as (1-2)-3
).
Tokens are just the most primitive
components of the expression language. Each grammar rule listed
earlier is followed in square brackets by a list of tokens used to
select it. In recursive descent parsing, we determine the set of
tokens that can possibly start a rule’s substring, and we use that
information to predict which rule will work ahead of time. For rules
that iterate (the -tail
rules),
we use the set of possibly following tokens to know when to stop.
Typically, tokens are recognized by a string processor (a
scanner), and a higher-level processor (a
parser) uses the token stream to predict and
step through grammar rules and substrings.
The system is structured as two modules, holding two classes:
The scanner handles low-level character-by-character analysis.
The parser embeds a scanner and handles higher-level grammar analysis.
The parser is also responsible for computing the expression’s
value and testing the system. In this version, the parser evaluates
the expression while it is being parsed. To use the system, we
create a parser with an input string and call its parse
method. We can also call parse
again later with a new expression
string.
There’s a deliberate division of labor here. The scanner extracts tokens from the string, but it knows nothing about the grammar. The parser handles the grammar, but it is naive about the string itself. This modular structure keeps the code relatively simple. And it’s another example of the object-oriented programming (OOP) composition relationship at work: parsers embed and delegate to scanners.
The module in Example 19-13 implements the
lexical analysis task—detecting the expression’s basic tokens by
scanning the text string left to right on demand. Notice that this
is all straightforward logic; such analysis can sometimes be
performed with regular expressions instead (described earlier), but
the pattern needed to detect and extract tokens in this example
would be too complex and fragile for my tastes. If your tastes vary,
try recoding this module with re
.
""" ############################################################################### the scanner (lexical analyser) ############################################################################### """ import string class SyntaxError(Exception): pass # local errors class LexicalError(Exception): pass # used to be strings class Scanner: def __init__(self, text): self.next = 0 self.text = text + ' ' def newtext(self, text): Scanner.__init__(self, text) def showerror(self): print('=> ', self.text) print('=> ', (' ' * self.start) + '^') def match(self, token): if self.token != token: raise SyntaxError(token) else: value = self.value if self.token != ' ': self.scan() # next token/value return value # return prior value def scan(self): self.value = None ix = self.next while self.text[ix] in string.whitespace: ix += 1 self.start = ix if self.text[ix] in ['(', ')', '-', '+', '/', '*', ' ']: self.token = self.text[ix] ix += 1 elif self.text[ix] in string.digits: str = '' while self.text[ix] in string.digits: str += self.text[ix] ix += 1 if self.text[ix] == '.': str += '.' ix += 1 while self.text[ix] in string.digits: str += self.text[ix] ix += 1 self.token = 'num' self.value = float(str) else: self.token = 'num' self.value = int(str) # subsumes long() in 3.x elif self.text[ix] in string.ascii_letters: str = '' while self.text[ix] in (string.digits + string.ascii_letters): str += self.text[ix] ix += 1 if str.lower() == 'set': self.token = 'set' else: self.token = 'var' self.value = str else: raise LexicalError() self.next = ix
The parser module’s class creates and embeds a scanner for its lexical chores and handles interpretation of the expression grammar’s rules and evaluation of the expression’s result, as shown in Example 19-14.
""" ################################################################################ the parser (syntax analyser, evaluates during parse) ################################################################################ """ class UndefinedError(Exception): pass from scanner import Scanner, LexicalError, SyntaxError class Parser: def __init__(self, text=''): self.lex = Scanner(text) # embed a scanner self.vars = {'pi': 3.14159} # add a variable def parse(self, *text): if text: # main entry-point self.lex.newtext(text[0]) # reuse this parser? try: self.lex.scan() # get first token self.Goal() # parse a sentence except SyntaxError: print('Syntax Error at column:', self.lex.start) self.lex.showerror() except LexicalError: print('Lexical Error at column:', self.lex.start) self.lex.showerror() except UndefinedError as E: name = E.args[0] print("'%s' is undefined at column:" % name, self.lex.start) self.lex.showerror() def Goal(self): if self.lex.token in ['num', 'var', '(']: val = self.Expr() self.lex.match(' ') # expression? print(val) elif self.lex.token == 'set': # set command? self.Assign() self.lex.match(' ') else: raise SyntaxError() def Assign(self): self.lex.match('set') var = self.lex.match('var') val = self.Expr() self.vars[var] = val # assign name in dict def Expr(self): left = self.Factor() while True: if self.lex.token in [' ', ')']: return left elif self.lex.token == '+': self.lex.scan() left = left + self.Factor() elif self.lex.token == '-': self.lex.scan() left = left - self.Factor() else: raise SyntaxError() def Factor(self): left = self.Term() while True: if self.lex.token in ['+', '-', ' ', ')']: return left elif self.lex.token == '*': self.lex.scan() left = left * self.Term() elif self.lex.token == '/': self.lex.scan() left = left / self.Term() else: raise SyntaxError() def Term(self): if self.lex.token == 'num': val = self.lex.match('num') # numbers return val elif self.lex.token == 'var': if self.lex.value in self.vars.keys(): # keys(): EIBTI! val = self.vars[self.lex.value] # look up name's value self.lex.scan() return val else: raise UndefinedError(self.lex.value) elif self.lex.token == '(': self.lex.scan() val = self.Expr() # sub-expression self.lex.match(')') return val else: raise SyntaxError() if __name__ == '__main__': import testparser # self-test code testparser.test(Parser, 'parser1') # test local Parser
If you study this code closely, you’ll notice that the parser
keeps a dictionary (self.vars
) to
manage variable names: they’re stored in the dictionary on a
set
command and are fetched from
it when they appear in an expression. Tokens are represented as
strings, with an optional associated value (a numeric value for
numbers and a string for variable names).
The parser uses iteration (while
loops) rather than recursion for the
expr-tail
and factor-tail
rules. Other than this
optimization, the rules of the grammar map directly onto parser
methods: tokens become calls to the scanner, and nested rule
references become calls to other methods.
When the file parser1.py is run as a
top-level program, its self-test code is executed, which in turn
simply runs a canned test in the module shown in Example 19-15. Notice how the
scanner converts numbers to strings with int
; this ensures that all integer math
invoked by the parser supports unlimited precision, simply because
it uses Python integers which always provide the extra precision if
needed (the separate Python 2.X long
type and syntax is no more).
Also notice that mixed integer/floating-point operations cast
up to floating point since Python operators are used to do the
actual calculations along the way. The expression language’s
/
division operator also inherits
Python 3.X’s true division model which retains remainders and
returns floating point results regardless of operand types. We could
simply run a //
in our evaluation
logic to retain prior behavior (or allow for both /
and //
in our grammar), but we’ll follow
Python 3.X’s lead here.
""" ############################################################################ parser test code ############################################################################ """ def test(ParserClass, msg): print(msg, ParserClass) x = ParserClass('4 / 2 + 3') # allow different Parsers x.parse() x.parse('3 + 4 / 2') # like eval('3 + 4 / 2')... x.parse('(3 + 4) / 2') # 3.X: / is now true div x.parse('4 / (2 + 3)') # // is not supported (yet) x.parse('4.0 / (2 + 3)') x.parse('4 / (2.0 + 3)') x.parse('4.0 / 2 * 3') x.parse('(4.0 / 2) * 3') x.parse('4.0 / (2 * 3)') x.parse('(((3))) + 1') y = ParserClass() y.parse('set a 4 / 2 + 1') y.parse('a * 3') y.parse('set b 12 / a') y.parse('b') z = ParserClass() z.parse('set a 99') z.parse('set a a + 1') z.parse('a') z = ParserClass() z.parse('pi') z.parse('2 * pi') z.parse('1.234 + 2.1') def interact(ParserClass): # command-line entry print(ParserClass) x = ParserClass() while True: cmd = input('Enter=> ') if cmd == 'stop': break x.parse(cmd)
Correlate the following results to print call statements in the self-test module:
C:...PP4ELangParser> python parser1.py
parser1 <class '__main__.Parser'>
5.0
5.0
3.5
0.8
0.8
0.8
6.0
6.0
0.666666666667
4
9.0
4.0
100
3.14159
6.28318
3.334
As usual, we can also test and use the system interactively to work through more of its utility:
C:...PP4ELangParser>python
>>>import parser1
>>>x = parser1.Parser()
>>>x.parse('1 + 2')
3
Error cases are trapped and reported in a fairly friendly fashion (assuming users think in zero-based terms):
>>>x.parse('1 + a')
'a' is undefined at column: 4 => 1 + a => ^ >>>x.parse('1+a+2')
'a' is undefined at column: 2 => 1+a+2 => ^ >>>x.parse('1 * 2 $')
Lexical Error at column: 6 => 1 * 2 $ => ^ >>>x.parse('1 * - 1')
Syntax Error at column: 4 => 1 * - 1 => ^ >>>x.parse('1 * (9')
Syntax Error at column: 6 => 1 * (9 => ^ >>>x.parse('1 + 2 / 3')
# 3.X division change 1.66666666667 >>>x.parse('1 + 2 // 3')
Syntax Error at column: 7 => 1 + 2 // 3 => ^
Pathologically big numbers are handled well, too, because Python’s built-in objects and operators are used along the way:
>>>x.parse('888888888888888888888888888888888888888888888.9999999')
8.88888888889e+44 >>>x.parse('99999999999999999999999999999999999999999 + 2')
100000000000000000000000000000000000000001 >>>x.parse('999999999999999999999999999999.88888888888 + 1.1')
1e+30
In addition, there is an interactive loop interface in the
testparser
module if you want to
use the parser as a simple command-line calculator (or if you get
tired of typing parser method calls). Pass the Parser
class, so testparser
can make one of its own:
>>>import testparser
>>>testparser.interact(parser1.Parser)
<class 'parser1.Parser'> Enter=>4 * 3 + 5
17 Enter=>5 + 4 * 3
17 Enter=>(5 + 4) * 3
27 Enter=>set a 99
Enter=>set b 66
Enter=>a + b
165 Enter=># + 1
Lexical Error at column: 0 => # + 1 => ^ Enter=>a * b + c
'c' is undefined at column: 8 => a * b + c => ^ Enter=>a * b * + c
Syntax Error at column: 8 => a * b * + c => ^ Enter=>a
99 Enter=>a * a * a
970299 Enter=>stop
>>>
One weakness in the parser1
program is that it embeds
expression evaluation logic in the parsing logic: the result is
computed while the string is being parsed. This makes evaluation
quick, but it can also make it difficult to modify the code,
especially in larger systems. To simplify, we could restructure the
program to keep expression parsing and evaluation separate. Instead
of evaluating the string, the parser can build up an intermediate
representation of it that can be evaluated later. As an added
incentive, building the representation separately makes it available
to other analysis tools (e.g., optimizers, viewers, and so on)—they
can be run as separate passes over the tree.
Example 19-16
shows a variant of parser1
that
implements this idea. The parser analyzes the string and builds up a
parse tree—that is, a tree of class instances
that represents the expression and that may be evaluated in a
separate step. The parse tree is built from classes that “know” how
to evaluate themselves: to compute the expression, we just ask the
tree to evaluate itself. Root nodes in the tree ask their children
to evaluate themselves, and then combine the results by applying a
single operator. In effect, evaluation in this version is simply a
recursive traversal of a tree of embedded class instances
constructed by the parser.
""" Separate expression parsing from evaluation by building an explicit parse tree """ TraceDefault = False class UndefinedError(Exception): pass if __name__ == '__main__': from scanner import Scanner, SyntaxError, LexicalError # if run here else: from .scanner import Scanner, SyntaxError, LexicalError # from PyTree ################################################################################# # the interpreter (a smart objects tree) ################################################################################# class TreeNode: def validate(self, dict): # default error check pass def apply(self, dict): # default evaluator pass def trace(self, level): # default unparser print('.' * level + '<empty>') # ROOTS class BinaryNode(TreeNode): def __init__(self, left, right): # inherited methods self.left, self.right = left, right # left/right branches def validate(self, dict): self.left.validate(dict) # recurse down branches self.right.validate(dict) def trace(self, level): print('.' * level + '[' + self.label + ']') self.left.trace(level+3) self.right.trace(level+3) class TimesNode(BinaryNode): label = '*' def apply(self, dict): return self.left.apply(dict) * self.right.apply(dict) class DivideNode(BinaryNode): label = '/' def apply(self, dict): return self.left.apply(dict) / self.right.apply(dict) class PlusNode(BinaryNode): label = '+' def apply(self, dict): return self.left.apply(dict) + self.right.apply(dict) class MinusNode(BinaryNode): label = '-' def apply(self, dict): return self.left.apply(dict) - self.right.apply(dict) # LEAVES class NumNode(TreeNode): def __init__(self, num): self.num = num # already numeric def apply(self, dict): # use default validate return self.num def trace(self, level): print('.' * level + repr(self.num)) # as code, was 'self.num' class VarNode(TreeNode): def __init__(self, text, start): self.name = text # variable name self.column = start # column for errors def validate(self, dict): if not self.name in dict.keys(): raise UndefinedError(self.name, self.column) def apply(self, dict): return dict[self.name] # validate before apply def assign(self, value, dict): dict[self.name] = value # local extension def trace(self, level): print('.' * level + self.name) # COMPOSITES class AssignNode(TreeNode): def __init__(self, var, val): self.var, self.val = var, val def validate(self, dict): self.val.validate(dict) # don't validate var def apply(self, dict): self.var.assign( self.val.apply(dict), dict ) def trace(self, level): print('.' * level + 'set ') self.var.trace(level + 3) self.val.trace(level + 3) ################################################################################# # the parser (syntax analyser, tree builder) ################################################################################# class Parser: def __init__(self, text=''): self.lex = Scanner(text) # make a scanner self.vars = {'pi':3.14159} # add constants self.traceme = TraceDefault def parse(self, *text): # external interface if text: self.lex.newtext(text[0]) # reuse with new text tree = self.analyse() # parse string if tree: if self.traceme: # dump parse-tree? print(); tree.trace(0) if self.errorCheck(tree): # check names self.interpret(tree) # evaluate tree def analyse(self): try: self.lex.scan() # get first token return self.Goal() # build a parse-tree except SyntaxError: print('Syntax Error at column:', self.lex.start) self.lex.showerror() except LexicalError: print('Lexical Error at column:', self.lex.start) self.lex.showerror() def errorCheck(self, tree): try: tree.validate(self.vars) # error checker return 'ok' except UndefinedError as instance: # args is a tuple varinfo = instance.args print("'%s' is undefined at column: %d" % varinfo) self.lex.start = varinfo[1] self.lex.showerror() # returns None def interpret(self, tree): result = tree.apply(self.vars) # tree evals itself if result != None: # ignore 'set' result print(result) # ignores errors def Goal(self): if self.lex.token in ['num', 'var', '(']: tree = self.Expr() self.lex.match(' ') return tree elif self.lex.token == 'set': tree = self.Assign() self.lex.match(' ') return tree else: raise SyntaxError() def Assign(self): self.lex.match('set') vartree = VarNode(self.lex.value, self.lex.start) self.lex.match('var') valtree = self.Expr() return AssignNode(vartree, valtree) # two subtrees def Expr(self): left = self.Factor() # left subtree while True: if self.lex.token in [' ', ')']: return left elif self.lex.token == '+': self.lex.scan() left = PlusNode(left, self.Factor()) # add root-node elif self.lex.token == '-': self.lex.scan() left = MinusNode(left, self.Factor()) # grows up/right else: raise SyntaxError() def Factor(self): left = self.Term() while True: if self.lex.token in ['+', '-', ' ', ')']: return left elif self.lex.token == '*': self.lex.scan() left = TimesNode(left, self.Term()) elif self.lex.token == '/': self.lex.scan() left = DivideNode(left, self.Term()) else: raise SyntaxError() def Term(self): if self.lex.token == 'num': leaf = NumNode(self.lex.match('num')) return leaf elif self.lex.token == 'var': leaf = VarNode(self.lex.value, self.lex.start) self.lex.scan() return leaf elif self.lex.token == '(': self.lex.scan() tree = self.Expr() self.lex.match(')') return tree else: raise SyntaxError() ################################################################################# # self-test code: use my parser, parser1's tester ################################################################################# if __name__ == '__main__': import testparser testparser.test(Parser, 'parser2') # run with Parser class here
Notice the way we handle undefined name exceptions in errorCheck
. When exceptions are derived
from the built-in Exception
class, their instances automatically return the arguments passed to
the exception constructor call as a tuple in their args
attribute—convenient for use in
string formatting here.
Also notice that the new parser reuses the same scanner module as well. To catch errors raised by the scanner, it also imports the specific classes that identify the scanner’s exceptions. Both the scanner and the parser can raise exceptions on errors (lexical errors, syntax errors, and undefined name errors). They’re caught at the top level of the parser, and they end the current parse. There’s no need to set and check status flags to terminate the recursion. Since math is done using integers, floating-point numbers, and Python’s operators, there’s usually no need to trap numeric overflow or underflow errors. But as is, the parser doesn’t handle errors such as division by zero—such Python exceptions make the parser system exit with a Python stack trace and message. Uncovering the cause and fix for this is left as suggested exercise.
When parser2
is run as a
top-level program, we get the same test code output as for parser1
. In fact, it reuses the
very same test code—both parsers pass in their parser class object
to testparser.test
. And since
classes are also objects, we can also pass this version of the parser
to testparser
’s
interactive loop: testparser
.
interact
(
parser2
.
Parser
)
.
The new parser’s external behavior is identical to that of the original, so I won’t repeat all its output here (run this live for a firsthand look). Of note, though, this parser supports both use as a top-level script, and package imports from other directories, such as the PyTree viewer we’ll use in a moment. Python 3.X no longer searches a module’s own directory on the import search path, though, so we have to use package-relative import syntax for the latter case, and import from another directory when testing interactively:
C:...PP4ELangParser>parser2.py
parser2 <class '__main__.Parser'> 5.0 ...rest is same as for parser1... C:...PP4ELangParser>python
>>>import parser2
from .scanner import Scanner, SyntaxError, LexicalError # from PyTree ValueError: Attempted relative import in non-package C:...PP4ELangParser>cd ..
C:...PP4ELang>Parserparser2.py
parser2 <class '__main__.Parser'> 5.0 ...rest is same as for parser1... C:...PP4ELang>python
>>>from Parser import parser2
>>>x = parser2.Parser()
>>>x.parse('1 + 2 * 3 + 4')
11 >>>import Parser.testparser
>>>Parser.testparser.interact(parser2.Parser)
<class 'Parser.parser2.Parser'> Enter=>4 * 3 + 5
17 Enter=>stop
>>>
Using full package import paths in parser2
instead of either package-relative
or unqualified imports:
from PP4E.Lang.Parser import scanner
would suffice for all three use cases—script, and both same and other directory imports—but requires the path to be set properly, and seems overkill for importing a file in the same directory as the importer.
Really, the only tangible difference with this latest parser is
that it builds and uses trees to evaluate an expression internally
instead of evaluating as it parses. The intermediate representation
of an expression is a tree of class instances, whose shape reflects
the order of operator evaluation. This parser also has logic to
print an indented listing of the constructed parse tree if the
traceme
attribute is set to
True
(or 1
). Indentation gives the nesting of
subtrees, and binary operators list left subtrees first. For
example:
C:...PP4ELang> >>>from Parser import parser2
>>>p = parser2.Parser()
>>>p.traceme = True
>>>p.parse('5 + 4 * 2')
[+] ...5 ...[*] ......4 ......2 13
When this tree is evaluated, the apply
method recursively evaluates
subtrees and applies root operators to their results. Here, *
is evaluated before +
, since it’s lower in the tree. The
Factor
method consumes the
*
substring before returning a
right subtree to Expr
. The next
tree takes a different shape:
>>> p.parse('5 * 4 - 2')
[-]
...[*]
......5
......4
...2
18
In this example, *
is
evaluated before -
. The Factor
method loops through a substring of
*
and /
expressions before returning the
resulting left subtree to Expr
.
The next example is more complex, but follows the same rules:
>>> p.parse('1 + 3 * (2 * 3 + 4)')
[+]
...1
...[*]
......3
......[+]
.........[*]
............2
............3
.........4
31
Trees are made of nested class instances. From an OOP perspective, it’s another way to use composition. Since tree nodes are just class instances, this tree could be created and evaluated manually, too:
PlusNode( NumNode(1), TimesNode( NumNode(3), PlusNode( TimesNode(NumNode(2), NumNode(3)), NumNode(4) ))).apply({})
But we might as well let the parser build it for us (Python is not that much like Lisp, despite what you may have heard).
But wait—there is a better way to explore parse tree structures. Figure 19-1 shows the
parse tree generated for the string 1 + 3 *
(2 * 3 + 4)
, displayed in PyTree, the tree visualization
GUI described at the end of Chapter 18. This
works only because the parser2
module builds the parse tree explicitly (parser1
evaluates during a parse instead)
and because PyTree’s code is generic and reusable.
If you read the last chapter, you’ll recall that PyTree can draw most any tree data structure, but it is preconfigured to handle binary search trees and the expression parse trees we’re studying in this chapter. For parse trees, clicking on nodes in a displayed parse tree evaluates the subtree rooted there.
PyTree makes it easy to learn about and experiment with the parser. To determine the tree shape produced for a given expression, start PyTree, click on its Parser radio button, type the expression in the input field at the bottom right, and press “input” (or your Enter key). The parser class is run to generate a tree from your input, and the GUI displays the result. Depending on the operators used within an expression, some very differently shaped trees yield the same result when evaluated.
Try running PyTree on your computer to get a better feel for the parsing process. (I’d like to show more example trees, but I ran out of page real estate at this point in the book.)
The handcoded custom parser programs we’ve met in this section illustrate some interesting concepts and underscore the power of Python for general-purpose programming. Depending on your job description, they may also be typical of the sort of thing you’d write regularly in a traditional language such as C. Parsers are an important component in a wide variety of applications, but in some cases, they’re not as necessary as you might think. Let me explain why.
So far, we started with an expression parser and added a parse tree interpreter to make the code easier to modify. As is, the parser works, but it may be slow compared to a C implementation. If the parser is used frequently, we could speed it up by moving parts to C extension modules. For instance, the scanner might be moved to C initially, since it’s often called from the parser. Ultimately, we might add components to the grammar that allow expressions to access application-specific variables and functions.
All of these steps constitute good engineering. But depending
on your application, this approach may not be the best one in
Python. Often the easiest way to evaluate input expressions in
Python is to let Python do it for us, by calling its eval
built-in function. In fact, we can
usually replace the entire expression evaluation program with this
one function call. The next section will show how this can be used
to simplify language-based systems in general.
More important, the next section underscores a core idea behind the language: if you already have an extensible, embeddable, high-level language system, why invent another? Python itself can often satisfy language-based component needs.
To wrap up this chapter, I’m going to show you a practical application for some of the parsing technology introduced in the preceding section. This section presents PyCalc, a Python calculator program with a graphical interface, similar to the calculator programs available on most window systems. Like most of the GUI examples in this book, though, PyCalc offers a few advantages over existing calculators. Because PyCalc is written in Python, it is both easily customized and widely portable across window platforms. And because it is implemented with classes, it is both a standalone program and a reusable object library.
Before I show you how to write a full-blown calculator, though, the module shown in Example 19-17 starts this discussion in simpler terms. It implements a limited calculator GUI, whose buttons just add text to the input field at the top in order to compose a Python expression string. Fetching and running the string all at once produces results. Figure 19-2 shows the window this module makes when run as a top-level script.
"a simplistic calculator GUI: expressions run all at once with eval/exec" from tkinter import * from PP4E.Gui.Tools.widgets import frame, button, entry class CalcGui(Frame): def __init__(self, parent=None): # an extended frame Frame.__init__(self, parent) # on default top-level self.pack(expand=YES, fill=BOTH) # all parts expandable self.master.title('Python Calculator 0.1') # 6 frames plus entry self.master.iconname("pcalc1") self.names = {} # namespace for variables text = StringVar() entry(self, TOP, text) rows = ["abcd", "0123", "4567", "89()"] for row in rows: frm = frame(self, TOP) for char in row: button(frm, LEFT, char, lambda char=char: text.set(text.get() + char)) frm = frame(self, TOP) for char in "+-*/=": button(frm, LEFT, char, lambda char=char: text.set(text.get()+ ' ' + char + ' ')) frm = frame(self, BOTTOM) button(frm, LEFT, 'eval', lambda: self.eval(text) ) button(frm, LEFT, 'clear', lambda: text.set('') ) def eval(self, text): try: text.set(str(eval(text.get(), self.names, self.names))) # was 'x' except SyntaxError: try: exec(text.get(), self.names, self.names) except: text.set("ERROR") # bad as statement too? else: text.set('') # worked as a statement except: text.set("ERROR") # other eval expression errors if __name__ == '__main__': CalcGui().mainloop()
Now, this is about as simple as a calculator can be, but
it demonstrates the basics. This window comes up with buttons for
entry of numbers, variable names, and operators. It is built by
attaching buttons to frames: each row of buttons is a nested
Frame
, and the GUI itself is a
Frame
subclass with an attached
Entry
and six embedded row
frames (grids would work here, too). The calculator’s frame, entry
field, and buttons are made expandable in the imported widgets
utility module we coded earlier
in Example 10-1.
This calculator builds up a string to pass to the Python
interpreter all at once on “eval” button presses. Because you can
type any Python expression or statement in the entry field, the
buttons are really just a convenience. In fact, the entry field
isn’t much more than a command line. Try typing import sys
, and then dir(sys)
to display sys
module attributes in the input field
at the top—it’s not what you normally do with a calculator, but it
is demonstrative nevertheless.[71]
In CalcGui
’s constructor,
buttons are coded as lists of strings; each string represents a
row and each character in the string represents a button. Lambdas
are used to save extra callback data for each button. The callback
functions retain the button’s character and the linked text entry
variable so that the character can be added to the end of the
entry widget’s current string on a press.
Notice how we must pass in the loop variable as a
default argument to some lambdas in this
code. Recall from Chapter 7 how
references within a lambda (or nested def
) to names in an enclosing scope are
evaluated when the nested function is called, not when it is
created. When the generated function is called, enclosing scope
references inside the lambda reflect their latest setting in the
enclosing scope, which is not necessarily the values they held
when the lambda expression ran. By contrast, defaults are
evaluated at function creation time instead and so can remember
the current values of loop variables. Without the defaults, each
button would reflect the last iteration of the loop.
This module implements a GUI calculator in some 45 lines of code
(counting comments and blank lines). But truthfully, it “cheats.”
Expression evaluation is delegated entirely to Python. In fact,
the built-in eval
and exec
tools do most of the work
here:
Both accept optional dictionaries to be used as global and
local namespaces for assigning and evaluating names used in the
code strings. In the calculator, self.names
becomes a symbol table for
running calculator expressions. A related Python function,
compile
, can be used to
precompile code strings to code objects before passing them to
eval
and exec
(use it if you need to run the same
string many times).
By default, a code string’s namespace defaults to the
caller’s namespaces. If we didn’t pass in dictionaries here, the
strings would run in the eval
method’s namespace. Since the method’s local namespace goes away
after the method call returns, there would be no way to retain
names assigned in the string. Notice the use of nested exception
handlers in the class’s eval
method:
It first assumes the string is an expression and tries
the built-in eval
function.
If that fails because of a syntax error, it tries
evaluating the string as a statement using exec
.
Finally, if both attempts fail, it reports an error in the string (a syntax error, undefined name, and so on).
Statements and invalid expressions might be parsed twice,
but the overhead doesn’t matter here, and you can’t tell whether a
string is an expression or a statement without parsing it
manually. Note that the “eval” button evaluates expressions, but
=
sets Python variables by
running an assignment statement. Variable names are combinations
of the letter keys “abcd” (or any name typed directly). They are
assigned and evaluated in a dictionary used to represent the
calculator’s namespace and retained for the session.
Clients that reuse this calculator are as simple as the calculator itself. Like most class-based tkinter GUIs, this one can be extended in subclasses—Example 19-18 customizes the simple calculator’s constructor to add extra widgets.
from tkinter import * from calc0 import CalcGui class Inner(CalcGui): # extend GUI def __init__(self): CalcGui.__init__(self) Label(self, text='Calc Subclass').pack() # add after Button(self, text='Quit', command=self.quit).pack() # top implied Inner().mainloop()
It can also be embedded in a container class—Example 19-19 attaches the simple calculator’s widget package, along with extras, to a common parent.
from tkinter import * from calc0 import CalcGui # add parent, no master calls class Outer: def __init__(self, parent): # embed GUI Label(parent, text='Calc Attachment').pack() # side=top CalcGui(parent) # add calc frame Button(parent, text='Quit', command=parent.quit).pack() root = Tk() Outer(root) root.mainloop()
Figure 19-3 shows the result of running both of these scripts from different command lines. Both have a distinct input field at the top. This works, but to see a more practical application of such reuse techniques, we need to make the underlying calculator more practical, too.
Of course, real calculators don’t usually work by building up
expression strings and evaluating them all at once; that approach is
really little more than a glorified Python command line.
Traditionally, expressions are evaluated in piecemeal fashion as
they are entered, and temporary results are displayed as soon as
they are computed. Implementing this behavior requires a bit more
work: expressions must be evaluated manually and in parts, instead
of calling the eval
function only
once. But the end result is much more useful and intuitive.
This section presents the implementation of PyCalc, a more realistic Python/tkinter program that implements such a traditional calculator GUI. It touches on the subject of text and languages in two ways: it parses and evaluates expressions, and it implements a kind of stack-based language to perform the evaluation. Although its evaluation logic is more complex than the simpler calculator shown earlier, it demonstrates advanced programming techniques and serves as an interesting finale for this chapter.
As usual, let’s look at the GUI before the code. You can run PyCalc from the PyGadgets and PyDemos launcher bars at the top of the examples tree, or by directly running the file calculator.py listed shortly (e.g., click it in a file explorer, or type it in a shell command line). Figure 19-4 shows PyCalc’s main window. By default, it shows operand buttons in black-on-blue (and opposite for operator buttons), but font and color options can be passed into the GUI class’s constructor method. Of course, that means gray-on-gray in this book, so you’ll have to run PyCalc yourself to see what I mean.
If you do run this, you’ll notice that PyCalc implements a normal calculator model—expressions are evaluated as entered, not all at once at the end. That is, parts of an expression are computed and displayed as soon as operator precedence and manually typed parentheses allow. The result in Figure 19-4, for instance, reflects pressing “2”, and then repeatedly pressing “*” to display successive powers of 2. I’ll explain how this evaluation works in a moment.
PyCalc’s CalcGui
class
builds the GUI interface as frames of buttons much
like the simple calculator of the previous section, but PyCalc
adds a host of new features. Among them are another row of action
buttons, inherited methods from GuiMixin
(presented in Chapter 10), a new
“cmd” button that pops up nonmodal dialogs for entry of arbitrary
Python code, and a recent calculations history pop up. Figure 19-5 captures
some of PyCalc’s pop-up windows.
You may enter expressions in PyCalc by clicking buttons in
the GUI, typing full expressions in command-line pop ups, or
typing keys on your keyboard. PyCalc intercepts key press events
and interprets them the same as corresponding button presses;
typing +
is like pressing the
+
button, the Space bar key is
“clear,” Enter is “eval,” backspace erases a character, and
?
is like pressing
“help.”
The command-line pop-up windows are nonmodal (you can pop up as many as you like). They accept any Python code—press the Run button or your Enter key to evaluate text in their input fields. The result of evaluating this code in the calculator’s namespace dictionary is thrown up in the main window for use in larger expressions. You can use this as an escape mechanism to employ external tools in your calculations. For instance, you can import and use functions coded in Python or C within these pop ups. The current value in the main calculator window is stored in newly opened command-line pop ups, too, for use in typed expressions.
PyCalc supports integers (unlimited precision), negatives,
and floating-point numbers just because Python does. Individual
operands and expressions are still evaluated with the eval
built-in, which calls the Python
parser/interpreter at runtime. Variable names can be assigned and
referenced in the main window with the letter, =
, and “eval” keys; they are assigned in
the calculator’s namespace dictionary (more complex variable names
may be typed in command-line pop ups). Note the use of pi
in the history window: PyCalc
preimports names in the math
and random
modules into the
namespace where expressions are evaluated.
Now that you have the general idea of what PyCalc does, I need to say a little bit about how it does what it does. Most of the changes in this calculator involve managing the expression display and evaluating expressions. PyCalc is structured as two classes:
CalcGui
classManages the GUI itself. It controls input events and is in
charge of the main window’s display field at the top. It
doesn’t evaluate expressions, though; for that, it sends
operators and operands entered in the GUI to an embedded
instance of the Evaluator
class.
Evaluator
classManages two stacks. One stack records pending
operators (e.g., +
), and one records pending
operands (e.g., 3.141
). Temporary results are
computed as new operators are sent from CalcGui
and pushed onto the
operands stack.
As you can see from this, the magic of expression evaluation boils down to juggling the operator and operand stacks. In a sense, the calculator implements a little stack-based language, to evaluate the expressions being entered. While scanning expression strings from left to right as they are entered, operands are pushed along the way, but operators delimit operands and may trigger temporary results before they are pushed. Because it records states and performs transitions, some might use the term state machine to describe this calculator language implementation.
Here’s the general scenario:
When a new operator is seen (i.e., when an operator button or key is pressed), the prior operand in the entry field is pushed onto the operands stack.
The operator is then added to the operators stack, but
only after all pending operators of higher precedence have
been popped and applied to pending operands (e.g., pressing
+
makes any pending
*
operators on the stack
fire).
When “eval” is pressed, all remaining operators are popped and applied to all remaining operands, and the result is the last remaining value on the operands stack.
In the end, the last value on the operands stack is displayed in the calculator’s entry field, ready for use in another operation. This evaluation algorithm is probably best described by working through examples. Let’s step through the entry of a few expressions and watch the evaluation stacks grow.
PyCalc stack tracing is enabled with the debugme
flag in the module; if true, the
operator and operand stacks are displayed on stdout
each time the Evaluator
class is about to apply an
operator and reduce (pop) the stacks. Run
PyCalc with a console window to see the traces. A tuple holding
the stack lists (
operators
,
operands
)
is printed on each stack reduction;
tops of stacks are at the ends of the lists. For instance, here is
the console output after typing and evaluating a simple
string:
1) Entered keys: "5 * 3 + 4 <eval>" [result = 19] (['*'], ['5', '3']) [on '+' press: displays "15"] (['+'], ['15', '4']) [on 'eval' press: displays "19"]
Note that the pending (stacked) *
subexpression is evaluated when the
+
is pressed: *
operators bind tighter than +
, so the code is evaluated immediately
before the +
operator is
pushed. When the +
button is
pressed, the entry field contains 3; we push 3 onto the operands
stack, reduce the *
subexpression (5 * 3), push its result onto operands, push
+
onto operators, and continue
scanning user inputs. When “eval” is pressed at the end, 4 is
pushed onto operands, and the final +
on operators is applied to stacked
operands.
The text input and display field at the top of the GUI’s
main window plays a part in this algorithm, too. The text input
field and expression stacks are integrated by the calculator
class. In general, the text input field always holds the prior
operand when an operator button is pressed (e.g., on 5 *
); the text in the input field is
pushed onto the operands stack before the operator is resolved.
Because of this, we have to pop results before displaying them
after “eval” or )
is pressed;
otherwise the results are pushed onto the stack twice—they would
be both on the stack and in the display field, from which they
would be immediately pushed again when the next operator is
input.
For both usability and accuracy, when an operator is seen,
we also have to arrange to erase the input field’s prior value
when the next operand’s entry is started (e.g., on both 3
and 4
in 5 * 3 +
4
). This erasure of the prior values is also arranged
when “eval” or )
is applied, on
the assumption that a subsequent operand key or button replaces
the prior result—for a new expression after “eval,” and for an
operand following a new operator after )
; e.g., to erase the parenthesized
12
result on 2
in 5 + (3 *
4) * 2
. Without this erasure, operand buttons and keys
simply concatenate to the currently displayed value. This model
also allows user to change temporary result operands after a
)
by entry of operand instead
of operator.
Expression stacks also defer operations of lower precedence
as the input is scanned. In the next trace, the pending +
isn’t evaluated when the *
button is pressed: since *
binds tighter, we need to postpone the
+
until the *
can be evaluated. The *
operator isn’t popped until its right
operand 4 has been seen. There are two operators to pop and apply
to operand stack entries on the “eval” press—the *
at the top of operators is applied to
the 3 and 4 at the top of operands, and then +
is run on 5 and the 12 pushed for
*
:
2) Entered keys: "5 + 3 * 4 <eval>" [result = 17] (['+', '*'], ['5', '3', '4']) [on 'eval' press] (['+'], ['5', '12']) [displays "17"]
For strings of same-precedence operators such as the
following, we pop and evaluate immediately as we scan left to
right, instead of postponing evaluation. This results in a
left-associative evaluation, in the absence of parentheses:
5+3+4
is evaluated as ((5+3)+4)
. For +
and *
operations this is irrelevant because
order doesn’t matter:
3) Entered keys: "5 + 3 + 4 <eval>" [result = 12] (['+'], ['5', '3']) [on the second '+'] (['+'], ['8', '4']) [on 'eval']
The following trace is more complex. In this case, all the
operators and operands are stacked (postponed) until we press the
)
button at the end. To make
parentheses work, (
is given a
higher precedence than any operator and is pushed onto the
operators stack to seal off lower stack reductions until the
)
is seen. When the )
button is pressed, the parenthesized
subexpression is popped and evaluated ((3 * 4), then (1 + 12)),
and 13 is displayed in the entry field. On pressing “eval,” the
rest is evaluated ((3 * 13), (1 +39)), and the final result (40)
is shown. This result in the entry field itself becomes the left
operand of a future operator.
4) Entered keys: "1 + 3 * ( 1 + 3 * 4 ) <eval>" [result = 40] (['+', '*', '(', '+', '*'], ['1', '3', '1', '3', '4']) [on ')'] (['+', '*', '(', '+'], ['1', '3', '1', '12']) [displays "13"] (['+', '*'], ['1', '3', '13']) [on 'eval'] (['+'], ['1', '39'])
In fact, any temporary result can be used again: if we keep
pressing an operator button without typing new operands, it’s
reapplied to the result of the prior press—the value in the entry
field is pushed twice and applied to itself each time. Press
*
many times after entering 2
to see how this works (e.g., 2***
). On the first *
, it pushes 2 and the *
. On the next *
, it pushes 2 again from the entry
field, pops and evaluates the stacked (2 * 2), pushes back and
displays the result, and pushes the new *
. And on each following *
, it pushes the currently displayed
result and evaluates again, computing successive squares.
Figure 19-6
shows how the two stacks look at their highest level while
scanning the expression in the prior example trace. On each
reduction, the top operator is applied to the top two operands and
the result is pushed back for the operator below. Because of the
way the two stacks are used, the effect is similar to converting
the expression to a string of the form +1*3(+1*34
and evaluating it right to
left. In other cases, though, parts of the expression are
evaluated and displayed as temporary results along the way, so
it’s not simply a string conversion process.
Finally, the next example’s string triggers an error. PyCalc
is casual about error handling. Many errors are made impossible by
the algorithm itself, but things such as unmatched parentheses
still trip up the evaluator. Instead of trying to detect all
possible error cases explicitly, a general try
statement in the reduce
method is used to catch them all:
expression errors, numeric errors, undefined name errors, syntax
errors, and so on.
Operands and temporary results are always stacked as
strings, and each operator is applied by calling eval
. When an error occurs inside an
expression, a result operand of *ERROR*
is pushed, which makes all
remaining operators fail in eval
too. *ERROR*
essentially percolates to the
top of the expression. At the end, it’s the last operand and is
displayed in the text entry field to alert you of the
mistake:
5) Entered keys: "1 + 3 * ( 1 + 3 * 4 <eval>" [result = *ERROR*] (['+', '*', '(', '+', '*'], ['1', '3', '1', '3', '4']) [on eval] (['+', '*', '(', '+'], ['1', '3', '1', '12']) (['+', '*', '('], ['1', '3', '13']) (['+', '*'], ['1', '*ERROR*']) (['+'], ['*ERROR*']) (['+'], ['*ERROR*', '*ERROR*'])
Try tracing through these and other examples in the calculator’s code to get a feel for the stack-based evaluation that occurs. Once you understand the general shift/reduce (push/pop) mechanism, expression evaluation is straightforward.
Example 19-20 contains the PyCalc source module that puts these ideas to work in the context of a GUI. It’s a single-file implementation (not counting utilities imported and reused). Study the source for more details; as usual, there’s no substitute for interacting with the program on your own to get a better feel for its functionality.
Also see the opening comment’s “to do” list for suggested areas for improvement. Like all software systems, this calculator is prone to evolve over time (and in fact it has, with each new edition of this book). Since it is written in Python, such future mutations will be easy to apply.
#!/usr/local/bin/python """ ################################################################################ PyCalc 3.0+: a Python/tkinter calculator program and GUI component. Evaluates expressions as they are entered, catches keyboard keys for expression entry; 2.0 added integrated command-line popups, a recent calculations history display popup, fonts and colors configuration, help and about popups, preimported math/random constants, and more; 3.0+ (PP4E, version number retained): -port to run under Python 3.X (only) -drop 'L' keypress (the long type is now dead in earnest) 3.0 changes (PP3E): -use 'readonly' entry state, not 'disabled', else field is greyed out (fix for 2.3 Tkinter change); -avoid extended display precision for floats by using str(), instead of `x`/repr() (fix for Python change); -apply font to input field to make it larger; -use justify=right for input field so it displays on right, not left; -add 'E+' and 'E-' buttons (and 'E' keypress) for float exponents; 'E' keypress must generally be followed digits, not + or - optr key; -remove 'L' button (but still allow 'L' keypress): superfluous now, because Python auto converts up if too big ('L' forced this in past); -use smaller font size overall; -auto scroll to the end in the history window to do: add a commas-insertion mode (see str.format and LP4E example); allow '**' as an operator key; allow '+' and 'J' inputs for complex Numbers; use new decimal type for fixed precision floats; as is, can use 'cmd' popup windows to input and evaluate things like complex, but can't be input via main window; caveat: PyCalc's precision, accuracy, and some of its behaviour, is currently bound by result of str() call; ################################################################################ """ from tkinter import * # widgets, consts from PP4E.Gui.Tools.guimixin import GuiMixin # quit method from PP4E.Gui.Tools.widgets import label, entry, button, frame # widget builders Fg, Bg, Font = 'black', 'skyblue', ('courier', 14, 'bold') # default config debugme = True def trace(*args): if debugme: print(args) ################################################################################ # the main class - handles user interface; # an extended Frame, on new Toplevel, or embedded in another container widget ################################################################################ class CalcGui(GuiMixin, Frame): Operators = "+-*/=" # button lists Operands = ["abcd", "0123", "4567", "89()"] # customizable def __init__(self, parent=None, fg=Fg, bg=Bg, font=Font): Frame.__init__(self, parent) self.pack(expand=YES, fill=BOTH) # all parts expandable self.eval = Evaluator() # embed a stack handler self.text = StringVar() # make a linked variable self.text.set("0") self.erase = 1 # clear "0" text next self.makeWidgets(fg, bg, font) # build the GUI itself if not parent or not isinstance(parent, Frame): self.master.title('PyCalc 3.0') # title iff owns window self.master.iconname("PyCalc") # ditto for key bindings self.master.bind('<KeyPress>', self.onKeyboard) self.entry.config(state='readonly') # 3.0: not 'disabled'=grey else: self.entry.config(state='normal') self.entry.focus() def makeWidgets(self, fg, bg, font): # 7 frames plus text-entry self.entry = entry(self, TOP, self.text) # font, color configurable self.entry.config(font=font) # 3.0: make display larger self.entry.config(justify=RIGHT) # 3.0: on right, not left for row in self.Operands: frm = frame(self, TOP) for char in row: button(frm, LEFT, char, lambda op=char: self.onOperand(op), fg=fg, bg=bg, font=font) frm = frame(self, TOP) for char in self.Operators: button(frm, LEFT, char, lambda op=char: self.onOperator(op), fg=bg, bg=fg, font=font) frm = frame(self, TOP) button(frm, LEFT, 'dot ', lambda: self.onOperand('.')) button(frm, LEFT, ' E+ ', lambda: self.text.set(self.text.get()+'E+')) button(frm, LEFT, ' E- ', lambda: self.text.set(self.text.get()+'E-')) button(frm, LEFT, 'cmd ', self.onMakeCmdline) button(frm, LEFT, 'help', self.help) button(frm, LEFT, 'quit', self.quit) # from guimixin frm = frame(self, BOTTOM) button(frm, LEFT, 'eval ', self.onEval) button(frm, LEFT, 'hist ', self.onHist) button(frm, LEFT, 'clear', self.onClear) def onClear(self): self.eval.clear() self.text.set('0') self.erase = 1 def onEval(self): self.eval.shiftOpnd(self.text.get()) # last or only opnd self.eval.closeall() # apply all optrs left self.text.set(self.eval.popOpnd()) # need to pop: optr next? self.erase = 1 def onOperand(self, char): if char == '(': self.eval.open() self.text.set('(') # clear text next self.erase = 1 elif char == ')': self.eval.shiftOpnd(self.text.get()) # last or only nested opnd self.eval.close() # pop here too: optr next? self.text.set(self.eval.popOpnd()) self.erase = 1 else: if self.erase: self.text.set(char) # clears last value else: self.text.set(self.text.get() + char) # else append to opnd self.erase = 0 def onOperator(self, char): self.eval.shiftOpnd(self.text.get()) # push opnd on left self.eval.shiftOptr(char) # eval exprs to left? self.text.set(self.eval.topOpnd()) # push optr, show opnd|result self.erase = 1 # erased on next opnd|'(' def onMakeCmdline(self): new = Toplevel() # new top-level window new.title('PyCalc command line') # arbitrary Python code frm = frame(new, TOP) # only the Entry expands label(frm, LEFT, '>>>').pack(expand=NO) var = StringVar() ent = entry(frm, LEFT, var, width=40) onButton = (lambda: self.onCmdline(var, ent)) onReturn = (lambda event: self.onCmdline(var, ent)) button(frm, RIGHT, 'Run', onButton).pack(expand=NO) ent.bind('<Return>', onReturn) var.set(self.text.get()) def onCmdline(self, var, ent): # eval cmdline pop-up input try: value = self.eval.runstring(var.get()) var.set('OKAY') if value != None: # run in eval namespace dict self.text.set(value) # expression or statement self.erase = 1 var.set('OKAY => '+ value) except: # result in calc field var.set('ERROR') # status in pop-up field ent.icursor(END) # insert point after text ent.select_range(0, END) # select msg so next key deletes def onKeyboard(self, event): pressed = event.char # on keyboard press event if pressed != '': # pretend button was pressed if pressed in self.Operators: self.onOperator(pressed) else: for row in self.Operands: if pressed in row: self.onOperand(pressed) break else: # 4E: drop 'Ll' if pressed == '.': self.onOperand(pressed) # can start opnd if pressed in 'Ee': # 2e10, no +/- self.text.set(self.text.get()+pressed) # can't: no erase elif pressed == ' ': self.onEval() # enter key=eval elif pressed == ' ': self.onClear() # spacebar=clear elif pressed == '': self.text.set(self.text.get()[:-1]) # backspace elif pressed == '?': self.help() def onHist(self): # show recent calcs log popup from tkinter.scrolledtext import ScrolledText # or PP4E.Gui.Tour new = Toplevel() # make new window ok = Button(new, text="OK", command=new.destroy) ok.pack(pady=1, side=BOTTOM) # pack first=clip last text = ScrolledText(new, bg='beige') # add Text + scrollbar text.insert('0.0', self.eval.getHist()) # get Evaluator text text.see(END) # 3.0: scroll to end text.pack(expand=YES, fill=BOTH) # new window goes away on ok press or enter key new.title("PyCalc History") new.bind("<Return>", (lambda event: new.destroy())) ok.focus_set() # make new window modal: new.grab_set() # get keyboard focus, grab app new.wait_window() # don't return till new.destroy def help(self): self.infobox('PyCalc', 'PyCalc 3.0+ ' 'A Python/tkinter calculator ' 'Programming Python 4E ' 'May, 2010 ' '(3.0 2005, 2.0 1999, 1.0 1996) ' 'Use mouse or keyboard to ' 'input numbers and operators, ' 'or type code in cmd popup') ################################################################################ # the expression evaluator class # embedded in and used by a CalcGui instance, to perform calculations ################################################################################ class Evaluator: def __init__(self): self.names = {} # a names-space for my vars self.opnd, self.optr = [], [] # two empty stacks self.hist = [] # my prev calcs history log self.runstring("from math import *") # preimport math modules self.runstring("from random import *") # into calc's namespace def clear(self): self.opnd, self.optr = [], [] # leave names intact if len(self.hist) > 64: # don't let hist get too big self.hist = ['clear'] else: self.hist.append('--clear--') def popOpnd(self): value = self.opnd[-1] # pop/return top|last opnd self.opnd[-1:] = [] # to display and shift next return value # or x.pop(), or del x[-1] def topOpnd(self): return self.opnd[-1] # top operand (end of list) def open(self): self.optr.append('(') # treat '(' like an operator def close(self): # on ')' pop downto highest '(' self.shiftOptr(')') # ok if empty: stays empty self.optr[-2:] = [] # pop, or added again by optr def closeall(self): while self.optr: # force rest on 'eval' self.reduce() # last may be a var name try: self.opnd[0] = self.runstring(self.opnd[0]) except: self.opnd[0] = '*ERROR*' # pop else added again next: afterMe = {'*': ['+', '-', '(', '='], # class member '/': ['+', '-', '(', '='], # optrs to not pop for key '+': ['(', '='], # if prior optr is this: push '-': ['(', '='], # else: pop/eval prior optr ')': ['(', '='], # all left-associative as is '=': ['('] } def shiftOpnd(self, newopnd): # push opnd at optr, ')', eval self.opnd.append(newopnd) def shiftOptr(self, newoptr): # apply ops with <= priority while (self.optr and self.optr[-1] not in self.afterMe[newoptr]): self.reduce() self.optr.append(newoptr) # push this op above result # optrs assume next opnd erases def reduce(self): trace(self.optr, self.opnd) try: # collapse the top expr operator = self.optr[-1] # pop top optr (at end) [left, right] = self.opnd[-2:] # pop top 2 opnds (at end) self.optr[-1:] = [] # delete slice in-place self.opnd[-2:] = [] result = self.runstring(left + operator + right) if result == None: result = left # assignment? key var name self.opnd.append(result) # push result string back except: self.opnd.append('*ERROR*') # stack/number/name error def runstring(self, code): try: # 3.0: not `x`/repr result = str(eval(code, self.names, self.names)) # try expr: string self.hist.append(code + ' => ' + result) # add to hist log except: exec(code, self.names, self.names) # try stmt: None self.hist.append(code) result = None return result def getHist(self): return ' '.join(self.hist) def getCalcArgs(): from sys import argv # get cmdline args in a dict config = {} # ex: -bg black -fg red for arg in argv[1:]: # font not yet supported if arg in ['-bg', '-fg']: # -bg red' -> {'bg':'red'} try: config[arg[1:]] = argv[argv.index(arg) + 1] except: pass return config if __name__ == '__main__': CalcGui(**getCalcArgs()).mainloop() # in default toplevel window
PyCalc serves a standalone program on my desktop, but it’s
also useful in the context of other GUIs. Like most of the GUI
classes in this book, PyCalc can be customized with subclass
extensions or embedded in a larger GUI with attachments. The
module in Example 19-21 demonstrates
one way to reuse PyCalc’s CalcGui
class by extending and
embedding, similar to what was done for the simple calculator
earlier.
""" test calculator: use as an extended and embedded GUI component """ from tkinter import * from calculator import CalcGui def calcContainer(parent=None): frm = Frame(parent) frm.pack(expand=YES, fill=BOTH) Label(frm, text='Calc Container').pack(side=TOP) CalcGui(frm) Label(frm, text='Calc Container').pack(side=BOTTOM) return frm class calcSubclass(CalcGui): def makeWidgets(self, fg, bg, font): Label(self, text='Calc Subclass').pack(side=TOP) Label(self, text='Calc Subclass').pack(side=BOTTOM) CalcGui.makeWidgets(self, fg, bg, font) #Label(self, text='Calc Subclass').pack(side=BOTTOM) if __name__ == '__main__': import sys if len(sys.argv) == 1: # % calculator_test.py root = Tk() # run 3 calcs in same process CalcGui(Toplevel()) # each in a new toplevel window calcContainer(Toplevel()) calcSubclass(Toplevel()) Button(root, text='quit', command=root.quit).pack() root.mainloop() if len(sys.argv) == 2: # % calculator_testl.py - CalcGui().mainloop() # as a standalone window (default root) elif len(sys.argv) == 3: # % calculator_test.py - - calcContainer().mainloop() # as an embedded component elif len(sys.argv) == 4: # % calculator_test.py - - - calcSubclass().mainloop() # as a customized superclass
Figure 19-7 shows the result of running this script with no command-line arguments. We get instances of the original calculator class, plus the container and subclass classes defined in this script, all attached to new top-level windows.
These two windows on the right reuse the core PyCalc code running in the window on the left. All of these windows run in the same process (e.g., quitting one quits them all), but they all function as independent windows. Note that when running three calculators in the same process like this, each has its own distinct expression evaluation namespace because it’s a class instance attribute, not a global module-level variable. Because of that, variables set in one calculator are set in that calculator only, and they don’t overwrite settings made in other windows. Similarly, each calculator has its own evaluation stack manager object, such that calculations in one window don’t appear in or impact other windows at all.
The two extensions in this script are artificial, of course—they simply add labels at the top and bottom of the window—but the concept is widely applicable. You could reuse the calculator’s class by attaching it to any GUI that needs a calculator and customize it with subclasses arbitrarily. It’s a reusable widget.
One obvious way to reuse the calculator is to add additional expression feature buttons—square roots, inverses, cubes, and the like. You can type such operations in the command-line pop ups, but buttons are a bit more convenient. Such features could also be added to the main calculator implementation itself, but since the set of features that will be useful may vary per user and application, a better approach may be to add them in separate extensions. For instance, the class in Example 19-22 adds a few extra buttons to PyCalc by embedding (i.e., attaching) it in a container.
""" ############################################################################# a container with an extra row of buttons for common operations; a more useful customization: adds buttons for more operations (sqrt, 1/x, etc.) by embedding/composition, not subclassing; new buttons are added after entire CalGui frame because of the packing order/options; ############################################################################# """ from tkinter import * from calculator import CalcGui, getCalcArgs from PP4E.Gui.Tools.widgets import frame, button, label class CalcGuiPlus(Toplevel): def __init__(self, **args): Toplevel.__init__(self) label(self, TOP, 'PyCalc Plus - Container') self.calc = CalcGui(self, **args) frm = frame(self, BOTTOM) extras = [('sqrt', 'sqrt(%s)'), ('x^2 ', '(%s)**2'), ('x^3 ', '(%s)**3'), ('1/x ', '1.0/(%s)')] for (lab, expr) in extras: button(frm, LEFT, lab, (lambda expr=expr: self.onExtra(expr))) button(frm, LEFT, ' pi ', self.onPi) def onExtra(self, expr): text = self.calc.text eval = self.calc.eval try: text.set(eval.runstring(expr % text.get())) except: text.set('ERROR') def onPi(self): self.calc.text.set(self.calc.eval.runstring('pi')) if __name__ == '__main__': root = Tk() button(root, TOP, 'Quit', root.quit) CalcGuiPlus(**getCalcArgs()).mainloop() # -bg,-fg to calcgui
Because PyCalc is coded as a Python class, you can always achieve a similar effect by extending PyCalc in a new subclass instead of embedding it, as shown in Example 19-23.
""" ############################################################################# a customization with an extra row of buttons for common operations; a more useful customization: adds buttons for more operations (sqrt, 1/x, etc.) by subclassing to extend the original class, not embedding; new buttons show up before frame attached to bottom by calcgui class; ############################################################################# """ from tkinter import * from calculator import CalcGui, getCalcArgs from PP4E.Gui.Tools.widgets import label, frame, button class CalcGuiPlus(CalcGui): def makeWidgets(self, *args): label(self, TOP, 'PyCalc Plus - Subclass') CalcGui.makeWidgets(self, *args) frm = frame(self, BOTTOM) extras = [('sqrt', 'sqrt(%s)'), ('x^2 ', '(%s)**2'), ('x^3 ', '(%s)**3'), ('1/x ', '1.0/(%s)')] for (lab, expr) in extras: button(frm, LEFT, lab, (lambda expr=expr: self.onExtra(expr))) button(frm, LEFT, ' pi ', self.onPi) def onExtra(self, expr): try: self.text.set(self.eval.runstring(expr % self.text.get())) except: self.text.set('ERROR') def onPi(self): self.text.set(self.eval.runstring('pi')) if __name__ == '__main__': CalcGuiPlus(**getCalcArgs()).mainloop() # passes -bg, -fg on
Notice that these buttons’ callbacks force floating-point
division to be used for inverses just because that’s how /
operates in Python 3.X (//
for integers truncates remainders
instead); the buttons also wrap entry field values in parentheses
to sidestep precedence issues. They could instead convert the
entry’s text to a number and do real math, but Python does all the
work automatically when expression strings are run raw.
Also note that the buttons added by these scripts simply operate on the current value in the entry field, immediately. That’s not quite the same as expression operators applied with the stacks evaluator (additional customizations are needed to make them true operators). Still, these buttons prove the point these scripts are out to make—they use PyCalc as a component, both from the outside and from below.
Finally, to test both of the extended calculator classes, as well as PyCalc configuration options, the script in Example 19-24 puts up four distinct calculator windows (this is the script run by PyDemos).
#!/usr/local/bin/python """ demo all 3 calculator flavors at once each is a distinct calculator object and window """ from tkinter import Tk, Button, Toplevel import calculator, calculator_plus_ext, calculator_plus_emb root=Tk() calculator.CalcGui(Toplevel()) calculator.CalcGui(Toplevel(), fg='white', bg='purple') calculator_plus_ext.CalcGuiPlus(Toplevel(), fg='gold', bg='black') calculator_plus_emb.CalcGuiPlus(fg='black', bg='red') Button(root, text='Quit Calcs', command=root.quit).pack() root.mainloop()
Figure 19-8 shows the result—four independent calculators in top-level windows within the same process. The two windows on the right represent specialized reuses of PyCalc as a component, and the Help dialog appears in the lower right. Although it may not be obvious in this book, all four use different color schemes; calculator classes accept color and font configuration options and pass them down the call chain as needed.
As we learned earlier, these calculators could also be run
as independent processes by spawning command lines with the
launchmodes
module we met in
Chapter 5. In fact, that’s how the
PyGadgets and PyDemos launcher bars run calculators, so see their
code for more details. And as always, read the code and experiment
on your own for further enlightenment; this is Python, after
all.
This chapter concludes our Python language material in this book. The next and final technical chapter of the text takes us on a tour of techniques for integrating Python with programs written in compiled languages like C and C++. Not everyone needs to know how to do this, so some readers may wish to skip ahead to the book’s conclusion in Chapter 21 at this point. Since most Python programmers use wrapped C libraries (even if they don’t wrap them themselves), though, I recommend a quick pass over the next chapter before you close the book on this book.
[71] Once again, I need to warn you about running code strings like this if you can’t be sure they won’t cause damage. If these strings can be entered by users you cannot trust, they will have access to anything on the computer that the Python process has access to. See Chapters 9 and 15 for more on security issues related to code run in GUI, Web, and other contexts.