This chapter wraps up our look at the system interfaces domain in Python by presenting a collection of larger Python scripts that do real systems work—comparing and copying directory trees, splitting files, searching files and directories, testing other programs, configuring launched programs’ shell environments, and so on. The examples here are Python system utility programs that illustrate typical tasks and techniques in this domain and focus on applying built-in tools, such as file and directory tree processing.
Although the main point of this case-study chapter is to give you a feel for realistic scripts in action, the size of these examples also gives us an opportunity to see Python’s support for development paradigms like object-oriented programming (OOP) and reuse at work. It’s really only in the context of nontrivial programs such as the ones we’ll meet here that such tools begin to bear tangible fruit. This chapter also emphasizes the “why” of system tools, not just the “how”; along the way, I’ll point out real-world needs met by the examples we’ll study, to help you put the details in context.
One note up front: this chapter moves quickly, and a few of its examples are largely listed just for independent study. Because all the scripts here are heavily documented and use Python system tools described in the preceding chapters, I won’t go through all the code in exhaustive detail. You should read the source code listings and experiment with these programs on your own computer to get a better feel for how to combine system interfaces to accomplish realistic tasks. All are available in source code form in the book’s examples distribution and most work on all major platforms.
I should also mention that most of these are programs I have really used, not examples written just for this book. They were coded over a period of years and perform widely differing tasks, so there is no obvious common thread to connect the dots here other than need. On the other hand, they help explain why system tools are useful in the first place, demonstrate larger development concepts that simpler examples cannot, and bear collective witness to the simplicity and portability of automating system tasks with Python. Once you’ve mastered the basics, you’ll wish you had done so sooner.
Quick: what’s the biggest Python source file on your computer? This was the query innocently posed by a student in one of my Python classes. Because I didn’t know either, it became an official exercise in subsequent classes, and it provides a good example of ways to apply Python system tools for a realistic purpose in this book. Really, the query is a bit vague, because its scope is unclear. Do we mean the largest Python file in a directory, in a full directory tree, in the standard library, on the module import search path, or on your entire hard drive? Different scopes imply different solutions.
For instance, Example 6-1 is a first-cut solution that looks for the biggest Python file in one directory—a limited scope, but enough to get started.
""" Find the largest Python source file in a single directory. Search Windows Python source lib, unless dir command-line arg. """ import os, glob, sys dirname = r'C:Python31Lib' if len(sys.argv) == 1 else sys.argv[1] allsizes = [] allpy = glob.glob(dirname + os.sep + '*.py') for filename in allpy: filesize = os.path.getsize(filename) allsizes.append((filesize, filename)) allsizes.sort() print(allsizes[:2]) print(allsizes[-2:])
This script uses the glob
module to run through a directory’s files and detects
the largest by storing sizes and names on a list that is sorted at
the end—because size appears first in the list’s tuples, it will
dominate the ascending value sort, and the largest percolates to the
end of the list. We could instead keep track of the currently
largest as we go, but the list scheme is more flexible. When run,
this script scans the Python standard library’s source directory on
Windows, unless you pass a different directory on the command line,
and it prints both the two smallest and largest files it
finds:
C:...PP4ESystemFiletools>bigpy-dir.py
[(0, 'C:\Python31\Lib\build_class.py'), (56, 'C:\Python31\Lib\struct.py')] [(147086, 'C:\Python31\Lib\turtle.py'), (211238, 'C:\Python31\Lib\decimal. py')] C:...PP4ESystemFiletools>bigpy-dir.py .
[(21, '.\__init__.py'), (461, '.\bigpy-dir.py')] [(1940, '.\bigext-tree.py'), (2547, '.\split.py')] C:...PP4ESystemFiletools>bigpy-dir.py ..
[(21, '..\__init__.py'), (29, '..\testargv.py')] [(541, '..\testargv2.py'), (549, '..\more.py')]
The prior section’s solution works, but it’s obviously a partial
answer—Python files are usually located in more than one directory.
Even within the standard library, there are many subdirectories for
module packages, and they may be arbitrarily nested. We really need
to traverse an entire directory tree. Moreover, the first output
above is difficult to read; Python’s pprint
(for
“pretty print”) module can help here. Example 6-2 puts these extensions
into code.
""" Find the largest Python source file in an entire directory tree. Search the Python source lib, use pprint to display results nicely. """ import sys, os, pprint trace = False if sys.platform.startswith('win'): dirname = r'C:Python31Lib' # Windows else: dirname = '/usr/lib/python' # Unix, Linux, Cygwin allsizes = [] for (thisDir, subsHere, filesHere) in os.walk(dirname): if trace: print(thisDir) for filename in filesHere: if filename.endswith('.py'): if trace: print('...', filename) fullname = os.path.join(thisDir, filename) fullsize = os.path.getsize(fullname) allsizes.append((fullsize, fullname)) allsizes.sort() pprint.pprint(allsizes[:2]) pprint.pprint(allsizes[-2:])
When run, this new version uses os.walk
to search
an entire tree of directories for the largest Python source file.
Change this script’s trace
variable if you want to track its progress through the tree. As
coded, it searches the Python standard library’s source tree,
tailored for Windows and Unix-like locations:
C:...PP4ESystemFiletools> bigpy-tree.py
[(0, 'C:\Python31\Lib\build_class.py'),
(0, 'C:\Python31\Lib\email\mime\__init__.py')]
[(211238, 'C:\Python31\Lib\decimal.py'),
(380582, 'C:\Python31\Lib\pydoc_data\topics.py')]
Sure enough—the prior section’s script found smallest and largest files in subdirectories. While searching Python’s entire standard library tree this way is more inclusive, it’s still incomplete: there may be additional modules installed elsewhere on your computer, which are accessible from the module import search path but outside Python’s source tree. To be more exhaustive, we could instead essentially perform the same tree search, but for every directory on the module import search path. Example 6-3 adds this extension to include every importable Python-coded module on your computer—located both on the path directly and nested in package directory trees.
""" Find the largest Python source file on the module import search path. Skip already-visited directories, normalize path and case so they will match properly, and include line counts in pprinted result. It's not enough to use os.environ['PYTHONPATH']: this is a subset of sys.path. """ import sys, os, pprint trace = 0 # 1=dirs, 2=+files visited = {} allsizes = [] for srcdir in sys.path: for (thisDir, subsHere, filesHere) in os.walk(srcdir): if trace > 0: print(thisDir) thisDir = os.path.normpath(thisDir) fixcase = os.path.normcase(thisDir) if fixcase in visited: continue else: visited[fixcase] = True for filename in filesHere: if filename.endswith('.py'): if trace > 1: print('...', filename) pypath = os.path.join(thisDir, filename) try: pysize = os.path.getsize(pypath) except os.error: print('skipping', pypath, sys.exc_info()[0]) else: pylines = len(open(pypath, 'rb').readlines()) allsizes.append((pysize, pylines, pypath)) print('By size...') allsizes.sort() pprint.pprint(allsizes[:3]) pprint.pprint(allsizes[-3:]) print('By lines...') allsizes.sort(key=lambda x: x[1]) pprint.pprint(allsizes[:3]) pprint.pprint(allsizes[-3:])
When run, this script marches down the module import path and, for each valid directory it contains, attempts to search the entire tree rooted there. In fact, it nests loops three deep—for items on the path, directories in the item’s tree, and files in the directory. Because the module path may contain directories named in arbitrary ways, along the way this script must take care to:
Normalize directory paths—fixing up slashes and dots to map directories to a common form.
Normalize directory name case—converting to lowercase on case-insensitive Windows, so that same names match by string equality, but leaving case unchanged on Unix, where it matters.
Detect repeats to avoid visiting the same directory twice
(the same directory might be reached from more than one entry on
sys.path
).
Skip any file-like item in the tree for which os.path.getsize
fails (by default
os.walk
itself silently
ignores things it cannot treat as directories, both at the top
of and within the tree).
Avoid potential Unicode decoding
errors in file content by opening files in binary
mode in order to count their lines. Text mode requires decodable
content, and some files in Python 3.1’s library tree cannot be
decoded properly on Windows. Catching Unicode exceptions with a
try
statement would avoid
program exits, too, but might skip candidate files.
This version also adds line counts; this might add significant
run time to this script too, but it’s a useful metric to report. In
fact, this version uses this value as a sort key to report the three
largest and smallest files by line counts too—this may differ from
results based upon raw file size. Here’s the script in action in
Python 3.1 on my Windows 7 machine; since these results depend on
platform, installed extensions, and path settings, your sys.path
and largest and smallest files
may vary:
C:...PP4ESystemFiletools> bigpy-path.py
By size...
[(0, 0, 'C:\Python31\lib\build_class.py'),
(0, 0, 'C:\Python31\lib\email\mime\__init__.py'),
(0, 0, 'C:\Python31\lib\email\test\__init__.py')]
[(161613, 3754, 'C:\Python31\lib\tkinter\__init__.py'),
(211238, 5768, 'C:\Python31\lib\decimal.py'),
(380582, 78, 'C:\Python31\lib\pydoc_data\topics.py')]
By lines...
[(0, 0, 'C:\Python31\lib\build_class.py'),
(0, 0, 'C:\Python31\lib\email\mime\__init__.py'),
(0, 0, 'C:\Python31\lib\email\test\__init__.py')]
[(147086, 4132, 'C:\Python31\lib\turtle.py'),
(150069, 4268, 'C:\Python31\lib\test\test_descr.py'),
(211238, 5768, 'C:\Python31\lib\decimal.py')]
Again, change this script’s trace
variable if you want to track its
progress through the tree. As you can see, the results for largest
files differ when viewed by size and lines—a disparity which we’ll
probably have to hash out in our next requirements meeting.
Finally, although searching trees rooted in the module import
path normally includes every Python source file you can import on
your computer, it’s still not complete. Technically, this approach
checks only modules; Python source files which are top-level scripts
run directly do not need to be included in the module path.
Moreover, the module search path may be manually changed by some
scripts dynamically at runtime (for example, by direct sys.path
updates in scripts that run on
web servers) to include additional directories that Example 6-3 won’t catch.
Ultimately, finding the largest source file on your computer requires searching your entire drive—a feat which our tree searcher in Example 6-2 almost supports, if we generalize it to accept the root directory name as an argument and add some of the bells and whistles of the path searcher version (we really want to avoid visiting the same directory twice if we’re scanning an entire machine, and we might as well skip errors and check line-based sizes if we’re investing the time). Example 6-4 implements such general tree scans, outfitted for the heavier lifting required for scanning drives.
""" Find the largest file of a given type in an arbitrary directory tree. Avoid repeat paths, catch errors, add tracing and line count size. Also uses sets, file iterators and generator to avoid loading entire file, and attempts to work around undecodable dir/file name prints. """ import os, pprint from sys import argv, exc_info trace = 1 # 0=off, 1=dirs, 2=+files dirname, extname = os.curdir, '.py' # default is .py files in cwd if len(argv) > 1: dirname = argv[1] # ex: C:, C:Python31Lib if len(argv) > 2: extname = argv[2] # ex: .pyw, .txt if len(argv) > 3: trace = int(argv[3]) # ex: ". .py 2" def tryprint(arg): try: print(arg) # unprintable filename? except UnicodeEncodeError: print(arg.encode()) # try raw byte string visited = set() allsizes = [] for (thisDir, subsHere, filesHere) in os.walk(dirname): if trace: tryprint(thisDir) thisDir = os.path.normpath(thisDir) fixname = os.path.normcase(thisDir) if fixname in visited: if trace: tryprint('skipping ' + thisDir) else: visited.add(fixname) for filename in filesHere: if filename.endswith(extname): if trace > 1: tryprint('+++' + filename) fullname = os.path.join(thisDir, filename) try: bytesize = os.path.getsize(fullname) linesize = sum(+1 for line in open(fullname, 'rb')) except Exception: print('error', exc_info()[0]) else: allsizes.append((bytesize, linesize, fullname)) for (title, key) in [('bytes', 0), ('lines', 1)]: print(' By %s...' % title) allsizes.sort(key=lambda x: x[key]) pprint.pprint(allsizes[:3]) pprint.pprint(allsizes[-3:])
Unlike the prior tree version, this one allows us to search in specific directories, and for specific extensions. The default is to simply search the current working directory for Python files:
C:...PP4ESystemFiletools> bigext-tree.py
.
By bytes...
[(21, 1, '.\__init__.py'),
(461, 17, '.\bigpy-dir.py'),
(818, 25, '.\bigpy-tree.py')]
[(1696, 48, '.\join.py'),
(1940, 49, '.\bigext-tree.py'),
(2547, 57, '.\split.py')]
By lines...
[(21, 1, '.\__init__.py'),
(461, 17, '.\bigpy-dir.py'),
(818, 25, '.\bigpy-tree.py')]
[(1696, 48, '.\join.py'),
(1940, 49, '.\bigext-tree.py'),
(2547, 57, '.\split.py')]
For more custom work, we can pass in a directory name, extension type, and trace level on the command-line now (trace level 0 disables tracing, and 1, the default, shows directories visited along the way):
C:...PP4ESystemFiletools> bigext-tree.py .. .py 0
By bytes...
[(21, 1, '..\__init__.py'),
(21, 1, '..\Filetools\__init__.py'),
(28, 1, '..\Streams\hello-out.py')]
[(2278, 67, '..\Processes\multi2.py'),
(2547, 57, '..\Filetools\split.py'),
(4361, 105, '..\Tester\tester.py')]
By lines...
[(21, 1, '..\__init__.py'),
(21, 1, '..\Filetools\__init__.py'),
(28, 1, '..\Streams\hello-out.py')]
[(2547, 57, '..\Filetools\split.py'),
(2278, 67, '..\Processes\multi2.py'),
(4361, 105, '..\Tester\tester.py')]
This script also lets us scan for different file types; here it is picking out the smallest and largest text file from one level up (at the time I ran this script, at least):
C:...PP4ESystemFiletools> bigext-tree.py .. .txt 1
..
..Environment
..Filetools
..Processes
..Streams
..Tester
..TesterArgs
..TesterErrors
..TesterInputs
..TesterOutputs
..TesterScripts
..Testerxxold
..Threads
By bytes...
[(4, 2, '..\Streams\input.txt'),
(13, 1, '..\Streams\hello-in.txt'),
(20, 4, '..\Streams\data.txt')]
[(104, 4, '..\Streams\output.txt'),
(172, 3, '..\Tester\xxold\README.txt.txt'),
(435, 4, '..\Filetools\temp.txt')]
By lines...
[(13, 1, '..\Streams\hello-in.txt'),
(22, 1, '..\spam.txt'),
(4, 2, '..\Streams\input.txt')]
[(20, 4, '..\Streams\data.txt'),
(104, 4, '..\Streams\output.txt'),
(435, 4, '..\Filetools\temp.txt')]
And now, to search your entire system, simply pass in your
machine’s root directory name (use /
instead of C:
on Unix-like machines), along with an
optional file extension type (.py is just the default now). The winner
is…(please, no wagering):
C:...PP4EdevExamplesPP4ESystemFiletools> bigext-tree.py C:
C:
C:$Recycle.Bin
C:$Recycle.BinS-1-5-21-3951091421-2436271001-910485044-1004
C:cygwin
C:cygwinin
C:cygwincygdrive
C:cygwindev
C:cygwindevmqueue
C:cygwindevshm
C:cygwinetc
...MANY more lines omitted...
By bytes...
[(0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\build_class.py'),
(0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\email\mime\__init__.py'),
(0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\email\test\__init__.py')]
[(380582, 78, 'C:\Python31\Lib\pydoc_data\topics.py'),
(398157, 83, 'C:\...\Install\Source\Python-2.6\Lib\pydoc_topics.py'),
(412434, 83, 'C:\Python26\Lib\pydoc_topics.py')]
By lines...
[(0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\build_class.py'),
(0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\email\mime\__init__.py'),
(0, 0, 'C:\cygwin\...\python31\Python-3.1.1\Lib\email\test\__init__.py')]
[(204107, 5589, 'C:\...Install\Source\Python-3.0\Lib\decimal.py'),
(205470, 5768, 'C:\cygwin\...\python31\Python-3.1.1\Lib\decimal.py'),
(211238, 5768, 'C:\Python31\Lib\decimal.py')]
The script’s trace logic is preset to allow you to monitor its directory progress. I’ve shortened some directory names to protect the innocent here (and to fit on this page). This command may take a long time to finish on your computer—on my sadly underpowered Windows 7 netbook, it took 11 minutes to scan a solid state drive with some 59G of data, 200K files, and 25K directories when the system was lightly loaded (8 minutes when not tracing directory names, but half an hour when many other applications were running). Nevertheless, it provides the most exhaustive solution to the original query of all our attempts.
This is also as complete a solution as we have space for in
this book. For more fun, consider that you may need to scan more
than one drive, and some Python source files may also appear in zip
archives, both on the module path or not (os.walk
silently ignores zip files in
Example 6-3). They might
also be named in other ways—with .pyw extensions to suppress shell pop ups
on Windows, and with arbitrary extensions for some top-level
scripts. In fact, top-level scripts might have no filename extension
at all, even though they are Python source files. And while they’re
generally not Python files, some importable modules may also appear
in frozen binaries or be statically linked into the Python
executable. In the interest of space, we’ll leave such higher
resolution (and potentially intractable!) search extensions as
suggested exercises.
One fine point before we move on: notice the seemingly superfluous exception
handling in Example 6-4’s
tryprint
function. When I first
tried to scan an entire drive as shown in the preceding section,
this script died on a Unicode encoding error while trying to print a
directory name of a saved web page. Adding the exception handler
skips the error entirely.
This demonstrates a subtle but pragmatically important issue:
Python 3.X’s Unicode orientation extends to filenames, even if they
are just printed. As we learned in Chapter 4, because filenames may contain
arbitrary text, os.listdir
returns filenames in two different ways—we get back
decoded Unicode strings when we pass in a normal str
argument, and still-encoded byte
strings when we send a bytes
:
>>>import os
>>>os.listdir('.')[:4]
['bigext-tree.py', 'bigpy-dir.py', 'bigpy-path.py', 'bigpy-tree.py'] >>>os.listdir(b'.')[:4]
[b'bigext-tree.py', b'bigpy-dir.py', b'bigpy-path.py', b'bigpy-tree.py']
Both os.walk
(used in the
Example 6-4 script) and
glob.glob
inherit this behavior
for the directory and file names they return, because they work by
calling os.listdir
internally at
each directory level. For all these calls, passing in a byte string
argument suppresses Unicode decoding of file and directory names.
Passing a normal string assumes that filenames are decodable per the
file system’s Unicode scheme.
The reason this potentially mattered to this section’s example
is that running the tree search version over an entire hard drive
eventually reached an undecodable filename (an old saved web page
with an odd name), which generated an exception when the print
function tried to display it. Here’s
a simplified recreation of the error, run in a shell window (Command
Prompt) on Windows:
>>>root = r'C:py3000'
>>>for (dir, subs, files) in os.walk(root): print(dir)
... C:py3000 C:py3000FutureProofPython - PythonInfo Wiki_files C:py3000Oakwinter_com Code » Porting setuptools to py3k_files Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:Python31libencodingscp437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character 'u2019' in position 45: character maps to <undefined>
One way out of this dilemma is to use bytes
strings for the directory root
name—this suppresses filename decoding in the os.listdir
calls run by os.walk
, and effectively limits the scope
of later printing to raw bytes. Since printing does not have to deal
with encodings, it works without error. Manually encoding to bytes
prior to printing works too, but the results are slightly
different:
>>>root.encode()
b'C:\py3000' >>>for (dir, subs, files) in os.walk(root.encode()): print(dir)
... b'C:\py3000' b'C:\py3000\FutureProofPython - PythonInfo Wiki_files' b'C:\py3000\Oakwinter_com Code xbb Porting setuptools to py3k_files' b'C:\py3000\Whatx92s New in Python 3_0 x97 Python Documentation' >>>for (dir, subs, files) in os.walk(root): print(dir.encode())
... b'C:\py3000' b'C:\py3000\FutureProofPython - PythonInfo Wiki_files' b'C:\py3000\Oakwinter_com Code xc2xbb Porting setuptools to py3k_files' b'C:\py3000\Whatxe2x80x99s New in Python 3_0 xe2x80x94 Python Documentation'
Unfortunately, either approach means that all the directory names printed during the walk display as cryptic byte strings. To maintain the better readability of normal strings, I instead opted for the exception handler approach used in the script’s code. This avoids the issues entirely:
>>>for (dir, subs, files) in os.walk(root):
...try:
...print(dir)
...except UnicodeEncodeError:
...print(dir.encode())
# or simply punt if enocde may fail too ... C:py3000 C:py3000FutureProofPython - PythonInfo Wiki_files C:py3000Oakwinter_com Code » Porting setuptools to py3k_files b'C:\py3000\Whatxe2x80x99s New in Python 3_0 xe2x80x94 Python Documentation'
Oddly, though, the error seems more related to printing than
to Unicode encodings of filenames—because the filename did not fail
until printed, it must have been decodable when its string was
created initially. That’s why wrapping up the print
in a try
suffices; otherwise, the error would
occur earlier.
Moreover, this error does not occur if the script’s output is
redirected to a file, either at the shell level (bigext-tree.py c: > out
), or by the
print call itself (print(dir,
file=F)
). In the latter case the output file must later be
read back in binary mode, as text mode triggers the same error when
printing the file’s content to the shell window (but again, not
until printed). In fact, the exact same code that fails when run in
a system shell Command Prompt on Windows works without error when
run in the IDLE GUI on the same platform—the tkinter GUI used by
IDLE handles display of characters that printing to standard output
connected to a shell terminal window does not:
>>>import os
# run in IDLE (a tkinter GUI), not system shell >>>root = r'C:py3000'
>>>for (dir, subs, files) in os.walk(root): print(dir)
C:py3000 C:py3000FutureProofPython - PythonInfo Wiki_files C:py3000Oakwinter_com Code » Porting setuptools to py3k_files C:py3000What's New in Python 3_0 — Python Documentation_files
In other words, the exception occurs only when printing to a
shell window, and long after the file name string is created. This
reflects an artifact of extra translations performed by the Python printer, not
of Unicode file names in general. Because we have no room for
further exploration here, though, we’ll have to be satisfied with
the fact that our exception handler sidesteps the printing problem
altogether. You should still be aware of the implications of Unicode
filename decoding, though; on some platforms you may need to pass
byte strings to os.walk
in this
script to prevent decoding errors as filenames are created.[18]
Since Unicode is still relatively new in 3.1, be sure to test for such errors on your computer and your Python. Also see also Python’s manuals for more on the treatment of Unicode filenames, and the text Learning Python for more on Unicode in general. As noted earlier, our scripts also had to open text files in binary mode because some might contain undecodable content too. It might seem surprising that Unicode issues can crop up in basic printing like this too, but such is life in the brave new Unicode world. Many real-world scripts don’t need to care much about Unicode, of course—including those we’ll explore in the next section.
Like most kids, mine spent a lot of time on the Internet when they were growing up. As far as I could tell, it was the thing to do. Among their generation, computer geeks and gurus seem to have been held in the same sort of esteem that my generation once held rock stars. When kids disappeared into their rooms, chances were good that they were hacking on computers, not mastering guitar riffs (well, real ones, at least). It may or may not be healthier than some of the diversions of my own misspent youth, but that’s a topic for another kind of book.
Despite the rhetoric of techno-pundits about the Web’s potential to empower an upcoming generation in ways unimaginable by their predecessors, my kids seemed to spend most of their time playing games. To fetch new ones in my house at the time, they had to download to a shared computer which had Internet access and transfer those games to their own computers to install. (Their own machines did not have Internet access until later, for reasons that most parents in the crowd could probably expand upon.)
The problem with this scheme is that game files are not small.
They were usually much too big to fit on a floppy or memory stick of
the time, and burning a CD or DVD took away valuable game-playing
time. If all the machines in my house ran Linux, this would have been
a nonissue. There are standard command-line programs on Unix for
chopping a file into pieces small enough to fit on a transfer device
(split
), and others for putting the
pieces back together to re-create the original file (cat
). Because we had all sorts of different
machines in the house, though, we needed a more portable
solution.[19]
Since all the computers in my house ran Python, a simple portable Python script came to the rescue. The Python program in Example 6-5 distributes a single file’s contents among a set of part files and stores those part files in a directory.
#!/usr/bin/python """ ################################################################################ split a file into a set of parts; join.py puts them back together; this is a customizable version of the standard Unix split command-line utility; because it is written in Python, it also works on Windows and can be easily modified; because it exports a function, its logic can also be imported and reused in other applications; ################################################################################ """ import sys, os kilobytes = 1024 megabytes = kilobytes * 1000 chunksize = int(1.4 * megabytes) # default: roughly a floppy def split(fromfile, todir, chunksize=chunksize): if not os.path.exists(todir): # caller handles errors os.mkdir(todir) # make dir, read/write parts else: for fname in os.listdir(todir): # delete any existing files os.remove(os.path.join(todir, fname)) partnum = 0 input = open(fromfile, 'rb') # binary: no decode, endline while True: # eof=empty string from read chunk = input.read(chunksize) # get next part <= chunksize if not chunk: break partnum += 1 filename = os.path.join(todir, ('part%04d' % partnum)) fileobj = open(filename, 'wb') fileobj.write(chunk) fileobj.close() # or simply open().write() input.close() assert partnum <= 9999 # join sort fails if 5 digits return partnum if __name__ == '__main__': if len(sys.argv) == 2 and sys.argv[1] == '-help': print('Use: split.py [file-to-split target-dir [chunksize]]') else: if len(sys.argv) < 3: interactive = True fromfile = input('File to be split? ') # input if clicked todir = input('Directory to store part files? ') else: interactive = False fromfile, todir = sys.argv[1:3] # args in cmdline if len(sys.argv) == 4: chunksize = int(sys.argv[3]) absfrom, absto = map(os.path.abspath, [fromfile, todir]) print('Splitting', absfrom, 'to', absto, 'by', chunksize) try: parts = split(fromfile, todir, chunksize) except: print('Error during split:') print(sys.exc_info()[0], sys.exc_info()[1]) else: print('Split finished:', parts, 'parts are in', absto) if interactive: input('Press Enter key') # pause if clicked
By default, this script splits the input file into chunks that
are roughly the size of a floppy disk—perfect for moving big files
between the electronically isolated machines of the time. Most
importantly, because this is all portable Python code, this script
will run on just about any machine, even ones without their own file
splitter. All it requires is an installed Python. Here it is at work
splitting a Python 3.1 self-installer executable located in the
current working directory on Windows (I’ve omitted a few dir
output lines to save space here; use
ls -l
on Unix):
C: emp>cd C: emp
C: emp>dir python-3.1.msi
...more... 06/27/2009 04:53 PM 13,814,272 python-3.1.msi 1 File(s) 13,814,272 bytes 0 Dir(s) 188,826,189,824 bytes free C: emp>python C:...PP4ESystemFiletoolssplit.py -help
Use: split.py [file-to-split target-dir [chunksize]] C: emp>python C:...P4ESystemFiletoolssplit.py python-3.1.msi pysplit
Splitting C: emppython-3.1.msi to C: emppysplit by 1433600 Split finished: 10 parts are in C: emppysplit C: emp>dir pysplit
...more... 02/21/2010 11:13 AM <DIR> . 02/21/2010 11:13 AM <DIR> .. 02/21/2010 11:13 AM 1,433,600 part0001 02/21/2010 11:13 AM 1,433,600 part0002 02/21/2010 11:13 AM 1,433,600 part0003 02/21/2010 11:13 AM 1,433,600 part0004 02/21/2010 11:13 AM 1,433,600 part0005 02/21/2010 11:13 AM 1,433,600 part0006 02/21/2010 11:13 AM 1,433,600 part0007 02/21/2010 11:13 AM 1,433,600 part0008 02/21/2010 11:13 AM 1,433,600 part0009 02/21/2010 11:13 AM 911,872 part0010 10 File(s) 13,814,272 bytes 2 Dir(s) 188,812,328,960 bytes free
Each of these generated part files represents one binary chunk
of the file python-3.1.msi—a
chunk small enough to fit comfortably on a floppy disk of the time.
In fact, if you add the sizes of the generated part files given by
the ls
command, you’ll come up
with exactly the same number of bytes as the original file’s size.
Before we see how to put these files back together again, here are a
few points to ponder as you study this script’s code:
This script is designed to input its parameters in either interactive or command-line mode; it checks the number of command-line arguments to find out the mode in which it is being used. In command-line mode, you list the file to be split and the output directory on the command line, and you can optionally override the default part file size with a third command-line argument.
In interactive mode, the script asks for a filename and
output directory at the console window with input
and pauses for a key press at
the end before exiting. This mode is nice when the program
file is started by clicking on its icon; on Windows,
parameters are typed into a pop-up DOS box that doesn’t
automatically disappear. The script also shows the absolute
paths of its parameters (by running them through os.path.abspath
) because they may
not be obvious in interactive mode.
This code is careful to open both input and output files
in binary mode (rb
,
wb
), because it needs to
portably handle things like executables and audio files, not
just text. In Chapter 4, we
learned that on Windows, text-mode files automatically map
end-of-line sequences
to
on input and map
to
on output. For true binary
data, we really don’t want any
characters in the data to go away
when read, and we don’t want any superfluous
characters to be added on output.
Binary-mode files suppress this
mapping when the script is run on
Windows and so avoid data corruption.
In Python 3.X, binary mode also means that file data is
bytes
objects in our
script, not encoded str
text, though we don’t need to do anything special—this
script’s file processing code runs the same on Python 3.X as
it did on 2.X. In fact, binary mode is required in 3.X for
this program, because the target file’s data may not be
encoded text at all; text mode requires that file content must
be decodable in 3.X, and that might fail both for truly binary
data and text files obtained from other platforms. On output,
binary mode accepts bytes
and suppresses Unicode encoding and line-end
translations.
This script also goes out of its way to manually close
its files. As we also saw in
Chapter 4, we can
often get by with a single line: open(partname, 'wb').write(chunk)
.
This shorter form relies on the fact that the current Python
implementation automatically closes files for you when file
objects are reclaimed (i.e., when they are garbage collected,
because there are no more references to the file object). In
this one-liner, the file object would be reclaimed
immediately, because the open
result is temporary in an
expression and is never referenced by a longer-lived name.
Similarly, the input
file
is reclaimed when the split
function exits.
However, it’s not impossible that this automatic-close
behavior may go away in the future. Moreover, the Jython
Java-based Python implementation does not reclaim unreferenced
objects as immediately as the standard Python. You should
close manually if you care about the Java port, your script
may potentially create many files in a short amount of time,
and it may run on a machine that has a limit on the number of
open files per program. Because the split
function in this module is
intended to be a general-purpose tool, it accommodates such
worst-case scenarios.
Also see Chapter 4’s mention
of the file context manager and the with
statement; this provides an
alternative way to guarantee file closes.
Back to moving big files around the house: after downloading a big game program file, you can run the previous splitter script by clicking on its name in Windows Explorer and typing filenames. After a split, simply copy each part file onto its own floppy (or other more modern medium), walk the files to the destination machine, and re-create the split output directory on the target computer by copying the part files. Finally, the script in Example 6-6 is clicked or otherwise run to put the parts back together.
#!/usr/bin/python """ ################################################################################ join all part files in a dir created by split.py, to re-create file. This is roughly like a 'cat fromdir/* > tofile' command on unix, but is more portable and configurable, and exports the join operation as a reusable function. Relies on sort order of filenames: must be same length. Could extend split/join to pop up Tkinter file selectors. ################################################################################ """ import os, sys readsize = 1024 def join(fromdir, tofile): output = open(tofile, 'wb') parts = os.listdir(fromdir) parts.sort() for filename in parts: filepath = os.path.join(fromdir, filename) fileobj = open(filepath, 'rb') while True: filebytes = fileobj.read(readsize) if not filebytes: break output.write(filebytes) fileobj.close() output.close() if __name__ == '__main__': if len(sys.argv) == 2 and sys.argv[1] == '-help': print('Use: join.py [from-dir-name to-file-name]') else: if len(sys.argv) != 3: interactive = True fromdir = input('Directory containing part files? ') tofile = input('Name of file to be recreated? ') else: interactive = False fromdir, tofile = sys.argv[1:] absfrom, absto = map(os.path.abspath, [fromdir, tofile]) print('Joining', absfrom, 'to make', absto) try: join(fromdir, tofile) except: print('Error joining files:') print(sys.exc_info()[0], sys.exc_info()[1]) else: print('Join complete: see', absto) if interactive: input('Press Enter key') # pause if clicked
Here is a join in progress on Windows, combining the split
files we made a moment ago; after running the join
script, you still may need to run
something like zip
, gzip
, or tar
to unpack an archive file unless it’s
shipped as an executable, but at least the original downloaded file
is set to go[20]:
C: emp>python C:...PP4ESystemFiletoolsjoin.py -help
Use: join.py [from-dir-name to-file-name] C: emp>python C:...PP4ESystemFiletoolsjoin.py pysplit mypy31.msi
Joining C: emppysplit to make C: empmypy31.msi Join complete: see C: empmypy31.msi C: emp>dir *.msi
...more... 02/21/2010 11:21 AM 13,814,272 mypy31.msi 06/27/2009 04:53 PM 13,814,272 python-3.1.msi 2 File(s) 27,628,544 bytes 0 Dir(s) 188,798,611,456 bytes free C: emp>fc /b mypy31.msi python-3.1.msi
Comparing files mypy31.msi and PYTHON-3.1.MSI FC: no differences encountered
The join script simply uses os.listdir
to
collect all the part files in a directory created by split
, and sorts the filename list to put
the parts back together in the correct order. We get back an exact
byte-for-byte copy of the original file (proved by the DOS fc
command in the code; use cmp
on Unix).
Some of this process is still manual, of course (I never did
figure out how to script the “walk the floppies to your bedroom”
step), but the split
and join
scripts make it both quick and simple
to move big files around. Because this script is also portable
Python code, it runs on any platform to which we cared to move split
files. For instance, my home computers ran both Windows and Linux at
the time; since this script runs on either platform, the gamers were
covered. Before we move on, here are a couple of implementation
details worth underscoring in the join
script’s code:
First of all, notice that this script deals with files
in binary mode but also reads each part file in blocks of 1 KB
each. In fact, the readsize
setting here (the size of each block read from an input part
file) has no relation to chunksize
in
split.py (the total size of each output
part file). As we learned in Chapter 4, this script could
instead read each part file all at once: output.write(open(filepath,
'rb').read())
. The downside to this scheme is that
it really does load all of a file into memory at once. For
example, reading a 1.4 MB part file into memory all at once
with the file object read
method generates a 1.4 MB string in memory to hold the file’s
bytes. Since split
allows
users to specify even larger chunk sizes, the join
script plans for the worst and
reads in terms of limited-size blocks. To be completely
robust, the split
script
could read its input data in smaller chunks too, but this
hasn’t become a concern in practice (recall that as your
program runs, Python automatically reclaims strings that are
no longer referenced, so this isn’t as wasteful as it might
seem).
If you study this script’s code closely, you may also
notice that the join
scheme
it uses relies completely on the sort order of filenames in
the parts directory. Because it simply calls the list sort
method on the filenames list
returned by os.listdir
, it
implicitly requires that filenames have the same length and
format when created by split
. To satisfy this requirement,
the splitter uses zero-padding notation in a string formatting
expression ('part%04d'
) to
make sure that filenames all have the same number of digits at
the end (four). When sorted, the leading zero characters in
small numbers guarantee that part files are ordered for
joining correctly.
Alternatively, we could strip off digits in filenames,
convert them with int
, and
sort numerically, by using the list sort
method’s keys
argument, but that would still
imply that all filenames must start with the some type of
substring, and so doesn’t quite remove the file-naming
dependency between the split
and join
scripts. Because these scripts
are designed to be two steps of the same process, though, some
dependencies between them seem reasonable.
Finally, let’s run a few more experiments with these Python system utilities to
demonstrate other usage modes. When run without full command-line
arguments, both split
and
join
are smart enough to input
their parameters interactively. Here they are
chopping and gluing the Python self-installer file on Windows again,
with parameters typed in the DOS console window:
C: emp>python C:...PP4ESystemFiletoolssplit.py
File to be split?python-3.1.msi
Directory to store part files?splitout
Splitting C: emppython-3.1.msi to C: empsplitout by 1433600 Split finished: 10 parts are in C: empsplitout Press Enter key C: emp>python C:...PP4ESystemFiletoolsjoin.py
Directory containing part files?splitout
Name of file to be recreated?newpy31.msi
Joining C: empsplitout to make C: emp ewpy31.msi Join complete: see C: emp ewpy31.msi Press Enter key C: emp>fc /B python-3.1.msi newpy31.msi
Comparing files python-3.1.msi and NEWPY31.MSI FC: no differences encountered
When these program files are double-clicked in a Windows file explorer GUI, they work the same way (there are usually no command-line arguments when they are launched this way). In this mode, absolute path displays help clarify where files really are. Remember, the current working directory is the script’s home directory when clicked like this, so a simple name actually maps to a source code directory; type a full path to make the split files show up somewhere else:
[in a pop-up DOS console box when split.py is clicked] File to be split?c: emppython-3.1.msi
Directory to store part files?c: empparts
Splitting c: emppython-3.1.msi to c: empparts by 1433600 Split finished: 10 parts are in c: empparts Press Enter key [in a pop-up DOS console box when join.py is clicked] Directory containing part files?c: empparts
Name of file to be recreated?c: empmorepy31.msi
Joining c: empparts to make c: empmorepy31.msi Join complete: see c: empmorepy31.msi Press Enter key
Because these scripts package their core logic in functions, though, it’s just as easy to reuse their code by importing and calling from another Python component (make sure your module import search path includes the directory containing the PP4E root first; the first abbreviated line here is one way to do so):
C: emp>set PYTHONPATH=C:...devExamples
C: emp>python
>>>from PP4E.System.Filetools.split import split
>>>from PP4E.System.Filetools.join import join
>>> >>>numparts = split('python-3.1.msi', 'calldir')
>>>numparts
10 >>>join('calldir', 'callpy31.msi')
>>> >>>import os
>>>os.system('fc /B python-3.1.msi callpy31.msi')
Comparing files python-3.1.msi and CALLPY31.msi FC: no differences encountered 0
A word about performance: all the split
and join
tests shown so far process a 13 MB
file, but they take less than one second of real wall-clock time to
finish on my Windows 7 2GHz Atom processor laptop computer—plenty
fast for just about any use I could imagine. Both scripts run just
as fast for other reasonable part file sizes,
too; here is the splitter chopping up the file into 4MB and 500KB
parts:
C: emp>C:...PP4ESystemFiletoolssplit.py python-3.1.msi tempsplit 4000000
Splitting C: emppython-3.1.msi to C: emp empsplit by 4000000 Split finished: 4 parts are in C: emp empsplit C: emp>dir tempsplit
...more... Directory of C: emp empsplit 02/21/2010 01:27 PM <DIR> . 02/21/2010 01:27 PM <DIR> .. 02/21/2010 01:27 PM 4,000,000 part0001 02/21/2010 01:27 PM 4,000,000 part0002 02/21/2010 01:27 PM 4,000,000 part0003 02/21/2010 01:27 PM 1,814,272 part0004 4 File(s) 13,814,272 bytes 2 Dir(s) 188,671,983,616 bytes free C: emp>C:...PP4ESystemFiletoolssplit.py python-3.1.msi tempsplit 500000
Splitting C: emppython-3.1.msi to C: emp empsplit by 500000 Split finished: 28 parts are in C: emp empsplit C: emp>dir tempsplit
...more... Directory of C: emp empsplit 02/21/2010 01:27 PM <DIR> . 02/21/2010 01:27 PM <DIR> .. 02/21/2010 01:27 PM 500,000 part0001 02/21/2010 01:27 PM 500,000 part0002 02/21/2010 01:27 PM 500,000 part0003 02/21/2010 01:27 PM 500,000 part0004 02/21/2010 01:27 PM 500,000 part0005 ...more lines omitted... 02/21/2010 01:27 PM 500,000 part0024 02/21/2010 01:27 PM 500,000 part0025 02/21/2010 01:27 PM 500,000 part0026 02/21/2010 01:27 PM 500,000 part0027 02/21/2010 01:27 PM 314,272 part0028 28 File(s) 13,814,272 bytes 2 Dir(s) 188,671,946,752 bytes free
The split can take noticeably longer to finish, but only if the part file’s size is set small enough to generate thousands of part files—splitting into 1,382 parts works but runs slower (though some machines today are quick enough that you might not notice):
C: emp>C:...PP4ESystemFiletoolssplit.py python-3.1.msi tempsplit 10000
Splitting C: emppython-3.1.msi to C: emp empsplit by 10000 Split finished: 1382 parts are in C: emp empsplit C: emp>C:...PP4ESystemFiletoolsjoin.py tempsplit manypy31.msi
Joining C: emp empsplit to make C: empmanypy31.msi Join complete: see C: empmanypy31.msi C: emp>fc /B python-3.1.msi manypy31.msi
Comparing files python-3.1.msi and MANYPY31.MSI FC: no differences encountered C: emp>dir tempsplit
...more... Directory of C: emp empsplit 02/21/2010 01:40 PM <DIR> . 02/21/2010 01:40 PM <DIR> .. 02/21/2010 01:39 PM 10,000 part0001 02/21/2010 01:39 PM 10,000 part0002 02/21/2010 01:39 PM 10,000 part0003 02/21/2010 01:39 PM 10,000 part0004 02/21/2010 01:39 PM 10,000 part0005 ...over 1,000 lines deleted... 02/21/2010 01:40 PM 10,000 part1378 02/21/2010 01:40 PM 10,000 part1379 02/21/2010 01:40 PM 10,000 part1380 02/21/2010 01:40 PM 10,000 part1381 02/21/2010 01:40 PM 4,272 part1382 1382 File(s) 13,814,272 bytes 2 Dir(s) 188,651,008,000 bytes free
Finally, the splitter is also smart enough to create the output directory if it doesn’t yet exist and to clear out any old files there if it does exist—the following, for example, leaves only new files in the output directory. Because the joiner combines whatever files exist in the output directory, this is a nice ergonomic touch. If the output directory was not cleared before each split, it would be too easy to forget that a prior run’s files are still there. Given that target audience for these scripts, they needed to be as forgiving as possible; your user base may vary (though you often shouldn’t assume so).
C: emp>C:...PP4ESystemFiletoolssplit.py python-3.1.msi tempsplit 5000000
Splitting C: emppython-3.1.msi to C: emp empsplit by 5000000 Split finished: 3 parts are in C: emp empsplit C: emp>dir tempsplit
...more... Directory of C: emp empsplit 02/21/2010 01:47 PM <DIR> . 02/21/2010 01:47 PM <DIR> .. 02/21/2010 01:47 PM 5,000,000 part0001 02/21/2010 01:47 PM 5,000,000 part0002 02/21/2010 01:47 PM 3,814,272 part0003 3 File(s) 13,814,272 bytes 2 Dir(s) 188,654,452,736 bytes free
Of course, the dilemma that these scripts address might today be more easily addressed by simply buying a bigger memory stick or giving kids their own Internet access. Still, once you catch the scripting bug, you’ll find the ease and flexibility of Python to be powerful and enabling tools, especially for writing custom automation scripts like these. When used well, Python may well become your Swiss Army knife of computing.
Moving is rarely painless, even in cyberspace. Changing your website’s Internet address can lead to all sorts of confusion. You need to ask known contacts to use the new address and hope that others will eventually stumble onto it themselves. But if you rely on the Internet, moves are bound to generate at least as much confusion as an address change in the real world.
Unfortunately, such site relocations are often unavoidable. Both Internet Service Providers (ISPs) and server machines can come and go over the years. Moreover, some ISPs let their service fall to intolerably low levels; if you are unlucky enough to have signed up with such an ISP, there is not much recourse but to change providers, and that often implies a change of web addresses.[21]
Imagine, though, that you are an O’Reilly author and have published your website’s address in multiple books sold widely all over the world. What do you do when your ISP’s service level requires a site change? Notifying each of the hundreds of thousands of readers out there isn’t exactly a practical solution.
Probably the best you can do is to leave forwarding instructions at the old site for some reasonably long period of time—the virtual equivalent of a “We’ve Moved” sign in a storefront window. On the Web, such a sign can also send visitors to the new site automatically: simply leave a page at the old site containing a hyperlink to the page’s address at the new site, along with timed auto-relocation specifications. With such forward-link files in place, visitors to the old addresses will be only one click or a few seconds away from reaching the new ones.
That sounds simple enough. But because visitors might try to directly access the address of any file at your old site, you generally need to leave one forward-link file for every old file—HTML pages, images, and so on. Unless your prior server supports auto-redirection (and mine did not), this represents a dilemma. If you happen to enjoy doing lots of mindless typing, you could create each forward-link file by hand. But given that my home site contained over 100 HTML files at the time I wrote this paragraph, the prospect of running one editor session per file was more than enough motivation for an automated solution.
Here’s what I came up with. First of all, I create a general page template text file, shown in Example 6-7, to describe how all the forward-link files should look, with parts to be filled in later.
<HTML> <head> <META HTTP-EQUIV="Refresh" CONTENT="10; URL=http://$server$/$home$/$file$"> <title>Site Redirection Page: $file$</title> </head> <BODY> <H1>This page has moved</H1> <P>This page now lives at this address: <P><A HREF="http://$server$/$home$/$file$"> http://$server$/$home$/$file$</A> <P>Please click on the new address to jump to this page, and update any links accordingly. You will be redirectly shortly. </P> <HR> </BODY></HTML>
To fully understand this template, you have to know something
about HTML, a web page description language that we’ll explore in
Part IV. But for the purposes of
this example, you can ignore most of this file and focus on just the
parts surrounded by dollar signs: the strings $server$
, $home$
, and $file$
are targets to be replaced with
real values by global text substitutions. They represent items that
vary per site relocation and file.
Now, given a page template file, the Python script in Example 6-8 generates all the required forward-link files automatically.
""" ################################################################################ Create forward-link pages for relocating a web site. Generates one page for every existing site html file; upload the generated files to your old web site. See ftplib later in the book for ways to run uploads in scripts either after or during page file creation. ################################################################################ """ import os servername = 'learning-python.com' # where site is relocating to homedir = 'books' # where site will be rooted sitefilesdir = r'C: emppublic_html' # where site files live locally uploaddir = r'C: empisp-forward' # where to store forward files templatename = 'template.html' # template for generated pages try: os.mkdir(uploaddir) # make upload dir if needed except OSError: pass template = open(templatename).read() # load or import template text sitefiles = os.listdir(sitefilesdir) # filenames, no directory prefix count = 0 for filename in sitefiles: if filename.endswith('.html') or filename.endswith('.htm'): fwdname = os.path.join(uploaddir, filename) print('creating', filename, 'as', fwdname) filetext = template.replace('$server$', servername) # insert text filetext = filetext.replace('$home$', homedir) # and write filetext = filetext.replace('$file$', filename) # file varies open(fwdname, 'w').write(filetext) count += 1 print('Last file => ', filetext, sep='') print('Done:', count, 'forward files created.')
Notice that the template’s text is loaded by reading a file; it would work just as well to code it as an imported Python string variable (e.g., a triple-quoted string in a module file). Also observe that all configuration options are assignments at the top of the script, not command-line arguments; since they change so seldom, it’s convenient to type them just once in the script itself.
But the main thing worth noticing here is that this script doesn’t care what the template file looks like at all; it simply performs global substitutions blindly in its text, with a different filename value for each generated file. In fact, we can change the template file any way we like without having to touch the script. Though a fairly simple technique, such a division of labor can be used in all sorts of contexts—generating “makefiles,” form letters, HTML replies from CGI scripts on web servers, and so on. In terms of library tools, the generator script:
Uses os.listdir
to step
through all the filenames in the site’s directory (glob.glob
would work too, but may
require stripping directory prefixes from file names)
Uses the string object’s replace
method to perform global
search-and-replace operations that fill in the $
-delimited targets in the template
file’s text, and endswith
to
skip non-HTML files (e.g., images—most browsers won’t know what
to do with HTML text in a “.jpg” file)
Uses os.path.join
and
built-in file objects to write the resulting text out to a
forward-link file of the
same name in an output directory
The end result is a mirror image of the original website directory, containing only forward-link files generated from the page template. As an added bonus, the generator script can be run on just about any Python platform—I can run it on my Windows laptop (where I’m writing this book), as well as on a Linux server (where my http://learning-python.com domain is hosted). Here it is in action on Windows:
C:...PP4ESystemFiletools> python site-forward.py
creating about-lp.html as C: empisp-forwardabout-lp.html
creating about-lp1e.html as C: empisp-forwardabout-lp1e.html
creating about-lp2e.html as C: empisp-forwardabout-lp2e.html
creating about-lp3e.html as C: empisp-forwardabout-lp3e.html
creating about-lp4e.html as C: empisp-forwardabout-lp4e.html
...many more lines deleted...
creating training.html as C: empisp-forward raining.html
creating whatsnew.html as C: empisp-forwardwhatsnew.html
creating whatsold.html as C: empisp-forwardwhatsold.html
creating xlate-lp.html as C: empisp-forwardxlate-lp.html
creating zopeoutline.htm as C: empisp-forwardzopeoutline.htm
Last file =>
<HTML>
<head>
<META HTTP-EQUIV="Refresh" CONTENT="10; URL=http://learning-python.com/books/zop
eoutline.htm">
<title>Site Redirection Page: zopeoutline.htm</title>
</head>
<BODY>
<H1>This page has moved</H1>
<P>This page now lives at this address:
<P><A HREF="http://learning-python.com/books/zopeoutline.htm">
http://learning-python.com/books/zopeoutline.htm</A>
<P>Please click on the new address to jump to this page, and
update any links accordingly. You will be redirectly shortly.
</P>
<HR>
</BODY></HTML>
Done: 124 forward files created.
To verify this script’s output, double-click on any of the
output files to see what they look like in a web browser (or run a
start
command in a DOS console on
Windows—e.g., start
isp-forwardabout-lp4e.html
). Figure 6-1 shows what one generated
page looks like on my machine.
To complete the process, you still need to install the forward links: upload all the generated files in the output directory to your old site’s web directory. If that’s too much to do by hand, too, be sure to see the FTP site upload scripts in Chapter 13 for an automatic way to do that step with Python as well (PP4EInternetFtpuploadflat.py will do the job). Once you’ve started scripting in earnest, you’ll be amazed at how much manual labor Python can automate. The next section provides another prime example.
Mistakes happen. As we’ve seen, Python provides interfaces to a variety of system services, along with tools for adding others. Example 6-9 shows some of the more commonly used system tools in action. It implements a simple regression test system for Python scripts—it runs each in a directory of Python scripts with provided input and command-line arguments, and compares the output of each run to the prior run’s results. As such, this script can be used as an automated testing system to catch errors introduced by changes in program source files; in a big system, you might not know when a fix is really a bug in disguise.
""" ################################################################################ Test a directory of Python scripts, passing command-line arguments, piping in stdin, and capturing stdout, stderr, and exit status to detect failures and regressions from prior run outputs. The subprocess module spawns and controls streams (much like os.popen3 in Python 2.X), and is cross-platform. Streams are always binary bytes in subprocess. Test inputs, args, outputs, and errors map to files in subdirectories. This is a command-line script, using command-line arguments for optional test directory name, and force-generation flag. While we could package it as a callable function, the fact that its results are messages and output files makes a call/return model less useful. Suggested enhancement: could be extended to allow multiple sets of command-line arguments and/or inputs per test script, to run a script multiple times (glob for multiple ".in*" files in Inputs?). Might also seem simpler to store all test files in same directory with different extensions, but this could grow large over time. Could also save both stderr and stdout to Errors on failures, but I prefer to have expected/actual output in Outputs on regressions. ################################################################################ """ import os, sys, glob, time from subprocess import Popen, PIPE # configuration args testdir = sys.argv[1] if len(sys.argv) > 1 else os.curdir forcegen = len(sys.argv) > 2 print('Start tester:', time.asctime()) print('in', os.path.abspath(testdir)) def verbose(*args): print('-'*80) for arg in args: print(arg) def quiet(*args): pass trace = quiet # glob scripts to be tested testpatt = os.path.join(testdir, 'Scripts', '*.py') testfiles = glob.glob(testpatt) testfiles.sort() trace(os.getcwd(), *testfiles) numfail = 0 for testpath in testfiles: # run all tests in dir testname = os.path.basename(testpath) # strip directory path # get input and args infile = testname.replace('.py', '.in') inpath = os.path.join(testdir, 'Inputs', infile) indata = open(inpath, 'rb').read() if os.path.exists(inpath) else b'' argfile = testname.replace('.py', '.args') argpath = os.path.join(testdir, 'Args', argfile) argdata = open(argpath).read() if os.path.exists(argpath) else '' # locate output and error, scrub prior results outfile = testname.replace('.py', '.out') outpath = os.path.join(testdir, 'Outputs', outfile) outpathbad = outpath + '.bad' if os.path.exists(outpathbad): os.remove(outpathbad) errfile = testname.replace('.py', '.err') errpath = os.path.join(testdir, 'Errors', errfile) if os.path.exists(errpath): os.remove(errpath) # run test with redirected streams pypath = sys.executable command = '%s %s %s' % (pypath, testpath, argdata) trace(command, indata) process = Popen(command, shell=True, stdin=PIPE, stdout=PIPE, stderr=PIPE) process.stdin.write(indata) process.stdin.close() outdata = process.stdout.read() errdata = process.stderr.read() # data are bytes exitstatus = process.wait() # requires binary files trace(outdata, errdata, exitstatus) # analyze results if exitstatus != 0: print('ERROR status:', testname, exitstatus) # status and/or stderr if errdata: print('ERROR stream:', testname, errpath) # save error text open(errpath, 'wb').write(errdata) if exitstatus or errdata: # consider both failure numfail += 1 # can get status+stderr open(outpathbad, 'wb').write(outdata) # save output to view elif not os.path.exists(outpath) or forcegen: print('generating:', outpath) # create first output open(outpath, 'wb').write(outdata) else: priorout = open(outpath, 'rb').read() # or compare to prior if priorout == outdata: print('passed:', testname) else: numfail += 1 print('FAILED output:', testname, outpathbad) open(outpathbad, 'wb').write(outdata) print('Finished:', time.asctime()) print('%s tests were run, %s tests failed.' % (len(testfiles), numfail))
We’ve seen the tools used by this script earlier in this part of
the book—subprocess
, os.path
, glob
, files, and the like. This example
largely just pulls these tools together to solve a useful purpose. Its
core operation is comparing new outputs to old, in order to spot
changes (“regressions”). Along the way, it also manages command-line
arguments, error messages, status codes, and files.
This script is also larger than most we’ve seen so far, but it’s a realistic and representative system administration tool (in fact, it’s derived from a similar tool I actually used in the past to detect changes in a compiler). Probably the best way to understand how it works is to demonstrate what it does. The next section steps through a testing session to be read in conjunction with studying the test script’s code.
Much of the magic behind the test driver script in Example 6-9 has to do with its directory structure. When you run it for the first time in a test directory (or force it to start from scratch there by passing a second command-line argument), it:
Collects scripts to be run in the Scripts
subdirectory
Fetches any associated script input and command-line
arguments from the Inputs
and
Args
subdirectories
Generates initial stdout output files for tests that
exit normally in the Outputs
subdirectory
Reports tests that fail either by exit status code or by error messages appearing in stderr
On all failures, the script also saves any stderr error message text, as well as any
stdout data generated up to the
point of failure; standard error text is saved to a file in the
Errors
subdirectory, and standard
output of failed tests is saved with a special “.bad” filename
extension in Outputs
(saving this
normally in the Outputs
subdirectory would trigger a failure when the test is later fixed!).
Here’s a first run:
C:...PP4ESystemTester> python tester.py . 1
Start tester: Mon Feb 22 22:13:38 2010
in C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
generating: .Outputs est-basic-args.out
generating: .Outputs est-basic-stdout.out
generating: .Outputs est-basic-streams.out
generating: .Outputs est-basic-this.out
ERROR status: test-errors-runtime.py 1
ERROR stream: test-errors-runtime.py .Errors est-errors-runtime.err
ERROR status: test-errors-syntax.py 1
ERROR stream: test-errors-syntax.py .Errors est-errors-syntax.err
ERROR status: test-status-bad.py 42
generating: .Outputs est-status-good.out
Finished: Mon Feb 22 22:13:41 2010
8 tests were run, 3 tests failed.
To run each script, the tester configures any preset
command-line arguments provided, pipes in fetched canned input (if
any), and captures the script’s standard output and error streams,
along with its exit status code. When I ran this example, there were
8 test scripts, along with a variety of inputs and outputs. Since
the directory and file naming structures are the key to this
example, here is a listing of the test directory I used—the Scripts
directory is primary, because
that’s where tests to be run are collected:
C:...PP4ESystemTester>dir /B
Args Errors Inputs Outputs Scripts tester.py xxold C:...PP4ESystemTester>dir /B Scripts
test-basic-args.py test-basic-stdout.py test-basic-streams.py test-basic-this.py test-errors-runtime.py test-errors-syntax.py test-status-bad.py test-status-good.py
The other subdirectories contain any required inputs and any generated outputs associated with scripts to be tested:
C:...PP4ESystemTester>dir /B Args
test-basic-args.args test-status-good.args C:...PP4ESystemTester>dir /B Inputs
test-basic-args.in test-basic-streams.in C:...PP4ESystemTester>dir /B Outputs
test-basic-args.out test-basic-stdout.out test-basic-streams.out test-basic-this.out test-errors-runtime.out.bad test-errors-syntax.out.bad test-status-bad.out.bad test-status-good.out C:...PP4ESystemTester>dir /B Errors
test-errors-runtime.err test-errors-syntax.err
I won’t list all these files here (as you can see, there are many, and all are available in the book examples distribution package), but to give you the general flavor, here are the files associated with the test script test-basic-args.py:
C:...PP4ESystemTester>type Scripts est-basic-args.py
# test args, streams import sys, os print(os.getcwd()) # to Outputs print(sys.path[0]) print('[argv]') for arg in sys.argv: # from Args print(arg) # to Outputs print('[interaction]') # to Outputs text = input('Enter text:') # from Inputs rept = sys.stdin.readline() # from Inputs sys.stdout.write(text * int(rept)) # to Outputs C:...PP4ESystemTester>type Args est-basic-args.args
-command -line --stuff C:...PP4ESystemTester>type Inputs est-basic-args.in
Eggs 10 C:...PP4ESystemTester>type Outputs est-basic-args.out
C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTesterScripts [argv] .Scripts est-basic-args.py -command -line --stuff [interaction] Enter text:EggsEggsEggsEggsEggsEggsEggsEggsEggsEggs
And here are two files related to one of the detected errors—the first is its captured stderr, and the second is its stdout generated up to the point where the error occurred; these are for human (or other tools) inspection, and are automatically removed the next time the tester script runs:
C:...PP4ESystemTester>type Errors est-errors-runtime.err
Traceback (most recent call last): File ".Scripts est-errors-runtime.py", line 3, in <module> print(1 / 0) ZeroDivisionError: int division or modulo by zero C:...PP4ESystemTester>type Outputs est-errors-runtime.out.bad
starting
Now, when run again without making any changes to the tests, the test driver script compares saved prior outputs to new ones and detects no regressions; failures designated by exit status and stderr messages are still reported as before, but there are no deviations from other tests’ saved expected output:
C:...PP4ESystemTester> python tester.py
Start tester: Mon Feb 22 22:26:41 2010
in C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
passed: test-basic-args.py
passed: test-basic-stdout.py
passed: test-basic-streams.py
passed: test-basic-this.py
ERROR status: test-errors-runtime.py 1
ERROR stream: test-errors-runtime.py .Errors est-errors-runtime.err
ERROR status: test-errors-syntax.py 1
ERROR stream: test-errors-syntax.py .Errors est-errors-syntax.err
ERROR status: test-status-bad.py 42
passed: test-status-good.py
Finished: Mon Feb 22 22:26:43 2010
8 tests were run, 3 tests failed.
But when I make a change in one of the test scripts that will
produce different output (I changed a loop counter to print fewer
lines), the regression is caught and reported; the new and different
output of the script is reported as a failure, and saved in Outputs
as a “.bad”
for later viewing:
C:...PP4ESystemTester>python tester.py
Start tester: Mon Feb 22 22:28:35 2010 in C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester passed: test-basic-args.py FAILED output: test-basic-stdout.py .Outputs est-basic-stdout.out.bad passed: test-basic-streams.py passed: test-basic-this.py ERROR status: test-errors-runtime.py 1 ERROR stream: test-errors-runtime.py .Errors est-errors-runtime.err ERROR status: test-errors-syntax.py 1 ERROR stream: test-errors-syntax.py .Errors est-errors-syntax.err ERROR status: test-status-bad.py 42 passed: test-status-good.py Finished: Mon Feb 22 22:28:38 2010 8 tests were run, 4 tests failed. C:...PP4ESystemTester>type Outputs est-basic-stdout.out.bad
begin Spam! Spam!Spam! Spam!Spam!Spam! Spam!Spam!Spam!Spam! end
One last usage note: if you change the trace
variable in this script to be
verbose
, you’ll get much more
output designed to help you trace the programs operation (but
probably too much for real testing runs):
C:...PP4ESystemTester> tester.py
Start tester: Mon Feb 22 22:34:51 2010
in C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
--------------------------------------------------------------------------------
C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemTester
.Scripts est-basic-args.py
.Scripts est-basic-stdout.py
.Scripts est-basic-streams.py
.Scripts est-basic-this.py
.Scripts est-errors-runtime.py
.Scripts est-errors-syntax.py
.Scripts est-status-bad.py
.Scripts est-status-good.py
--------------------------------------------------------------------------------
C:Python31python.exe .Scripts est-basic-args.py -command -line --stuff
b'Eggs
10
'
--------------------------------------------------------------------------------
b'C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\System\Tester
C:\Users\mark\Stuff\Books\4E\PP4E\dev\Examples\PP4E\System\Tester\
Scripts
[argv]
.\Scripts\test-basic-args.py
-command
-line
--st
uff
[interaction]
Enter text:EggsEggsEggsEggsEggsEggsEggsEggsEggsEggs'
b''
0
passed: test-basic-args.py
...more lines deleted...
Study the test driver’s code for more details. Naturally,
there is much more to the general testing story than we have space
for here. For example, in-process tests don’t need to spawn programs
and can generally make do with importing modules and testing them in
try
exception handler statements.
There is also ample room for expansion and customization in our
testing script (see its docstring for starters). Moreover, Python
comes with two testing frameworks, doctest
and unittest
(a.k.a. PyUnit), which provide
techniques and structures for coding regression and unit
tests:
An object-oriented framework that specifies test cases, expected results, and test suites. Subclasses provide test methods and use inherited assertion calls to specify expected results.
Parses out and reruns tests from an interactive session log that is pasted into a module’s docstrings. The logs give test calls and expected results; doctest essentially reruns the interactive session.
See the Python library manual, the PyPI website, and your favorite Web search engine for additional testing toolkits in both Python itself and the third-party domain.
For automated testing of Python command-line scripts that run as independent programs and tap into standard script execution context, though, our tester does the job. Because the test driver is fully independent of the scripts it tests, we can drop in new test cases without having to update the driver’s code. And because it is written in Python, it’s quick and easy to change as our testing needs evolve. As we’ll see again in the next section, this “scriptability” that Python provides can be a decided advantage for real tasks.
My CD writer sometimes does weird things. In fact, copies of files with odd names can be totally botched on the CD, even though other files show up in one piece. That’s not necessarily a showstopper; if just a few files are trashed in a big CD backup copy, I can always copy the offending files elsewhere one at a time. Unfortunately, drag-and-drop copies on some versions of Windows don’t play nicely with such a CD: the copy operation stops and exits the moment the first bad file is encountered. You get only as many files as were copied up to the error, but no more.
In fact, this is not limited to CD copies. I’ve run into similar problems when trying to back up my laptop’s hard drive to another drive—the drag-and-drop copy stops with an error as soon as it reaches a file with a name that is too long or odd to copy (common in saved web pages). The last 30 minutes spent copying is wasted time; frustrating, to say the least!
There may be some magical Windows setting to work around this feature, but I gave up hunting for one as soon as I realized that it would be easier to code a copier in Python. The cpall.py script in Example 6-10 is one way to do it. With this script, I control what happens when bad files are found—I can skip over them with Python exception handlers, for instance. Moreover, this tool works with the same interface and effect on other platforms. It seems to me, at least, that a few minutes spent writing a portable and reusable Python script to meet a need is a better investment than looking for solutions that work on only one platform (if at all).
""" ################################################################################ Usage: "python cpall.py dirFrom dirTo". Recursive copy of a directory tree. Works like a "cp -r dirFrom/* dirTo" Unix command, and assumes that dirFrom and dirTo are both directories. Was written to get around fatal error messages under Windows drag-and-drop copies (the first bad file ends the entire copy operation immediately), but also allows for coding more customized copy operations in Python. ################################################################################ """ import os, sys maxfileload = 1000000 blksize = 1024 * 500 def copyfile(pathFrom, pathTo, maxfileload=maxfileload): """ Copy one file pathFrom to pathTo, byte for byte; uses binary file modes to supress Unicde decode and endline transform """ if os.path.getsize(pathFrom) <= maxfileload: bytesFrom = open(pathFrom, 'rb').read() # read small file all at once open(pathTo, 'wb').write(bytesFrom) else: fileFrom = open(pathFrom, 'rb') # read big files in chunks fileTo = open(pathTo, 'wb') # need b mode for both while True: bytesFrom = fileFrom.read(blksize) # get one block, less at end if not bytesFrom: break # empty after last chunk fileTo.write(bytesFrom) def copytree(dirFrom, dirTo, verbose=0): """ Copy contents of dirFrom and below to dirTo, return (files, dirs) counts; may need to use bytes for dirnames if undecodable on other platforms; may need to do more file type checking on Unix: skip links, fifos, etc. """ fcount = dcount = 0 for filename in os.listdir(dirFrom): # for files/dirs here pathFrom = os.path.join(dirFrom, filename) pathTo = os.path.join(dirTo, filename) # extend both paths if not os.path.isdir(pathFrom): # copy simple files try: if verbose > 1: print('copying', pathFrom, 'to', pathTo) copyfile(pathFrom, pathTo) fcount += 1 except: print('Error copying', pathFrom, 'to', pathTo, '--skipped') print(sys.exc_info()[0], sys.exc_info()[1]) else: if verbose: print('copying dir', pathFrom, 'to', pathTo) try: os.mkdir(pathTo) # make new subdir below = copytree(pathFrom, pathTo) # recur into subdirs fcount += below[0] # add subdir counts dcount += below[1] dcount += 1 except: print('Error creating', pathTo, '--skipped') print(sys.exc_info()[0], sys.exc_info()[1]) return (fcount, dcount) def getargs(): """ Get and verify directory name arguments, returns default None on errors """ try: dirFrom, dirTo = sys.argv[1:] except: print('Usage error: cpall.py dirFrom dirTo') else: if not os.path.isdir(dirFrom): print('Error: dirFrom is not a directory') elif not os.path.exists(dirTo): os.mkdir(dirTo) print('Note: dirTo was created') return (dirFrom, dirTo) else: print('Warning: dirTo already exists') if hasattr(os.path, 'samefile'): same = os.path.samefile(dirFrom, dirTo) else: same = os.path.abspath(dirFrom) == os.path.abspath(dirTo) if same: print('Error: dirFrom same as dirTo') else: return (dirFrom, dirTo) if __name__ == '__main__': import time dirstuple = getargs() if dirstuple: print('Copying...') start = time.clock() fcount, dcount = copytree(*dirstuple) print('Copied', fcount, 'files,', dcount, 'directories', end=' ') print('in', time.clock() - start, 'seconds')
This script implements its own recursive tree traversal logic
and keeps track of both the “from” and “to” directory paths as it
goes. At every level, it copies over simple files, creates directories
in the “to” path, and recurs into subdirectories with “from” and “to”
paths extended by one level. There are other ways to code this task
(e.g., we might change the working directory along the way with
os.chdir
calls or there is probably
an os.walk
solution which replaces
from and to path prefixes as it walks), but extending paths on
recursive descent works well in this script.
Notice this script’s reusable copyfile
function—just in case there are
multigigabyte files in the tree to be copied, it uses a file’s size to
decide whether it should be read all at once or in chunks (remember,
the file read
method without
arguments actually loads the entire file into an in-memory string). We
choose fairly large file and block sizes, because the more we read at
once in Python, the faster our scripts will typically run. This is
more efficient than it may sound; strings left behind by prior reads
will be garbage collected and reused as we go. We’re using binary file
modes here again, too, to suppress the Unicode encodings and
end-of-line translations of text files—trees may contain arbitrary
kinds of files.
Also notice that this script creates the “to” directory if
needed, but it assumes that the directory is empty when a copy starts
up; for accuracy, be sure to remove the target directory before
copying a new tree to its name, or old files may linger in the target
tree (we could automatically remove the target first, but this may not
always be desired). This script also tries to determine if the source
and target are the same; on Unix-like platforms with oddities such as
links, os.path.samefile
does a more
accurate job than comparing absolute file names (different file names
may be the same file).
Here is a copy of a big book examples tree (I use the tree from
the prior edition throughout this chapter) in action on Windows; pass
in the name of the “from” and “to” directories to kick off the
process, redirect the output to a file if there are too many error
messages to read all at once (e.g., >
output.txt
), and run an rm
–r
or rmdir /S
shell
command (or similar platform-specific tool) to delete the target
directory first if needed:
C:...PP4ESystemFiletools>rmdir /S copytemp
copytemp, Are you sure (Y/N)?y
C:...PP4ESystemFiletools>cpall.py C: empPP3EExamples copytemp
Note: dirTo was created Copying... Copied 1430 files, 185 directories in 10.4470980971 seconds C:...PP4ESystemFiletools>fc /B copytempPP3ELauncher.py
C: empPP3EExamplesPP3ELauncher.py
Comparing files COPYTEMPPP3ELauncher.py and C:TEMPPP3EEXAMPLESPP3ELAUNCHER.PY FC: no differences encountered
You can use the copy function’s verbose
argument to trace the process if you
wish. At the time I wrote this edition in 2010, this test run copied a
tree of 1,430 files and 185 directories in 10 seconds on my woefully
underpowered netbook machine (the built-in time.clock
call is used to query the system
time in seconds); it may run arbitrarily faster or slower for you.
Still, this is at least as fast as the best drag-and-drop I’ve timed
on this machine.
So how does this script work around bad files on a CD backup? The secret is that it catches and ignores file exceptions, and it keeps walking. To copy all the files that are good on a CD, I simply run a command line such as this one:
C:...PP4ESystemFiletools> python cpall.py G:Examples C:PP3EExamples
Because the CD is addressed as “G:” on my Windows machine, this is the command-line equivalent of drag-and-drop copying from an item in the CD’s top-level folder, except that the Python script will recover from errors on the CD and get the rest. On copy errors, it prints a message to standard output and continues; for big copies, you’ll probably want to redirect the script’s output to a file for later inspection.
In general, cpall
can be
passed any absolute directory path on your machine, even those that
indicate devices such as CDs. To make this go on Linux, try a root
directory such as /dev/cdrom or something similar
to address your CD drive. Once you’ve copied a tree this way, you
still might want to verify; to see how, let’s move on to the next
example.
Engineers can be a paranoid sort (but you didn’t hear that from me). At least I am. It comes from decades of seeing things go terribly wrong, I suppose. When I create a CD backup of my hard drive, for instance, there’s still something a bit too magical about the process to trust the CD writer program to do the right thing. Maybe I should, but it’s tough to have a lot of faith in tools that occasionally trash files and seem to crash my Windows machine every third Tuesday of the month. When push comes to shove, it’s nice to be able to verify that data copied to a backup CD is the same as the original—or at least to spot deviations from the original—as soon as possible. If a backup is ever needed, it will be really needed.
Because data CDs are accessible as simple directory trees in the
file system, we are once again in the realm of tree walkers—to verify
a backup CD, we simply need to walk its top-level directory. If our
script is general enough, we will also be able to use it to verify
other copy operations as well—e.g., downloaded tar files, hard-drive
backups, and so on. In fact, the combination of the cpall
script of the prior section and a
general tree comparison would provide a portable and scriptable way to
copy and verify data sets.
We’ve already studied generic directory tree walkers, but they won’t help us here directly: we need to walk two directories in parallel and inspect common files along the way. Moreover, walking either one of the two directories won’t allow us to spot files and directories that exist only in the other. Something more custom and recursive seems in order here.
Before we start coding, the first thing we need to clarify is what it means to compare two directory trees. If both trees have exactly the same branch structure and depth, this problem reduces to comparing corresponding files in each tree. In general, though, the trees can have arbitrarily different shapes, depths, and so on.
More generally, the contents of a directory in one tree may have more or fewer entries than the corresponding directory in the other tree. If those differing contents are filenames, there is no corresponding file to compare with; if they are directory names, there is no corresponding branch to descend through. In fact, the only way to detect files and directories that appear in one tree but not the other is to detect differences in each level’s directory.
In other words, a tree comparison algorithm will also have to perform directory comparisons along the way. Because this is a nested and simpler operation, let’s start by coding and debugging a single-directory comparison of filenames in Example 6-11.
""" ################################################################################ Usage: python dirdiff.py dir1-path dir2-path Compare two directories to find files that exist in one but not the other. This version uses the os.listdir function and list difference. Note that this script checks only filenames, not file contents--see diffall.py for an extension that does the latter by comparing .read() results. ################################################################################ """ import os, sys def reportdiffs(unique1, unique2, dir1, dir2): """ Generate diffs report for one dir: part of comparedirs output """ if not (unique1 or unique2): print('Directory lists are identical') else: if unique1: print('Files unique to', dir1) for file in unique1: print('...', file) if unique2: print('Files unique to', dir2) for file in unique2: print('...', file) def difference(seq1, seq2): """ Return all items in seq1 only; a set(seq1) - set(seq2) would work too, but sets are randomly ordered, so any platform-dependent directory order would be lost """ return [item for item in seq1 if item not in seq2] def comparedirs(dir1, dir2, files1=None, files2=None): """ Compare directory contents, but not actual files; may need bytes listdir arg for undecodable filenames on some platforms """ print('Comparing', dir1, 'to', dir2) files1 = os.listdir(dir1) if files1 is None else files1 files2 = os.listdir(dir2) if files2 is None else files2 unique1 = difference(files1, files2) unique2 = difference(files2, files1) reportdiffs(unique1, unique2, dir1, dir2) return not (unique1 or unique2) # true if no diffs def getargs(): "Args for command-line mode" try: dir1, dir2 = sys.argv[1:] # 2 command-line args except: print('Usage: dirdiff.py dir1 dir2') sys.exit(1) else: return (dir1, dir2) if __name__ == '__main__': dir1, dir2 = getargs() comparedirs(dir1, dir2)
Given listings of names in two directories, this script simply
picks out unique names in the first and unique names in the second,
and reports any unique names found as differences (that is, files in
one directory but not the other). Its comparedirs
function returns a true result if no differences were found,
which is useful for detecting differences in callers.
Let’s run this script on a few directories; differences are detected and reported as names unique in either passed-in directory pathname. Notice that this is only a structural comparison that just checks names in listings, not file contents (we’ll add the latter in a moment):
C:...PP4ESystemFiletools>dirdiff.py C: empPP3EExamples copytemp
Comparing C: empPP3EExamples to copytemp Directory lists are identical C:...PP4ESystemFiletools>dirdiff.py C: empPP3EExamplesPP3ESystem ..
Comparing C: empPP3EExamplesPP3ESystem to .. Files unique to C: empPP3EExamplesPP3ESystem ... App ... Exits ... Media ... moreplus.py Files unique to .. ... more.pyc ... spam.txt ... Tester ... __init__.pyc
The unique
function
is the heart of this script: it performs a simple list
difference operation. When
applied to directories, unique items represent
tree differences, and common items are names of
files or subdirectories that merit further comparisons or
traversals. In fact, in Python 2.4 and later, we could also use the
built-in set
object type if we
don’t care about the order in the results—because sets are not
sequences, they would not maintain any original and possibly
platform-specific left-to-right order of the directory listings
provided by os.listdir
. For that
reason (and to avoid requiring users to upgrade), we’ll keep using
our own comprehension-based function instead of sets.
We’ve just coded a directory comparison tool that picks out unique files and
directories. Now all we need is a tree walker that applies dirdiff
at each
level to report unique items, explicitly compares the contents of
files in common, and descends through directories in common. Example 6-12 fits the bill.
""" ################################################################################ Usage: "python diffall.py dir1 dir2". Recursive directory tree comparison: report unique files that exist in only dir1 or dir2, report files of the same name in dir1 and dir2 with differing contents, report instances of same name but different type in dir1 and dir2, and do the same for all subdirectories of the same names in and below dir1 and dir2. A summary of diffs appears at end of output, but search redirected output for "DIFF" and "unique" strings for further details. New: (3E) limit reads to 1M for large files, (3E) catch same name=file/dir, (4E) avoid extra os.listdir() calls in dirdiff.comparedirs() by passing results here along. ################################################################################ """ import os, dirdiff blocksize = 1024 * 1024 # up to 1M per read def intersect(seq1, seq2): """ Return all items in both seq1 and seq2; a set(seq1) & set(seq2) woud work too, but sets are randomly ordered, so any platform-dependent directory order would be lost """ return [item for item in seq1 if item in seq2] def comparetrees(dir1, dir2, diffs, verbose=False): """ Compare all subdirectories and files in two directory trees; uses binary files to prevent Unicode decoding and endline transforms, as trees might contain arbitrary binary files as well as arbitrary text; may need bytes listdir arg for undecodable filenames on some platforms """ # compare file name lists print('-' * 20) names1 = os.listdir(dir1) names2 = os.listdir(dir2) if not dirdiff.comparedirs(dir1, dir2, names1, names2): diffs.append('unique files at %s - %s' % (dir1, dir2)) print('Comparing contents') common = intersect(names1, names2) missed = common[:] # compare contents of files in common for name in common: path1 = os.path.join(dir1, name) path2 = os.path.join(dir2, name) if os.path.isfile(path1) and os.path.isfile(path2): missed.remove(name) file1 = open(path1, 'rb') file2 = open(path2, 'rb') while True: bytes1 = file1.read(blocksize) bytes2 = file2.read(blocksize) if (not bytes1) and (not bytes2): if verbose: print(name, 'matches') break if bytes1 != bytes2: diffs.append('files differ at %s - %s' % (path1, path2)) print(name, 'DIFFERS') break # recur to compare directories in common for name in common: path1 = os.path.join(dir1, name) path2 = os.path.join(dir2, name) if os.path.isdir(path1) and os.path.isdir(path2): missed.remove(name) comparetrees(path1, path2, diffs, verbose) # same name but not both files or dirs? for name in missed: diffs.append('files missed at %s - %s: %s' % (dir1, dir2, name)) print(name, 'DIFFERS') if __name__ == '__main__': dir1, dir2 = dirdiff.getargs() diffs = [] comparetrees(dir1, dir2, diffs, True) # changes diffs in-place print('=' * 40) # walk, report diffs list if not diffs: print('No diffs found.') else: print('Diffs found:', len(diffs)) for diff in diffs: print('-', diff)
At each directory in the tree, this script simply runs the
dirdiff
tool to detect unique
names, and then compares names in common by intersecting directory
lists. It uses recursive function calls to traverse the tree and
visits subdirectories only after comparing all the files at each
level so that the output is more coherent to read (the trace output
for subdirectories appears after that for files; it is not
intermixed).
Notice the misses
list,
added in the third edition of this book; it’s very unlikely, but not
impossible, that the same name might be a file in one directory and
a subdirectory in the other. Also notice the blocksize
variable; much like the tree
copy script we saw earlier, instead of blindly reading entire files
into memory all at once, we limit each read to grab up to 1 MB at a
time, just in case any files in the directories are too big to be
loaded into available memory. Without this limit, I ran into
MemoryError
exceptions on some
machines with a prior version of this script that read both files
all at once, like this:
bytes1 = open(path1, 'rb').read() bytes2 = open(path2, 'rb').read() if bytes1 == bytes2: ...
This code was simpler, but is less practical for very large files that can’t fit into your available memory space (consider CD and DVD image files, for example). In the new version’s loop, the file reads return what is left when there is less than 1 MB present or remaining and return empty strings at end-of-file. Files match if all blocks read are the same, and they reach end-of-file at the same time.
We’re also dealing in binary files and byte strings again to
suppress Unicode decoding and end-line translations for file
content, because trees may contain arbitrary binary and text files.
The usual note about changing this to pass byte strings to os.listdir
on platforms where filenames
may generate Unicode decoding errors applies here as well (e.g. pass
dir1.encode()
). On some
platforms, you may also want to detect and skip certain kinds of
special files in order to be fully general, but these were not in my
trees, so they are not in my script.
One minor change for the fourth edition of this book: os.listdir
results are now gathered just
once per subdirectory and passed along, to avoid extra calls in
dirdiff
—not a huge win, but every
cycle counts on the pitifully underpowered netbook I used when
writing this edition.
Since we’ve already studied the tree-walking tools this script
employs, let’s jump right into a few example runs. When run on
identical trees, status messages scroll during the traversal, and a
No diffs found.
message appears
at the end:
C:...PP4ESystemFiletools>diffall.py C: empPP3EExamples copytemp > diffs.txt
C:...PP4ESystemFiletools>type diffs.txt | more
-------------------- Comparing C: empPP3EExamples to copytemp Directory lists are identical Comparing contents README-root.txt matches -------------------- Comparing C: empPP3EExamplesPP3E to copytempPP3E Directory lists are identical Comparing contents echoEnvironment.pyw matches LaunchBrowser.pyw matches Launcher.py matches Launcher.pyc matches ...over 2,000 more lines omitted... -------------------- Comparing C: empPP3EExamplesPP3ETempParts to copytempPP3ETempParts Directory lists are identical Comparing contents 109_0237.JPG matches lawnlake1-jan-03.jpg matches part-001.txt matches part-002.html matches ======================================== No diffs found.
I usually run this with the verbose
flag passed in as True
, and redirect output to a file (for
big trees, it produces too much output to scroll through
comfortably); use False
to watch
fewer status messages fly by. To show how differences are reported,
we need to generate a few; for simplicity, I’ll manually change a
few files scattered about one of the trees, but you could also run a
global search-and-replace script like the one we’ll write later in
this chapter. While we’re at it, let’s remove a few common files so
that directory uniqueness differences show up on the scope, too; the
last two removal commands in the following will generate one
difference in the same directory in different trees:
C:...PP4ESystemFiletools>notepad copytempPP3EREADME-PP3E.txt
C:...PP4ESystemFiletools>notepad copytempPP3ESystemFiletoolscommands.py
C:...PP4ESystemFiletools>notepad C: empPP3EExamplesPP3E\__init__.py
C:...PP4ESystemFiletools>del copytempPP3ESystemFiletoolscpall_visitor.py
C:...PP4ESystemFiletools>del copytempPP3ELauncher.py
C:...PP4ESystemFiletools>del C: empPP3EExamplesPP3EPyGadgets.py
Now, rerun the comparison walker to pick out differences and
redirect its output report to a file for easy inspection. The
following lists just the parts of the output report that identify
differences. In typical use, I inspect the summary at the bottom of
the report first, and then search for the strings "DIFF"
and "unique"
in the report’s text if I need
more information about the differences summarized; this interface
could be much more user-friendly, of course, but it does the job for
me:
C:...PP4ESystemFiletools>diffall.py C: empPP3EExamples copytemp > diff2.txt
C:...PP4ESystemFiletools>notepad diff2.txt
-------------------- Comparing C: empPP3EExamples to copytemp Directory lists are identical Comparing contents README-root.txt matches -------------------- Comparing C: empPP3EExamplesPP3E to copytempPP3E Files unique to C: empPP3EExamplesPP3E ... Launcher.py Files unique to copytempPP3E ... PyGadgets.py Comparing contents echoEnvironment.pyw matches LaunchBrowser.pyw matches Launcher.pyc matches ...more omitted... PyGadgets_bar.pyw matches README-PP3E.txt DIFFERS todos.py matches tounix.py matches __init__.py DIFFERS __init__.pyc matches -------------------- Comparing C: empPP3EExamplesPP3ESystemFiletools to copytempPP3ESystemFil... Files unique to C: empPP3EExamplesPP3ESystemFiletools ... cpall_visitor.py Comparing contents commands.py DIFFERS cpall.py matches ...more omitted... -------------------- Comparing C: empPP3EExamplesPP3ETempParts to copytempPP3ETempParts Directory lists are identical Comparing contents 109_0237.JPG matches lawnlake1-jan-03.jpg matches part-001.txt matches part-002.html matches ======================================== Diffs found: 5 - unique files at C: empPP3EExamplesPP3E - copytempPP3E - files differ at C: empPP3EExamplesPP3EREADME-PP3E.txt – copytempPP3EREADME-PP3E.txt - files differ at C: empPP3EExamplesPP3E\__init__.py – copytempPP3E\__init__.py - unique files at C: empPP3EExamplesPP3ESystemFiletools – copytempPP3ESystemFiletools - files differ at C: empPP3EExamplesPP3ESystemFiletoolscommands.py – copytempPP3ESystemFiletoolscommands.py
I added line breaks and tabs in a few of these output lines to make them fit on this page, but the report is simple to understand. In a tree with 1,430 files and 185 directories, we found five differences—the three files we changed by edits, and the two directories we threw out of sync with the three removal commands.
So how does this script placate CD backup paranoia? To double-check my CD writer’s work, I run a command such as the following. I can also use a command like this to find out what has been changed since the last backup. Again, since the CD is “G:” on my machine when plugged in, I provide a path rooted there; use a root such as /dev/cdrom or /mnt/cdrom on Linux:
C:...PP4ESystemFiletools>python diffall.py
Examples g:PP3EExamples > diff0226
C:...PP4ESystemFiletools>more diff0226
...output omitted...
The CD spins, the script compares, and a summary of differences appears at the end of the report. For an example of a full difference report, see the file diff*.txt files in the book’s examples distribution package. And to be really sure, I run the following global comparison command to verify the entire book development tree backed up to a memory stick (which works just like a CD in terms of the filesystem):
C:...PP4ESystemFiletools>diffall.py
F:writing-backupsfeb-26-10dev
C:UsersmarkStuffBooks4EPP4Edev > diff3.txt
C:...PP4ESystemFiletools>more diff3.txt
-------------------- Comparing F:writing-backupsfeb-26-10dev to C:UsersmarkStuffBooks4EPP4Edev Directory lists are identical Comparing contents ch00.doc DIFFERS ch01.doc matches ch02.doc DIFFERS ch03.doc matches ch04.doc DIFFERS ch05.doc matches ch06.doc DIFFERS ...more output omitted... -------------------- Comparing F:writing-backupsfeb-26-10devExamplesPP4ESystemFiletools to C:… Files unique to C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemFiletools ... copytemp ... cpall.py ... diff2.txt ... diff3.txt ... diffall.py ... diffs.txt ... dirdiff.py ... dirdiff.pyc Comparing contents bigext-tree.py matches bigpy-dir.py matches ...more output omitted... ======================================== Diffs found: 7 - files differ at F:writing-backupsfeb-26-10devch00.doc – C:UsersmarkStuffBooks4EPP4Edevch00.doc - files differ at F:writing-backupsfeb-26-10devch02.doc – C:UsersmarkStuffBooks4EPP4Edevch02.doc - files differ at F:writing-backupsfeb-26-10devch04.doc – C:UsersmarkStuffBooks4EPP4Edevch04.doc - files differ at F:writing-backupsfeb-26-10devch06.doc – C:UsersmarkStuffBooks4EPP4Edevch06.doc - files differ at F:writing-backupsfeb-26-10devTOC.txt – C:UsersmarkStuffBooks4EPP4EdevTOC.txt - unique files at F:writing-backupsfeb-26-10devExamplesPP4ESystemFiletools – C:UsersmarkStuffBooks4EPP4EdevExamplesPP4ESystemFiletools - files differ at F:writing-backupsfeb-26-10devExamplesPP4EToolsvisitor.py – C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EToolsvisitor.py
This particular run indicates that I’ve added a few examples
and changed some chapter files since the last backup; if run
immediately after a backup, nothing should show up on diffall
radar except for any files that
cannot be copied in general. This global comparison can take a few
minutes. It performs byte-for-byte comparisons of all chapter files
and screenshots, the examples tree, and more, but it’s an accurate
and complete verification. Given that this book development tree
contained many files, a more manual verification procedure without
Python’s help would be utterly impossible.
After writing this script, I also started using it to verify
full automated backups of my laptops onto an external hard-drive
device. To do so, I run the cpall
copy script we wrote earlier in the preceding section of this
chapter, and then the comparison script developed here to check
results and get a list of files that didn’t copy correctly. The last
time I did this, this procedure copied and compared 225,000 files
and 15,000 directories in 20 GB of space—not the sort of task that
lends itself to manual labor!
Here are the magic incantations on my Windows laptop. f:
is a partition on my external hard
drive, and you shouldn’t be surprised if each of these commands runs
for half an hour or more on currently common hardware. A
drag-and-drop copy takes at least as long (assuming it works at
all!):
C:...SystemFiletools>cpall.py c: f: > f:copy-log.txt
C:...SystemFiletools>diffall.py f: c: > f:diff-log.txt
Finally, it’s worth noting that this script still only detects differences in the tree but does not give any further details about individual file differences. In fact, it simply loads and compares the binary contents of corresponding files with string comparisons. It’s a simple yes/no result.
If and when I need more details about how two reported files
actually differ, I either edit the files or run the file-comparison
command on the host platform (e.g., fc
on Windows/DOS, diff
or cmp
on Unix and Linux). That’s not a
portable solution for this last step; but for my purposes, just
finding the differences in a 1,400-file tree was much more critical
than reporting which lines differ in files flagged in the
report.
Of course, since we can always run shell commands in Python,
this last step could be automated by spawning a diff
or fc
command with os.popen
as differences are encountered
(or after the traversal, by scanning the report summary). The output
of these system calls could be displayed verbatim, or parsed for
relevant parts.
We also might try to do a bit better here by opening true text
files in text mode to ignore line-terminator differences caused by
transferring across platforms, but it’s not clear that such
differences should be ignored (what if the caller wants to know
whether line-end markers have been changed?). For example, after
downloading a website with an FTP script we’ll meet in Chapter 13, the diffall
script detected a discrepancy
between the local copy of a file and the one at the remote server.
To probe further, I simply ran some interactive Python code:
>>>a = open('lp2e-updates.html', 'rb').read()
>>>b = open(r'C:MarkWEBSITEpublic_htmllp2e-updates.html', 'rb').read()
>>>a == b
False
This verifies that there really is a binary difference in the
downloaded and local versions of the file; to see whether it’s
because a Unix or DOS line end snuck into the file, try again in
text mode so that line ends are all mapped to the standard
character:
>>>a = open('lp2e-updates.html', 'r').read()
>>>b = open(r'C:MarkWEBSITEpublic_htmllp2e-updates.html', 'r').read()
>>>a == b
True
Sure enough; now, to find where the difference is, the following code checks character by character until the first mismatch is found (in binary mode, so we retain the difference):
>>>a = open('lp2e-updates.html', 'rb').read()
>>>b = open(r'C:MarkWEBSITEpublic_htmllp2e-updates.html', 'rb').read()
>>>for (i, (ac, bc)) in enumerate(zip(a, b)):
...if ac != bc:
...print(i, repr(ac), repr(bc))
...break
... 37966 ' ' ' '
This means that at byte offset 37,966, there is a
in the downloaded file, but a
in the local copy. This line has a DOS
line end in one and a Unix line end in the other. To see more, print
text around the mismatch:
>>>for (i, (ac, bc)) in enumerate(zip(a, b)):
...if ac != bc:
...print(i, repr(ac), repr(bc))
...print(repr(a[i-20:i+20]))
...print(repr(b[i-20:i+20]))
...break
... 37966 ' ' ' ' 're> def min(*args): tmp = list(arg' 're> def min(*args): tmp = list(args'
Apparently, I wound up with a Unix line end at one point in
the local copy and a DOS line end in the version I downloaded—the
combined effect of the text mode used by the download script itself
(which translated
to
) and years of edits on both Linux and
Windows PDAs and laptops (I probably coded this change on Linux and
copied it to my local Windows copy in binary mode). Code such as
this could be integrated into the diffall
script to make it more intelligent
about text files and difference reporting.
Because Python excels at processing files and strings, it’s
even possible to go one step further and code a Python equivalent of
the fc
and diff
commands. In fact, much of the work
has already been done; the standard library module difflib
could make this task simple. See
the Python library manual for details and usage examples.
We could also be smarter by avoiding the load and compare steps for files that differ in size, and we might use a smaller block size to reduce the script’s memory requirements. For most trees, such optimizations are unnecessary; reading multimegabyte files into strings is very fast in Python, and garbage collection reclaims the space as you go.
Since such extensions are beyond both this script’s scope and this chapter’s size limits, though, they will have to await the attention of a curious reader (this book doesn’t have formal exercises, but that almost sounds like one, doesn’t it?). For now, let’s move on to explore ways to code one more common directory task: search.
Engineers love to change things. As I was writing this book, I found it almost irresistible to move and rename directories, variables, and shared modules in the book examples tree whenever I thought I’d stumbled onto a more coherent structure. That was fine early on, but as the tree became more intertwined, this became a maintenance nightmare. Things such as program directory paths and module names were hardcoded all over the place—in package import statements, program startup calls, text notes, configuration files, and more.
One way to repair these references, of course, is to edit every file in the directory by hand, searching each for information that has changed. That’s so tedious as to be utterly impossible in this book’s examples tree, though; the examples of the prior edition contained 186 directories and 1,429 files! Clearly, I needed a way to automate updates after changes. There are a variety of solutions to such goals—from shell commands, to find operations, to custom tree walkers, to general-purpose frameworks. In this and the next section, we’ll explore each option in turn, just as I did while refining solutions to this real-world dilemma.
If you work on Unix-like systems, you probably already know
that there is a standard way to search files for strings on such
platforms—the command-line program grep
and its relatives list all lines in one or more files
containing a string or string pattern.[22] Given that shells expand (i.e., “glob”) filename
patterns automatically, a command such as the following will search
a single directory’s Python files for a string named on the command
line (this uses the grep
command
installed with the Cygwin Unix-like system for Windows that I
described in the prior chapter):
C:...PP4ESystemFiletools> c:cygwiningrep.exe walk *.py
bigext-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):
bigpy-path.py: for (thisDir, subsHere, filesHere) in os.walk(srcdir):
bigpy-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname):
As we’ve seen, we can often accomplish the same within a
Python script by running such a shell command with os.system
or os.popen
. And if we search its results
manually, we can also achieve similar results with the
Python glob
module we met
in Chapter 4; it expands a
filename pattern into a list of matching filename strings much like
a shell:
C:...PP4ESystemFiletools>python
>>>import os
>>>for line in os.popen(r'c:cygwiningrep.exe walk *.py'):
...print(line, end='')
... bigext-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname): bigpy-path.py: for (thisDir, subsHere, filesHere) in os.walk(srcdir): bigpy-tree.py:for (thisDir, subsHere, filesHere) in os.walk(dirname): >>>from glob import glob
>>>for filename in glob('*.py'):
...if 'walk' in open(filename).read():
...print(filename)
... bigext-tree.py bigpy-path.py bigpy-tree.py
Unfortunately, these tools are generally limited to a single
directory. glob
can visit
multiple directories given the right sort of pattern string, but
it’s not a general directory walker of the sort I need to maintain a
large examples tree. On Unix-like systems, a find
shell command can go the extra mile to traverse an entire directory
tree. For instance, the
following Unix command line would pinpoint lines and files at and
below the current directory that mention the string popen
:
find . -name "*.py" -print -exec fgrep popen {} ;
If you happen to have a Unix-like find
command on every machine you will
ever use, this is one way to process directories.
But if you don’t happen to have a Unix find
on all your computers, not to
worry—it’s easy to code a portable one in Python. Python itself used
to have a find
module in its
standard library, which I used frequently in the past. Although that
module was removed between the second and third editions of this
book, the newer os.walk
makes
writing your own simple. Rather than lamenting the demise of a
module, I decided to spend 10 minutes coding a custom
equivalent.
Example 6-13
implements a find utility in Python, which collects all matching
filenames in a directory tree. Unlike glob.glob
, its find.find
automatically matches through an
entire tree. And unlike the tree walk structure of os.walk
, we can treat find.find
results as a simple linear
group.
#!/usr/bin/python """ ################################################################################ Return all files matching a filename pattern at and below a root directory; custom version of the now deprecated find module in the standard library: import as "PP4E.Tools.find"; like original, but uses os.walk loop, has no support for pruning subdirs, and is runnable as a top-level script; find() is a generator that uses the os.walk() generator to yield just matching filenames: use findlist() to force results list generation; ################################################################################ """ import fnmatch, os def find(pattern, startdir=os.curdir): for (thisDir, subsHere, filesHere) in os.walk(startdir): for name in subsHere + filesHere: if fnmatch.fnmatch(name, pattern): fullpath = os.path.join(thisDir, name) yield fullpath def findlist(pattern, startdir=os.curdir, dosort=False): matches = list(find(pattern, startdir)) if dosort: matches.sort() return matches if __name__ == '__main__': import sys namepattern, startdir = sys.argv[1], sys.argv[2] for name in find(namepattern, startdir): print(name)
There’s not much to this file—it’s largely just a minor
extension to os.walk
—but
calling its find
function
provides the same utility as both the deprecated find
standard library module and the Unix
utility of the same name. It’s also much more portable, and
noticeably easier than repeating all of this file’s code every time
you need to perform a find-type search. Because this file is
instrumented to be both a script and a library, it can also be both
run as a command-line tool or called from other programs.
For instance, to process every Python file in the directory
tree rooted one level up from the current working directory, I
simply run the following command line from a system console window.
Run this yourself to watch its progress; the script’s standard
output is piped into the more
command to page it here, but it can be piped into any processing
program that reads its input from the standard input stream
(remember to quote the “*.py” on Unix and Linux shells only, to
avoid premature pattern expansion):
C:...PP4ETools> python find.py *.py .. | more
..LaunchBrowser.py
..Launcher.py
..\__init__.py
..Previewattachgui.py
..Previewcustomizegui.py
...more lines omitted...
For more control, run the following sort of Python code from a script or interactive prompt. In this mode, you can apply any operation to the found files that the Python language provides:
C:...PP4ESystemFiletools>python
>>>from PP4E.Tools import find
# or just import find if in cwd >>>for filename in find.find('*.py', '..'):
...if 'walk' in open(filename).read():
...print(filename)
... ..Launcher.py ..SystemFiletoolsigext-tree.py ..SystemFiletoolsigpy-path.py ..SystemFiletoolsigpy-tree.py ..Toolscleanpyc.py ..Toolsfind.py ..Toolsvisitor.py
Notice how this avoids having to recode the nested loop
structure required for os.walk
every time you want a list of matching file names; for many use
cases, this seems conceptually simpler. Also note that because this
finder is a generator function, your script doesn’t have to wait
until all matching files have been found and collected; os.walk
yields results as it goes, and
find.find
yields matching files
among that set.
Here’s a more complex example of our find
module at work: the following system
command line lists all Python files in directory
C: empPP3E whose names begin with the letter
q or x. Note how find
returns full directory paths that
begin with the start directory specification:
C:...PP4ETools> find.py [qx]*.py C: empPP3E
C: empPP3EExamplesPP3EDatabaseSQLscriptsquerydb.py
C: empPP3EExamplesPP3EGuiToolsqueuetest-gui-class.py
C: empPP3EExamplesPP3EGuiToolsqueuetest-gui.py
C: empPP3EExamplesPP3EGuiTourquitter.py
C: empPP3EExamplesPP3EInternetOtherGrailQuestion.py
C: empPP3EExamplesPP3EInternetOtherXMLxmlrpc.py
C: empPP3EExamplesPP3ESystemThreadsqueuetest.py
And here’s some Python code that does the same find but also extracts base names and file sizes for each file found:
C:...PP4ETools>python
>>>import os
>>>from find import find
>>>for name in find('[qx]*.py', r'C: empPP3E'):
...print(os.path.basename(name), os.path.getsize(name))
... querydb.py 635 queuetest-gui-class.py 1152 queuetest-gui.py 963 quitter.py 801 Question.py 817 xmlrpc.py 705 queuetest.py 1273
To achieve such code economy, the find
module calls os.walk
to walk the tree and simply
yields matching filenames along the way. New here, though, is the
fnmatch
module—yet another
Python standard library module that performs Unix-like pattern
matching against filenames. This module supports common operators
in name pattern strings: *
to
match any number of characters, ?
to match any single character, and
[...]
and [!...]
to match any character inside the
bracket pairs or not; other characters match themselves. Unlike
the re
module, fnmatch
supports only common Unix shell
matching operators, not full-blown regular expression patterns;
we’ll see why this distinction matters in Chapter 19.
Interestingly, Python’s glob.glob
function also uses the
fnmatch
module to match names:
it combines os.listdir
and
fnmatch
to match in directories
in much the same way our find.find
combines os.walk
and fnmatch
to match in trees (though
os.walk
ultimately uses
os.listdir
as well). One
ramification of all this is that you can pass byte strings for
both pattern and start-directory to find.find
if you need to suppress
Unicode filename decoding, just as you can for os.walk
and glob.glob
; you’ll receive byte strings
for filenames in the result. See Chapter 4 for more details on Unicode
filenames.
By comparison, find.find
with just “*” for its name pattern is also roughly equivalent to
platform-specific directory tree listing shell commands such as
dir /B /S
on DOS and Windows.
Since all files match “*”, this just exhaustively generates all
the file names in a tree with a single traversal. Because we can
usually run such shell commands in a Python script with os.popen
, the following do the same
work, but the first is inherently nonportable and must start up a
separate program along the way:
>>>import os
>>>for line in os.popen('dir /B /S'): print(line, end='')
>>>from PP4E.Tools.find import find
>>>for name in find(pattern='*', startdir='.'): print(name)
Watch for this utility to show up in action later in this
chapter and book, including an arguably strong showing in the next
section and a cameo appearance in the Grep dialog of Chapter 11’s PyEdit text editor GUI, where
it will serve a central role in a threaded external files search
tool. The standard library’s find
module may be gone, but it need not
be forgotten.
In fact, you must pass a bytes
pattern string for a bytes
filename to fnnmatch
(or pass both as str
), because the re
pattern matching module it uses
does not allow the string types of subject and pattern to be
mixed. This rule is inherited by our find.find
for directory and pattern.
See Chapter 19 for more on re
.
Curiously, the fnmatch
module in Python 3.1 also converts a bytes
pattern string to and from
Unicode str
in order to
perform internal text processing, using the Latin-1 encoding.
This suffices for many contexts, but may not be entirely sound
for some encodings which do not map to Latin-1 cleanly. sys.getfilesystemencoding
might be a
better encoding choice in such contexts, as this reflects the
underlying file system’s constraints (as we learned in Chapter 4, sys.getdefaultencoding
reflects file
content, not names).
In the absence of bytes
, os.walk
assumes filenames follow the
platform’s convention and does not ignore decoding errors
triggered by os.listdir
. In
the “grep” utility of Chapter 11’s
PyEdit, this picture is further clouded by the fact that a
str
pattern string from a GUI
would have to be encoded to bytes
using a potentially
inappropriate encoding for some files present. See fnmatch.py and os.py in Python’s library and the
Python library manual for more details. Unicode can be a very
subtle affair.
The find
module of
the prior section isn’t quite the general string
searcher we’re after, but it’s an important first step—it collects
files that we can then search in an automated script. In fact, the
act of collecting matching files in a tree is enough by itself to
support a wide variety of day-to-day system tasks.
For example, one of the other common tasks I perform on a
regular basis is removing all the bytecode files in a tree. Because
these are not always portable across major Python releases, it’s
usually a good idea to ship programs without them and let Python
create new ones on first imports. Now that we’re expert os.walk
users, we could cut out the
middleman and use it directly. Example 6-14 codes a portable
and general command-line tool, with support for arguments, exception
processing, tracing, and list-only mode.
""" delete all .pyc bytecode files in a directory tree: use the command line arg as root if given, else current working dir """ import os, sys findonly = False rootdir = os.getcwd() if len(sys.argv) == 1 else sys.argv[1] found = removed = 0 for (thisDirLevel, subsHere, filesHere) in os.walk(rootdir): for filename in filesHere: if filename.endswith('.pyc'): fullname = os.path.join(thisDirLevel, filename) print('=>', fullname) if not findonly: try: os.remove(fullname) removed += 1 except: type, inst = sys.exc_info()[:2] print('*'*4, 'Failed:', filename, type, inst) found += 1 print('Found', found, 'files, removed', removed)
When run, this script walks a directory tree (the CWD by default, or else one passed in on the command line), deleting any and all bytecode files along the way:
C:...ExamplesPP4E>Toolscleanpyc.py
=> C:UsersmarkStuffBooks4EPP4EdevExamplesPP4E\__init__.pyc => C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EPreviewinitdata.pyc => C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EPreviewmake_db_file.pyc => C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EPreviewmanager.pyc => C:UsersmarkStuffBooks4EPP4EdevExamplesPP4EPreviewperson.pyc ...more lines here... Found 24 files, removed 24 C:...PP4ETools>cleanpyc.py .
=> .find.pyc => .visitor.pyc => .\__init__.pyc Found 3 files, removed 3
This script works, but it’s a bit more manual and code-y than it needs to be. In fact, now that we also know about find operations, writing scripts based upon them is almost trivial when we just need to match filenames. Example 6-15, for instance, falls back on spawning shell find commands if you have them.
""" find and delete all "*.pyc" bytecode files at and below the directory named on the command-line; assumes a nonportable Unix-like find command """ import os, sys rundir = sys.argv[1] if sys.platform[:3] == 'win': findcmd = r'c:cygwininfind %s -name "*.pyc" -print' % rundir else: findcmd = 'find %s -name "*.pyc" -print' % rundir print(findcmd) count = 0 for fileline in os.popen(findcmd): # for all result lines count += 1 # have at the end print(fileline, end='') os.remove(fileline.rstrip()) print('Removed %d .pyc files' % count)
When run, files returned by the shell command are removed:
C:...PP4ETools> cleanpyc-find-shell.py .
c:cygwininfind . -name "*.pyc" -print
./find.pyc
./visitor.pyc
./__init__.pyc
Removed 3 .pyc files
This script uses os.popen
to collect the output of a Cygwin find
program installed on one of my
Windows computers, or else the standard find
tool on the Linux side. It’s also
completely nonportable to Windows machines that
don’t have the Unix-like find
program installed, and that includes other computers of my own (not
to mention those throughout most of the world at large). As we’ve
seen, spawning shell commands also incurs performance penalties for
starting a new program.
We can do much better on the portability and performance fronts and still retain code simplicity, by applying the find tool we wrote in Python in the prior section. The new script is shown in Example 6-16.
""" find and delete all "*.pyc" bytecode files at and below the directory named on the command-line; this uses a Python-coded find utility, and so is portable; run this to delete .pyc's from an old Python release; """ import os, sys, find # here, gets Tools.find count = 0 for filename in find.find('*.pyc', sys.argv[1]): count += 1 print(filename) os.remove(filename) print('Removed %d .pyc files' % count)
When run, all bytecode files in the tree rooted at the passed-in directory name are removed as before; this time, though, our script works just about everywhere Python does:
C:...PP4ETools> cleanpyc-find-py.py .
.find.pyc
.visitor.pyc
.\__init__.pyc
Removed 3 .pyc files
This works portably, and it avoids external program startup
costs. But find
is really just
half the story—it collects files matching a name pattern but doesn’t
search their content. Although extra code can add such searching to
a find’s result, a more manual approach can allow us to tap into the
search process more directly. The next section shows how.
After experimenting with greps and globs and finds, in the
end, to help ease the task of performing global searches on all
platforms I might ever use, I wound up coding a task-specific Python
script to do most of the work for me. Example 6-17 employs the
following standard Python tools that we met in the preceding
chapters: os.walk
to visit files
in a directory, os.path.splitext
to skip over files with binary-type extensions, and os.path.join
to portably combine a
directory path and filename.
Because it’s pure Python code, it can be run the same way on both Linux and Windows. In fact, it should work on any computer where Python has been installed. Moreover, because it uses direct system calls, it will likely be faster than approaches that rely on underlying shell commands.
""" ################################################################################ Use: "python ...Toolssearch_all.py dir string". Search all files at and below a named directory for a string; uses the os.walk interface, rather than doing a find.find to collect names first; similar to calling visitfile for each find.find result for "*" pattern; ################################################################################ """ import os, sys listonly = False textexts = ['.py', '.pyw', '.txt', '.c', '.h'] # ignore binary files def searcher(startdir, searchkey): global fcount, vcount fcount = vcount = 0 for (thisDir, dirsHere, filesHere) in os.walk(startdir): for fname in filesHere: # do non-dir files here fpath = os.path.join(thisDir, fname) # fnames have no dirpath visitfile(fpath, searchkey) def visitfile(fpath, searchkey): # for each non-dir file global fcount, vcount # search for string print(vcount+1, '=>', fpath) # skip protected files try: if not listonly: if os.path.splitext(fpath)[1] not in textexts: print('Skipping', fpath) elif searchkey in open(fpath).read(): input('%s has %s' % (fpath, searchkey)) fcount += 1 except: print('Failed:', fpath, sys.exc_info()[0]) vcount += 1 if __name__ == '__main__': searcher(sys.argv[1], sys.argv[2]) print('Found in %d files, visited %d' % (fcount, vcount))
Operationally, this script works roughly the same as calling
its visitfile
function for every
result generated by our find.find
tool with a pattern of “*”; but because this version is specific to
searching content it can better tailored for its goal. Really, this
equivalence holds only because a “*” pattern invokes an exhaustive
traversal in find.find
, and
that’s all that this new script’s searcher
function
does. The finder is good at selecting specific file types, but this
script benefits from a more custom single traversal.
When run standalone, the search key is passed on the command
line; when imported, clients call this module’s searcher
function directly. For example,
to search (that is, grep) for all appearances of a string in the
book examples tree, I run a command line like this in a DOS or Unix
shell:
C:\PP4E> Toolssearch_all.py . mimetypes
1 => .LaunchBrowser.py
2 => .Launcher.py
3 => .Launch_PyDemos.pyw
4 => .Launch_PyGadgets_bar.pyw
5 => .\__init__.py
6 => .\__init__.pyc
Skipping .\__init__.pyc
7 => .Previewattachgui.py
8 => .Previewob.pkl
Skipping .Previewob.pkl
...more lines omitted: pauses for Enter key press at matches...
Found in 2 files, visited 184
The script lists each file it checks as it goes, tells you
which files it is skipping (names that end in extensions not listed
in the variable textexts
that
imply binary data), and pauses for an Enter key press each time it
announces a file containing the search string. The search_all
script works the same way when
it is imported rather than run, but there is no
final statistics output line (fcount
and vcount
live in the module and so would
have to be imported to be inspected here):
C:...PP4EdevExamplesPP4E>python
>>>import Tools.search_all
>>>search_all.searcher(r'C: empPP3EExamples', 'mimetypes')
...more lines omitted: 8 pauses for Enter key press along the way... >>>search_all.fcount, search_all.vcount
# matches, files (8, 1429)
However launched, this script tracks down all references to a string in an entire directory tree: a name of a changed book examples file, object, or directory, for instance. It’s exactly what I was looking for—or at least I thought so, until further deliberation drove me to seek more complete and better structured solutions, the topic of the next section.
Be sure to also see the coverage of regular expressions in
Chapter 19. The search_all
script here searches for a
simple string in each file with the in
string membership expression, but it
would be trivial to extend it to search for a regular expression
pattern match instead (roughly, just replace in
with a call to a regular expression
object’s search method). Of course, such a mutation will be much
more trivial after we’ve learned how.
Also notice the textexts
list in Example 6-17,
which attempts to list all possible text file types: it would be
more general and robust to use the mimetypes
logic we will meet near the
end of this chapter in order to guess file content type from its
name, but the skips list provides more control and sufficed for
the trees I used this script against.
Finally note that for simplicity many of the directory searches in this chapter assume that text is encoded per the underlying platform’s Unicode default. They could open text in binary mode to avoid decoding errors, but searches might then be inaccurate because of encoding scheme differences in the raw encoded bytes. To see how to do better, watch for the “grep” utility in Chapter 11’s PyEdit GUI, which will apply an encoding name to all the files in a searched tree and ignore those text or binary files that fail to decode.
Laziness is the mother of many a framework. Armed with the
portable search_all
script
from Example 6-17, I was
able to better pinpoint files to be edited every time I changed the
book examples tree content or structure. At least initially, in one
window I ran search_all
to pick out
suspicious files and edited each along the way by hand in another
window.
Pretty soon, though, this became tedious, too. Manually typing filenames into editor commands is no fun, especially when the number of files to edit is large. Since I occasionally have better things to do than manually start dozens of text editor sessions, I started looking for a way to automatically run an editor on each suspicious file.
Unfortunately, search_all
simply prints results to the screen. Although that text could be
intercepted with os.popen
and
parsed by another program, a more direct approach that spawns edit
sessions during the search may be simpler. That would require major
changes to the tree search script as currently coded, though, and make
it useful for just one specific purpose. At this point, three thoughts
came to mind:
After writing a few directory walking utilities, it became
clear that I was rewriting the same sort of code over and over
again. Traversals could be even further simplified by wrapping
common details for reuse. Although the os.walk
tool avoids having to write
recursive functions, its model tends to foster redundant
operations and code (e.g., directory name joins, tracing
prints).
Past experience informed me that it would be better in the
long run to add features to a general directory searcher as
external components, rather than changing the original script
itself. Because editing files was just one possible extension
(what about automating text replacements, too?), a more general,
customizable, and reusable approach seemed the way to go.
Although os.walk
is
straightforward to use, its nested loop-based structure doesn’t
quite lend itself to customization the way a class can.
Based on past experience, I also knew that it’s a
generally good idea to insulate programs from implementation
details as much as possible. While os.walk
hides the details of recursive
traversal, it still imposes a very specific interface on its
clients, which is prone to change over time. Indeed it has—as
I’ll explain further at the end of this section, one of Python’s
tree walkers was removed altogether in 3.X, instantly breaking
code that relied upon it. It would be better to hide such
dependencies behind a more neutral interface, so that clients
won’t break as our needs change.
Of course, if you’ve studied Python in any depth, you know that
all these goals point to using an object-oriented
framework for traversals and searching. Example 6-18 is a concrete
realization of these goals. It exports a general FileVisitor
class
that mostly just wraps os.walk
for
easier use and extension, as well as a generic SearchVisitor
class
that generalizes the notion of directory searches.
By itself, SearchVisitor
simply does what search_all
did,
but it also opens up the search process to customization—bits of its
behavior can be modified by overloading its methods in subclasses.
Moreover, its core search logic can be reused everywhere we need to
search. Simply define a subclass that adds extensions for a specific
task. The same goes for FileVisitor
—by redefining its methods and
using its attributes, we can tap into tree search using OOP coding
techniques. As is usual in programming, once you repeat
tactical tasks often enough, they tend to inspire
this kind of strategic thinking.
""" #################################################################################### Test: "python ...Toolsvisitor.py dir testmask [string]". Uses classes and subclasses to wrap some of the details of os.walk call usage to walk and search; testmask is an integer bitmask with 1 bit per available self-test; see also: visitor_*/.py subclasses use cases; frameworks should generally use__X pseudo private names, but all names here are exported for use in subclasses and clients; redefine reset to support multiple independent walks that require subclass updates; #################################################################################### """ import os, sys class FileVisitor: """ Visits all nondirectory files below startDir (default '.'), override visit* methods to provide custom file/dir handlers; context arg/attribute is optional subclass-specific state; trace switch: 0 is silent, 1 is directories, 2 adds files """ def __init__(self, context=None, trace=2): self.fcount = 0 self.dcount = 0 self.context = context self.trace = trace def run(self, startDir=os.curdir, reset=True): if reset: self.reset() for (thisDir, dirsHere, filesHere) in os.walk(startDir): self.visitdir(thisDir) for fname in filesHere: # for non-dir files fpath = os.path.join(thisDir, fname) # fnames have no path self.visitfile(fpath) def reset(self): # to reuse walker self.fcount = self.dcount = 0 # for independent walks def visitdir(self, dirpath): # called for each dir self.dcount += 1 # override or extend me if self.trace > 0: print(dirpath, '...') def visitfile(self, filepath): # called for each file self.fcount += 1 # override or extend me if self.trace > 1: print(self.fcount, '=>', filepath) class SearchVisitor(FileVisitor): """ Search files at and below startDir for a string; subclass: redefine visitmatch, extension lists, candidate as needed; subclasses can use testexts to specify file types to search (but can also redefine candidate to use mimetypes for text content: see ahead) """ skipexts = [] testexts = ['.txt', '.py', '.pyw', '.html', '.c', '.h'] # search these exts #skipexts = ['.gif', '.jpg', '.pyc', '.o', '.a', '.exe'] # or skip these exts def __init__(self, searchkey, trace=2): FileVisitor.__init__(self, searchkey, trace) self.scount = 0 def reset(self): # on independent walks self.scount = 0 def candidate(self, fname): # redef for mimetypes ext = os.path.splitext(fname)[1] if self.testexts: return ext in self.testexts # in test list else: # or not in skip list return ext not in self.skipexts def visitfile(self, fname): # test for a match FileVisitor.visitfile(self, fname) if not self.candidate(fname): if self.trace > 0: print('Skipping', fname) else: text = open(fname).read() # 'rb' if undecodable if self.context in text: # or text.find() != −1 self.visitmatch(fname, text) self.scount += 1 def visitmatch(self, fname, text): # process a match print('%s has %s' % (fname, self.context)) # override me lower if __name__ == '__main__': # self-test logic dolist = 1 dosearch = 2 # 3=do list and search donext = 4 # when next test added def selftest(testmask): if testmask & dolist: visitor = FileVisitor(trace=2) visitor.run(sys.argv[2]) print('Visited %d files and %d dirs' % (visitor.fcount, visitor.dcount)) if testmask & dosearch: visitor = SearchVisitor(sys.argv[3], trace=0) visitor.run(sys.argv[2]) print('Found in %d files, visited %d' % (visitor.scount, visitor.fcount)) selftest(int(sys.argv[1])) # e.g., 3 = dolist | dosearch
This module primarily serves to export classes for external use,
but it does something useful when run standalone, too. If you invoke
it as a script with a test mask of 1
and a root directory name, it makes and
runs a FileVisitor
object and
prints an exhaustive listing of every file and directory at and below
the root:
C:...PP4ETools> visitor.py 1 C: empPP3EExamples
C: empPP3EExamples ...
1 => C: empPP3EExamplesREADME-root.txt
C: empPP3EExamplesPP3E ...
2 => C: empPP3EExamplesPP3EechoEnvironment.pyw
3 => C: empPP3EExamplesPP3ELaunchBrowser.pyw
4 => C: empPP3EExamplesPP3ELauncher.py
5 => C: empPP3EExamplesPP3ELauncher.pyc
...more output omitted (pipe into more or a file)...
1424 => C: empPP3EExamplesPP3ESystemThreads hread-count.py
1425 => C: empPP3EExamplesPP3ESystemThreads hread1.py
C: empPP3EExamplesPP3ETempParts ...
1426 => C: empPP3EExamplesPP3ETempParts109_0237.JPG
1427 => C: empPP3EExamplesPP3ETempPartslawnlake1-jan-03.jpg
1428 => C: empPP3EExamplesPP3ETempPartspart-001.txt
1429 => C: empPP3EExamplesPP3ETempPartspart-002.html
Visited 1429 files and 186 dirs
If you instead invoke this script with a 2
as its first command-line argument, it
makes and runs a SearchVisitor
object using the third argument as the search key. This form is
similar to running the search_all.py script we
met earlier, but it simply reports each matching file without
pausing:
C:...PP4ETools> visitor.py 2 C: empPP3EExamples mimetypes
C: empPP3EExamplesPP3EextrasLosAlamosAdvancedClassday1-systemdata.txt ha
s mimetypes
C: empPP3EExamplesPP3EInternetEmailmailtoolsmailParser.py has mimetypes
C: empPP3EExamplesPP3EInternetEmailmailtoolsmailSender.py has mimetypes
C: empPP3EExamplesPP3EInternetFtpmirrordownloadflat.py has mimetypes
C: empPP3EExamplesPP3EInternetFtpmirrordownloadflat_modular.py has mimet
ypes
C: empPP3EExamplesPP3EInternetFtpmirrorftptools.py has mimetypes
C: empPP3EExamplesPP3EInternetFtpmirroruploadflat.py has mimetypes
C: empPP3EExamplesPP3ESystemMediaplayfile.py has mimetypes
Found in 8 files, visited 1429
Technically, passing this script a first argument of 3
runs both a FileVisitor
and a SearchVisitor
(two separate traversals are
performed). The first argument is really used as a bit mask to select
one or more supported self-tests; if a test’s bit is on in the binary
value of the argument, the test will be run. Because 3 is 011 in
binary, it selects both a search (010) and a listing (001). In a more
user-friendly system, we might want to be more symbolic about that
(e.g., check for -search
and
-list
arguments), but bit masks
work just as well for this script’s scope.
As usual, this module can also be used interactively. The following is one way to determine how many files and directories you have in specific directories; the last command walks over your entire drive (after a generally noticeable delay!). See also the “biggest file” example at the start of this chapter for issues such as potential repeat visits not handled by this walker:
C:...PP4ETools>python
>>>from visitor import FileVisitor
>>>V = FileVisitor(trace=0)
>>>V.run(r'C: empPP3EExamples')
>>>V.dcount, V.fcount
(186, 1429) >>>V.run('..')
# independent walk (reset counts) >>>V.dcount, V.fcount
(19, 181) >>>V.run('..', reset=False)
# accumulative walk (keep counts) >>>V.dcount, V.fcount
(38, 362) >>>V = FileVisitor(trace=0)
# new independent walker (own counts) >>>V.run(r'C:')
# entire drive: try '/' on Unix-en >>>V.dcount, V.fcount
(24992, 198585)
Although the visitor module is useful by itself for listing and searching trees, it was really designed to be extended. In the rest of this section, let’s quickly step through a handful of visitor clients which add more specific tree operations, using normal OO customization techniques.
After genericizing tree traversals and searches, it’s easy to add automatic
file editing in a brand-new, separate component. Example 6-19 defines a
new EditVisitor
class
that simply customizes the visitmatch
method of the SearchVisitor
class to open a text editor
on the matched file. Yes, this is the complete program—it needs to
do something special only when visiting matched files, and so it
needs to provide only that behavior. The rest of the traversal and
search logic is unchanged and inherited.
""" Use: "python ...Toolsvisitor_edit.py string rootdir?". Add auto-editor startup to SearchVisitor in an external subclass component; Automatically pops up an editor on each file containing string as it traverses; can also use editor='edit' or 'notepad' on Windows; to use texteditor from later in the book, try r'python GuiTextEditor extEditor.py'; could also send a search command to go to the first match on start in some editors; """ import os, sys from visitor import SearchVisitor class EditVisitor(SearchVisitor): """ edit files at and below startDir having string """ editor = r'C:cygwininvim-nox.exe' # ymmv! def visitmatch(self, fpathname, text): os.system('%s %s' % (self.editor, fpathname)) if __name__ == '__main__': visitor = EditVisitor(sys.argv[1]) visitor.run('.' if len(sys.argv) < 3 else sys.argv[2]) print('Edited %d files, visited %d' % (visitor.scount, visitor.fcount))
When we make and run an EditVisitor
, a text editor is started with
the os.system
command-line spawn
call, which usually blocks its caller until the spawned program
finishes. As coded, when run on my machines, each time this script
finds a matched file during the traversal, it starts up the vi text
editor within the console window where the script was started;
exiting the editor resumes the tree walk.
Let’s find and edit some files. When run as a script, we pass
this program the search string as a command argument (here, the
string mimetypes
is the search
key). The root directory passed to the run
method is either the second argument
or “.” (the current run directory) by default. Traversal status
messages show up in the console, but each matched file now
automatically pops up in a text editor along the way. In the
following, the editor is started eight times—try this with an editor
and tree of your own to get a better feel for how it works:
C:...PP4ETools> visitor_edit.py mimetypes C: empPP3EExamples
C: empPP3EExamples ...
1 => C: empPP3EExamplesREADME-root.txt
C: empPP3EExamplesPP3E ...
2 => C: empPP3EExamplesPP3EechoEnvironment.pyw
3 => C: empPP3EExamplesPP3ELaunchBrowser.pyw
4 => C: empPP3EExamplesPP3ELauncher.py
5 => C: empPP3EExamplesPP3ELauncher.pyc
Skipping C: empPP3EExamplesPP3ELauncher.pyc
...more output omitted...
1427 => C: empPP3EExamplesPP3ETempPartslawnlake1-jan-03.jpg
Skipping C: empPP3EExamplesPP3ETempPartslawnlake1-jan-03.jpg
1428 => C: empPP3EExamplesPP3ETempPartspart-001.txt
1429 => C: empPP3EExamplesPP3ETempPartspart-002.html
Edited 8 files, visited 1429
This, finally, is the exact tool I was looking for to simplify global book examples tree maintenance. After major changes to things such as shared modules and file and directory names, I run this script on the examples root directory with an appropriate search string and edit any files it pops up as needed. I still need to change files by hand in the editor, but that’s often safer than blind global replacements.
But since I brought it up: given a general tree traversal class, it’s easy to
code a global search-and-replace subclass, too. The ReplaceVisitor
class in Example 6-20
is a SearchVisitor
subclass that
customizes the visitfile
method
to globally replace any appearances of one string with another, in
all text files at and below a root directory. It also collects the
names of all files that were changed in a list just in case you wish
to go through and verify the automatic edits applied (a text editor
could be automatically popped up on each changed file, for
instance).
""" Use: "python ...Toolsvisitor_replace.py rootdir fromStr toStr". Does global search-and-replace in all files in a directory tree: replaces fromStr with toStr in all text files; this is powerful but dangerous!! visitor_edit.py runs an editor for you to verify and make changes, and so is safer; use visitor_collect.py to simply collect matched files list; listonly mode here is similar to both SearchVisitor and CollectVisitor; """ import sys from visitor import SearchVisitor class ReplaceVisitor(SearchVisitor): """ Change fromStr to toStr in files at and below startDir; files changed available in obj.changed list after a run """ def __init__(self, fromStr, toStr, listOnly=False, trace=0): self.changed = [] self.toStr = toStr self.listOnly = listOnly SearchVisitor.__init__(self, fromStr, trace) def visitmatch(self, fname, text): self.changed.append(fname) if not self.listOnly: fromStr, toStr = self.context, self.toStr text = text.replace(fromStr, toStr) open(fname, 'w').write(text) if __name__ == '__main__': listonly = input('List only?') == 'y' visitor = ReplaceVisitor(sys.argv[2], sys.argv[3], listonly) if listonly or input('Proceed with changes?') == 'y': visitor.run(startDir=sys.argv[1]) action = 'Changed' if not listonly else 'Found' print('Visited %d files' % visitor.fcount) print(action, '%d files:' % len(visitor.changed)) for fname in visitor.changed: print(fname)
To run this script over a directory tree, run the following sort of command line with appropriate “from” and “to” strings. On my shockingly underpowered netbook machine, doing this on a 1429-file tree and changing 101 files along the way takes roughly three seconds of real clock time when the system isn’t particularly busy.
C:...PP4ETools>visitor_replace.py C: empPP3EExamples PP3E PP4E
List only?y
Visited 1429 files Found 101 files: C: empPP3EExamplesREADME-root.txt C: empPP3EExamplesPP3EechoEnvironment.pyw C: empPP3EExamplesPP3ELauncher.py ...more matching filenames omitted... C:...PP4ETools>visitor_replace.py C: empPP3EExamples PP3E PP4E
List only?n
Proceed with changes?y
Visited 1429 files Changed 101 files: C: empPP3EExamplesREADME-root.txt C: empPP3EExamplesPP3EechoEnvironment.pyw C: empPP3EExamplesPP3ELauncher.py ...more changed filenames omitted... C:...PP4ETools>visitor_replace.py C: empPP3EExamples PP3E PP4E
List only?n
Proceed with changes?y
Visited 1429 files Changed 0 files:
Naturally, we can also check our work by running the visitor
script (and Search
Visitor
superclass):
C:...PP4ETools>visitor.py 2 C: empPP3EExamples PP3E
Found in 0 files, visited 1429 C:...PP4ETools>visitor.py 2 C: empPP3EExamples PP4E
C: empPP3EExamplesREADME-root.txt has PP4E C: empPP3EExamplesPP3EechoEnvironment.pyw has PP4E C: empPP3EExamplesPP3ELauncher.py has PP4E ...more matching filenames omitted... Found in 101 files, visited 1429
This is both wildly powerful and dangerous. If the string to
be replaced can show up in places you didn’t anticipate, you might
just ruin an entire tree of files by running the ReplaceVisitor
object defined here. On the
other hand, if the string is something very specific, this object
can obviate the need to manually edit suspicious files. For
instance, website addresses in HTML files are likely too specific to
show up in other places by chance.
The two preceding visitor
module clients were both search-oriented, but it’s just as easy to
extend the basic walker class for more specific goals. Example 6-21, for instance,
extends FileVisitor
to count the
number of lines in program source code files of various types
throughout an entire tree. The effect is much like calling the
visitfile
method of this class
for each filename returned by the find
tool we wrote earlier in this
chapter, but the OO structure here is arguably more flexible and
extensible.
""" Count lines among all program source files in a tree named on the command line, and report totals grouped by file types (extension). A simple SLOC (source lines of code) metric: skip blank and comment lines if desired. """ import sys, pprint, os from visitor import FileVisitor class LinesByType(FileVisitor): srcExts = [] # define in subclass def __init__(self, trace=1): FileVisitor.__init__(self, trace=trace) self.srcLines = self.srcFiles = 0 self.extSums = {ext: dict(files=0, lines=0) for ext in self.srcExts} def visitsource(self, fpath, ext): if self.trace > 0: print(os.path.basename(fpath)) lines = len(open(fpath, 'rb').readlines()) self.srcFiles += 1 self.srcLines += lines self.extSums[ext]['files'] += 1 self.extSums[ext]['lines'] += lines def visitfile(self, filepath): FileVisitor.visitfile(self, filepath) for ext in self.srcExts: if filepath.endswith(ext): self.visitsource(filepath, ext) break class PyLines(LinesByType): srcExts = ['.py', '.pyw'] # just python files class SourceLines(LinesByType): srcExts = ['.py', '.pyw', '.cgi', '.html', '.c', '.cxx', '.h', '.i'] if __name__ == '__main__': walker = SourceLines() walker.run(sys.argv[1]) print('Visited %d files and %d dirs' % (walker.fcount, walker.dcount)) print('-'*80) print('Source files=>%d, lines=>%d' % (walker.srcFiles, walker.srcLines)) print('By Types:') pprint.pprint(walker.extSums) print(' Check sums:', end=' ') print(sum(x['lines'] for x in walker.extSums.values()), end=' ') print(sum(x['files'] for x in walker.extSums.values())) print(' Python only walk:') walker = PyLines(trace=0) walker.run(sys.argv[1]) pprint.pprint(walker.extSums)
When run as a script, we get trace messages during the walk (omitted here to save space), and a report with line counts grouped by file type. Run this on trees of your own to watch its progress; my tree has 907 source files and 48K source lines, including 783 files and 34K lines of “.py” Python code:
C:...PP4ETools> visitor_sloc.py C: empPP3EExamples
Visited 1429 files and 186 dirs
--------------------------------------------------------------------------------
Source files=>907, lines=>48047
By Types:
{'.c': {'files': 45, 'lines': 7370},
'.cgi': {'files': 5, 'lines': 122},
'.cxx': {'files': 4, 'lines': 2278},
'.h': {'files': 7, 'lines': 297},
'.html': {'files': 48, 'lines': 2830},
'.i': {'files': 4, 'lines': 49},
'.py': {'files': 783, 'lines': 34601},
'.pyw': {'files': 11, 'lines': 500}}
Check sums: 48047 907
Python only walk:
{'.py': {'files': 783, 'lines': 34601}, '.pyw': {'files': 11, 'lines': 500}}
Let’s peek at one more visitor use case. When I first wrote the
cpall.py
script earlier in this
chapter, I couldn’t see a way that the visitor
class hierarchy we met earlier
would help. Two directories needed to be
traversed in parallel (the original and the copy), and visitor
is based on walking just one tree
with os.walk
. There seemed no
easy way to keep track of where the script was in the copy
directory.
The trick I eventually stumbled onto is not to keep track at
all. Instead, the script in Example 6-22 simply replaces
the “from” directory path string with the “to” directory path
string, at the front of all directory names and pathnames passed in
from os.walk
. The results of the
string replacements are the paths to which the original files and
directories are to be copied.
""" Use: "python ...Toolsvisitor_cpall.py fromDir toDir trace?" Like SystemFiletoolscpall.py, but with the visitor classes and os.walk; does string replacement of fromDir with toDir at the front of all the names that the walker passes in; assumes that the toDir does not exist initially; """ import os from visitor import FileVisitor # visitor is in '.' from PP4E.System.Filetools.cpall import copyfile # PP4E is in a dir on path class CpallVisitor(FileVisitor): def __init__(self, fromDir, toDir, trace=True): self.fromDirLen = len(fromDir) + 1 self.toDir = toDir FileVisitor.__init__(self, trace=trace) def visitdir(self, dirpath): toPath = os.path.join(self.toDir, dirpath[self.fromDirLen:]) if self.trace: print('d', dirpath, '=>', toPath) os.mkdir(toPath) self.dcount += 1 def visitfile(self, filepath): toPath = os.path.join(self.toDir, filepath[self.fromDirLen:]) if self.trace: print('f', filepath, '=>', toPath) copyfile(filepath, toPath) self.fcount += 1 if __name__ == '__main__': import sys, time fromDir, toDir = sys.argv[1:3] trace = len(sys.argv) > 3 print('Copying...') start = time.clock() walker = CpallVisitor(fromDir, toDir, trace) walker.run(startDir=fromDir) print('Copied', walker.fcount, 'files,', walker.dcount, 'directories', end=' ') print('in', time.clock() - start, 'seconds')
This version accomplishes roughly the same goal as the original, but it has made a few assumptions to keep the code simple. The “to” directory is assumed not to exist initially, and exceptions are not ignored along the way. Here it is copying the book examples tree from the prior edition again on Windows:
C:...PP4ETools>set PYTHONPATH
PYTHONPATH=C:UsersMarkStuffBooks4EPP4EdevExamples C:...PP4ETools>rmdir /S copytemp
copytemp, Are you sure (Y/N)?y
C:...PP4ETools>visitor_cpall.py C: empPP3EExamples copytemp
Copying... Copied 1429 files, 186 directories in 11.1722033777 seconds C:...PP4ETools>fc /B copytempPP3ELauncher.py
C: empPP3EExamplesPP3ELauncher.py
Comparing files COPYTEMPPP3ELauncher.py and C:TEMPPP3EEXAMPLESPP3ELAUNCHER.PY FC: no differences encountered
Despite the extra string slicing going on, this version seems to run just as fast as the original (the actual difference can be chalked up to system load variations). For tracing purposes, this version also prints all the “from” and “to” copy paths during the traversal if you pass in a third argument on the command line:
C:...PP4ETools>rmdir /S copytemp
copytemp, Are you sure (Y/N)?y
C:...PP4ETools>visitor_cpall.py C: empPP3EExamples copytemp 1
Copying... d C: empPP3EExamples => copytemp f C: empPP3EExamplesREADME-root.txt => copytempREADME-root.txt d C: empPP3EExamplesPP3E => copytempPP3E ...more lines omitted: try this on your own for the full output...
Although the visitor is widely applicable, we don’t have space to explore additional subclasses in this book. For more example clients and use cases, see the following examples in book’s examples distribution package described in the Preface:
Toolsvisitor_collect.py collects and/or prints files containing a search string
Toolsvisitor_poundbang.py replaces directory paths in “#!” lines at the top of Unix scripts
Toolsvisitor_cleanpyc.py is a visitor-based recoding of our earlier bytecode cleanup scripts
Toolsvisitor_bigpy.py is a visitor-based version of the “biggest file” example at the start of this chapter
Most of these are almost as trivial as the visitor_edit.py code in Example 6-19, because the
visitor framework handles walking details automatically. The
collector, for instance, simply appends to a list as a search
visitor detects matched files and allows the default list of text
filename extensions in the search visitor to be overridden per
instance—it’s roughly like a
combination of find
and grep
on Unix:
>>>from visitor_collect import CollectVisitor
>>>V = CollectVisitor('mimetypes', testexts=['.py', '.pyw'], trace=0)
>>>V.run(r'C: empPP3EExamples')
>>>for name in V.matches: print(name)
# .py and .pyw files with 'mimetypes' ... C: empPP3EExamplesPP3EInternetEmailmailtoolsmailParser.py C: empPP3EExamplesPP3EInternetEmailmailtoolsmailSender.py C: empPP3EExamplesPP3EInternetFtpmirrordownloadflat.py C: empPP3EExamplesPP3EInternetFtpmirrordownloadflat_modular.py C: empPP3EExamplesPP3EInternetFtpmirrorftptools.py C: empPP3EExamplesPP3EInternetFtpmirroruploadflat.py C: empPP3EExamplesPP3ESystemMediaplayfile.py C:...PP4ETools>visitor_collect.py mimetypes C: empPP3EExamples
# as script
The core logic of the biggest-file visitor is similarly straightforward, and harkens back to chapter start:
class BigPy(FileVisitor): def __init__(self, trace=0): FileVisitor.__init__(self, context=[], trace=trace) def visitfile(self, filepath): FileVisitor.visitfile(self, filepath) if filepath.endswith('.py'): self.context.append((os.path.getsize(filepath), filepath))
And the bytecode-removal visitor brings us back full circle,
showing an additional alternative to those we met earlier in this
chapter. It’s essentially the same code, but it runs os.remove
on “.pyc” file visits.
In the end, while the visitor classes are really just simple
wrappers for os.walk
, they
further automate walking chores and provide a general framework and
alternative class-based structure which may seem more natural to
some than simple unstructured loops. They’re also representative of
how Python’s OOP support maps well to real-world structures like
file systems. Although os.walk
works well for one-off scripts, the better extensibility, reduced
redundancy, and greater encapsulation possible with OOP can be a
major asset in real work as our needs change and evolve over
time.
In fact, those needs have changed over
time. Between the third and fourth editions of this book, the
original os.path.walk
call was
removed in Python 3.X, and os.walk
became the only automated way to
perform tree walks in the standard library. Examples from the
prior edition that used os.path.walk
were effectively broken. By
contrast, although the visitor classes used this call, too, its
clients did not. Because updating the visitor classes to use
os.walk
internally did not
alter those classes’ interfaces, visitor-based tools continued to
work unchanged.
This seems a prime example of the benefits of OOP’s support
for encapsulation. Although the future is never completely
predictable, in practice, user-defined tools like visitor tend to
give you more control over changes than standard library tools
like os.walk
. Trust me on that;
as someone who has had to update three Python books over the last
15 years, I can say with some certainty that Python change is a
constant!
We have space for just one last, quick example in this chapter,
so we’ll close with a bit of fun. Did you notice how the file
extensions for text and binary file types were hard-coded in the
directory search scripts of the prior two sections? That approach
works for the trees they were applied to, but it’s not necessarily
complete or portable. It would be better if we could deduce file type
from file name automatically. That’s exactly what Python’s mimetypes
module can
do for us. In this section, we’ll use it to build a script that
attempts to launch a file based upon its media type, and in the
process develop general tools for opening media portably with specific
or generic players.
As we’ve seen, on Windows this task is trivial—the os.startfile
call opens files per the
Windows registry, a system-wide mapping of file extension types to
handler programs. On other platforms, we can either run specific media
handlers per media type, or fall back on a resident web browser to
open the file generically using Python’s webbrowser
module.
Example 6-23 puts these ideas
into code.
#!/usr/local/bin/python """ ################################################################################## Try to play an arbitrary media file. Allows for specific players instead of always using general web browser scheme. May not work on your system as is; audio files use filters and command lines on Unix, and filename associations on Windows via the start command (i.e., whatever you have on your machine to run .au files--an audio player, or perhaps a web browser). Configure and extend as needed. playknownfile assumes you know what sort of media you wish to open, and playfile tries to determine media type automatically using Python mimetypes module; both try to launch a web browser with Python webbrowser module as a last resort when mimetype or platform unknown. ################################################################################## """ import os, sys, mimetypes, webbrowser helpmsg = """ Sorry: can't find a media player for '%s' on your system! Add an entry for your system to the media player dictionary for this type of file in playfile.py, or play the file manually. """ def trace(*args): print(*args) # with spaces between ################################################################################## # player techniques: generic and otherwise: extend me ################################################################################## class MediaTool: def __init__(self, runtext=''): self.runtext = runtext def run(self, mediafile, **options): # most ignore options fullpath = os.path.abspath(mediafile) # cwd may be anything self.open(fullpath, **options) class Filter(MediaTool): def open(self, mediafile, **ignored): media = open(mediafile, 'rb') player = os.popen(self.runtext, 'w') # spawn shell tool player.write(media.read()) # send to its stdin class Cmdline(MediaTool): def open(self, mediafile, **ignored): cmdline = self.runtext % mediafile # run any cmd line os.system(cmdline) # use %s for filename class Winstart(MediaTool): # use Windows registry def open(self, mediafile, wait=False, **other): # or os.system('start file') if not wait: # allow wait for curr media os.startfile(mediafile) else: os.system('start /WAIT ' + mediafile) class Webbrowser(MediaTool): # file:// requires abs path def open(self, mediafile, **options): webbrowser.open_new('file://%s' % mediafile, **options) ################################################################################## # media- and platform-specific policies: change me, or pass one in ################################################################################## # map platform to player: change me! audiotools = { 'sunos5': Filter('/usr/bin/audioplay'), # os.popen().write() 'linux2': Cmdline('cat %s > /dev/audio'), # on zaurus, at least 'sunos4': Filter('/usr/demo/SOUND/play'), # yes, this is that old! 'win32': Winstart() # startfile or system #'win32': Cmdline('start %s') } videotools = { 'linux2': Cmdline('tkcVideo_c700 %s'), # zaurus pda 'win32': Winstart(), # avoid DOS pop up } imagetools = { 'linux2': Cmdline('zimager %s'), # zaurus pda 'win32': Winstart(), } texttools = { 'linux2': Cmdline('vi %s'), # zaurus pda 'win32': Cmdline('notepad %s') # or try PyEdit? } apptools = { 'win32': Winstart() # doc, xls, etc: use at your own risk! } # map mimetype of filenames to player tables mimetable = {'audio': audiotools, 'video': videotools, 'image': imagetools, 'text': texttools, # not html text: browser 'application': apptools} ################################################################################## # top-level interfaces ################################################################################## def trywebbrowser(filename, helpmsg=helpmsg, **options): """ try to open a file in a web browser last resort if unknown mimetype or platform, and for text/html """ trace('trying browser', filename) try: player = Webbrowser() # open in local browser player.run(filename, **options) except: print(helpmsg % filename) # else nothing worked def playknownfile(filename, playertable={}, **options): """ play media file of known type: uses platform-specific player objects, or spawns a web browser if nothing for this platform; accepts a media-specific player table """ if sys.platform in playertable: playertable[sys.platform].run(filename, **options) # specific tool else: trywebbrowser(filename, **options) # general scheme def playfile(filename, mimetable=mimetable, **options): """ play media file of any type: uses mimetypes to guess media type and map to platform-specific player tables; spawn web browser if text/html, media type unknown, or has no table """ contenttype, encoding = mimetypes.guess_type(filename) # check name if contenttype == None or encoding is not None: # can't guess contenttype = '?/?' # poss .txt.gz maintype, subtype = contenttype.split('/', 1) # 'image/jpeg' if maintype == 'text' and subtype == 'html': trywebbrowser(filename, **options) # special case elif maintype in mimetable: playknownfile(filename, mimetable[maintype], **options) # try table else: trywebbrowser(filename, **options) # other types ############################################################################### # self-test code ############################################################################### if __name__ == '__main__': # media type known playknownfile('sousa.au', audiotools, wait=True) playknownfile('ora-pp3e.gif', imagetools, wait=True) playknownfile('ora-lp4e.jpg', imagetools) # media type guessed input('Stop players and press Enter') playfile('ora-lp4e.jpg') # image/jpeg playfile('ora-pp3e.gif') # image/gif playfile('priorcalendar.html') # text/html playfile('lp4e-preface-preview.html') # text/html playfile('lp-code-readme.txt') # text/plain playfile('spam.doc') # app playfile('spreadsheet.xls') # app playfile('sousa.au', wait=True) # audio/basic input('Done') # stay open if clicked
Although it’s generally possible to open most media files by passing their names to a web browser these days, this module provides a simple framework for launching media files with more specific tools, tailored by both media type and platform. A web browser is used only as a fallback option, if more specific tools are not available. The net result is an extendable media file player, which is as specific and portable as the customizations you provide for its tables.
We’ve seen the program launch tools employed by this script in
prior chapters. The script’s main new concepts have to do with the
modules it uses: the webbrowser
module to open some files in a local web browser, as well as the
Python mimetypes
module to
determine media type from file name. Since these are the heart of this
code’s matter, let’s explore these briefly before we run the
script.
The standard library webbrowser
module used by this example provides a portable interface for
launching web browsers from Python scripts. It attempts to locate a
suitable web browser on your local machine to open a given URL (file
or web address) for display. Its interface is
straightforward:
>>> import webbrowser
>>> webbrowser.open_new('file://' + fullfilename
) # use os.path.abspath()
This code will open the named file in a new web browser window
using whatever browser is found on the underlying computer, or raise
an exception if it cannot. You can tailor the browsers used on your
platform, and the order in which they are attempted, by using the
BROWSER
environment variable and
register
function. By default,
webbrowser
attempts to be
automatically portable across platforms.
Use an argument string of the form “file://...” or “http://...” to open a file on the local computer or web server, respectively. In fact, you can pass in any URL that the browser understands. The following pops up Python’s home page in a new locally-running browser window, for example:
>>> webbrowser.open_new('http://www.python.org')
Among other things, this is an easy way to display HTML
documents as well as media files, as demonstrated by this section’s
example. For broader applicability, this module can be used as both
command-line script (Python’s -m
module search path flag helps here) and as importable tool:
C:UsersmarkStuffWebsitespublic_html>python -m webbrowser about-pp.html
C:UsersmarkStuffWebsitespublic_html>python -m webbrowser -n about-pp.html
C:UsersmarkStuffWebsitespublic_html>python -m webbrowser -t about-pp.html
C:UsersmarkStuffWebsitespublic_html>python
>>>import webbrowser
>>>webbrowser.open('about-pp.html')
# reuse, new window, new tab True >>>webbrowser.open_new('about-pp.html')
# file:// optional on Windows True >>>webbrowser.open_new_tab('about-pp.html')
True
In both modes, the difference between the three usage forms is that the first tries to reuse an already-open browser window if possible, the second tries to open a new window, and the third tries to open a new tab. In practice, though, their behavior is totally dependent on what the browser selected on your platform supports, and even on the platform in general. All three forms may behave the same.
On Windows, for example, all three simply run os.startfile
by default and thus create a
new tab in an existing window under Internet Explorer 8. This is
also why I didn’t need the “file://” full URL prefix in the
preceding listing. Technically, Internet Explorer is only run if
this is what is registered on your computer for the file type being
opened; if not, that file type’s handler is opened instead. Some
images, for example, may open in a photo viewer instead. On other
platforms, such as Unix and Mac OS X, browser behavior differs, and
non-URL file names might not be opened; use “file://” for portability.
We’ll use this module again later in this book. For example,
the PyMailGUI program in Chapter 14
will employ it as a way to display HTML-formatted email messages and
attachments, as well as program help. See the Python library manual
for more details. In Chapters 13 and 15,
we’ll also meet a related call, urllib.request.urlopen
, which fetches a
web page’s text given a URL, but does not open it in a browser; it
may be parsed, saved, or otherwise used.
To make this media player module even more useful, we also use the
Python mimetypes
standard library module to
automatically determine the media type from the filename. We get
back a type/subtype
MIME
content-type string if the type can be determined or None
if the guess failed:
>>>import mimetypes
>>>mimetypes.guess_type('spam.jpg')
('image/jpeg', None) >>>mimetypes.guess_type('TheBrightSideOfLife.mp3')
('audio/mpeg', None) >>>mimetypes.guess_type('lifeofbrian.mpg')
('video/mpeg', None) >>>mimetypes.guess_type('lifeofbrian.xyz')
# unknown type (None, None)
Stripping off the first part of the content-type string gives the file’s general media type, which we can use to select a generic player; the second part (subtype) can tell us if text is plain or HTML:
>>>contype, encoding = mimetypes.guess_type('spam.jpg')
>>>contype.split('/')[0]
'image' >>>mimetypes.guess_type('spam.txt')
# subtype is 'plain' ('text/plain', None) >>>mimetypes.guess_type('spam.html')
('text/html', None) >>>mimetypes.guess_type('spam.html')[0].split('/')[1]
'html'
A subtle thing: the second item in the tuple returned from the
mimetypes
guess is an encoding
type we won’t use here for opening purposes. We still have to pay
attention to it, though—if it is not None
, it means the file is compressed
(gzip
or compress
), even if we receive a media
content type. For example, if the filename is something like
spam.gif.gz, it’s a compressed image that we
don’t want to try to open directly:
>>>mimetypes.guess_type('spam.gz')
# content unknown (None, 'gzip') >>>mimetypes.guess_type('spam.gif.gz')
# don't play me! ('image/gif', 'gzip') >>>mimetypes.guess_type('spam.zip')
# archives ('application/zip', None) >>>mimetypes.guess_type('spam.doc')
# office app files ('application/msword', None)
If the filename you pass in contains a directory path, the path portion is ignored (only the extension is used). This module is even smart enough to give us a filename extension for a type—useful if we need to go the other way, and create a file name from a content type:
>>>mimetypes.guess_type(r'C:songssousa.au')
('audio/basic', None) >>>mimetypes.guess_extension('audio/basic')
'.au'
Try more calls on your own for more details. We’ll use the
mimetypes
module again in FTP
examples in Chapter 13 to determine
transfer type (text or binary), and in our email examples in
Chapters 13, 14,
and 16 to send, save, and open mail
attachments.
In Example 6-23, we use
mimetypes
to select a table of
platform-specific player commands for the media type of the file to
be played. That is, we pick a player table for the file’s media
type, and then pick a command from the player table for the
platform. At both steps, we give up and run a web browser if there
is nothing more specific to be done.
To use this module for directing our text file search scripts we wrote earlier in this chapter, simply extract the first item in the content-type returned for a file’s name. For instance, all in the following list are considered text (except “.pyw”, which we may have to special-case if we must care):
>>>for ext in ['.txt', '.py', '.pyw', '.html', '.c', '.h', '.xml']:
...print(ext, mimetypes.guess_type('spam' + ext))
... .txt ('text/plain', None) .py ('text/x-python', None) .pyw (None, None) .html ('text/html', None) .c ('text/plain', None) .h ('text/plain', None) .xml ('text/xml', None)
We can add this technique to our earlier SearchVisitor
class by redefining its
candidate selection method, in order to replace its default
extension lists with mimetypes
guesses—yet more evidence of
the power of OOP customization at work:
C:...PP4ETools>python
>>>import mimetypes
>>>from visitor import SearchVisitor
# or PP4E.Tools.visitor if not . >>> >>>class SearchMimeVisitor(SearchVisitor):
...def candidate(self, fname):
...contype, encoding = mimetypes.guess_type(fname)
...return (contype and
...contype.split('/')[0] == 'text' and
...encoding == None)
... >>>V = SearchMimeVisitor('mimetypes', trace=0)
# search key >>>V.run(r'C: empPP3EExamples')
# root dir C: empPP3EExamplesPP3EextrasLosAlamosAdvancedClassday1-systemdata.txt ha s mimetypes C: empPP3EExamplesPP3EInternetEmailmailtoolsmailParser.py has mimetypes C: empPP3EExamplesPP3EInternetEmailmailtoolsmailSender.py has mimetypes C: empPP3EExamplesPP3EInternetFtpmirrordownloadflat.py has mimetypes C: empPP3EExamplesPP3EInternetFtpmirrordownloadflat_modular.py has mimet ypes C: empPP3EExamplesPP3EInternetFtpmirrorftptools.py has mimetypes C: empPP3EExamplesPP3EInternetFtpmirroruploadflat.py has mimetypes C: empPP3EExamplesPP3ESystemMediaplayfile.py has mimetypes >>>V.scount, V.fcount, V.dcount
(8, 1429, 186)
Because this is not completely accurate, though (you may need to add logic to include extensions like “.pyw” missed by the guess), and because it’s not even appropriate for all search clients (some may want to search specific kinds of text only), this scheme was not used for the original class. Using and tailoring it for your own searches is left as optional exercise.
Now, when Example 6-23 is run from the command line, if all goes well its canned self-test code at the end opens a number of audio, image, text, and other file types located in the script’s directory, using either platform-specific players or a general web browser. On my Windows 7 laptop, GIF and HTML files open in new IE browser tabs; JPEG files in Windows Photo Viewer; plain text files in Notepad; DOC and XLS files in Microsoft Word and Excel; and audio files in Windows Media Player.
Because the programs used and their behavior may vary widely from machine to machine, though, you’re best off studying this script’s code and running it on your own computer and with your own test files to see what happens. As usual, you can also test it interactively (use the package path like this one to import from a different directory, assuming your module search path includes the PP4E root):
>>>from PP4E.System.Media.playfile import playfile
>>>playfile(r'C:moviesmov10428.mpg')
# video/mpeg
We’ll use the playfile
module again as an imported library like this in Chapter 13 to open media files downloaded by
FTP. Again, you may want to tweak this script’s tables for your
players. This script also assumes the media file is located on the
local machine (even though the webbrowser
module supports remote files
with “http://” names), and it does not currently allow different
players for most different MIME subtypes (it special-cases text to
handle “plain” and “html” differently, but no others). In fact, this
script is really just something of a simple framework that was
designed to be extended. As always, hack on; this is Python,
after all.
Finally, some optional reading—in the examples distribution package for this book (available at sites listed in the Preface) you can find additional system-related scripts we do not have space to cover here:
PP4ELauncher.py—contains tools used by
some GUI programs later in the book to start Python programs
without any environment configuration. Roughly, it sets up both
the system path and module import search paths as needed to run
book examples, which are inherited by spawned programs. By using
this module to search for files and configure environments
automatically, users can avoid (or at least postpone) having to
learn the intricacies of manual environment configuration before
running programs. Though there is not much new in this example
from a system interfaces perspective, we’ll refer back to it
later, when we explore GUI programs that use its tools, as well as
those of its launchmodes
cousin, which we wrote in Chapter 5.
PP4ELaunch_PyDemos.pyw
and PP4ELaunch_PyGadgets_bar.pyw—use
Launcher.py to start major
GUI book examples without any environment configuration. Because
all spawned processes inherit configurations performed by the
launcher, they all run with proper search path settings. When run
directly, the underlying PyDemos2.pyw and
PyGadgets_bar.pyw scripts (which we’ll
explore briefly at the end of Chapter 10) instead rely on the
configuration settings on the underlying machine. In other words,
Launcher
effectively hides
configuration details from the GUI interfaces by enclosing them in
a configuration program layer.
PP4ELaunchBrowser.pyw—portably locates
and starts an Internet web browser program on the host machine in
order to view a local file or remote web page. In prior versions,
it used tools in Launcher.py
to search for a reasonable browser to run. The original version of
this example has now been largely superseded by the standard
library’s webbrowser
module,
which arose after this example had been developed (reptilian minds
think alike!). In this edition, LaunchBrowser
simply parses command-line
arguments for backward compatibility and invokes the open
function in webbrowser
. See this module’s help text,
or PyGadgets and PyDemos in Chapter 10, for example command-line
usage.
That’s the end of our system tools exploration. In the next part of this book we leave the realm of the system shell and move on to explore ways to add graphical user interfaces to our program. Later, we’ll do the same using web-based approaches. As we continue, keep in mind that the system tools we’ve studied in this part of the book see action in a wide variety of programs. For instance, we’ll put threads to work to spawn long-running tasks in the GUI part, use both threads and processes when exploring server implementations in the Internet part, and use files and file-related system calls throughout the remainder of the book.
Whether your interfaces are command lines, multiwindow GUIs, or distributed client/server websites, Python’s system interfaces toolbox is sure to play a important part in your Python programming future.
[18] For a related print
issue, see Chapter 14’s workaround
for program aborts when printing stack tracebacks to standard
output from spawned programs. Unlike the problem described here,
that issue does not appear to be related to Unicode characters
that may be unprintable in shell windows but reflects another
regression for standard output prints in general in Python 3.1,
which may or may not be repaired by the time you read this text.
See also the Python environment variable PYTHONIOENCODING, which
can override the default encoding used for standard
streams.
[19] I should note that this background story stems from the second edition of this book, written in 2000. Some ten years later, floppies have largely gone the way of the parallel port and the dinosaur. Moreover, burning a CD or DVD is no longer as painful as it once was; there are new options today such as large flash memory cards, wireless home networks, and simple email; and naturally, my home computers configuration isn’t what it once was. For that matter, some of my kids are no longer kids (though they’ve retained some backward compatibility with their former selves).
[20] It turns out that the zip
, gzip
, and tar
commands can all be replaced with
pure Python code today, too. The gzip
module in the Python standard
library provides tools for reading and writing compressed
gzip
files, usually named
with a .gz filename extension. It can serve
as an all-Python equivalent of the standard gzip
and gunzip
command-line utility programs.
This built-in module uses another module called zlib
that implements gzip
-compatible data compressions. In
recent Python releases, the zipfile
module can be imported to make
and use ZIP format archives (zip
is an archive and compression
format, gzip
is a compression
scheme), and the tarfile
module allows scripts to read and write tar archives. See the
Python library manual for details.
[21] It happens. In fact, most people who spend any substantial amount of time in cyberspace could probably tell a horror story or two. Mine goes like this: a number of years ago, I had an account with an ISP that went completely offline for a few weeks in response to a security breach by an ex-employee. Worse, not only was personal email disabled, but queued up messages were permanently lost. If your livelihood depends on email and the Web as much as mine does, you’ll appreciate the havoc such an outage can wreak.
[22] In fact, the act of searching files often goes by the colloquial name “grepping” among developers who have spent any substantial time in the Unix ghetto.