Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2
Scripting with Python

WHAT YOU WILL LEARN IN THIS CHAPTER:

Accessing and managing computer resources via the operating system
Handling common file formats such as CSV and XML
Working with dates and times
Automating applications and accessing their APIs
Using third-party modules to extend automation beyond the standard library capabilities

WROX.COM DOWNLOADS FOR THIS CHAPTER

For this chapter the wrox.com code downloads are found at www.wrox.com/go/pythonprojects on the Download Code tab. The code is in the Chapter 2 download, called Chapter2.zip, and individually named according to the names throughout the chapter.

Often, you may find yourself undertaking tasks that involve many repetitive operations. To combat this repetition of work, it may be possible to write a macro to automate those operations within a single application but, if the operations span several applications, macros are rarely effective. For example, if you back up and archive a large multimedia web application, you may have to deal with content produced by one or more media tools, code from an IDE, and probably some database files, too. Instead of macros, you need an external programming tool to drive each application, or utility, to perform its part of the whole. Python is well suited to this kind of orchestration role.

In this chapter you learn how to use Python modules to check user settings as well as directory and file access levels; set up the correct environment for an operation; and launch and control external programs from your script. You also discover how Python modules help you access data in common file formats, how to handle dates and times and, finally, how to directly access the low-level programming interfaces of external applications using the very powerful ctypes module and, for Windows, the pywin32 package.

Accessing the Operating System

Most of the tasks that a typical programmer needs to undertake using the operating system—for example, collecting user information or navigating the file system—can be done in a generic way using Python’s standard library of modules. (Recall that modules are reusable pieces of code that can be shared across multiple programs.) The key modules have been written in such a way that the peculiarities of individual operating system behaviors have been hidden behind a higher level set of objects and operations. The modules that you consider in this section are: os/path, pwd, glob, shutil, and subprocess. The material here focuses on how to use these modules in common scenarios; it does not try to cover every possible permutation or available option.

The os module, as the name suggests, provides access to many operating system features. It is, in fact, a package with a submodule, os.path, that deals with managing file paths, names, and types. The os module is supported by a number of other modules that you meet as you work through the various topics in this chapter. These myriad modules are collectively referred to as the OS modules (uppercase) and the actual os module as os (lowercase). If you are familiar with systems programming on a UNIX system, or even with using a UNIX shell such as Bash, many of these operations will be familiar to you.

The OS is primarily there to manage access to the computer’s hardware in the form of CPU, memory, storage, and networking. It regulates access to these resources and manages the creation, scheduling, and removal of processes. The OS module functions provide insight and control over these OS activities. In the next few sections, you look at these common tasks:

Collecting user and system information
Managing processes
Determining file information
Manipulating files
Navigating folders

Obtaining Information About Users and Their Computer

One of the first things you can do when exploring the OS modules is to find out what they can tell you about users. Specifically, you can find out the user’s ID, login name, and some of his default settings.

Like most new things in Python, the best way to get familiar is via the interactive prompt, so fire up the Python interpreter and try it out.

TRY IT OUT Identifying the User

In this Try It Out, you find out some information about the current user. To do so, follow these steps:

Start the Python interpreter.

Type the following code into the interpreter:

  >>> import os
  >>> os.getlogin()
  'agauld'
  >>> os.getuid()		# Not Windows
  1001
  >>> import pwd		# Not Windows
  >>> pwd.getpwuid(os.getuid())	# Not Windows
  pwd.struct_passwd(pw_name='agauld', pw_passwd='unused', pw_uid=1001,
  pw_gid=513, pw_gecos='Alan Gauld,U-DOCUMENTATION\agauld,
  S-1-5-21-2472883112-933775427-2136723719-1001',
  pw_dir='/home/agauld', pw_shell='/bin/bash')
  >>> for id in pwd.getpwall():
  ...	print(id[0])
  ...
  SYSTEM
  LocalService
  NetworkService
  Administrators
  TrustedInstaller
  Administrator
  agauld
  Guest
  HomeGroupUser$
  ????????
  >>>

How It Works

After importing the os module in the first line, you got the login name as a string. That is generally most useful for creating personalized prompts or screen messages. Unfortunately, for Windows users, that’s it; the rest of the Try It Out code is suitable for UNIX-based systems only. However, all is not lost because you can also find some of this information from environment variables which you look at a little later in the section “Obtaining Information About the Current Process.”

If you have a UNIX-based system, you use os.getuid() to get the user ID as the OS sees it, namely a numeric value, which you can then use with various other functions. The next lines import and use functions from the password module, pwd, to translate the OS user ID into a more complete set of information that includes the real name, default shell and home directory. This is obviously much more informative, but it requires the UID from os.getuid() as a starting point. An alternative function, os.getpwnam(), takes the login name instead and returns the same information. Finally, you used pwd.getpwall() and a for loop to extract all of the user names for this system.

Next you find out what kind of permissions the users have on files they create. This is significant because it affects any files your code produces. It may be that you need to temporarily alter the permissions—for example, if you need to create a file that you execute later in the program, it needs to have execute privileges. In UNIX, these settings are stored in something known as a umask or user mask. It is a bitmask, like the ones you used at the end of Chapter 1, where each bit represents a user-access data point, as described next.

Python lets you look at the umask value, even on Windows, using the os.umask() function. The os.umask() function has a slight quirk in its usage, however. It expects you to pass a new value to the function; it then sets that value and returns the old value. But if you only want to find out the current value, you can’t do it. Instead you need to set the umask to a temporary new value, read the old one, and then reset the value to the original. The format of the mask is very compact, consisting of 3 groups of 3 bits, 1 group for each of Owner, Group, and World permissions, respectively.

Within a group the 3 bits each represent one type of access—read, write, or execute. These are most conveniently written using explicit binary notation. Table 2.1 shows how each 3-bit binary value maps onto permissions.

Table 2.1 Umask Binary Mappings

UMASK BINARY VALUE	READ, WRITE, EXECUTE VALUES
000	Read = True, Write = True, Execute = True
001	Read = True, Write = True, Execute = False
010	Read = True, Write = False, Execute = True
011	Read = True, Write = False, Execute = False
100	Read = False, Write = True, Execute = True
101	Read = False, Write = True, Execute = False
110	Read = False, Write = False, Execute = True
111	Read = False, Write = False, Execute =False

Now that you understand what you are trying to do, it’s time to try it out.

TRY IT OUT Reading and Modifying umask Values

In this Try It Out, you read, modify, and restore the current user’s umask value. To do so, follow these steps:

Start the Python interpreter.

Type the following code into the interpreter:

  >>> import os
  >>> os.umask(0b111111111) # binary for all false - 111 x 3
  18
  >>> bin(18)
  '0b10010'
  >>> os.umask(18)
  511

How It Works

You started out by calling os.umask() with a binary value of 111111111. That sets all permissions to false, which you did as a security feature in the event that something went wrong. It’s better to have the mask too restrictive than leave the user vulnerable to security exploits.

Python then printed the decimal value 18. By calling the bin() function, you see that 18 has the binary mask value of 10010. If you pad that with zeros to get the full 9-bit mask and split it into 3-bit groups, you see that it is 000-010-010. Referring back to Table 2.1, you find that this value represents full access to the Owner, but only read and execute access to Group and World users.

Finally, you restored the user’s original setting by calling os.umask() again with an argument of 18 (the original value returned by umask) and the previous mask value (111111111) that you had set was printed, in decimal, as 511.

Sometimes you want to know what kind of computer system the user is running, in particular the details of the OS itself. Python has several ways of doing this, but the one you look at first is the os.name property. At the time of writing, this property returns one of the following values: posix, nt, mac, os2, ce, or java.

Another place to look for the system the user is running is in the sys module and, in particular, the sys.platform attribute. This attribute often returns slightly different information than that found using os.name. For example, Windows is reported as win32 rather than nt or ce. On UNIX another function in os called os.uname() provides slightly more detail. If you have several different OSes available to you, it can be interesting to compare the results from these different techniques. It is recommended that you use the os.name option simply because it is universally available and returns a well-defined set of results.

One other snippet of information that is often useful to collect is the size of the user’s terminal in terms of its lines and columns. You can use this information to modify the display of messages from your scripts. The shutil module provides a function for this called shutil.get_terminal_size(), and it is used like this:

  >>> import shutil
  >>> cols, lines = shutil.get_terminal_size()
  >>> cols
  80
  >>> lines
  49

If the terminal size cannot be ascertained, the default return value is 80 × 24. A different default can be specified as an optional argument, but 80 × 24 is usually a sensible option because it’s the traditional size for terminal emulators.

Obtaining Information About the Current Process

It can be useful for a program to know something about its current status and runtime environment. For example, you might want to know the process identity or if the process has a preferred folder in which to write its data files or read configuration data. The OS modules provide functions for determining these values.

One such source of process information is the process environment, as defined by environment variables. The os module provides a dictionary called os.environ that holds all the environment variables for the current process.

The disadvantage of environment variables is that they are highly volatile. Users can create them and remove them. Applications can do likewise, so it is dangerous to rely on the existence of an environment variable; you should always have a default value that you can fall back on. Fortunately, some values are fairly reliable and usually present. Three of these are particularly useful for Windows users because the pwd.getpwuid() and os.uname() functions discussed earlier are not available. These are HOME, OS, and PROCESSOR_ARCHITECTURE.

If you do try to access a variable that is not defined, you get the usual Python dictionary KeyError. On most, but not all, operating systems, a program can set, or modify, environment variables. If this feature is supported for your OS, then Python reflects any changes to the os.environ dictionary back into the OS environment. In addition to using environment variables as a source of user information, it is quite common to use them to define user-specific configuration details about a program—for example, the location of a database. This practice is slightly frowned upon nowadays, and it’s considered better to use a configuration file for such details. But if you are working with older applications, you may need to refer to the environment for such things.

TRY IT OUT Investigating the Process Environment

In this Try It Out, you investigate the process environment on your computer. Complete the following steps:

Start the Python interpreter.

Type the following code into the interpreter:

  >>> import os
  >>> os.getpid()
  16432
  >>> os.getppid()
  3165
  >>> os.getcwd()
  /home/agauld
  >>> len(os.environ)
  48
  >>> os.environ['HOME']
  '/home/agauld'
  >>> os.environ['testing123']
  Traceback (most recent call last):
 	File "<stdin>", line 1, in <module>
 	File "/usr/lib/python2.7/UserDict.py", line 23, in __getitem__
 		raise KeyError(key)
  KeyError: 'testing123'
  >>> os.environ['testing123'] = 42
  Traceback (most recent call last):
 	File "<stdin>", line 1, in <module>
 	File "/usr/lib/python2.7/os.py", line 471, in __setitem__
 		putenv(key, item)
  TypeError: str expected, not int
  >>> os.environ['testing123'] = '42'
  >>> os.environ['testing123']
  '42'
  >>> del(os.environ['testing123'])
  >>> os.environ['testing123']
  Traceback (most recent call last):
 	File "<stdin>", line 1, in <module>
 	File "/usr/lib/python2.7/UserDict.py", line 23, in __getitem__
 		raise KeyError(key)
  KeyError: 'testing123'

How It Works

After importing os in the first line, the first thing you did was determine the current process ID. You are in the Python interpreter, so it is the interpreter process itself that you are identifying. The next line reads the parent process ID; in this case, that is the OS shell program. You can use these IDs when interacting with other operating system tools, as described in the next section.

You then used the os.getcwd() function (which stands for get current working directory) to determine which directory is currently the default. This is usually the directory from which the interpreter was invoked but, as you will see, it is possible to change directory within your program. os.getcwd() is a useful way of checking exactly where your code is working at any given time.

Next you found out how big your environment was by displaying the length of the os.environ dictionary. You then pulled out the value of the HOME environment variable that shows the user’s home directory. You then tried to access a variable called testing123, but because it did not exist, you got a KeyError. You attempted to create a testing123 variable by assigning the number 42, but that yielded a TypeError because environment variables must be strings. You then assigned the string value '42' to the testing123 variable and succeeded in creating a new environment variable for this process. (Remember that the environment is local to this process and any subprocesses that it spawns. This is not visible in other external processes.) You next read the value again and this time got a value with no error messages, confirming that you had succeeded. Finally, you deleted the variable and proved that it was gone by once more attempting to read it—resulting in, as expected, an error.

Managing Other Programs

It is often useful to be able to run other programs from within a script, and the subprocess module is the preferred tool for this. The subprocess module contains a class called Popen that provides a very powerful and flexible interface to external programs. The module also has several convenience functions that you can use when a simpler approach is preferred. The documentation describes how to use all of these features; in this section you use only the simplest function, subprocess.call(), and the Popen class.

The most basic use of the subprocess module is to call an external OS command and simply allow it to run its course. The output is usually displayed on screen or stored in a data file somewhere. After the program completes, you can ask the user to make some kind of selection based on what was displayed or you can access the data file directly from your code. You can force many OS tools, especially on UNIX-based systems, into producing a data file as output by providing suitable command-line options or by using OS file redirection. This technique is a very powerful way to harness the power of OS utilities in a way that Python can use for further processing.

This basic mechanism for calling a program is wrapped up in the subprocess.call() function. This function has a list of strings as its first parameter, followed by several optional keyword parameters that are used to control the input and output locations and a few other things.

The easiest way to see how it works is to try it out.

TRY IT OUT Starting External Programs

In this Try It Out, you call various programs from within Python. Complete the following steps to see how it works:

Create a test directory, called root, and populate it with a few text files. (Either create them from scratch or copy them from somewhere else.) It doesn’t matter what they contain; you are only interested in their being there at this stage. To get the same results as shown here, the structure should look like this:

root

fileA.txt

fileB.txt
Change into your root folder and start the Python interpreter.

Type the following code at the >>> prompt (be sure to use the right commands for your OS):

  >>> import subprocess as sub
  >>> sub.call(['ls']) 	# Not Windows
  fileA.txt fileB.txt
  0
  >>> sub.call(['ls'], stdout=open('ls.txt', 'w')) 	# Not windows
  0
  >>> sub.call(['cmd', '/c', 'dir', '/b']) 	# Windows only
  fileA.txt
  fileB.txt
  0
  >>> sub.call(['cmd', '/c', 'dir', '/b'], stdout=open('ls.txt','w')) 	# Windows only
  0
  >>> sub.call(['more','ls.txt']) 	# Not Windows
  fileA.txt
  fileB.txt
  ls.txt
  0
  >>> sub.call(['cmd','/c','type','ls.txt']) 	# Windows only
  fileA.txt
  fileB.txt
  ls.txt
  0
  >>> for line in open('ls.txt'): print(line)
  ...
  fileA.txt
  fileB.txt
  ls.txt

How It Works

After importing subprocess with the alias sub in the first line (it just saves some typing later!), you called sub.call() for the first time, with an argument of ['ls'] or ['cmd', '/c', 'dir', '/b'] depending on your OS. (Notice that dir is actually a subcommand of the cmd.exe shell process. The Windows help system explains what the /c and /b flags do.) The output is displayed on stdout (your terminal), but the filenames are not accessible from within Python. The only thing returned by call() is the operating system return code, which tells you if the program completed successfully or not, but it does not help you interact with the data in any way. You then used sub.call() a second time, but this time you redirected stdout to a new file: ls.txt. Next you used the operating system tool more (or type on Windows) to display the file. The fact that ls.txt is a regular text file also means you can access the data by opening the file and processing it in the usual way using Python commands. In this case you simply looped over the lines and printed them, but you could have stored and used the data for some other purpose just as easily. It is worth noting that exposing a list of files in a text file like this is a potential security issue, and you should delete the file as soon as possible after processing it.

One problem that can occur when running external programs is that the OS cannot find the command. You generally get an error message when this happens, and you need to explicitly provide the full path to the program file, assuming it does actually exist.

Finally, consider how to stop a running process. For interactive programs, the simplest way is for the user to close the external program in the normal way or issuing an interrupt signal using Ctrl+C or Ctrl+Z, or whatever is the norm on the user’s OS. But for non-interactive programs, you may need to intervene from the OS, usually by examining the list of running processes and explicitly terminating the errant process.

NOTE On Windows, another useful tool is os.startfile(). Instead of passing a command to the function, you pass a filename and Windows uses its file association database to start the appropriate command. As an example, if you pass it a text file, it may start the Notepad editor. A second optional parameter called operation specifies what the called program should do with the file, the most common options being open, which is the default, or print. The specified operation must be one that is recognized by Windows for the file type. You can see these by right-clicking the file in Windows Explorer; however, only a subset of that list is associated with an external program. An invalid operation results in an OSError exception being raised. If in doubt, use the Python interpreter to experiment with the options and find out which ones work.

You have just seen how easy it is to use subprocess.call() to start an external process. You now learn how the subprocess module gives you much more control over processes and, in particular, how it enables your program to interact with them while they are running, especially how to read the process output directly from your script.

Managing Subprocesses More Effectively

You can use the Popen class to create an instance of a process, or command. Unfortunately, the documentation can appear rather daunting because the Popen constructor has quite a few parameters. The good news is that nearly all of those parameters have useful default values and can be ignored in the simplest cases. Thus, to simply run an OS command from within a script, you only need to do this (Windows users should substitute the dir command from the previous example):

  >>> import subprocess as sub
  >>> sub.Popen(['ls', '*.*'], shell=True)
  <subprocess.Popen object at 0x7fd3edec>
  >>> book tmp

Notice the shell=True argument. This is necessary to get the command interpreted by the OS command processor, or shell. Doing so ensures that the wildcard characters ('*.*') as well as any string quotes and the like are all interpreted the way you expect. If you do not use the shell parameter, this happens:

  >>> sub.Popen(['ls', '*.*'])
  <subprocess.Popen object at 0x7fcd328c>
  >>> ls: cannot access *.*: No such file or directory

Without specifying shell=True, the operating system tries to find a file with the literal name '*.*', which doesn’t exist.

The problem with using shell=True is that it also creates security issues in the form of a potential injection attack, so never use this if the commands are formulated from dynamically created strings, such as those read from a file or from a user.

To access the output of the command being run, you can add a couple of extra features to the call, like so:

  >>> lsout = sub.Popen(['ls', '*.*'], shell=True, stdout=sub.PIPE).stdout
  >>> for line in lsout:
  ...	print (line)

Here you specify that stdout should be a sub.PIPE and then assign the stdout attribute of the Popen instance to lsout. (A pipe is just a data connection to another process, in this case between your program and the command that you are executing.) Having done so, you can then treat the lsout variable just like a normal Python file and read from it—and so on.

You can send data into the process in much the same way by specifying that stdin is a pipe to which you can then write. The valid values that you can assign to the various streams include open files, file descriptors, or other streams (so that stderr can be made to appear on stdout, for example). Note that it’s possible to chain external commands together by setting, for example, the input of the second program to be the output of the first. That produces a similar effect to using the OS pipe character (|) on a command line.

TRY IT OUT Using subprocess.Popen to Access stdin/stdout

To see how to use subprocess.Popen to interact with processes, complete the following steps:

Start the Python interpreter in your root folder.

Type this code:

  >>> import subprocess as sub
  >>> sub.Popen(['ls']) 	# Windows use: ("cmd /c dir /b")
  <subprocess.Popen object at 0x7fd3eecc>
  fileA.txt fileB.txt ls.txt
  >>> 	# Windows use: ("cmd /c dir /b", stdout=sub.PIPE)
  >>> ls = sub.Popen(['ls'], stdout=sub.PIPE)
  >>> for f in ls.stdout: print(f)
  ...
  b'fileA.txt
'
  b'fileB.txt
'
  b'ls.txt
'
  >>> ex = sub.Popen(['ex', 'test.txt'], stdin=sub.PIPE) 	# Not Windows
  >>> ex.stdin.write(b'i
this is a test
.
') 	# Not Windows
  19
  >>> ex.stdin.write(b'wq
') 	# Not Windows
  3
  >>>
  1+ Stopped python3
  >>> sub.Popen(['NonExistentFile'])
  Traceback (most recent call last):
  	File "<stdin>", line 1, in <module>
  	File "/usr/lib/python3.2/subprocess.py", line 745, in __init__
  		restore_signals, start_new_session)
  	File "/usr/lib/python3.2/subprocess.py", line 1361, in _execute_child
  		raise child_exception_type(errno_num, err_msg)
  OSError: [Errno 2] No such file or directory: 'NonExistentFile'

How It Works

To start, you imported subprocess with the alias sub. The first couple of commands just duplicated what you did with subprocess.call() in that you initially produced a file listing on stdout, but again could not use that data. The second case is more interesting because you redirected stdout to a sub .PIPE that allowed you to read it via the stdout attribute. (Notice the difference between the stdout parameter, which you set to sub.PIPE within the Popen constructor call and the stdout attribute that you use for reading data from the Popen instance and is accessed via dot notation.) To use this, you also assigned the result of the Popen call to a variable called ls. Popen is actually a class and the result of the call is a new Popen object instance. The end result is very similar to the subprocess.call() case where you fed the output to a file, ls.txt, and then read the file, but in this case you don’t wind up creating any files. Rather, you read directly from the process. This means you don’t end up cluttering your file system with little temporary files that have to be tidied up afterwards.

The next example used the UNIX line editor ex, but this time you redirected stdin to sub.PIPE and then fed some commands into the editor to create a short text file. You also passed in a filename as an argument by providing a second string in the first argument to Popen. Note that the input to stdin must be a byte string (b'xxxx') rather than a normal text string.

The penultimate example shows what happens if you try to open a non-existent file (or don’t provide the correct path information). You get an OSError exception that you could, of course, catch using Python’s try/except structure.

In the Try It Out examples, you accessed stdin and stdout directly; however, this can sometimes cause problems, especially when running processes concurrently or within threads, leading to pipes filling and blocking the process. To avoid these issues, it’s recommended that you use the Popen .communicate() method and index the appropriate stream. This is slightly more complex to use, but avoids the problems just mentioned. Popen.communicate() takes an input string (equivalent to stdin) and returns a tuple with the first element being the content of stdout and the second the content of stderr. So, repeating the file listing example using Popen.communicate() looks like this:

  >>> ls = sub.Popen(['ls'], stdout=sub.PIPE)
  >>> lsout = ls.communicate()[0]
  >>> print (lsout)
  b'fileA.txt
fileB.txt
ls.txt
'
  >>>

To conclude this section, it is worth pointing out that, for simplicity, you have been using fairly basic commands, such as ls, in the examples. Many of these commands can be performed equivalently from within Python itself (as you see shortly). The real value in mechanisms like subprocess.call() and Popen() is in running much more complex programs such as file conversion utilities and image-processing batch tools. Writing the equivalent functionality of these tools in Python would be a major project, so calling the external program is a more sensible alternative. You use Python where it is strongest, in orchestrating and validating the inputs and outputs, but leave the “heavy lifting” to the more specialized applications.

Obtaining Information About Files (and Devices)

The os module is heavily biased to the UNIX way of doing things. As such it treats devices and files similarly. So finding out about devices such as the current terminal session looks a lot like finding out about files. In this section you now look at how you can determine file status and permissions and even how to change some of their properties from within your programs. Consider the following code:

  >>> import os
  >>> os.listdir('.')
  ['fileA.txt', 'fileB.txt', 'ls.txt', 'test.txt']
  >>> os.stat('fileA.txt')
  posix.stat_result(st_mode=33204, st_ino=1125899907117103,
  st_dev=1491519654, st_nlink=1, st_uid=1001, st_gid=513,
  st_size=257, st_atime=1388676837, st_mtime=1388677418,
  st_ctime=1388677418)

Here you checked the current directory ('.') listing with os.listdir(). (Now that you’ve seen os.listdir(), you hopefully realize that your use of ls or dir in subprocess was rather artificial because os.listdir() does the same job directly from Python, and does it more efficiently.) You then used the os.stat() function to get some information about one of the files. This function returns a named tuple object that contains 10 items of interest. Perhaps the most useful of these are st_uid, st_size, and st_mtime. These values represent the file owner’s user ID, the size, and the last modification date/time. The times are integers that must be decoded using the time module, like so:

  >>> import time
  >>> time.localtime(1388677418)
  time.struct_time(tm_year=2014, tm_mon=1, tm_mday=2, tm_hour=15,
  tm_min=43, tm_sec=38, tm_wday=3, tm_yday=2, tm_isdst=0)
  >>> time.strftime("%Y-%m-%d", time.localtime(1388677418))
  '2014-01-02'

Here you used the time module’s localtime() function to convert the integer st_mtime value into a time tuple showing the local time values and from there into a readable date string using the time.strftime() function with a suitable format string. (You look more closely at the time module in the “Using the Time Module” section later in this chapter.)

The simple 10-value tuple returned from os.stat() is generally convenient, but more details are available via os.stat() than the tuple provides directly. Some of these additional values are OS dependent, such as the st_obtype attribute found on RiscOS systems. You need to do a little bit more work to dig these out. You can access the details by using object attribute dot notation.

Perhaps the most interesting field that you can access from os.stat() is the st_mode value, which tells you about the access permissions of the file. You use it like this:

  >>> import os
  >>> stats = os.stat('fileA.txt')
  >>> stats.st_mode
  33204

But that’s not too helpful; it’s just an apparently random number! The secret lies in the individual bits making up the number; it’s another bitmask. You may recall the umask bitmask that you looked at earlier in the chapter. The st_mode is conceptually similar to the umask, but with the bit meanings reversed. You can see how the access details are encoded by looking at the last 9 bits, like this:

  >>> bin(stats.st_mode)[-9:]
  '111111101'

By using the bin() function in combination with a slice, you have extracted the binary representation of the last 9 bits. Looking at those as 3 groups of 3, you can see the read/write/execute values for Owner, Group, and World respectively. Thus, in this example, Owner and Group have all three bits set to one (True), but World only has the read and execute bits set to 1 (True), and the write access is 0 (False). (Note that these are the direct inverse of the meanings of the umask bits; do not confuse the two!)

The higher order bits also have meanings, and the stat module contains a set of bitmasks that can be used to extract the details on a bit-by-bit basis. For most purposes the preceding access bits are sufficient, and helper functions exist in the os.path module that enable you to access that information. You’ll revisit this theme when you look at os.path later in the chapter.

You have several other ways to determine access rights to a file in Python. In particular the os module provides a convenience function—os.access()—that takes a filename and flag variable (one of os.F_OK, os.R_OK, os.W_OK, or os.X_OK) that returns a boolean result depending on whether the file exists, or is readable, writable, or executable, respectively. These functions are all easier to use than the underlying os.stat() and bitmask approach but it’s useful to know where the functions are getting their data.

Finally, the os documentation points out a potential issue when checking for access before opening a file. There is a very short period between the two operations when the file could change either its access level or its content. So, as is usual in Python, it’s better to use try/except to open the file and deal with failure if it happens. You can then use the access checks to determine the cause of failure if necessary. The recommended pattern looks like this:

  try:
 	myfile = open('myfile.txt')
  except PermissionError:
  	# test/modify the permissions here
  else:
  	# process the file here
  finally:
  	# close the file here

Having seen how to explore the properties of individual files, you now look at the mechanisms available for traversing the file system, reading folders, copying, moving, and deleting files, and so on.

Navigating and Manipulating the File system

Python provides built-in functions for opening, reading, and writing individual files. The os module adds functions to manipulate files as complete entities—for example, renaming, deleting, and creating links are all catered for. However, the os module itself provides only half of the story when it comes to working with files. You look at the other half when you explore the shutil module and other utility modules that work alongside os.

You start with reading and navigating the file system. You’ve already seen how you can use os.listdir() to get a directory listing and os.getcwd() to tell you the name of the current working directory. You can use os.mkdir() to create a new directory and os.chdir() to navigate into a different directory.

TRY IT OUT Accessing Directories in Python

In this Try It Out, you create and access directories using Python. Complete the following steps:

Change into the root directory you used previously.

Start the Python interpreter and type the following Python commands:

  >>> import os
  >>> cwd = os.getcwd()
  >>> print (cwd)
  /home/agauld/book/root
  >>> os.listdir(cwd)
  ['fileA.txt', 'fileB.txt', 'ls.txt']
  >>> os.mkdir('subdir')
  >>> os.listdir(cwd)
  ['fileA.txt', 'fileB.txt', 'ls.txt', 'subdir']
  >>> os.chdir('subdir')
  >>> os.getcwd()
  '/home/agauld/book/root/subdir'
  >>> os.chdir('..')
  >>> os.getcwd()
  '/home/agauld/book/root'

How It Works

After importing os in the first line, you stored the current directory in cwd and then printed it to confirm that you were where you thought you were. You then listed its contents. Next, you created a folder called subdir and changed into that folder. You verified the move succeeded by calling os.getcwd() again from the new folder and found that the folder had indeed changed. Finally, you changed back up to the previous directory using the '..' shortcut and once again verified that it worked with os.getcwd().

One problem with the os.mkdir() function used here is that it can only create a directory in an existing directory. If you try creating a directory in a place that doesn’t exist, it fails. Python provides an alternative function called os.makedirs()—note the difference in spelling—that creates all the intermediate folders in a path if they do not already exist.

You can see how that works with the following commands:

  >>> os.mkdir('test2/newtestdir')
  Traceback (most recent call last):
  	File "<stdin>", line 1, in <module>
  OSError: [Errno 2] No such file or directory: 'test2/newtestdir'
  >>> os.makedirs('test2/newtestdir')
  >>> os.chdir('test2/newtestdir')
  >>> print( os.getcwd() )
  /home/agauld/book/root/test2/newtestdir

Here the original os.mkdir() call produced an error because the intermediate folder test2 did not exist. The call to os.makedirs() succeeded, however, creating both the test2 and newtestdir folders, and you were able to change into newtestdir to prove the point. Note that os.makedirs() raises an error if the target folder already exists. You can use a couple of additional parameters to further tune the behavior, but the default values are usually what you need.

Another module, shutil, provides a set of higher level file manipulation commands. These include the ability to copy individual files, copy whole directory trees, delete directory trees, and move files or whole directory trees. One anomaly is the ability to delete a single file or group of files. That is actually found in the os module in the form of the os.remove() function for files (and os.rmdir() for empty directories, although shutil.rmtree() is more powerful and usually what you want).

Another useful module is glob. This module provides filename wildcard handling. You are probably familiar with the ? and * wildcards used to specify groups of files in the OS commands. For example, *.exe specifies all files ending in .exe. glob.glob() does the same thing in your code by returning a list of the matching filenames for a given pattern.

TRY IT OUT Using Wildcards, Copying, Deleting, and Moving Files

In this Try It Out, you use the functions from os, glob, and shutil to manipulate whole files. Follow these steps:

Change into the root directory created earlier.
Create a new file called test.py; it doesn’t really matter what is in it, you are only interested in the name!

Start the Python interpreter and type in the following code:

  >>> import os,glob,shutil as sh
  >>> os.listdir('.') 	# everything in the folder
  ['fileA.txt', 'fileB.txt', 'ls.txt', 'subdir', 'test.py', 'test2']
  >>> glob.glob('*') 	# everything in the folder
  ['fileA.txt', 'fileB.txt', 'ls.txt', 'subdir', 'test.py', 'test2']
  >>> glob.glob('*.*') 	# files with an extension
  ['fileA.txt', 'fileB.txt', 'ls.txt', 'test.py']
  >>> glob.glob('*.txt') 	# text files only
  ['fileA.txt', 'fileB.txt', 'ls.txt']
  >>> glob.glob('file?.txt') 	# text files starting with 'file'
  ['fileA.txt', 'fileB.txt']
  >>> glob.glob('*.??') 	# any file with a 2 letter extension
  ['test.py']

Look closely at the different result sets to see the effect of the different function/argument combinations.

Type the following code:

  >>> sh.copy('fileA.txt','fileX.txt')
  >>> sh.copy('fileB.txt','subdir/fileY.txt')
  >>> os.listdir('.')
  ['fileA.txt', 'fileB.txt', 'fileX.txt', 'ls.txt', 'subdir', 'test.py', 'test2']
  >>> os.listdir('subdir')
  ['fileY.txt']
  >>> sh.copytree('subdir', 'test3')
  >>> os.listdir('.')
  ['fileA.txt', 'fileB.txt', 'fileX.txt', 'ls.txt', 'subdir', 'test.py',
  'test2', 'test3']
  >>> os.listdir('test3')
  ['fileY.txt']
  >>> sh.rmtree('test2')
  >>> os.listdir('.')
  ['fileA.txt', 'fileB.txt', 'fileX.txt', 'ls.txt', 'subdir', 'test.py', 'test3']

Review the output from the commands you just typed and consider their impact before typing the following code:

  >>> os.mkdir('test4')
  >>> sh.move('subdir/fileY.txt', 'test4')
  >>> os.listdir('test4')
  ['fileY.txt']
  >>> os.listdir('subdir')
  []
  >>> os.remove('test4/fileY.txt')
  >>> os.listdir('test4')
  []
  >>> os.remove('test4')
  Traceback (most recent call last):
  	File "<stdin>", line 1, in <module>
  OSError: [Errno 1] Operation not permitted: 'test4'
  >>> sh.rmtree('test4')
  >>> sh.rmtree('test3')
  >>> os.remove('fileX.txt')
  >>> os.listdir('.')
  ['fileA.txt', 'fileB.txt', 'ls.txt', 'subdir', 'test.py']

How It Works

In the first set of commands, you compared the effect of os.listdir() with various patterns in the glob.glob() function. The first pattern ('*') replicated what os.listdir() did by listing the full contents of the folder. The second pattern ('*.*') listed all files with an extension. (Strictly speaking glob() does not know anything about files or folders; it works strictly with names, so it actually listed all names that had a period included regardless of what kind of object it was.) Next you used the '*.txt' pattern to find all the text files followed by 'file?.txt' to find any “txt” file whose name starts with file followed by a single character. Finally, you used a combination of wildcards to find any file whose name ends in two characters ('*.??')

The second set of commands looked at the shutil file manipulation commands.

You started by using shutil.copy() to copy single files: fileA.txt to a new file in the same folder called fileX.txt and fileB.txt to the subdirectory subdir with a new name fileY.txt. You then used os.listdir() twice to see the results in each folder.

You next looked at directory level operations with the shutil.copytree() function that copied the subdir directory and its contents to a new folder, test3, creating it in the process. Again, you used os.listdir() twice to confirm the results. You used shutil.rmtree() to delete the test2 folder and its contents. Once again os.listdir() proves the point.

You started the next sequence by creating another new folder, test4, using os.mkdir(). Into this new folder, you moved the file fileY.txt using sh.move(). Again, using os.listdir() on both folders proves that the operation succeeded and the file no longer exists in subdir but does exist in test4.

Finally, you looked at removing files with os.remove(). The first example removed the file from test4 and verified that it had been deleted. The next line attempted to remove test4 itself, but produced an error because os.remove()works only on files. If you need to remove a directory, you need to use shutil.rmtree() again (or you could have used the os.rmdir() function to the same end). You finished by using os.listdir() once more to confirm that the folder had gone.

If you look at the shutil documentation, you’ll see several variations on the copy functions with subtly different behaviors. In most cases the standard shutil.copy() function does what you want. Other features of shutil include the ability to create archived or compressed files in either zip or tar formats. Also, you can extend the functionality of several of the functions using optional arguments. One of the most interesting is the shutil.copytree() function, which has an ignore parameter. You can set this to a function that takes two arguments: a root folder and a list of files. (The function must accept two parameters even if they are not actually used by it.) The function then returns another list of filenames that shutil.copytree() ignores. This function is then called by shutil.copytree() for each folder of the tree being copied, with the arguments being the current folder within the tree and the list of files produced by os.listdir() acting on that folder. This is useful for ignoring temporary or archive files, or files that can be re-created later. Here is a short example that copies a project directory tree but ignores any compiled Python files (i.e., those with an extension of .pyc).

  >>> def ignore_pyc(root, names):
  ...		return [name for name in names if name.endswith('pyc')]
  ...
  >>> 	# now test that it works
  >>> ignore_pyc('fred',['1.py','2.py','2.pyc','4.py','5.pyc'])
  ['2.pyc', '5.pyc']
  >>> sh.copytree('projdir', 'projbak', ignore=ignore_pyc)

In this case you used a list comprehension to build the ignore list, but you could equally just return a hard-coded filename (for example RCS to avoid copying version control files across) or you could have a much more complex piece of logic involving database lookups or other complex processing. The scenario of testing for a standard pattern ('*.pyc' in your case) is so common that shutil has a helper function called shutil.ignore_patterns(), which takes a list of glob-style patterns and returns a function that can be used in shutil.copytree(). Here is the previous example again, but this time using shutil.ignore_patterns():

  >>> sh.copytree('projdir', 'projbak', ignore=sh.ignore_patterns('*.pyc') )

Remember that the ignore function is called for every folder being copied, so if it is very complex, the copytree() operation could become quite resource-intensive and slow.

Finally, consider a submodule of os called os.path. The os.path module contains several helpful tests and utility functions that can help you when using the higher-level functions already discussed. The most useful functions are for creating paths, deconstructing paths, expanding user details, testing for path existence, and obtaining some information about file properties.

You start your exploration of os.path by looking at some helpful test functions. You saw earlier how you can use os.stat() to extract information about a file. os.path provides some helper functions that get the more common features more easily. You can, for instance, determine the size of a file using os.path.getsize(), the modification time using os.path.getmtime(), and the creation time with os.path.getctime(). You can also tell whether a name, returned by os.listdir() for example, is a file or a directory using os.path.isfile() or os.path.isdir(). (You can even test for mount points and links if that is important to you.) All of these functions take a name as an argument and return True or False. That’s a bit easier than calling os.stat() and then using a combination of indexing and bitmasking to extract the details.

The next thing that os.path helps with is processing paths. You can find the full path to your file using os.path.abspath() and, if it’s a link, the path to the real file with os.path.realpath(). Having obtained that path, you can break it into its constituent parts. Python considers a full file path to look like this:

  [<drive>]<path to folder><filename><extension>

Using os.path.splitdrive(), you can read the drive letter (if you are on Windows, otherwise it is empty). os.path.dirname() finds the folder, and os.path.basename() gets the filename (including the extension). You can even get the folder path and filename in one go with os.path.split(). Usually that’s sufficient but, if necessary, you can further split the filename into its extension and core name with os.path.splitext(), in which case the extension includes the period, for example, myfile.exe returns myfile and .exe.

Often, after having inspected and worked with the various path components, you want to reassemble the path or even create one from scratch. os.path provides another convenience function for this called os.path.join(), which takes the various elements and combines them into a single string using the current OS path separator, as defined in the constant os.sep. This is very important because path format is one area that varies considerably across operating systems. Since MacOS X appeared, based on a UNIX kernel, things have been a little easier because Windows usually accepts the UNIX-style / separator in addition to its native style. But it is still safer to use os.path.join() to create file paths if you plan on running your script on multiple computer types.

You can see this operation in action on your test files and folders.

TRY IT OUT Working with Paths

In this Try It Out, you use the os.path functions to determine the status of various files. To see them in action, complete the following steps:

Change into the root directory you created earlier. Stay in the OS for the moment.
Using the OS, copy fileA.txt into the subdirectory subdir as fileC.txt.
Now create a symbolic link to fileB.txt in subdir and call it fileD.txt. (If you are on Windows, you probably don’t create symbolic links, so you need to miss out on the steps associated with fileD.txt). The UNIX command for this is:
```
  $ ln -s fileB.txt subdir/fileD.txt
```

Fire up Python and type the following code:

  >>> import os
  >>> from os import path
  >>> os.listdir('.')
  ['fileA.txt', 'fileB.txt', 'ls.txt', 'subdir', 'test.py']
  >>> path.getsize('fileA.txt')
  257
  >>> path.getctime('fileA.txt')
  1389109373.1044922
  >>> path.getsize('subdir/fileC.txt')
  257
  >>> path.getctime('subdir/fileC.txt')
  1389274706.736207
  >>> path.abspath('fileB.txt')
  '/home/agauld/book/root/fileB.txt'
  >>> path.abspath('subdir/fileD.txt')		# Not Windows
  '/home/agauld/book/root/subdir/fileD.txt'
  >>> path.islink('fileB.txt')			# Not windows
  False
  >>> path.islink('subdir/fileD.txt')		# Not windows
  True
  >>> path.realpath('subdir/fileD.txt') 	# Not windows
  '/home/agauld/book/root/subdir/fileB.txt'
  >>> folder, filename = path.split(path.abspath('subdir/fileB.txt'))
  >>> print (folder, filename)
  /home/agauld/book/root/subdir fileB.txt
  >>> path.join(folder,filename)
  '/home/agauld/book/root/subdir/fileB.txt'
  >>> path.splitext(filename)
  ('fileB', '.txt')

How It Works

You started off by importing the modules you need and checking the contents of the current folder, just to check that you are where you expect to be. You then compared the size and creation time of fileA.txt and subdir/fileC.txt (which was a copy of fileA.txt if you recall). You see that the sizes are identical but the creation times different, as you’d expect.

You next looked at fileB.txt and its linked relation fileD.txt. This time you looked first at the absolute path to each file. Both show up as you’d expect in their respective places. You then tested both files to see if they were links. Of course, fileB.txt is not, but fileD.txt yields a positive result. You next asked for the real path to fileD.txt, and it revealed that the original was, in fact, fileB.txt.

Finally, you looked at some path manipulation. You used os.path.split() to break the fileB.txt path into the folder and file parts. You joined them back together with os.path.join() and then you split up the filename to get the core name and the extension. Notice the period is preserved at the front of the extension.

Plumbing the Directory Tree Depths

One common automation operation is to start at a given location and apply a particular action to every file (or type of file) in the file system below that location. This is often called “walking the directory tree,” and the os module contains a powerful and flexible function called os.walk() that helps you do just that. It is not the most straightforward function to use, so you spend this section looking at its key features.

You consider an example of os.walk() being used to find a specific file located somewhere within a given directory tree or subtree. You then create a new module with a findfile() function that you can use in your programs. That foundation can go on to form the basis for a whole group of functions that you can use to process directory trees.

First you need to create a test environment consisting of a hierarchy of folders under a root directory. (You can generate this structure by extracting the file TreeRoot.zip from the Chapter2.zip master file on the download site and then extracting the files within TreeRoot.zip, or you can use the OS tools to generate it manually.) Each folder contains some files, and one of the folders contains the file you want to find, namely target.txt. You can see this structure here:

  TreeRoot
 	FA.txt
 	FB.txt
 	D1
 		FC.txt
 		D1-1
 			FF.txt
 	D2
 		FD.txt
 	D3
 		FE.txt
 		D3-1
 			target.txt

The os.walk() function takes a starting point as an argument and returns a generator yielding tuples with 3 members (sometimes called a 3-tuple or triplet): the root, a list of directories in the current root, and a list of the current files in that root. If you look at the hierarchy you have created, you would expect the top-level tuple to look like this:

  ( 'TreeRoot', ['D1','D2','D3'], ['FA.txt','FB.txt'])

You can check that easily by writing a for loop at the interactive prompt:

  >>> import os
  >>> for t in os.walk('TreeRoot'):
  ...		print (t)
  ...
  ('TreeRoot', ['D1', 'D2', 'D3'], ['FA.txt', 'FB.txt'])
  ('TreeRoot/D1', ['D1-1'], ['FC.txt'])
  ('TreeRoot/D1/D1-1', [], ['FF.txt'])
  ('TreeRoot/D2', [], ['FD.txt'])
  ('TreeRoot/D3', ['D3-1'], ['FE.txt'])
  ('TreeRoot/D3/D3-1', [], ['target.txt'])

This clearly shows the path taken by os.walk(), starting with the first directory at the top level and drilling down before moving onto the next directory and so on. It also shows how you can take a file from the files list and construct its full path by combining the name with the root value of the containing tuple.

By writing your function to use regular expressions, and to return a list, you can create a function that is much more powerful (but also slower!) than the simple glob.glob() that you saw earlier.

TRY IT OUT Building a File Finder (file_tree.py)

In this Try It Out, you build and test a module containing a file-finding function using the os.walk() function as a foundation. To build the module, complete the following steps:

If you haven’t done so already, create a test directory structure like the one just described (or load it from the zip file).
Go to your project directory where you keep your Python modules.

Open your favorite text editor and type in the following code (or load it from the zip file):

  # file_tree.py module containing functions to assist
  # in working with directory hierarchies.
  # Based on the os.walk() function.

  import os, re
  import os.path as path

  def find_files(pattern, base='.'):
 	"""Finds files under base based on pattern

 	Walks the file system starting at base and
 	returns a list of filenames matching pattern"""
 	regex = re.compile(pattern)

 	matches = []
 	for root, dirs, files in os.walk(base):
 		for f in files:
 			if regex.match(f):
 				matches.append( path.join(root,f) )
 	return matches

Save the file as file_tree.py.
Go to your root folder (that is, the one above TreeRoot).

Start the Python interpreter and type the following to test the new function:

  >>> import file_tree
  >>> help(file_tree)
  Help on module file_tree:

  NAME
 	file_tree

  DESCRIPTION
  	# file_tree.py module containing functions to assist in
  	# working with directory hierarchies.
  	# Based on the os.walk() function.

  FUNCTIONS
 	find_files(pattern, base='.')
 		Finds files under base based on pattern

 		Walks the file system starting at base and
 		returns a list of filenames matching pattern
  FILE
 	/cygdrive/d/PythonCode/Chapter2/file_tree.py

Type q to exit the help screen:

  >>> file_tree.find_files('target.txt','TreeRoot')
  ['TreeRoot/D3/D3-1/target.txt']
  >>> file_tree.find_files('F.*','TreeRoot')
  ['TreeRoot/FA.txt', 'TreeRoot/FB.txt', 'TreeRoot/D1/FC.txt',
  'TreeRoot/D1/D1-1/FF.txt', 'TreeRoot/D2/FD.txt', 'TreeRoot/D3/FE.txt']
  >>> file_tree.find_files('.*.txt','TreeRoot')
  ['TreeRoot/FA.txt', 'TreeRoot/FB.txt', 'TreeRoot/D1/FC.txt',
  'TreeRoot/D1/D1-1/FF.txt', 'TreeRoot/D2/FD.txt', 'TreeRoot/D3/FE.txt',
  'TreeRoot/D3/D3-1/target.txt']
  >>> file_tree.find_files('D.*','TreeRoot')
  []

How It Works

The function takes a regular expression as the parameter pattern and compiles it for efficiency. It then calls os.walk() in the usual way using the base parameter value and at each level of the tree tests each file found against the input pattern. If it finds a match, it generates the full path and adds it to the result list. Once all the files in the directory tree have been tested, it returns the list of found files.

You tested the code by importing the module and running the help() function on it. You saw the comments and doc string describing how to use the function.

After exiting help, you tested find_files() by searching for the literal name of your file target.txt and found the result TreeRoot/D3/D3-1/target.txt. You then experimented with some regular expressions (note the differences between the glob() wildcard syntax and the regular expression form). F.* indicates any file starting with F followed by zero or more characters. .*.txt indicates any sequence of characters followed by a literal period and the 3 characters txt.

Finally, you tried using a pattern that matched directory names and got back an empty list, which is to be expected because the function only checks the filenames; it doesn’t look at the names in the dirs tuple element.

You’ve seen how Python helps you work with the OS. In the next section, you see how Python enables you to work with dates and times.

Working with Dates and Times

One of the most common features of scripting tasks is the use of dates and times. This could be to identify files older than a certain date or between a certain range or it might be to set a process to run at a certain time or interval. You might need to compare dates and times in data to select an appropriate subset of a file’s content. Reading dates and times and comparing their values is necessary in many scenarios.

Unfortunately, dates and times are not clearly defined values like integers or floats. They tend to be stored in strings with a multitude of formats. For example 2016-02-07, 02/07/2016, and 07/02/2016 are all possible representations for the 7th day in February, 2016. The situation is further complicated by the possibility of rendering months and days using name abbreviations such as Jan, Feb, or Mon, Tue, and so on. Add the fact that years may be abbreviated to two digits and the separators can be any of a number of characters, and you start to see the complexity. How can you reliably read a date value from a given string? Time values are almost as complex, especially if you have to consider time zones and daylight saving rules. Fortunately, Python offers several modules to help you do just that. The most basic is the time module, augmented by the datetime module and, for some tasks, the calendar module.

Using the time Module

The time module stores times (including dates) in two different formats. The first is the number of seconds since the epoch, which is simply a fixed date in history. For UNIX-based systems that’s 1st January 1970. (Did you notice that’s yet another date representation?) The other representation is as a tuple of fields representing the various parts of a date/time: year, month, day, hour, minute, second, and so on. The details are all found in the time module documentation, but you need to remember which underlying format you are using. The time modules contain various conversion functions to switch between them.

Two very important functions for reading and writing times as strings take into account most of the issues just discussed. These functions are called strptime() (the “p” stands for parse) and strftime() (the “f” stands for format). The secret to using these functions lies in a format string. This string tells the function how to map string values to/from time values. The format string uses % markers to indicate a field and a set of character codes to indicate what the field should contain. For example, %Y indicates a four-digit year whereas %y indicates a two-digit year. %m indicates a two-digit month, and %B indicates the full month name (taking into account the local language settings). A table in the time module documentation for strftime() provides the definitive list.

The easiest way to come to grips with these functions is to play with them at the Python prompt. You can try it out now.

TRY IT OUT Formatting Dates and Time Strings

In this Try It Out, you see how to use the strptime() and strftime() functions. To do so complete the following steps, but also feel free to experiment with other variations to consolidate your understanding.

Start the Python interpreter and type the following code:

  >>> import time as t
  >>> now = t.time() 	# Note: current time will be a different value each time
  >>> now
  1394536692.958137
  >>> gmt = t.gmtime(now)
  >>> gmt
  time.struct_time(tm_year=2014, tm_mon=3, tm_mday=11, tm_hour=11,
  tm_min=18, tm_sec=12, tm_wday=1, tm_yday=70, tm_isdst=0)
  >>>

Look at the two formats for now: seconds since the epoch and gmt (a time tuple representing GMT, or UTC, time). Next, you use the tuple version to play with strftime().

Type the following code:

  >>> t.strftime("The date is: %Y-%m-%d", gmt)
  'The date is: 2014-03-11'
  >>> t.strftime("The date is: %b %d, %Y", gmt)
  'The date is: Mar 11, 2014'.s
  >>> t trftime("The time is: %H:%M:%S", gmt)
  'The time is: 11:18:12'
  >>> t.strftime("It is now %I %M%p",gmt)
  'It is now 11 18AM'
  >>> t.strftime("The local time format is: %X", gmt)
  'The local time format is: 11:18:12'
  >>> t.strftime("The local date format is: %x", gmt)
  'The local date format is: 03/11/14'
  >>>

Look at the reverse operation using strptime() by typing the following code:

  >>> dt = t.strptime("Saturday, March 8, 2014", "%A, %B %d, %Y")
  >>> dt
  time.struct_time(tm_year=2014, tm_mon=3, tm_mday=8, tm_hour=0,
  tm_min=0, tm_sec=0, tm_wday=5, tm_yday=67, tm_isdst=-1)
  >>> dt = t.strptime("Saturday, March 8th, 2014", "%A, %B %d, %Y")
  Traceback (most recent call last):
  	File "<stdin>", line 1, in <module>
  	File "/usr/lib/python3.2/_strptime.py", line 482, in _strptime_time
 		tt = _strptime(data_string, format)[0]
  	File "/usr/lib/python3.2/_strptime.py", line 337, in _strptime
 		(data_string, format))
  ValueError: time data 'Saturday, March 8th, 2014' does not match
  format '%A, %B %d, %Y'
  >>> dt = t.strptime("Saturday, March 8th, 2014", "%A, %B %dth, %Y")
  >>>

Notice that the format string must match the input string exactly. The use of a th postfix on the day completely confused the strptime() function.

Try a few more examples:

  >>> t.strptime("2014-01-01", "%Y-%m-%d")
  time.struct_time(tm_year=2014, tm_mon=1, tm_mday=1, tm_hour=0,
  tm_min=0, tm_sec=0, tm_wday=2, tm_yday=1, tm_isdst=-1)
  >>> t.strptime("2014-01-01T15:05:45", "%Y-%m-%dT%H:%M:%S")
  time.struct_time(tm_year=2014, tm_mon=1, tm_mday=1, tm_hour=15,
  tm_min=5, tm_sec=45, tm_wday=2, tm_yday=1, tm_isdst=-1)
  >>> t.asctime(gmt)
  'Tue Mar 11 11:18:12 2014'
  >>> t.mktime(dt)
  1394236800.0
  >>>

How It Works

You started by importing the time module and, to save typing, you assigned it an alias: t. You then used the time.time() function to get the current time. This should reflect the current time on your computer, so the values will be different from those shown. In fact, the values should be different every time you call the function! The now variable contains the time expressed as seconds from the epoch; note that it is a decimal value. The part after the decimal point is dependent on your operating system and computer clock for its accuracy, so may not be as precise as it appears. The seconds representation is useful for doing simple time calculations, such as timing the duration between two events in your code. (Consider using the datetime module discussed in the next section for more complex calculations.)

You next converted the time in seconds to a time tuple, which is the format you need for the string formatting functions to work. The tuple representation enabled you to confirm that the now value really is storing the current date and time.

You then used strftime() to print out the stored time in various formats. Note that the format string can have any string text within it, not just the special formatting characters.

For the next set of instructions, you used the strptime() function to convert a string into a date tuple. The first example used a well-formatted string to store the value in a variable called dt. The next example used a less well-formatted string in that it had a th postfix after the day value. The strptime() parser is not able to handle this format so you need to add the postfix to your string value. This becomes awkward when reading strings in this format, and you may need to use a try/except structure with a combination of postfixes (st, nd, rd, th) to get it right. You then tried a few other formats, including some times.

Finally, you saw two new functions. time.asctime() is a convenience function that prints a time tuple using a standard format regardless of local settings. time.mktime() converts a time tuple into a seconds representation.

The time module includes several other functions for managing time zones and for getting information about the system clocks. You can also tell if daylight savings time is in effect on the computer.

Finally, and far from least, the time module contains a sleep() function that pauses your program for the specified number of seconds. This is often useful in scripting when you are using background processes to perform a task that may require some time. It is also useful when polling a resource such as a network connection while waiting for data to arrive. You can use fractions of a second, but you should realize that the timing is only approximate because of OS process scheduling overheads and the like.

Introducing the datetime Module

The datetime module includes several objects and methods that represent both absolute dates and times as well as relative dates and times. Relative values are used for computing differences between times and avoid you having to do messy calculations on second-based values, dividing by 60 and 24, and so on. Some overlap exists between the time functions and the datetime objects. In general, if you are doing comparisons or time-based calculations, you should use the datetime module rather than time. If you are using both in the same code, use the most basic import style to ensure no name collisions occur.

The main classes exposed by the datetime module are date, time, and datetime, whose names are indicative of their scope. datetime and time objects can have a timezone attribute set to a timezone object to take account of time zone effects. If you have complex time processing to do, you may need to subclass the timezone class to provide any non-trivial algorithms required. In this book you only use the basic objects from the module. The other, and perhaps most useful, object type exposed is the timedelta class, which handles time durations such as the result of a time computation or a relative period such as a year or a month. The datetime module supports many time-based calculations using timedelta objects, including addition, subtraction, multiplication of a delta by a number, and even various forms of division.

You can initialize the date object by passing year, month, and day values, all of which are mandatory. You can initialize the time object passing hour, minute, and second values, all of which are optional and default to zero. The datetime object, you will not be surprised to learn, uses the full gamut of year, month, day, hour, minute, and second. Some helpful class methods return instances based on object arguments. An example is the date.today() method that returns today’s date or the date.fromtimestamp() method that takes a time value in seconds as its argument. Various attributes and methods exist for extracting data about the date after it has been created. The date class includes a strftime() method similar to the one in the time module (but has no corresponding strptime(); for that, you must look to the datetime object).

The time object is conceptually similar but, as mentioned earlier, includes the capability to take account of time zone data including daylight savings information. time objects, like date objects, support a strftime() method only.

datetime objects are a combination of both date and time objects and support a combination of both objects’ methods. datetime also adds a few extra methods of its own, including a now() class method for initialization to the current date and time, and combine() class method that takes date and time objects as arguments and returns a combined datetime object with the same values. You can do basic arithmetic using a combination of datetime and timedelta objects, the latter being either an argument or result as appropriate. datetime objects also support both strftime() and strptime() methods, which work in the same way as those in the time module described earlier.

You use the datetime objects as part of a larger example in the Try It Out “Parsing XML with ElementTree” later in the chapter.

Introducing the `calendar` Module

The calendar module is the simplest of the time-based modules in Python’s standard library. Essentially, it generates a calendar for a given year. The calendar is a calendar.Calendar class instance that has several support methods that allow you to, for example, iterate over the days in a given month or produce various formatted text strings that can be useful in presenting user messages in a script. Calendars can be formatted as plaintext or in HTML.

calendar is probably the least used of the three modules discussed, but it has some useful features that are not available elsewhere and would be time consuming to reproduce. Among these are some utility functions such as isleap(), which reports whether or not the specified year is a leap year, and timegm(), which converts a time.gmtime() tuple into seconds (why it is located in the calendar module instead of time is something of a mystery).

Finally, a couple of printing functions, prcal() and prmonth(), take a year and a year/month combination, respectively, as arguments and display their output on stdout. These can be useful when you want to prompt your user to choose a date.

There are some third-party modules available that try to simplify date and time handling in Python by combining all the functions from all of the standard modules into a single more user-friendly module. Some examples include arrow and delorean, but an Internet search will reveal several others.

In the next section, you see how Python assists in reading and writing several common data file formats.

Handling Common File Formats

When writing scripts to control several applications or utilities, it’s common to use files as the data transfer mechanism between applications. Unfortunately the output format of one application may not be in exactly the right format for the next application to read. At this point the script itself must convert the output file into the appropriate form for the next application to read. Most applications produce and consume variants of a few standard formats such as CSV (comma-separated values), HTML (HyperText Markup Language), XML (eXtended Markup Language), Windows INI (named after the file extension) and, more recently, JSON (JavaScript Object Notation). You now look at how Python’s standard library supports these various formats. (JSON is covered in Chapter 5, “Python on the Web,” because it is most commonly associated with web applications.) These modules make it easier to read and write data than if you tried to do it using the standard Python text-processing tools such as string methods or regular expressions.

Using Comma-Separated Values

The comma-separated value (CSV) format has been around for many years. Its name is something of a misnomer because, though commas are the most common separator, the term CSV is often applied to files using tabs or pipes (|) or, indeed, just about any other kind of character, as a separator. At first glance it might seem easy to parse data from such a file using the built-in string split() method. The problem is that the format is not absolutely standardized, and different files have different ways of representing fields that contain the separator within them. Also, lines of data can sometimes be split over multiple physical lines in the file. To make dealing with this diversity easier, Python includes the csv module in its standard library.

The csv module provides two mechanisms for reading CSV files. The simplest just reads each line into a tuple, and the programmer has to keep track of what each position in the tuple represents. The second method reads the data into a dictionary, often using the first line of the file as the keys of the dictionary. This is a particularly flexible mechanism because it accommodates changes in the file format (such as adding new keys) without breaking existing code.

The module defaults to the CSV format used by Microsoft Excel, but you can define your own formats too; it just takes a bit of extra work. In this chapter you are dealing with the Excel format only.

The examples that follow are based on a simple spreadsheet, toolhire.xlsx, as shown in Figure 2.1. (All of the data files discussed are included in the ToolhireData folder of the Chapter2 .zip download file in case you don’t have access to Excel.) The spreadsheet describes a small tool hire facility set up by some friends to keep track of who is borrowing what from whom.

images — **Figure 2.1** The toolhire spreadsheet

The data was saved to CSV format in the file toolhire.csv. The raw data in that file looks like this:

  ItemID,Name,Description,Owner,Borrower,DateLent,DateReturned
  1,LawnMower,Small Hover mower,Fred,Joe,4/1/2012,4/26/2012
  2,LawnMower,Ride-on mower,Mike,Anne,9/5/2012,1/5/2013
  3,Bike,BMX bike,Joe,Rob,7/3/2013,7/22/2013
  4,Drill,Heavy duty hammer,Rob,Fred,11/19/2013,11/29/2013
  5,Scarifier,"Quality, stainless steel",Anne,Mike,12/5/2013,
  6,Sprinkler,Cheap but effective,Fred,,,

This is a fairly simple file, but does include one of the complexities described earlier. Notice that Anne’s scarifier description is surrounded by double quotes because it contains a comma.

After importing the module, you can read the file into a list of tuples like so:

  >>> import csv
  >>> with open('toolhire.csv') as th:
  ...	toolreader = csv.reader(th)
  ...	print(list(toolreader))
  ...
  [['ItemID', 'Name', 'Description', 'Owner', 'Borrower',
  'DateLent', 'DateReturned'],
  ['1', 'LawnMower', 'Small Hover mower', 'Fred', 'Joe', '4/1/2012', '4/26/2012'],
  ['2', 'LawnMower', 'Ride-on mower', 'Mike', 'Anne', '9/5/2012', '1/5/2013'],
  ['3', 'Bike', 'BMX bike', 'Joe', 'Rob', '7/3/2013', '7/22/2013'], ['4', 'Drill',
  'Heavy duty hammer', 'Rob', 'Fred', '11/19/2013', '11/29/2013'], ['5',
  'Scarifier', 'Quality, stainless steel', 'Anne', 'Mike', '12/5/2013', ''],
  ['6', 'Sprinkler', 'Cheap but effective', 'Fred', '', '', '']]
  >>>

Notice that Anne’s scarifier description no longer has double quotes, but does still contain the original comma. Figuring out how to do that is the value that the csv module adds to your programs. You can apply lots of options both to the file in the call to open() and in the creation of the csv.reader object. The example shows the minimal set.

Writing to a CSV file is just as easy. In this example, you create a new page of data for the toolhire.xlsx spreadsheet that lists the various tools available, along with some details about when they were made available, their condition, and original price. You save the data as a CSV file called tooldesc.csv that you can load into Excel as a new worksheet.

Here is the code:

  >>> import csv
  >>> items = [
  ... 	['1','Lawnmower', 'Small Hover mower', 'Fred','$150','Excellent',
  '2012-01-05'],
  ... 	['2','Lawnmower','Ride-on mower','Mike','$370','Fair','2012-04-01'],
  ... 	['3','Bike','BMX bike','Joe','$200','Good','2013-03-22'],
  ... 	['4','Drill','Heavy duty hammer','Rob','$100','Good','2013-10-28'],
  ... 	['5','Scarifier','Quality, stainless steel','Anne','$200','2013-09-14'],
  ... 	['6','Sprinkler','Cheap but effective','Fred','$80','2014-01-06']
  ...	]
  >>> with open('tooldesc.csv','w', newline='') as tooldata:
  ...	toolwriter = csv.writer(tooldata)
  ...	for item in items:
  ...		toolwriter.writerow(item)
  ...
  44
  39
  33
  34
  34
  33
  >>>

As you can see, the writer.writerow() method returns the number of characters written to the file. Mostly you just ignore that! The output file looks like this:

  1,Lawnmower,Small Hover mower,Fred,$150,Excellent,2012-01-05
  2,Lawnmower,Ride-on mower,Mike,$370,Fair,2012-04-01
  3,Bike,BMX bike,Joe,$200,Good,2013-03-22
  4,Drill,Heavy duty hammer,Rob,$100,Good,2013-10-28
  5,Scarifier,"Quality, stainless steel",Anne,$200,2013-09-14
  6,Sprinkler,Cheap but effective,Fred,$80,2014-01-06

Notice that the scarifier description once again has quotes around it, and the date fields are written exactly as is. If you want to get the dates in the same format as Excel produced in the original CSV file, you need to do that manipulation before you write the data. This is very typical of the kinds of inconsistencies you find when using CSV files as a transport between different applications. You can use the datetime module to convert the date formats. datetime contains the datetime.strptime() function, which can parse an input string to a datetime object and the datetime.strftime() function to write that datetime object out in the format you want. Try that out now.

TRY IT OUT Reformatting Data and Writing to a CSV File (change_date.py)

In this Try It Out, you read a list of tools from the tooldesc.csv file, then extract each date field from the list, convert the format of the date and, finally, write the entire data structure back to the CSV file. To do this, complete the following steps:

Change to the folder where you saved the CSV files, or create a new folder and copy the CSV files from the zip file.

Open your favorite IDE or editor. Type the following code and save it as change_date.py (or load it from the ToolHire folder of the Chapter2.zip download file):

  import csv
  from datetime import datetime

  def convertDate(item):
 	theDate = item[-1]
 	dateObj = datetime.strptime(theDate,'%Y-%m-%d')
 	dateStr = datetime.strftime(dateObj,'%m/%d/%Y')
 	item[-1] = dateStr
 	return item

  with open('tooldesc.csv') as td:
 	rdr = csv.reader(td)
 	items = list(rdr)

  items = [convertDate(item) for item in items]
  with open('tooldesc2.csv', 'w', newline='') as td:
 	wrt = csv.writer(td)
 	for item in items:
 		wrt.writerow(item)

Run the script. Check that a new tooldesc2.csv file has been created.
Open the new file, tooldesc2.csv, in a text editor such as Notepad.

Confirm that it has the new date format, like this:

  1,Lawnmower,Small Hover mower,Fred,$150,Excellent,01/05/2012
  2,Lawnmower,Ride-on mower,Mike,$370,Fair,04/01/2012
  3,Bike,BMX bike,Joe,$200,Good,03/22/2013
  4,Drill,Heavy duty hammer,Rob,$100,Good,10/28/2013
  5,Scarifier,"Quality, stainless steel",Anne,$200,09/14/2013
  6,Sprinkler,Cheap but effective,Fred,$80,01/06/2014

How It Works

You started off with the imports that you need: csv and the datetime class from the datetime module. You then created a function to convert the dates. This starts by extracting the date field from the record and then uses the datetime.strptime() function to parse the date field. The format string ("%Y-%m-%d") tells it to select a four-digit year (%Y) followed by a hyphen, followed by a two-digit month (%m), another hyphen, and a two-digit day (%d). That produces a date object. You then used the datetime.strftime() function to format that date object in the required format by rearranging the fields and using slash (/) as a separator.

(Note that the datetime versions of strptime() and strftime() do not have the parameter order anomalies of the time module versions.)

Finally, you replaced the original date field with the new string and returned the modified record.

The main code of the script opens the original CSV file, reads the records, and stores them as a list called items. You then replaced items using a list comprehension that called your convertDate() function on each record in items.

Finally, you wrote the modified items list out to a new CSV file called tooldesc2.csv.

So far you have been using the basic reader and writer components of the csv module that work with lists of data items. You may recall from earlier that csv also supports a dictionary-based approach. You now use that to access the original toolhire.csv file. If you look again at the content of the CSV file, you notice that the first line is a list of headings that describe the columns. The csv module can exploit that by using the headings as keys in a dictionary. This makes accessing individual fields much more reliable because you no longer need to rely on the numeric position of the field in the file.

The way it works is very similar to the previous code, but instead of using a csv.reader object, you use a csv.DictReader. It looks like this:

  >>> with open('toolhire.csv') as th:
  ...	rdr = csv.DictReader(th)
  ...	for item in rdr:
  ...		print(item)
  ...
  {'DateReturned': '4/26/2012', 'Description': 'Small Hover mower',
  'Owner': 'Fred', 'ItemID': '1', 'DateLent': '4/1/2012',
  'Name': 'LawnMower', 'Borrower': 'Joe'}
  {'DateReturned': '1/5/2013', 'Description': 'Ride-on mower',
  'Owner': 'Mike', 'ItemID': '2', 'DateLent': '9/5/2012',
  'Name': 'LawnMower', 'Borrower': 'Anne'}
  {'DateReturned': '7/22/2013', 'Description': 'BMX bike',
  'Owner': 'Joe', 'ItemID': '3', 'DateLent': '7/3/2013',
  'Name': 'Bike', 'Borrower': 'Rob'}
  {'DateReturned': '11/29/2013', 'Description': 'Heavy duty hammer',
  'Owner': 'Rob', 'ItemID': '4', 'DateLent': '11/19/2013',
  'Name': 'Drill', 'Borrower': 'Fred'}
  {'DateReturned': '', 'Description': 'Quality, stainless steel',
  'Owner': 'Anne', 'ItemID': '5', 'DateLent': '12/5/2013',
  'Name': 'Scarifier', 'Borrower': 'Mike'}
  {'DateReturned': '', 'Description': 'Cheap but effective',
  'Owner': 'Fred', 'ItemID': '6', 'DateLent': '',
  'Name': 'Sprinkler', 'Borrower': ''}
  >>>

Notice that, as is normal with a dictionary, the fields are not in the original order, and they are keyed using the labels from the first line. You can see that, as before, the scarifier description has lost the quotes but retained its comma.

If, instead of printing the items, you store them in a variable, you can do some interesting analysis of the data using list comprehensions. For example, to see all of the items owned by Fred, you can do this:

  >>> with open('toolhire.csv') as th:
  ...	rdr = csv.DictReader(th)
  ...	items = [item for item in rdr]
  ...
  >>> [item['Name'] for item in items if item['Owner'] == 'Fred']
  ['LawnMower', 'Sprinkler']
  >>>

You could do the same thing using the basic reader and its lists, but you’d need to use numeric indices, which are much less readable. For example, the list comprehension using the earlier list would look like this:

  >>> [item[1] for item in toolList if item[3] == 'Fred']
  ['LawnMower', 'Sprinkler']

It isn’t nearly so obvious what you are returning or what the selection criteria are. Also, if the file format ever changed, you would need to change the indices everywhere in your code.

There is a matching DictWriter object that can write a dictionary out to a CSV file. You use it in the next Try It Out exercise.

You can use the DictReader even if your CSV file contains no labels. For example, the tooldesc2.csv file that you created in the previous Try It Out had no label line. You can remedy that by reading it into a DictReader and then writing it out with a DictWriter. The trick is to provide the labels as an argument to the DictReader constructor. Try it out now.

TRY IT OUT Adding a Label Line to a CSV File (add_labels.py)

In this Try it Out, you define a set of labels for the tooldesc2.csv file and then open the file and read it into a DictReader object. You then write the data out to a new file using a DictWriter that automatically inserts a label heading row. To do this, complete the following steps:

Change to your project folder and open your IDE or editor.

Type the following code and save it as add_labels.py (or load it from the zip file):

  import csv

  fields = ['ItemID', 'Name', 'Description', 'Owner',
 		'Price', 'Condition', 'DateRegistered']

  with open('tooldesc2.csv') as td_in:
 	rdr = csv.DictReader(td_in, fieldnames = fields)
 	items = [item for item in rdr]

  with open('tooldesc3.csv', 'w', newline='') as td_out:
 	wrt = csv.DictWriter(td_out, fieldnames=fields)
 	wrt.writeheader()
 	wrt.writerows(items)

Run the code and confirm that a new file, tooldesc3.csv, has been created.

Check that tooldesc3.csv has indeed acquired a header row by opening it in your text editor. It should look like this:

  ItemID,Name,Description,Owner,Price,Condition,DateRegistered
  1,Lawnmower,Small Hover mower,Fred,$150,Excellent,01/05/2012
  2,Lawnmower,Ride-on mower,Mike,$370,Fair,04/01/2012
  3,Bike,BMX bike,Joe,$200,Good,03/22/2013
  4,Drill,Heavy duty hammer,Rob,$100,Good,10/28/2013
  5,Scarifier,"Quality, stainless steel",Anne,$200,09/14/2013,
  6,Sprinkler,Cheap but effective,Fred,$80,01/06/2014,

How It Works

After importing csv in the first line, you defined the field names as a list of strings.

You then opened the original file,tooldesc2.csv, and read it into a list, items, using a DictReader that had been initialized with your fields list as its fieldnames parameter.

The next step was to write this out to a new file with a header line. To do that you opened the new file, tooldesc3.csv, and created a DictWriter object specifying the desired order of the fields via the fieldname parameter (remember that a dictionary stores the fields in an arbitrary order). You simply passed the same fields list that you used to read the file, thus maintaining the same order. You then called the writeheader() method of the writer object and followed that by using the writerows() method that writes out the entire items list in one go.

You’ve seen how to use the csv reader and writer objects to convert between the CSV file format and Python lists as well as the DictReader and DictWriter objects to do the same with dictionaries. You’ve also seen two examples of modifying a CSV file format to make it more suitable for subsequent processing. The csv module contains a few other features for dealing with non-Excel based CSV files, but you can read about those in the documentation if you need them.

Working with Config Files

Config files or, as they are often called, Windows “INI” files, have a very readable format that is also easy to work with programmatically. They have fallen out of favor in recent years because Microsoft now advocates the Windows Registry and non-Microsoft applications are moving to XML-based storage. However, there are plenty of legacy applications around that use this format. (A search for *.ini on a relatively clean installation of Windows 8.1 found several hundred files, so it is far from dead!)

The format is very good at storing multiple instances of similar data, such as per-node settings on a network, or for multiple categories of options, such as various screen sizes, or online versus offline operational parameters. The disadvantage of the Config format is that it can sometimes be too simple with the result that complex data is harder to fit into the format. Python provides the configparser module for reading and writing Config format data.

The basic structure of a Config file is as shown here:

  [DEFAULT]
  Option1=value1

  [SECTION1]
  Option2=value2
  Option3=value3

  [SECTION2]
  Option4=value4
  etc.

The DEFAULT section is noteworthy because options defined there apply to all following sections. The format has a lot of flexibility, with spaces and indentation optional, embedded sections, and various other variants, including the ability to interpolate a value from one option into another option. The configparser module can handle all of these and much more. It converts the data into, or from, a dictionary format similar to the kind used for the CSV files described in the previous section.

Basic usage is shown very clearly in the documentation with examples of creating a file and reading from it. Because there is little point in repeating that here, you can browse it at your leisure and then try it out in the following example.

TRY IT OUT Creating and Reading a Config File

In this Try It Out, you create a Config file such as might be used for the tool lending application discussed earlier. This Config file describes the standard settings and any user-specific overrides to those values. The settings are limited to the lending period (expressed in days) and the maximum value of items that can be borrowed (zero implies no limit and is the default value).

To create the file, complete the following steps:

Create a project directory and change into it.

Start the Python interpreter and type the following code:

  >>> import configparser as cp
  >>> conf = cp.ConfigParser()
  >>> conf['DEFAULT'] = {'lending_period' : 0, 'max_value' : 0}
  >>> conf['Fred'] = {'max_value' : 200} 	# Fred's a bit rough with things!
  >>> conf['Anne'] = {'lending_period' : 30} 	# She is a bit forgetful sometimes
  >>> with open('toolhire.ini', 'w') as toolhire:
  ...	conf.write(toolhire)
  ...
  >>>

Check that a new file toolhire.ini has been created in your folder.
Open this new file in your text editor (but keep your Python session running) and confirm that it looks like this:
```
 	[DEFAULT]
  lending_period = 0
  max_value = 0

  [Fred]
  max_value = 200

  [Anne]
  lending_period = 30
```
Having created the file, you can now read back some values.

Go back to your Python interpreter session and type the following:

  >>> del(conf) 	# get rid of the old one
  >>> conf = cp.ConfigParser() 	# create a new one
  >>> conf.read('toolhire.ini')
  ['toolhire.ini']
  >>> conf.sections()
  ['Fred', 'Anne']
  >>> conf['DEFAULT']['max_value']
  '0'
  >>> conf['Anne']['max_value']
  '0'
  >>> conf['Anne']['lending_period']
  >>> conf['Fred']['max_value']
  '200'

Finally, to investigate a bit of irregular behavior type the following:

  >>> conf['Joe']
  Traceback (most recent call last):
  	File "<interactive input>", line 1, in <module>
  	File "C:Python33libconfigparser.py", line 954, in __getitem__
  		raise KeyError(key)
  KeyError: 'Joe'
  >>> conf.options('Anne')
  ['lending_period', 'max_value']
  >>> conf.options('DEFAULT')
  Traceback (most recent call last):
  	File "<interactive input>", line 1, in <module>
  	File "C:Python33libconfigparser.py", line 667, in options
  		raise NoSectionError(section)
  configparser.NoSectionError: No section: 'DEFAULT'
  >>> conf.defaults()
  OrderedDict([('lending_period', '0'), ('max_value', '0')])
  >>>

How It Works

After importing the module in the first line, you created a ConfigParser object and then assigned a dictionary of name/value pairs to the DEFAULT section. You then stipulated a limit on the value of what Fred could borrow (he has a track record of breaking things!) as well as a limit on Anne’s lending_period (because she tends to “forget” to return things, so she needs a reminder). You then opened the file “toolhire.ini” in write mode and used the ConfigParser object to write the data to the file.

Having checked that the file existed and contained the correct data, you then set about reading the data back from the file.

In preparation for doing that, you deleted the original parser object and created a new one, reusing the name conf.

You usedconf to read the toolhire.ini file. You checked that the expected sections were available and saw that the DEFAULT section was not included in the list. You then read some option values from the parser and saw that you could read the DEFAULT values even though it was not listed as a section. Furthermore, the DEFAULT values are used when a particular option is not explicitly declared for a user. (For example, the default max_value is returned for Anne even though she only had lending_period specified.)

You then tried pushing the boundaries a bit to see how the parser handles error conditions. The first attempt was to access a user for which no values had been provided. This gave a typical dictionary KeyError. You then read the available options for Anne and discovered that the parser returned the default values as well as those explicitly defined. You also discovered that options() does not work for the DEFAULT section, and you needed to use the explicit defaults() method to fetch those options. The nonstandard behavior for DEFAULTS is one of the few annoyances you will experience in using configparser.

Working with XML and HTML files

You are probably familiar with HTML as the language of web pages. XML is also widely used, as a so-called self-describing data format. XML and HTML are closely related formats. XML is a much more rigidly defined format, and that makes it easier to process using a computer. HTML is very forgiving of malformed content and, although that makes it easy to create by hand, as well as with specialized editors, HTML is much more hit or miss to process accurately. HTML also has many variations because of web browser proprietary extensions. All of this means that HTML parsers have a trickier job and often yield less than perfect results when faced with badly formatted files. Because XML is easier to handle programmatically, you look at parsing it first and then extend the technique to cover HTML.

Parsing XML Files

Many different parsers are available for parsing XML. The Python standard library contains no less than five (dom, minidom, expat, ElementTree, and sax). These all fall into two categories: those that read the entire file into a tree-like data structure called a document object model (DOM) or those that read the file looking for items of interest (an “event”) and triggering a response as the items are found. The former are more flexible for complex, or multiple, queries on the same set of data. The latter tend to be faster and slightly simpler to use. In this book you only look at two of the parsers, each representing one of these two approaches.

The first parser you consider is sax, which is an example of an event-based parser. To understand how event-based parsers work, consider the following example that parses some plaintext:

  >>> text = """mary had a little lamb
  ... its fleece was white as snow
  ... and everywhere that mary went
  ... the lamb was sure to go"""

  >>> def has_mary(aLine):
  ...	print( "We found: ", aLine)
  ...
  >>> def parse_text(theText, aPattern, function):
  ...	for line in theText.split('
'):
  ...		if aPattern in line:
  ...			function(line)
  ...
  >>> parse_text(text,'mary',has_mary)
  We found: mary had a little lamb
  We found: and everywhere that mary went
  >>>

Here you create some text that you want to parse. You then define a function, has_mary(), that you want to be called every time mary is found in the text.

Next you create your event-driven parsing function, parse_text(). This function iterates over the input text line by line. If the search string, in this case mary, is found, then it calls the function that has been passed in.

When you execute parse_text() with your text string and the has_mary() function as arguments, it prints out the two lines containing mary.

The sax module works in a similar way to your parse_text() function; however, it uses events, such as detecting the start of an XML element, rather than plaintext patterns. It takes in an XML source text and a collection of events and associated event-handler functions. It then processes the XML text section by section, and if it finds a match to a given event, it calls the associated handler to deal with it. The parser does not store the XML data, it simply iterates over it. If you need to go back to access earlier data, you need to re-parse the entire file.

To investigate the sax parser, you need an XML file. You can find one, called toolhire.xml, in the ToolhireData folder of the Chapter2.zip file. This is simply an XML export of the toolhire.xlsx spreadsheet that you used earlier. A fragment of that file, including the parts you will be extracting, slightly edited for readability, is shown here:

  <?xml version="1.0"?>
  <?mso-application progid="FileName_Excel.Sheet"?>
  <Workbook
  ...
  <Worksheet ss:Name="Sheet1">
 	<Table ss:ExpandedColumnCount="1025" ss:ExpandedRowCount="7" x:FullColumns="1"
 	x:FullRows="1" ss:DefaultRowHeight="15">
 	<Column ss:AutoFitWidth="0" ss:Width="36"/>
  ...
 	<Row ss:StyleID="s36">
 		<Cell><Data ss:Type="String">ItemID</Data></Cell>
 		<Cell><Data ss:Type="String">Name</Data></Cell>
 		<Cell><Data ss:Type="String">Description</Data></Cell>
 		<Cell><Data ss:Type="String">Owner</Data></Cell>
 		<Cell><Data ss:Type="String">Borrower</Data></Cell>
 		<Cell><Data ss:Type="String">DateLent</Data></Cell>
 		<Cell><Data ss:Type="String">DateReturned</Data></Cell>
 	</Row>
 	<Row>
 		<Cell><Data ss:Type="Number">1</Data></Cell>
 		<Cell><Data ss:Type="String">LawnMower</Data></Cell>
 		<Cell><Data ss:Type="String">Small Hover mower</Data></Cell>
 		<Cell><Data ss:Type="String">Fred</Data></Cell>
 		<Cell><Data ss:Type="String">Joe</Data></Cell>
 		<Cell ss:StyleID="s37"><Data ss:Type="DateTime">
  2012-04-01T00:00:00.000</Data></Cell>
 		<Cell ss:StyleID="s37"><Data ss:Type="DateTime">
  2012-04-26T00:00:00.000</Data></Cell>
 		</Row>
  ...
  </Worksheet>
  </Workbook>

Assume you want to find the average length of loan. You use sax to extract just the DateLent and DateReturned fields for each item and store them as a tuple in a dates list. You can then later process those dates to find the duration for each lent item.

To initialize the parser, you need to create your handler and specify the events that you are interested in. sax actually uses a handler object, an instance of the xml.sax.handler .ContentHandler class, or more specifically, a subclass of it, to combine the event and function. Several predefined handler subclasses exist, including one for dealing with errors. The advantage of this approach is that many default methods are already defined and others can be easily overridden, such as startDocument() called at the very beginning of parsing and useful for setting up state variables and the like. For simple XML parsing tasks, you normally create a custom subclass of ContentHandler and then write your own versions of the startElement(), endElement(), and possibly, thecharacter() methods.

By inspecting the XML file, you can see that the data you want is contained in a <Data> element and is identified by the ss:Type attribute being set to DateTime. The actual data is character data that sits between the start and end <Data> tags, so the expected event sequence is startElement(), followed by character(), followed by endElement().

The code for your ToolHireHandler class looks like this (and is in the ToolHire folder of Chapter2.zip as toolhiresax.py):

  import xml.sax
  import xml.sax.handler

  class ToolHireHandler(xml.sax.handler.ContentHandler):
 	def __init__(self):
 		super().__init__()
 		self.dates = []
 		self.dateLent = ''
 		self.dateCounter = 0
 		self.isDate = False

  def startElement(self, name, attributes):
 	if name == "Data":
 		data = attributes.get('ss:Type', None)
 		if data == 'DateTime':
 			self.isDate = True
 			self.dateCounter += 1
 			else:
 				self.dateCounter = 0
  def endElement(self, name):
 	self.isDate = False

  def characters(self, data):
 	if self.isDate:
 		if self.dateCounter == 1:
 			self.dateLent = data
 		else:
 			self.dates.append((self.dateLent, data))

  if __name__ == '__main__':
 	handler = ToolHireHandler()
 	parser = xml.sax.make_parser()
 	parser.setContentHandler(handler)
 	parser.parse('toolhire.xml')
 	print(handler.dates)

The initializer calls the superclass initializer and then sets up various data attributes that you use in the parsing and need to use across methods. It also creates an empty dates list to hold the results.

The main parsing method is the startElement() method that looks out for Data elements and, when one is found, refines the search by selecting only those with a ss:Type attribute of DateTime. (You have to identify these values by inspecting the XML file manually.) Because you can have up to two dates in a single row, you use the self.dateCounter to keep track of which date within the row you are handling. You use the self.isDate value to indicate to the character() method that it is inside a date element. If the data is not a DateTime type, then you reset the self.dateCounter to 0.

The endElement() method ensures the self.isDate flag is reset to False ready for the next startElement() event to come along.

The character() method is called whenever content outside a tag element is encountered. You are only interested in the date information so, if the self.isDate flag is not set, you simply ignore the character data. If the data is a date, then you check if it’s the first date, in which case you store it in the self.dateLent attribute; if it’s the second date, you store both dates in the self.dates list. If only one date is found, the character handler is not called a second time, and the date is not added to the dates list, thus ensuring you store only pairs of dates, which is what you need for the duration calculations.

Finally, the driver code at the bottom creates the handler and parser instances. It then sets the handler within the parser to your ToolHireHandler instance and executes the parse() operation on your XML file. After parsing is complete, it prints out the collected dates from the handler.

You repeat this exercise using the ElementTree DOM-based parser in the Try It Out at the end of this section. There you can compare and contrast the two techniques. First, though, you look at parsing HTML because the standard library HTML parser is very similar in style to the sax XML parser.

Parsing HTML Files

The standard library provides the html.parser module for parsing HTML. It works in a similar way to the sax parser, in that it is event driven. It is slightly simpler to use because it only has a single class with the handler methods defined within it. To show how it works, you once again extract the dates from the toolhire.xlsx spreadsheet, but this time from the HTML export. You can find this file in the zip file under the ToolhireData/toolhire_files folder as sheet001.htm.

The file looks, in part, like this:

  <html xmlns:v="urn:schemas-microsoft-com:vml"
  xmlns:o="urn:schemas-microsoft-com:office:office"
  xmlns:x="urn:schemas-microsoft-com:office:excel"
  xmlns="http://www.w3.org/TR/REC-html40">

  <head>
  <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
  <meta name=ProgId content=Excel.Sheet>
  ...
  <body link=blue vlink=purple>

  <table border=0 cellpadding=0 cellspacing=0 width=752 style='border-collapse:
  	collapse;table-layout:fixed;width:564pt'>
  <col width=64 style='width:48pt'>
  <col width=115 style='mso-width-source:userset;mso-width-alt:4205;width:86pt'>
  ...
  <tr class=xl66 height=21 style='height:15.75pt'>
  		<td height=21 class=xl66 width=64 style='height:15.75pt;width:48pt'>ItemID</td>
  		<td class=xl66 width=115 style='width:86pt'>Name</td>
  		<td class=xl66 width=153 style='width:115pt'>Description</td>
  		<td class=xl66 width=80 style='width:60pt'>Owner</td>
  		<td class=xl66 width=120 style='width:90pt'>Borrower</td>
  		<td class=xl66 width=99 style='width:74pt'>DateLent</td>
  		<td class=xl66 width=121 style='width:91pt'>DateReturned</td>
  </tr>
  <tr height=20 style='height:15.0pt'>
  		<td height=20 align=right style='height:15.0pt'>1</td>
  		<td>LawnMower</td>
  		<td>Small Hover mower</td>
  		<td>Fred</td>
  		<td>Joe</td>
  		<td class=xl65 align=right>4/1/2012</td>
  		<td class=xl65 align=right>4/26/2012</td>
  </tr>
  ...
  </tabular>
  </body>
  </html>

You can see that the dates have a unique class, namely xl65. This means you can look for <td> tags with that class attribute value in a similar way that you did with the earlier XML example.

The HTMLParser class works very like the saxContentHandler class in that it has methods corresponding to HTML document elements. In the example you override the handle_starttag(),handle_endtag(), and handle_data() methods that are directly analogous to the startElement, endElement, and character methods for XML.

You can find the code for this example in the zip file ToolHire folder as toolhirehtml.py. It looks like this:

  import html.parser
  class ToolHireParser(html.parser.HTMLParser):
 	def __init__(self):
 		super().__init__()
 		self.dates = []
 		self.dateLent = ''
 		self.isDate = False
 		self.dateCounter = 0

  def handle_starttag(self, name, attributes):
 	if name == 'td':
 		for key, value in attributes:
 			if key == 'class' and value == 'xl65':
 				self.isDate = True
 				self.dateCounter += 1
 				break
 	else:
 		self.dateCounter = 0
 	def handle_endtag(self, name):
 		self.isDate = False
 	def handle_data(self, data):
 		if self.isDate:
 			if self.dateCounter == 1:
 				self.dateLent = data
 			else:
 				self.dates.append((self.dateLent, data))
  if __name__ == '__main__':
 	htm = open('sheet001.htm').read()
 	parser = ToolHireParser()
 	parser.feed(htm)
 	print(parser.dates)

If you compare that with the sax example, you see that the code inside the methods is nearly identical. The HTMLParser presents its attributes as a list of tuples. You iterate over that list looking for a class attribute of value xl65 to identify a date field. (Note that’s an x-ELL not x-ONE; remember that this is an export from Microsoft Excel, hence the class name.) The parser conveniently takes care of mixed-case HTML tags or by converting tag names to lowercase, so you don’t need to worry about that. It also does its best to make sense of badly formed HTML although it’s not perfect and really bad code can trip it up.

You conclude this section on reading data files with a look at another of Python’s XML parsers. This time it’s ElementTree, and you investigate it in the following Try It Out.

TRY IT OUT Parsing XML with ElementTree (toolhireET1.py and toolhireET.py)

In this Try It Out, you use the ElementTree XML parser to extract the same set of dates that you did in the earlier examples. You also use the data to calculate the average loan period for the items in the spreadsheet. To get the answer, complete the following steps:

Create a new project folder and copy the toolhire.xml file, which you used earlier, into it.

Open your IDE or text editor and type in and save the following code as toolhireET.py (or grab toolhireET1.py from the zip file):

  import xml.etree.ElementTree as ET
  import datetime as dt

  def parseDates(filename):
 	dates = []
 	rows = []
 	dom = ET.parse(filename)
 	root = dom.getroot()
 	for node in dom.iter('*'):
 		if 'Row' in node.tag:
 			rows.append(node)
 	for row in rows:
 		row_dates = []
 		for node in row.iter('*'):
 			for key,value in node.attrib.items():
 				if 'Type' in key and 'DateTime' in value:
 					row_dates.append(node.text)
 	if len(row_dates) == 2:
 		dates += row_dates
 	return dates

  def main():
 	print(parseDates('toolhire.xml'))

  if __name__ == '__main__':
 	main()

Open an OS console window and navigate to your project folder.
Run the file from your console using python3 toolhireET.py (or toolhireET1.py if using the zip file).

Check that the output looks like this:

  ['2012-04-01T00:00:00.000', '2012-04-26T00:00:00.000', '2012-09-05T00:00:00.000',
  '2013-01-05T00:00:00.000', '2013-07-03T00:00:00.000', '2013-07-22T00:00:00.000',
  '2013-11-19T00:00:00.000', '2013-11-29T00:00:00.000', '2013-12-05T00:00:00.000']

Go back into your IDE or editor and add a calculateAverage() function and modify main() as shown (or load toolhireET.py from the zip file):

  def calculateAverage(dates):
 	loan_periods = []
 	while dates:
 		lent = dates.pop(0).split('T')[0]
 		ret = dates.pop(0).split('T')[0]
 		lent_date = dt.datetime.strptime(lent,'%Y-%m-%d')
 		ret_date = dt.datetime.strptime(ret,'%Y-%m-%d')
 		loan_periods.append( (ret_date - lent_date).days )
 	average = sum(loan_periods)/len(loan_periods)
 	return average
 	def main():
 		dates = parseDates('toolhire.xml')
 		avg = calculateAverage(dates)
 		print('Average loan period is: {} days'.format(avg))

Save the file and run the code. You should see a message telling you that the average loan period is 44 days.

How It Works

You started by importing the ElementTree parser and assigning it an alias, ET. You then imported the datetime module as dt in anticipation of the date calculations to follow.

You created the parseDates() function that starts off by parsing the XML file into a DOM. Notice that the parse() function takes a filename as an argument, not a file object. You then obtained the root node from that DOM using the getroot() method. You used the iter() method to find all nodes, as signified by the '*' argument. You then inspected the node, and if it had Row in its tag, you added the node to the rows list.

Having built a list of rows, you then drilled into each row and examined the cells. You checked their attributes looking for a key of Type with a value of DateTime. Once found, you inserted the date nodes into the row_dates list. At the end of each row, if the row_dates list contained two dates, you added it to the dates list; otherwise, you just ignored it. You returned the final dates list at the end of the function.

You then tested the function and checked that the output was the expected list of dates.

You next added the new function calculateAverage() and modified the main() function accordingly.

In the calculateAverage() function, you initialized a list to hold the length of each loan. You then iterated over the dates list extracting them in pairs. You knew the pairs matched because you only added pairs of dates in the parseDates() function. The extraction process involved splitting the date strings on the letter T and only keeping the first part of the string. (You had to split the strings because the datetime.strptime() method cannot handle the decimal seconds value.) The next step was to convert the date strings into datetime objects using the strptime() method. You then used the datetime object’s arithmetic capabilities to compute a timedelta representing the loan period and stored the days value in the loan_period list. Finally, you calculated the average of the stored periods and returned the result.

Some applications do not lend themselves to generating data files. In these cases, you may need to interact with the program via an application programmer’s interface (API). The next section shows you how.

Accessing Native APIs with ctypes and pywin32

Some applications or OS functions are not easily accessed from regular Python code because no Python API exists or no user-friendly operations are exposed that you can call from Python. The ctypes module can provide an alternative means of access by exposing to Python the C code libraries from which the application is built. In Windows these libraries are typically a set of DLL files or, in UNIX, a set of shared object libraries. ctypes enables you to load those libraries into your application and call their functions directly from Python. This only works, of course, if you know what functions are in the library, what arguments are required, and the return values. This may not be published, and you then have to resort to trial and error, or reverse engineering, which may, in turn, be prohibited by the manufacturer or vendor. However, if the library has a published interface, ctypes provides an effective, although non-trivial, method of access.

SOME WORDS OF CAUTION

Using ctypes requires a basic knowledge of C programming. If you don’t have that skill, you may want to skip this section because it may not make much sense. On the other hand, skimming it lets you see what the potential is, should you need it.

When you use ctypes, you leave the safety net of the Python interpreter behind. Remember that you are working with the raw OS libraries and sometimes directly accessing memory locations. The libraries also work with the raw file system and input/output streams, and so they may not show the results you expect in an IDE like IDLE or Pythonwin.

If you make a mistake, you can easily cause the Python interpreter to crash. In extreme cases, you could even cause the OS to crash! This is why you should treat ctypes and its friends as a last resort, only to be used when all else fails.

Another package, installed by default in the ActiveState distribution of Python for Windows, or available for download on other distributions, is pywin32. This package provides access to the Windows native libraries and, in particular, to any Microsoft component object model (COM) interfaces. Being Windows specific, it is usually easier to use than ctypes which works generically on any operating system. The same caveats apply when using pywin32 as apply to ctypes.

Accessing the Operating System Libraries

One area that is usually well documented is the OS application programming interface (API) that is exposed in standard system libraries. In this section you use the OS libraries to perform some fairly simple tasks that are nonetheless not available via Python’s os module. This method is particularly useful for Windows users because many of the UNIX-like features in the os module do not work, or only work partially, under Windows. Accessing the Win32 API directly via ctypes (orpywin32) is often the only option.

The following sections show ctypes being used on Windows and Linux systems, but the principles are identical, apart from getting the initial reference to the C library.

Using ctypes with Windows

On Windows systems the basic C library is found in the msvcrt library. Some functions in msvcrt .dll, mainly concerned with console input/output operations, are exposed in the Python msvcrt module, but many more are not available by that route. You can easily access the native msvcrt library from ctypes using the following code:

  >>> import ctypes as ct
  >>> libc = ct.cdll.msvcrt 	# Windows only

Once you have a reference to the standard library, you can call the familiar C functions. The only complication is that you need to ensure the arguments are type-compatible with C. In general, integer arguments work just fine, but strings usually need to be explicitly marked as byte strings, and floats need a special ctypes type conversion. Many type conversion functions are included in ctypes; you can find a full list in the module documentation. Here are two examples:

  >>> libc.printf(b"%d %s %s hanging on a wall
", 6, b"green", b"bottles")
  6 green bottles hanging on a wall
  34
  >>> libc.printf(b"Pi is: %f
", ct.c_double(3.14159))
  Pi is: 3.141590
  16

Notice the use of b to indicate a byte string and, in the second example, the use of the ctypes.c_double() conversion function. Also, note that the return value of printf(), which is the number of characters printed, is displayed after the message is printed.

Many C functions require pointers to data (effectively memory addresses) as arguments. ctypes enables you to do this using the byref() function. You can create an object of a given type and then pass that object using byref() into the ctypes function call you want to perform. Here is an example of using sscanf() that reads an integer value from a string into a Python variable:

  >>> d = ct.c_int()
  >>> print(d.value)
  0
  >>> libc.sscanf(b"6", b"%d", ct.byref(d))
  1
  >>> print(d.value)
  6

Next you look at a slightly more practical function in the Windows library: msvcrt._getdrives(). This returns a list of available drives on a Windows system, something not easily done using Python’s standard os module. The only complication is that the returned list is a bitmask, so you need to write a loop to test each bit to find out which bits are set and map the bit position into a drive letter. Here is the code:

  >>> drives = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  >>> drivelist = libc._getdrives()
  >>> for n in range(26):
  ...	mask = 1 << n 	# use left bit shifting to build a mask
  ...	if drivelist & mask: print (drives[n], 'is available')
  ...
  C is available
  D is available
  E is available
  P is available

The Microsoft Developers Network (msdn.microsoft.com) has full documentation for the standard Windows library functions.

Using ctypes on Linux

You can use ctypes on non-Windows systems, too. Here is an example using printf() on a Linux system accessed via the standard C library libc.so.6. (You can also use other UNIX-like OSes if you can find out the name of the library that implements the standard C library functions.)

  >>> import ctypes as ct
  >>> libc = ct.CDLL('libc.so.6')
  >>> libc.printf(b"My name is %s
", b"Fred")
  My name is Fred
  16

The printf() and sscanf() examples in the previous section should also work using the Linux libc, as will the byref() function and the various type conversion functions.

Accessing a Windows Application Using COM

Accessing an application library is almost as easy as accessing an OS system library, provided you can get documentation for the contents of the library. However, that is not always readily available. Another option on Windows is to use the OS functions to access the COM objects and then manipulate the COM objects from Python. Unfortunately, COM is a complex technology and has been extended over time to include features such as distribution over a network as well as various data access mechanisms. Compounding the difficulty is the fact that documentation for COM objects is often sparse and hard to find. Nonetheless, COM is often the most effective option for automating Windows applications.

The easiest way to use COM objects in Python is to use the pywin32 package, written by Mark Hammond and available for download from the SourceForge website or included as standard in the ActiveState distribution of Python. The following Try It Out demonstrates the use of pywin32 to open Excel preloaded with the toolhire.xlsx file you used in the earlier sections of this chapter.

TRY IT OUT Using COM to Present a File-Open Dialog (toolhireCOM.py)

This Try It Out demonstrates how to use the Excel COM interface to open the application and let the user select a file—all from within Python. If you are using Windows, follow along with these steps (this example works under Windows only):

If you do not have the ActiveState version of Python installed, download and install the pywin32 extension package from the SourceForge website: http://sourceforge.net/projects/pywin32/.

Open your Python IDE and create a new file called toolhireCOM.py (or load it from the zip file). Enter the following code (making sure to set the filepath to your own file location):

  import win32com.client as com
  # set the file path as required on your PC
  filepath = r"D:PythonCodeChapter2CSVexamples	oolhire.xlsx"
  fileopen = 1 	# found by trial and error!
  app = com.Dispatch("Excel.Application")
  app.Visible = True
  fd = app.FileDialog(fileopen)
  fd.InitialFileName = filepath
  fd.Title = "Open the toolhire spreadsheet"
  if fd.Show() == -1:
 	fd.Execute()

Save and run the file.
Click OK in the dialog box that opens.
Confirm that the spreadsheet contains the data from the spreadsheet you used earlier.

How It Works

After importing the win32com.client module and aliasing it as com, you set the file path to a variable so it’s easy to change if needed. (Note: You should use the path to your own folder, not the one that is used here.) The next line sets a filemode variable to 1. This determines what kind of file dialog is opened, in this case a File-Open dialog. (The value was found by trial and error, valid values lie between 1 and 4.)

You then created the Application COM object using the Dispatch() function and made the window visible by setting its Visible property to True. At this point the window appeared on screen but without the usual grid of cells. This is because Excel actually stores the grid in another COM object called a Workbook. You could have created a Workbook (or more accurately a set of Workbooks, or tabs) and opened the file directly instead of using a dialog box if you knew which file you were interested in. Workbook objects contain Cells, and it is these you would use if you wanted to create or modify data within a spreadsheet.

The next step was to create the dialog object using the FileDialog() method of the Application. This took your filemode value as an argument. You then set a couple of attributes of the object to ensure it opened in the right place and with an instructive title.

Finally, you called the Show() method of the dialog that displays the dialog box on screen, with all the usual functionality available to the user. If the user selects the OK button, the return value is -1. In this case you can call the Execute() method of the dialog object that proceeds to open (or save, if necessary) the selected file. At this point the spreadsheet gets populated with the Workbooks and appears as you would normally see the application.

You have now seen many techniques for integrating different applications in a scripting program. The next section gives you some advice on how to bring these techniques together to complete a scripting project.

Automating Tasks Involving Multiple Applications

Scripting was defined at the start of this chapter as “coordinating the actions of other programs or applications to perform a task.” So far, you have seen several enabling modules that can help you to interface with these external programs, but the bigger picture of how to automate a full workflow has not been discussed.

Normally, when you approach a workflow automation project, you look at what the human process is. You identify the systems used and the actions taken. You look at the input and output data. You then try to replicate that using whatever automation options are available for each system and process. You should take one other step before jumping in too quickly and that is to eliminate any steps that are done purely for the human user’s convenience—for example, formatting data into a more readable layout when the data is only an intermediate result. If the computer can read the data without that formatting, it’s an unnecessary step. Once you have identified the necessary steps, along with the systems and tools to be used, you can look at the automation options.

This section considers some guidelines that should minimize the pain in developing such multi-application scripts. As a general rule, use the following techniques in the order discussed.

Using Python First

Python comes with many support modules that enable you to replicate the OS functions and commands directly from your code. Other modules provide access to different file formats and network protocols. For example, Python has modules for directly manipulating the Windows registry and the UNIX password file that avoid calling external programs. Using Python directly provides an efficient and flexible solution that will be easier to maintain in the future. This should always be the first choice if possible.

Using Operating System Utilities

The OS provides many tools and commands for performing system administration. Many of these tools have command line interfaces (CLIs) that make them easy to call from Python code using the subprocess module. Tools that operate without interaction are the easiest to work with, even if this means using data files as an intermediate step because the files can be used as a recovery point should the process fail: You simply restart with the last successful step.

Using Data Files

Many tools and OS commands use configuration files to control how they function. By creating or modifying these configuration files prior to running the command, you can often control the behavior without the complexity of interacting with the processes in real time. In addition, you can usually drive such tools by using input files and generating output files rather than interactively providing data at prompts. You can build such files (or read them) using Python code, and you have seen how Python modules can assist in parsing many common data formats.

Using a Third-Party Module

Many popular applications have third-party modules that facilitate interacting with the application or direct manipulation of their data files. Microsoft Excel is a good example, with several modules available to assist in manipulating spreadsheets. You can manipulate many other proprietary file formats using third-party modules. Use your favorite search engine to find such modules. Include keywords like the application name, “python”, and “module”, and you should find what you are looking for fairly quickly.

The main caveat with this approach is that third-party modules often work only with older Python versions and may not be updated to the latest build. Most such modules are open source, with generous license conditions, so you usually have the option of updating the code yourself or, if that is too big a project, perhaps copying just the code that you need for your project. Due credit to the original authors should, of course, be given.

Interacting with Subprocesses via a CLI

If a tool has a CLI but cannot be driven using a data file, you can still use the subprocess module and interact with the process using stdin and stdout as was demonstrated with the ex editor earlier in the “Managing Subprocesses” section of this chapter. This is a potentially complex strategy because you have to anticipate every possible response or input request that the application may make. Similarly, error handling can be difficult to control and often, if an application deviates from the expected interaction, you may have no choice but to abort your script and try to recover manually. This is why using data files is preferable if at all possible.

There is a third-party module called pexpect that makes interacting with an external console-based program easier. It works by looking for expected (hence the name) prompt strings from the target application and then responding by allowing the programmer to send responses. This works well for login dialogs and similar interactions.

Using Web Services for Server-Based Applications

Some applications provide web services as an interface option. This is often an attractive alternative to using a third-party module, although the trade-off is often slower performance and the added complexity of parsing the XML or JSON data format used by such services. Web services are discussed in more detail in Chapter 5.

Using a Native Code API

If the application you need to control offers a C library as an API, you can use ctypes to access it from Python. The biggest problem you are likely to face with this approach is finding good documentation for the API. If documentation exists, this can be a very effective technique, but if not, it can involve a lot of painful trial and error. The Python interactive prompt is an invaluable tool in these scenarios.

For Windows applications you can often find a COM interface and access that via the win32 package. As with using ctypes, the lack of documentation is often the biggest obstacle.

Using GUI Robotics

The final option for GUI applications with no API is to interact with the GUI itself by sending user event messages into the application. Such events could include key-presses, mouse-clicks, and so forth. This technique is known as robotics because you are simulating a human user from your Python program. It is really an extension of the native code access described in the previous section, but operating at a much lower level.

This is a frustrating technique that is very error prone and also very vulnerable to changes in the application being controlled—for example, if an upgrade changes the screen layout, your code will likely break. Because of the difficulty of writing the code, as well as the fragility of the solution, you should avoid this unless every other possibility has failed.

Summary

This chapter looked at how to automate tasks involving several different applications or OS utilities. You saw that Python’s standard library contains several powerful modules to assist in this. The os, os.path, shutil, and glob modules, for example, can provide much information about computer resources and help you manage files directly from within Python.

The subprocess module provides a mechanism to launch and interact with command line programs from within your scripts.

The time, datetime, and calendar modules can assist with time-related tasks and calculations. The time.sleep() function can introduce a pause to your script’s execution while waiting for other processes to complete.

You also saw that common data files that can be generated, or used as input by applications, can be created or read by Python using modules such as csv, configparser, htmllib, and xml.etree.

If no other form of access is available, it may be possible to use ctypes to access C functions exposed by dynamic libraries. On Windows similar functions exposed as a COM interface may be available, and the pywin32 modules simplify access somewhat. These techniques are usually more complex than using data files or calling subprocess functions.

Finally, you reviewed the options available for scripting with their pros and cons, including the last resort option for GUIs of sending OS events to the application windows. This last option is fraught with difficulty and should only ever be used when all other means have been explored and exhausted.

EXERCISES

Explore the os module to see what else you can discover about your computer. Be sure to read the relevant parts of the Python documentation for the os and stat modules.
Try adding a new function to the file_tree module called find_dirs() that searches for directories matching a given regular expression. Combine both to create a third function, find_all(), that searches both files and directories.
Create another function, apply_to_files(), that applies a function parameter to all files matching the input pattern. You could, for example, use this function to remove all files matching a pattern, such as *.tmp , like this:
```
  findfiles.apply_to_files('.*.tmp', os.remove, 'TreeRoot')
```
Write a program that loops over the first 128 characters and displays a message indicating whether or not the value is a control character (characters with ordinal values between 0x00 and 0x1F, plus 0x7F). Use ctypes to access the standard C library and call the iscntrl() function to determine if a given character is a control character. Note this is not one of the built-in test methods of the string type in Python.

WHAT YOU LEARNED IN THIS CHAPTER

TOPIC	DESCRIPTION
Scripting	Automation of a task involving multiple tools or applications. Python is used as the glue that binds these tools together, converting data formats to compatible forms, synchronizing the activities, and if necessary driving the functionality as a pseudo user.
OS environment	When the OS runs a process, it creates an environment consisting of certain configuration details. These include things like the process priority, its home directory, file permissions, and formats. Scripts often need to customize the environment prior to launching a program to ensure that it performs in the correct way.
Process and subprocesses	Programs run by the OS are known as processes. A single application may consist of a process hierarchy with a top-level process spawning multiple child or subprocesses. Subprocesses, by default, inherit their parent’s environment. Scripts frequently launch other programs as subprocesses.
Tree walking	The file system exists as a tree structure with a root node and subtrees attached to the root. It is possible to recursively descend through this structure to the leaf nodes, which are the individual files. Scripts frequently need to process multiple files within a given subtree of the file system.
Absolute dates and times	A fixed date and time in history. A date such as July 4th, 1776 is an absolute date.
Relative dates and times	A date or time relative to another date or time. Usually expressed as a period such as three hours or as a repeating date or time, such as the third hour of every day or first day of every month.
Parser	A function that breaks down structured data into its component parts. Parsers can be based on several different algorithms and the most common types are either event based or tree based. Python supports both styles for XML parsing.
Libraries	Programming languages make reusable code available in code libraries. These are conceptually like Python modules, but in compiled languages are generated with special tools and can be either static or dynamically linked into an application. `ctypes` can access dynamically linked C libraries.
COM	The Windows Common Object Model (COM) mechanism enables external applications (or frequently an internal macro language) to manipulate the functionality of a program. The pywin32 package simplifies Python access to COM objects.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2: Scripting with Python

Create new playlist

Sign In

Sign Up

Accessing the Operating System

Obtaining Information About Users and Their Computer

Obtaining Information About the Current Process

Managing Other Programs

Managing Subprocesses More Effectively

Obtaining Information About Files (and Devices)

Navigating and Manipulating the File system

Plumbing the Directory Tree Depths

Working with Dates and Times

Using the time Module

Introducing the datetime Module

Introducing the calendar Module

Handling Common File Formats

Using Comma-Separated Values

Working with Config Files

Working with XML and HTML files

Parsing XML Files

Parsing HTML Files

Accessing Native APIs with ctypes and pywin32

Accessing the Operating System Libraries

Using ctypes with Windows

Using ctypes on Linux

Accessing a Windows Application Using COM

Automating Tasks Involving Multiple Applications

Using Python First

Using Operating System Utilities

Using Data Files

Using a Third-Party Module

Interacting with Subprocesses via a CLI

Using Web Services for Server-Based Applications

Using a Native Code API

Using GUI Robotics

Summary

EXERCISES

WHAT YOU LEARNED IN THIS CHAPTER

Table of Contents for
Chapter 2: Scripting with Python

Introducing the `calendar` Module