2
Scripting with Python

WHAT YOU WILL LEARN IN THIS CHAPTER:    

  • Accessing and managing computer resources via the operating system
  • Handling common file formats such as CSV and XML
  • Working with dates and times
  • Automating applications and accessing their APIs
  • Using third-party modules to extend automation beyond the standard library capabilities

WROX.COM DOWNLOADS FOR THIS CHAPTER

For this chapter the wrox.com code downloads are found at www.wrox.com/go/pythonprojects on the Download Code tab. The code is in the Chapter 2 download, called Chapter2.zip, and individually named according to the names throughout the chapter.

Often, you may find yourself undertaking tasks that involve many repetitive operations. To combat this repetition of work, it may be possible to write a macro to automate those operations within a single application but, if the operations span several applications, macros are rarely effective. For example, if you back up and archive a large multimedia web application, you may have to deal with content produced by one or more media tools, code from an IDE, and probably some database files, too. Instead of macros, you need an external programming tool to drive each application, or utility, to perform its part of the whole. Python is well suited to this kind of orchestration role.

In this chapter you learn how to use Python modules to check user settings as well as directory and file access levels; set up the correct environment for an operation; and launch and control external programs from your script. You also discover how Python modules help you access data in common file formats, how to handle dates and times and, finally, how to directly access the low-level programming interfaces of external applications using the very powerful ctypes module and, for Windows, the pywin32 package.

Accessing the Operating System

Most of the tasks that a typical programmer needs to undertake using the operating system—for example, collecting user information or navigating the file system—can be done in a generic way using Python’s standard library of modules. (Recall that modules are reusable pieces of code that can be shared across multiple programs.) The key modules have been written in such a way that the peculiarities of individual operating system behaviors have been hidden behind a higher level set of objects and operations. The modules that you consider in this section are: os/path, pwd, glob, shutil, and subprocess. The material here focuses on how to use these modules in common scenarios; it does not try to cover every possible permutation or available option.

The os module, as the name suggests, provides access to many operating system features. It is, in fact, a package with a submodule, os.path, that deals with managing file paths, names, and types. The os module is supported by a number of other modules that you meet as you work through the various topics in this chapter. These myriad modules are collectively referred to as the OS modules (uppercase) and the actual os module as os (lowercase). If you are familiar with systems programming on a UNIX system, or even with using a UNIX shell such as Bash, many of these operations will be familiar to you.

The OS is primarily there to manage access to the computer’s hardware in the form of CPU, memory, storage, and networking. It regulates access to these resources and manages the creation, scheduling, and removal of processes. The OS module functions provide insight and control over these OS activities. In the next few sections, you look at these common tasks:

  • Collecting user and system information
  • Managing processes
  • Determining file information
  • Manipulating files
  • Navigating folders

Obtaining Information About Users and Their Computer

One of the first things you can do when exploring the OS modules is to find out what they can tell you about users. Specifically, you can find out the user’s ID, login name, and some of his default settings.

Like most new things in Python, the best way to get familiar is via the interactive prompt, so fire up the Python interpreter and try it out.

Next you find out what kind of permissions the users have on files they create. This is significant because it affects any files your code produces. It may be that you need to temporarily alter the permissions—for example, if you need to create a file that you execute later in the program, it needs to have execute privileges. In UNIX, these settings are stored in something known as a umask or user mask. It is a bitmask, like the ones you used at the end of Chapter 1, where each bit represents a user-access data point, as described next.

Python lets you look at the umask value, even on Windows, using the os.umask() function. The os.umask() function has a slight quirk in its usage, however. It expects you to pass a new value to the function; it then sets that value and returns the old value. But if you only want to find out the current value, you can’t do it. Instead you need to set the umask to a temporary new value, read the old one, and then reset the value to the original. The format of the mask is very compact, consisting of 3 groups of 3 bits, 1 group for each of Owner, Group, and World permissions, respectively.

Within a group the 3 bits each represent one type of access—read, write, or execute. These are most conveniently written using explicit binary notation. Table 2.1 shows how each 3-bit binary value maps onto permissions.

Table 2.1 Umask Binary Mappings

UMASK BINARY VALUE READ, WRITE, EXECUTE VALUES
000 Read = True, Write = True, Execute = True
001 Read = True, Write = True, Execute = False
010 Read = True, Write = False, Execute = True
011 Read = True, Write = False, Execute = False
100 Read = False, Write = True, Execute = True
101 Read = False, Write = True, Execute = False
110 Read = False, Write = False, Execute = True
111 Read = False, Write = False, Execute =False

Now that you understand what you are trying to do, it’s time to try it out.

Sometimes you want to know what kind of computer system the user is running, in particular the details of the OS itself. Python has several ways of doing this, but the one you look at first is the os.name property. At the time of writing, this property returns one of the following values: posix, nt, mac, os2, ce, or java.

Another place to look for the system the user is running is in the sys module and, in particular, the sys.platform attribute. This attribute often returns slightly different information than that found using os.name. For example, Windows is reported as win32 rather than nt or ce. On UNIX another function in os called os.uname() provides slightly more detail. If you have several different OSes available to you, it can be interesting to compare the results from these different techniques. It is recommended that you use the os.name option simply because it is universally available and returns a well-defined set of results.

One other snippet of information that is often useful to collect is the size of the user’s terminal in terms of its lines and columns. You can use this information to modify the display of messages from your scripts. The shutil module provides a function for this called shutil.get_terminal_size(), and it is used like this:

  >>> import shutil
  >>> cols, lines = shutil.get_terminal_size()
  >>> cols
  80
  >>> lines
  49

If the terminal size cannot be ascertained, the default return value is 80 × 24. A different default can be specified as an optional argument, but 80 × 24 is usually a sensible option because it’s the traditional size for terminal emulators.

Obtaining Information About the Current Process

It can be useful for a program to know something about its current status and runtime environment. For example, you might want to know the process identity or if the process has a preferred folder in which to write its data files or read configuration data. The OS modules provide functions for determining these values.

One such source of process information is the process environment, as defined by environment variables. The os module provides a dictionary called os.environ that holds all the environment variables for the current process.

The disadvantage of environment variables is that they are highly volatile. Users can create them and remove them. Applications can do likewise, so it is dangerous to rely on the existence of an environment variable; you should always have a default value that you can fall back on. Fortunately, some values are fairly reliable and usually present. Three of these are particularly useful for Windows users because the pwd.getpwuid() and os.uname() functions discussed earlier are not available. These are HOME, OS, and PROCESSOR_ARCHITECTURE.

If you do try to access a variable that is not defined, you get the usual Python dictionary KeyError. On most, but not all, operating systems, a program can set, or modify, environment variables. If this feature is supported for your OS, then Python reflects any changes to the os.environ dictionary back into the OS environment. In addition to using environment variables as a source of user information, it is quite common to use them to define user-specific configuration details about a program—for example, the location of a database. This practice is slightly frowned upon nowadays, and it’s considered better to use a configuration file for such details. But if you are working with older applications, you may need to refer to the environment for such things.

Managing Other Programs

It is often useful to be able to run other programs from within a script, and the subprocess module is the preferred tool for this. The subprocess module contains a class called Popen that provides a very powerful and flexible interface to external programs. The module also has several convenience functions that you can use when a simpler approach is preferred. The documentation describes how to use all of these features; in this section you use only the simplest function, subprocess.call(), and the Popen class.

The most basic use of the subprocess module is to call an external OS command and simply allow it to run its course. The output is usually displayed on screen or stored in a data file somewhere. After the program completes, you can ask the user to make some kind of selection based on what was displayed or you can access the data file directly from your code. You can force many OS tools, especially on UNIX-based systems, into producing a data file as output by providing suitable command-line options or by using OS file redirection. This technique is a very powerful way to harness the power of OS utilities in a way that Python can use for further processing.

This basic mechanism for calling a program is wrapped up in the subprocess.call() function. This function has a list of strings as its first parameter, followed by several optional keyword parameters that are used to control the input and output locations and a few other things.

The easiest way to see how it works is to try it out.

One problem that can occur when running external programs is that the OS cannot find the command. You generally get an error message when this happens, and you need to explicitly provide the full path to the program file, assuming it does actually exist.

Finally, consider how to stop a running process. For interactive programs, the simplest way is for the user to close the external program in the normal way or issuing an interrupt signal using Ctrl+C or Ctrl+Z, or whatever is the norm on the user’s OS. But for non-interactive programs, you may need to intervene from the OS, usually by examining the list of running processes and explicitly terminating the errant process.

You have just seen how easy it is to use subprocess.call() to start an external process. You now learn how the subprocess module gives you much more control over processes and, in particular, how it enables your program to interact with them while they are running, especially how to read the process output directly from your script.

Managing Subprocesses More Effectively

You can use the Popen class to create an instance of a process, or command. Unfortunately, the documentation can appear rather daunting because the Popen constructor has quite a few parameters. The good news is that nearly all of those parameters have useful default values and can be ignored in the simplest cases. Thus, to simply run an OS command from within a script, you only need to do this (Windows users should substitute the dir command from the previous example):

  >>> import subprocess as sub
  >>> sub.Popen(['ls', '*.*'], shell=True)
  <subprocess.Popen object at 0x7fd3edec>
  >>> book tmp

Notice the shell=True argument. This is necessary to get the command interpreted by the OS command processor, or shell. Doing so ensures that the wildcard characters ('*.*') as well as any string quotes and the like are all interpreted the way you expect. If you do not use the shell parameter, this happens:

  >>> sub.Popen(['ls', '*.*'])
  <subprocess.Popen object at 0x7fcd328c>
  >>> ls: cannot access *.*: No such file or directory

Without specifying shell=True, the operating system tries to find a file with the literal name '*.*', which doesn’t exist.

The problem with using shell=True is that it also creates security issues in the form of a potential injection attack, so never use this if the commands are formulated from dynamically created strings, such as those read from a file or from a user.

To access the output of the command being run, you can add a couple of extra features to the call, like so:

  >>> lsout = sub.Popen(['ls', '*.*'], shell=True, stdout=sub.PIPE).stdout
  >>> for line in lsout:
  ...	print (line)

Here you specify that stdout should be a sub.PIPE and then assign the stdout attribute of the Popen instance to lsout. (A pipe is just a data connection to another process, in this case between your program and the command that you are executing.) Having done so, you can then treat the lsout variable just like a normal Python file and read from it—and so on.

You can send data into the process in much the same way by specifying that stdin is a pipe to which you can then write. The valid values that you can assign to the various streams include open files, file descriptors, or other streams (so that stderr can be made to appear on stdout, for example). Note that it’s possible to chain external commands together by setting, for example, the input of the second program to be the output of the first. That produces a similar effect to using the OS pipe character (|) on a command line.

In the Try It Out examples, you accessed stdin and stdout directly; however, this can sometimes cause problems, especially when running processes concurrently or within threads, leading to pipes filling and blocking the process. To avoid these issues, it’s recommended that you use the Popen
.communicate() method and index the appropriate stream. This is slightly more complex to use, but avoids the problems just mentioned. Popen.communicate() takes an input string (equivalent to stdin) and returns a tuple with the first element being the content of stdout and the second the content of stderr. So, repeating the file listing example using Popen.communicate() looks like this:

  >>> ls = sub.Popen(['ls'], stdout=sub.PIPE)
  >>> lsout = ls.communicate()[0]
  >>> print (lsout)
  b'fileA.txt
fileB.txt
ls.txt
'
  >>>

To conclude this section, it is worth pointing out that, for simplicity, you have been using fairly basic commands, such as ls, in the examples. Many of these commands can be performed equivalently from within Python itself (as you see shortly). The real value in mechanisms like subprocess.call() and Popen() is in running much more complex programs such as file conversion utilities and image-processing batch tools. Writing the equivalent functionality of these tools in Python would be a major project, so calling the external program is a more sensible alternative. You use Python where it is strongest, in orchestrating and validating the inputs and outputs, but leave the “heavy lifting” to the more specialized applications.

Obtaining Information About Files (and Devices)

The os module is heavily biased to the UNIX way of doing things. As such it treats devices and files similarly. So finding out about devices such as the current terminal session looks a lot like finding out about files. In this section you now look at how you can determine file status and permissions and even how to change some of their properties from within your programs. Consider the following code:

  >>> import os
  >>> os.listdir('.')
  ['fileA.txt', 'fileB.txt', 'ls.txt', 'test.txt']
  >>> os.stat('fileA.txt')
  posix.stat_result(st_mode=33204, st_ino=1125899907117103,
  st_dev=1491519654, st_nlink=1, st_uid=1001, st_gid=513,
  st_size=257, st_atime=1388676837, st_mtime=1388677418,
  st_ctime=1388677418)

Here you checked the current directory ('.') listing with os.listdir(). (Now that you’ve seen os.listdir(), you hopefully realize that your use of ls or dir in subprocess was rather artificial because os.listdir() does the same job directly from Python, and does it more efficiently.) You then used the os.stat() function to get some information about one of the files. This function returns a named tuple object that contains 10 items of interest. Perhaps the most useful of these are st_uid, st_size, and st_mtime. These values represent the file owner’s user ID, the size, and the last modification date/time. The times are integers that must be decoded using the time module, like so:

  >>> import time
  >>> time.localtime(1388677418)
  time.struct_time(tm_year=2014, tm_mon=1, tm_mday=2, tm_hour=15,
  tm_min=43, tm_sec=38, tm_wday=3, tm_yday=2, tm_isdst=0)
  >>> time.strftime("%Y-%m-%d", time.localtime(1388677418))
  '2014-01-02'

Here you used the time module’s localtime() function to convert the integer st_mtime value into a time tuple showing the local time values and from there into a readable date string using the time.strftime() function with a suitable format string. (You look more closely at the time module in the “Using the Time Module” section later in this chapter.)

The simple 10-value tuple returned from os.stat() is generally convenient, but more details are available via os.stat() than the tuple provides directly. Some of these additional values are OS dependent, such as the st_obtype attribute found on RiscOS systems. You need to do a little bit more work to dig these out. You can access the details by using object attribute dot notation.

Perhaps the most interesting field that you can access from os.stat() is the st_mode value, which tells you about the access permissions of the file. You use it like this:

  >>> import os
  >>> stats = os.stat('fileA.txt')
  >>> stats.st_mode
  33204

But that’s not too helpful; it’s just an apparently random number! The secret lies in the individual bits making up the number; it’s another bitmask. You may recall the umask bitmask that you looked at earlier in the chapter. The st_mode is conceptually similar to the umask, but with the bit meanings reversed. You can see how the access details are encoded by looking at the last 9 bits, like this:

  >>> bin(stats.st_mode)[-9:]
  '111111101'

By using the bin() function in combination with a slice, you have extracted the binary representation of the last 9 bits. Looking at those as 3 groups of 3, you can see the read/write/execute values for Owner, Group, and World respectively. Thus, in this example, Owner and Group have all three bits set to one (True), but World only has the read and execute bits set to 1 (True), and the write access is 0 (False). (Note that these are the direct inverse of the meanings of the umask bits; do not confuse the two!)

The higher order bits also have meanings, and the stat module contains a set of bitmasks that can be used to extract the details on a bit-by-bit basis. For most purposes the preceding access bits are sufficient, and helper functions exist in the os.path module that enable you to access that information. You’ll revisit this theme when you look at os.path later in the chapter.

You have several other ways to determine access rights to a file in Python. In particular the os module provides a convenience function—os.access()—that takes a filename and flag variable (one of os.F_OK, os.R_OK, os.W_OK, or os.X_OK) that returns a boolean result depending on whether the file exists, or is readable, writable, or executable, respectively. These functions are all easier to use than the underlying os.stat() and bitmask approach but it’s useful to know where the functions are getting their data.

Finally, the os documentation points out a potential issue when checking for access before opening a file. There is a very short period between the two operations when the file could change either its access level or its content. So, as is usual in Python, it’s better to use try/except to open the file and deal with failure if it happens. You can then use the access checks to determine the cause of failure if necessary. The recommended pattern looks like this:

  try:
 	myfile = open('myfile.txt')
  except PermissionError:
  	# test/modify the permissions here
  else:
  	# process the file here
  finally:
  	# close the file here

Having seen how to explore the properties of individual files, you now look at the mechanisms available for traversing the file system, reading folders, copying, moving, and deleting files, and so on.

Navigating and Manipulating the File system

Python provides built-in functions for opening, reading, and writing individual files. The os module adds functions to manipulate files as complete entities—for example, renaming, deleting, and creating links are all catered for. However, the os module itself provides only half of the story when it comes to working with files. You look at the other half when you explore the shutil module and other utility modules that work alongside os.

You start with reading and navigating the file system. You’ve already seen how you can use os.listdir() to get a directory listing and os.getcwd() to tell you the name of the current working directory. You can use os.mkdir() to create a new directory and os.chdir() to navigate into a different directory.

One problem with the os.mkdir() function used here is that it can only create a directory in an existing directory. If you try creating a directory in a place that doesn’t exist, it fails. Python provides an alternative function called os.makedirs()—note the difference in spelling—that creates all the intermediate folders in a path if they do not already exist.

You can see how that works with the following commands:

  >>> os.mkdir('test2/newtestdir')
  Traceback (most recent call last):
  	File "<stdin>", line 1, in <module>
  OSError: [Errno 2] No such file or directory: 'test2/newtestdir'
  >>> os.makedirs('test2/newtestdir')
  >>> os.chdir('test2/newtestdir')
  >>> print( os.getcwd() )
  /home/agauld/book/root/test2/newtestdir

Here the original os.mkdir() call produced an error because the intermediate folder test2 did not exist. The call to os.makedirs() succeeded, however, creating both the test2 and newtestdir folders, and you were able to change into newtestdir to prove the point. Note that os.makedirs() raises an error if the target folder already exists. You can use a couple of additional parameters to further tune the behavior, but the default values are usually what you need.

Another module, shutil, provides a set of higher level file manipulation commands. These include the ability to copy individual files, copy whole directory trees, delete directory trees, and move files or whole directory trees. One anomaly is the ability to delete a single file or group of files. That is actually found in the os module in the form of the os.remove() function for files (and os.rmdir() for empty directories, although shutil.rmtree() is more powerful and usually what you want).

Another useful module is glob. This module provides filename wildcard handling. You are probably familiar with the ? and * wildcards used to specify groups of files in the OS commands. For example, *.exe specifies all files ending in .exe. glob.glob() does the same thing in your code by returning a list of the matching filenames for a given pattern.

If you look at the shutil documentation, you’ll see several variations on the copy functions with subtly different behaviors. In most cases the standard shutil.copy() function does what you want. Other features of shutil include the ability to create archived or compressed files in either zip or tar formats. Also, you can extend the functionality of several of the functions using optional arguments. One of the most interesting is the shutil.copytree() function, which has an ignore parameter. You can set this to a function that takes two arguments: a root folder and a list of files. (The function must accept two parameters even if they are not actually used by it.) The function then returns another list of filenames that shutil.copytree() ignores. This function is then called by shutil.copytree() for each folder of the tree being copied, with the arguments being the current folder within the tree and the list of files produced by os.listdir() acting on that folder. This is useful for ignoring temporary or archive files, or files that can be re-created later. Here is a short example that copies a project directory tree but ignores any compiled Python files (i.e., those with an extension of .pyc).

  >>> def ignore_pyc(root, names):
  ...		return [name for name in names if name.endswith('pyc')]
  ...
  >>> 	# now test that it works
  >>> ignore_pyc('fred',['1.py','2.py','2.pyc','4.py','5.pyc'])
  ['2.pyc', '5.pyc']
  >>> sh.copytree('projdir', 'projbak', ignore=ignore_pyc)

In this case you used a list comprehension to build the ignore list, but you could equally just return a hard-coded filename (for example RCS to avoid copying version control files across) or you could have a much more complex piece of logic involving database lookups or other complex processing. The scenario of testing for a standard pattern ('*.pyc' in your case) is so common that shutil has a helper function called shutil.ignore_patterns(), which takes a list of glob-style patterns and returns a function that can be used in shutil.copytree(). Here is the previous example again, but this time using shutil.ignore_patterns():

  >>> sh.copytree('projdir', 'projbak', ignore=sh.ignore_patterns('*.pyc') )

Remember that the ignore function is called for every folder being copied, so if it is very complex, the copytree() operation could become quite resource-intensive and slow.

Finally, consider a submodule of os called os.path. The os.path module contains several helpful tests and utility functions that can help you when using the higher-level functions already discussed. The most useful functions are for creating paths, deconstructing paths, expanding user details, testing for path existence, and obtaining some information about file properties.

You start your exploration of os.path by looking at some helpful test functions. You saw earlier how you can use os.stat() to extract information about a file. os.path provides some helper functions that get the more common features more easily. You can, for instance, determine the size of a file using os.path.getsize(), the modification time using os.path.getmtime(), and the creation time with os.path.getctime(). You can also tell whether a name, returned by os.listdir() for example, is a file or a directory using os.path.isfile() or os.path.isdir(). (You can even test for mount points and links if that is important to you.) All of these functions take a name as an argument and return True or False. That’s a bit easier than calling os.stat() and then using a combination of indexing and bitmasking to extract the details.

The next thing that os.path helps with is processing paths. You can find the full path to your file using os.path.abspath() and, if it’s a link, the path to the real file with os.path.realpath(). Having obtained that path, you can break it into its constituent parts. Python considers a full file path to look like this:

  [<drive>]<path to folder><filename><extension>

Using os.path.splitdrive(), you can read the drive letter (if you are on Windows, otherwise it is empty). os.path.dirname() finds the folder, and os.path.basename() gets the filename (including the extension). You can even get the folder path and filename in one go with os.path.split(). Usually that’s sufficient but, if necessary, you can further split the filename into its extension and core name with os.path.splitext(), in which case the extension includes the period, for example, myfile.exe returns myfile and .exe.

Often, after having inspected and worked with the various path components, you want to reassemble the path or even create one from scratch. os.path provides another convenience function for this called os.path.join(), which takes the various elements and combines them into a single string using the current OS path separator, as defined in the constant os.sep. This is very important because path format is one area that varies considerably across operating systems. Since MacOS X appeared, based on a UNIX kernel, things have been a little easier because Windows usually accepts the UNIX-style / separator in addition to its native style. But it is still safer to use os.path.join() to create file paths if you plan on running your script on multiple computer types.

You can see this operation in action on your test files and folders.

Plumbing the Directory Tree Depths

One common automation operation is to start at a given location and apply a particular action to every file (or type of file) in the file system below that location. This is often called “walking the directory tree,” and the os module contains a powerful and flexible function called os.walk() that helps you do just that. It is not the most straightforward function to use, so you spend this section looking at its key features.

You consider an example of os.walk() being used to find a specific file located somewhere within a given directory tree or subtree. You then create a new module with a findfile() function that you can use in your programs. That foundation can go on to form the basis for a whole group of functions that you can use to process directory trees.

First you need to create a test environment consisting of a hierarchy of folders under a root directory. (You can generate this structure by extracting the file TreeRoot.zip from the Chapter2.zip master file on the download site and then extracting the files within TreeRoot.zip, or you can use the OS tools to generate it manually.) Each folder contains some files, and one of the folders contains the file you want to find, namely target.txt. You can see this structure here:

  TreeRoot
 	FA.txt
 	FB.txt
 	D1
 		FC.txt
 		D1-1
 			FF.txt
 	D2
 		FD.txt
 	D3
 		FE.txt
 		D3-1
 			target.txt

The os.walk() function takes a starting point as an argument and returns a generator yielding tuples with 3 members (sometimes called a 3-tuple or triplet): the root, a list of directories in the current root, and a list of the current files in that root. If you look at the hierarchy you have created, you would expect the top-level tuple to look like this:

  ( 'TreeRoot', ['D1','D2','D3'], ['FA.txt','FB.txt'])

You can check that easily by writing a for loop at the interactive prompt:

  >>> import os
  >>> for t in os.walk('TreeRoot'):...		print (t)...
  ('TreeRoot', ['D1', 'D2', 'D3'], ['FA.txt', 'FB.txt'])
  ('TreeRoot/D1', ['D1-1'], ['FC.txt'])
  ('TreeRoot/D1/D1-1', [], ['FF.txt'])
  ('TreeRoot/D2', [], ['FD.txt'])
  ('TreeRoot/D3', ['D3-1'], ['FE.txt'])
  ('TreeRoot/D3/D3-1', [], ['target.txt'])

This clearly shows the path taken by os.walk(), starting with the first directory at the top level and drilling down before moving onto the next directory and so on. It also shows how you can take a file from the files list and construct its full path by combining the name with the root value of the containing tuple.

By writing your function to use regular expressions, and to return a list, you can create a function that is much more powerful (but also slower!) than the simple glob.glob() that you saw earlier.

You’ve seen how Python helps you work with the OS. In the next section, you see how Python enables you to work with dates and times.

Working with Dates and Times

One of the most common features of scripting tasks is the use of dates and times. This could be to identify files older than a certain date or between a certain range or it might be to set a process to run at a certain time or interval. You might need to compare dates and times in data to select an appropriate subset of a file’s content. Reading dates and times and comparing their values is necessary in many scenarios.

Unfortunately, dates and times are not clearly defined values like integers or floats. They tend to be stored in strings with a multitude of formats. For example 2016-02-07, 02/07/2016, and 07/02/2016 are all possible representations for the 7th day in February, 2016. The situation is further complicated by the possibility of rendering months and days using name abbreviations such as Jan, Feb, or Mon, Tue, and so on. Add the fact that years may be abbreviated to two digits and the separators can be any of a number of characters, and you start to see the complexity. How can you reliably read a date value from a given string? Time values are almost as complex, especially if you have to consider time zones and daylight saving rules. Fortunately, Python offers several modules to help you do just that. The most basic is the time module, augmented by the datetime module and, for some tasks, the calendar module.

Using the time Module

The time module stores times (including dates) in two different formats. The first is the number of seconds since the epoch, which is simply a fixed date in history. For UNIX-based systems that’s 1st January 1970. (Did you notice that’s yet another date representation?) The other representation is as a tuple of fields representing the various parts of a date/time: year, month, day, hour, minute, second, and so on. The details are all found in the time module documentation, but you need to remember which underlying format you are using. The time modules contain various conversion functions to switch between them.

Two very important functions for reading and writing times as strings take into account most of the issues just discussed. These functions are called strptime() (the “p” stands for parse) and strftime() (the “f” stands for format). The secret to using these functions lies in a format string. This string tells the function how to map string values to/from time values. The format string uses % markers to indicate a field and a set of character codes to indicate what the field should contain. For example, %Y indicates a four-digit year whereas %y indicates a two-digit year. %m indicates a two-digit month, and %B indicates the full month name (taking into account the local language settings). A table in the time module documentation for strftime() provides the definitive list.

The easiest way to come to grips with these functions is to play with them at the Python prompt. You can try it out now.

The time module includes several other functions for managing time zones and for getting information about the system clocks. You can also tell if daylight savings time is in effect on the computer.

Finally, and far from least, the time module contains a sleep() function that pauses your program for the specified number of seconds. This is often useful in scripting when you are using background processes to perform a task that may require some time. It is also useful when polling a resource such as a network connection while waiting for data to arrive. You can use fractions of a second, but you should realize that the timing is only approximate because of OS process scheduling overheads and the like.

Introducing the datetime Module

The datetime module includes several objects and methods that represent both absolute dates and times as well as relative dates and times. Relative values are used for computing differences between times and avoid you having to do messy calculations on second-based values, dividing by 60 and 24, and so on. Some overlap exists between the time functions and the datetime objects. In general, if you are doing comparisons or time-based calculations, you should use the datetime module rather than time. If you are using both in the same code, use the most basic import style to ensure no name collisions occur.

The main classes exposed by the datetime module are date, time, and datetime, whose names are indicative of their scope. datetime and time objects can have a timezone attribute set to a timezone object to take account of time zone effects. If you have complex time processing to do, you may need to subclass the timezone class to provide any non-trivial algorithms required. In this book you only use the basic objects from the module. The other, and perhaps most useful, object type exposed is the timedelta class, which handles time durations such as the result of a time computation or a relative period such as a year or a month. The datetime module supports many time-based calculations using timedelta objects, including addition, subtraction, multiplication of a delta by a number, and even various forms of division.

You can initialize the date object by passing year, month, and day values, all of which are mandatory. You can initialize the time object passing hour, minute, and second values, all of which are optional and default to zero. The datetime object, you will not be surprised to learn, uses the full gamut of year, month, day, hour, minute, and second. Some helpful class methods return instances based on object arguments. An example is the date.today() method that returns today’s date or the date.fromtimestamp() method that takes a time value in seconds as its argument. Various attributes and methods exist for extracting data about the date after it has been created. The date class includes a strftime() method similar to the one in the time module (but has no corresponding strptime(); for that, you must look to the datetime object).

The time object is conceptually similar but, as mentioned earlier, includes the capability to take account of time zone data including daylight savings information. time objects, like date objects, support a strftime() method only.

datetime objects are a combination of both date and time objects and support a combination of both objects’ methods. datetime also adds a few extra methods of its own, including a now() class method for initialization to the current date and time, and combine() class method that takes date and time objects as arguments and returns a combined datetime object with the same values. You can do basic arithmetic using a combination of datetime and timedelta objects, the latter being either an argument or result as appropriate. datetime objects also support both strftime() and strptime() methods, which work in the same way as those in the time module described earlier.

You use the datetime objects as part of a larger example in the Try It Out “Parsing XML with ElementTree” later in the chapter.

Introducing the calendar Module

The calendar module is the simplest of the time-based modules in Python’s standard library. Essentially, it generates a calendar for a given year. The calendar is a calendar.Calendar class instance that has several support methods that allow you to, for example, iterate over the days in a given month or produce various formatted text strings that can be useful in presenting user messages in a script. Calendars can be formatted as plaintext or in HTML.

calendar is probably the least used of the three modules discussed, but it has some useful features that are not available elsewhere and would be time consuming to reproduce. Among these are some utility functions such as isleap(), which reports whether or not the specified year is a leap year, and timegm(), which converts a time.gmtime() tuple into seconds (why it is located in the calendar module instead of time is something of a mystery).

Finally, a couple of printing functions, prcal() and prmonth(), take a year and a year/month combination, respectively, as arguments and display their output on stdout. These can be useful when you want to prompt your user to choose a date.

There are some third-party modules available that try to simplify date and time handling in Python by combining all the functions from all of the standard modules into a single more user-friendly module. Some examples include arrow and delorean, but an Internet search will reveal several others.

In the next section, you see how Python assists in reading and writing several common data file formats.

Handling Common File Formats

When writing scripts to control several applications or utilities, it’s common to use files as the data transfer mechanism between applications. Unfortunately the output format of one application may not be in exactly the right format for the next application to read. At this point the script itself must convert the output file into the appropriate form for the next application to read. Most applications produce and consume variants of a few standard formats such as CSV (comma-separated values), HTML (HyperText Markup Language), XML (eXtended Markup Language), Windows INI (named after the file extension) and, more recently, JSON (JavaScript Object Notation). You now look at how Python’s standard library supports these various formats. (JSON is covered in Chapter 5, “Python on the Web,” because it is most commonly associated with web applications.) These modules make it easier to read and write data than if you tried to do it using the standard Python text-processing tools such as string methods or regular expressions.

Using Comma-Separated Values

The comma-separated value (CSV) format has been around for many years. Its name is something of a misnomer because, though commas are the most common separator, the term CSV is often applied to files using tabs or pipes (|) or, indeed, just about any other kind of character, as a separator. At first glance it might seem easy to parse data from such a file using the built-in string split() method. The problem is that the format is not absolutely standardized, and different files have different ways of representing fields that contain the separator within them. Also, lines of data can sometimes be split over multiple physical lines in the file. To make dealing with this diversity easier, Python includes the csv module in its standard library.

The csv module provides two mechanisms for reading CSV files. The simplest just reads each line into a tuple, and the programmer has to keep track of what each position in the tuple represents. The second method reads the data into a dictionary, often using the first line of the file as the keys of the dictionary. This is a particularly flexible mechanism because it accommodates changes in the file format (such as adding new keys) without breaking existing code.

The module defaults to the CSV format used by Microsoft Excel, but you can define your own formats too; it just takes a bit of extra work. In this chapter you are dealing with the Excel format only.

The examples that follow are based on a simple spreadsheet, toolhire.xlsx, as shown in Figure 2.1. (All of the data files discussed are included in the ToolhireData folder of the Chapter2
.zip download file in case you don’t have access to Excel.) The spreadsheet describes a small tool hire facility set up by some friends to keep track of who is borrowing what from whom.

images

Figure 2.1 The toolhire spreadsheet

The data was saved to CSV format in the file toolhire.csv. The raw data in that file looks like this:

  ItemID,Name,Description,Owner,Borrower,DateLent,DateReturned
  1,LawnMower,Small Hover mower,Fred,Joe,4/1/2012,4/26/2012
  2,LawnMower,Ride-on mower,Mike,Anne,9/5/2012,1/5/2013
  3,Bike,BMX bike,Joe,Rob,7/3/2013,7/22/2013
  4,Drill,Heavy duty hammer,Rob,Fred,11/19/2013,11/29/2013
  5,Scarifier,"Quality, stainless steel",Anne,Mike,12/5/2013,
  6,Sprinkler,Cheap but effective,Fred,,,

This is a fairly simple file, but does include one of the complexities described earlier. Notice that Anne’s scarifier description is surrounded by double quotes because it contains a comma.

After importing the module, you can read the file into a list of tuples like so:

  >>> import csv
  >>> with open('toolhire.csv') as th:
  ...	toolreader = csv.reader(th)
  ...	print(list(toolreader))
  ...
  [['ItemID', 'Name', 'Description', 'Owner', 'Borrower',
  'DateLent', 'DateReturned'],
  ['1', 'LawnMower', 'Small Hover mower', 'Fred', 'Joe', '4/1/2012', '4/26/2012'],
  ['2', 'LawnMower', 'Ride-on mower', 'Mike', 'Anne', '9/5/2012', '1/5/2013'],
  ['3', 'Bike', 'BMX bike', 'Joe', 'Rob', '7/3/2013', '7/22/2013'], ['4', 'Drill',
  'Heavy duty hammer', 'Rob', 'Fred', '11/19/2013', '11/29/2013'], ['5',
  'Scarifier', 'Quality, stainless steel', 'Anne', 'Mike', '12/5/2013', ''],
  ['6', 'Sprinkler', 'Cheap but effective', 'Fred', '', '', '']]
  >>>

Notice that Anne’s scarifier description no longer has double quotes, but does still contain the original comma. Figuring out how to do that is the value that the csv module adds to your programs. You can apply lots of options both to the file in the call to open() and in the creation of the csv.reader object. The example shows the minimal set.

Writing to a CSV file is just as easy. In this example, you create a new page of data for the toolhire.xlsx spreadsheet that lists the various tools available, along with some details about when they were made available, their condition, and original price. You save the data as a CSV file called tooldesc.csv that you can load into Excel as a new worksheet.

Here is the code:

  >>> import csv
  >>> items = [... 	['1','Lawnmower', 'Small Hover mower', 'Fred','$150','Excellent','2012-01-05'],... 	['2','Lawnmower','Ride-on mower','Mike','$370','Fair','2012-04-01'],... 	['3','Bike','BMX bike','Joe','$200','Good','2013-03-22'],... 	['4','Drill','Heavy duty hammer','Rob','$100','Good','2013-10-28'],... 	['5','Scarifier','Quality, stainless steel','Anne','$200','2013-09-14'],... 	['6','Sprinkler','Cheap but effective','Fred','$80','2014-01-06']
  ...	]
  >>> with open('tooldesc.csv','w', newline='') as tooldata:...	toolwriter = csv.writer(tooldata)...	for item in items:...		toolwriter.writerow(item)
  ...
  44
  39
  33
  34
  34
  33
  >>>

As you can see, the writer.writerow() method returns the number of characters written to the file. Mostly you just ignore that! The output file looks like this:

  1,Lawnmower,Small Hover mower,Fred,$150,Excellent,2012-01-05
  2,Lawnmower,Ride-on mower,Mike,$370,Fair,2012-04-01
  3,Bike,BMX bike,Joe,$200,Good,2013-03-22
  4,Drill,Heavy duty hammer,Rob,$100,Good,2013-10-28
  5,Scarifier,"Quality, stainless steel",Anne,$200,2013-09-14
  6,Sprinkler,Cheap but effective,Fred,$80,2014-01-06

Notice that the scarifier description once again has quotes around it, and the date fields are written exactly as is. If you want to get the dates in the same format as Excel produced in the original CSV file, you need to do that manipulation before you write the data. This is very typical of the kinds of inconsistencies you find when using CSV files as a transport between different applications. You can use the datetime module to convert the date formats. datetime contains the datetime.strptime() function, which can parse an input string to a datetime object and the datetime.strftime() function to write that datetime object out in the format you want. Try that out now.

So far you have been using the basic reader and writer components of the csv module that work with lists of data items. You may recall from earlier that csv also supports a dictionary-based approach. You now use that to access the original toolhire.csv file. If you look again at the content of the CSV file, you notice that the first line is a list of headings that describe the columns. The csv module can exploit that by using the headings as keys in a dictionary. This makes accessing individual fields much more reliable because you no longer need to rely on the numeric position of the field in the file.

The way it works is very similar to the previous code, but instead of using a csv.reader object, you use a csv.DictReader. It looks like this:

  >>> with open('toolhire.csv') as th:
  ...	rdr = csv.DictReader(th)
  ...	for item in rdr:
  ...		print(item)
  ...
  {'DateReturned': '4/26/2012', 'Description': 'Small Hover mower',
  'Owner': 'Fred', 'ItemID': '1', 'DateLent': '4/1/2012',
  'Name': 'LawnMower', 'Borrower': 'Joe'}
  {'DateReturned': '1/5/2013', 'Description': 'Ride-on mower',
  'Owner': 'Mike', 'ItemID': '2', 'DateLent': '9/5/2012',
  'Name': 'LawnMower', 'Borrower': 'Anne'}
  {'DateReturned': '7/22/2013', 'Description': 'BMX bike',
  'Owner': 'Joe', 'ItemID': '3', 'DateLent': '7/3/2013',
  'Name': 'Bike', 'Borrower': 'Rob'}
  {'DateReturned': '11/29/2013', 'Description': 'Heavy duty hammer',
  'Owner': 'Rob', 'ItemID': '4', 'DateLent': '11/19/2013',
  'Name': 'Drill', 'Borrower': 'Fred'}
  {'DateReturned': '', 'Description': 'Quality, stainless steel',
  'Owner': 'Anne', 'ItemID': '5', 'DateLent': '12/5/2013',
  'Name': 'Scarifier', 'Borrower': 'Mike'}
  {'DateReturned': '', 'Description': 'Cheap but effective',
  'Owner': 'Fred', 'ItemID': '6', 'DateLent': '',
  'Name': 'Sprinkler', 'Borrower': ''}
  >>>

Notice that, as is normal with a dictionary, the fields are not in the original order, and they are keyed using the labels from the first line. You can see that, as before, the scarifier description has lost the quotes but retained its comma.

If, instead of printing the items, you store them in a variable, you can do some interesting analysis of the data using list comprehensions. For example, to see all of the items owned by Fred, you can do this:

  >>> with open('toolhire.csv') as th:
  ...	rdr = csv.DictReader(th)
  ...	items = [item for item in rdr]
  ...
  >>> [item['Name'] for item in items if item['Owner'] == 'Fred']
  ['LawnMower', 'Sprinkler']
  >>>

You could do the same thing using the basic reader and its lists, but you’d need to use numeric indices, which are much less readable. For example, the list comprehension using the earlier list would look like this:

  >>> [item[1] for item in toolList if item[3] == 'Fred']
  ['LawnMower', 'Sprinkler']

It isn’t nearly so obvious what you are returning or what the selection criteria are. Also, if the file format ever changed, you would need to change the indices everywhere in your code.

There is a matching DictWriter object that can write a dictionary out to a CSV file. You use it in the next Try It Out exercise.

You can use the DictReader even if your CSV file contains no labels. For example, the tooldesc2.csv file that you created in the previous Try It Out had no label line. You can remedy that by reading it into a DictReader and then writing it out with a DictWriter. The trick is to provide the labels as an argument to the DictReader constructor. Try it out now.

You’ve seen how to use the csv reader and writer objects to convert between the CSV file format and Python lists as well as the DictReader and DictWriter objects to do the same with dictionaries. You’ve also seen two examples of modifying a CSV file format to make it more suitable for subsequent processing. The csv module contains a few other features for dealing with non-Excel based CSV files, but you can read about those in the documentation if you need them.

Working with Config Files

Config files or, as they are often called, Windows “INI” files, have a very readable format that is also easy to work with programmatically. They have fallen out of favor in recent years because Microsoft now advocates the Windows Registry and non-Microsoft applications are moving to XML-based storage. However, there are plenty of legacy applications around that use this format. (A search for *.ini on a relatively clean installation of Windows 8.1 found several hundred files, so it is far from dead!)

The format is very good at storing multiple instances of similar data, such as per-node settings on a network, or for multiple categories of options, such as various screen sizes, or online versus offline operational parameters. The disadvantage of the Config format is that it can sometimes be too simple with the result that complex data is harder to fit into the format. Python provides the configparser module for reading and writing Config format data.

The basic structure of a Config file is as shown here:

  [DEFAULT]
  Option1=value1

  [SECTION1]
  Option2=value2
  Option3=value3

  [SECTION2]
  Option4=value4
  etc.

The DEFAULT section is noteworthy because options defined there apply to all following sections. The format has a lot of flexibility, with spaces and indentation optional, embedded sections, and various other variants, including the ability to interpolate a value from one option into another option. The configparser module can handle all of these and much more. It converts the data into, or from, a dictionary format similar to the kind used for the CSV files described in the previous section.

Basic usage is shown very clearly in the documentation with examples of creating a file and reading from it. Because there is little point in repeating that here, you can browse it at your leisure and then try it out in the following example.

Working with XML and HTML files

You are probably familiar with HTML as the language of web pages. XML is also widely used, as a so-called self-describing data format. XML and HTML are closely related formats. XML is a much more rigidly defined format, and that makes it easier to process using a computer. HTML is very forgiving of malformed content and, although that makes it easy to create by hand, as well as with specialized editors, HTML is much more hit or miss to process accurately. HTML also has many variations because of web browser proprietary extensions. All of this means that HTML parsers have a trickier job and often yield less than perfect results when faced with badly formatted files. Because XML is easier to handle programmatically, you look at parsing it first and then extend the technique to cover HTML.

Parsing XML Files

Many different parsers are available for parsing XML. The Python standard library contains no less than five (dom, minidom, expat, ElementTree, and sax). These all fall into two categories: those that read the entire file into a tree-like data structure called a document object model (DOM) or those that read the file looking for items of interest (an “event”) and triggering a response as the items are found. The former are more flexible for complex, or multiple, queries on the same set of data. The latter tend to be faster and slightly simpler to use. In this book you only look at two of the parsers, each representing one of these two approaches.

The first parser you consider is sax, which is an example of an event-based parser. To understand how event-based parsers work, consider the following example that parses some plaintext:

  >>> text = """mary had a little lamb
  ... its fleece was white as snow
  ... and everywhere that mary went
  ... the lamb was sure to go"""

  >>> def has_mary(aLine):
  ...	print( "We found: ", aLine)
  ...
  >>> def parse_text(theText, aPattern, function):
  ...	for line in theText.split('
'):
  ...		if aPattern in line:
  ...			function(line)
  ...
  >>> parse_text(text,'mary',has_mary)
  We found: mary had a little lamb
  We found: and everywhere that mary went
  >>>

Here you create some text that you want to parse. You then define a function, has_mary(), that you want to be called every time mary is found in the text.

Next you create your event-driven parsing function, parse_text(). This function iterates over the input text line by line. If the search string, in this case mary, is found, then it calls the function that has been passed in.

When you execute parse_text() with your text string and the has_mary() function as arguments, it prints out the two lines containing mary.

The sax module works in a similar way to your parse_text() function; however, it uses events, such as detecting the start of an XML element, rather than plaintext patterns. It takes in an XML source text and a collection of events and associated event-handler functions. It then processes the XML text section by section, and if it finds a match to a given event, it calls the associated handler to deal with it. The parser does not store the XML data, it simply iterates over it. If you need to go back to access earlier data, you need to re-parse the entire file.

To investigate the sax parser, you need an XML file. You can find one, called toolhire.xml, in the ToolhireData folder of the Chapter2.zip file. This is simply an XML export of the toolhire.xlsx spreadsheet that you used earlier. A fragment of that file, including the parts you will be extracting, slightly edited for readability, is shown here:

  <?xml version="1.0"?>
  <?mso-application progid="FileName_Excel.Sheet"?>
  <Workbook
  ...
  <Worksheet ss:Name="Sheet1">
 	<Table ss:ExpandedColumnCount="1025" ss:ExpandedRowCount="7" x:FullColumns="1"
 	x:FullRows="1" ss:DefaultRowHeight="15">
 	<Column ss:AutoFitWidth="0" ss:Width="36"/>
  ...
 	<Row ss:StyleID="s36">
 		<Cell><Data ss:Type="String">ItemID</Data></Cell>
 		<Cell><Data ss:Type="String">Name</Data></Cell>
 		<Cell><Data ss:Type="String">Description</Data></Cell>
 		<Cell><Data ss:Type="String">Owner</Data></Cell>
 		<Cell><Data ss:Type="String">Borrower</Data></Cell>
 		<Cell><Data ss:Type="String">DateLent</Data></Cell>
 		<Cell><Data ss:Type="String">DateReturned</Data></Cell>
 	</Row>
 	<Row>
 		<Cell><Data ss:Type="Number">1</Data></Cell>
 		<Cell><Data ss:Type="String">LawnMower</Data></Cell>
 		<Cell><Data ss:Type="String">Small Hover mower</Data></Cell>
 		<Cell><Data ss:Type="String">Fred</Data></Cell>
 		<Cell><Data ss:Type="String">Joe</Data></Cell>
 		<Cell ss:StyleID="s37"><Data ss:Type="DateTime">
  2012-04-01T00:00:00.000</Data></Cell>
 		<Cell ss:StyleID="s37"><Data ss:Type="DateTime">
  2012-04-26T00:00:00.000</Data></Cell>
 		</Row>
  ...
  </Worksheet>
  </Workbook>

Assume you want to find the average length of loan. You use sax to extract just the DateLent and DateReturned fields for each item and store them as a tuple in a dates list. You can then later process those dates to find the duration for each lent item.

To initialize the parser, you need to create your handler and specify the events that you are interested in. sax actually uses a handler object, an instance of the xml.sax.handler
.ContentHandler class, or more specifically, a subclass of it, to combine the event and function. Several predefined handler subclasses exist, including one for dealing with errors. The advantage of this approach is that many default methods are already defined and others can be easily overridden, such as startDocument() called at the very beginning of parsing and useful for setting up state variables and the like. For simple XML parsing tasks, you normally create a custom subclass of ContentHandler and then write your own versions of the startElement(), endElement(), and possibly, thecharacter() methods.

By inspecting the XML file, you can see that the data you want is contained in a <Data> element and is identified by the ss:Type attribute being set to DateTime. The actual data is character data that sits between the start and end <Data> tags, so the expected event sequence is startElement(), followed by character(), followed by endElement().

The code for your ToolHireHandler class looks like this (and is in the ToolHire folder of Chapter2.zip as toolhiresax.py):

import xml.saximport xml.sax.handlerclass ToolHireHandler(xml.sax.handler.ContentHandler):def __init__(self):super().__init__()self.dates = []self.dateLent = ''self.dateCounter = 0self.isDate = Falsedef startElement(self, name, attributes):if name == "Data":data = attributes.get('ss:Type', None)if data == 'DateTime':self.isDate = Trueself.dateCounter += 1else:self.dateCounter = 0def endElement(self, name):self.isDate = Falsedef characters(self, data):if self.isDate:if self.dateCounter == 1:self.dateLent = dataelse:self.dates.append((self.dateLent, data))if __name__ == '__main__':handler = ToolHireHandler()parser = xml.sax.make_parser()parser.setContentHandler(handler)parser.parse('toolhire.xml')print(handler.dates)

The initializer calls the superclass initializer and then sets up various data attributes that you use in the parsing and need to use across methods. It also creates an empty dates list to hold the results.

The main parsing method is the startElement() method that looks out for Data elements and, when one is found, refines the search by selecting only those with a ss:Type attribute of DateTime. (You have to identify these values by inspecting the XML file manually.) Because you can have up to two dates in a single row, you use the self.dateCounter to keep track of which date within the row you are handling. You use the self.isDate value to indicate to the character() method that it is inside a date element. If the data is not a DateTime type, then you reset the self.dateCounter to 0.

The endElement() method ensures the self.isDate flag is reset to False ready for the next startElement() event to come along.

The character() method is called whenever content outside a tag element is encountered. You are only interested in the date information so, if the self.isDate flag is not set, you simply ignore the character data. If the data is a date, then you check if it’s the first date, in which case you store it in the self.dateLent attribute; if it’s the second date, you store both dates in the self.dates list. If only one date is found, the character handler is not called a second time, and the date is not added to the dates list, thus ensuring you store only pairs of dates, which is what you need for the duration calculations.

Finally, the driver code at the bottom creates the handler and parser instances. It then sets the handler within the parser to your ToolHireHandler instance and executes the parse() operation on your XML file. After parsing is complete, it prints out the collected dates from the handler.

You repeat this exercise using the ElementTree DOM-based parser in the Try It Out at the end of this section. There you can compare and contrast the two techniques. First, though, you look at parsing HTML because the standard library HTML parser is very similar in style to the sax XML parser.

Parsing HTML Files

The standard library provides the html.parser module for parsing HTML. It works in a similar way to the sax parser, in that it is event driven. It is slightly simpler to use because it only has a single class with the handler methods defined within it. To show how it works, you once again extract the dates from the toolhire.xlsx spreadsheet, but this time from the HTML export. You can find this file in the zip file under the ToolhireData/toolhire_files folder as sheet001.htm.

The file looks, in part, like this:

  <html xmlns:v="urn:schemas-microsoft-com:vml"
  xmlns:o="urn:schemas-microsoft-com:office:office"
  xmlns:x="urn:schemas-microsoft-com:office:excel"
  xmlns="http://www.w3.org/TR/REC-html40">

  <head>
  <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
  <meta name=ProgId content=Excel.Sheet>
  ...
  <body link=blue vlink=purple>

  <table border=0 cellpadding=0 cellspacing=0 width=752 style='border-collapse:
  	collapse;table-layout:fixed;width:564pt'>
  <col width=64 style='width:48pt'>
  <col width=115 style='mso-width-source:userset;mso-width-alt:4205;width:86pt'>
  ...
  <tr class=xl66 height=21 style='height:15.75pt'>
  		<td height=21 class=xl66 width=64 style='height:15.75pt;width:48pt'>ItemID</td>
  		<td class=xl66 width=115 style='width:86pt'>Name</td>
  		<td class=xl66 width=153 style='width:115pt'>Description</td>
  		<td class=xl66 width=80 style='width:60pt'>Owner</td>
  		<td class=xl66 width=120 style='width:90pt'>Borrower</td>
  		<td class=xl66 width=99 style='width:74pt'>DateLent</td>
  		<td class=xl66 width=121 style='width:91pt'>DateReturned</td>
  </tr>
  <tr height=20 style='height:15.0pt'>
  		<td height=20 align=right style='height:15.0pt'>1</td>
  		<td>LawnMower</td>
  		<td>Small Hover mower</td>
  		<td>Fred</td>
  		<td>Joe</td>
  		<td class=xl65 align=right>4/1/2012</td>
  		<td class=xl65 align=right>4/26/2012</td>
  </tr>
  ...
  </tabular>
  </body>
  </html>

You can see that the dates have a unique class, namely xl65. This means you can look for <td> tags with that class attribute value in a similar way that you did with the earlier XML example.

The HTMLParser class works very like the saxContentHandler class in that it has methods corresponding to HTML document elements. In the example you override the handle_starttag(),handle_endtag(), and handle_data() methods that are directly analogous to the startElement, endElement, and character methods for XML.

You can find the code for this example in the zip file ToolHire folder as toolhirehtml.py. It looks like this:

import html.parserclass ToolHireParser(html.parser.HTMLParser):def __init__(self):super().__init__()self.dates = []self.dateLent = ''self.isDate = Falseself.dateCounter = 0def handle_starttag(self, name, attributes):if name == 'td':for key, value in attributes:if key == 'class' and value == 'xl65':self.isDate = Trueself.dateCounter += 1breakelse:self.dateCounter = 0def handle_endtag(self, name):self.isDate = Falsedef handle_data(self, data):if self.isDate:if self.dateCounter == 1:self.dateLent = dataelse:self.dates.append((self.dateLent, data))if __name__ == '__main__':htm = open('sheet001.htm').read()parser = ToolHireParser()parser.feed(htm)print(parser.dates)

If you compare that with the sax example, you see that the code inside the methods is nearly identical. The HTMLParser presents its attributes as a list of tuples. You iterate over that list looking for a class attribute of value xl65 to identify a date field. (Note that’s an x-ELL not x-ONE; remember that this is an export from Microsoft Excel, hence the class name.) The parser conveniently takes care of mixed-case HTML tags or by converting tag names to lowercase, so you don’t need to worry about that. It also does its best to make sense of badly formed HTML although it’s not perfect and really bad code can trip it up.

You conclude this section on reading data files with a look at another of Python’s XML parsers. This time it’s ElementTree, and you investigate it in the following Try It Out.

Some applications do not lend themselves to generating data files. In these cases, you may need to interact with the program via an application programmer’s interface (API). The next section shows you how.

Accessing Native APIs with ctypes and pywin32

Some applications or OS functions are not easily accessed from regular Python code because no Python API exists or no user-friendly operations are exposed that you can call from Python. The ctypes module can provide an alternative means of access by exposing to Python the C code libraries from which the application is built. In Windows these libraries are typically a set of DLL files or, in UNIX, a set of shared object libraries. ctypes enables you to load those libraries into your application and call their functions directly from Python. This only works, of course, if you know what functions are in the library, what arguments are required, and the return values. This may not be published, and you then have to resort to trial and error, or reverse engineering, which may, in turn, be prohibited by the manufacturer or vendor. However, if the library has a published interface, ctypes provides an effective, although non-trivial, method of access.

Another package, installed by default in the ActiveState distribution of Python for Windows, or available for download on other distributions, is pywin32. This package provides access to the Windows native libraries and, in particular, to any Microsoft component object model (COM) interfaces. Being Windows specific, it is usually easier to use than ctypes which works generically on any operating system. The same caveats apply when using pywin32 as apply to ctypes.

Accessing the Operating System Libraries

One area that is usually well documented is the OS application programming interface (API) that is exposed in standard system libraries. In this section you use the OS libraries to perform some fairly simple tasks that are nonetheless not available via Python’s os module. This method is particularly useful for Windows users because many of the UNIX-like features in the os module do not work, or only work partially, under Windows. Accessing the Win32 API directly via ctypes (orpywin32) is often the only option.

The following sections show ctypes being used on Windows and Linux systems, but the principles are identical, apart from getting the initial reference to the C library.

Using ctypes with Windows

On Windows systems the basic C library is found in the msvcrt library. Some functions in msvcrt
.dll, mainly concerned with console input/output operations, are exposed in the Python msvcrt module, but many more are not available by that route. You can easily access the native msvcrt library from ctypes using the following code:

  >>> import ctypes as ct
  >>> libc = ct.cdll.msvcrt 	# Windows only

Once you have a reference to the standard library, you can call the familiar C functions. The only complication is that you need to ensure the arguments are type-compatible with C. In general, integer arguments work just fine, but strings usually need to be explicitly marked as byte strings, and floats need a special ctypes type conversion. Many type conversion functions are included in ctypes; you can find a full list in the module documentation. Here are two examples:

  >>> libc.printf(b"%d %s %s hanging on a wall
", 6, b"green", b"bottles")
  6 green bottles hanging on a wall
  34
  >>> libc.printf(b"Pi is: %f
", ct.c_double(3.14159))
  Pi is: 3.141590
  16

Notice the use of b to indicate a byte string and, in the second example, the use of the ctypes.c_double() conversion function. Also, note that the return value of printf(), which is the number of characters printed, is displayed after the message is printed.

Many C functions require pointers to data (effectively memory addresses) as arguments. ctypes enables you to do this using the byref() function. You can create an object of a given type and then pass that object using byref() into the ctypes function call you want to perform. Here is an example of using sscanf() that reads an integer value from a string into a Python variable:

  >>> d = ct.c_int()
  >>> print(d.value)
  0
  >>> libc.sscanf(b"6", b"%d", ct.byref(d))
  1
  >>> print(d.value)
  6

Next you look at a slightly more practical function in the Windows library: msvcrt._getdrives(). This returns a list of available drives on a Windows system, something not easily done using Python’s standard os module. The only complication is that the returned list is a bitmask, so you need to write a loop to test each bit to find out which bits are set and map the bit position into a drive letter. Here is the code:

  >>> drives = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
  >>> drivelist = libc._getdrives()
  >>> for n in range(26):
  ...	mask = 1 << n 	# use left bit shifting to build a mask
  ...	if drivelist & mask: print (drives[n], 'is available')
  ...
  C is available
  D is available
  E is available
  P is available

The Microsoft Developers Network (msdn.microsoft.com) has full documentation for the standard Windows library functions.

Using ctypes on Linux

You can use ctypes on non-Windows systems, too. Here is an example using printf() on a Linux system accessed via the standard C library libc.so.6. (You can also use other UNIX-like OSes if you can find out the name of the library that implements the standard C library functions.)

  >>> import ctypes as ct
  >>> libc = ct.CDLL('libc.so.6')
  >>> libc.printf(b"My name is %s
", b"Fred")
  My name is Fred
  16

The printf() and sscanf() examples in the previous section should also work using the Linux libc, as will the byref() function and the various type conversion functions.

Accessing a Windows Application Using COM

Accessing an application library is almost as easy as accessing an OS system library, provided you can get documentation for the contents of the library. However, that is not always readily available. Another option on Windows is to use the OS functions to access the COM objects and then manipulate the COM objects from Python. Unfortunately, COM is a complex technology and has been extended over time to include features such as distribution over a network as well as various data access mechanisms. Compounding the difficulty is the fact that documentation for COM objects is often sparse and hard to find. Nonetheless, COM is often the most effective option for automating Windows applications.

The easiest way to use COM objects in Python is to use the pywin32 package, written by Mark Hammond and available for download from the SourceForge website or included as standard in the ActiveState distribution of Python. The following Try It Out demonstrates the use of pywin32 to open Excel preloaded with the toolhire.xlsx file you used in the earlier sections of this chapter.

You have now seen many techniques for integrating different applications in a scripting program. The next section gives you some advice on how to bring these techniques together to complete a scripting project.

Automating Tasks Involving Multiple Applications

Scripting was defined at the start of this chapter as “coordinating the actions of other programs or applications to perform a task.” So far, you have seen several enabling modules that can help you to interface with these external programs, but the bigger picture of how to automate a full workflow has not been discussed.

Normally, when you approach a workflow automation project, you look at what the human process is. You identify the systems used and the actions taken. You look at the input and output data. You then try to replicate that using whatever automation options are available for each system and process. You should take one other step before jumping in too quickly and that is to eliminate any steps that are done purely for the human user’s convenience—for example, formatting data into a more readable layout when the data is only an intermediate result. If the computer can read the data without that formatting, it’s an unnecessary step. Once you have identified the necessary steps, along with the systems and tools to be used, you can look at the automation options.

This section considers some guidelines that should minimize the pain in developing such multi-application scripts. As a general rule, use the following techniques in the order discussed.

Using Python First

Python comes with many support modules that enable you to replicate the OS functions and commands directly from your code. Other modules provide access to different file formats and network protocols. For example, Python has modules for directly manipulating the Windows registry and the UNIX password file that avoid calling external programs. Using Python directly provides an efficient and flexible solution that will be easier to maintain in the future. This should always be the first choice if possible.

Using Operating System Utilities

The OS provides many tools and commands for performing system administration. Many of these tools have command line interfaces (CLIs) that make them easy to call from Python code using the subprocess module. Tools that operate without interaction are the easiest to work with, even if this means using data files as an intermediate step because the files can be used as a recovery point should the process fail: You simply restart with the last successful step.

Using Data Files

Many tools and OS commands use configuration files to control how they function. By creating or modifying these configuration files prior to running the command, you can often control the behavior without the complexity of interacting with the processes in real time. In addition, you can usually drive such tools by using input files and generating output files rather than interactively providing data at prompts. You can build such files (or read them) using Python code, and you have seen how Python modules can assist in parsing many common data formats.

Using a Third-Party Module

Many popular applications have third-party modules that facilitate interacting with the application or direct manipulation of their data files. Microsoft Excel is a good example, with several modules available to assist in manipulating spreadsheets. You can manipulate many other proprietary file formats using third-party modules. Use your favorite search engine to find such modules. Include keywords like the application name, “python”, and “module”, and you should find what you are looking for fairly quickly.

The main caveat with this approach is that third-party modules often work only with older Python versions and may not be updated to the latest build. Most such modules are open source, with generous license conditions, so you usually have the option of updating the code yourself or, if that is too big a project, perhaps copying just the code that you need for your project. Due credit to the original authors should, of course, be given.

Interacting with Subprocesses via a CLI

If a tool has a CLI but cannot be driven using a data file, you can still use the subprocess module and interact with the process using stdin and stdout as was demonstrated with the ex editor earlier in the “Managing Subprocesses” section of this chapter. This is a potentially complex strategy because you have to anticipate every possible response or input request that the application may make. Similarly, error handling can be difficult to control and often, if an application deviates from the expected interaction, you may have no choice but to abort your script and try to recover manually. This is why using data files is preferable if at all possible.

There is a third-party module called pexpect that makes interacting with an external console-based program easier. It works by looking for expected (hence the name) prompt strings from the target application and then responding by allowing the programmer to send responses. This works well for login dialogs and similar interactions.

Using Web Services for Server-Based Applications

Some applications provide web services as an interface option. This is often an attractive alternative to using a third-party module, although the trade-off is often slower performance and the added complexity of parsing the XML or JSON data format used by such services. Web services are discussed in more detail in Chapter 5.

Using a Native Code API

If the application you need to control offers a C library as an API, you can use ctypes to access it from Python. The biggest problem you are likely to face with this approach is finding good documentation for the API. If documentation exists, this can be a very effective technique, but if not, it can involve a lot of painful trial and error. The Python interactive prompt is an invaluable tool in these scenarios.

For Windows applications you can often find a COM interface and access that via the win32 package. As with using ctypes, the lack of documentation is often the biggest obstacle.

Using GUI Robotics

The final option for GUI applications with no API is to interact with the GUI itself by sending user event messages into the application. Such events could include key-presses, mouse-clicks, and so forth. This technique is known as robotics because you are simulating a human user from your Python program. It is really an extension of the native code access described in the previous section, but operating at a much lower level.

This is a frustrating technique that is very error prone and also very vulnerable to changes in the application being controlled—for example, if an upgrade changes the screen layout, your code will likely break. Because of the difficulty of writing the code, as well as the fragility of the solution, you should avoid this unless every other possibility has failed.

Summary

This chapter looked at how to automate tasks involving several different applications or OS utilities. You saw that Python’s standard library contains several powerful modules to assist in this. The os, os.path, shutil, and glob modules, for example, can provide much information about computer resources and help you manage files directly from within Python.

The subprocess module provides a mechanism to launch and interact with command line programs from within your scripts.

The time, datetime, and calendar modules can assist with time-related tasks and calculations. The time.sleep() function can introduce a pause to your script’s execution while waiting for other processes to complete.

You also saw that common data files that can be generated, or used as input by applications, can be created or read by Python using modules such as csv, configparser, htmllib, and xml.etree.

If no other form of access is available, it may be possible to use ctypes to access C functions exposed by dynamic libraries. On Windows similar functions exposed as a COM interface may be available, and the pywin32 modules simplify access somewhat. These techniques are usually more complex than using data files or calling subprocess functions.

Finally, you reviewed the options available for scripting with their pros and cons, including the last resort option for GUIs of sending OS events to the application windows. This last option is fraught with difficulty and should only ever be used when all other means have been explored and exhausted.

EXERCISES

  1. Explore the os module to see what else you can discover about your computer. Be sure to read the relevant parts of the Python documentation for the os and stat modules.

  2. Try adding a new function to the file_tree module called find_dirs() that searches for directories matching a given regular expression. Combine both to create a third function, find_all(), that searches both files and directories.

  3. Create another function, apply_to_files(), that applies a function parameter to all files matching the input pattern. You could, for example, use this function to remove all files matching a pattern, such as *.tmp , like this:

      findfiles.apply_to_files('.*.tmp', os.remove, 'TreeRoot')
  4. Write a program that loops over the first 128 characters and displays a message indicating whether or not the value is a control character (characters with ordinal values between 0x00 and 0x1F, plus 0x7F). Use ctypes to access the standard C library and call the iscntrl() function to determine if a given character is a control character. Note this is not one of the built-in test methods of the string type in Python.

arrow2  WHAT YOU LEARNED IN THIS CHAPTER

TOPIC DESCRIPTION
Scripting Automation of a task involving multiple tools or applications. Python is used as the glue that binds these tools together, converting data formats to compatible forms, synchronizing the activities, and if necessary driving the functionality as a pseudo user.
OS environment When the OS runs a process, it creates an environment consisting of certain configuration details. These include things like the process priority, its home directory, file permissions, and formats. Scripts often need to customize the environment prior to launching a program to ensure that it performs in the correct way.
Process and subprocesses Programs run by the OS are known as processes. A single application may consist of a process hierarchy with a top-level process spawning multiple child or subprocesses. Subprocesses, by default, inherit their parent’s environment. Scripts frequently launch other programs as subprocesses.
Tree walking The file system exists as a tree structure with a root node and subtrees attached to the root. It is possible to recursively descend through this structure to the leaf nodes, which are the individual files. Scripts frequently need to process multiple files within a given subtree of the file system.
Absolute dates and times A fixed date and time in history. A date such as July 4th, 1776 is an absolute date.
Relative dates and times A date or time relative to another date or time. Usually expressed as a period such as three hours or as a repeating date or time, such as the third hour of every day or first day of every month.
Parser A function that breaks down structured data into its component parts. Parsers can be based on several different algorithms and the most common types are either event based or tree based. Python supports both styles for XML parsing.
Libraries Programming languages make reusable code available in code libraries. These are conceptually like Python modules, but in compiled languages are generated with special tools and can be either static or dynamically linked into an application. ctypes can access dynamically linked C libraries.
COM The Windows Common Object Model (COM) mechanism enables external applications (or frequently an internal macro language) to manipulate the functionality of a program. The pywin32 package simplifies Python access to COM objects.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset