13. Client-Side Scripting

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 13. Client-Side Scripting

“Socket to Me!”

The preceding chapter introduced Internet fundamentals and explored sockets—the underlying communications mechanism over which bytes flow on the Net. In this chapter, we climb the encapsulation hierarchy one level and shift our focus to Python tools that support the client-side interfaces of common Internet protocols.

We talked about the Internet’s higher-level protocols in the abstract at the start of the preceding chapter, and you should probably review that material if you skipped over it the first time around. In short, protocols define the structure of the conversations that take place to accomplish most of the Internet tasks we’re all familiar with—reading email, transferring files by FTP, fetching web pages, and so on.

At the most basic level, all of these protocol dialogs happen over sockets using fixed and standard message structures and ports, so in some sense this chapter builds upon the last. But as we’ll see, Python’s protocol modules hide most of the underlying details—scripts generally need to deal only with simple objects and methods, and Python automates the socket and messaging logic required by the protocol.

In this chapter, we’ll concentrate on the FTP and email protocol modules in Python, and we’ll peek at a few others along the way (NNTP news, HTTP web pages, and so on). Because it is so prevalent, we will especially focus on email in much of this chapter, as well as in the two to follow—we’ll use tools and techniques introduced here in the larger PyMailGUI and PyMailCGI client and server-side programs of Chapters 14 and 16.

All of the tools employed in examples here are in the standard Python library and come with the Python system. All of the examples here are also designed to run on the client side of a network connection—these scripts connect to an already running server to request interaction and can be run from a basic PC or other client device (they require only a server to converse with). And as usual, all the code here is also designed to teach us something about Python programming in general—we’ll refactor FTP examples and package email code to show object-oriented programming (OOP) in action.

In the next chapter, we’ll look at a complete client-side program example before moving on to explore scripts designed to be run on the server side instead. Python programs can also produce pages on a web server, and there is support in the Python world for implementing the server side of things like HTTP, email, and FTP. For now, let’s focus on the client.^[48]

FTP: Transferring Files over the Net

As we saw in the preceding chapter, sockets see plenty of action on the Net. For instance, the last chapter’s getfile example allowed us to transfer entire files between machines. In practice, though, higher-level protocols are behind much of what happens on the Net. Protocols run on top of sockets, but they hide much of the complexity of the network scripting examples of the prior chapter.

FTP—the File Transfer Protocol—is one of the more commonly used Internet protocols. It defines a higher-level conversation model that is based on exchanging command strings and file contents over sockets. By using FTP, we can accomplish the same task as the prior chapter’s getfile script, but the interface is simpler, standard and more general—FTP lets us ask for files from any server machine that supports FTP, without requiring that it run our custom getfile script. FTP also supports more advanced operations such as uploading files to the server, getting remote directory listings, and more.

Really, FTP runs on top of two sockets: one for passing control commands between client and server (port 21), and another for transferring bytes. By using a two-socket model, FTP avoids the possibility of deadlocks (i.e., transfers on the data socket do not block dialogs on the control socket). Ultimately, though, Python’s ftplib support module allows us to upload and download files at a remote server machine by FTP, without dealing in raw socket calls or FTP protocol details.

Transferring Files with ftplib

Because the Python FTP interface is so easy to use, let’s jump right into a realistic example. The script in Example 13-1 automatically fetches (a.k.a. “downloads”) and opens a remote file with Python. More specifically, this Python script does the following:

Downloads an image file (by default) from a remote FTP site
Opens the downloaded file with a utility we wrote in Example 6-23, in Chapter 6

The download portion will run on any machine with Python and an Internet connection, though you’ll probably want to change the script’s settings so it accesses a server and file of your own. The opening part works if your playfile.py supports your platform; see Chapter 6 for details, and change as needed.

Example 13-1. PP4EInternetFtpgetone.py

#!/usr/local/bin/python
"""
A Python script to download and play a media file by FTP.  Uses ftplib, the ftp
protocol handler which uses sockets.  Ftp runs on 2 sockets (one for data, one
for control--on ports 20 and 21) and imposes message text formats, but Python's
ftplib module hides most of this protocol's details.  Change for your site/file.
"""

import os, sys
from getpass import getpass                   # hidden password input
from ftplib import FTP                        # socket-based FTP tools

nonpassive  = False                           # force active mode FTP for server?
filename    = 'monkeys.jpg'                   # file to be downloaded
dirname     = '.'                             # remote directory to fetch from
sitename    = 'ftp.rmi.net'                   # FTP site to contact
userinfo    = ('lutz', getpass('Pswd?'))      # use () for anonymous
if len(sys.argv) > 1: filename = sys.argv[1]  # filename on command line?

print('Connecting...')
connection = FTP(sitename)                  # connect to FTP site
connection.login(*userinfo)                 # default is anonymous login
connection.cwd(dirname)                     # xfer 1k at a time to localfile
if nonpassive:                              # force active FTP if server requires
    connection.set_pasv(False)

print('Downloading...')
localfile = open(filename, 'wb')            # local file to store download
connection.retrbinary('RETR ' + filename, localfile.write, 1024)
connection.quit()
localfile.close()

if input('Open file?') in ['Y', 'y']:
    from PP4E.System.Media.playfile import playfile
    playfile(filename)

Most of the FTP protocol details are encapsulated by the Python ftplib module imported here. This script uses some of the simplest interfaces in ftplib (we’ll see others later in this chapter), but they are representative of the module in general.

To open a connection to a remote (or local) FTP server, create an instance of the ftplib.FTP object, passing in the string name (domain or IP style) of the machine you wish to connect to:

connection = FTP(sitename)                  # connect to ftp site

Assuming this call doesn’t throw an exception, the resulting FTP object exports methods that correspond to the usual FTP operations. In fact, Python scripts act much like typical FTP client programs—just replace commands you would normally type or select with method calls:

connection.login(*userinfo)                 # default is anonymous login
connection.cwd(dirname)                     # xfer 1k at a time to localfile

Once connected, we log in and change to the remote directory from which we want to fetch a file. The login method allows us to pass in a username and password as additional optional arguments to specify an account login; by default, it performs anonymous FTP. Notice the use of the nonpassive flag in this script:

if nonpassive:                              # force active FTP if server requires
    connection.set_pasv(False)

If this flag is set to True, the script will transfer the file in active FTP mode rather than the default passive mode. We’ll finesse the details of the difference here (it has to do with which end of the dialog chooses port numbers for the transfer), but if you have trouble doing transfers with any of the FTP scripts in this chapter, try using active mode as a first step. In Python 2.1 and later, passive FTP mode is on by default. Now, open a local file to receive the file’s content, and fetch the file:

localfile = open(filename, 'wb')
connection.retrbinary('RETR ' + filename, localfile.write, 1024)

Once we’re in the target remote directory, we simply call the retrbinary method to download the target server file in binary mode. The retrbinary call will take a while to complete, since it must download a big file. It gets three arguments:

An FTP command string; here, the string RETR filename, which is the standard format for FTP retrievals.
A function or method to which Python passes each chunk of the downloaded file’s bytes; here, the write method of a newly created and opened local file.
A size for those chunks of bytes; here, 1,024 bytes are downloaded at a time, but the default is reasonable if this argument is omitted.

Because this script creates a local file named localfile of the same name as the remote file being fetched, and passes its write method to the FTP retrieval method, the remote file’s contents will automatically appear in a local, client-side file after the download is finished.

Observe how this file is opened in wb binary output mode. If this script is run on Windows we want to avoid automatically expanding any bytes into byte sequences; as we saw in Chapter 4, this happens automatically on Windows when writing files opened in w text mode. We also want to avoid Unicode issues in Python 3.X—as we also saw in Chapter 4, strings are encoded when written in text mode and this isn’t appropriate for binary data such as images. A text-mode file would also not allow for the bytes strings passed to write by the FTP library’s retrbinary in any event, so rb is effectively required here (more on output file modes later).

Finally, we call the FTP quit method to break the connection with the server and manually close the local file to force it to be complete before it is further processed (it’s not impossible that parts of the file are still held in buffers before the close call):

connection.quit()
localfile.close()

And that’s all there is to it—all the FTP, socket, and networking details are hidden behind the ftplib interface module. Here is this script in action on a Windows 7 machine; after the download, the image file pops up in a Windows picture viewer on my laptop, as captured in Figure 13-1. Change the server and file assignments in this script to test on your own, and be sure your PYTHONPATH environment variable includes the PP4E root’s container, as we’re importing across directories on the examples tree here:

C:...PP4EInternetFtp> python getone.py
Pswd?
Connecting...
Downloading...
Open file?y

Figure 13-1. Image file downloaded by FTP and opened locally

Notice how the standard Python getpass.getpass is used to ask for an FTP password. Like the input built-in function, this call prompts for and reads a line of text from the console user; unlike input, getpass does not echo typed characters on the screen at all (see the moreplus stream redirection example of Chapter 3 for related tools). This is handy for protecting things like passwords from potentially prying eyes. Be careful, though—after issuing a warning, the IDLE GUI echoes the password anyhow!

The main thing to notice is that this otherwise typical Python script fetches information from an arbitrarily remote FTP site and machine. Given an Internet link, any information published by an FTP server on the Net can be fetched by and incorporated into Python scripts using interfaces such as these.

Using urllib to Download Files

In fact, FTP is just one way to transfer information across the Net, and there are more general tools in the Python library to accomplish the prior script’s download. Perhaps the most straightforward is the Python urllib.request module: given an Internet address string—a URL, or Universal Resource Locator—this module opens a connection to the specified server and returns a file-like object ready to be read with normal file object method calls (e.g., read, readline).

We can use such a higher-level interface to download anything with an address on the Web—files published by FTP sites (using URLs that start with ftp://); web pages and output of scripts that live on remote servers (using http:// URLs); and even local files (using file:// URLs). For instance, the script in Example 13-2 does the same as the one in Example 13-1, but it uses the general urllib.request module to fetch the source distribution file, instead of the protocol-specific ftplib.

Example 13-2. PP4EInternetFtpgetone-urllib.py

#!/usr/local/bin/python
"""
A Python script to download a file by FTP by its URL string; use higher-level
urllib instead of ftplib to fetch file;  urllib supports FTP, HTTP, client-side
HTTPS, and local files, and handles proxies, redirects, cookies, and more;
urllib also allows downloads of html pages, images, text, etc.;  see also
Python html/xml parsers for web pages fetched by urllib in Chapter 19;
"""

import os, getpass
from urllib.request import urlopen       # socket-based web tools
filename = 'monkeys.jpg'                 # remote/local filename
password = getpass.getpass('Pswd?')

remoteaddr = 'ftp://lutz:%[email protected]/%s;type=i' % (password, filename)
print('Downloading', remoteaddr)

# this works too:
# urllib.request.urlretrieve(remoteaddr, filename)

remotefile = urlopen(remoteaddr)                 # returns input file-like object
localfile  = open(filename, 'wb')                # where to store data locally
localfile.write(remotefile.read())
localfile.close()
remotefile.close()

Note how we use a binary mode output file again; urllib fetches return byte strings, even for HTTP web pages. Don’t sweat the details of the URL string used here; it is fairly complex, and we’ll explain its structure and that of URLs in general in Chapter 15. We’ll also use urllib again in this and later chapters to fetch web pages, format generated URL strings, and get the output of remote scripts on the Web.

Technically speaking, urllib.request supports a variety of Internet protocols (HTTP, FTP, and local files). Unlike ftplib, urllib.request is generally used for reading remote objects, not for writing or uploading them (though the HTTP and FTP protocols support file uploads too). As with ftplib, retrievals must generally be run in threads if blocking is a concern. But the basic interface shown in this script is straightforward. The call:

remotefile = urllib.request.urlopen(remoteaddr)  # returns input file-like object

contacts the server named in the remoteaddr URL string and returns a file-like object connected to its download stream (here, an FTP-based socket). Calling this file’s read method pulls down the file’s contents, which are written to a local client-side file. An even simpler interface:

urllib.request.urlretrieve(remoteaddr, filename)

also does the work of opening a local file and writing the downloaded bytes into it—things we do manually in the script as coded. This comes in handy if we want to download a file, but it is less useful if we want to process its data immediately.

Either way, the end result is the same: the desired server file shows up on the client machine. The output is similar to the original version, but we don’t try to automatically open this time (I’ve changed the password in the URL here to protect the innocent):

C:...PP4EInternetFtp> getone-urllib.py
Pswd?
Downloading ftp://lutz:[email protected]/monkeys.jpg;type=i

C:...PP4EInternetFtp> fc monkeys.jpg testmonkeys.jpg
FC: no differences encountered

C:...PP4EInternetFtp> start monkeys.jpg

For more urllib download examples, see the section on HTTP later in this chapter, and the server-side examples in Chapter 15. As we’ll see in Chapter 15, in bigger terms, tools like the urllib.request urlopen function allow scripts to both download remote files and invoke programs that are located on a remote server machine, and so serves as a useful tool for testing and using web sites in Python scripts. In Chapter 15, we’ll also see that urllib.parse includes tools for formatting (escaping) URL strings for safe transmission.

FTP get and put Utilities

When I present the ftplib interfaces in Python classes, students often ask why programmers need to supply the RETR string in the retrieval method. It’s a good question—the RETR string is the name of the download command in the FTP protocol, but ftplib is supposed to encapsulate that protocol. As we’ll see in a moment, we have to supply an arguably odd STOR string for uploads as well. It’s boilerplate code that you accept on faith once you see it, but that begs the question. You could propose a patch to ftplib, but that’s not really a good answer for beginning Python students, and it may break existing code (the interface is as it is for a reason).

Perhaps a better answer is that Python makes it easy to extend the standard library modules with higher-level interfaces of our own—with just a few lines of reusable code, we can make the FTP interface look any way we want in Python. For instance, we could, once and for all, write utility modules that wrap the ftplib interfaces to hide the RETR string. If we place these utility modules in a directory on PYTHONPATH, they become just as accessible as ftplib itself, automatically reusable in any Python script we write in the future. Besides removing the RETR string requirement, a wrapper module could also make assumptions that simplify FTP operations into single function calls.

For instance, given a module that encapsulates and simplifies ftplib, our Python fetch-and-play script could be further reduced to the script shown in Example 13-3—essentially just two function calls plus a password prompt, but with a net effect exactly like Example 13-1 when run.

Example 13-3. PP4EInternetFtpgetone-modular.py

#!/usr/local/bin/python
"""
A Python script to download and play a media file by FTP.
Uses getfile.py, a utility module which encapsulates FTP step.
"""

import getfile
from getpass import getpass
filename = 'monkeys.jpg'

# fetch with utility
getfile.getfile(file=filename,
                site='ftp.rmi.net',
                dir ='.',
                user=('lutz', getpass('Pswd?')),
                refetch=True)

# rest is the same
if input('Open file?') in ['Y', 'y']:
    from PP4E.System.Media.playfile import playfile
    playfile(filename)

Besides having a much smaller line count, the meat of this script has been split off into a file for reuse elsewhere. If you ever need to download a file again, simply import an existing function instead of copying code with cut-and-paste editing. Changes in download operations would need to be made in only one file, not everywhere we’ve copied boilerplate code; getfile.getfile could even be changed to use urllib rather than ftplib without affecting any of its clients. It’s good engineering.

Download utility

So just how would we go about writing such an FTP interface wrapper (he asks, rhetorically)? Given the ftplib library module, wrapping downloads of a particular file in a particular directory is straightforward. Connected FTP objects support two download methods:

retrbinary: This method downloads the requested file in binary mode, sending its bytes in chunks to a supplied function, without line-feed mapping. Typically, the supplied function is a write method of an open local file object, such that the bytes are placed in the local file on the client.
retrlines: This method downloads the requested file in ASCII text mode, sending each line of text to a supplied function with all end-of-line characters stripped. Typically, the supplied function adds a newline (mapped appropriately for the client machine), and writes the line to a local file.

We will meet the retrlines method in a later example; the getfile utility module in Example 13-4 always transfers in binary mode with retrbinary. That is, files are downloaded exactly as they were on the server, byte for byte, with the server’s line-feed conventions in text files (you may need to convert line feeds after downloads if they look odd in your text editor—see your editor or system shell commands for pointers, or write a Python script that opens and writes the text as needed).

Example 13-4. PP4EInternetFtpgetfile.py

#!/usr/local/bin/python
"""
Fetch an arbitrary file by FTP.  Anonymous FTP unless you pass a
user=(name, pswd) tuple. Self-test FTPs a test file and site.
"""

from ftplib  import FTP          # socket-based FTP tools
from os.path import exists       # file existence test

def getfile(file, site, dir, user=(), *, verbose=True, refetch=False):
    """
    fetch a file by ftp from a site/directory
    anonymous or real login, binary transfer
    """
    if exists(file) and not refetch:
        if verbose: print(file, 'already fetched')
    else:
        if verbose: print('Downloading', file)
        local = open(file, 'wb')                # local file of same name
        try:
            remote = FTP(site)                  # connect to FTP site
            remote.login(*user)                 # anonymous=() or (name, pswd)
            remote.cwd(dir)
            remote.retrbinary('RETR ' + file, local.write, 1024)
            remote.quit()
        finally:
            local.close()                       # close file no matter what
        if verbose: print('Download done.')     # caller handles exceptions

if __name__ == '__main__':
    from getpass import getpass
    file = 'monkeys.jpg'
    dir  = '.'
    site = 'ftp.rmi.net'
    user = ('lutz', getpass('Pswd?'))
    getfile(file, site, dir, user)

This module is mostly just a repackaging of the FTP code we used to fetch the image file earlier, to make it simpler and reusable. Because it is a callable function, the exported getfile.getfile here tries to be as robust and generally useful as possible, but even a function this small implies some design decisions. Here are a few usage notes:

FTP mode

The getfile function in this script runs in anonymous FTP mode by default, but a two-item tuple containing a username and password string may be passed to the user argument in order to log in to the remote server in nonanonymous mode. To use anonymous FTP, either don’t pass the user argument or pass it an empty tuple, (). The FTP object login method allows two optional arguments to denote a username and password, and the function(*args) call syntax in Example 13-4 sends it whatever argument tuple you pass to user as individual arguments.

Processing modes

If passed, the last two arguments (verbose, refetch) allow us to turn off status messages printed to the stdout stream (perhaps undesirable in a GUI context) and to force downloads to happen even if the file already exists locally (the download overwrites the existing local file).

These two arguments are coded as Python 3.X default keyword-only arguments, so if used they must be passed by name, not position. The user argument instead can be passed either way, if it is passed at all. Keyword-only arguments here prevent passed verbose or refetch values from being incorrectly matched against the user argument if the user value is actually omitted in a call.

Exception protocol

The caller is expected to handle exceptions; this function wraps downloads in a try/finally statement to guarantee that the local output file is closed, but it lets exceptions propagate. If used in a GUI or run from a thread, for instance, exceptions may require special handling unknown in this file.

Self-test

If run standalone, this file downloads an image file again from my website as a self-test (configure for your server and file as desired), but the function will normally be passed FTP filenames, site names, and directory names as well.

File mode

As in earlier examples, this script is careful to open the local output file in wb binary mode to suppress end-line mapping and conform to Python 3.X’s Unicode string model. As we learned in Chapter 4, it’s not impossible that true binary datafiles may have bytes whose value is equal to a line-feed character; opening in w text mode instead would make these bytes automatically expand to a two-byte sequence when written locally on Windows. This is only an issue when run on Windows; mode w doesn’t change end-lines elsewhere.

As we also learned in Chapter 4, though, binary mode is required to suppress the automatic Unicode translations performed for text in Python 3.X. Without binary mode, Python would attempt to encode fetched data when written per a default or passed Unicode encoding scheme, which might fail for some types of fetched text and would normally fail for truly binary data such as images and audio.

Because retrbinary writes bytes strings in 3.X, we really cannot open the output file in text mode anyhow, or write will raise exceptions. Recall that in 3.X text-mode files require str strings, and binary mode files expect bytes. Since retrbinary writes bytes and retrlines writes str, they implicitly require binary and text-mode output files, respectively. This constraint is irrespective of end-line or Unicode issues, but it effectively accomplishes those goals as well.

As we’ll see in later examples, text-mode retrievals have additional encoding requirements; in fact, ftplib will turn out to be a good example of the impacts of Python 3.X’s Unicode string model on real-world code. By always using binary mode in the script here, we sidestep the issue altogether.

Directory model

This function currently uses the same filename to identify both the remote file and the local file where the download should be stored. As such, it should be run in the directory where you want the file to show up; use os.chdir to move to directories if needed. (We could instead assume filename is the local file’s name, and strip the local directory with os.path.split to get the remote name, or accept two distinct filename arguments—local and remote.)

Also notice that, despite its name, this module is very different from the getfile.py script we studied at the end of the sockets material in the preceding chapter. The socket-based getfile implemented custom client and server-side logic to download a server file to a client machine over raw sockets.

The new getfile here is a client-side tool only. Instead of raw sockets, it uses the standard FTP protocol to request a file from a server; all socket-level details are hidden in the simpler ftplib module’s implementation of the FTP client protocol. Furthermore, the server here is a perpetually running program on the server machine, which listens for and responds to FTP requests on a socket, on the dedicated FTP port (number 21). The net functional effect is that this script requires an FTP server to be running on the machine where the desired file lives, but such a server is much more likely to be available.

Upload utility

While we’re at it, let’s write a script to upload a single file by FTP to a remote machine. The upload interfaces in the FTP module are symmetric with the download interfaces. Given a connected FTP object, its:

storbinary method can be used to upload bytes from an open local file object
storlines method can be used to upload text in ASCII mode from an open local file object

Unlike the download interfaces, both of these methods are passed a file object as a whole, not a file object method (or other function). We will meet the storlines method in a later example. The utility module in Example 13-5 uses storbinary such that the file whose name is passed in is always uploaded verbatim—in binary mode, without Unicode encodings or line-feed translations for the target machine’s conventions. If this script uploads a text file, it will arrive exactly as stored on the machine it came from, with client line-feed markers and existing Unicode encoding.

Example 13-5. PP4EInternetFtpputfile.py

#!/usr/local/bin/python
"""
Store an arbitrary file by FTP in binary mode.  Uses anonymous
ftp unless you pass in a user=(name, pswd) tuple of arguments.
"""

import ftplib                    # socket-based FTP tools

def putfile(file, site, dir, user=(), *, verbose=True):
    """
    store a file by ftp to a site/directory
    anonymous or real login, binary transfer
    """
    if verbose: print('Uploading', file)
    local  = open(file, 'rb')               # local file of same name
    remote = ftplib.FTP(site)               # connect to FTP site
    remote.login(*user)                     # anonymous or real login
    remote.cwd(dir)
    remote.storbinary('STOR ' + file, local, 1024)
    remote.quit()
    local.close()
    if verbose: print('Upload done.')

if __name__ == '__main__':
    site = 'ftp.rmi.net'
    dir  = '.'
    import sys, getpass
    pswd = getpass.getpass(site + ' pswd?')                # filename on cmdline
    putfile(sys.argv[1], site, dir, user=('lutz', pswd))   # nonanonymous login

Notice that for portability, the local file is opened in rb binary input mode this time to suppress automatic line-feed character conversions. If this is binary information, we don’t want any bytes that happen to have the value of the carriage-return character to mysteriously go away during the transfer when run on a Windows client. We also want to suppress Unicode encodings for nontext files, and we want reads to produce the bytes strings expected by the storbinary upload operation (more on input file modes later).

This script uploads a file you name on the command line as a self-test, but you will normally pass in real remote filename, site name, and directory name strings. Also like the download utility, you may pass a (username, password) tuple to the user argument to trigger nonanonymous FTP mode (anonymous FTP is the default).

Playing the Monty Python theme song

It’s time for a bit of fun. To test, let’s use these scripts to transfer a copy of the Monty Python theme song audio file I have at my website. First, let’s write a module that downloads and plays the sample file, as shown in Example 13-6.

Example 13-6. PP4EInternetFtpsousa.py

#!/usr/local/bin/python
"""
Usage: sousa.py.  Fetch and play the Monty Python theme song.
This will not work on your system as is: it requires a machine with Internet access
and an FTP server account you can access, and uses audio filters on Unix and your
.au player on Windows.  Configure this and playfile.py as needed for your platform.
"""

from getpass import getpass
from PP4E.Internet.Ftp.getfile  import getfile
from PP4E.System.Media.playfile import playfile

file = 'sousa.au'                      # default file coordinates
site = 'ftp.rmi.net'                   # Monty Python theme song
dir  = '.'
user = ('lutz', getpass('Pswd?'))

getfile(file, site, dir, user)         # fetch audio file by FTP
playfile(file)                         # send it to audio player

# import os
# os.system('getone.py sousa.au')      # equivalent command line

There’s not much to this script, because it really just combines two tools we’ve already coded. We’re reusing Example 13-4’s getfile to download, and Chapter 6’s playfile module (Example 6-23) to play the audio sample after it is downloaded (turn back to that example for more details on the player part of the task). Also notice the last two lines in this file—we can achieve the same effect by passing in the audio filename as a command-line argument to our original script, but it’s less direct.

As is, this script assumes my FTP server account; configure as desired (alas, this file used to be at the ftp.python.org anonymous FTP site, but that site went dark for security reasons between editions of this book). Once configured, this script will run on any machine with Python, an Internet link, and a recognizable audio player; it works on my Windows laptop with a broadband Internet connection, and it plays the music clip in Windows Media Player (and if I could insert an audio file hyperlink here to show what it sounds like, I would…):

C:...PP4EInternetFtp> sousa.py
Pswd?
Downloading sousa.au
Download done.

C:...PP4EInternetFtp> sousa.py
Pswd?
sousa.au already fetched

The getfile and putfile modules themselves can be used to move the sample file around too. Both can either be imported by clients that wish to use their functions, or run as top-level programs to trigger self-tests and command-line usage. For variety, let’s run these scripts from a command line and the interactive prompt to see how they work. When run standalone, the filename is passed in the command line to putfile and both use password input and default site settings:

C:...PP4EInternetFtp> putfile.py sousa.py
ftp.rmi.net pswd?
Uploading sousa.py
Upload done.

When imported, parameters are passed explicitly to functions:

C:...PP4EInternetFtp> python
>>> from getfile import getfile
>>> getfile(file='sousa.au', site='ftp.rmi.net', dir='.', user=('lutz', 'XXX'))
sousa.au already fetched

C:...PP4EInternetFtp> del sousa.au

C:...PP4EInternetFtp> python
>>> from getfile import getfile
>>> getfile(file='sousa.au', site='ftp.rmi.net', dir='.', user=('lutz', 'XXX'))
Downloading sousa.au
Download done.

>>> from PP4E.System.Media.playfile import playfile
>>> playfile('sousa.au')

Although Python’s ftplib already automates the underlying socket and message formatting chores of FTP, tools of our own like these can make the process even simpler.

Adding a User Interface

If you read the preceding chapter, you’ll recall that it concluded with a quick look at scripts that added a user interface to a socket-based getfile script—one that transferred files over a proprietary socket dialog, instead of over FTP. At the end of that presentation, I mentioned that FTP is a much more generally useful way to move files around because FTP servers are so widely available on the Net. For illustration purposes, Example 13-7 shows a simple mutation of the prior chapter’s user interface, implemented as a new subclass of the preceding chapter’s general form builder, form.py of Example 12-20.

Example 13-7. PP4EInternetFtpgetfilegui.py

"""
#################################################################################
launch FTP getfile function with a reusable form GUI class;  uses os.chdir to
goto target local dir (getfile currently assumes that filename has no local
directory path prefix);  runs getfile.getfile in thread to allow more than one
to be running at once and avoid blocking GUI during downloads;  this differs
from socket-based getfilegui, but reuses Form GUI builder tool;  supports both
user and anonymous FTP as currently coded;

caveats: the password field is not displayed as stars here, errors are printed
to the console instead of shown in the GUI (threads can't generally update the
GUI on Windows), this isn't 100% thread safe (there is a slight delay between
os.chdir here and opening the local output file in getfile) and we could
display both a save-as popup for picking the local dir, and a remote directory
listing for picking the file to get;  suggested exercises: improve me;
#################################################################################
"""

from tkinter import Tk, mainloop
from tkinter.messagebox import showinfo
import getfile, os, sys, _thread                # FTP getfile here, not socket
from PP4E.Internet.Sockets.form import Form     # reuse form tool in socket dir

class FtpForm(Form):
    def __init__(self):
        root = Tk()
        root.title(self.title)
        labels = ['Server Name', 'Remote Dir', 'File Name',
                  'Local Dir',   'User Name?', 'Password?']
        Form.__init__(self, labels, root)
        self.mutex = _thread.allocate_lock()
        self.threads = 0

    def transfer(self, filename, servername, remotedir, userinfo):
        try:
            self.do_transfer(filename, servername, remotedir, userinfo)
            print('%s of "%s" successful'  % (self.mode, filename))
        except:
            print('%s of "%s" has failed:' % (self.mode, filename), end=' ')
            print(sys.exc_info()[0], sys.exc_info()[1])
        self.mutex.acquire()
        self.threads -= 1
        self.mutex.release()

    def onSubmit(self):
        Form.onSubmit(self)
        localdir   = self.content['Local Dir'].get()
        remotedir  = self.content['Remote Dir'].get()
        servername = self.content['Server Name'].get()
        filename   = self.content['File Name'].get()
        username   = self.content['User Name?'].get()
        password   = self.content['Password?'].get()
        userinfo   = ()
        if username and password:
            userinfo = (username, password)
        if localdir:
            os.chdir(localdir)
        self.mutex.acquire()
        self.threads += 1
        self.mutex.release()
        ftpargs = (filename, servername, remotedir, userinfo)
        _thread.start_new_thread(self.transfer, ftpargs)
        showinfo(self.title, '%s of "%s" started' % (self.mode, filename))

    def onCancel(self):
        if self.threads == 0:
            Tk().quit()
        else:
            showinfo(self.title,
                     'Cannot exit: %d threads running' % self.threads)

class FtpGetfileForm(FtpForm):
    title = 'FtpGetfileGui'
    mode  = 'Download'
    def do_transfer(self, filename, servername, remotedir, userinfo):
        getfile.getfile(
            filename, servername, remotedir, userinfo, verbose=False, refetch=True)

if __name__ == '__main__':
    FtpGetfileForm()
    mainloop()

If you flip back to the end of the preceding chapter, you’ll find that this version is similar in structure to its counterpart there; in fact, it has the same name (and is distinct only because it lives in a different directory). The class here, though, knows how to use the FTP-based getfile module from earlier in this chapter instead of the socket-based getfile module we met a chapter ago. When run, this version also implements more input fields, as in Figure 13-2, shown on Windows 7.

Figure 13-2. FTP getfile input form

Notice that a full absolute file path can be entered for the local directory here. If not, the script assumes the current working directory, which changes after each download and can vary depending on where the GUI is launched (e.g., the current directory differs when this script is run by the PyDemos program at the top of the examples tree). When we click this GUI’s Submit button (or press the Enter key), the script simply passes the form’s input field values as arguments to the getfile.getfile FTP utility function of Example 13-4 earlier in this section. It also posts a pop up to tell us the download has begun (Figure 13-3).

Figure 13-3. FTP getfile info pop up

As currently coded, further download status messages, including any FTP error messages, show up in the console window; here are the messages for successful downloads as well as one that fails (with added blank lines for readability):

C:...PP4EInternetFtp> getfilegui.py
Server Name     =>       ftp.rmi.net
User Name?      =>       lutz
Local Dir       =>       test
File Name       =>       about-pp.html
Password?       =>       xxxxxxxx
Remote Dir      =>       .
Download of "about-pp.html" successful

Server Name     =>       ftp.rmi.net
User Name?      =>       lutz
Local Dir       =>       C:	emp
File Name       =>       ora-lp4e-big.jpg
Password?       =>       xxxxxxxx
Remote Dir      =>       .
Download of "ora-lp4e-big.jpg" successful

Server Name     =>       ftp.rmi.net
User Name?      =>       lutz
Local Dir       =>       C:	emp
File Name       =>       ora-lp4e.jpg
Password?       =>       xxxxxxxx
Remote Dir      =>       .
Download of "ora-lp4e.jpg" has failed: <class 'ftplib.error_perm'>
550 ora-lp4e.jpg: No such file or directory

Given a username and password, the downloader logs into the specified account. To do anonymous FTP instead, leave the username and password fields blank.

Now, to illustrate the threading capabilities of this GUI, start a download of a large file, then start another download while this one is in progress. The GUI stays active while downloads are underway, so we simply change the input fields and press Submit again.

This second download starts and runs in parallel with the first, because each download is run in a thread, and more than one Internet connection can be active at once. In fact, the GUI itself stays active during downloads only because downloads are run in threads; if they were not, even screen redraws wouldn’t happen until a download finished.

We discussed threads in Chapter 5, and their application to GUIs in Chapters 9 and 10, but this script illustrates some practical thread concerns:

This program takes care to not do anything GUI-related in a download thread. As we’ve learned, only the thread that makes GUIs can generally process them.
To avoid killing spawned download threads on some platforms, the GUI must also be careful not to exit while any downloads are in progress. It keeps track of the number of in-progress threads, and just displays a pop up if we try to kill the GUI by pressing the Cancel button while both of these downloads are in progress.

We learned about ways to work around the no-GUI rule for threads in Chapter 10, and we will apply such techniques when we explore the PyMailGUI example in the next chapter. To be portable, though, we can’t really close the GUI until the active-thread count falls to zero; the exit model of the threading module of Chapter 5 can be used to achieve the same effect. Here is the sort of output that appears in the console window when two downloads overlap in time:

C:...PP4EInternetFtp> python getfilegui.py
Server Name     =>       ftp.rmi.net
User Name?      =>       lutz
Local Dir       =>       C:	emp
File Name       =>       spain08.JPG
Password?       =>       xxxxxxxx
Remote Dir      =>       .

Server Name     =>       ftp.rmi.net
User Name?      =>       lutz
Local Dir       =>       C:	emp
File Name       =>       index.html
Password?       =>       xxxxxxxx
Remote Dir      =>       .

Download of "index.html" successful
Download of "spain08.JPG" successful

This example isn’t much more useful than a command line-based tool, of course, but it can be easily modified by changing its Python code, and it provides enough of a GUI to qualify as a simple, first-cut FTP user interface. Moreover, because this GUI runs downloads in Python threads, more than one can be run at the same time from this GUI without having to start or restart a different FTP client tool.

While we’re in a GUI mood, let’s add a simple interface to the putfile utility, too. The script in Example 13-8 creates a dialog that starts uploads in threads, using core FTP logic imported from Example 13-5. It’s almost the same as the getfile GUI we just wrote, so there’s not much new to say. In fact, because get and put operations are so similar from an interface perspective, most of the get form’s logic was deliberately factored out into a single generic class (FtpForm), so changes need be made in only a single place. That is, the put GUI here is mostly just a reuse of the get GUI, with distinct output labels and transfer methods. It’s in a file by itself, though, to make it easy to launch as a standalone program.

Example 13-8. PP4EInternetFtpputfilegui.py

"""
###############################################################
launch FTP putfile function with a reusable form GUI class;
see getfilegui for notes: most of the same caveats apply;
the get and put forms have been factored into a single
class such that changes need be made in only one place;
###############################################################
"""

from tkinter import mainloop
import putfile, getfilegui

class FtpPutfileForm(getfilegui.FtpForm):
    title = 'FtpPutfileGui'
    mode  = 'Upload'
    def do_transfer(self, filename, servername, remotedir, userinfo):
        putfile.putfile(filename, servername, remotedir, userinfo, verbose=False)

if __name__ == '__main__':
    FtpPutfileForm()
    mainloop()

Running this script looks much like running the download GUI, because it’s almost entirely the same code at work. Let’s upload some files from the client machine to the server; Figure 13-4 shows the state of the GUI while starting one.

Figure 13-4. FTP putfile input form

And here is the console window output we get when uploading two files in serial fashion; here again, uploads run in parallel threads, so if we start a new upload before one in progress is finished, they overlap in time:

C:...PP4EInternetFtp	est> ..putfilegui.py
Server Name     =>       ftp.rmi.net
User Name?      =>       lutz
Local Dir       =>       .
File Name       =>       sousa.au
Password?       =>       xxxxxxxx
Remote Dir      =>       .
Upload of "sousa.au" successful

Server Name     =>       ftp.rmi.net
User Name?      =>       lutz
Local Dir       =>       .
File Name       =>       about-pp.html
Password?       =>       xxxxxxxx
Remote Dir      =>       .
Upload of "about-pp.html" successful

Finally, we can bundle up both GUIs in a single launcher script that knows how to start the get and put interfaces, regardless of which directory we are in when the script is started, and independent of the platform on which it runs. Example 13-9 shows this process.

Example 13-9. PP4EInternetFtpPyFtpGui.pyw

"""
spawn FTP get and put GUIs no matter what directory I'm run from; os.getcwd is not
necessarily the place this script lives;  could also hardcode path from $PP4EHOME,
or guessLocation;  could also do:  [from PP4E.launchmodes import PortableLauncher,
PortableLauncher('getfilegui', '%s/getfilegui.py' % mydir)()], but need the DOS
console pop up on Windows to view status messages which describe transfers made;
"""

import os, sys
print('Running in: ', os.getcwd())

# PP3E
# from PP4E.Launcher import findFirst
# mydir = os.path.split(findFirst(os.curdir, 'PyFtpGui.pyw'))[0]

# PP4E
from PP4E.Tools.find import findlist
mydir = os.path.dirname(findlist('PyFtpGui.pyw', startdir=os.curdir)[0])

if sys.platform[:3] == 'win':
    os.system('start %sgetfilegui.py' % mydir)
    os.system('start %sputfilegui.py' % mydir)
else:
    os.system('python %s/getfilegui.py &' % mydir)
    os.system('python %s/putfilegui.py &' % mydir)

Notice that we’re reusing the find utility from Chapter 6’s Example 6-13 again here—this time to locate the home directory of the script in order to build command lines. When run by launchers in the examples root directory or command lines elsewhere in general, the current working directory may not always be this script’s container. In the prior edition, this script used a tool in the Launcher module instead to search for its own directory (see the examples distribution for that equivalent).

When this script is started, both the get and put GUIs appear as distinct, independently run programs; alternatively, we might attach both forms to a single interface. We could get much fancier than these two interfaces, of course. For instance, we could pop up local file selection dialogs, and we could display widgets that give the status of downloads and uploads in progress. We could even list files available at the remote site in a selectable listbox by requesting remote directory listings over the FTP connection. To learn how to add features like that, though, we need to move on to the next section.

Transferring Directories with ftplib

Once upon a time, I used Telnet to manage my website at my Internet Service Provider (ISP). I logged in to the web server in a shell window, and performed all my edits directly on the remote machine. There was only one copy of a site’s files—on the machine that hosted it. Moreover, content updates could be performed from any machine that ran a Telnet client—ideal for people with travel-based careers.^[49]

Of course, times have changed. Like most personal websites, today mine are maintained on my laptop and I transfer their files to and from my ISP as needed. Often, this is a simple matter of one or two files, and it can be accomplished with a command-line FTP client. Sometimes, though, I need an easy way to transfer the entire site. Maybe I need to download to detect files that have become out of sync. Occasionally, the changes are so involved that it’s easier to upload the entire site in a single step.

Although there are a variety of ways to approach this task (including options in site-builder tools), Python can help here, too: writing Python scripts to automate the upload and download tasks associated with maintaining my website on my laptop provides a portable and mobile solution. Because Python FTP scripts will work on any machine with sockets, they can be run on my laptop and on nearly any other computer where Python is installed. Furthermore, the same scripts used to transfer page files to and from my PC can be used to copy my site to another web server as a backup copy, should my ISP experience an outage. The effect is sometimes called a mirror—a copy of a remote site.

Downloading Site Directories

The following two scripts address these needs. The first, downloadflat.py, automatically downloads (i.e., copies) by FTP all the files in a directory at a remote site to a directory on the local machine. I keep the main copy of my website files on my PC these days, but I use this script in two ways:

To download my website to client machines where I want to make edits, I fetch the contents of my web directory of my account on my ISP’s machine.
To mirror my site to my account on another server, I run this script periodically on the target machine if it supports Telnet or SSH secure shell; if it does not, I simply download to one machine and upload from there to the target server.

More generally, this script (shown in Example 13-10) will download a directory full of files to any machine with Python and sockets, from any machine running an FTP server.

Example 13-10. PP4EInternetFtpMirrordownloadflat.py

#!/bin/env python
"""
###############################################################################
use FTP to copy (download) all files from a single directory at a remote
site to a directory on the local machine; run me periodically to mirror
a flat FTP site directory to your ISP account;  set user to 'anonymous'
to do anonymous FTP;  we could use try to skip file failures, but the FTP
connection is likely closed if any files fail;  we could also try to
reconnect with a new FTP instance before each transfer: connects once now;
if failures, try setting nonpassive for active FTP, or disable firewalls;
this also depends on a working FTP server, and possibly its load policies.
###############################################################################
"""

import os, sys, ftplib
from getpass   import getpass
from mimetypes import guess_type

nonpassive = False                        # passive FTP on by default in 2.1+
remotesite = 'home.rmi.net'               # download from this site
remotedir  = '.'                          # and this dir (e.g., public_html)
remoteuser = 'lutz'
remotepass = getpass('Password for %s on %s: ' % (remoteuser, remotesite))
localdir   = (len(sys.argv) > 1 and sys.argv[1]) or '.'
cleanall   = input('Clean local directory first? ')[:1] in ['y', 'Y']

print('connecting...')
connection = ftplib.FTP(remotesite)                 # connect to FTP site
connection.login(remoteuser, remotepass)            # login as user/password
connection.cwd(remotedir)                           # cd to directory to copy
if nonpassive:                                      # force active mode FTP
    connection.set_pasv(False)                      # most servers do passive

if cleanall:
    for localname in os.listdir(localdir):          # try to delete all locals
        try:                                        # first, to remove old files
            print('deleting local', localname)      # os.listdir omits . and ..
            os.remove(os.path.join(localdir, localname))
        except:
            print('cannot delete local', localname)

count = 0                                           # download all remote files
remotefiles = connection.nlst()                     # nlst() gives files list
                                                    # dir()  gives full details
for remotename in remotefiles:
    if remotename in ('.', '..'): continue          # some servers include . and ..
    mimetype, encoding = guess_type(remotename)     # e.g., ('text/plain', 'gzip')
    mimetype  = mimetype or '?/?'                   # may be (None, None)
    maintype  = mimetype.split('/')[0]              # .jpg ('image/jpeg', None')

    localpath = os.path.join(localdir, remotename)
    print('downloading', remotename, 'to', localpath, end=' ')
    print('as', maintype, encoding or '')

    if maintype == 'text' and encoding == None:
        # use ascii mode xfer and text file
        # use encoding compatible wth ftplib's
        localfile = open(localpath, 'w', encoding=connection.encoding)
        callback  = lambda line: localfile.write(line + '
')
        connection.retrlines('RETR ' + remotename, callback)

    else:
        # use binary mode xfer and bytes file
        localfile = open(localpath, 'wb')
        connection.retrbinary('RETR ' + remotename, localfile.write)

    localfile.close()
    count += 1

connection.quit()
print('Done:', count, 'files downloaded.')

There’s not much that is new to speak of in this script, compared to other FTP examples we’ve seen thus far. We open a connection with the remote FTP server, log in with a username and password for the desired account (this script never uses anonymous FTP), and go to the desired remote directory. New here, though, are loops to iterate over all the files in local and remote directories, text-based retrievals, and file deletions:

Deleting all local files

This script has a cleanall option, enabled by an interactive prompt. If selected, the script first deletes all the files in the local directory before downloading, to make sure there are no extra files that aren’t also on the server (there may be junk here from a prior download). To delete local files, the script calls os.listdir to get a list of filenames in the directory, and os.remove to delete each; see Chapter 4 (or the Python library manual) for more details if you’ve forgotten what these calls do.

Notice the use of os.path.join to concatenate a directory path and filename according to the host platform’s conventions; os.listdir returns filenames without their directory paths, and this script is not necessarily run in the local directory where downloads will be placed. The local directory defaults to the current directory (“.”), but can be set differently with a command-line argument to the script.

Fetching all remote files

To grab all the files in a remote directory, we first need a list of their names. The FTP object’s nlst method is the remote equivalent of os.listdir: nlst returns a list of the string names of all files in the current remote directory. Once we have this list, we simply step through it in a loop, running FTP retrieval commands for each filename in turn (more on this in a minute).

The nlst method is, more or less, like requesting a directory listing with an ls command in typical interactive FTP programs, but Python automatically splits up the listing’s text into a list of filenames. We can pass it a remote directory to be listed; by default it lists the current server directory. A related FTP method, dir, returns the list of line strings produced by an FTP LIST command; its result is like typing a dir command in an FTP session, and its lines contain complete file information, unlike nlst. If you need to know more about all the remote files, parse the result of a dir method call (we’ll see how in a later example).

Notice how we skip “.” and “..” current and parent directory indicators if present in remote directory listings; unlike os.listdir, some (but not all) servers include these, so we need to either skip these or catch the exceptions they may trigger (more on this later when we start using dir, too).

Selecting transfer modes with mimetypes

We discussed output file modes for FTP earlier, but now that we’ve started transferring text, too, I can fill in the rest of this story. To handle Unicode encodings and to keep line-ends in sync with the machines that my web files live on, this script distinguishes between binary and text file transfers. It uses the Python mimetypes module to choose between text and binary transfer modes for each file.

We met mimetypes in Chapter 6 near Example 6-23, where we used it to play media files (see the examples and description there for an introduction). Here, mimetypes is used to decide whether a file is text or binary by guessing from its filename extension. For instance, HTML web pages and simple text files are transferred as text with automatic line-end mappings, and images and tar archives are transferred in raw binary mode.

Downloading: text versus binary

For binary files data is pulled down with the retrbinary method we met earlier, and stored in a local file with binary open mode of wb. This file open mode is required to allow for the bytes strings passed to the write method by retrbinary, but it also suppresses line-end byte mapping and Unicode encodings in the process. Again, text mode requires encodable text in Python 3.X, and this fails for binary data like images. This script may also be run on Windows or Unix-like platforms, and we don’t want a byte embedded in an image to get expanded to on Windows. We don’t use a chunk-size third argument for binary transfers here, though—it defaults to a reasonable size if omitted.

For text files, the script instead uses the retrlines method, passing in a function to be called for each line in the text file downloaded. The text line handler function receives lines in str string form, and mostly just writes the line to a local text file. But notice that the handler function created by the lambda here also adds a line-end character to the end of the line it is passed. Python’s retrlines method strips all line-feed characters from lines to sidestep platform differences. By adding a , the script ensures the proper line-end marker character sequence for the local platform on which this script runs when written to the file ( or ).

For this auto-mapping of the in the script to work, of course, we must also open text output files in w text mode, not in wb—the mapping from to on Windows happens when data is written to the file. As discussed earlier, text mode also means that the file’s write method will allow for the str string passed in by retrlines, and that text will be encoded per Unicode when written.

Subtly, though, we also explicitly use the FTP connection object’s Unicode encoding scheme for our text output file in open, instead of the default. Without this encoding option, the script aborted with a UnicodeEncodeError exception for some files in my site. In retrlines, the FTP object itself reads the remote file data over a socket with a text-mode file wrapper and an explicit encoding scheme for decoding; since the FTP object can do no better than this encoding anyhow, we use its encoding for our output file as well.

By default, FTP objects use the latin1 scheme for decoding text fetched (as well as for encoding text sent), but this can be specialized by assigning to their encoding attribute. Our script’s local text output file will inherit whatever encoding ftplib uses and so be compatible with the encoded text data that it produces and passes.

We could try to also catch Unicode exceptions for files outside the Unicode encoding used by the FTP object, but exceptions leave the FTP object in an unrecoverable state in tests I’ve run in Python 3.1. Alternatively, we could use wb binary mode for the local text output file and manually encode line strings with line.encode, or simply use retrbinary and binary mode files in all cases, but both of these would fail to map end-lines portably—the whole point of making text distinct in this context.

All of this is simpler in action than in words. Here is the command I use to download my entire book support website from my ISP server account to my Windows laptop PC, in a single step:

C:...PP4EInternetFtpMirror> downloadflat.py test
Password for lutz on home.rmi.net:
Clean local directory first? y
connecting...
deleting local 2004-longmont-classes.html
deleting local 2005-longmont-classes.html
deleting local 2006-longmont-classes.html
deleting local about-hopl.html
deleting local about-lp.html
deleting local about-lp2e.html
deleting local about-pp-japan.html

...lines omitted...

downloading 2004-longmont-classes.html to test2004-longmont-classes.html as text
downloading 2005-longmont-classes.html to test2005-longmont-classes.html as text
downloading 2006-longmont-classes.html to test2006-longmont-classes.html as text
downloading about-hopl.html to testabout-hopl.html as text
downloading about-lp.html to testabout-lp.html as text
downloading about-lp2e.html to testabout-lp2e.html as text
downloading about-pp-japan.html to testabout-pp-japan.html as text

...lines omitted...

downloading ora-pyref4e.gif to testora-pyref4e.gif as image
downloading ora-lp4e-big.jpg to testora-lp4e-big.jpg as image
downloading ora-lp4e.gif to testora-lp4e.gif as image
downloading pyref4e-updates.html to testpyref4e-updates.html as text
downloading lp4e-updates.html to testlp4e-updates.html as text
downloading lp4e-examples.html to testlp4e-examples.html as text
downloading LP4E-examples.zip to testLP4E-examples.zip as application
Done: 297 files downloaded.

This may take a few moments to complete, depending on your site’s size and your connection speed (it’s bound by network speed constraints, and it usually takes roughly two to three minutes for my site on my current laptop and wireless broadband connection). It is much more accurate and easier than downloading files by hand, though. The script simply iterates over all the remote files returned by the nlst method, and downloads each with the FTP protocol (i.e., over sockets) in turn. It uses text transfer mode for names that imply text data, and binary mode for others.

With the script running this way, I make sure the initial assignments in it reflect the machines involved, and then run the script from the local directory where I want the site copy to be stored. Because the target download directory is often not where the script lives, I may need to give Python the full path to the script file. When run on a server in a Telnet or SSH session window, for instance, the execution and script directory paths are different, but the script works the same way.

If you elect to delete local files in the download directory, you may also see a batch of “deleting local…” messages scroll by on the screen before any “downloading…” lines appear: this automatically cleans out any garbage lingering from a prior download. And if you botch the input of the remote site password, a Python exception is raised; I sometimes need to run it again (and type more slowly):

C:...PP4EInternetFtpMirror> downloadflat.py test
Password for lutz on home.rmi.net:
Clean local directory first?
connecting...
Traceback (most recent call last):
  File "C:...PP4EInternetFtpMirrordownloadflat.py", line 29, in <module>
    connection.login(remoteuser, remotepass)            # login as user/password
  File "C:Python31libftplib.py", line 375, in login
    if resp[0] == '3': resp = self.sendcmd('PASS ' + passwd)
  File "C:Python31libftplib.py", line 245, in sendcmd
    return self.getresp()
  File "C:Python31libftplib.py", line 220, in getresp
    raise error_perm(resp)
ftplib.error_perm: 530 Login incorrect.

It’s worth noting that this script is at least partially configured by assignments near the top of the file. In addition, the password and deletion options are given by interactive inputs, and one command-line argument is allowed—the local directory name to store the downloaded files (it defaults to “.”, the directory where the script is run). Command-line arguments could be employed to universally configure all the other download parameters and options, too, but because of Python’s simplicity and lack of compile/link steps, changing settings in the text of Python scripts is usually just as easy as typing words on a command line.

Note

To check for version skew after a batch of downloads and uploads, you can run the diffall script we wrote in Chapter 6, Example 6-12. For instance, I find files that have diverged over time due to updates on multiple platforms by comparing the download to a local copy of my website using a shell command line such as C:...PP4EInternetFtp> ....SystemFiletoolsdiffall.py Mirror est C:...Websitespublic_html. See Chapter 6 for more details on this tool, and file diffall.out.txt in the diffs subdirectory of the examples distribution for a sample run; its text file differences stem from either final line newline characters or newline differences reflecting binary transfers that Windows fc commands and FTP servers do not notice.

Uploading Site Directories

Uploading a full directory is symmetric to downloading: it’s mostly a matter of swapping the local and remote machines and operations in the program we just met. The script in Example 13-11 uses FTP to copy all files in a directory on the local machine on which it runs up to a directory on a remote machine.

I really use this script, too, most often to upload all of the files maintained on my laptop PC to my ISP account in one fell swoop. I also sometimes use it to copy my site from my PC to a mirror machine or from the mirror machine back to my ISP. Because this script runs on any computer with Python and sockets, it happily transfers a directory from any machine on the Net to any machine running an FTP server. Simply change the initial setting in this module as appropriate for the transfer you have in mind.

Example 13-11. PP4EInternetFtpMirroruploadflat.py

#!/bin/env python
"""
##############################################################################
use FTP to upload all files from one local dir to a remote site/directory;
e.g., run me to copy a web/FTP site's files from your PC to your ISP;
assumes a flat directory upload: uploadall.py does nested directories.
see downloadflat.py comments for more notes: this script is symmetric.
##############################################################################
"""

import os, sys, ftplib
from getpass import getpass
from mimetypes import guess_type

nonpassive = False                                  # passive FTP by default
remotesite = 'learning-python.com'                  # upload to this site
remotedir  = 'books'                                # from machine running on
remoteuser = 'lutz'
remotepass = getpass('Password for %s on %s: ' % (remoteuser, remotesite))
localdir   = (len(sys.argv) > 1 and sys.argv[1]) or '.'
cleanall   = input('Clean remote directory first? ')[:1] in ['y', 'Y']

print('connecting...')
connection = ftplib.FTP(remotesite)                 # connect to FTP site
connection.login(remoteuser, remotepass)            # log in as user/password
connection.cwd(remotedir)                           # cd to directory to copy
if nonpassive:                                      # force active mode FTP
    connection.set_pasv(False)                      # most servers do passive

if cleanall:
    for remotename in connection.nlst():            # try to delete all remotes
        try:                                        # first, to remove old files
            print('deleting remote', remotename)
            connection.delete(remotename)           # skips . and .. if attempted
        except:
            print('cannot delete remote', remotename)

count = 0                                           # upload all local files
localfiles = os.listdir(localdir)                   # listdir() strips dir path
                                                    # any failure ends script
for localname in localfiles:
    mimetype, encoding = guess_type(localname)      # e.g., ('text/plain', 'gzip')
    mimetype  = mimetype or '?/?'                   # may be (None, None)
    maintype  = mimetype.split('/')[0]              # .jpg ('image/jpeg', None')

    localpath = os.path.join(localdir, localname)
    print('uploading', localpath, 'to', localname, end=' ')
    print('as', maintype, encoding or '')

    if maintype == 'text' and encoding == None:
        # use ascii mode xfer and bytes file
        # need rb mode for ftplib's crlf logic
        localfile = open(localpath, 'rb')
        connection.storlines('STOR ' + localname, localfile)

    else:
        # use binary mode xfer and bytes file
        localfile = open(localpath, 'rb')
        connection.storbinary('STOR ' + localname, localfile)

    localfile.close()
    count += 1

connection.quit()
print('Done:', count, 'files uploaded.')

Similar to the mirror download script, this program illustrates a handful of new FTP interfaces and a set of FTP scripting techniques:

Deleting all remote files

Just like the mirror script, the upload begins by asking whether we want to delete all the files in the remote target directory before copying any files there. This cleanall option is useful if we’ve deleted files in the local copy of the directory in the client—the deleted files would remain on the server-side copy unless we delete all files there first.

To implement the remote cleanup, this script simply gets a listing of all the files in the remote directory with the FTP nlst method, and deletes each in turn with the FTP delete method. Assuming we have delete permission, the directory will be emptied (file permissions depend on the account we logged into when connecting to the server). We’ve already moved to the target remote directory when deletions occur, so no directory paths need to be prepended to filenames here. Note that nlst may raise an exception for some servers if the remote directory is empty; we don’t catch the exception here, but you can simply not select a cleaning if one fails for you. We do catch deletion exceptions, because directory names like “.” and “..” may be returned in the listing by some servers.

Storing all local files

To apply the upload operation to each file in the local directory, we get a list of local filenames with the standard os.listdir call, and take care to prepend the local source directory path to each filename with the os.path.join call. Recall that os.listdir returns filenames without directory paths, and the source directory may not be the same as the script’s execution directory if passed on the command line.

Uploading: Text versus binary

This script may also be run on both Windows and Unix-like clients, so we need to handle text files specially. Like the mirror download, this script picks text or binary transfer modes by using Python’s mimetypes module to guess a file’s type from its filename extension; HTML and text files are moved in FTP text mode, for instance. We already met the storbinary FTP object method used to upload files in binary mode—an exact, byte-for-byte copy appears at the remote site.

Text-mode transfers work almost identically: the storlines method accepts an FTP command string and a local file (or file-like) object, and simply copies each line read from the local file to a same-named file on the remote machine.

Notice, though, that the local text input file must be opened in rb binary mode in Python3.X. Text input files are normally opened in r text mode to perform Unicode decoding and to convert any end-of-line sequences on Windows to the platform-neutral character as lines are read. However, ftplib in Python 3.1 requires that the text file be opened in rb binary mode, because it converts all end-lines to the sequence for transmission; to do so, it must read lines as raw bytes with readlines and perform bytes string processing, which implies binary mode files.

This ftplib string processing worked with text-mode files in Python 2.X, but only because there was no separate bytes type; was expanded to . Opening the local file in binary mode for ftplib to read also means no Unicode decoding will occur: the text is sent over sockets as a byte string in already encoded form. All of which is, of course, a prime lesson on the impacts of Unicode encodings; consult the module ftplib.py in the Python source library directory for more details.

For binary mode transfers, things are simpler—we open the local file in rb binary mode to suppress Unicode decoding and automatic mapping everywhere, and return the bytes strings expected by ftplib on read. Binary data is not Unicode text, and we don’t want bytes in an audio file that happen to have the same value as to magically disappear when read on Windows.

As for the mirror download script, this program simply iterates over all files to be transferred (files in the local directory listing this time), and transfers each in turn—in either text or binary mode, depending on the files’ names. Here is the command I use to upload my entire website from my laptop Windows PC to a remote Linux server at my ISP, in a single step:

C:...PP4EInternetFtpMirror> uploadflat.py test
Password for lutz on learning-python.com:
Clean remote directory first? y
connecting...
deleting remote .
cannot delete remote .
deleting remote ..
cannot delete remote ..
deleting remote 2004-longmont-classes.html
deleting remote 2005-longmont-classes.html
deleting remote 2006-longmont-classes.html
deleting remote about-lp1e.html
deleting remote about-lp2e.html
deleting remote about-lp3e.html
deleting remote about-lp4e.html

...lines omitted...

uploading test2004-longmont-classes.html to 2004-longmont-classes.html as text
uploading test2005-longmont-classes.html to 2005-longmont-classes.html as text
uploading test2006-longmont-classes.html to 2006-longmont-classes.html as text
uploading testabout-lp1e.html to about-lp1e.html as text
uploading testabout-lp2e.html to about-lp2e.html as text
uploading testabout-lp3e.html to about-lp3e.html as text
uploading testabout-lp4e.html to about-lp4e.html as text
uploading testabout-pp-japan.html to about-pp-japan.html as text

...lines omitted...

uploading testwhatsnew.html to whatsnew.html as text
uploading testwhatsold.html to whatsold.html as text
uploading testwxPython.doc.tgz to wxPython.doc.tgz as application gzip
uploading testxlate-lp.html to xlate-lp.html as text
uploading testzaurus0.jpg to zaurus0.jpg as image
uploading testzaurus1.jpg to zaurus1.jpg as image
uploading testzaurus2.jpg to zaurus2.jpg as image
uploading testzoo-jan-03.jpg to zoo-jan-03.jpg as image
uploading testzopeoutline.htm to zopeoutline.htm as text
Done: 297 files uploaded.

For my site and on my current laptop and wireless broadband connection, this process typically takes six minutes, depending on server load. As with the download script, I often run this command from the local directory where my web files are kept, and I pass Python the full path to the script. When I run this on a Linux server, it works in the same way, but the paths to the script and my web files directory differ.^[50]

Refactoring Uploads and Downloads for Reuse

The directory upload and download scripts of the prior two sections work as advertised and, apart from the mimetypes logic, were the only FTP examples that were included in the second edition of this book. If you look at these two scripts long enough, though, their similarities will pop out at you eventually. In fact, they are largely the same—they use identical code to configure transfer parameters, connect to the FTP server, and determine file type. The exact details have been lost to time, but some of this code was certainly copied from one file to the other.

Although such redundancy isn’t a cause for alarm if we never plan on changing these scripts, it can be a killer in software projects in general. When you have two copies of identical bits of code, not only is there a danger of them becoming out of sync over time (you’ll lose uniformity in user interface and behavior), but you also effectively double your effort when it comes time to change code that appears in both places. Unless you’re a big fan of extra work, it pays to avoid redundancy wherever possible.

This redundancy is especially glaring when we look at the complex code that uses mimetypes to determine file types. Repeating magic like this in more than one place is almost always a bad idea—not only do we have to remember how it works every time we need the same utility, but it is a recipe for errors.

Refactoring with functions

As originally coded, our download and upload scripts comprise top-level script code that relies on global variables. Such a structure is difficult to reuse—code runs immediately on imports, and it’s difficult to generalize for varying contexts. Worse, it’s difficult to maintain—when you program by cut-and-paste of existing code, you increase the cost of future changes every time you click the Paste button.

To demonstrate how we might do better, Example 13-12 shows one way to refactor (reorganize) the download script. By wrapping its parts in functions, they become reusable in other modules, including our upload program.

Example 13-12. PP4EInternetFtpMirrordownloadflat_modular.py

#!/bin/env python
"""
##############################################################################
use FTP to copy (download) all files from a remote site and directory
to a directory on the local machine; this version works the same, but has
been refactored to wrap up its code in functions that can be reused by the
uploader, and possibly other programs in the future - else code redundancy,
which may make the two diverge over time, and can double maintenance costs.
##############################################################################
"""

import os, sys, ftplib
from getpass   import getpass
from mimetypes import guess_type, add_type

defaultSite = 'home.rmi.net'
defaultRdir = '.'
defaultUser = 'lutz'

def configTransfer(site=defaultSite, rdir=defaultRdir, user=defaultUser):
    """
    get upload or download parameters
    uses a class due to the large number
    """
    class cf: pass
    cf.nonpassive = False                 # passive FTP on by default in 2.1+
    cf.remotesite = site                  # transfer to/from this site
    cf.remotedir  = rdir                  # and this dir ('.' means acct root)
    cf.remoteuser = user
    cf.localdir   = (len(sys.argv) > 1 and sys.argv[1]) or '.'
    cf.cleanall   = input('Clean target directory first? ')[:1] in ['y','Y']
    cf.remotepass = getpass(
                    'Password for %s on %s:' % (cf.remoteuser, cf.remotesite))
    return cf

def isTextKind(remotename, trace=True):
    """
    use mimetype to guess if filename means text or binary
    for 'f.html,   guess is ('text/html', None): text
    for 'f.jpeg'   guess is ('image/jpeg', None): binary
    for 'f.txt.gz' guess is ('text/plain', 'gzip'): binary
    for unknowns,  guess may be (None, None): binary
    mimetype can also guess name from type: see PyMailGUI
    """
    add_type('text/x-python-win', '.pyw')                       # not in tables
    mimetype, encoding = guess_type(remotename, strict=False)   # allow extras
    mimetype  = mimetype or '?/?'                               # type unknown?
    maintype  = mimetype.split('/')[0]                          # get first part
    if trace: print(maintype, encoding or '')
    return maintype == 'text' and encoding == None              # not compressed

def connectFtp(cf):
    print('connecting...')
    connection = ftplib.FTP(cf.remotesite)           # connect to FTP site
    connection.login(cf.remoteuser, cf.remotepass)   # log in as user/password
    connection.cwd(cf.remotedir)                     # cd to directory to xfer
    if cf.nonpassive:                                # force active mode FTP
        connection.set_pasv(False)                   # most servers do passive
    return connection

def cleanLocals(cf):
    """
    try to delete all locals files first to remove garbage
    """
    if cf.cleanall:
        for localname in os.listdir(cf.localdir):    # local dirlisting
            try:                                     # local file delete
                print('deleting local', localname)
                os.remove(os.path.join(cf.localdir, localname))
            except:
                print('cannot delete local', localname)

def downloadAll(cf, connection):
    """
    download all files from remote site/dir per cf config
    ftp nlst() gives files list, dir() gives full details
    """
    remotefiles = connection.nlst()                  # nlst is remote listing
    for remotename in remotefiles:
        if remotename in ('.', '..'): continue
        localpath = os.path.join(cf.localdir, remotename)
        print('downloading', remotename, 'to', localpath, 'as', end=' ')
        if isTextKind(remotename):
            # use text mode xfer
            localfile = open(localpath, 'w', encoding=connection.encoding)
            def callback(line): localfile.write(line + '
')
            connection.retrlines('RETR ' + remotename, callback)
        else:
            # use binary mode xfer
            localfile = open(localpath, 'wb')
            connection.retrbinary('RETR ' + remotename, localfile.write)
        localfile.close()
    connection.quit()
    print('Done:', len(remotefiles), 'files downloaded.')

if __name__ == '__main__':
    cf = configTransfer()
    conn = connectFtp(cf)
    cleanLocals(cf)          # don't delete if can't connect
    downloadAll(cf, conn)

Compare this version with the original. This script, and every other in this section, runs the same as the original flat download and upload programs. Although we haven’t changed its behavior, though, we’ve modified the script’s software structure radically—its code is now a set of tools that can be imported and reused in other programs.

The refactored upload program in Example 13-13, for instance, is now noticeably simpler, and the code it shares with the download script only needs to be changed in one place if it ever requires improvement.

Example 13-13. PP4EInternetFtpMirroruploadflat_modular.py

#!/bin/env python
"""
##############################################################################
use FTP to upload all files from a local dir to a remote site/directory;
this version reuses downloader's functions, to avoid code redundancy;
##############################################################################
"""

import os
from downloadflat_modular import configTransfer, connectFtp, isTextKind

def cleanRemotes(cf, connection):
    """
    try to delete all remote files first to remove garbage
    """
    if cf.cleanall:
        for remotename in connection.nlst():            # remote dir listing
            try:                                        # remote file delete
                print('deleting remote', remotename)    # skips . and .. exc
                connection.delete(remotename)
            except:
                print('cannot delete remote', remotename)

def uploadAll(cf, connection):
    """
    upload all files to remote site/dir per cf config
    listdir() strips dir path, any failure ends script
    """
    localfiles = os.listdir(cf.localdir)            # listdir is local listing
    for localname in localfiles:
        localpath = os.path.join(cf.localdir, localname)
        print('uploading', localpath, 'to', localname, 'as', end=' ')
        if isTextKind(localname):
            # use text mode xfer
            localfile = open(localpath, 'rb')
            connection.storlines('STOR ' + localname, localfile)
        else:
            # use binary mode xfer
            localfile = open(localpath, 'rb')
            connection.storbinary('STOR ' + localname, localfile)
        localfile.close()
    connection.quit()
    print('Done:', len(localfiles), 'files uploaded.')

if __name__ == '__main__':
    cf = configTransfer(site='learning-python.com', rdir='books', user='lutz')
    conn = connectFtp(cf)
    cleanRemotes(cf, conn)
    uploadAll(cf, conn)

Not only is the upload script simpler now because it reuses common code, but it will also inherit any changes made in the download module. For instance, the isTextKind function was later augmented with code that adds the .pyw extension to mimetypes tables (this file type is not recognized by default); because it is a shared function, the change is automatically picked up in the upload program, too.

This script and the one it imports achieve the same goals as the originals, but changing them for easier code maintenance is a big deal in the real world of software development. The following, for example, downloads the site from one server and uploads to another:

C:...PP4EInternetFtpMirror> python downloadflat_modular.py test
Clean target directory first?
Password for lutz on home.rmi.net:
connecting...
downloading 2004-longmont-classes.html to test2004-longmont-classes.html as text
...lines omitted...
downloading relo-feb010-index.html to test
elo-feb010-index.html as text
Done: 297 files downloaded.

C:...PP4EInternetFtpMirror> python uploadflat_modular.py test
Clean target directory first?
Password for lutz on learning-python.com:
connecting...
uploading test2004-longmont-classes.html to 2004-longmont-classes.html as text
...lines omitted...
uploading testzopeoutline.htm to zopeoutline.htm as text
Done: 297 files uploaded.

Refactoring with classes

The function-based approach of the last two examples addresses the redundancy issue, but they are perhaps clumsier than they need to be. For instance, their cf configuration options object provides a namespace that replaces global variables and breaks cross-file dependencies. Once we start making objects to model namespaces, though, Python’s OOP support tends to be a more natural structure for our code. As one last twist, Example 13-14 refactors the FTP code one more time in order to leverage Python’s class feature.

Example 13-14. PP4EInternetFtpMirrorftptools.py

#!/bin/env python
"""
##############################################################################
use FTP to download or upload all files in a single directory from/to a
remote site and directory;  this version has been refactored to use classes
and OOP for namespace and a natural structure;  we could also structure this
as a download superclass, and an upload subclass which redefines the clean
and transfer methods, but then there is no easy way for another client to
invoke both an upload and download;  for the uploadall variant and possibly
others, also make single file upload/download code in orig loops methods;
##############################################################################
"""

import os, sys, ftplib
from getpass   import getpass
from mimetypes import guess_type, add_type

# defaults for all clients
dfltSite = 'home.rmi.net'
dfltRdir = '.'
dfltUser = 'lutz'

class FtpTools:

    # allow these 3 to be redefined
    def getlocaldir(self):
        return (len(sys.argv) > 1 and sys.argv[1]) or '.'

    def getcleanall(self):
        return input('Clean target dir first?')[:1] in ['y','Y']

    def getpassword(self):
        return getpass(
               'Password for %s on %s:' % (self.remoteuser, self.remotesite))

    def configTransfer(self, site=dfltSite, rdir=dfltRdir, user=dfltUser):
        """
        get upload or download parameters
        from module defaults, args, inputs, cmdline
        anonymous ftp: user='anonymous' pass=emailaddr
        """
        self.nonpassive = False             # passive FTP on by default in 2.1+
        self.remotesite = site              # transfer to/from this site
        self.remotedir  = rdir              # and this dir ('.' means acct root)
        self.remoteuser = user
        self.localdir   = self.getlocaldir()
        self.cleanall   = self.getcleanall()
        self.remotepass = self.getpassword()

    def isTextKind(self, remotename, trace=True):
        """
        use mimetypes to guess if filename means text or binary
        for 'f.html,   guess is ('text/html', None): text
        for 'f.jpeg'   guess is ('image/jpeg', None): binary
        for 'f.txt.gz' guess is ('text/plain', 'gzip'): binary
        for unknowns,  guess may be (None, None): binary
        mimetypes can also guess name from type: see PyMailGUI
        """
        add_type('text/x-python-win', '.pyw')                    # not in tables
        mimetype, encoding = guess_type(remotename, strict=False)# allow extras
        mimetype  = mimetype or '?/?'                            # type unknown?
        maintype  = mimetype.split('/')[0]                       # get 1st part
        if trace: print(maintype, encoding or '')
        return maintype == 'text' and encoding == None           # not compressed

    def connectFtp(self):
        print('connecting...')
        connection = ftplib.FTP(self.remotesite)           # connect to FTP site
        connection.login(self.remoteuser, self.remotepass) # log in as user/pswd
        connection.cwd(self.remotedir)                     # cd to dir to xfer
        if self.nonpassive:                                # force active mode
            connection.set_pasv(False)                     # most do passive
        self.connection = connection

    def cleanLocals(self):
        """
        try to delete all local files first to remove garbage
        """
        if self.cleanall:
            for localname in os.listdir(self.localdir):    # local dirlisting
                try:                                       # local file delete
                    print('deleting local', localname)
                    os.remove(os.path.join(self.localdir, localname))
                except:
                    print('cannot delete local', localname)

    def cleanRemotes(self):
        """
        try to delete all remote files first to remove garbage
        """
        if self.cleanall:
            for remotename in self.connection.nlst():       # remote dir listing
                try:                                        # remote file delete
                    print('deleting remote', remotename)
                    self.connection.delete(remotename)
                except:
                    print('cannot delete remote', remotename)

    def downloadOne(self, remotename, localpath):
        """
        download one file by FTP in text or binary mode
        local name need not be same as remote name
        """
        if self.isTextKind(remotename):
            localfile = open(localpath, 'w', encoding=self.connection.encoding)
            def callback(line): localfile.write(line + '
')
            self.connection.retrlines('RETR ' + remotename, callback)
        else:
            localfile = open(localpath, 'wb')
            self.connection.retrbinary('RETR ' + remotename, localfile.write)
        localfile.close()

    def uploadOne(self, localname, localpath, remotename):
        """
        upload one file by FTP in text or binary mode
        remote name need not be same as local name
        """
        if self.isTextKind(localname):
            localfile = open(localpath, 'rb')
            self.connection.storlines('STOR ' + remotename, localfile)
        else:
            localfile = open(localpath, 'rb')
            self.connection.storbinary('STOR ' + remotename, localfile)
        localfile.close()

    def downloadDir(self):
        """
        download all files from remote site/dir per config
        ftp nlst() gives files list, dir() gives full details
        """
        remotefiles = self.connection.nlst()         # nlst is remote listing
        for remotename in remotefiles:
            if remotename in ('.', '..'): continue
            localpath = os.path.join(self.localdir, remotename)
            print('downloading', remotename, 'to', localpath, 'as', end=' ')
            self.downloadOne(remotename, localpath)
        print('Done:', len(remotefiles), 'files downloaded.')

    def uploadDir(self):
        """
        upload all files to remote site/dir per config
        listdir() strips dir path, any failure ends script
        """
        localfiles = os.listdir(self.localdir)       # listdir is local listing
        for localname in localfiles:
            localpath = os.path.join(self.localdir, localname)
            print('uploading', localpath, 'to', localname, 'as', end=' ')
            self.uploadOne(localname, localpath, localname)
        print('Done:', len(localfiles), 'files uploaded.')

    def run(self, cleanTarget=lambda:None, transferAct=lambda:None):
        """
        run a complete FTP session
        default clean and transfer are no-ops
        don't delete if can't connect to server
        """
        self.connectFtp()
        cleanTarget()
        transferAct()
        self.connection.quit()

if __name__ == '__main__':
    ftp = FtpTools()
    xfermode = 'download'
    if len(sys.argv) > 1:
        xfermode = sys.argv.pop(1)   # get+del 2nd arg
    if xfermode == 'download':
        ftp.configTransfer()
        ftp.run(cleanTarget=ftp.cleanLocals,  transferAct=ftp.downloadDir)
    elif xfermode == 'upload':
        ftp.configTransfer(site='learning-python.com', rdir='books', user='lutz')
        ftp.run(cleanTarget=ftp.cleanRemotes, transferAct=ftp.uploadDir)
    else:
        print('Usage: ftptools.py ["download" | "upload"] [localdir]')

In fact, this last mutation combines uploads and downloads into a single file, because they are so closely related. As before, common code is factored into methods to avoid redundancy. New here, the instance object itself becomes a natural namespace for storing configuration options (they become self attributes). Study this example’s code for more details of the restructuring applied.

Again, this revision runs the same as our original site download and upload scripts; see its self-test code at the end for usage details, and pass in a command-line argument to specify “download” or “upload.” We haven’t changed what it does, we’ve refactored it for maintainability and reuse:

C:...PP4EInternetFtpMirror> ftptools.py download test
Clean target dir first?
Password for lutz on home.rmi.net:
connecting...
downloading 2004-longmont-classes.html to test2004-longmont-classes.html as text
...lines omitted...
downloading relo-feb010-index.html to test
elo-feb010-index.html as text
Done: 297 files downloaded.


C:...PP4EInternetFtpMirror> ftptools.py upload test
Clean target dir first?
Password for lutz on learning-python.com:
connecting...
uploading test2004-longmont-classes.html to 2004-longmont-classes.html as text
...lines omitted...
uploading testzopeoutline.htm to zopeoutline.htm as text
Done: 297 files uploaded.

Although this file can still be run as a command-line script like this, its class is really now a package of FTP tools that can be mixed into other programs and reused. By wrapping its code in a class, it can be easily customized by redefining its methods—its configuration calls, such as getlocaldir, for example, may be redefined in subclasses for custom scenarios.

Perhaps most importantly, using classes optimizes code reusability. Clients of this file can both upload and download directories by simply subclassing or embedding an instance of this class and calling its methods. To see one example of how, let’s move on to the next section.

Transferring Directory Trees with ftplib

Perhaps the biggest limitation of the website download and upload scripts we just met is that they assume the site directory is flat (hence their names). That is, the preceding scripts transfer simple files only, and none of them handle nested subdirectories within the web directory to be transferred.

For my purposes, that’s often a reasonable constraint. I avoid nested subdirectories to keep things simple, and I store my book support home website as a simple directory of files. For other sites, though, including one I keep at another machine, site transfer scripts are easier to use if they also automatically transfer subdirectories along the way.

Uploading Local Trees

It turns out that supporting directories on uploads is fairly simple—we need to add only a bit of recursion and remote directory creation calls. The upload script in Example 13-15 extends the class-based version we just saw in Example 13-14, to handle uploading all subdirectories nested within the transferred directory. Furthermore, it recursively transfers subdirectories within subdirectories—the entire directory tree contained within the top-level transfer directory is uploaded to the target directory at the remote server.

In terms of its code structure, Example 13-15 is just a customization of the FtpTools class of the prior section—really, we’re just adding a method for recursive uploads, by subclassing. As one consequence, we get tools such as parameter configuration, content type testing, and connection and upload code for free here; with OOP, some of the work is done before we start.

Example 13-15. PP4EInternetFtpMirroruploadall.py

#!/bin/env python
"""
############################################################################
extend the FtpTools class to upload all files and subdirectories from a
local dir tree to a remote site/dir; supports nested dirs too, but not
the cleanall option (that requires parsing FTP listings to detect remote
dirs: see cleanall.py); to upload subdirectories, uses os.path.isdir(path)
to see if a local file is really a directory, FTP().mkd(path) to make dirs
on the remote machine (wrapped in a try in case it already exists there),
and recursion to upload all files/dirs inside the nested subdirectory.
############################################################################
"""

import os, ftptools

class UploadAll(ftptools.FtpTools):
    """
    upload an entire tree of subdirectories
    assumes top remote directory exists
    """
    def __init__(self):
        self.fcount = self.dcount = 0

    def getcleanall(self):
        return False  # don't even ask

    def uploadDir(self, localdir):
        """
        for each directory in an entire tree
        upload simple files, recur into subdirectories
        """
        localfiles = os.listdir(localdir)
        for localname in localfiles:
            localpath = os.path.join(localdir, localname)
            print('uploading', localpath, 'to', localname, end=' ')
            if not os.path.isdir(localpath):
                self.uploadOne(localname, localpath, localname)
                self.fcount += 1
            else:
                try:
                    self.connection.mkd(localname)
                    print('directory created')
                except:
                    print('directory not created')
                self.connection.cwd(localname)             # change remote dir
                self.uploadDir(localpath)                  # upload local subdir
                self.connection.cwd('..')                  # change back up
                self.dcount += 1
                print('directory exited')

if __name__ == '__main__':
    ftp = UploadAll()
    ftp.configTransfer(site='learning-python.com', rdir='training', user='lutz')
    ftp.run(transferAct = lambda: ftp.uploadDir(ftp.localdir))
    print('Done:', ftp.fcount, 'files and', ftp.dcount, 'directories uploaded.')

Like the flat upload script, this one can be run on any machine with Python and sockets and upload to any machine running an FTP server; I run it both on my laptop PC and on other servers by Telnet or SSH to upload sites to my ISP.

The crux of the matter in this script is the os.path.isdir test near the top; if this test detects a directory in the current local directory, we create an identically named directory on the remote machine with connection.mkd and descend into it with connection.cwd, and recur into the subdirectory on the local machine (we have to use recursive calls here, because the shape and depth of the tree are arbitrary). Like all FTP object methods, mkd and cwd methods issue FTP commands to the remote server. When we exit a local subdirectory, we run a remote cwd('..') to climb to the remote parent directory and continue; the recursive call level’s return restores the prior directory on the local machine. The rest of the script is roughly the same as the original.

In the interest of space, I’ll leave studying this variant in more depth as a suggested exercise. For more context, try changing this script so as not to assume that the top-level remote directory already exists. As usual in software, there are a variety of implementation and operation options here.

Here is the sort of output displayed on the console when the upload-all script is run, uploading a site with multiple subdirectory levels which I maintain with site builder tools. It’s similar to the flat upload (which you might expect, given that it is reusing much of the same code by inheritance), but notice that it traverses and uploads nested subdirectories along the way:

C:...PP4EInternetFtpMirror> uploadall.py Website-Training
Password for lutz on learning-python.com:
connecting...
uploading Website-Training2009-public-classes.htm to 2009-public-classes.htm text
uploading Website-Training2010-public-classes.html to 2010-public-classes.html text
uploading Website-Trainingabout.html to about.html text
uploading Website-Trainingooks to books directory created
uploading Website-Trainingooksindex.htm to index.htm text
uploading Website-Trainingooksindex.html to index.html text
uploading Website-Trainingooks\_vti_cnf to _vti_cnf directory created
uploading Website-Trainingooks\_vti_cnfindex.htm to index.htm text
uploading Website-Trainingooks\_vti_cnfindex.html to index.html text
directory exited
directory exited
uploading Website-Trainingcalendar.html to calendar.html text
uploading Website-Trainingcontacts.html to contacts.html text
uploading Website-Trainingestes-nov06.htm to estes-nov06.htm text
uploading Website-Trainingformalbio.html to formalbio.html text
uploading Website-Trainingfulloutline.html to fulloutline.html text

...lines omitted...

uploading Website-Training\_vti_pvtwriteto.cnf to writeto.cnf ?
uploading Website-Training\_vti_pvt\_vti_cnf to _vti_cnf directory created
uploading Website-Training\_vti_pvt\_vti_cnf\_x_todo.htm to _x_todo.htm text
uploading Website-Training\_vti_pvt\_vti_cnf\_x_todoh.htm to _x_todoh.htm text
directory exited
uploading Website-Training\_vti_pvt\_x_todo.htm to _x_todo.htm text
uploading Website-Training\_vti_pvt\_x_todoh.htm to _x_todoh.htm text
directory exited
Done: 366 files and 18 directories uploaded.

As is, the script of Example 13-15 handles only directory tree uploads; recursive uploads are generally more useful than recursive downloads if you maintain your websites on your local PC and upload to a server periodically, as I do. To also download (mirror) a website that has subdirectories, a script must parse the output of a remote listing command to detect remote directories. For the same reason, the recursive upload script was not coded to support the remote directory tree cleanup option of the original—such a feature would require parsing remote listings as well. The next section shows how.

Deleting Remote Trees

One last example of code reuse at work: when I initially tested the prior section’s upload-all script, it contained a bug that caused it to fall into an infinite recursion loop, and keep copying the full site into new subdirectories, over and over, until the FTP server kicked me off (not an intended feature of the program!). In fact, the upload got 13 levels deep before being killed by the server; it effectively locked my site until the mess could be repaired.

To get rid of all the files accidentally uploaded, I quickly wrote the script in Example 13-16 in emergency (really, panic) mode; it deletes all files and nested subdirectories in an entire remote tree. Luckily, this was very easy to do given all the reuse that Example 13-16 inherits from the FtpTools superclass. Here, we just have to define the extension for recursive remote deletions. Even in tactical mode like this, OOP can be a decided advantage.

Example 13-16. PP4EInternetFtpMirrorcleanall.py

#!/bin/env python
"""
##############################################################################
extend the FtpTools class to delete files and subdirectories from a remote
directory tree; supports nested directories too;  depends on the dir()
command output format, which may vary on some servers! - see Python's
ToolsScriptsftpmirror.py for hints;  extend me for remote tree downloads;
##############################################################################
"""

from ftptools import FtpTools

class CleanAll(FtpTools):
    """
    delete an entire remote tree of subdirectories
    """
    def __init__(self):
        self.fcount = self.dcount = 0

    def getlocaldir(self):
        return None  # irrelevent here

    def getcleanall(self):
        return True  # implied here

    def cleanDir(self):
        """
        for each item in current remote directory,
        del simple files, recur into and then del subdirectories
        the dir() ftp call passes each line to a func or method
        """
        lines = []                                   # each level has own lines
        self.connection.dir(lines.append)            # list current remote dir
        for line in lines:
            parsed  = line.split()                   # split on whitespace
            permiss = parsed[0]                      # assume 'drw... ... filename'
            fname   = parsed[-1]
            if fname in ('.', '..'):                 # some include cwd and parent
                continue
            elif permiss[0] != 'd':                  # simple file: delete
                print('file', fname)
                self.connection.delete(fname)
                self.fcount += 1
            else:                                    # directory: recur, del
                print('directory', fname)
                self.connection.cwd(fname)           # chdir into remote dir
                self.cleanDir()                      # clean subdirectory
                self.connection.cwd('..')            # chdir remote back up
                self.connection.rmd(fname)           # delete empty remote dir
                self.dcount += 1
                print('directory exited')

if __name__ == '__main__':
    ftp = CleanAll()
    ftp.configTransfer(site='learning-python.com', rdir='training', user='lutz')
    ftp.run(cleanTarget=ftp.cleanDir)
    print('Done:', ftp.fcount, 'files and', ftp.dcount, 'directories cleaned.')

Besides again being recursive in order to handle arbitrarily shaped trees, the main trick employed here is to parse the output of a remote directory listing. The FTP nlst call used earlier gives us a simple list of filenames; here, we use dir to also get file detail lines like these:

C:...PP4EInternetFtp> ftp learning-python.com
ftp> cd training
ftp> dir
drwxr-xr-x   11 5693094  450          4096 May  4 11:06 .
drwx---r-x   19 5693094  450          8192 May  4 10:59 ..
-rw----r--    1 5693094  450         15825 May  4 11:02 2009-public-classes.htm
-rw----r--    1 5693094  450         18084 May  4 11:02 2010-public-classes.html
drwx---r-x    3 5693094  450          4096 May  4 11:02 books
-rw----r--    1 5693094  450          3783 May  4 11:02 calendar-save-aug09.html
-rw----r--    1 5693094  450          3923 May  4 11:02 calendar.html
drwx---r-x    2 5693094  450          4096 May  4 11:02 images
-rw----r--    1 5693094  450          6143 May  4 11:02 index.html
...lines omitted...

This output format is potentially server-specific, so check this on your own server before relying on this script. For this Unix ISP, if the first character of the first item on the line is character “d”, the filename at the end of the line names a remote directory. To parse, the script simply splits on whitespace to extract parts of a line.

Notice how this script, like others before it, must skip the symbolic “.” and “..” current and parent directory names in listings to work properly for this server. Oddly this can vary per server as well; one of the servers I used for this book’s examples, for instance, does not include these special names in listings. We can verify by running ftplib at the interactive prompt, as though it were a portable FTP client interface:

C:...PP4EInternetFtp> python
>>> from ftplib import FTP
>>> f = FTP('ftp.rmi.net')
>>> f.login('lutz', 'xxxxxxxx')         # output lines omitted
>>> for x in f.nlst()[:3]: print(x)     # no . or .. in listings
...
2004-longmont-classes.html
2005-longmont-classes.html
2006-longmont-classes.html

>>> L = []
>>> f.dir(L.append)                     # ditto for detailed list
>>> for x in L[:3]: print(x)
...
-rw-r--r--   1 ftp      ftp          8173 Mar 19  2006 2004-longmont-classes.html
-rw-r--r--   1 ftp      ftp          9739 Mar 19  2006 2005-longmont-classes.html
-rw-r--r--   1 ftp      ftp           805 Jul  8  2006 2006-longmont-classes.html

On the other hand, the server I’m using in this section does include the special dot names; to be robust, our scripts must skip over these names in remote directory listings just in case they’re run against a server that includes them (here, the test is required to avoid falling into an infinite recursive loop!). We don’t need to care about local directory listings because Python’s os.listdir never includes “.” or “..” in its result, but things are not quite so consistent in the “Wild West” that is the Internet today:

>>> f = FTP('learning-python.com')
>>> f.login('lutz', 'xxxxxxxx')         # output lines omitted
>>> for x in f.nlst()[:5]: print(x)     # includes . and .. here
...
.
..
.hcc.thumbs
2009-public-classes.htm
2010-public-classes.html

>>> L = []
>>> f.dir(L.append)                     # ditto for detailed list
>>> for x in L[:5]: print(x)
...
drwx---r-x   19 5693094  450          8192 May  4 10:59 .
drwx---r-x   19 5693094  450          8192 May  4 10:59 ..
drwx------    2 5693094  450          4096 Feb 18 05:38 .hcc.thumbs
-rw----r--    1 5693094  450         15824 May  1 14:39 2009-public-classes.htm
-rw----r--    1 5693094  450         18083 May  4 09:05 2010-public-classes.html

The output of our clean-all script in action follows; it shows up in the system console window where the script is run. You might be able to achieve the same effect with a “rm –rf” Unix shell command in a SSH or Telnet window on some servers, but the Python script runs on the client and requires no other remote access than basic FTP on the client:

C:PP4EInternetFtpMirror> cleanall.py
Password for lutz on learning-python.com:
connecting...
file 2009-public-classes.htm
file 2010-public-classes.html
file Learning-Python-interview.doc
file Python-registration-form-010.pdf
file PythonPoweredSmall.gif
directory _derived
file 2009-public-classes.htm_cmp_DeepBlue100_vbtn.gif
file 2009-public-classes.htm_cmp_DeepBlue100_vbtn_p.gif
file 2010-public-classes.html_cmp_DeepBlue100_vbtn_p.gif
file 2010-public-classes.html_cmp_deepblue100_vbtn.gif
directory _vti_cnf
file 2009-public-classes.htm_cmp_DeepBlue100_vbtn.gif
file 2009-public-classes.htm_cmp_DeepBlue100_vbtn_p.gif
file 2010-public-classes.html_cmp_DeepBlue100_vbtn_p.gif
file 2010-public-classes.html_cmp_deepblue100_vbtn.gif
directory exited
directory exited

...lines omitted...

file priorclients.html
file public_classes.htm
file python_conf_ora.gif
file topics.html
Done: 366 files and 18 directories cleaned.

Downloading Remote Trees

It is possible to extend this remote tree-cleaner to also download a remote tree with subdirectories: rather than deleting, as you walk the remote tree simply create a local directory to match a remote one, and download nondirectory files. We’ll leave this final step as a suggested exercise, though, partly because its dependence on the format produced by server directory listings makes it complex to be robust and partly because this use case is less common for me—in practice, I am more likely to maintain a site on my PC and upload to the server than to download a tree.

If you do wish to experiment with a recursive download, though, be sure to consult the script ToolsScriptsftpmirror.py in Python’s install or source tree for hints. That script attempts to download a remote directory tree by FTP, and allows for various directory listing formats which we’ll skip here in the interest of space. For our purposes, it’s time to move on to the next protocol on our tour—Internet email.

Processing Internet Email

Some of the other most common, higher-level Internet protocols have to do with reading and sending email messages: POP and IMAP for fetching email from servers, SMTP for sending new messages, and other formalisms such as RFC822 for specifying email message content and format. You don’t normally need to know about such acronyms when using common email tools, but internally, programs like Microsoft Outlook and webmail systems generally talk to POP and SMTP servers to do your bidding.

Like FTP, email ultimately consists of formatted commands and byte streams shipped over sockets and ports (port 110 for POP; 25 for SMTP). Regardless of the nature of its content and attachments, an email message is little more than a string of bytes sent and received through sockets. But also like FTP, Python has standard library modules to simplify all aspects of email processing:

poplib and imaplib for fetching email
smtplib for sending email
The email module package for parsing email and constructing email

These modules are related: for nontrivial messages, we typically use email to parse mail text which has been fetched with poplib and use email to compose mail text to be sent with smtplib. The email package also handles tasks such as address parsing, date and time formatting, attachment formatting and extraction, and encoding and decoding of email content (e,g, uuencode, Base64). Additional modules handle more specific tasks (e.g., mimetypes to map filenames to and from content types).

In the next few sections, we explore the POP and SMTP interfaces for fetching and sending email from and to servers, and the email package interfaces for parsing and composing email message text. Other email interfaces in Python are analogous and are documented in the Python library reference manual.^[51]

Unicode in Python 3.X and Email Tools

In the prior sections of this chapter, we studied how Unicode encodings can impact scripts using Python’s ftplib FTP tools in some depth, because it illustrates the implications of Python 3.X’s Unicode string model for real-world programming. In short:

All binary mode transfers should open local output and input files in binary mode (modes wb and rb).
Text-mode downloads should open local output files in text mode with explicit encoding names (mode w, with an encoding argument that defaults to latin1 within ftplib itself).
Text-mode uploads should open local input files in binary mode (mode rb).

The prior sections describe why these rules are in force. The last two points here differ for scripts written originally for Python 2.X. As you might expect, given that the underlying sockets transfer byte strings today, the email story is somewhat convoluted for Unicode in Python 3.X as well. As a brief preview:

Fetching: The poplib module returns fetched email text in bytes string form. Command text sent to the server is encoded per UTF8 internally, but replies are returned as raw binary bytes and not decoded into str text.
Sending: The smtplib module accepts email content to send as str strings. Internally, message text passed in str form is encoded to binary bytes for transmission using the ascii encoding scheme. Passing an already encoded bytes string to the send call may allow more explicit control.
Composing: The email package produces Unicode str strings containing plain text when generating full email text for sending with smtplib and accepts optional encoding specifications for messages and their parts, which it applies according to email standard rules. Message headers may also be encoded per email, MIME, and Unicode conventions.
Parsing: The email package in 3.1 currently requires raw email byte strings of the type fetched with poplib to be decoded into Unicode str strings as appropriate before it can be passed in to be parsed into a message object. This pre-parse decoding might be done by a default, user preference, mail headers inspection, or intelligent guess. Because this requirement raises difficult issues for package clients, it may be dropped in a future version of email and Python.
Navigating: The email package returns most message components as str strings, though parts content decoded by Base64 and other email encoding schemes may be returned as bytes strings, parts fetched without such decoding may be str or bytes, and some str string parts are internally encoded to bytes with scheme raw-unicode-escape before processing. Message headers may be decoded by the package on request as well.

If you’re migrating email scripts (or your mindset) from 2.X, you’ll need to treat email text fetched from a server as byte strings, and encode it before passing it along for parsing; scripts that send or compose email are generally unaffected (and this may be the majority of Python email-aware scripts), though content may have to be treated specially if it may be returned as byte strings.

This is the story in Python 3.1, which is of course prone to change over time. We’ll see how these email constraints translate into code as we move along in this section. Suffice it to say, the text on the Internet is not as simple as it used to be, though it probably shouldn’t have been anyhow.

POP: Fetching Email

I confess: up until just before 2000, I took a lowest-common-denominator approach to email. I preferred to check my messages by Telnetting to my ISP and using a simple command-line email interface. Of course, that’s not ideal for mail with attachments, pictures, and the like, but its portability was staggering—because Telnet runs on almost any machine with a network link, I was able to check my mail quickly and easily from anywhere on the planet. Given that I make my living traveling around the world teaching Python classes, this wild accessibility was a big win.

As with website maintenance, times have changed on this front. Somewhere along the way, most ISPs began offering web-based email access with similar portability and dropped Telnet altogether. When my ISP took away Telnet access, however, they also took away one of my main email access methods. Luckily, Python came to the rescue again—by writing email access scripts in Python, I could still read and send email from any machine in the world that has Python and an Internet connection. Python can be as portable a solution as Telnet, but much more powerful.

Moreover, I can still use these scripts as an alternative to tools suggested by the ISP. Besides my not being fond of delegating control to commercial products of large companies, closed email tools impose choices on users that are not always ideal and may sometimes fail altogether. In many ways, the motivation for coding Python email scripts is the same as it was for the larger GUIs in Chapter 11: the scriptability of Python programs can be a decided advantage.

For example, Microsoft Outlook historically and by default has preferred to download mail to your PC and delete it from the mail server as soon as you access it. This keeps your email box small (and your ISP happy), but it isn’t exactly friendly to people who travel and use multiple machines along the way—once accessed, you cannot get to a prior email from any machine except the one to which it was initially downloaded. Worse, the web-based email interfaces offered by my ISPs have at times gone offline completely, leaving me cut off from email (and usually at the worst possible time).

The next two scripts represent one first-cut solution to such portability and reliability constraints (we’ll see others in this and later chapters). The first, popmail.py, is a simple mail reader tool, which downloads and prints the contents of each email in an email account. This script is admittedly primitive, but it lets you read your email on any machine with Python and sockets; moreover, it leaves your email intact on the server, and isn’t susceptible to webmail outages. The second, smtpmail.py, is a one-shot script for writing and sending a new email message that is as portable as Python itself.

Later in this chapter, we’ll implement an interactive console-based email client (pymail), and later in this book we’ll code a full-blown GUI email tool (PyMailGUI) and a web-based email program of our own (PyMailCGI). For now, we’ll start with the basics.

Mail Configuration Module

Before we get to the scripts, let’s first take a look at a common module they import and use. The module in Example 13-17 is used to configure email parameters appropriately for a particular user. It’s simply a collection of assignments to variables used by mail programs that appear in this book; each major mail client has its own version, to allow content to vary. Isolating these configuration settings in this single module makes it easy to configure the book’s email programs for a particular user, without having to edit actual program logic code.

If you want to use any of this book’s email programs to do mail processing of your own, be sure to change its assignments to reflect your servers, account usernames, and so on (as shown, they refer to email accounts used for developing this book). Not all scripts use all of these settings; we’ll revisit this module in later examples to explain more of them.

Note that some ISPs may require that you be connected directly to their systems in order to use their SMTP servers to send mail. For example, when connected directly by dial-up in the past, I could use my ISP’s server directly, but when connected via broadband, I had to route requests through a cable Internet provider. You may need to adjust these settings to match your configuration; see your ISP to obtain the required POP and SMTP servers. Also, some SMTP servers check domain name validity in addresses, and may require an authenticating login step—see the SMTP section later in this chapter for interface details.

Example 13-17. PP4EInternetEmailmailconfig.py

"""
user configuration settings for various email programs (pymail/mailtools version);
email scripts get their server names and other email config options from this
module: change me to reflect your server names and mail preferences;
"""

#------------------------------------------------------------------------------
# (required for load, delete: all) POP3 email server machine, user
#------------------------------------------------------------------------------

popservername = 'pop.secureserver.net'
popusername   = '[email protected]'

#------------------------------------------------------------------------------
# (required for send: all) SMTP email server machine name
# see Python smtpd module for a SMTP server class to run locally;
#------------------------------------------------------------------------------

smtpservername = 'smtpout.secureserver.net'

#------------------------------------------------------------------------------
# (optional: all) personal information used by clients to fill in mail if set;
# signature  -- can be a triple-quoted block, ignored if empty string;
# address -- used for initial value of "From" field if not empty,
# no longer tries to guess From for replies: this had varying success;
#------------------------------------------------------------------------------

myaddress   = '[email protected]'
mysignature = ('Thanks,
'
               '--Mark Lutz  (http://learning-python.com/books)')

#------------------------------------------------------------------------------
# (optional: mailtools) may be required for send; SMTP user/password if
# authenticated; set user to None or '' if no login/authentication is
# required; set pswd to name of a file holding your SMTP password, or
# an empty string to force programs to ask (in a console, or GUI);
#------------------------------------------------------------------------------

smtpuser  = None                           # per your ISP
smtppasswdfile  = ''                       # set to '' to be asked

#------------------------------------------------------------------------------
# (optional: mailtools) name of local one-line text file with your pop
# password; if empty or file cannot be read, pswd is requested when first
# connecting; pswd not encrypted: leave this empty on shared machines;
#------------------------------------------------------------------------------

poppasswdfile  = r'c:	emppymailgui.txt'      # set to '' to be asked

#------------------------------------------------------------------------------
# (required: mailtools) local file where sent messages are saved by some clients;
#------------------------------------------------------------------------------

sentmailfile   = r'.sentmail.txt'             # . means in current working dir

#------------------------------------------------------------------------------
# (required: pymail, pymail2) local file where pymail saves pop mail on request;
#------------------------------------------------------------------------------

savemailfile   = r'c:	empsavemail.txt'       # not used in PyMailGUI: dialog

#------------------------------------------------------------------------------
# (required: pymail, mailtools) fetchEncoding is the Unicode encoding used to
# decode fetched full message bytes, and to encode and decode message text if
# stored in text-mode save files; see Chapter 13 for details: this is a limited
# and temporary approach to Unicode encodings until a new bytes-friendly email
# package is developed; headersEncodeTo is for sent headers: see chapter13;
#------------------------------------------------------------------------------

fetchEncoding = 'utf8'      # 4E: how to decode and store message text (or latin1?)
headersEncodeTo = None      # 4E: how to encode non-ASCII headers sent (None=utf8)

#------------------------------------------------------------------------------
# (optional: mailtools) the maximum number of mail headers or messages to
# download on each load request; given this setting N, mailtools fetches at
# most N of the most recently arrived mails; older mails outside this set are
# not fetched from the server, but are returned as empty/dummy emails; if this
# is assigned to None (or 0), loads will have no such limit; use this if you
# have very many mails in your inbox, and your Internet or mail server speed
# makes full loads too slow to be practical;  some clients also load only
# newly-arrived emails, but this setting is independent of that feature;
#------------------------------------------------------------------------------

fetchlimit = 25             # 4E: maximum number headers/emails to fetch on loads

POP Mail Reader Script

On to reading email in Python: the script in Example 13-18 employs Python’s standard poplib module, an implementation of the client-side interface to POP—the Post Office Protocol. POP is a well-defined and widely available way to fetch email from servers over sockets. This script connects to a POP server to implement a simple yet portable email download and display tool.

Example 13-18. PP4EInternetEmailpopmail.py

#!/usr/local/bin/python
"""
##############################################################################
use the Python POP3 mail interface module to view your POP email account
messages;  this is just a simple listing--see pymail.py for a client with
more user interaction features, and smtpmail.py for a script which sends
mail;  POP is used to retrieve mail, and runs on a socket using port number
110 on the server machine, but Python's poplib hides all protocol details;
to send mail, use the smtplib module (or os.popen('mail...')).  see also:
imaplib module for IMAP alternative, PyMailGUI/PyMailCGI for more features;
##############################################################################
"""

import poplib, getpass, sys, mailconfig

mailserver = mailconfig.popservername      # ex: 'pop.rmi.net'
mailuser   = mailconfig.popusername        # ex: 'lutz'
mailpasswd = getpass.getpass('Password for %s?' % mailserver)

print('Connecting...')
server = poplib.POP3(mailserver)
server.user(mailuser)                      # connect, log in to mail server
server.pass_(mailpasswd)                   # pass is a reserved word

try:
    print(server.getwelcome())             # print returned greeting message
    msgCount, msgBytes = server.stat()
    print('There are', msgCount, 'mail messages in', msgBytes, 'bytes')
    print(server.list())
    print('-' * 80)
    input('[Press Enter key]')

    for i in range(msgCount):
        hdr, message, octets = server.retr(i+1)    # octets is byte count
        for line in message: print(line.decode())  # retrieve, print all mail
        print('-' * 80)                            # mail text is bytes in 3.x
        if i < msgCount - 1:
           input('[Press Enter key]')              # mail box locked till quit
finally:                                           # make sure we unlock mbox
    server.quit()                                  # else locked till timeout
print('Bye.')

Though primitive, this script illustrates the basics of reading email in Python. To establish a connection to an email server, we start by making an instance of the poplib.POP3 object, passing in the email server machine’s name as a string:

server = poplib.POP3(mailserver)

If this call doesn’t raise an exception, we’re connected (by socket) to the POP server listening on POP port number 110 at the machine where our email account lives.

The next thing we need to do before fetching messages is tell the server our username and password; notice that the password method is called pass_. Without the trailing underscore, pass would name a reserved word and trigger a syntax error:

server.user(mailuser)                      # connect, log in to mail server
server.pass_(mailpasswd)                   # pass is a reserved word

To keep things simple and relatively secure, this script always asks for the account password interactively; the getpass module we met in the FTP section of this chapter is used to input but not display a password string typed by the user.

Once we’ve told the server our username and password, we’re free to fetch mailbox information with the stat method (number messages, total bytes among all messages) and fetch the full text of a particular message with the retr method (pass the message number—they start at 1). The full text includes all headers, followed by a blank line, followed by the mail’s text and any attached parts. The retr call sends back a tuple that includes a list of line strings representing the content of the mail:

msgCount, msgBytes = server.stat()
hdr, message, octets = server.retr(i+1)    # octets is byte count

We close the email server connection by calling the POP object’s quit method:

server.quit()                              # else locked till timeout

Notice that this call appears inside the finally clause of a try statement that wraps the bulk of the script. To minimize complications associated with changes, POP servers lock your email inbox between the time you first connect and the time you close your connection (or until an arbitrary, system-defined timeout expires). Because the POP quit method also unlocks the mailbox, it’s crucial that we do this before exiting, whether an exception is raised during email processing or not. By wrapping the action in a try/finally statement, we guarantee that the script calls quit on exit to unlock the mailbox to make it accessible to other processes (e.g., delivery of incoming email).

Fetching Messages

Here is the popmail script of Example 13-18 in action, displaying two messages in my account’s mailbox on machine pop.secureserver.net—the domain name of the mail server machine used by the ISP hosting my learning-python.com domain name, as configured in the module mailconfig. To keep this output reasonably sized, I’ve omitted or truncated a few irrelevant message header lines here, including most of the Received: headers that chronicle an email’s journey; run this on your own to see all the gory details of raw email text:

C:...PP4EInternetEmail> popmail.py
Password for pop.secureserver.net?
Connecting...
b'+OK <[email protected]>'
There are 2 mail messages in 3268 bytes
(b'+OK ', [b'1 1860', b'2 1408'], 16)
--------------------------------------------------------------------------------
[Press Enter key]
Received: (qmail 7690 invoked from network); 5 May 2010 15:29:43 −0000
X-IronPort-Anti-Spam-Result: AskCAG4r4UvRVllAlGdsb2JhbACDF44FjCkVAQEBAQkLCAkRAx+
Received: from 72.236.109.185 by webmail.earthlink.net with HTTP; Wed, 5 May 201
Message-ID: <27293081.1273073376592.JavaMail.root@mswamui-thinleaf.atl.sa.earthl
Date: Wed, 5 May 2010 11:29:36 −0400 (EDT)
From: [email protected]
Reply-To: [email protected]
To: [email protected]
Subject: I'm a Lumberjack, and I'm Okay
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Mailer: EarthLink Zoo Mail 1.0
X-ELNK-Trace: 309f369105a89a174e761f5d55cab8bca866e5da7af650083cf64d888edc8b5a35
X-Originating-IP: 209.86.224.51
X-Nonspam: None

I cut down trees, I skip and jump,
I like to press wild flowers...


--------------------------------------------------------------------------------
[Press Enter key]
Received: (qmail 17482 invoked from network); 5 May 2010 15:33:47 −0000
X-IronPort-Anti-Spam-Result: AlIBAIss4UthSoc7mWdsb2JhbACDF44FjD4BAQEBAQYNCgcRIq1
Received: (qmail 4009 invoked by uid 99); 5 May 2010 15:33:47 −0000
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
X-Originating-IP: 72.236.109.185
User-Agent: Web-Based Email 5.2.13
Message-Id: <20100505083347.deec9532fd532622acfef00cad639f45.0371a89d29.wbe@emai
From: [email protected]
To: [email protected]
Cc: [email protected]
Subject: testing
Date: Wed, 05 May 2010 08:33:47 −0700
Mime-Version: 1.0
X-Nonspam: None

Testing Python mail tools.


--------------------------------------------------------------------------------
Bye.

This user interface is about as simple as it could be—after connecting to the server, it prints the complete and raw full text of one message at a time, pausing between each until you press the Enter key. The input built-in is called to wait for the key press between message displays. The pause keeps messages from scrolling off the screen too fast; to make them visually distinct, emails are also separated by lines of dashes.

We could make the display fancier (e.g., we can use the email package to parse headers, bodies, and attachments—watch for examples in this and later chapters), but here we simply display the whole message that was sent. This works well for simple mails like these two, but it can be inconvenient for larger messages with attachments; we’ll improve on this in later clients.

This book won’t cover the full of set of headers that may appear in emails, but we’ll make use of some along the way. For example, the X-Mailer header line, if present, typically identifies the sending program; we’ll use it later to identify Python-coded email senders we write. The more common headers such as From and Subject are more crucial to a message. In fact, a variety of extra header lines can be sent in a message’s text. The Received headers, for example, trace the machines that a message passed through on its way to the target mailbox.

Because popmail prints the entire raw text of a message, you see all headers here, but you usually see only a few by default in end-user-oriented mail GUIs such as Outlook and webmail pages. The raw text here also makes apparent the email structure we noted earlier: an email in general consists of a set of headers like those here, followed by a blank line, which is followed by the mail’s main text, though as we’ll see later, they can be more complex if there are alternative parts or attachments.

The script in Example 13-18 never deletes mail from the server. Mail is simply retrieved and printed and will be shown again the next time you run the script (barring deletion in another tool, of course). To really remove mail permanently, we need to call other methods (e.g., server.dele(msgnum)), but such a capability is best deferred until we develop more interactive mail tools.

Notice how the reader script decodes each mail content line with line.decode into a str string for display; as mentioned earlier, poplib returns content as bytes strings in 3.X. In fact, if we change the script to not decode, this becomes more obvious in its output:

[Press Enter key]
...assorted lines omitted...
b'Date: Wed, 5 May 2010 11:29:36 −0400 (EDT)'
b'From: [email protected]'
b'Reply-To: [email protected]'
b'To: [email protected]'
b"Subject: I'm a Lumberjack, and I'm Okay"
b'Mime-Version: 1.0'
b'Content-Type: text/plain; charset=UTF-8'
b'Content-Transfer-Encoding: 7bit'
b'X-Mailer: EarthLink Zoo Mail 1.0'
b''
b'I cut down trees, I skip and jump,'
b'I like to press wild flowers...'
b''

As we’ll see later, we’ll need to decode similarly in order to parse this text with email tools. The next section exposes the bytes-based interface as well.

Fetching Email at the Interactive Prompt

If you don’t mind typing code and reading POP server messages, it’s possible to use the Python interactive prompt as a simple email client, too. The following session uses two additional interfaces we’ll apply in later examples:

conn.list(): Returns a list of “message-number message-size” strings.
conn.top( N , 0): Retrieves just the header text portion of message number N.

The top call also returns a tuple that includes the list of line strings sent back; its second argument tells the server how many additional lines after the headers to send, if any. If all you need are header details, top can be much quicker than the full text fetch of retr, provided your mail server implements the TOP command (most do):

C:...PP4EInternetEmail> python
>>> from poplib import POP3
>>> conn = POP3('pop.secureserver.net')         # connect to server
>>> conn.user('[email protected]')       # log in to account
b'+OK '
>>> conn.pass_('xxxxxxxx')
b'+OK '

>>> conn.stat()      # num mails, num bytes
(2, 3268)
>>> conn.list()
(b'+OK ', [b'1 1860', b'2 1408'], 16)

>>> conn.top(1, 0)
(b'+OK 1860 octets ', [b'Received: (qmail 7690 invoked from network); 5 May 2010
...lines omitted...
b'X-Originating-IP: 209.86.224.51', b'X-Nonspam: None', b'', b''], 1827)

>>> conn.retr(1)
(b'+OK 1860 octets ', [b'Received: (qmail 7690 invoked from network); 5 May 2010
...lines omitted...
b'X-Originating-IP: 209.86.224.51', b'X-Nonspam: None', b'',
b'I cut down trees, I skip and jump,', b'I like to press wild flowers...',
b'', b''], 1898)

>>> conn.quit()
b'+OK '

Printing the full text of a message at the interactive prompt is easy once it’s fetched: simply decode each line to a normal string as it is printed, like our pop mail script did, or concatenate the line strings returned by retr or top adding a newline between; any of the following will suffice for an open POP server object:

>>> info, msg, oct = connection.retr(1)       # fetch first email in mailbox

>>> for x in msg: print(x.decode())           # four ways to display message lines
>>> print(b'
'.join(msg).decode())
>>> x = [print(x.decode()) for x in msg]
>>> x = list(map(print, map(bytes.decode, msg)))

Parsing email text to extract headers and components is more complex, especially for mails with attached and possibly encoded parts, such as images. As we’ll see later in this chapter, the standard library’s email package can parse the mail’s full or headers text after it has been fetched with poplib (or imaplib).

See the Python library manual for details on other POP module tools. As of Python 2.4, there is also a POP3_SSL class in the poplib module that connects to the server over an SSL-encrypted socket on port 995 by default (the standard port for POP over SSL). It provides an identical interface, but it uses secure sockets for the conversation where supported by servers.

SMTP: Sending Email

There is a proverb in hackerdom that states that every useful computer program eventually grows complex enough to send email. Whether such wisdom rings true or not in practice, the ability to automatically initiate email from within a program is a powerful tool.

For instance, test systems can automatically email failure reports, user interface programs can ship purchase orders to suppliers by email, and so on. Moreover, a portable Python mail script could be used to send messages from any computer in the world with Python and an Internet connection that supports standard email protocols. Freedom from dependence on mail programs like Outlook is an attractive feature if you happen to make your living traveling around teaching Python on all sorts of computers.

Luckily, sending email from within a Python script is just as easy as reading it. In fact, there are at least four ways to do so:

Calling os.popen to launch a command-line mail program

On some systems, you can send email from a script with a call of the form:

os.popen('mail -s "xxx" [email protected]', 'w').write(text)

As we saw earlier in the book, the popen tool runs the command-line string passed to its first argument, and returns a file-like object connected to it. If we use an open mode of w, we are connected to the command’s standard input stream—here, we write the text of the new mail message to the standard Unix mail command-line program. The net effect is as if we had run mail interactively, but it happens inside a running Python script.

Running the sendmail program

The open source sendmail program offers another way to initiate mail from a program. Assuming it is installed and configured on your system, you can launch it using Python tools like the os.popen call of the previous paragraph.

Using the standard smtplib Python module

Python’s standard library comes with support for the client-side interface to SMTP—the Simple Mail Transfer Protocol—a higher-level Internet standard for sending mail over sockets. Like the poplib module we met in the previous section, smtplib hides all the socket and protocol details and can be used to send mail on any machine with Python and a suitable socket-based Internet link.

Fetching and using third-party packages and tools

Other tools in the open source library provide higher-level mail handling packages for Python; most build upon one of the prior three techniques.

Of these four options, smtplib is by far the most portable and direct. Using os.popen to spawn a mail program usually works on Unix-like platforms only, not on Windows (it assumes a command-line mail program), and requires spawning one or more processes along the way. And although the sendmail program is powerful, it is also somewhat Unix-biased, complex, and may not be installed even on all Unix-like machines.

By contrast, the smtplib module works on any machine that has Python and an Internet link that supports SMTP access, including Unix, Linux, Mac, and Windows. It sends mail over sockets in-process, instead of starting other programs to do the work. Moreover, SMTP affords us much control over the formatting and routing of email.

SMTP Mail Sender Script

Since SMTP is arguably the best option for sending mail from a Python script, let’s explore a simple mailing program that illustrates its interfaces. The Python script shown in Example 13-19 is intended to be used from an interactive command line; it reads a new mail message from the user and sends the new mail by SMTP using Python’s smtplib module.

Example 13-19. PP4EInternetEmailsmtpmail.py

#!/usr/local/bin/python
"""
###########################################################################
use the Python SMTP mail interface module to send email messages; this
is just a simple one-shot send script--see pymail, PyMailGUI, and
PyMailCGI for clients with more user interaction features; also see
popmail.py for a script that retrieves mail, and the mailtools pkg
for attachments and formatting with the standard library email package;
###########################################################################
"""

import smtplib, sys, email.utils, mailconfig
mailserver = mailconfig.smtpservername         # ex: smtp.rmi.net

From = input('From? ').strip()                 # or import from mailconfig
To   = input('To?   ').strip()                 # ex: [email protected]
Tos  = To.split(';')                           # allow a list of recipients
Subj = input('Subj? ').strip()
Date = email.utils.formatdate()                # curr datetime, rfc2822

# standard headers, followed by blank line, followed by text
text = ('From: %s
To: %s
Date: %s
Subject: %s

' % (From, To, Date, Subj))

print('Type message text, end with line=[Ctrl+d (Unix), Ctrl+z (Windows)]')
while True:
    line = sys.stdin.readline()
    if not line:
        break                        # exit on ctrl-d/z
   #if line[:4] == 'From':
   #    line = '>' + line            # servers may escape
    text += line

print('Connecting...')
server = smtplib.SMTP(mailserver)              # connect, no log-in step
failed = server.sendmail(From, Tos, text)
server.quit()
if failed:                                     # smtplib may raise exceptions
    print('Failed recipients:', failed)        # too, but let them pass here
else:
    print('No errors.')
print('Bye.')

Most of this script is user interface—it inputs the sender’s address (From), one or more recipient addresses (To, separated by “;” if more than one), and a subject line. The sending date is picked up from Python’s standard time module, standard header lines are formatted, and the while loop reads message lines until the user types the end-of-file character (Ctrl-Z on Windows, Ctrl-D on Linux).

To be robust, be sure to add a blank line between the header lines and the body in the message’s text; it’s required by the SMTP protocol and some SMTP servers enforce this. Our script conforms by inserting an empty line with at the end of the string format expression—one to terminate the current line and another for a blank line; smtplib expands to Internet-style internally prior to transmission, so the short form is fine here. Later in this chapter, we’ll format our messages with the Python email package, which handles such details for us automatically.

The rest of the script is where all the SMTP magic occurs: to send a mail by SMTP, simply run these two sorts of calls:

server = smtplib.SMTP(mailserver): Make an instance of the SMTP object, passing in the name of the SMTP server that will dispatch the message first. If this doesn’t throw an exception, you’re connected to the SMTP server via a socket when the call returns. Technically, the connect method establishes connection to a server, but the SMTP object calls this method automatically if the mail server name is passed in this way.
failed = server.sendmail(From, Tos, text): Call the SMTP object’s sendmail method, passing in the sender address, one or more recipient addresses, and the raw text of the message itself with as many standard mail header lines as you care to provide.

When you’re done, be sure to call the object’s quit method to disconnect from the server and finalize the transaction. Notice that, on failure, the sendmail method may either raise an exception or return a list of the recipient addresses that failed; the script handles the latter case itself but lets exceptions kill the script with a Python error message.

Subtly, calling the server object’s quit method after sendmail raises an exception may or may not work as expected—quit can actually hang until a server timeout if the send fails internally and leaves the interface in an unexpected state. For instance, this can occur on Unicode encoding errors when translating the outgoing mail to bytes per the ASCII scheme (the rset reset request hangs in this case, too). An alternative close method simply closes the client’s sockets without attempting to send a quit command to the server; quit calls close internally as a last step (assuming the quit command can be sent!).

For advanced usage, SMTP objects provide additional calls not used in this example:

server.login(user, password) provides an interface to SMTP servers that require and support authentication; watch for this call to appear as an option in the mailtools package example later in this chapter.
server.starttls([keyfile[, certfile]]) puts the SMTP connection in Transport Layer Security (TLS) mode; all commands will be encrypted using the Python ssl module’s socket wrapper SSL support, and they assume the server supports this mode.

See the Python library manual for more on these and other calls not covered here.

Sending Messages

Let’s ship a few messages across the world. The smtpmail script is a one-shot tool: each run allows you to send a single new mail message. Like most of the client-side tools in this chapter, it can be run from any computer with Python and an Internet link that supports SMTP (most do, though some public access machines may restrict users to HTTP [Web] access only or require special server SMTP configuration). Here it is running on Windows:

C:...PP4EInternetEmail> smtpmail.py
From? [email protected]
To?   [email protected]
Subj? A B C D E F G
Type message text, end with line=[Ctrl+d (Unix), Ctrl+z (Windows)]
Fiddle de dum, Fiddle de dee,
Eric the half a bee.
^Z
Connecting...
No errors.
Bye.

This mail is sent to the book’s email account address ([email protected]), so it ultimately shows up in the inbox at my ISP, but only after being routed through an arbitrary number of machines on the Net, and across arbitrarily distant network links. It’s complex at the bottom, but usually, the Internet “just works.”

Notice the From address, though—it’s completely fictitious (as far as I know, at least). It turns out that we can usually provide any From address we like because SMTP doesn’t check its validity (only its general format is checked). Furthermore, unlike POP, there is usually no notion of a username or password in SMTP, so the sender is more difficult to determine. We need only pass email to any machine with a server listening on the SMTP port, and we don’t need an account or login on that machine. Here, the name [email protected] works just fine as the sender; Marketing.Geek.[email protected] might work just as well.

In fact, I didn’t import a From email address from the mailconfig.py module on purpose, because I wanted to be able to demonstrate this behavior; it’s the basis of some of those annoying junk emails that show up in your mailbox without a real sender’s address.^[52] Marketers infected with e-millionaire mania will email advertising to all addresses on a list without providing a real From address, to cover their tracks.

Normally, of course, you should use the same To address in the message and the SMTP call and provide your real email address as the From value (that’s the only way people will be able to reply to your message). Moreover, apart from teasing your significant other, sending phony addresses is often just plain bad Internet citizenship. Let’s run the script again to ship off another mail with more politically correct coordinates:

C:...PP4EInternetEmail> smtpmail.py
From? [email protected]
To?   [email protected]
Subj? testing smtpmail
Type message text, end with line=[Ctrl+d (Unix), Ctrl+z (Windows)]
Lovely Spam! Wonderful Spam!
^Z
Connecting...
No errors.
Bye.

Verifying receipt

At this point, we could run whatever email tool we normally use to access our mailbox to verify the results of these two send operations; the two new emails should show up in our mailbox regardless of which mail client is used to view them. Since we’ve already written a Python script for reading mail, though, let’s put it to use as a verification tool—running the popmail script from the last section reveals our two new messages at the end of the mail list (again parts of the output have been trimmed to conserve space and protect the innocent here):

C:...PP4EInternetEmail> popmail.py
Password for pop.secureserver.net?
Connecting...
b'+OK <[email protected]>'
There are 4 mail messages in 5326 bytes
(b'+OK ', [b'1 1860', b'2 1408', b'3 1049', b'4 1009'], 32)
--------------------------------------------------------------------------------
[Press Enter key]

...first two mails omitted...

Received: (qmail 25683 invoked from network); 6 May 2010 14:12:07 −0000
Received: from unknown (HELO p3pismtp01-018.prod.phx3.secureserver.net) ([10.6.1
          (envelope-sender <[email protected]>)
          by p3plsmtp06-04.prod.phx3.secureserver.net (qmail-1.03) with SMTP
          for <[email protected]>; 6 May 2010 14:12:07 −0000
...more deleted...
Received: from [66.194.109.3] by smtp.mailmt.com (ArGoSoft Mail Server .NET v.1.
        for <[email protected]>; Thu, 06 May 2010 10:12:12 −0400
From: [email protected]
To: [email protected]
Date: Thu, 06 May 2010 14:11:07 −0000
Subject: A B C D E F G
Message-ID: <jdlohzf0j8dp8z4x06052010101212@SMTP>
X-FromIP: 66.194.109.3
X-Nonspam: None

Fiddle de dum, Fiddle de dee,
Eric the half a bee.


--------------------------------------------------------------------------------
[Press Enter key]
Received: (qmail 4634 invoked from network); 6 May 2010 14:16:57 −0000
Received: from unknown (HELO p3pismtp01-025.prod.phx3.secureserver.net) ([10.6.1
          (envelope-sender <[email protected]>)
          by p3plsmtp06-05.prod.phx3.secureserver.net (qmail-1.03) with SMTP
          for <[email protected]>; 6 May 2010 14:16:57 −0000
...more deleted...
Received: from [66.194.109.3] by smtp.mailmt.com (ArGoSoft Mail Server .NET v.1.
        for <[email protected]>; Thu, 06 May 2010 10:17:03 −0400
From: [email protected]
To: [email protected]
Date: Thu, 06 May 2010 14:16:31 −0000
Subject: testing smtpmail
Message-ID: <8fad1n462667fik006052010101703@SMTP>
X-FromIP: 66.194.109.3
X-Nonspam: None

Lovely Spam! Wonderful Spam!


--------------------------------------------------------------------------------
Bye.

Notice how the fields we input to our script show up as headers and text in the email’s raw text delivered to the recipient. Technically, some ISPs test to make sure that at least the domain of the email sender’s address (the part after “@”) is a real, valid domain name, and disallow delivery if not. As mentioned earlier, some servers also require that SMTP senders have a direct connection to their network and may require an authentication call with username and password (described near the end of the preceding section). In the second edition of the book, I used an ISP that let me get away with more nonsense, but this may vary per server; the rules have tightened since then to limit spam.

Manipulating both From and To

The first mail listed at the end of the preceding section was the one we sent with a fictitious sender address; the second was the more legitimate message. Like sender addresses, header lines are a bit arbitrary under SMTP. Our smtpmail script automatically adds From and To header lines in the message’s text with the same addresses that are passed to the SMTP interface, but only as a polite convention. Sometimes, though, you can’t tell who a mail was sent to, either—to obscure the target audience or to support legitimate email lists, senders may manipulate the contents of both these headers in the message’s text.

For example, if we change smtpmail to not automatically generate a “To:” header line with the same address(es) sent to the SMTP interface call:

text = ('From: %s
Date: %s
Subject: %s
' % (From, Date, Subj))

we can then manually type a “To:” header that differs from the address we’re really sending to—the “To” address list passed into the smtplib send call gives the true recipients, but the “To:” header line in the text of the message is what most mail clients will display (see smtpmail-noTo.py in the examples package for the code needed to support such anonymous behavior, and be sure to type a blank line after “To:”):

C:...PP4EInternetEmail> smtpmail-noTo.py
From? [email protected]
To?   [email protected]
Subj? a b c d e f g
Type message text, end with line=(ctrl + D or Z)
To: [email protected]

Spam; Spam and eggs; Spam, spam, and spam.
^Z
Connecting...
No errors.
Bye.

In some ways, the From and To addresses in send method calls and message header lines are similar to addresses on envelopes and letters in envelopes, respectively. The former is used for routing, but the latter is what the reader sees. Here, From is fictitious in both places. Moreover, I gave the real To address for the account on the server, but then gave a fictitious name in the manually typed “To:” header line—the first address is where it really goes and the second appears in mail clients. If your mail tool picks out the “To:” line, such mails will look odd when viewed.

For instance, when the mail we just sent shows up in my mailbox at learning-python.com, it’s difficult to tell much about its origin or destination in the webmail interface my ISP provides, as captured in Figure 13-5.

Figure 13-5. Anonymous mail in a web-mail client (see also ahead: PyMailGUI)

Furthermore, this email’s raw text won’t help unless we look closely at the “Received:” headers added by the machines it has been routed through:

C:...PP4EInternetEmail> popmail.py
Password for pop.secureserver.net?
Connecting...
b'+OK <[email protected]>'
There are 5 mail messages in 6364 bytes
(b'+OK ', [b'1 1860', b'2 1408', b'3 1049', b'4 1009', b'5 1038'], 40)
--------------------------------------------------------------------------------
[Press Enter key]

...first three mails omitted...


Received: (qmail 30325 invoked from network); 6 May 2010 14:33:45 −0000
Received: from unknown (HELO p3pismtp01-004.prod.phx3.secureserver.net) ([10.6.1
          (envelope-sender <[email protected]>)
          by p3plsmtp06-03.prod.phx3.secureserver.net (qmail-1.03) with SMTP
          for <[email protected]>; 6 May 2010 14:33:45 −0000
...more deleted...
Received: from [66.194.109.3] by smtp.mailmt.com (ArGoSoft Mail Server .NET v.1.
        for <[email protected]>; Thu, 06 May 2010 10:33:16 −0400
From: [email protected]
Date: Thu, 06 May 2010 14:32:32 −0000
Subject: a b c d e f g
To: [email protected]
Message-ID: <66koqg66e0q1c8hl06052010103316@SMTP>
X-FromIP: 66.194.109.3
X-Nonspam: None

Spam; Spam and eggs; Spam, spam, and spam.


--------------------------------------------------------------------------------
Bye.

Once again, though, don’t do this unless you have good cause. This demonstration is intended only to help you understand how mail headers factor into email processing. To write an automatic spam filter that deletes incoming junk mail, for instance, you need to know some of the telltale signs to look for in a message’s text. Spamming techniques have grown much more sophisticated than simply forging sender and recipient names, of course (you’ll find much more on the subject on the Web at large and in the SpamBayes mail filter written in Python), but it’s one common trick.

On the other hand, such To address juggling may also be useful in the context of legitimate mailing lists—the name of the list appears in the “To:” header when the message is viewed, not the potentially many individual recipients named in the send-mail call. As the next section’s example demonstrates, a mail client can simply send a mail to all on the list but insert the general list name in the “To:” header.

But in other contexts, sending email with bogus “From:” and “To:” lines is equivalent to making anonymous phone calls. Most mailers won’t even let you change the From line, and they don’t distinguish between the To address and header line. When you program mail scripts of your own, though, SMTP is wide open in this regard. So be good out there, OK?

In the prior version of the smtpmail script of Example 13-19, a simple date format was used for the Date email header that didn’t quite follow the SMTP date formatting standard:

>>> import time
>>> time.asctime()
'Wed May 05 17:52:05 2010'

Most servers don’t care and will let any sort of date text appear in date header lines, or even add one if needed. Clients are often similarly forgiving, but not always; one of my ISP webmail programs shows dates correctly anyhow, but another leaves such ill-formed dates blank in mail displays. If you want to be more in line with the standard, you could format the date header with code like this (the result can be parsed with standard tools such as the time.strptime call):

import time
gmt = time.gmtime(time.time())
fmt = '%a, %d %b %Y %H:%M:%S GMT'
str = time.strftime(fmt, gmt)
hdr = 'Date: ' + str
print(hdr)

The hdr variable’s value looks like this when this code is run:

Date: Wed, 05 May 2010 21:49:32 GMT

The time.strftime call allows arbitrary date and time formatting; time.asctime is just one standard format. Better yet, do what smtpmail does now—in the newer email package (described in this chapter), an email.utils call can be used to properly format date and time automatically. The smtpmail script uses the first of the following format alternatives:

>>> import email.utils
>>> email.utils.formatdate()
'Wed, 05 May 2010 21:54:28 −0000'
>>> email.utils.formatdate(localtime=True)
'Wed, 05 May 2010 17:54:52 −0400'
>>> email.utils.formatdate(usegmt=True)
'Wed, 05 May 2010 21:55:22 GMT'

See the pymail and mailtools examples in this chapter for additional usage examples; the latter is reused by the larger PyMailGUI and PyMailCGI email clients later in this book.

Sending Email at the Interactive Prompt

So where are we in the Internet abstraction model now? With all this email fetching and sending going on, it’s easy to lose the forest for the trees. Keep in mind that because mail is transferred over sockets (remember sockets?), they are at the root of all this activity. All email read and written ultimately consists of formatted bytes shipped over sockets between computers on the Net. As we’ve seen, though, the POP and SMTP interfaces in Python hide all the details. Moreover, the scripts we’ve begun writing even hide the Python interfaces and provide higher-level interactive tools.

Both the popmail and smtpmail scripts provide portable email tools but aren’t quite what we’d expect in terms of usability these days. Later in this chapter, we’ll use what we’ve seen thus far to implement a more interactive, console-based mail tool. In the next chapter, we’ll also code a tkinter email GUI, and then we’ll go on to build a web-based interface in a later chapter. All of these tools, though, vary primarily in terms of user interface only; each ultimately employs the Python mail transfer modules we’ve met here to transfer mail message text over the Internet with sockets.

Before we move on, one more SMTP note: just as for reading mail, we can use the Python interactive prompt as our email sending client, too, if we type calls manually. The following, for example, sends a message through my ISP’s SMTP server to two recipient addresses assumed to be part of a mail list:

C:...PP4EInternetEmail> python
>>> from smtplib import SMTP
>>> conn = SMTP('smtpout.secureserver.net')
>>> conn.sendmail(
... '[email protected]',                           # true sender
... ['[email protected]', '[email protected]'],         # true recipients
... """From: [email protected]
... To: maillist
... Subject: test interactive smtplib
...
... testing 1 2 3...
... """)
{}
>>> conn.quit()                                # quit() required, Date added
(221, b'Closing connection. Good bye.')

We’ll verify receipt of this message in a later email client program; the “To” recipient shows up as “maillist” in email clients—a completely valid use case for header manipulation. In fact, you can achieve the same effect with the smtpmail-noTo script by separating recipient addresses at the “To?” prompt with a semicolon (e.g. [email protected]; [email protected]) and typing the email list’s name in the “To:” header line. Mail clients that support mailing lists automate such steps.

Sending mail interactively this way is a bit tricky to get right, though—header lines are governed by standards: the blank line after the subject line is required and significant, for instance, and Date is omitted altogether (one is added for us). Furthermore, mail formatting gets much more complex as we start writing messages with attachments. In practice, the email package in the standard library is generally used to construct emails, before shipping them off with smtplib. The package lets us build mails by assigning headers and attaching and possibly encoding parts, and creates a correctly formatted mail text. To learn how, let’s move on to the next section.

email: Parsing and Composing Mail Content

The second edition of this book used a handful of standard library modules (rfc822, StringIO, and more) to parse the contents of messages, and simple text processing to compose them. Additionally, that edition included a section on extracting and decoding attached parts of a message using modules such as mhlib, mimetools, and base64.

In the third edition, those tools were still available, but were, frankly, a bit clumsy and error-prone. Parsing attachments from messages, for example, was tricky, and composing even basic messages was tedious (in fact, an early printing of the prior edition contained a potential bug, because it omitted one character in a string formatting operation). Adding attachments to sent messages wasn’t even attempted, due to the complexity of the formatting involved. Most of these tools are gone completely in Python 3.X as I write this fourth edition, partly because of their complexity, and partly because they’ve been made obsolete.

Luckily, things are much simpler today. After the second edition, Python sprouted a new email package—a powerful collection of tools that automate most of the work behind parsing and composing email messages. This module gives us an object-based message interface and handles all the textual message structure details, both analyzing and creating it. Not only does this eliminate a whole class of potential bugs, it also promotes more advanced mail processing.

Things like attachments, for instance, become accessible to mere mortals (and authors with limited book real estate). In fact, an entire original section on manual attachment parsing and decoding was deleted in the third edition—it’s essentially automatic with email. The new package parses and constructs headers and attachments; generates correct email text; decodes and encodes Base64, quoted-printable, and uuencoded data; and much more.

We won’t cover the email package in its entirety in this book; it is well documented in Python’s library manual. Our goal here is to explore some example usage code, which you can study in conjunction with the manuals. But to help get you started, let’s begin with a quick overview. In a nutshell, the email package is based around the Message object it provides:

Parsing mail: A mail’s full text, fetched from poplib or imaplib, is parsed into a new Message object, with an API for accessing its components. In the object, mail headers become dictionary-like keys, and components become a “payload” that can be walked with a generator interface (more on payloads in a moment).
Creating mail: New mails are composed by creating a new Message object, using an API to attach headers and parts, and asking the object for its print representation—a correctly formatted mail message text, ready to be passed to the smtplib module for delivery. Headers are added by key assignment and attachments by method calls.

In other words, the Message object is used both for accessing existing messages and for creating new ones from scratch. In both cases, email can automatically handle details like content encodings (e.g., attached binary images can be treated as text with Base64 encoding and decoding), content types, and more.

Message Objects

Since the email module’s Message object is at the heart of its API, you need a cursory understanding of its form to get started. In short, it is designed to reflect the structure of a formatted email message. Each Message consists of three main pieces of information:

Type: A content type (plain text, HTML text, JPEG image, and so on), encoded as a MIME main type and a subtype. For instance, “text/html” means the main type is text and the subtype is HTML (a web page); “image/jpeg” means a JPEG photo. A “multipart/mixed” type means there are nested parts within the message.
Headers: A dictionary-like mapping interface, with one key per mail header (From, To, and so on). This interface supports almost all of the usual dictionary operations, and headers may be fetched or set by normal key indexing.
Content: A “payload,” which represents the mail’s content. This can be either a string (bytes or str) for simple messages, or a list of additional Message objects for multipart container messages with attached or alternative parts. For some oddball types, the payload may be a Python None object.

The MIME type of a Message is key to understanding its content. For example, mails with attached images may have a main top-level Message (type multipart/mixed), with three more Message objects in its payload—one for its main text (type text/plain), followed by two of type image for the photos (type image/jpeg). The photo parts may be encoded for transmission as text with Base64 or another scheme; the encoding type, as well as the original image filename, are specified in the part’s headers.

Similarly, mails that include both simple text and an HTML alternative will have two nested Message objects in their payload, of type plain text (text/plain) and HTML text (text/html), along with a main root Message of type multipart/alternative. Your mail client decides which part to display, often based on your preferences.

Simpler messages may have just a root Message of type text/plain or text/html, representing the entire message body. The payload for such mails is a simple string. They may also have no explicitly given type at all, which generally defaults to text/plain. Some single-part messages are text/html, with no text/plain alternative—they require a web browser or other HTML viewer (or a very keen-eyed user).

Other combinations are possible, including some types that are not commonly seen in practice, such as message/delivery status. Most messages have a main text part, though it is not required, and may be nested in a multipart or other construct.

In all cases, an email message is a simple, linear string, but these message structures are automatically detected when mail text is parsed and are created by your method calls when new messages are composed. For instance, when creating messages, the message attach method adds parts for multipart mails, and set_payload sets the entire payload to a string for simple mails.

Message objects also have assorted properties (e.g., the filename of an attachment), and they provide a convenient walk generator method, which returns the next Message in the payload each time through in a for loop or other iteration context. Because the walker yields the root Message object first (i.e., self), single-part messages don’t have to be handled as a special case; a nonmultipart message is effectively a Message with a single item in its payload—itself.

Ultimately, the Message object structure closely mirrors the way mails are formatted as text. Special header lines in the mail’s text give its type (e.g., plain text or multipart), as well as the separator used between the content of nested parts. Since the underlying textual details are automated by the email package—both when parsing and when composing—we won’t go into further formatting details here.

If you are interested in seeing how this translates to real emails, a great way to learn mail structure is by inspecting the full raw text of messages displayed by email clients you already use, as we’ll see with some we meet in this book. In fact, we’ve already seen a few—see the raw text printed by our earlier POP email scripts for simple mail text examples. For more on the Message object, and email in general, consult the email package’s entry in Python’s library manual. We’re skipping details such as its available encoders and MIME object classes here in the interest of space.

Beyond the email package, the Python library includes other tools for mail-related processing. For instance, mimetypes maps a filename to and from a MIME type:

mimetypes.guess_type(filename): Maps a filename to a MIME type. Name spam.txt maps to text/plan.
mimetypes.guess_extension(contype): Maps a MIME type to a filename extension. Type text/html maps to .html.

We also used the mimetypes module earlier in this chapter to guess FTP transfer modes from filenames (see Example 13-10), as well as in Chapter 6, where we used it to guess a media player for a filename (see the examples there, including playfile.py, Example 6-23). For email, these can come in handy when attaching files to a new message (guess_type) and saving parsed attachments that do not provide a filename (guess_extension). In fact, this module’s source code is a fairly complete reference to MIME types. See the library manual for more on these tools.

Basic email Package Interfaces in Action

Although we can’t provide an exhaustive reference here, let’s step through a simple interactive session to illustrate the fundamentals of email processing. To compose the full text of a message—to be delivered with smtplib, for instance—make a Message, assign headers to its keys, and set its payload to the message body. Converting to a string yields the mail text. This process is substantially simpler and less error-prone than the manual text operations we used earlier in Example 13-19 to build mail as strings:

>>> from email.message import Message
>>> m = Message()
>>> m['from'] = 'Jane Doe <[email protected]>'
>>> m['to']   = '[email protected]'
>>> m.set_payload('The owls are not what they seem...')
>>>
>>> s = str(m)
>>> print(s)
from: Jane Doe <[email protected]>
to: [email protected]

The owls are not what they seem...

Parsing a message’s text—like the kind you obtain with poplib—is similarly simple, and essentially the inverse: we get back a Message object from the text, with keys for headers and a payload for the body:

>>> s    # same as in prior interaction
'from: Jane Doe <[email protected]>
to: [email protected]

The owls are not...'

>>> from email.parser import Parser
>>> x = Parser().parsestr(s)
>>> x
<email.message.Message object at 0x015EA9F0>
>>>
>>> x['From']
'Jane Doe <[email protected]>'
>>> x.get_payload()
'The owls are not what they seem...'
>>> x.items()
[('from', 'Jane Doe <[email protected]>'), ('to', '[email protected]')]

So far this isn’t much different from the older and now-defunct rfc822 module, but as we’ll see in a moment, things get more interesting when there is more than one part. For simple messages like this one, the message walk generator treats it as a single-part mail, of type plain text:

>>> for part in x.walk():
...     print(x.get_content_type())
...     print(x.get_payload())
...
text/plain
The owls are not what they seem...

Handling multipart messages

Making a mail with attachments is a little more work, but not much: we just make a root Message and attach nested Message objects created from the MIME type object that corresponds to the type of data we’re attaching. The MIMEText class, for instance, is a subclass of Message, which is tailored for text parts, and knows how to generate the right types of header information when printed. MIMEImage and MIMEAudio similarly customize Message for images and audio, and also know how to apply Base64 and other MIME encodings to binary data. The root message is where we store the main headers of the mail, and we attach parts here, instead of setting the entire payload—the payload is a list now, not a string. MIMEMultipart is a Message that provides the extra header protocol we need for the root:

>>> from email.mime.multipart import MIMEMultipart      # Message subclasses
>>> from email.mime.text import MIMEText                # with extra headers+logic
>>>
>>> top = MIMEMultipart()                               # root Message object
>>> top['from'] = 'Art <[email protected]>'            # subtype default=mixed
>>> top['to']   = '[email protected]'
>>>
>>> sub1 = MIMEText('nice red uniforms...
')           # part Message attachments
>>> sub2 = MIMEText(open('data.txt').read())
>>> sub2.add_header('Content-Disposition', 'attachment', filename='data.txt')
>>> top.attach(sub1)
>>> top.attach(sub2)

When we ask for the text, a correctly formatted full mail text is returned, separators and all, ready to be sent with smtplib—quite a trick, if you’ve ever tried this by hand:

>>> text = top.as_string()    # or do: str(top) or print(top)
>>> print(text)
Content-Type: multipart/mixed; boundary="===============1574823535=="
MIME-Version: 1.0
from: Art <[email protected]>
to: [email protected]

--===============1574823535==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

nice red uniforms...

--===============1574823535==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="data.txt"

line1
line2
line3

--===============1574823535==--

If we are sent this message and retrieve it via poplib, parsing its full text yields a Message object just like the one we built to send. The message walk generator allows us to step through each part, fetching their types and payloads:

>>> text    # same as in prior interaction
'Content-Type: multipart/mixed; boundary="===============1574823535=="
MIME-Ver...'

>>> from email.parser import Parser
>>> msg = Parser().parsestr(text)
>>> msg['from']
'Art <[email protected]>'

>>> for part in msg.walk():
...     print(part.get_content_type())
...     print(part.get_payload())
...     print()
...
 multipart/mixed
[<email.message.Message object at 0x015EC610>,
<email.message.Message object at0x015EC630>]

text/plain
nice red uniforms...


text/plain
line1
line2
line3

Multipart alternative messages (with text and HTML renditions of the same message) can be composed and parsed in similar fashion. Because email clients are able to parse and compose messages with a simple object-based API, they are freed to focus on user-interface instead of text processing.

Unicode, Internationalization, and the Python 3.1 email Package

Now that I’ve shown you how “cool” the email package is, I unfortunately need to let you know that it’s not completely operational in Python 3.1. The email package works as shown for simple messages, but is severely impacted by Python 3.X’s Unicode/bytes string dichotomy in a number of ways.

In short, the email package in Python 3.1 is still somewhat coded to operate in the realm of 2.X str text strings. Because these have become Unicode in 3.X, and because some tools that email uses are now oriented toward bytes strings, which do not mix freely with str, a variety of conflicts crop up and cause issues for programs that depend upon this module.

At this writing, a new version of email is being developed which will handle bytes and Unicode encodings better, but the going consensus is that it won’t be folded back into Python until release 3.3 or later, long after this book’s release. Although a few patches might make their way into 3.2, the current sense is that fully addressing the package’s problems appears to require a full redesign.

To be fair, it’s a substantial problem. Email has historically been oriented toward single-byte ASCII text, and generalizing it for Unicode is difficult to do well. In fact, the same holds true for most of the Internet today—as discussed elsewhere in this chapter, FTP, POP, SMTP, and even webpage bytes fetched over HTTP pose the same sorts of issues. Interpreting the bytes shipped over networks as text is easy if the mapping is one-to-one, but allowing for arbitrary Unicode encoding in that text opens a Pandora’s box of dilemmas. The extra complexity is necessary today, but, as email attests, can be a daunting task.

Frankly, I considered not releasing this edition of this book until this package’s issues could be resolved, but I decided to go forward because a new email package may be years away (two Python releases, by all accounts). Moreover, the issues serve as a case study of the types of problems you’ll run into in the real world of large-scale software development. Things change over time, and program code is no exception.

Instead, this book’s examples provide new Unicode and Internationalization support but adopt policies to work around issues where possible. Programs in books are meant to be educational, after all, not commercially viable. Given the state of the email package that the examples depend on, though, the solutions used here might not be completely universal, and there may be additional Unicode issues lurking. To address the future, watch this book’s website (described in the Preface) for updated notes and code examples if/when the anticipated new email package appears. Here, we’ll work with what we have.

The good news is that we’ll be able to make use of email in its current form to build fairly sophisticated and full-featured email clients in this book anyhow. It still offers an amazing number of tools, including MIME encoding and decoding, message formatting and parsing, Internationalized headers extraction and construction, and more. The bad news is that this will require a handful of obscure workarounds and may need to be changed in the future, though few software projects are exempt from such realities.

Because email’s limitations have implications for later email code in this book, I’m going to quickly run through them in this section. Some of this can be safely saved for later reference, but parts of later examples may be difficult to understand if you don’t have this background. The upside is that exploring the package’s limitations here also serves as a vehicle for digging a bit deeper into the email package’s interfaces in general.

Parser decoding requirement

The first Unicode issue in Python3.1’s email package is nearly a showstopper in some contexts: the bytes strings of the sort produced by poplib for mail fetches must be decoded to str prior to parsing with email. Unfortunately, because there may not be enough information to know how to decode the message bytes per Unicode, some clients of this package may need to be generalized to detect whole-message encodings prior to parsing; in worst cases other than email that may mandate mixed data types, the current package cannot be used at all. Here’s the issue live:

>>> text    # from prior example in his section
'Content-Type: multipart/mixed; boundary="===============1574823535=="
MIME-Ver...'

>>> btext = text.encode()
>>> btext
b'Content-Type: multipart/mixed; boundary="===============1574823535=="
MIME-Ve...'

>>> msg = Parser().parsestr(text)           # email parser expects Unicode str
>>> msg = Parser().parsestr(btext)          # but poplib fetches email as bytes!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:Python31libemailparser.py", line 82, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
TypeError: initial_value must be str or None, not bytes

>>> msg = Parser().parsestr(btext.decode())               # okay per default
>>> msg = Parser().parsestr(btext.decode('utf8'))         # ascii encoded (default)
>>> msg = Parser().parsestr(btext.decode('latin1'))       # ascii is same in all 3
>>> msg = Parser().parsestr(btext.decode('ascii'))

This is less than ideal, as a bytes-based email would be able to handle message encodings more directly. As mentioned, though, the email package is not really fully functional in Python 3.1, because of its legacy str focus, and the sharp distinction that Python 3.X makes between Unicode text and byte strings. In this case, its parser should accept bytes and not expect clients to know how to decode.

Because of that, this book’s email clients take simplistic approaches to decoding fetched message bytes to be parsed by email. Specifically, full-text decoding will try a user-configurable encoding name, then fall back on trying common types as a heuristic, and finally attempt to decode just message headers.

This will suffice for the examples shown but may need to be enhanced for broader applicability. In some cases, encoding may have to be determined by other schemes such as inspecting email headers (if present at all), guessing from bytes structure analysis, or dynamic user feedback. Adding such enhancements in a robust fashion is likely too complex to attempt in a book’s example code, and it is better performed in common standard library tools in any event.

Really, robust decoding of mail text may not be possible today at all, if it requires headers inspections—we can’t inspect a message’s encoding information headers unless we parse the message, but we can’t parse a message with 3.1’s email package unless we already know the encoding. That is, scripts may need to parse in order to decode, but they need to decode in order to parse! The byte strings of poplib and Unicode strings of email in 3.1 are fundamentally at odds. Even within its own libraries, Python 3.X’s changes have created a chicken-and-egg dependency problem that still exists nearly two years after 3.0’s release.

Short of writing our own email parser, or pursuing other similarly complex approaches, the best bet today for fetched messages seems to be decoding per user preferences and defaults, and that’s how we’ll proceed in this edition. The PyMailGUI client of Chapter 14, for instance, will allow Unicode encodings for full mail text to be set on a per-session basis.

The real issue, of course, is that email in general is inherently complicated by the presence of arbitrary text encodings. Besides full mail text, we also must consider Unicode encoding issues for the text components of a message once it’s parsed—both its text parts and its message headers. To see why, let’s move on.

Note

Related Issue for CGI scripts: I should also note that the full text decoding issue may not be as large a factor for email as it is for some other email package clients. Because the original email standards call for ASCII text and require binary data to be MIME encoded, most emails are likely to decode properly according to a 7- or 8-bit encoding such as Latin-1.

As we’ll see in Chapter 15, though, a more insurmountable and related issue looms for server-side scripts that support CGI file uploads on the Web—because Python’s CGI module also uses the email package to parse multipart form data; because this package requires data to be decoded to str for parsing; and because such data might have mixed text and binary data (included raw binary data that is not MIME-encoded, text of any encoding, and even arbitrary combinations of these), these uploads fail in Python 3.1 if any binary or incompatible text files are included. The cgi module triggers Unicode decoding or type errors internally, before the Python script has a chance to intervene.

CGI uploads worked in Python 2.X, because the str type represented both possibly encoded text and binary data. Saving this type’s content to a binary mode file as a string of bytes in 2.X sufficed for both arbitrary text and binary data such as images. Email parsing worked in 2.X for the same reason. For better or worse, the 3.X str/bytes dichotomy makes this generality impossible.

In other words, although we can generally work around the email parser’s str requirement for fetched emails by decoding per an 8-bit encoding, it’s much more malignant for web scripting today. Watch for more details on this in Chapter 15, and stay tuned for a future fix, which may have materialized by the time you read these words.

Text payload encodings: Handling mixed type results

Our next email Unicode issue seems to fly in the face of Python’s generic programming model: the data types of message payload objects may differ, depending on how they are fetched. Especially for programs that walk and process payloads of mail parts generically, this complicates code.

Specifically, the Message object’s get_payload method we used earlier accepts an optional decode argument to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable). If this argument is passed in as 1 (or equivalently, True), the payload’s data is MIME-decoded when fetched, if required. Because this argument is so useful for complex messages with arbitrary parts, it will normally be passed as true in all cases. Binary parts are normally MIME-encoded, but even text parts might also be present in Base64 or another MIME form if their bytes fall outside email standards. Some types of Unicode text, for example, require MIME encoding.

The upshot is that get_payload normally returns str strings for str text parts, but returns bytes strings if its decode argument is true—even if the message part is known to be text by nature. If this argument is not used, the payload’s type depends upon how it was set: str or bytes. Because Python 3.X does not allow str and bytes to be mixed freely, clients that need to use the result in text processing or store it in files need to accommodate the difference. Let’s run some code to illustrate:

>>> from email.message import Message
>>> m = Message()
>>> m['From'] = 'Lancelot'
>>> m.set_payload('Line?...')

>>> m['From']
'Lancelot'
>>> m.get_payload()                 # str, if payload is str
'Line?...'
>>> m.get_payload(decode=1)         # bytes, if MIME decode (same as decode=True)
b'Line?...'

The combination of these different return types and Python 3.X’s strict str/bytes dichotomy can cause problems in code that processes the result unless they decode carefully:

>>> m.get_payload(decode=True) + 'spam'                        # can't mix in 3.X!
TypeError: can't concat bytes to str
>>> m.get_payload(decode=True).decode() + 'spam'               # convert if required
'Line?...spam'

To make sense of these examples, it may help to remember that there are two different concepts of “encoding” for email text:

Email-style MIME encodings such as Base64, uuencode, and quoted-printable, which are applied to binary and otherwise unusual content to make them acceptable for transmission in email text
Unicode text encodings for strings in general, which apply to message text as well as its parts, and may be required after MIME encoding for text message parts

The email package handles email-style MIME encodings automatically when we pass decode=1 to fetch parsed payloads, or generate text for messages that have nonprintable parts, but scripts still need to take Unicode encodings into consideration because of Python 3.X’s sharp string types differentiation. For example, the first decode in the following refers to MIME, and the second to Unicode:

m.get_payload(decode=True).decode()  # to bytes via MIME, then to str via Unicode

Even without the MIME decode argument, the payload type may also differ if it is stored in different forms:

>>> m = Message(); m.set_payload('spam'), m.get_payload()      # fetched as stored
'spam'
>>> m = Message(); m.set_payload(b'spam'), m.get_payload()
b'spam'

Moreover, the same hold true for the text-specific MIME subclass (though as we’ll see later in this section, we cannot pass a bytes to its constructor to force a binary payload):

>>> from email.mime.text import MIMEText
>>> m = MIMEText('Line...?')
>>> m['From'] = 'Lancelot'
>>> m['From']
'Lancelot'
>>> m.get_payload()
'Line...?'
>>> m.get_payload(decode=1)
b'Line...?'

Unfortunately, the fact that payloads might be either str or bytes today not only flies in the face of Python’s type-neutral mindset, it can complicate your code—scripts may need to convert in contexts that require one or the other type. For instance, GUI libraries might allow both, but file saves and web page content generation may be less flexible. In our example programs, we’ll process payloads as bytes whenever possible, but decode to str text in cases where required using the encoding information available in the header API described in the next section.

Text payload encodings: Using header information to decode

More profoundly, text in email can be even richer than implied so far—in principle, text payloads of a single message may be encoded in a variety of different Unicode schemes (e.g., three HTML webpage file attachments, all in different Unicode encodings, and possibly different than the full message text’s encoding). Although treating such text as binary byte strings can sometimes finesse encoding issues, saving such parts in text-mode files for opening must respect the original encoding types. Further, any text processing performed on such parts will be similarly type-specific.

Luckily, the email package both adds character-set headers when generating message text and retains character-set information for parts if it is present when parsing message text. For instance, adding non-ASCII text attachments simply requires passing in an encoding name—the appropriate message headers are added automatically on text generation, and the character set is available directly via the get_content_charset method:

>>> s = b'Axe4B'
>>> s.decode('latin1')
'AäB'

>>> from email.message import Message
>>> m = Message()
>>> m.set_payload(b'Axe4B', charset='latin1')       # or 'latin-1': see ahead
>>> t = m.as_string()
>>> print(t)
MIME-Version: 1.0
Content-Type: text/plain; charset="latin1"
Content-Transfer-Encoding: base64

QeRC

>>> m.get_content_charset()
'latin1'

Notice how email automatically applies Base64 MIME encoding to non-ASCII text parts on generation, to conform to email standards. The same is true for the more specific MIME text subclass of Message:

>>> from email.mime.text import MIMEText
>>> m = MIMEText(b'Axe4B', _charset='latin1')
>>> t = m.as_string()
>>> print(t)
Content-Type: text/plain; charset="latin1"
MIME-Version: 1.0
Content-Transfer-Encoding: base64

QeRC

>>> m.get_content_charset()
'latin1'

Now, if we parse this message’s text string with email, we get back a new Message whose text payload is the Base64 MIME-encoded text used to represent the non-ASCII Unicode string. Requesting MIME decoding for the payload with decode=1 returns the byte string we originally attached:

>>> from email.parser import Parser
>>> q = Parser().parsestr(t)
>>> q
<email.message.Message object at 0x019ECA50>
>>> q.get_content_type()
'text/plain'
>>> q._payload
'QeRC
'
>>> q.get_payload()
'QeRC
'
>>> q.get_payload(decode=1)
b'Axe4B'

However, running Unicode decoding on this byte string to convert to text fails if we attempt to use the platform default on Windows (UTF8). To be more accurate, and support a wide variety of text types, we need to use the character-set information saved by the parser and attached to the Message object. This is especially important if we need to save the data to a file—we either have to store as bytes in binary mode files, or specify the correct (or at least a compatible) Unicode encoding in order to use such strings for text-mode files. Decoding manually works the same way:

>>> q.get_payload(decode=1).decode()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected

>>> q.get_content_charset()
'latin1'
>>> q.get_payload(decode=1).decode('latin1')                    # known type
'AäB'
>>> q.get_payload(decode=1).decode(q.get_content_charset())     # allow any type
'AäB'

In fact, all the header details are available on Message objects, if we know where to look. The character set can also be absent entirely, in which case it’s returned as None; clients need to define policies for such ambiguous text (they might try common types, guess, or treat the data as a raw byte string):

>>> q['content-type']                                     # mapping interface
'text/plain; charset="latin1"'
>>> q.items()
[('Content-Type', 'text/plain; charset="latin1"'), ('MIME-Version', '1.0'),
('Content-Transfer-Encoding', 'base64')]

>> q.get_params(header='Content-Type')                    # param interface
[('text/plain', ''), ('charset', 'latin1')]
>>> q.get_param('charset', header='Content-Type')
'latin1'

>>> charset = q.get_content_charset()                     # might be missing
>>> if charset:
...    print(q.get_payload(decode=1).decode(charset))
...
AäB

This handles encodings for message text parts in parsed emails. For composing new emails, we still must apply session-wide user settings or allow the user to specify an encoding for each part interactively. In some of this book’s email clients, payload conversions are performed as needed—using encoding information in message headers after parsing and provided by users during mail composition.

Message header encodings: email package support

On a related note, the email package also provides support for encoding and decoding message headers themselves (e.g., From, Subject) per email standards when they are not simple text. Such headers are often called Internationalized (or i18n) headers, because they support inclusion of non-ASCII character set text in emails. This term is also sometimes used to refer to encoded text of message payloads; unlike message headers, though, message payload encoding is used for both international Unicode text and truly binary data such as images (as we’ll see in the next section).

Like mail payload parts, i18n headers are encoded specially for email, and may also be encoded per Unicode. For instance, here’s how to decode an encoded subject line from an arguably spammish email that just showed up in my inbox; its =?UTF-8?Q? preamble declares that the data following it is UTF-8 encoded Unicode text, which is also MIME-encoded per quoted-printable for transmission in email (in short, unlike the prior section’s part payloads, which declare their encodings in separate header lines, headers themselves may declare their Unicode and MIME encodings by embedding them in their own content this way):

>>> rawheader = '=?UTF-8?Q?Introducing=20Top=20Values=3A=20A=20Special=20Selecti
on=20of=20Great=20Money=20Savers?='

>>> from email.header import decode_header        # decode per email+MIME
>>> decode_header(rawheader)
[(b'Introducing Top Values: A Special Selection of Great Money Savers', 'utf-8')]

>>> bin, enc = decode_header(rawheader)[0]        # and decode per Unicode
>>> bin, enc
(b'Introducing Top Values: A Special Selection of Great Money Savers', 'utf-8')
>>> bin.decode(enc)
'Introducing Top Values: A Special Selection of Great Money Savers'

Subtly, the email package can return multiple parts if there are encoded substrings in the header, and each must be decoded individually and joined to produce decoded header text. Even more subtly, in 3.1, this package returns all bytes when any substring (or the entire header) is encoded but returns str for a fully unencoded header, and uncoded substrings returned as bytes are encoded per “raw-unicode-escape” in the package—an encoding scheme useful to convert str to bytes when no encoding type applies:

>>> from email.header import decode_header

>>> S1 = 'Man where did you get that assistant?'
>>> S2 = '=?utf-8?q?Man_where_did_you_get_that_assistant=3F?='
>>> S3 = 'Man where did you get that =?UTF-8?Q?assistant=3F?='

# str: don't decode()
>>> decode_header(S1)
[('Man where did you get that assistant?', None)]

# bytes: do decode()
>>> decode_header(S2)
[(b'Man where did you get that assistant?', 'utf-8')]

# bytes: do decode() using raw-unicode-escape applied in package
>>> decode_header(S3)
[(b'Man where did you get that', None), (b'assistant?', 'utf-8')]

# join decoded parts if more than one
>>> parts = decode_header(S3)
>>> ' '.join(abytes.decode('raw-unicode-escape' if enc == None else enc)
...          for (abytes, enc) in parts)
'Man where did you get that assistant?'

We’ll use logic similar to the last step here in the mailtools package ahead, but also retain str substrings intact without attempting to decode.

Note

Late-breaking news: As I write this in mid-2010, it seems possible that this mixed type, nonpolymorphic, and frankly, non-Pythonic API behavior may be addressed in a future Python release. In response to a rant posted on the Python developers list by a book author whose work you might be familiar with, there is presently a vigorous discussion of the topic there. Among other ideas is a proposal for a bytes-like type which carries with it an explicit Unicode encoding; this may make it possible to treat some text cases in a more generic fashion. While it’s impossible to foresee the outcome of such proposals, it’s good to see that the issues are being actively explored. Stay tuned to this book’s website for further developments in the Python 3.X library API and Unicode stories.

Message address header encodings and parsing, and header creation

One wrinkle pertaining to the prior section: for message headers that contain email addresses (e.g., From), the name component of the name/address pair might be encoded this way as well. Because the email package’s header parser expects encoded substrings to be followed by whitespace or the end of string, we cannot ask it to decode a complete address-related header—quotes around name components will fail.

To support such Internationalized address headers, we must also parse out the first part of the email address and then decode. First of all, we need to extract the name and address parts of an email address using email package tools:

>>> from email.utils import parseaddr, formataddr
>>> p = parseaddr('"Smith, Bob" <[email protected]>')      # split into name/addr pair
>>> p                                                # unencoded addr
('Smith, Bob', '[email protected]')
>>> formataddr(p)
'"Smith, Bob" <[email protected]>'

>>> parseaddr('Bob Smith <[email protected]>')             # unquoted name part
('Bob Smith', '[email protected]')
>>> formataddr(parseaddr('Bob Smith <[email protected]>'))
'Bob Smith <[email protected]>'

>>> parseaddr('[email protected]')                          # simple, no name
('', '[email protected]')
>>> formataddr(parseaddr('[email protected]'))
'[email protected]'

Fields with multiple addresses (e.g., To) separate individual addresses by commas. Since email names might embed commas, too, blindly splitting on commas to run each though parsing won’t always work. Instead, another utility can be used to parse each address individually: getaddresses ignores commas in names when spitting apart separate addresses, and parseaddr does, too, because it simply returns the first pair in the getaddresses result (some line breaks were added to the following for legibility):

>>> from email.utils import getaddresses
>>> multi = '"Smith, Bob" <[email protected]>, Bob Smith <[email protected]>, [email protected],
"Bob" <[email protected]>'

>>> getaddresses([multi])
[('Smith, Bob', '[email protected]'), ('Bob Smith', '[email protected]'), ('', '[email protected]'),
('Bob', '[email protected]')]

>>> [formataddr(pair) for pair in getaddresses([multi])]
['"Smith, Bob" <[email protected]>', 'Bob Smith <[email protected]>', '[email protected]',
'Bob <[email protected]>']

>>> ', '.join([formataddr(pair) for pair in getaddresses([multi])])
'"Smith, Bob" <[email protected]>, Bob Smith <[email protected]>, [email protected],
Bob <[email protected]>'

>>> getaddresses(['[email protected]'])     # handles single address cases too
('', '[email protected]')]

Now, decoding email addresses is really just an extra step before and after the normal header decoding logic we saw earlier:

>>> rawfromheader = '"=?UTF-8?Q?Walmart?=" <[email protected]>'

>>> from email.utils import parseaddr, formataddr
>>> from email.header import decode_header

>>> name, addr = parseaddr(rawfromheader)             # split into name/addr parts
>>> name, addr
('=?UTF-8?Q?Walmart?=', '[email protected]')

>>> abytes, aenc = decode_header(name)[0]             # do email+MIME decoding
>>> abytes, aenc
(b'Walmart', 'utf-8')

>>> name = abytes.decode(aenc)                        # do Unicode decoding
>>> name
'Walmart'

>>> formataddr((name, addr))                          # put parts back together
'Walmart <[email protected]>'

Although From headers will typically have just one address, to be fully robust we need to apply this to every address in headers, such as To, Cc, and Bcc. Again, the multiaddress getaddresses utility avoids comma clashes between names and address separators; since it also handles the single address case, it suffices for From headers as well:

>>> rawfromheader = '"=?UTF-8?Q?Walmart?=" <[email protected]>'
>>> rawtoheader = rawfromheader + ', ' + rawfromheader
>>> rawtoheader
'"=?UTF-8?Q?Walmart?=" <[email protected]>, "=?UTF-8?Q?Walmart?=" <newslet
[email protected]>'

>>> pairs = getaddresses([rawtoheader])
>>> pairs
[('=?UTF-8?Q?Walmart?=', '[email protected]'), ('=?UTF-8?Q?Walmart?=', 'ne
[email protected]')]

>>> addrs = []
>>> for name, addr in pairs:
...     abytes, aenc = decode_header(name)[0]      # email+MIME
...     name = abytes.decode(aenc)                 # Unicode
...     addrs.append(formataddr((name, addr)))     # one or more addrs
...
>>> ', '.join(addrs)
'Walmart <[email protected]>, Walmart <[email protected]>'

These tools are generally forgiving for unencoded content and return them intact. To be robust, though, the last portion of code here should also allow for multiple parts returned by decode_header (for encoded substrings), None encoding values for parts (for unencoded substrings), and str substring values instead of bytes (for fully unencoded names).

Decoding this way applies both MIME and Unicode decoding steps to fetched mails. Creating properly encoded headers for inclusion in new mails composed and sent is similarly straightforward:

>>> from email.header import make_header
>>> hdr = make_header([(b'Axc4Bxe4C', 'latin-1')])
>>> print(hdr)
AÄBäC
>>> print(hdr.encode())
=?iso-8859-1?q?A=C4B=E4C?=
>>> decode_header(hdr.encode())
[(b'Axc4Bxe4C', 'iso-8859-1')]

This can be applied to entire headers such as Subject, as well as the name component of each email address in an address-related header line such as From and To (use getaddresses to split into individual addresses first if needed). The header object provides an alternative interface; both techniques handle additional details, such as line lengths, for which we’ll defer to Python manuals:

>>> from email.header import Header
>>> h = Header(b'Axe4Bxc4X', charset='latin-1')
>>> h.encode()
'=?iso-8859-1?q?A=E4B=C4X?='
>>>
>>> h = Header('spam', charset='ascii')       # same as Header('spam')
>>> h.encode()
'spam'

The mailtools package ahead and its PyMailGUI client of Chapter 14 will use these interfaces to automatically decode message headers in fetched mails per their content for display, and to encode headers sent that are not in ASCII format. That latter also applies to the name component of email addresses, and assumes that SMTP servers will allow these to pass. This may encroach on some SMTP server issues which we don’t have space to address in this book. See the Web for more on SMTP headers handling. For more on headers decoding, see also file _test-i18n-headers.py in the examples package; it decodes additional subject and address-related headers using mailtools methods, and displays them in a tkinter Text widget—a foretaste of how these will be displayed in PyMailGUI.

Workaround: Message text generation for binary attachment payloads is broken

Our last two email Unicode issues are outright bugs which we must work around today, though they will almost certainly be fixed in a future Python release. The first breaks message text generation for all but trivial messages—the email package today no longer supports generation of full mail text for messages that contain any binary parts, such as images or audio files. Without coding workarounds, only simple emails that consist entirely of text parts can be composed and generated in Python 3.1’s email package; any MIME-encoded binary part causes mail text generation to fail.

This is a bit tricky to understand without poring over email’s source code (which, thankfully, we can in the land of open source), but to demonstrate the issue, first notice how simple text payloads are rendered as full message text when printed as we’ve already seen:

C:...PP4EInternetEmail> python
>>> from email.message import Message            # generic message object
>>> m = Message()
>>> m['From'] = '[email protected]'
>>> m.set_payload(open('text.txt').read())       # payload is str text
>>> print(m)                                     # print uses as_string()
From: [email protected]

spam
Spam
SPAM!

As we’ve also seen, for convenience, the email package also provides subclasses of the Message object, tailored to add message headers that provide the extra descriptive details used by email clients to know how to process the data:

>>> from email.mime.text import MIMEText         # Message subclass with headers
>>> text = open('text.txt').read()
>>> m = MIMEText(text)                           # payload is str text
>>> m['From'] = '[email protected]'
>>> print(m)
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: [email protected]

spam
Spam
SPAM!

This works for text, but watch what happens when we try to render a message part with truly binary data, such as an image that could not be decoded as Unicode text:

>>> from email.message import Message            # generic Message object
>>> m = Message()
>>> m['From'] = '[email protected]'
>>> bytes = open('monkeys.jpg', 'rb').read()     # read binary bytes (not Unicode)
>>> m.set_payload(bytes)                         # we set the payload to bytes
>>> print(m)
Traceback (most recent call last):
  ...lines omitted...
  File "C:Python31libemailgenerator.py", line 155, in _handle_text
    raise TypeError('string payload expected: %s' % type(payload))
TypeError: string payload expected: <class 'bytes'>

>>> m.get_payload()[:20]
b'xffxd8xffxe0x00x10JFIFx00x01x01x01x00xx00xx00x00'

The problem here is that the email package’s text generator assumes that the message’s payload data is a Base64 (or similar) encoded str text string by generation time, not bytes. Really, the error is probably our fault in this case, because we set the payload to raw bytes manually. We should use the MIMEImage MIME subclass tailored for images; if we do, the email package internally performs Base64 MIME email encoding on the data when the message object is created. Unfortunately, it still leaves it as bytes, not str, despite the fact the whole point of Base64 is to change binary data to text (though the exact Unicode flavor this text should take may be unclear). This leads to additional failures in Python 3.1:

>>> from email.mime.image import MIMEImage       # Message sublcass with hdrs+base64
>>> bytes = open('monkeys.jpg', 'rb').read()     # read binary bytes again
>>> m = MIMEImage(bytes)                         # MIME class does Base64 on data
>>> print(m)
Traceback (most recent call last):
  ...lines omitted...
  File "C:Python31libemailgenerator.py", line 155, in _handle_text
    raise TypeError('string payload expected: %s' % type(payload))
TypeError: string payload expected: <class 'bytes'>

>>> m.get_payload()[:40]                         # this is already Base64 text
b'/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIB'

>>> m.get_payload()[:40].decode('ascii')         # but it's still bytes internally!
'/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIB'

In other words, not only does the Python 3.1 email package not fully support the Python 3.X Unicode/bytes dichotomy, it was actually broken by it. Luckily, there’s a workaround for this case.

To address this specific issue, I opted to create a custom encoding function for binary MIME attachments, and pass it in to the email package’s MIME message object subclasses for all binary data types. This custom function is coded in the upcoming mailtools package of this chapter (Example 13-23). Because it is used by email to encode from bytes to text at initialization time, it is able to decode to ASCII text per Unicode as an extra step, after running the original call to perform Base64 encoding and arrange content-encoding headers. The fact that email does not do this extra Unicode decoding step itself is a genuine bug in that package (albeit, one introduced by changes elsewhere in Python standard libraries), but the workaround does its job:

# in mailtools.mailSender module ahead in this chapter...
def fix_encode_base64(msgobj):
     from email.encoders import encode_base64
     encode_base64(msgobj)                # what email does normally: leaves bytes
     bytes = msgobj.get_payload()         # bytes fails in email pkg on text gen
     text  = bytes.decode('ascii')        # decode to unicode str so text gen works
     ...line splitting logic omitted...
     msgobj.set_payload('
'.join(lines))

>>> from email.mime.image import MIMEImage
>>> from mailtools.mailSender import fix_encode_base64      # use custom workaround
>>> bytes = open('monkeys.jpg', 'rb').read()
>>> m = MIMEImage(bytes, _encoder=fix_encode_base64)        # convert to ascii str
>>> print(m.as_string()[:500])
Content-Type: image/jpeg
MIME-Version: 1.0
Content-Transfer-Encoding: base64

/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcG
BwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwM
DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAARCAHoAvQDASIA
AhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA
AAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3
ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc

>>> print(m)   # to print the entire message: very long

Another possible workaround involves defining a custom MIMEImage class that is like the original but does not attempt to perform Base64 ending on creation; that way, we could encode and translate to str before message object creation, but still make use of the original class’s header-generation logic. If you take this route, though, you’ll find that it requires repeating (really, cutting and pasting) far too much of the original logic to be reasonable—this repeated code would have to mirror any future email changes:

>>> from email.mime.nonmultipart import MIMENonMultipart
>>> class MyImage(MIMENonMultipart):
...     def __init__(self, imagedata, subtype):
...         MIMENonMultipart.__init__(self, 'image', subtype)
...         self.set_payload(_imagedata)

...repeat all the base64 logic here, with an extra ASCII Unicode decode...
>>> m = MyImage(text_from_bytes)

Interestingly, this regression in email actually reflects an unrelated change in Python’s base64 module made in 2007, which was completely benign until the Python 3.X bytes/str differentiation came online. Prior to that, the email encoder worked in Python 2.X, because bytes was really str. In 3.X, though, because base64 returns bytes, the normal mail encoder in email also leaves the payload as bytes, even though it’s been encoded to Base64 text form. This in turn breaks email text generation, because it assumes the payload is text in this case, and requires it to be str. As is common in large-scale software systems, the effects of some 3.X changes may have been difficult to anticipate or accommodate in full.

By contrast, parsing binary attachments (as opposed to generating text for them) works fine in 3.X, because the parsed message payload is saved in message objects as a Base64-encoded str string, not bytes, and is converted to bytes only when fetched. This bug seems likely to also go away in a future Python and email package (perhaps even as a simple patch in Python 3.2), but it’s more serious than the other Unicode decoding issues described here, because it prevents mail composition for all but trivial mails.

The flexibility afforded by the package and the Python language allows such a workaround to be developed external to the package, rather than hacking the package’s code directly. With open source and forgiving APIs, you rarely are truly stuck.

Note

Late-breaking news: This section’s bug is scheduled to be fixed in Python 3.2, making our workaround here unnecessary in this and later Python releases. This is per communications with members of Python’s email special interest group (on the “email-sig” mailing list).

Regrettably, this fix didn’t appear until after this chapter and its examples had been written. I’d like to remove the workaround and its description entirely, but this book is based on Python 3.1, both before and after the fix was incorporated.

So that it works under Python 3.2 alpha, too, though, the workaround code ahead was specialized just before publication to check for bytes prior to decoding. Moreover, the workaround still must manually split lines in Base64 data, because 3.2 still does not.

Workaround: Message composition for non-ASCII text parts is broken

Our final email Unicode issue is as severe as the prior one: changes like that of the prior section introduced yet another regression for mail composition. In short, it’s impossible to make text message parts today without specializing for different Unicode encodings.

Some types of text are automatically MIME-encoded for transmission. Unfortunately, because of the str/bytes split, the MIME text message class in email now requires different string object types for different Unicode encodings. The net effect is that you now have to know how the email package will process your text data when making a text message object, or repeat most of its logic redundantly.

For example, to properly generate Unicode encoding headers and apply required MIME encodings, here’s how we must proceed today for common Unicode text types:

>>> m = MIMEText('abc', _charset='ascii')             # pass text for ascii
>>> print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

abc
>>> m = MIMEText('abc', _charset='latin-1')           # pass text for latin-1
>>> print(m)                                          # but not for 'latin1': ahead
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

abc
>>> m = MIMEText(b'abc', _charset='utf-8')            # pass bytes for utf8
>>> print(m)
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64

YWJj

This works, but if you look closely, you’ll notice that we must pass str to the first two, but bytes to the third. That requires that we special-case code for Unicode types based upon the package’s internal operation. Types other than those expected for a Unicode encoding don’t work at all, because of newly invalid str/bytes combinations that occur inside the email package in 3.1:

>>> m = MIMEText('abc', _charset='ascii')
>>> m = MIMEText(b'abc', _charset='ascii')            # bug: assumes 2.X str
Traceback (most recent call last):
  ...lines omitted...
  File "C:Python31libemailencoders.py", line 60, in encode_7or8bit
    orig.encode('ascii')
AttributeError: 'bytes' object has no attribute 'encode'

>>> m = MIMEText('abc', _charset='latin-1')
>>> m = MIMEText(b'abc', _charset='latin-1')          # bug: qp uses str
Traceback (most recent call last):
  ...lines omitted...
  File "C:Python31libemailquoprimime.py", line 176, in body_encode
    if line.endswith(CRLF):
TypeError: expected an object with the buffer interface

>>> m = MIMEText(b'abc', _charset='utf-8')
>>> m = MIMEText('abc', _charset='utf-8')             # bug: base64 uses bytes
Traceback (most recent call last):
  ...lines omitted...
  File "C:Python31libemailase64mime.py", line 94, in body_encode
    enc = b2a_base64(s[i:i + max_unencoded]).decode("ascii")
TypeError: must be bytes or buffer, not str

Moreover, the email package is pickier about encoding name synonyms than Python and most other tools are: “latin-1” is detected as a quoted-printable MIME type, but “latin1” is unknown and so defaults to Base64 MIME. In fact, this is why Base64 was used for the “latin1” Unicode type earlier in this section—an encoding choice that is irrelevant to any recipient that understands the “latin1” synonym, including Python itself. Unfortunately, that means that we also need to pass in a different string type if we use a synonym the package doesn’t understand today:

>>> m = MIMEText('abc', _charset='latin-1')            # str for 'latin-1'
>>> print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

abc
>>> m = MIMEText('abc', _charset='latin1')
Traceback (most recent call last):
  ...lines omitted...
  File "C:Python31libemailase64mime.py", line 94, in body_encode
    enc = b2a_base64(s[i:i + max_unencoded]).decode("ascii")
TypeError: must be bytes or buffer, not str

>>> m = MIMEText(b'abc', _charset='latin1')            # bytes for 'latin1'!
>>> print(m)
Content-Type: text/plain; charset="latin1"
MIME-Version: 1.0
Content-Transfer-Encoding: base64

YWJj

There are ways to add aliases and new encoding types in the email package, but they’re not supported out of the box. Programs that care about being robust would have to cross-check the user’s spelling, which may be valid for Python itself, against that expected by email. This also holds true if your data is not ASCII in general—you’ll have to first decode to text in order to use the expected “latin-1” name because its quoted-printable MIME encoding expects str, even though bytes are required if “latin1” triggers the default Base64 MIME:

>>> m = MIMEText(b'Axe4B', _charset='latin1')
>>> print(m)
Content-Type: text/plain; charset="latin1"
MIME-Version: 1.0
Content-Transfer-Encoding: base64

QeRC

>>> m = MIMEText(b'Axe4B', _charset='latin-1')
Traceback (most recent call last):
  ...lines omitted...
  File "C:Python31libemailquoprimime.py", line 176, in body_encode
    if line.endswith(CRLF):
TypeError: expected an object with the buffer interface

>>> m = MIMEText(b'Axe4B'.decode('latin1'), _charset='latin-1')
>>> print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

A=E4B

In fact, the text message object doesn’t check to see that the data you’re MIME-encoding is valid per Unicode in general—we can send invalid UTF text but the receiver may have trouble decoding it:

>>> m = MIMEText(b'Axe4B', _charset='utf-8')
>>> print(m)
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64

QeRC

>>> b'Axe4B'.decode('utf8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected...

>>> import base64
>>> base64.b64decode(b'QeRC')
b'Axe4B'
>>> base64.b64decode(b'QeRC').decode('utf')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected...

So what to do if we need to attach message text to composed messages if the text’s datatype requirement is indirectly dictated by its Unicode encoding name? The generic Message superclass doesn’t help here directly if we specify an encoding, as it exhibits the same encoding-specific behavior:

>>> m = Message()
>>> m.set_payload('spam', charset='us-ascii')
>>> print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

spam
>>> m = Message()
>>> m.set_payload(b'spam', charset='us-ascii')
AttributeError: 'bytes' object has no attribute 'encode'

>>> m.set_payload('spam', charset='utf-8')
TypeError: must be bytes or buffer, not str

Although we could try to work around these issues by repeating much of the code that email runs, the redundancy would make us hopelessly tied to its current implementation and dependent upon its future changes. The following, for example, parrots the steps that email runs internally to create a text message object for ASCII encoding text; unlike the MIMEText class, this approach allows all data to be read from files as binary byte strings, even if it’s simple ASCII:

>>> m = Message()
>>> m.add_header('Content-Type', 'text/plain')
>>> m['MIME-Version'] = '1.0'
>>> m.set_param('charset', 'us-ascii')
>>> m.add_header('Content-Transfer-Encoding', '7bit')
>>> data = b'spam'
>>> m.set_payload(data.decode('ascii'))           # data read as bytes here
>>> print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

spam
>>> print(MIMEText('spam', _charset='ascii'))     # same, but type-specific
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

spam

To do the same for other kinds of text that require MIME encoding, just insert an extra encoding step; although we’re concerned with text parts here, a similar imitative approach could address the binary parts text generation bug we met earlier:

>>> m = Message()
>>> m.add_header('Content-Type', 'text/plain')
>>> m['MIME-Version'] = '1.0'
>>> m.set_param('charset', 'utf-8')
>>> m.add_header('Content-Transfer-Encoding', 'base64')
>>> data = b'spam'
>>> from binascii import b2a_base64               # add MIME encode if needed
>>> data = b2a_base64(data)                       # data read as bytes here too
>>> m.set_payload(data.decode('ascii'))
>>> print(m)
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

c3BhbQ==

>>> print(MIMEText(b'spam', _charset='utf-8'))    # same, but type-specific
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64

c3BhbQ==

This works, but besides the redundancy and dependency it creates, to use this approach broadly we’d also have to generalize to account for all the various kinds of Unicode encodings and MIME encodings possible, like the email package already does internally. We might also have to support encoding name synonyms to be flexible, adding further redundancy. In other words, this requires additional work, and in the end, we’d still have to specialize our code for different Unicode types.

Any way we go, some dependence on the current implementation seems unavoidable today. It seems the best we can do here, apart from hoping for an improved email package in a few years’ time, is to specialize text message construction calls by Unicode type, and assume both that encoding names match those expected by the package and that message data is valid for the Unicode type selected. Here is the sort of arguably magic code that the upcoming mailtools package (again in Example 13-23) will apply to choose text types:

>>> from email.charset import Charset, BASE64, QP
>>> for e in ('us-ascii', 'latin-1', 'utf8', 'latin1', 'ascii'):
...     cset = Charset(e)
...     benc = cset.body_encoding
...     if benc in (None, QP):
...         print(e, benc, 'text')              # read/fetch data as str
...     else:
...         print(e, benc, 'binary')            # read/fetch data as bytes
...
us-ascii None text
latin-1 1 text
utf8 2 binary
latin1 2 binary
ascii None text

We’ll proceed this way in this book, with the major caveat that this is almost certainly likely to require changes in the future because of its strong coupling with the current email implementation.

Note

Late-breaking news: Like the prior section, it now appears that this section’s bug will also be fixed in Python 3.2, making the workaround here unnecessary in this and later Python releases. The nature of the fix is unknown, though, and we still need the fix for the version of Python current when this chapter was written. As of just before publication, the alpha release of 3.2 is still somewhat type specific on this issue, but now accepts either str or bytes for text that triggers Base64 encodings, instead of just bytes.

Summary: Solutions and workarounds

The email package in Python 3.1 provides powerful tools for parsing and composing mails, and can be used as the basis for full-featured mail clients like those in this book with just a few workarounds. As you can see, though, it is less than fully functional today. Because of that, further specializing code to its current API is perhaps a temporary solution. Short of writing our own email parser and composer (not a practical option in a finitely-sized book!), some compromises are in order here. Moreover, the inherent complexity of Unicode support in email places some limits on how much we can pursue this thread in this book.

In this edition, we will support Unicode encodings of text parts and headers in messages composed, and respect the Unicode encodings in text parts and mail headers of messages fetched. To make this work with the partially crippled email package in Python 3.1, though, we’ll apply the following Unicode policies in various email clients in this book:

Use user preferences and defaults for the preparse decoding of full mail text fetched and encoding of text payloads sent.
Use header information, if available, to decode the bytes payloads returned by get_payload when text parts must be treated as str text, but use binary mode files to finesse the issue in other contexts.
Use formats prescribed by email standard to decode and encode message headers such as From and Subject if they are not simple text.
Apply the fix described to work around the message text generation issue for binary parts.
Special-case construction of text message objects according to Unicode types and email behavior.

These are not necessarily complete solutions. For example, some of this edition’s email clients allow for Unicode encodings for both text attachments and mail headers, but they do nothing about encoding the full text of messages sent beyond the policies inherited from smtplib and implement policies that might be inconvenient in some use cases. But as we’ll see, despite their limitations, our email clients will still be able to handle complex email tasks and a very large set of emails.

Again, since this story is in flux in Python today, watch this book’s website for updates that may improve or be required of code that uses email in the future. A future email may handle Unicode encodings more accurately. Like Python 3.X, though, backward compatibility may be sacrificed in the process and require updates to this book’s code. For more on this issue, see the Web as well as up-to-date Python release notes.

Although this quick tour captures the basic flavor of the interface, we need to step up to larger examples to see more of the email package’s power. The next section takes us on the first of those steps.

A Console-Based Email Client

Let’s put together what we’ve learned about fetching, sending, parsing, and composing email in a simple but functional command-line console email tool. The script in Example 13-20 implements an interactive email session—users may type commands to read, send, and delete email messages. It uses poplib and smtplib to fetch and send, and uses the email package directly to parse and compose.

Example 13-20. PP4EInternetEmailpymail.py

#!/usr/local/bin/python
"""
##########################################################################
pymail - a simple console email interface client in Python; uses Python
poplib module to view POP email messages, smtplib to send new mails, and
the email package to extract mail headers and payload and compose mails;
##########################################################################
"""

import poplib, smtplib, email.utils, mailconfig
from email.parser  import Parser
from email.message import Message
fetchEncoding = mailconfig.fetchEncoding

def decodeToUnicode(messageBytes, fetchEncoding=fetchEncoding):
    """
    4E, Py3.1: decode fetched bytes to str Unicode string for display or parsing;
    use global setting (or by platform default, hdrs inspection, intelligent guess);
    in Python 3.2/3.3, this step may not be required: if so, return message intact;
    """
    return [line.decode(fetchEncoding) for line in messageBytes]

def splitaddrs(field):
    """
    4E: split address list on commas, allowing for commas in name parts
    """
    pairs = email.utils.getaddresses([field])                 # [(name,addr)]
    return [email.utils.formataddr(pair) for pair in pairs]   # [name <addr>]

def inputmessage():
    import sys
    From = input('From? ').strip()
    To   = input('To?   ').strip()           # datetime hdr may be set auto
    To   = splitaddrs(To)                    # possible many, name+<addr> okay
    Subj = input('Subj? ').strip()           # don't split blindly on ',' or ';'
    print('Type message text, end with line="."')
    text = ''
    while True:
        line = sys.stdin.readline()
        if line == '.
': break
        text += line
    return From, To, Subj, text

def sendmessage():
    From, To, Subj, text = inputmessage()
    msg = Message()
    msg['From']    = From
    msg['To']      = ', '.join(To)                     # join for hdr, not send
    msg['Subject'] = Subj
    msg['Date']    = email.utils.formatdate()          # curr datetime, rfc2822
    msg.set_payload(text)
    server = smtplib.SMTP(mailconfig.smtpservername)
    try:
        failed = server.sendmail(From, To, str(msg))   # may also raise exc
    except:
        print('Error - send failed')
    else:
        if failed: print('Failed:', failed)

def connect(servername, user, passwd):
    print('Connecting...')
    server = poplib.POP3(servername)
    server.user(user)                     # connect, log in to mail server
    server.pass_(passwd)                  # pass is a reserved word
    print(server.getwelcome())            # print returned greeting message
    return server

def loadmessages(servername, user, passwd, loadfrom=1):
    server = connect(servername, user, passwd)
    try:
        print(server.list())
        (msgCount, msgBytes) = server.stat()
        print('There are', msgCount, 'mail messages in', msgBytes, 'bytes')
        print('Retrieving...')
        msgList = []                                     # fetch mail now
        for i in range(loadfrom, msgCount+1):            # empty if low >= high
            (hdr, message, octets) = server.retr(i)      # save text on list
            message = decodeToUnicode(message)           # 4E, Py3.1: bytes to str
            msgList.append('
'.join(message))           # leave mail on server
    finally:
        server.quit()                                    # unlock the mail box
    assert len(msgList) == (msgCount - loadfrom) + 1     # msg nums start at 1
    return msgList

def deletemessages(servername, user, passwd, toDelete, verify=True):
    print('To be deleted:', toDelete)
    if verify and input('Delete?')[:1] not in ['y', 'Y']:
        print('Delete cancelled.')
    else:
        server = connect(servername, user, passwd)
        try:
            print('Deleting messages from server...')
            for msgnum in toDelete:                 # reconnect to delete mail
                server.dele(msgnum)                 # mbox locked until quit()
        finally:
            server.quit()

def showindex(msgList):
    count = 0                                       # show some mail headers
    for msgtext in msgList:
        msghdrs = Parser().parsestr(msgtext, headersonly=True)  # expects str in 3.1
        count += 1
        print('%d:	%d bytes' % (count, len(msgtext)))
        for hdr in ('From', 'To', 'Date', 'Subject'):
            try:
                print('	%-8s=>%s' % (hdr, msghdrs[hdr]))
            except KeyError:
                print('	%-8s=>(unknown)' % hdr)
        if count % 5 == 0:
            input('[Press Enter key]')  # pause after each 5

def showmessage(i, msgList):
    if 1 <= i <= len(msgList):
       #print(msgList[i-1])             # old: prints entire mail--hdrs+text
        print('-' * 79)
        msg = Parser().parsestr(msgList[i-1])      # expects str in 3.1
        content = msg.get_payload()     # prints payload: string, or [Messages]
        if isinstance(content, str):    # keep just one end-line at end
            content = content.rstrip() + '
'
        print(content)
        print('-' * 79)                 # to get text only, see email.parsers
    else:
        print('Bad message number')

def savemessage(i, mailfile, msgList):
    if 1 <= i <= len(msgList):
        savefile = open(mailfile, 'a', encoding=mailconfig.fetchEncoding)  # 4E
        savefile.write('
' + msgList[i-1] + '-'*80 + '
')
    else:
        print('Bad message number')

def msgnum(command):
    try:
        return int(command.split()[1])
    except:
        return −1   # assume this is bad

helptext = """
Available commands:
i     - index display
l n?  - list all messages (or just message n)
d n?  - mark all messages for deletion (or just message n)
s n?  - save all messages to a file (or just message n)
m     - compose and send a new mail message
q     - quit pymail
?     - display this help text
"""

def interact(msgList, mailfile):
    showindex(msgList)
    toDelete = []
    while True:
        try:
            command = input('[Pymail] Action? (i, l, d, s, m, q, ?) ')
        except EOFError:
            command = 'q'
        if not command: command = '*'

        # quit
        if command == 'q':
            break

        # index
        elif command[0] == 'i':
            showindex(msgList)

        # list
        elif command[0] == 'l':
            if len(command) == 1:
                for i in range(1, len(msgList)+1):
                    showmessage(i, msgList)
            else:
                showmessage(msgnum(command), msgList)

        # save
        elif command[0] == 's':
            if len(command) == 1:
                for i in range(1, len(msgList)+1):
                    savemessage(i, mailfile, msgList)
            else:
                savemessage(msgnum(command), mailfile, msgList)

        # delete
        elif command[0] == 'd':
            if len(command) == 1:                          # delete all later
                toDelete = list(range(1, len(msgList)+1))  # 3.x requires list
            else:
                delnum = msgnum(command)
                if (1 <= delnum <= len(msgList)) and (delnum not in toDelete):
                    toDelete.append(delnum)
                else:
                    print('Bad message number')

        # mail
        elif command[0] == 'm':                # send a new mail via SMTP
            sendmessage()
            #execfile('smtpmail.py', {})       # alt: run file in own namespace

        elif command[0] == '?':
            print(helptext)
        else:
            print('What? -- type "?" for commands help')
    return toDelete

if __name__ == '__main__':
    import getpass, mailconfig
    mailserver = mailconfig.popservername        # ex: 'pop.rmi.net'
    mailuser   = mailconfig.popusername          # ex: 'lutz'
    mailfile   = mailconfig.savemailfile         # ex:  r'c:stuffsavemail'
    mailpswd   = getpass.getpass('Password for %s?' % mailserver)
    print('[Pymail email client]')
    msgList    = loadmessages(mailserver, mailuser, mailpswd)     # load all
    toDelete   = interact(msgList, mailfile)
    if toDelete: deletemessages(mailserver, mailuser, mailpswd, toDelete)
    print('Bye.')

There isn’t much new here—just a combination of user-interface logic and tools we’ve already met, plus a handful of new techniques:

Loads: This client loads all email from the server into an in-memory Python list only once, on startup; you must exit and restart to reload newly arrived email.
Saves: On demand, pymail saves the raw text of a selected message into a local file, whose name you place in the mailconfig module of Example 13-17.
Deletions: We finally support on-request deletion of mail from the server here: in pymail, mails are selected for deletion by number, but are still only physically removed from your server on exit, and then only if you verify the operation. By deleting only on exit, we avoid changing mail message numbers during a session—under POP, deleting a mail not at the end of the list decrements the number assigned to all mails following the one deleted. Since mail is cached in memory by pymail, future operations on the numbered messages in memory can be applied to the wrong mail if deletions were done immediately.^[53]
Parsing and composing messages: pymail now displays just the payload of a message on listing commands, not the entire raw text, and the mail index listing only displays selected headers parsed out of each message. Python’s email package is used to extract headers and content from a message, as shown in the prior section. Similarly, we use email to compose a message and ask for its string to ship as a mail.

By now, I expect that you know enough to read this script for a deeper look, so instead of saying more about its design here, let’s jump into an interactive pymail session to see how it works.

Running the pymail Console Client

Let’s start up pymail to read and delete email at our mail server and send new messages. pymail runs on any machine with Python and sockets, fetches mail from any email server with a POP interface on which you have an account, and sends mail via the SMTP server you’ve named in the mailconfig module we wrote earlier (Example 13-17).

Here it is in action running on my Windows laptop machine; its operation is identical on other machines thanks to the portability of both Python and its standard library. First, we start the script, supply a POP password (remember, SMTP servers usually require no password), and wait for the pymail email list index to appear; as is, this version loads the full text of all mails in the inbox on startup:

C:...PP4EInternetEmail> pymail.py
Password for pop.secureserver.net?
[Pymail email client]
Connecting...
b'+OK <[email protected]>'
(b'+OK ', [b'1 1860', b'2 1408', b'3 1049', b'4 1009', b'5 1038', b'6 957'], 47)
There are 6 mail messages in 7321 bytes
Retrieving...
1:      1861 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Wed, 5 May 2010 11:29:36 −0400 (EDT)
        Subject =>I'm a Lumberjack, and I'm Okay
2:      1409 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Wed, 05 May 2010 08:33:47 −0700
        Subject =>testing
3:      1050 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Thu, 06 May 2010 14:11:07 −0000
        Subject =>A B C D E F G
4:      1010 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Thu, 06 May 2010 14:16:31 −0000
        Subject =>testing smtpmail
5:      1039 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Thu, 06 May 2010 14:32:32 −0000
        Subject =>a b c d e f g
[Press Enter key]
6:      958 bytes
        From    =>[email protected]
        To      =>maillist
        Date    =>Thu, 06 May 2010 10:58:40 −0400
        Subject =>test interactive smtplib
[Pymail] Action? (i, l, d, s, m, q, ?) l 6
-------------------------------------------------------------------------------
testing 1 2 3...

-------------------------------------------------------------------------------
[Pymail] Action? (i, l, d, s, m, q, ?) l 3
-------------------------------------------------------------------------------
Fiddle de dum, Fiddle de dee,
Eric the half a bee.

-------------------------------------------------------------------------------
[Pymail] Action? (i, l, d, s, m, q, ?)

Once pymail downloads your email to a Python list on the local client machine, you type command letters to process it. The l command lists (prints) the contents of a given mail number; here, we just used it to list two emails we sent in the preceding section, with the smtpmail script, and interactively.

pymail also lets us get command help, delete messages (deletions actually occur at the server on exit from the program), and save messages away in a local text file whose name is listed in the mailconfig module we saw earlier:

[Pymail] Action? (i, l, d, s, m, q, ?) ?

Available commands:
i     - index display
l n?  - list all messages (or just message n)
d n?  - mark all messages for deletion (or just message n)
s n?  - save all messages to a file (or just message n)
m     - compose and send a new mail message
q     - quit pymail
?     - display this help text

[Pymail] Action? (i, l, d, s, m, q, ?) s 4
[Pymail] Action? (i, l, d, s, m, q, ?) d 4

Now, let’s pick the m mail compose option—pymail inputs the mail parts, builds mail text with email, and ships it off with smtplib. You can separate recipients with a comma, and use either simple “addr” or full “name <addr>” address pairs if desired. Because the mail is sent by SMTP, you can use arbitrary From addresses here; but again, you generally shouldn’t do that (unless, of course, you’re trying to come up with interesting examples for a book):

[Pymail] Action? (i, l, d, s, m, q, ?) m
From? [email protected]
To?   [email protected]
Subj? Among our weapons are these
Type message text, end with line="."
Nobody Expects the Spanish Inquisition!
.
[Pymail] Action? (i, l, d, s, m, q, ?) q
To be deleted: [4]
Delete?y
Connecting...
b'+OK <[email protected]>'
Deleting messages from server...
Bye.

As mentioned, deletions really happen only on exit. When we quit pymail with the q command, it tells us which messages are queued for deletion, and verifies the request. Once verified, pymail finally contacts the mail server again and issues POP calls to delete the selected mail messages. Because deletions change message numbers in the server’s inbox, postponing deletion until exit simplifies the handling of already loaded email (we’ll improve on this in the PyMailGUI client of the next chapter).

Because pymail downloads mail from your server into a local Python list only once at startup, though, we need to start pymail again to refetch mail from the server if we want to see the result of the mail we sent and the deletion we made. Here, our new mail shows up at the end as new number 6, and the original mail assigned number 4 in the prior session is gone:

C:...PP4EInternetEmail> pymail.py
Password for pop.secureserver.net?
[Pymail email client]
Connecting...
b'+OK <[email protected]>'
(b'+OK ', [b'1 1860', b'2 1408', b'3 1049', b'4 1038', b'5 957', b'6 1037'], 47)
There are 6 mail messages in 7349 bytes
Retrieving...
1:      1861 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Wed, 5 May 2010 11:29:36 −0400 (EDT)
        Subject =>I'm a Lumberjack, and I'm Okay
2:      1409 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Wed, 05 May 2010 08:33:47 −0700
        Subject =>testing
3:      1050 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Thu, 06 May 2010 14:11:07 −0000
        Subject =>A B C D E F G
4:      1039 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Thu, 06 May 2010 14:32:32 −0000
        Subject =>a b c d e f g
5:      958 bytes
        From    =>[email protected]
        To      =>maillist
        Date    =>Thu, 06 May 2010 10:58:40 −0400
        Subject =>test interactive smtplib
[Press Enter key]
6:      1038 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Fri, 07 May 2010 20:32:38 −0000
        Subject =>Among our weapons are these
[Pymail] Action? (i, l, d, s, m, q, ?) l 6
-------------------------------------------------------------------------------
Nobody Expects the Spanish Inquisition!

-------------------------------------------------------------------------------
[Pymail] Action? (i, l, d, s, m, q, ?) q
Bye.

Though not shown in this session, you can also send to multiple recipients, and include full name and address pairs in your email addresses. This works just because the script employs email utilities described earlier to split up addresses and fully parse to allow commas as both separators and name characters. The following, for example, would send to two and three recipients, respectively, using mostly full address formats:

[Pymail] Action? (i, l, d, s, m, q, ?) m
From? "moi 1" <[email protected]>
To?   "pp 4e" <[email protected]>, "lu,tz" <[email protected]>

[Pymail] Action? (i, l, d, s, m, q, ?) m
From? The Book <[email protected]>
To?   "pp 4e" <[email protected]>, "lu,tz" <[email protected]>,
[email protected]

Finally, if you are running this live, you will also find the mail save file on your machine, containing the one message we asked to be saved in the prior session; it’s simply the raw text of saved emails, with separator lines. This is both human and machine-readable—in principle, another script could load saved mail from this file into a Python list by calling the string object’s split method on the file’s text with the separator line as a delimiter. As shown in this book, it shows up in file C: empsavemail.txt, but you can configure this as you like in the mailconfig module.

The mailtools Utility Package

The email package used by the pymail example of the prior section is a collection of powerful tools—in fact, perhaps too powerful to remember completely. At the minimum, some reusable boilerplate code for common use cases can help insulate you from some of its details; by isolating module usage, such code can also ease the migration to possible future email changes. To simplify email interfacing for more complex mail clients, and to further demonstrate the use of standard library email tools, I developed the custom utility modules listed in this section—a package called mailtools.

mailtools is a Python modules package: a directory of code, with one module per tool class, and an initialization module run when the directory is first imported. This package’s modules are essentially just a wrapper layer above the standard library’s email package, as well as its poplib and smtplib modules. They make some assumptions about the way email is to be used, but they are reasonable and allow us to forget some of the underlying complexity of the standard library tools employed.

In a nutshell, the mailtools package provides three classes—to fetch, send, and parse email messages. These classes can be used as superclasses in order to mix in their methods to an application-specific class, or as standalone or embedded objects that export their methods for direct calls. We’ll see these classes deployed both ways in this text.

As a simple example of this package’s tools in action, its selftest.py module serves as a self-test script. When run, it sends a message from you, to you, which includes the selftest.py file as an attachment. It also fetches and displays some mail headers and parsed and unparsed content. These interfaces, along with some user-interface magic, will lead us to full-blown email clients and websites in later chapters.

Two design notes worth mentioning up front: First, none of the code in this package knows anything about the user interface it will be used in (console, GUI, web, or other) or does anything about things like threads; it is just a toolkit. As we’ll see, its clients are responsible for deciding how it will be deployed. By focusing on just email processing here, we simplify the code, as well as the programs that will use it.

Second, each of the main modules in this package illustrate Unicode issues that confront Python 3.X code, especially when using the 3.1 Python email package:

The sender must address encodings for the main message text, attachment input files, saved-mail output files, and message headers.
The fetcher must resolve full mail text encodings when new mails are fetched.
The parser must deal with encodings in text part payloads of parsed messages, as well as those in message headers.

In addition, the sender must provide workarounds for the binary parts generation and text part creation issues in email described earlier in this chapter. Since these highlight Unicode factors in general, and might not be solved as broadly as they might be due to limitations of the current Python email package, I’ll elaborate on each of these choices along the way.

The next few sections list mailtools source code. Together, its files consist of roughly 1,050 lines of code, including whitespace and comments. We won’t cover all of this package’s code in depth—study its listings for more details, and see its self-test module for a usage example. Also, for more context and examples, watch for the three clients that will use this package—the modified pymail2.py following this listing, the PyMailGUI client in Chapter 14, and the PyMailCGI server in Chapter 16. By sharing and reusing this module, all three systems inherit all its utility, as well as any future enhancements.

Initialization File

The module in Example 13-21 implements the initialization logic of the mailtools package; as usual, its code is run automatically the first time a script imports through the package’s directory. Notice how this file collects the contents of all the nested modules into the directory’s namespace with from * statements—because mailtools began life as a single .py file, this provides backward compatibility for existing clients. We also must use package-relative import syntax here (from .module), because Python 3.X no longer includes the package’s own directory on the module import search path (only the package’s container is on the path). Since this is the root module, global comments appear here as well.

Example 13-21. PP4EInternetEmailmailtools\__init__.py

"""
##################################################################################
mailtools package: interface to mail server transfers, used by pymail2, PyMailGUI,
and PyMailCGI;  does loads, sends, parsing, composing, and deleting, with part
attachments, encodings (of both the email and Unicdode kind), etc.;  the parser,
fetcher, and sender classes here are designed to be mixed-in to subclasses which
use their methods, or used as embedded or standalone objects;

this package also includes convenience subclasses for silent mode, and more;
loads all mail text if pop server doesn't do top;  doesn't handle threads or UI
here, and allows askPassword to differ per subclass;  progress callback funcs get
status;  all calls raise exceptions on error--client must handle in GUI/other;
this changed from file to package: nested modules imported here for bw compat;

4E: need to use package-relative import syntax throughout, because in Py 3.X
package dir in no longer on module import search path if package is imported
elsewhere (from another directory which uses this package);  also performs
Unicode decoding on mail text when fetched (see mailFetcher), as well as for
some text part payloads which might have been email-encoded (see mailParser);

TBD: in saveparts, should file be opened in text mode for text/ contypes?
TBD: in walkNamedParts, should we skip oddballs like message/delivery-status?
TBD: Unicode support has not been tested exhaustively: see Chapter 13 for more
on the Py3.1 email package and its limitations, and the policies used here;
##################################################################################
"""

# collect contents of all modules here, when package dir imported directly
from .mailFetcher import *
from .mailSender  import *                 # 4E: package-relative
from .mailParser  import *

# export nested modules here, when from mailtools import *
__all__ = 'mailFetcher', 'mailSender', 'mailParser'

# self-test code is in selftest.py to allow mailconfig's path
# to be set before running thr nested module imports above

MailTool Class

Example 13-22 contains common superclasses for the other classes in the package. This is in part meant for future expansion. At present, these are used only to enable or disable trace message output (some clients, such as web-based programs, may not want text to be printed to the output stream). Subclasses mix in the silent variant to turn off output.

Example 13-22. PP4EInternetEmailmailtoolsmailTool.py

"""
###############################################################################
common superclasses: used to turn trace massages on/off
###############################################################################
"""

class MailTool:                    # superclass for all mail tools
    def trace(self, message):      # redef me to disable or log to file
        print(message)

class SilentMailTool:              # to mixin instead of subclassing
    def trace(self, message):
        pass

MailSender Class

The class used to compose and send messages is coded in Example 13-23. This module provides a convenient interface that combines standard library tools we’ve already met in this chapter—the email package to compose messages with attachments and encodings, and the smtplib module to send the resulting email text. Attachments are passed in as a list of filenames—MIME types and any required encodings are determined automatically with the module mimetypes. Moreover, date and time strings are automated with an email.utils call, and non-ASCII headers are encoded per email, MIME, and Unicode standards. Study this file’s code and comments for more on its operation.

Unicode issues for attachments, save files, and headers

This is also where we open and add attachment files, generate message text, and save sent messages to a local file. Most attachment files are opened in binary mode, but as we’ve seen, some text attachments must be opened in text mode because the current email package requires them to be str strings when message objects are created. As we also saw earlier, the email package requires attachments to be str text when mail text is later generated, possibly as the result of MIME encoding.

To satisfy these constraints with the Python 3.1 email package, we must apply the two fixes described earlier— part file open calls select between text or binary mode (and thus read str or bytes) based upon the way email will process the data, and MIME encoding calls for binary data are augmented to decode the result to ASCII text. The latter of these also splits the Base64 text into lines here for binary parts (unlike email), because it is otherwise sent as one long line, which may work in some contexts, but causes problems in some text editors if the raw text is viewed.

Beyond these fixes, clients may optionally provide the names of the Unicode encoding scheme associated with the main text part and each text attachment part. In Chapter 14’s PyMailGUI, this is controlled in the mailconfig user settings module, with UTF-8 used as a fallback default whenever user settings fail to encode a text part. We could in principle also catch part file decoding errors and return an error indicator string (as we do for received mails in the mail fetcher ahead), but sending an invalid attachment is much more grievous than displaying one. Instead, the send request fails entirely on errors.

Finally, there is also new support for encoding non-ASCII headers (both full headers and names of email addresses) per a client-selectable encoding that defaults to UTF-8, and the sent message save file is opened in the same mailconfig Unicode encoding mode used to decode messages when they are fetched.

The latter policy for sent mail saves is used because the sent file may be opened to fetch full mail text in this encoding later by clients which apply this encoding scheme. This is intended to mirror the way that clients such as PyMailGUI save full message text in local files to be opened and parsed later. It might fail if the mail fetcher resorted to guessing a different and incompatible encoding, and it assumes that no message gives rise to incompatibly encoded data in the file across multiple sessions. We could instead keep one save file per encoding, but encodings for full message text probably will not vary; ASCII was the original standard for full mail text, so 7- or 8-bit text is likely.

Example 13-23. PP4EInternetEmailmailtoolsmailSender.py

"""
###############################################################################
send messages, add attachments (see __init__ for docs, test)
###############################################################################
"""

import mailconfig                                      # client's mailconfig
import smtplib, os, mimetypes                          # mime: name to type
import email.utils, email.encoders                     # date string, base64
from .mailTool import MailTool, SilentMailTool         # 4E: package-relative

from email.message          import Message             # general message, obj->text
from email.mime.multipart   import MIMEMultipart       # type-specific messages
from email.mime.audio       import MIMEAudio           # format/encode attachments
from email.mime.image       import MIMEImage
from email.mime.text        import MIMEText
from email.mime.base        import MIMEBase
from email.mime.application import MIMEApplication     # 4E: use new app class


def fix_encode_base64(msgobj):
    """
    4E: workaround for a genuine bug in Python 3.1 email package that prevents
    mail text generation for binary parts encoded with base64 or other email
    encodings;  the normal email.encoder run by the constructor leaves payload
    as bytes, even though it's encoded to base64 text form;  this breaks email
    text generation which assumes this is text and requires it to be str;  net
    effect is that only simple text part emails can be composed in Py 3.1 email
    package as is - any MIME-encoded binary part cause mail text generation to
    fail;  this bug seems likely to go away in a future Python and email package,
    in which case this should become a no-op;  see Chapter 13 for more details;
    """

    linelen = 76  # per MIME standards
    from email.encoders import encode_base64

    encode_base64(msgobj)                # what email does normally: leaves bytes
    text = msgobj.get_payload()          # bytes fails in email pkg on text gen
    if isinstance(text, bytes):          # payload is bytes in 3.1, str in 3.2 alpha
        text = text.decode('ascii')      # decode to unicode str so text gen works

    lines = []                           # split into lines, else 1 massive line
    text  = text.replace('
', '')       # no 
 present in 3.1, but futureproof me!
    while text:
        line, text = text[:linelen], text[linelen:]
        lines.append(line)
    msgobj.set_payload('
'.join(lines))


def fix_text_required(encodingname):
    """
    4E: workaround for str/bytes combination errors in email package;  MIMEText
    requires different types for different Unicode encodings in Python 3.1, due
    to the different ways it MIME-encodes some types of text;  see Chapter 13;
    the only other alternative is using generic Message and repeating much code;
    """
    from email.charset import Charset, BASE64, QP

    charset = Charset(encodingname)   # how email knows what to do for encoding
    bodyenc = charset.body_encoding   # utf8, others require bytes input data
    return bodyenc in (None, QP)      # ascii, latin1, others require str


class MailSender(MailTool):
    """
    send mail: format a message, interface with an SMTP server;
    works on any machine with Python+Inet, doesn't use cmdline mail;
    a nonauthenticating client: see MailSenderAuth if login required;
    4E: tracesize is num chars of msg text traced: 0=none, big=all;
    4E: supports Unicode encodings for main text and text parts;
    4E: supports header encoding, both full headers and email names;
    """
    def __init__(self, smtpserver=None, tracesize=256):
        self.smtpServerName = smtpserver or mailconfig.smtpservername
        self.tracesize = tracesize

    def sendMessage(self, From, To, Subj, extrahdrs, bodytext, attaches,
                                      saveMailSeparator=(('=' * 80) + 'PY
'),
                                      bodytextEncoding='us-ascii',
                                      attachesEncodings=None):
        """
        format and send mail: blocks caller, thread me in a GUI;
        bodytext is main text part, attaches is list of filenames,
        extrahdrs is list of (name, value) tuples to be added;
        raises uncaught exception if send fails for any reason;
        saves sent message text in a local file if successful;

        assumes that To, Cc, Bcc hdr values are lists of 1 or more already
        decoded addresses (possibly in full name+<addr> format); client
        must parse to split these on delimiters, or use multiline input;
        note that SMTP allows full name+<addr> format in recipients;
        4E: Bcc addrs now used for send/envelope, but header is dropped;
        4E: duplicate recipients removed, else will get >1 copies of mail;
        caveat: no support for multipart/alternative mails, just /mixed;
        """

        # 4E: assume main body text is already in desired encoding;
        # clients can decode to user pick, default, or utf8 fallback;
        # either way, email needs either str xor bytes specifically;

        if fix_text_required(bodytextEncoding):
            if not isinstance(bodytext, str):
                bodytext = bodytext.decode(bodytextEncoding)
        else:
            if not isinstance(bodytext, bytes):
                bodytext = bodytext.encode(bodytextEncoding)

        # make message root
        if not attaches:
            msg = Message()
            msg.set_payload(bodytext, charset=bodytextEncoding)
        else:
            msg = MIMEMultipart()
            self.addAttachments(msg, bodytext, attaches,
                                     bodytextEncoding, attachesEncodings)

        # 4E: non-ASCII hdrs encoded on sends; encode just name in address,
        # else smtp may drop the message completely; encodes all envelope
        # To names (but not addr) also, and assumes servers will allow;
        # msg.as_string retains any line breaks added by encoding headers;

        hdrenc = mailconfig.headersEncodeTo or 'utf-8'        # default=utf8
        Subj = self.encodeHeader(Subj, hdrenc)                # full header
        From = self.encodeAddrHeader(From, hdrenc)            # email names
        To   = [self.encodeAddrHeader(T, hdrenc) for T in To] # each recip
        Tos  = ', '.join(To)                                  # hdr+envelope

        # add headers to root
        msg['From']    = From
        msg['To']      = Tos                        # poss many: addr list
        msg['Subject'] = Subj                       # servers reject ';' sept
        msg['Date']    = email.utils.formatdate()   # curr datetime, rfc2822 utc
        recip = To
        for name, value in extrahdrs:               # Cc, Bcc, X-Mailer, etc.
            if value:
                if name.lower() not in ['cc', 'bcc']:
                    value = self.encodeHeader(value, hdrenc)
                    msg[name] = value
                else:
                    value = [self.encodeAddrHeader(V, hdrenc) for V in value]
                    recip += value                     # some servers reject ['']
                    if name.lower() != 'bcc':          # 4E: bcc gets mail, no hdr
                        msg[name] = ', '.join(value)   # add commas between cc

        recip = list(set(recip))                       # 4E: remove duplicates
        fullText = msg.as_string()                     # generate formatted msg

        # sendmail call raises except if all Tos failed,
        # or returns failed Tos dict for any that failed

        self.trace('Sending to...' + str(recip))
        self.trace(fullText[:self.tracesize])                  # SMTP calls connect
        server = smtplib.SMTP(self.smtpServerName, timeout=15) # this may fail too
        self.getPassword()                                     # if srvr requires
        self.authenticateServer(server)                        # login in subclass
        try:
            failed = server.sendmail(From, recip, fullText)    # except or dict
        except:
            server.close()                                     # 4E: quit may hang!
            raise                                              # reraise except
        else:
            server.quit()                                      # connect + send OK
        self.saveSentMessage(fullText, saveMailSeparator)      # 4E: do this first
        if failed:
            class SomeAddrsFailed(Exception): pass
            raise SomeAddrsFailed('Failed addrs:%s
' % failed)
        self.trace('Send exit')

    def addAttachments(self, mainmsg, bodytext, attaches,
                                      bodytextEncoding, attachesEncodings):
        """
        format a multipart message with attachments;
        use Unicode encodings for text parts if passed;
        """
        # add main text/plain part
        msg = MIMEText(bodytext, _charset=bodytextEncoding)
        mainmsg.attach(msg)

        # add attachment parts
        encodings = attachesEncodings or (['us-ascii'] * len(attaches))
        for (filename, fileencode) in zip(attaches, encodings):
            # filename may be absolute or relative
            if not os.path.isfile(filename):             # skip dirs, etc.
                continue

            # guess content type from file extension, ignore encoding
            contype, encoding = mimetypes.guess_type(filename)
            if contype is None or encoding is not None:  # no guess, compressed?
                contype = 'application/octet-stream'     # use generic default
            self.trace('Adding ' + contype)

            # build sub-Message of appropriate kind
            maintype, subtype = contype.split('/', 1)
            if maintype == 'text':                       # 4E: text needs encoding
                if fix_text_required(fileencode):        # requires str or bytes
                    data = open(filename, 'r', encoding=fileencode)
                else:
                    data = open(filename, 'rb')
                msg = MIMEText(data.read(), _subtype=subtype, _charset=fileencode)
                data.close()

            elif maintype == 'image':
                data = open(filename, 'rb')              # 4E: use fix for binaries
                msg  = MIMEImage(
                       data.read(), _subtype=subtype, _encoder=fix_encode_base64)
                data.close()

            elif maintype == 'audio':
                data = open(filename, 'rb')
                msg  = MIMEAudio(
                       data.read(), _subtype=subtype, _encoder=fix_encode_base64)
                data.close()

            elif maintype == 'application':              # new  in 4E
                data = open(filename, 'rb')
                msg  = MIMEApplication(
                       data.read(), _subtype=subtype, _encoder=fix_encode_base64)
                data.close()

            else:
                data = open(filename, 'rb')              # application/* could
                msg  = MIMEBase(maintype, subtype)       # use this code too
                msg.set_payload(data.read())
                data.close()                             # make generic type
                fix_encode_base64(msg)                   # was broken here too!
               #email.encoders.encode_base64(msg)        # encode using base64

            # set filename (ascii or utf8/mime encoded) and attach to container
            basename = self.encodeHeader(os.path.basename(filename)) # oct 2011
            msg.add_header('Content-Disposition',
                           'attachment', filename=basename)
            mainmsg.attach(msg)

        # text outside mime structure, seen by non-MIME mail readers
        mainmsg.preamble = 'A multi-part MIME format message.
'
        mainmsg.epilogue = ''  # make sure message ends with a newline

    def saveSentMessage(self, fullText, saveMailSeparator):
        """
        append sent message to local file if send worked for any;
        client: pass separator used for your application, splits;
        caveat: user may change the file at same time (unlikely);
        """
        try:
            sentfile = open(mailconfig.sentmailfile, 'a',
                                  encoding=mailconfig.fetchEncoding)    # 4E
            if fullText[-1] != '
': fullText += '
'
            sentfile.write(saveMailSeparator)
            sentfile.write(fullText)
            sentfile.close()
        except:
            self.trace('Could not save sent message')    # not a show-stopper

    def encodeHeader(self, headertext, unicodeencoding='utf-8'):
        """
        4E: encode composed non-ascii message headers content per both email
        and Unicode standards, according to an optional user setting or UTF-8;
        header.encode adds line breaks in header string automatically if needed;
        """
        try:
            headertext.encode('ascii')
        except:
            try:
                hdrobj = email.header.make_header([(headertext, unicodeencoding)])
                headertext = hdrobj.encode()
            except:
                pass         # auto splits into multiple cont lines if needed
        return headertext    # smtplib may fail if it won't encode to ascii

    def encodeAddrHeader(self, headertext, unicodeencoding='utf-8'):
        """
        4E: try to encode non-ASCII names in email addresess per email, MIME,
        and Unicode standards; if this fails drop name and use just addr part;
        if cannot even get addresses, try to decode as a whole, else smtplib
        may run into errors when it tries to encode the entire mail as ASCII;
        utf-8 default should work for most, as it formats code points broadly;

        inserts newlines if too long or hdr.encode split names to multiple lines,
        but this may not catch some lines longer than the cutoff (improve me);
        as used, Message.as_string formatter won't try to break lines further;
        see also decodeAddrHeader in mailParser module for the inverse of this;
        """
        try:
            pairs = email.utils.getaddresses([headertext])   # split addrs + parts
            encoded = []
            for name, addr in pairs:
                try:
                    name.encode('ascii')         # use as is if okay as ascii
                except UnicodeError:             # else try to encode name part
                    try:
                        uni  = name.encode(unicodeencoding)
                        hdr  = email.header.make_header([(uni, unicodeencoding)])
                        name = hdr.encode()
                    except:
                        name = None              # drop name, use address part only
                joined = email.utils.formataddr((name, addr))  # quote name if need
                encoded.append(joined)

            fullhdr = ', '.join(encoded)
            if len(fullhdr) > 72 or '
' in fullhdr:      # not one short line?
                fullhdr = ',
 '.join(encoded)            # try multiple lines
            return fullhdr
        except:
            return self.encodeHeader(headertext)

    def authenticateServer(self, server):
        pass  # no login required for this server/class

    def getPassword(self):
        pass  # no login required for this server/class


################################################################################
# specialized subclasses
################################################################################

class MailSenderAuth(MailSender):
    """
    use for servers that require login authorization;
    client: choose MailSender or MailSenderAuth super
    class based on mailconfig.smtpuser setting (None?)
    """
    smtpPassword = None    # 4E: on class, not self, shared by poss N instances
    
    def __init__(self, smtpserver=None, smtpuser=None, tracesize=256):
        MailSender.__init__(self, smtpserver, tracesize)
        self.smtpUser = smtpuser or mailconfig.smtpuser
        #self.smtpPassword = None # 4E: makes PyMailGUI ask for each send!

    def authenticateServer(self, server):
        server.login(self.smtpUser, self.smtpPassword)

    def getPassword(self):
        """
        get SMTP auth password if not yet known;
        may be called by superclass auto, or client manual:
        not needed until send, but don't run in GUI thread;
        get from client-side file or subclass method
        """
        if not self.smtpPassword:
            try:
                localfile = open(mailconfig.smtppasswdfile)
                MailSenderAuth.smtpPassword = localfile.readline()[:-1] # 4E
                self.trace('local file password' + repr(self.smtpPassword))
            except:
                MailSenderAuth.smtpPassword = self.askSmtpPassword()    # 4E

    def askSmtpPassword(self):
        assert False, 'Subclass must define method'

class MailSenderAuthConsole(MailSenderAuth):
    def askSmtpPassword(self):
        import getpass
        prompt = 'Password for %s on %s?' % (self.smtpUser, self.smtpServerName)
        return getpass.getpass(prompt)

class SilentMailSender(SilentMailTool, MailSender):
    pass   # replaces trace

MailFetcher Class

The class defined in Example 13-24 does the work of interfacing with a POP email server—loading, deleting, and synchronizing. This class merits a few additional words of explanation.

General usage

This module deals strictly in email text; parsing email after it has been fetched is delegated to a different module in the package. Moreover, this module doesn’t cache already loaded information; clients must add their own mail-retention tools if desired. Clients must also provide password input methods or pass one in, if they cannot use the console input subclass here (e.g., GUIs and web-based programs).

The loading and deleting tasks use the standard library poplib module in ways we saw earlier in this chapter, but notice that there are interfaces for fetching just message header text with the TOP action in POP if the mail server supports it. This can save substantial time if clients need to fetch only basic details for an email index. In addition, the header and full-text fetchers are equipped to load just mails newer than a particular number (useful once an initial load is run), and to restrict fetches to a fixed-sized set of the mostly recently arrived emails (useful for large inboxes with slow Internet access or servers).

This module also supports the notion of progress indicators—for methods that perform multiple downloads or deletions, callers may pass in a function that will be called as each mail is processed. This function will receive the current and total step numbers. It’s left up to the caller to render this in a GUI, console, or other user interface.

Unicode decoding for full mail text on fetches

Additionally, this module is where we apply the session-wide message bytes Unicode decoding policy required for parsing, as discussed earlier in this chapter. This decoding uses an encoding name user setting in the mailconfig module, followed by heuristics. Because this decoding is performed immediately when a mail is fetched, all clients of this package can assume message text is str Unicode strings—including any later parsing, display, or save operations. In addition to the mailconfig setting, we also apply a few guesses with common encoding types, though it’s not impossible that this may lead to problems if mails decoded by guessing cannot be written to mail save fails using the mailconfig setting.

As described, this session-wide approach to encodings is not ideal, but it can be adjusted per client session and reflects the current limitations of email in Python 3.1—its parser requires already decoded Unicode strings, but fetches return bytes. If this decoding fails, as a last resort we attempt to decode headers only, as either ASCII (or other common format) text or the platform default, and insert an error message in the email body—a heuristic that attempts to avoid killing clients with exceptions if possible (see file _test-decoding.py in the examples package for a test of this logic). In practice, an 8-bit Unicode encoding such as Latin-1 will probably suffice in most cases, because ASCII was the original requirement of email standards.

In principle, we could try to search for encoding information in message headers if it’s present, by parsing mails partially ourselves. We might then take a per-message instead of per-session approach to decoding full text, and associate an encoding type with each mail for later processing such as saves, though this raises further complications, as a save file can have just one (compatible) encoding, not one per message. Moreover, character sets in email headers may refer to individual components, not the entire email’s text. Since most mails will conform to 7- or 8-bit standards, and since a future email release will likely address this issue, extra complexity is probably not warranted for this case in this book.

Also keep in mind that the Unicode decoding performed here is for the entire mail text fetched from a server. Really, this is just one part of the email encoding story in the Unicode-aware world of today. In addition:

Payloads of parsed message parts may still be returned as bytes and require special handling or further Unicode decoding (see the parser module ahead).
Text parts and attachments in composed mails impose encoding choices as well (see the sender module earlier).
Message headers have their own encoding conventions, and may be both MIME and Unicode encoded if Internationalized (see both the parser and sender modules).

Inbox synchronization tools

When you start studying this example, you’ll also notice that Example 13-24 devotes substantial code to detecting synchronization errors between an email list held by a client and the current state of the inbox at the POP email server. Normally, POP assigns relative message numbers to email in the inbox, and only adds newly arrived emails to the end of the inbox. As a result, relative message numbers from an earlier fetch may usually be used to delete and fetch in the future.

However, although rare, it is not impossible for the server’s inbox to change in ways that invalidate previously fetched message numbers. For instance, emails may be deleted in another client, and the server itself may move mails from the inbox to an undeliverable state on download errors (this may vary per ISP). In both cases, email may be removed from the middle of the inbox, throwing some prior relative message numbers out of sync with the server.

This situation can result in fetching the wrong message in an email client—users receive a different message than the one they thought they had selected. Worse, this can make deletions inaccurate—if a mail client uses a relative message number in a delete request, the wrong mail may be deleted if the inbox has changed since the index was fetched.

To assist clients, Example 13-24 includes tools, which match message headers on deletions to ensure accuracy and perform general inbox synchronization tests on demand. These tools are useful only to clients that retain the fetched email list as state information. We’ll use these in the PyMailGUI client in Chapter 14. There, deletions use the safe interface, and loads run the on-demand synchronization test; on detection of synchronization errors, the inbox index is automatically reloaded. For now, see Example 13-24 source code and comments for more details.

Note that the synchronization tests try a variety of matching techniques, but require the complete headers text and, in the worst case, must parse headers and match many header fields. In many cases, the single previously fetched message-id header field would be sufficient for matching against messages in the server’s inbox. However, because this field is optional and can be forged to have any value, it might not always be a reliable way to identify messages. In other words, a same-valued message-id may not suffice to guarantee a match, although it can be used to identify a mismatch; in Example 13-24, the message-id is used to rule out a match if either message has one, and they differ in value. This test is performed before falling back on slower parsing and multiple header matches.

Example 13-24. PP4EInternetEmailmailtoolsmailFetcher.py

"""
###############################################################################
retrieve, delete, match mail from a POP server (see __init__ for docs, test)
###############################################################################
"""

import poplib, mailconfig, sys               # client's mailconfig on sys.path
print('user:', mailconfig.popusername)       # script dir, pythonpath, changes

from .mailParser import MailParser                 # for headers matching (4E: .)
from .mailTool   import MailTool, SilentMailTool   # trace control supers (4E: .)

# index/server msgnum out of synch tests
class DeleteSynchError(Exception): pass            # msg out of synch in del
class TopNotSupported(Exception): pass             # can't run synch test
class MessageSynchError(Exception): pass           # index list out of sync

class MailFetcher(MailTool):
    """
    fetch mail: connect, fetch headers+mails, delete mails
    works on any machine with Python+Inet; subclass me to cache
    implemented with the POP protocol; IMAP requires new class;
    4E: handles decoding of full mail text on fetch for parser;
    """
    def __init__(self, popserver=None, popuser=None, poppswd=None, hastop=True):
        self.popServer   = popserver or mailconfig.popservername
        self.popUser     = popuser   or mailconfig.popusername
        self.srvrHasTop  = hastop
        self.popPassword = poppswd  # ask later if None

    def connect(self):
        self.trace('Connecting...')
        self.getPassword()                          # file, GUI, or console
        server = poplib.POP3(self.popServer, timeout=15)
        server.user(self.popUser)                   # connect,login POP server
        server.pass_(self.popPassword)              # pass is a reserved word
        self.trace(server.getwelcome())             # print returned greeting
        return server

    # use setting in client's mailconfig on import search path;
    # to tailor, this can be changed in class or per instance;
    fetchEncoding = mailconfig.fetchEncoding

    def decodeFullText(self, messageBytes):
        """
        4E, Py3.1: decode full fetched mail text bytes to str Unicode string;
        done at fetch, for later display or parsing (full mail text is always
        Unicode thereafter);  decode with per-class or per-instance setting, or
        common types;  could also try headers inspection, or intelligent guess
        from structure; in Python 3.2/3.3, this step may not be required: if so,
        change to return message line list intact; for more details see Chapter 13;

        an 8-bit encoding such as latin-1 will likely suffice for most emails, as
        ASCII is the original standard;  this method applies to entire/full message
        text, which is really just one part of the email encoding story: Message
        payloads and Message headers may also be encoded per email, MIME, and
        Unicode standards; see Chapter 13 and mailParser and mailSender for more;
        """
        text = None
        kinds =  [self.fetchEncoding]             # try user setting first
        kinds += ['ascii', 'latin1', 'utf8']      # then try common types
        kinds += [sys.getdefaultencoding()]       # and platform dflt (may differ)
        for kind in kinds:                        # may cause mail saves to fail
            try:
                text = [line.decode(kind) for line in messageBytes]
                break
            except (UnicodeError, LookupError):   # LookupError: bad name
                pass

        if text == None:
            # try returning headers + error msg, else except may kill client;
            # still try to decode headers per ascii, other, platform default;

            blankline = messageBytes.index(b'')
            hdrsonly  = messageBytes[:blankline]
            commons   = ['ascii', 'latin1', 'utf8']
            for common in commons:
                try:
                    text = [line.decode(common) for line in hdrsonly]
                    break
                except UnicodeError:
                    pass
            else:                                                  # none worked
                try:
                    text = [line.decode() for line in hdrsonly]    # platform dflt?
                except UnicodeError:
                    text = ['From: (sender of unknown Unicode format headers)']
            text += ['', '--Sorry: mailtools cannot decode this mail content!--']
        return text

    def downloadMessage(self, msgnum):
        """
        load full raw text of one mail msg, given its
        POP relative msgnum; caller must parse content
        """
        self.trace('load ' + str(msgnum))
        server = self.connect()
        try:
            resp, msglines, respsz = server.retr(msgnum)
        finally:
            server.quit()
        msglines = self.decodeFullText(msglines)   # raw bytes to Unicode str
        return '
'.join(msglines)                 # concat lines for parsing

    def downloadAllHeaders(self, progress=None, loadfrom=1):
        """
        get sizes, raw header text only, for all or new msgs
        begins loading headers from message number loadfrom
        use loadfrom to load newly arrived mails only
        use downloadMessage to get a full msg text later
        progress is a function called with (count, total);
        returns: [headers text], [mail sizes], loadedfull?

        4E: add mailconfig.fetchlimit to support large email
        inboxes: if not None, only fetches that many headers,
        and returns others as dummy/empty mail; else inboxes
        like one of mine (4K emails) are not practical to use;
        4E: pass loadfrom along to downloadAllMsgs (a buglet);
        """
        if not self.srvrHasTop:                    # not all servers support TOP
            # naively load full msg text
            return self.downloadAllMsgs(progress, loadfrom)
        else:
            self.trace('loading headers')
            fetchlimit = mailconfig.fetchlimit
            server = self.connect()                # mbox now locked until quit
            try:
                resp, msginfos, respsz = server.list()   # 'num size' lines list
                msgCount = len(msginfos)                 # alt to srvr.stat[0]
                msginfos = msginfos[loadfrom-1:]         # drop already loadeds
                allsizes = [int(x.split()[1]) for x in msginfos]
                allhdrs  = []
                for msgnum in range(loadfrom, msgCount+1):          # poss empty
                    if progress: progress(msgnum, msgCount)         # run callback
                    if fetchlimit and (msgnum <= msgCount - fetchlimit):
                        # skip, add dummy hdrs
                        hdrtext = 'Subject: --mail skipped--

'
                        allhdrs.append(hdrtext)
                    else:
                        # fetch, retr hdrs only
                        resp, hdrlines, respsz = server.top(msgnum, 0)
                        hdrlines = self.decodeFullText(hdrlines)
                        allhdrs.append('
'.join(hdrlines))
            finally:
                server.quit()                          # make sure unlock mbox
            assert len(allhdrs) == len(allsizes)
            self.trace('load headers exit')
            return allhdrs, allsizes, False

    def downloadAllMessages(self, progress=None, loadfrom=1):
        """
        load full message text for all msgs from loadfrom..N,
        despite any caching that may be being done in the caller;
        much slower than downloadAllHeaders, if just need hdrs;

        4E: support mailconfig.fetchlimit: see downloadAllHeaders;
        could use server.list() to get sizes of skipped emails here
        too, but clients probably don't care about these anyhow;
        """
        self.trace('loading full messages')
        fetchlimit = mailconfig.fetchlimit
        server = self.connect()
        try:
            (msgCount, msgBytes) = server.stat()          # inbox on server
            allmsgs  = []
            allsizes = []
            for i in range(loadfrom, msgCount+1):         # empty if low >= high
                if progress: progress(i, msgCount)
                if fetchlimit and (i <= msgCount - fetchlimit):
                    # skip, add dummy mail
                    mailtext = 'Subject: --mail skipped--

Mail skipped.
'
                    allmsgs.append(mailtext)
                    allsizes.append(len(mailtext))
                else:
                    # fetch, retr full mail
                    (resp, message, respsz) = server.retr(i)  # save text on list
                    message = self.decodeFullText(message)
                    allmsgs.append('
'.join(message))        # leave mail on server
                    allsizes.append(respsz)                   # diff from len(msg)
        finally:
            server.quit()                                     # unlock the mail box
        assert len(allmsgs) == (msgCount - loadfrom) + 1      # msg nums start at 1
       #assert sum(allsizes) == msgBytes                      # not if loadfrom > 1
        return allmsgs, allsizes, True                        # not if fetchlimit

    def deleteMessages(self, msgnums, progress=None):
        """
        delete multiple msgs off server; assumes email inbox
        unchanged since msgnums were last determined/loaded;
        use if msg headers not available as state information;
        fast, but poss dangerous: see deleteMessagesSafely
        """
        self.trace('deleting mails')
        server = self.connect()
        try:
            for (ix, msgnum) in enumerate(msgnums):   # don't reconnect for each
                if progress: progress(ix+1, len(msgnums))
                server.dele(msgnum)
        finally:                                      # changes msgnums: reload
            server.quit()

    def deleteMessagesSafely(self, msgnums, synchHeaders, progress=None):
        """
        delete multiple msgs off server, but use TOP fetches to
        check for a match on each msg's header part before deleting;
        assumes the email server supports the TOP interface of POP,
        else raises TopNotSupported - client may call deleteMessages;

        use if the mail server might change the inbox since the email
        index was last fetched, thereby changing POP relative message
        numbers;  this can happen if email is deleted in a different
        client;  some ISPs may also move a mail from inbox to the
        undeliverable box in response to a failed download;

        synchHeaders must be a list of already loaded mail hdrs text,
        corresponding to selected msgnums (requires state);  raises
        exception if any out of synch with the email server;  inbox is
        locked until quit, so it should not change between TOP check
        and actual delete: synch check must occur here, not in caller;
        may be enough to call checkSynchError+deleteMessages, but check
        each msg here in case deletes and inserts in middle of inbox;
        """
        if not self.srvrHasTop:
            raise TopNotSupported('Safe delete cancelled')

        self.trace('deleting mails safely')
        errmsg  = 'Message %s out of synch with server.
'
        errmsg += 'Delete terminated at this message.
'
        errmsg += 'Mail client may require restart or reload.'

        server = self.connect()                       # locks inbox till quit
        try:                                          # don't reconnect for each
            (msgCount, msgBytes) = server.stat()      # inbox size on server
            for (ix, msgnum) in enumerate(msgnums):
                if progress: progress(ix+1, len(msgnums))
                if msgnum > msgCount:                            # msgs deleted
                    raise DeleteSynchError(errmsg % msgnum)
                resp, hdrlines, respsz = server.top(msgnum, 0)   # hdrs only
                hdrlines = self.decodeFullText(hdrlines)
                msghdrs = '
'.join(hdrlines)
                if not self.headersMatch(msghdrs, synchHeaders[msgnum-1]):
                    raise DeleteSynchError(errmsg % msgnum)
                else:
                    server.dele(msgnum)               # safe to delete this msg
        finally:                                      # changes msgnums: reload
            server.quit()                             # unlock inbox on way out

    def checkSynchError(self, synchHeaders):
        """
        check to see if already loaded hdrs text in synchHeaders
        list matches what is on the server, using the TOP command in
        POP to fetch headers text; use if inbox can change due to
        deletes in other client, or automatic action by email server;
        raises except if out of synch, or error while talking to server;

        for speed, only checks last in last: this catches inbox deletes,
        but assumes server won't insert before last (true for incoming
        mails); check inbox size first: smaller if just deletes;  else
        top will differ if deletes and newly arrived messages added at
        end;  result valid only when run: inbox may change after return;
        """
        self.trace('synch check')
        errormsg  = 'Message index out of synch with mail server.
'
        errormsg += 'Mail client may require restart or reload.'
        server = self.connect()
        try:
            lastmsgnum = len(synchHeaders)                      # 1..N
            (msgCount, msgBytes) = server.stat()                # inbox size
            if lastmsgnum > msgCount:                           # fewer now?
                raise MessageSynchError(errormsg)               # none to cmp
            if self.srvrHasTop:
                resp, hdrlines, respsz = server.top(lastmsgnum, 0)  # hdrs only
                hdrlines = self.decodeFullText(hdrlines)
                lastmsghdrs = '
'.join(hdrlines)
                if not self.headersMatch(lastmsghdrs, synchHeaders[-1]):
                    raise MessageSynchError(errormsg)
        finally:
            server.quit()

    def headersMatch(self, hdrtext1, hdrtext2):
        """"
        may not be as simple as a string compare: some servers add
        a "Status:" header that changes over time; on one ISP, it
        begins as "Status: U" (unread), and changes to "Status: RO"
        (read, old) after fetched once - throws off synch tests if
        new when index fetched, but have been fetched once before
        delete or last-message check;  "Message-id:" line is unique
        per message in theory, but optional, and can be anything if
        forged; match more common: try first; parsing costly: try last
        """
        # try match by simple string compare
        if hdrtext1 == hdrtext2:
            self.trace('Same headers text')
            return True

        # try match without status lines
        split1 = hdrtext1.splitlines()       # s.split('
'), but no final ''
        split2 = hdrtext2.splitlines()
        strip1 = [line for line in split1 if not line.startswith('Status:')]
        strip2 = [line for line in split2 if not line.startswith('Status:')]
        if strip1 == strip2:
            self.trace('Same without Status')
            return True

        # try mismatch by message-id headers if either has one
        msgid1 = [line for line in split1 if line[:11].lower() == 'message-id:']
        msgid2 = [line for line in split2 if line[:11].lower() == 'message-id:']
        if (msgid1 or msgid2) and (msgid1 != msgid2):
            self.trace('Different Message-Id')
            return False

        # try full hdr parse and common headers if msgid missing or trash
        tryheaders  = ('From', 'To', 'Subject', 'Date')
        tryheaders += ('Cc', 'Return-Path', 'Received')
        msg1 = MailParser().parseHeaders(hdrtext1)
        msg2 = MailParser().parseHeaders(hdrtext2)
        for hdr in tryheaders:                          # poss multiple Received
            if msg1.get_all(hdr) != msg2.get_all(hdr):  # case insens, dflt None
                self.trace('Diff common headers')
                return False

        # all common hdrs match and don't have a diff message-id
        self.trace('Same common headers')
        return True

    def getPassword(self):
        """
        get POP password if not yet known
        not required until go to server
        from client-side file or subclass method
        """
        if not self.popPassword:
            try:
                localfile = open(mailconfig.poppasswdfile)
                self.popPassword = localfile.readline()[:-1]
                self.trace('local file password' + repr(self.popPassword))
            except:
                self.popPassword = self.askPopPassword()

    def askPopPassword(self):
        assert False, 'Subclass must define method'


################################################################################
# specialized subclasses
################################################################################

class MailFetcherConsole(MailFetcher):
    def askPopPassword(self):
        import getpass
        prompt = 'Password for %s on %s?' % (self.popUser, self.popServer)
        return getpass.getpass(prompt)

class SilentMailFetcher(SilentMailTool, MailFetcher):
    pass   # replaces trace

MailParser Class

Example 13-25 implements the last major class in the mailtools package—given the (already decoded) text of an email message, its tools parse the mail’s content into a message object, with headers and decoded parts. This module is largely just a wrapper around the standard library’s email package, but it adds convenience tools—finding the main text part of a message, filename generation for message parts, saving attached parts to files, decoding headers, splitting address lists, and so on. See the code for more information. Also notice the parts walker here: by coding its search logic in one place as a generator function, we guarantee that all its three clients here, as well as any others elsewhere, implement the same traversal.

Unicode decoding for text part payloads and message headers

This module also provides support for decoding message headers per email standards (both full headers and names in address headers), and handles decoding per text part encodings. Headers are decoded according to their content, using tools in the email package; the headers themselves give their MIME and Unicode encodings, so no user intervention is required. For client convenience, we also perform Unicode decoding for main text parts to convert them from bytes to str here if needed.

The latter main-text decoding merits elaboration. As discussed earlier in this chapter, Message objects (main or attached) may return their payloads as bytes if we fetch with a decode=1 argument, or if they are bytes to begin with; in other cases, payloads may be returned as str. We generally need to decode bytes in order to treat payloads as text.

In mailtools itself, str text part payloads are automatically encoded to bytes by decode=1 and then saved to binary-mode files to finesse encoding issues, but main-text payloads are decoded to str if they are bytes. This main-text decoding is performed per the encoding name in the part’s message header (if present and correct), the platform default, or a guess. As we learned in Chapter 9, while GUIs may allow bytes for display, str text generally provides broader Unicode support; furthermore, str is sometimes needed for later processing such as line wrapping and webpage generation.

Since this package can’t predict the role of other part payloads besides the main text, clients are responsible for decoding and encoding as necessary. For instance, other text parts which are saved in binary mode here may require that message headers be consulted later to extract Unicode encoding names for better display. For example, Chapter 14’s PyMailGUI will proceed this way to open text parts on demand, passing message header encoding information on to PyEdit for decoding as text is loaded.

Some of the to-text conversions performed here are potentially partial solutions (some parts may lack the required headers and fail per the platform defaults) and may need to be improved; since this seems likely to be addressed in a future release of Python’s email package, we’ll settle for our assumptions here.

Example 13-25. PP4EInternetEmailmailtoolsmailParser.py

"""
###############################################################################
parsing and attachment extract, analyse, save (see __init__ for docs, test)
###############################################################################
"""

import os, mimetypes, sys                       # mime: map type to name
import email.parser                             # parse text to Message object
import email.header                             # 4E: headers decode/encode
import email.utils                              # 4E: addr header parse/decode
from email.message import Message               # Message may be traversed
from .mailTool import MailTool                  # 4E: package-relative

class MailParser(MailTool):
    """
    methods for parsing message text, attachments

    subtle thing: Message object payloads are either a simple
    string for non-multipart messages, or a list of Message
    objects if multipart (possibly nested); we don't need to
    distinguish between the two cases here, because the Message
    walk generator always returns self first, and so works fine
    on non-multipart messages too (a single object is walked);

    for simple messages, the message body is always considered
    here to be the sole part of the mail;  for multipart messages,
    the parts list includes the main message text, as well as all
    attachments;  this allows simple messages not of type text to
    be handled like attachments in a UI (e.g., saved, opened);
    Message payload may also be None for some oddball part types;

    4E note: in Py 3.1, text part payloads are returned as bytes
    for decode=1, and might be str otherwise; in mailtools, text
    is stored as bytes for file saves, but main-text bytes payloads
    are decoded to Unicode str per mail header info or platform
    default+guess; clients may need to convert other payloads:
    PyMailGUI uses headers to decode parts saved to binary files;

    4E supports fetched message header auto-decoding per its own
    content, both for general headers such as Subject, as well as
    for names in address header such as From and To; client must
    request this after parse, before display: parser doesn't decode;
    """

    def walkNamedParts(self, message):
        """
        generator to avoid repeating part naming logic;
        skips multipart headers, makes part filenames;
        message is already parsed email.message.Message object;
        doesn't skip oddball types: payload may be None, must
        handle in part saves; some others may warrant skips too;
        """
        for (ix, part) in enumerate(message.walk()):    # walk includes message
            fulltype = part.get_content_type()          # ix includes parts skipped
            maintype = part.get_content_maintype()
            if maintype == 'multipart':                 # multipart/*: container
                continue
            elif fulltype == 'message/rfc822':          # 4E: skip message/rfc822
                continue                                # skip all message/* too?
            else:
                filename, contype = self.partName(part, ix)
                yield (filename, contype, part)

    def partName(self, part, ix):
        """
        extract filename and content type from message part;
        filename: tries Content-Disposition, then Content-Type
        name param, or generates one based on mimetype guess;
        """
        filename = part.get_filename()                # filename in msg hdrs?
        contype  = part.get_content_type()            # lowercase maintype/subtype
        if not filename:
            filename = part.get_param('name')         # try content-type name
        if not filename:
            if contype == 'text/plain':               # hardcode plain text ext
                ext = '.txt'                          # else guesses .ksh!
            else:
                ext = mimetypes.guess_extension(contype)
                if not ext: ext = '.bin'              # use a generic default
            filename = 'part-%03d%s' % (ix, ext)
        return (self.decodeHeader(filename), contype) # oct 2011: decode i18n fnames

    def saveParts(self, savedir, message):
        """
        store all parts of a message as files in a local directory;
        returns [('maintype/subtype', 'filename')] list for use by
        callers, but does not open any parts or attachments here;
        get_payload decodes base64, quoted-printable, uuencoded data;
        mail parser may give us a None payload for oddball types we
        probably should skip over: convert to str here to be safe;
        """
        if not os.path.exists(savedir):
            os.mkdir(savedir)
        partfiles = []
        for (filename, contype, part) in self.walkNamedParts(message):
            fullname = os.path.join(savedir, filename)
            fileobj  = open(fullname, 'wb')             # use binary mode
            content  = part.get_payload(decode=1)       # decode base64,qp,uu
            if not isinstance(content, bytes):          # 4E: need bytes for rb
                content = b'(no content)'               # decode=1 returns bytes,
            fileobj.write(content)                      # but some payloads None
            fileobj.close()                             # 4E: not str(content)
            partfiles.append((contype, fullname))       # for caller to open
        return partfiles

    def saveOnePart(self, savedir, partname, message):
        """
        ditto, but find and save just one part by name
        """
        if not os.path.exists(savedir):
            os.mkdir(savedir)
        fullname = os.path.join(savedir, partname)
        (contype, content) = self.findOnePart(partname, message)
        if not isinstance(content, bytes):          # 4E: need bytes for rb
            content = b'(no content)'               # decode=1 returns bytes,
        open(fullname, 'wb').write(content)         # but some payloads None
        return (contype, fullname)                  # 4E: not str(content)

    def partsList(self, message):
        """"
        return a list of filenames for all parts of an
        already parsed message, using same filename logic
        as saveParts, but do not store the part files here
        """
        validParts = self.walkNamedParts(message)
        return [filename for (filename, contype, part) in validParts]

    def findOnePart(self, partname, message):
        """
        find and return part's content, given its name;
        intended to be used in conjunction with partsList;
        we could also mimetypes.guess_type(partname) here;
        we could also avoid this search by saving in dict;
        4E: content may be str or bytes--convert as needed;
        """
        for (filename, contype, part) in self.walkNamedParts(message):
            if filename == partname:
                content = part.get_payload(decode=1)          # does base64,qp,uu
                return (contype, content)                     # may be bytes text

    def decodedPayload(self, part, asStr=True):
        """
        4E: decode text part bytes to Unicode str for display, line wrap,
        etc.; part is a Message; (decode=1) undoes MIME email encodings
        (base64, uuencode, qp), bytes.decode() performs additional Unicode
        text string decodings; tries charset encoding name in message
        headers first (if present, and accurate), then tries platform
        defaults and a few guesses before giving up with error string;
        """
        payload = part.get_payload(decode=1)           # payload may be bytes
        if asStr and isinstance(payload, bytes):       # decode=1 returns bytes
            tries = []
            enchdr = part.get_content_charset()        # try msg headers first!
            if enchdr:
                tries += [enchdr]                      # try headers first
            tries += [sys.getdefaultencoding()]        # same as bytes.decode()
            tries += ['latin1', 'utf8']                # try 8-bit, incl ascii
            for trie in tries:                         # try utf8 (windows dflt)
                try:
                    payload = payload.decode(trie)     # give it a shot, eh?
                    break
                except (UnicodeError, LookupError):    # lookuperr: bad name
                    pass
            else:
                payload = '--Sorry: cannot decode Unicode text--'
        return payload

    def findMainText(self, message, asStr=True):
        """
        for text-oriented clients, return first text part's str;
        for the payload of a simple message, or all parts of
        a multipart message, looks for text/plain, then text/html,
        then text/*, before deducing that there is no text to
        display;  this is a heuristic, but covers most simple,
        multipart/alternative, and multipart/mixed messages;
        content-type defaults to text/plain if not in simple msg;

        handles message nesting at top level by walking instead
        of list scans;  if non-multipart but type is text/html,
        returns the HTML as the text with an HTML type: caller
        may open in web browser, extract plain text, etc;  if
        nonmultipart and not text, there is  no text to display:
        save/open message content in UI; caveat: does not try
        to concatenate multiple inline text/plain parts if any;
        4E: text payloads may be bytes--decodes to str here;
        4E: asStr=False to get raw bytes for HTML file saves;
        """

        # try to find a plain text
        for part in message.walk():                            # walk visits message
            type = part.get_content_type()                     # if nonmultipart
            if type == 'text/plain':                           # may be base64,qp,uu
                return type, self.decodedPayload(part, asStr)  # bytes to str too?

        # try to find an HTML part
        for part in message.walk():
            type = part.get_content_type()                     # caller renders html
            if type == 'text/html':
                return type, self.decodedPayload(part, asStr)

        # try any other text type, including XML
        for part in message.walk():
            if part.get_content_maintype() == 'text':
                return part.get_content_type(), self.decodedPayload(part, asStr)

        # punt: could use first part, but it's not marked as text
        failtext = '[No text to display]' if asStr else b'[No text to display]'
        return 'text/plain', failtext

    def decodeHeader(self, rawheader):
        """
        4E: decode existing i18n message header text per both email and Unicode
        standards, according to its content; return as is if unencoded or fails;
        client must call this to display: parsed Message object does not decode;
        i18n header example: '=?UTF-8?Q?Introducing=20Top=20Values=20..Savers?=';
        i18n header example: 'Man where did you get that =?UTF-8?Q?assistant=3F?=';

        decode_header handles any line breaks in header string automatically, may
        return multiple parts if any substrings of hdr are encoded, and returns all
        bytes in parts list if any encodings found (with unencoded parts encoded as
        raw-unicode-escape and enc=None) but returns a single part with enc=None
        that is str instead of bytes in Py3.1 if the entire header is unencoded
        (must handle mixed types here); see Chapter 13 for more details/examples;

        the following first attempt code was okay unless any encoded substrings, or
        enc was returned as None (raised except which returned rawheader unchanged):
        hdr, enc = email.header.decode_header(rawheader)[0]
        return hdr.decode(enc) # fails if enc=None: no encoding or encoded substrs
        """
        try:
            parts = email.header.decode_header(rawheader)
            decoded = []
            for (part, enc) in parts:                      # for all substrings
                if enc == None:                            # part unencoded?
                    if not isinstance(part, bytes):        # str: full hdr unencoded
                        decoded += [part]                  # else do unicode decode
                    else:
                        decoded += [part.decode('raw-unicode-escape')]
                else:
                    decoded += [part.decode(enc)]
            return ' '.join(decoded)
        except:
            return rawheader         # punt!

    def decodeAddrHeader(self, rawheader):
        """
        4E: decode existing i18n address header text per email and Unicode,
        according to its content; must parse out first part of email address
        to get i18n part: '"=?UTF-8?Q?Walmart?=" <[email protected]>';
        From will probably have just 1 addr, but To, Cc, Bcc may have many;

        decodeHeader handles nested encoded substrings within an entire hdr,
        but we can't simply call it for entire hdr here because it fails if
        encoded name substring ends in " quote instead of whitespace or endstr;
        see also encodeAddrHeader in mailSender module for the inverse of this;

        the following first attempt code failed to handle encoded substrings in
        name, and raised exc for unencoded bytes parts if any encoded substrings;
        namebytes, nameenc = email.header.decode_header(name)[0]  (do email+MIME)
        if nameenc: name = namebytes.decode(nameenc)              (do Unicode?)
        """
        try:
            pairs = email.utils.getaddresses([rawheader])  # split addrs and parts
            decoded = []                                   # handles name commas
            for (name, addr) in pairs:
                try:
                    name = self.decodeHeader(name)                # email+MIME+Uni
                except:
                    name = None   # but uses encooded name if exc in decodeHeader
                joined = email.utils.formataddr((name, addr))     # join parts
                decoded.append(joined)
            return ', '.join(decoded)                             # >= 1 addrs
        except:
            return self.decodeHeader(rawheader)    # try decoding entire string

    def splitAddresses(self, field):
        """
        4E: use comma separator for multiple addrs in the UI, and
        getaddresses to split correctly and allow for comma in the
        name parts of addresses; used by PyMailGUI to split To, Cc,
        Bcc as needed for user inputs and copied headers;  returns
        empty list if field is empty, or any exception occurs;
        """
        try:
            pairs = email.utils.getaddresses([field])                # [(name,addr)]
            return [email.utils.formataddr(pair) for pair in pairs]  # [name <addr>]
        except:
            return ''   # syntax error in user-entered field?, etc.

    # returned when parses fail
    errorMessage = Message()
    errorMessage.set_payload('[Unable to parse message - format error]')

    def parseHeaders(self, mailtext):
        """
        parse headers only, return root email.message.Message object
        stops after headers parsed, even if nothing else follows (top)
        email.message.Message object is a mapping for mail header fields
        payload of message object is None, not raw body text
        """
        try:
            return email.parser.Parser().parsestr(mailtext, headersonly=True)
        except:
            return self.errorMessage

    def parseMessage(self, fulltext):
        """
        parse entire message, return root email.message.Message object
        payload of message object is a string if not is_multipart()
        payload of message object is more Messages if multiple parts
        the call here same as calling email.message_from_string()
        """
        try:
            return email.parser.Parser().parsestr(fulltext)       # may fail!
        except:
            return self.errorMessage     # or let call handle? can check return

    def parseMessageRaw(self, fulltext):
        """
        parse headers only, return root email.message.Message object
        stops after headers parsed, for efficiency (not yet used here)
        payload of message object is raw text of mail after headers
        """
        try:
            return email.parser.HeaderParser().parsestr(fulltext)
        except:
            return self.errorMessage

Self-Test Script

The last file in the mailtools package, Example 13-26, lists the self-test code for the package. This code is a separate script file, in order to allow for import search path manipulation—it emulates a real client, which is assumed to have a mailconfig.py module in its own source directory (this module can vary per client).

Example 13-26. PP4EInternetEmailmailtoolsselftest.py

"""
###############################################################################
self-test when this file is run as a program
###############################################################################
"""

#
# mailconfig normally comes from the client's source directory or
# sys.path; for testing, get it from Email directory one level up
#
import sys
sys.path.append('..')
import mailconfig
print('config:', mailconfig.__file__)

# get these from __init__
from mailtools import (MailFetcherConsole,
                       MailSender, MailSenderAuthConsole,
                       MailParser)

if not mailconfig.smtpuser:
    sender = MailSender(tracesize=5000)
else:
    sender = MailSenderAuthConsole(tracesize=5000)

sender.sendMessage(From      = mailconfig.myaddress,
                   To        = [mailconfig.myaddress],
                   Subj      = 'testing mailtools package',
                   extrahdrs = [('X-Mailer', 'mailtools')],
                   bodytext  = 'Here is my source code
',
                   attaches  = ['selftest.py'],
                  )

                   # bodytextEncoding='utf-8',          # other tests to try
                   # attachesEncodings=['latin-1'],     # inspect text headers
                   # attaches=['monkeys.jpg'])          # verify Base64 encoded
                   # to='i18n adddr list...',           # test mime/unicode headers


# change mailconfig to test fetchlimit
fetcher = MailFetcherConsole()
def status(*args): print(args)

hdrs, sizes, loadedall = fetcher.downloadAllHeaders(status)
for num, hdr in enumerate(hdrs[:5]):
    print(hdr)
    if input('load mail?') in ['y', 'Y']:
        print(fetcher.downloadMessage(num+1).rstrip(), '
', '-'*70)

last5 = len(hdrs)-4
msgs, sizes, loadedall = fetcher.downloadAllMessages(status, loadfrom=last5)
for msg in msgs:
    print(msg[:200], '
', '-'*70)

parser = MailParser()
for i in [0]:                  # try [0 , len(msgs)]
    fulltext = msgs[i]
    message  = parser.parseMessage(fulltext)
    ctype, maintext = parser.findMainText(message)
    print('Parsed:', message['Subject'])
    print(maintext)
input('Press Enter to exit')   # pause if clicked on Windows

Running the self-test

Here’s a run of the self-test script; it generates a lot of output, most of which has been deleted here for presentation in this book—as usual, run this on your own for further details:

C:...PP4EInternetEmailmailtools> selftest.py
config: ..mailconfig.py
user: [email protected]
Adding text/x-python
Sending to...['[email protected]']
Content-Type: multipart/mixed; boundary="===============0085314748=="
MIME-Version: 1.0
From: [email protected]
To: [email protected]
Subject: testing mailtools package
Date: Sat, 08 May 2010 19:26:22 −0000
X-Mailer: mailtools

A multi-part MIME format message.

--===============0085314748==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit

Here is my source code

--===============0085314748==
Content-Type: text/x-python; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="selftest.py"

"""
###############################################################################
self-test when this file is run as a program
###############################################################################
"""
...more lines omitted...

print(maintext)
input('Press Enter to exit')   # pause if clicked on Windows

--===============0085314748==--

Send exit
loading headers
Connecting...
Password for [email protected] on pop.secureserver.net?
b'+OK <[email protected]>'
(1, 7)
(2, 7)
(3, 7)
(4, 7)
(5, 7)
(6, 7)
(7, 7)
load headers exit
Received: (qmail 7690 invoked from network); 5 May 2010 15:29:43 −0000
Received: from unknown (HELO p3pismtp01-026.prod.phx3.secureserver.net) ([10.6.1
...more lines omitted...

load mail?y
load 1
Connecting...
b'+OK <[email protected]>'
Received: (qmail 7690 invoked from network); 5 May 2010 15:29:43 −0000
Received: from unknown (HELO p3pismtp01-026.prod.phx3.secureserver.net) ([10.6.1
...more lines omitted...

load mail?
loading full messages
Connecting...
b'+OK <[email protected]>'
(3, 7)
(4, 7)
(5, 7)
(6, 7)
(7, 7)
Received: (qmail 25683 invoked from network); 6 May 2010 14:12:07 −0000
Received: from unknown (HELO p3pismtp01-018.prod.phx3.secureserver.net) ([10.6.1
...more lines omitted...

Parsed: A B C D E F G
Fiddle de dum, Fiddle de dee,
Eric the half a bee.

Press Enter to exit

Updating the pymail Console Client

As a final email example in this chapter, and to give a better use case for the mailtools module package of the preceding sections, Example 13-27 provides an updated version of the pymail program we met earlier (Example 13-20). It uses our mailtools package to access email, instead of interfacing with Python’s email package directly. Compare its code to the original pymail in this chapter to see how mailtools is employed here. You’ll find that its mail download and send logic is substantially simpler.

Example 13-27. PP4EInternetEmailpymail2.py

#!/usr/local/bin/python
"""
################################################################################
pymail2 - simple console email interface client in Python;  this version uses
the mailtools package, which in turn uses poplib, smtplib, and the email package
for parsing and composing emails;  displays first text part of mails, not the
entire full text;  fetches just mail headers initially, using the TOP command;
fetches full text of just email selected to be displayed;  caches already
fetched mails; caveat: no way to refresh index;  uses standalone mailtools
objects - they can also be used as superclasses;
################################################################################
"""

import mailconfig, mailtools
from pymail import inputmessage
mailcache = {}

def fetchmessage(i):
    try:
        fulltext = mailcache[i]
    except KeyError:
        fulltext = fetcher.downloadMessage(i)
        mailcache[i] = fulltext
    return fulltext

def sendmessage():
    From, To, Subj, text = inputmessage()
    sender.sendMessage(From, To, Subj, [], text, attaches=None)

def deletemessages(toDelete, verify=True):
    print('To be deleted:', toDelete)
    if verify and input('Delete?')[:1] not in ['y', 'Y']:
        print('Delete cancelled.')
    else:
        print('Deleting messages from server...')
        fetcher.deleteMessages(toDelete)

def showindex(msgList, msgSizes, chunk=5):
    count = 0
    for (msg, size) in zip(msgList, msgSizes):     # email.message.Message, int
        count += 1                                 # 3.x iter ok here
        print('%d:	%d bytes' % (count, size))
        for hdr in ('From', 'To', 'Date', 'Subject'):
            print('	%-8s=>%s' % (hdr, msg.get(hdr, '(unknown)')))
        if count % chunk == 0:
            input('[Press Enter key]')             # pause after each chunk

def showmessage(i, msgList):
    if 1 <= i <= len(msgList):
        fulltext = fetchmessage(i)
        message  = parser.parseMessage(fulltext)
        ctype, maintext = parser.findMainText(message)
        print('-' * 79)
        print(maintext.rstrip() + '
')   # main text part, not entire mail
        print('-' * 79)                   # and not any attachments after
    else:
        print('Bad message number')

def savemessage(i, mailfile, msgList):
    if 1 <= i <= len(msgList):
        fulltext = fetchmessage(i)
        savefile = open(mailfile, 'a', encoding=mailconfig.fetchEncoding)   # 4E
        savefile.write('
' + fulltext + '-'*80 + '
')
    else:
        print('Bad message number')

def msgnum(command):
    try:
        return int(command.split()[1])
    except:
        return −1   # assume this is bad

helptext = """
Available commands:
i     - index display
l n?  - list all messages (or just message n)
d n?  - mark all messages for deletion (or just message n)
s n?  - save all messages to a file (or just message n)
m     - compose and send a new mail message
q     - quit pymail
?     - display this help text
"""

def interact(msgList, msgSizes, mailfile):
    showindex(msgList, msgSizes)
    toDelete = []
    while True:
        try:
            command = input('[Pymail] Action? (i, l, d, s, m, q, ?) ')
        except EOFError:
            command = 'q'
        if not command: command = '*'

        if command == 'q':                     # quit
            break

        elif command[0] == 'i':                # index
            showindex(msgList, msgSizes)

        elif command[0] == 'l':                # list
            if len(command) == 1:
                for i in range(1, len(msgList)+1):
                    showmessage(i, msgList)
            else:
                showmessage(msgnum(command), msgList)

        elif command[0] == 's':                # save
            if len(command) == 1:
                for i in range(1, len(msgList)+1):
                    savemessage(i, mailfile, msgList)
            else:
                savemessage(msgnum(command), mailfile, msgList)

        elif command[0] == 'd':                # mark for deletion later
            if len(command) == 1:              # 3.x needs list(): iter
                toDelete = list(range(1, len(msgList)+1))
            else:
                delnum = msgnum(command)
                if (1 <= delnum <= len(msgList)) and (delnum not in toDelete):
                    toDelete.append(delnum)
                else:
                    print('Bad message number')

        elif command[0] == 'm':                # send a new mail via SMTP
            try:
                sendmessage()
            except:
                print('Error - mail not sent')

        elif command[0] == '?':
            print(helptext)
        else:
            print('What? -- type "?" for commands help')
    return toDelete

def main():
    global parser, sender, fetcher
    mailserver = mailconfig.popservername
    mailuser   = mailconfig.popusername
    mailfile   = mailconfig.savemailfile

    parser     = mailtools.MailParser()
    sender     = mailtools.MailSender()
    fetcher    = mailtools.MailFetcherConsole(mailserver, mailuser)

    def progress(i, max):
        print(i, 'of', max)

    hdrsList, msgSizes, ignore = fetcher.downloadAllHeaders(progress)
    msgList = [parser.parseHeaders(hdrtext) for hdrtext in hdrsList]

    print('[Pymail email client]')
    toDelete = interact(msgList, msgSizes, mailfile)
    if toDelete: deletemessages(toDelete)

if __name__ == '__main__': main()

Running the pymail2 console client

This program is used interactively, the same as the original. In fact, the output is nearly identical, so we won’t go into further details. Here’s a quick look at this script in action; run this on your own machine to see it firsthand:

C:...PP4EInternetEmail> pymail2.py
user: [email protected]
loading headers
Connecting...
Password for [email protected] on pop.secureserver.net?
b'+OK <[email protected]>'
1 of 7
2 of 7
3 of 7
4 of 7
5 of 7
6 of 7
7 of 7
load headers exit
[Pymail email client]
1:      1860 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Wed, 5 May 2010 11:29:36 −0400 (EDT)
        Subject =>I'm a Lumberjack, and I'm Okay
2:      1408 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Wed, 05 May 2010 08:33:47 −0700
        Subject =>testing
3:      1049 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Thu, 06 May 2010 14:11:07 −0000
        Subject =>A B C D E F G
4:      1038 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Thu, 06 May 2010 14:32:32 −0000
        Subject =>a b c d e f g
5:      957 bytes
        From    =>[email protected]
        To      =>maillist
        Date    =>Thu, 06 May 2010 10:58:40 −0400
        Subject =>test interactive smtplib
[Press Enter key]
6:      1037 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Fri, 07 May 2010 20:32:38 −0000
        Subject =>Among our weapons are these
7:      3248 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Sat, 08 May 2010 19:26:22 −0000
        Subject =>testing mailtools package
[Pymail] Action? (i, l, d, s, m, q, ?) l 7
load 7
Connecting...
b'+OK <[email protected]>'
-------------------------------------------------------------------------------
Here is my source code

-------------------------------------------------------------------------------
[Pymail] Action? (i, l, d, s, m, q, ?) d 7
[Pymail] Action? (i, l, d, s, m, q, ?) m
From? [email protected]
To?   [email protected]
Subj? test pymail2 send
Type message text, end with line="."
Run away! Run away!
.
Sending to...['[email protected]']
From: [email protected]
To: [email protected]
Subject: test pymail2 send
Date: Sat, 08 May 2010 19:44:25 −0000

Run away! Run away!

Send exit
[Pymail] Action? (i, l, d, s, m, q, ?) q
To be deleted: [7]
Delete?y
Deleting messages from server...
deleting mails
Connecting...
b'+OK <[email protected]>'

The messages in our mailbox have quite a few origins now—ISP webmail clients, basic SMTP scripts, the Python interactive command line, mailtools self-test code, and two console-based email clients; in later chapters, we’ll add even more. All their mails look the same to our script; here’s a verification of the email we just sent (the second fetch finds it already in-cache):

C:...PP4EInternetEmail> pymail2.py
user: [email protected]
loading headers
Connecting...
...more lines omitted...

[Press Enter key]
6:      1037 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Fri, 07 May 2010 20:32:38 −0000
        Subject =>Among our weapons are these
7:      984 bytes
        From    =>[email protected]
        To      =>[email protected]
        Date    =>Sat, 08 May 2010 19:44:25 −0000
        Subject =>test pymail2 send
[Pymail] Action? (i, l, d, s, m, q, ?) l 7
load 7
Connecting...
b'+OK <[email protected]>'
-------------------------------------------------------------------------------
Run away! Run away!

-------------------------------------------------------------------------------
[Pymail] Action? (i, l, d, s, m, q, ?) l 7
-------------------------------------------------------------------------------
Run away! Run away!

-------------------------------------------------------------------------------
[Pymail] Action? (i, l, d, s, m, q, ?) q

Study pymail2’s code for more insights. As you’ll see, this version eliminates some complexities, such as the manual formatting of composed mail message text. It also does a better job of displaying a mail’s text—instead of blindly listing the full mail text (attachments and all), it uses mailtools to fetch the first text part of the message. The messages we’re using are too simple to show the difference, but for a mail with attachments, this new version will be more focused about what it displays.

Moreover, because the interface to mail is encapsulated in the mailtools package’s modules, if it ever must change, it will only need to be changed in that module, regardless of how many mail clients use its tools. And because the code in mailtools is shared, if we know it works for one client, we can be sure it will work in another; there is no need to debug new code.

On the other hand, pymail2 doesn’t really leverage much of the power of either mailtools or the underlying email package it uses. For example, things like attachments, Internationalized headers, and inbox synchronization are not handled at all, and printing of some decoded main text may contain character sets incompatible with the console terminal interface. To see the full scope of the email package, we need to explore a larger email system, such as PyMailGUI or PyMailCGI. The first of these is the topic of the next chapter, and the second appears in Chapter 16. First, though, let’s quickly survey a handful of additional client-side protocol tools.

NNTP: Accessing Newsgroups

So far in this chapter, we have focused on Python’s FTP and email processing tools and have met a handful of client-side scripting modules along the way: ftplib, poplib, smtplib, email, mimetypes, urllib, and so on. This set is representative of Python’s client-side library tools for transferring and processing information over the Internet, but it’s not at all complete.

A more or less comprehensive list of Python’s Internet-related modules appears at the start of the previous chapter. Among other things, Python also includes client-side support libraries for Internet news, Telnet, HTTP, XML-RPC, and other standard protocols. Most of these are analogous to modules we’ve already met—they provide an object-based interface that automates the underlying sockets and message structures.

For instance, Python’s nntplib module supports the client-side interface to NNTP—the Network News Transfer Protocol—which is used for reading and posting articles to Usenet newsgroups on the Internet. Like other protocols, NNTP runs on top of sockets and merely defines a standard message protocol; like other modules, nntplib hides most of the protocol details and presents an object-based interface to Python scripts.

We won’t get into full protocol details here, but in brief, NNTP servers store a range of articles on the server machine, usually in a flat-file database. If you have the domain or IP name of a server machine that runs an NNTP server program listening on the NNTP port, you can write scripts that fetch or post articles from any machine that has Python and an Internet connection. For instance, the script in Example 13-28 by default fetches and displays the last 10 articles from Python’s Internet newsgroup, comp.lang.python, from the news.rmi.net NNTP server at one of my ISPs.

Example 13-28. PP4EInternetOther eadnews.py

"""
fetch and print usenet newsgroup posting from comp.lang.python via the
nntplib module, which really runs on top of sockets; nntplib also supports
posting new messages, etc.; note: posts not deleted after they are read;
"""

listonly = False
showhdrs = ['From', 'Subject', 'Date', 'Newsgroups', 'Lines']
try:
    import sys
    servername, groupname, showcount = sys.argv[1:]
    showcount  = int(showcount)
except:
    servername = nntpconfig.servername       # assign this to your server
    groupname  = 'comp.lang.python'          # cmd line args or defaults
    showcount  = 10                          # show last showcount posts

# connect to nntp server
print('Connecting to', servername, 'for', groupname)
from nntplib import NNTP
connection = NNTP(servername)
(reply, count, first, last, name) = connection.group(groupname)
print('%s has %s articles: %s-%s' % (name, count, first, last))

# get request headers only
fetchfrom = str(int(last) - (showcount-1))
(reply, subjects) = connection.xhdr('subject', (fetchfrom + '-' + last))

# show headers, get message hdr+body
for (id, subj) in subjects:                  # [-showcount:] if fetch all hdrs
    print('Article %s [%s]' % (id, subj))
    if not listonly and input('=> Display?') in ['y', 'Y']:
        reply, num, tid, list = connection.head(id)
        for line in list:
            for prefix in showhdrs:
                if line[:len(prefix)] == prefix:
                    print(line[:80])
                    break
        if input('=> Show body?') in ['y', 'Y']:
            reply, num, tid, list = connection.body(id)
            for line in list:
                print(line[:80])
    print()
print(connection.quit())

As for FTP and email tools, the script creates an NNTP object and calls its methods to fetch newsgroup information and articles’ header and body text. The xhdr method, for example, loads selected headers from a range of messages.

For NNTP servers that require authentication, you may also have to pass a username, a password, and possibly a reader-mode flag to the NNTP call. See the Python Library manual for more on other NNTP parameters and object methods.

In the interest of space and time, I’ll omit this script’s outputs here. When run, it connects to the server and displays each article’s subject line, pausing to ask whether it should fetch and show the article’s header information lines (headers listed in the variable showhdrs only) and body text. We can also pass this script an explicit server name, newsgroup, and display count on the command line to apply it in different ways. With a little more work, we could turn this script into a full-blown news interface. For instance, new articles could be posted from within a Python script with code of this form (assuming the local file already contains proper NNTP header lines):

# to post, say this (but only if you really want to post!)
connection = NNTP(servername)
localfile = open('filename')      # file has proper headers
connection.post(localfile)        # send text to newsgroup
connection.quit()

We might also add a tkinter-based GUI frontend to this script to make it more usable, but we’ll leave such an extension on the suggested exercise heap (see also the PyMailGUI interface’s suggested extensions at the end of the next chapter—email and news messages have a similar structure).

HTTP: Accessing Websites

Python’s standard library (the modules that are installed with the interpreter) also includes client-side support for HTTP—the Hypertext Transfer Protocol—a message structure and port standard used to transfer information on the World Wide Web. In short, this is the protocol that your web browser (e.g., Internet Explorer, Firefox, Chrome, or Safari) uses to fetch web pages and run applications on remote servers as you surf the Web. Essentially, it’s just bytes sent over port 80.

To really understand HTTP-style transfers, you need to know some of the server-side scripting topics covered in Chapter 15 (e.g., script invocations and Internet address schemes), so this section may be less useful to readers with no such background. Luckily, though, the basic HTTP interfaces in Python are simple enough for a cursory understanding even at this point in the book, so let’s take a brief look here.

Python’s standard http.client module automates much of the protocol defined by HTTP and allows scripts to fetch web pages as clients much like web browsers; as we’ll see in Chapter 15, http.server also allows us to implement web servers to handle the other side of the dialog. For instance, the script in Example 13-29 can be used to grab any file from any server machine running an HTTP web server program. As usual, the file (and descriptive header lines) is ultimately transferred as formatted messages over a standard socket port, but most of the complexity is hidden by the http.client module (see our raw socket dialog with a port 80 HTTP server in Chapter 12 for a comparison).

Example 13-29. PP4EInternetOtherhttp-getfile.py

"""
fetch a file from an HTTP (web) server over sockets via http.client;  the filename
parameter may have a full directory path, and may name a CGI script with ? query
parameters on the end to invoke a remote program;  fetched file data or remote
program output could be saved to a local file to mimic FTP, or parsed with str.find
or html.parser module;  also: http.client request(method, url, body=None, hdrs={});
"""

import sys, http.client
showlines = 6
try:
    servername, filename = sys.argv[1:]           # cmdline args?
except:
    servername, filename = 'learning-python.com', '/index.html'

print(servername, filename)
server = http.client.HTTPConnection(servername)   # connect to http site/server
server.putrequest('GET', filename)                # send request and headers
server.putheader('Accept', 'text/html')           # POST requests work here too
server.endheaders()                               # as do CGI script filenames

reply = server.getresponse()                      # read reply headers + data
if reply.status != 200:                           # 200 means success
    print('Error sending request', reply.status, reply.reason)
else:
    data = reply.readlines()                      # file obj for data received
    reply.close()                                 # show lines with eoln at end
    for line in data[:showlines]:                 # to save, write data to file
        print(line)                               # line already has 
, but bytes

Desired server names and filenames can be passed on the command line to override hardcoded defaults in the script. You need to know something of the HTTP protocol to make the most sense of this code, but it’s fairly straightforward to decipher. When run on the client, this script makes an HTTP object to connect to the server, sends it a GET request along with acceptable reply types, and then reads the server’s reply. Much like raw email message text, the HTTP server’s reply usually begins with a set of descriptive header lines, followed by the contents of the requested file. The HTTP object’s getfile method gives us a file object from which we can read the downloaded data.

Let’s fetch a few files with this script. Like all Python client-side scripts, this one works on any machine with Python and an Internet connection (here it runs on a Windows client). Assuming that all goes well, the first few lines of the downloaded file are printed; in a more realistic application, the text we fetch would probably be saved to a local file, parsed with Python’s html.parser module (introduced in Chapter 19), and so on. Without arguments, the script simply fetches the HTML index page at http://learning-python.com, a domain name I host at a commercial service provider:

C:...PP4EInternetOther> http-getfile.py
learning-python.com /index.html
b'<HTML>
'
b' 
'
b'<HEAD>
'
b"<TITLE>Mark Lutz's Python Training Services</TITLE>
"
b'<!--mstheme--><link rel="stylesheet" type="text/css" href="_themes/blends/blen...'
b'</HEAD>
'

Notice that in Python 3.X the fetched data comes back as bytes strings again, not str; since the Python html.parser HTML parse we’ll meet in Chapter 19 expects str text strings instead of bytes, you’ll likely need to resolve a Unicode encoding choice here in order to parse, much the same as we did for email message text earlier in this chapter. As there, we might decode from bytes to str per a default, user preferences or selections, headers inspection, or byte structure analysis. Because sockets send raw bytes, we confront this choice point whenever data shipped over them is text in nature; unless that text’s type is known or always simple in form, Unicode implies extra steps.

We can also list a server and file to be fetched on the command line, if we want to be more specific. In the following code, we use the script to fetch files from two different websites by listing their names on the command lines (I’ve truncated some of these lines so they fit in this book). Notice that the filename argument can include an arbitrary remote directory path to the desired file, as in the last fetch here:

C:...PP4EInternetOther> http-getfile.py www.python.org /index.html
www.python.org /index.html
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3....'
b'
'
b'
'
b'<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
'
b'
'
b'<head>
'

C:...PP4EInternetOther> http-getfile.py www.python.org index.html
www.python.org index.html
Error sending request 400 Bad Request

C:...PP4EInternetOther> http-getfile.py www.learning-python.com /books
www.learning-python.com /books
Error sending request 301 Moved Permanently

C:...PP4EInternetOther> http-getfile.py www.learning-python.com /books/index.html
www.learning-python.com /books/index.html
b'<HTML>
'
b'
'
b'<HEAD>
'
b"<TITLE>Mark Lutz's Book Support Site</TITLE>
"
b'</HEAD>
'
b'<BODY BGCOLOR="#f1f1ff">
'

Notice the second and third attempts in this code: if the request fails, the script receives and displays an HTTP error code from the server (we forgot the leading slash on the second, and the “index.html” on the third—required for this server and interface). With the raw HTTP interfaces, we need to be precise about what we want.

Technically, the string we call filename in the script can refer to either a simple static web page file or a server-side program that generates HTML as its output. Those server-side programs are usually called CGI scripts—the topic of Chapters 15 and 16. For now, keep in mind that when filename refers to a script, this program can be used to invoke another program that resides on a remote server machine. In that case, we can also specify parameters (called a query string) to be passed to the remote program after a ?.

Here, for instance, we pass a language=Python parameter to a CGI script we will meet in Chapter 15 (to make this work, we also need to first spawn a locally running HTTP web server coded in Python using a script we first met in Chapter 1 and will revisit in Chapter 15):

In a different window
C:...PP4EInternetWeb> webserver.py
webdir ".", port 80

C:...PP4EInternetOther> http-getfile.py localhost
                               /cgi-bin/languages.py?language=Python
localhost /cgi-bin/languages.py?language=Python
b'<TITLE>Languages</TITLE>
'
b'<H1>Syntax</H1><HR>
'
b'<H3>Python</H3><P><PRE>
'
b" print('Hello World')               
"
b'</PRE></P><BR>
'
b'<HR>
'

This book has much more to say later about HTML, CGI scripts, and the meaning of the HTTP GET request used in Example 13-29 (along with POST, one of two way to format information sent to an HTTP server), so we’ll skip additional details here.

Suffice it to say, though, that we could use the HTTP interfaces to write our own web browsers and build scripts that use websites as though they were subroutines. By sending parameters to remote programs and parsing their results, websites can take on the role of simple in-process functions (albeit, much more slowly and indirectly).

The urllib Package Revisited

The http.client module we just met provides low-level control for HTTP clients. When dealing with items available on the Web, though, it’s often easier to code downloads with Python’s standard urllib.request module, introduced in the FTP section earlier in this chapter. Since this module is another way to talk HTTP, let’s expand on its interfaces here.

Recall that given a URL, urllib.request either downloads the requested object over the Net to a local file or gives us a file-like object from which we can read the requested object’s contents. As a result, the script in Example 13-30 does the same work as the http.client script we just wrote but requires noticeably less code.

Example 13-30. PP4EInternetOtherhttp-getfile-urllib1.py

"""
fetch a file from an HTTP (web) server over sockets via urllib;  urllib supports
HTTP, FTP, files, and HTTPS via URL address strings;  for HTTP, the URL can name
a file or trigger a remote CGI script;  see also the urllib example in the FTP
section, and the CGI script invocation in a later chapter;  files can be fetched
over the net with Python in many ways that vary in code and server requirements:
over sockets, FTP, HTTP, urllib, and CGI outputs;  caveat: should run filename
through urllib.parse.quote to escape properly unless hardcoded--see later chapters;
"""

import sys
from urllib.request import urlopen
showlines = 6
try:
    servername, filename = sys.argv[1:]              # cmdline args?
except:
    servername, filename = 'learning-python.com', '/index.html'

remoteaddr = 'http://%s%s' % (servername, filename)  # can name a CGI script too
print(remoteaddr)
remotefile = urlopen(remoteaddr)                     # returns input file object
remotedata = remotefile.readlines()                  # read data directly here
remotefile.close()
for line in remotedata[:showlines]: print(line)      # bytes with embedded

Almost all HTTP transfer details are hidden behind the urllib.request interface here. This version works in almost the same way as the http.client version we wrote first, but it builds and submits an Internet URL address to get its work done (the constructed URL is printed as the script’s first output line). As we saw in the FTP section of this chapter, the urllib.request function urlopen returns a file-like object from which we can read the remote data. But because the constructed URLs begin with “http://” here, the urllib.request module automatically employs the lower-level HTTP interfaces to download the requested file instead of FTP:

C:...PP4EInternetOther> http-getfile-urllib1.py
http://learning-python.com/index.html
b'<HTML>
'
b' 
'
b'<HEAD>
'
b"<TITLE>Mark Lutz's Python Training Services</TITLE>
"
b'<!--mstheme--><link rel="stylesheet" type="text/css" href="_themes/blends/blen...'
b'</HEAD>
'

C:...PP4EInternetOther> http-getfile-urllib1.py www.python.org /index
http://www.python.org/index
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3....'
b'
'
b'
'
b'<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
'
b'
'
b'<head>
'

C:...PP4EInternetOther> http-getfile-urllib1.py www.learning-python.com /books
http://learning-python.com/books 
b'<HTML>
'
b'
'
b'<HEAD>
'
b"<TITLE>Mark Lutz's Book Support Site</TITLE>
"
b'</HEAD>
'
b'<BODY BGCOLOR="#f1f1ff">
'

C:...PP4EInternetOther> http-getfile-urllib1.py
                                localhost /cgi-bin/languages.py?language=Java
http://localhost/cgi-bin/languages.py?language=Java
b'<TITLE>Languages</TITLE>
'
b'<H1>Syntax</H1><HR>
'
b'<H3>Java</H3><P><PRE>
'
b' System.out.println("Hello World"); 
'
b'</PRE></P><BR>
'
b'<HR>
'

As before, the filename argument can name a simple file or a program invocation with optional parameters at the end, as in the last run here. If you read this output carefully, you’ll notice that this script still works if you leave the “index.html” off the end of a site’s root filename (in the third command line); unlike the raw HTTP version of the preceding section, the URL-based interface is smart enough to do the right thing.

Other urllib Interfaces

One last mutation: the following urllib.request downloader script uses the slightly higher-level urlretrieve interface in that module to automatically save the downloaded file or script output to a local file on the client machine. This interface is handy if we really mean to store the fetched data (e.g., to mimic the FTP protocol). If we plan on processing the downloaded data immediately, though, this form may be less convenient than the version we just met: we need to open and read the saved file. Moreover, we need to provide an extra protocol for specifying or extracting a local filename, as in Example 13-31.

Example 13-31. PP4EInternetOtherhttp-getfile-urllib2.py

"""
fetch a file from an HTTP (web) server over sockets via urlllib;  this version
uses an interface that saves the fetched data to a local binary-mode file; the
local filename is either passed in as a cmdline arg or stripped from the URL with
urllib.parse: the filename argument may have a directory path at the front and query
parameters at end, so os.path.split is not enough (only splits off directory path);
caveat: should urllib.parse.quote filename unless known ok--see later chapters;
"""

import sys, os, urllib.request, urllib.parse
showlines = 6
try:
    servername, filename = sys.argv[1:3]              # first 2 cmdline args?
except:
    servername, filename = 'learning-python.com', '/index.html'

remoteaddr = 'http://%s%s' % (servername, filename)   # any address on the Net
if len(sys.argv) == 4:                                # get result filename
    localname = sys.argv[3]
else:
    (scheme, server, path, parms, query, frag) = urllib.parse.urlparse(remoteaddr)
    localname = os.path.split(path)[1]

print(remoteaddr, localname)
urllib.request.urlretrieve(remoteaddr, localname)       # can be file or script
remotedata = open(localname, 'rb').readlines()          # saved to local file
for line in remotedata[:showlines]: print(line)         # file is bytes/binary

Let’s run this last variant from a command line. Its basic operation is the same as the last two versions: like the prior one, it builds a URL, and like both of the last two, we can list an explicit target server and file path on the command line:

C:...PP4EInternetOther> http-getfile-urllib2.py
http://learning-python.com/index.html index.html
b'<HTML>
'
b' 
'
b'<HEAD>
'
b"<TITLE>Mark Lutz's Python Training Services</TITLE>
"
b'<!--mstheme--><link rel="stylesheet" type="text/css" href="_themes/blends/blen...'
b'</HEAD>
'

C:...PP4EInternetOther> http-getfile-urllib2.py www.python.org /index.html
http://www.python.org/index.html index.html
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3....'
b'
'
b'
'
b'<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
'
b'
'
b'<head>
'

Because this version uses a urllib.request interface that automatically saves the downloaded data in a local file, it’s similar to FTP downloads in spirit. But this script must also somehow come up with a local filename for storing the data. You can either let the script strip and use the base filename from the constructed URL, or explicitly pass a local filename as a last command-line argument. In the prior run, for instance, the downloaded web page is stored in the local file index.html in the current working directory—the base filename stripped from the URL (the script prints the URL and local filename as its first output line). In the next run, the local filename is passed explicitly as py-index.html:

C:...PP4EInternetOther> http-getfile-urllib2.py
                                www.python.org /index.html py-index.html
http://www.python.org/index.html py-index.html
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3....'
b'
'
b'
'
b'<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
'
b'
'
b'<head>
'

C:...PP4EInternetOther> http-getfile-urllib2.py www.learning-python.com /books books.html
http://learning-python.com/books  books.html
b'<HTML>
'
b'
'
b'<HEAD>
'
b"<TITLE>Mark Lutz's Book Support Site</TITLE>
"
b'</HEAD>
'
b'<BODY BGCOLOR="#f1f1ff">
'

C:...PP4EInternetOther> http-getfile-urllib2.py www.learning-python.com /books/about-pp.html
http://learning-python.com/books/about-pp.html about-pp.html
b'<HTML>
'
b'
'
b'<HEAD>
'
b'<TITLE>About "Programming Python"</TITLE>
'
b'</HEAD>
'
b'
'

Invoking programs and escaping text

The next listing shows this script being used to trigger a remote program. As before, if you don’t give the local filename explicitly, the script strips the base filename out of the filename argument. That’s not always easy or appropriate for program invocations—the filename can contain both a remote directory path at the front and query parameters at the end for a remote program invocation.

Given a script invocation URL and no explicit output filename, the script extracts the base filename in the middle by using first the standard urllib.parse module to pull out the file path, and then os.path.split to strip off the directory path. However, the resulting filename is a remote script’s name, and it may or may not be an appropriate place to store the data locally. In the first run that follows, for example, the script’s output goes in a local file called languages.py, the script name in the middle of the URL; in the second, we instead name the output CxxSyntax.html explicitly to suppress filename extraction:

C:...PP4EInternetOther> python http-getfile-urllib2.py localhost
                                /cgi-bin/languages.py?language=Scheme
http://localhost/cgi-bin/languages.py?language=Scheme languages.py
b'<TITLE>Languages</TITLE>
'
b'<H1>Syntax</H1><HR>
'
b'<H3>Scheme</H3><P><PRE>
'
b' (display "Hello World") (newline)  
'
b'</PRE></P><BR>
'
b'<HR>
'

C:...PP4EInternetOther> python http-getfile-urllib2.py localhost
                                /cgi-bin/languages.py?language=C++ CxxSyntax.html
http://localhost/cgi-bin/languages.py?language=C++ CxxSyntax.html
b'<TITLE>Languages</TITLE>
'
b'<H1>Syntax</H1><HR>
'
b'<H3>C  </H3><P><PRE>
'
b"Sorry--I don't know that language
"
b'</PRE></P><BR>
'
b'<HR>
'

The remote script returns a not-found message when passed “C++” in the last command here. It turns out that “+” is a special character in URL strings (meaning a space), and to be robust, both of the urllib scripts we’ve just written should really run the filename string through something called urllib.parse.quote, a tool that escapes special characters for transmission. We will talk about this in depth in Chapter 15, so consider this a preview for now. But to make this invocation work, we need to use special sequences in the constructed URL. Here’s how to do it by hand:

C:...PP4EInternetOther> python http-getfile-urllib2.py  localhost
                               /cgi-bin/languages.py?language=C%2b%2b CxxSyntax.html
http://localhost/cgi-bin/languages.py?language=C%2b%2b CxxSyntax.html
b'<TITLE>Languages</TITLE>
'
b'<H1>Syntax</H1><HR>
'
b'<H3>C++</H3><P><PRE>
'
b' cout &lt;&lt; "Hello World" &lt;&lt; endl;     
'
b'</PRE></P><BR>
'
b'<HR>
'

The odd %2b strings in this command line are not entirely magical: the escaping required for URLs can be seen by running standard Python tools manually—this is what these scripts should do automatically to be able to handle all possible cases well; urllib.parse.unquote can undo these escapes if needed:

C:...PP4EInternetOther> python
>>> import urllib.parse
>>> urllib.parse.quote('C++')
'c%2B%2B'

Again, don’t work too hard at understanding these last few commands; we will revisit URLs and URL escapes in Chapter 15, while exploring server-side scripting in Python. I will also explain there why the C++ result came back with other oddities like <<—HTML escapes for <<, generated by the tool cgi.escape in the script on the server that produces the reply, and usually undone by HTML parsers including Python’s html.parser module we’ll meet in Chapter 19:

>>> import cgi
>>> cgi.escape('<<')
'&lt;&lt;'

Also in Chapter 15, we’ll meet urllib support for proxies, and its support for client-side cookies. We’ll discuss the related HTTPS concept in Chapter 16—HTTP transmissions over secure sockets, supported by urllib.request on the client side if SSL support is compiled into your Python. For now, it’s time to wrap up our look at the Web, and the Internet at large, from the client side of the fence.

Other Client-Side Scripting Options

In this chapter, we focused on client-side interfaces to standard protocols that run over sockets, but as suggested in an earlier footnote, client-side programming can take other forms, too. We outlined many of these at the start of Chapter 12—web service protocols (including SOAP and XML-RPC); Rich Internet Application toolkits (including Flex, Silverlight, and pyjamas); cross-language framework integration (including Java and .NET); and more.

As mentioned, most of these serve to extend the functionality of web browsers, and so ultimately run on top of the HTTP protocol we explored in this chapter. For instance:

The Jython system, a compiler that supports Python-coded Java applets—general-purpose programs downloaded from a server and run locally on the client when accessed or referenced by a URL, which extend the functionality of web browsers and interactions.
Similarly, RIAs provide AJAX communication and widget toolkits that allow JavaScript to implement user interaction within web browsers, which is more dynamic and rich than HTML and web browsers otherwise support.
In Chapter 19, we’ll also study Python’s support for XML—structured text that is used as the data transfer medium of client/server dialogs in web service protocols such as XML-RPC, which transfer XML-encoded objects over HTTP, and are supported by Python’s xmlrpc standard library package. Such protocols can simplify the interface to web servers in their clients.

In deference to time and space, though, we won’t go into further details on these and other client-side tools here. If you are interested in using Python to script clients, you should take a few minutes to become familiar with the list of Internet tools documented in the Python library reference manual. All work on similar principles but have slightly distinct interfaces.

In Chapter 15, we’ll hop the fence to the other side of the Internet world and explore scripts that run on server machines. Such programs give rise to the grander notion of applications that live entirely on the Web and are launched by web browsers. As we take this leap in structure, keep in mind that the tools we met in this and the preceding chapter are often sufficient to implement all the distributed processing that many applications require, and they can work in harmony with scripts that run on a server. To completely understand the Web worldview, though, we need to explore the server realm, too.

Before we get there, though, the next chapter puts concepts we’ve learned here to work by presenting a complete client-side program—a full-blown mail client GUI, which ties together many of the tools we’ve learned and coded. In fact, much of the email work we’ve done in this chapter was designed to lay the groundwork we’ll need to tackle the realistically scaled PyMailGUI example of the next chapter. Really, much of this book so far has served to build up skills required to equip us for this task: as we’ll see, PyMailGUI combines system tools, GUIs, and client-side Internet protocols to produce a useful system that does real work. As an added bonus, this example will help us understand the trade-offs between the client solutions we’ve met here and the server-side solutions we’ll study later in this part of the book.

^[48]There is also support in the Python world for other technologies that some might classify as “client-side scripting,” too, such as Jython/Java applets; XML-RPC and SOAP web services; and Rich Internet Application tools like Flex, Silverlight, pyjamas, and AJAX. These were all introduced early in Chapter 12. Such tools are generally bound up with the notion of web-based interactions—they either extend the functionality of a web browser running on a client machine, or simplify web server access in clients. We’ll study browser-based techniques in Chapters 15 and 16; here, client-side scripting means the client side of common Internet protocols such as FTP and email, independent of the Web or web browsers. At the bottom, web browsers are really just desktop GUI applications that make use of client-side protocols, including those we’ll study here, such as HTTP and FTP. See Chapter 12 as well as the end of this chapter for more on other client-side techniques.

^[49]No, really. The second edition of this book included a tale of woe here about how my ISP forced its users to wean themselves off Telnet access. This seems like a small issue today. Common practice on the Internet has come far in a short time. One of my sites has even grown too complex for manual edits (except, of course, to work around bugs in the site-builder tool). Come to think of it, so has Python’s presence on the Web. When I first found Python in 1992, it was a set of encoded email messages, which users decoded and concatenated and hoped the result worked. Yes, yes, I know—gee, Grandpa, tell us more…

^[50]Usage note: These scripts are highly dependent on the FTP server functioning properly. For a while, the upload script occasionally had timeout errors when running over my current broadband connection. These errors went away later, when my ISP fixed or reconfigured their server. If you have failures, try running against a different server; connecting and disconnecting around each transfer may or may not help (some servers limit their number of connections).

^[51]IMAP, or Internet Message Access Protocol, was designed as an alternative to POP, but it is still not as widely available today, and so it is not presented in this text. For instance, major commercial providers used for this book’s examples provide only POP (or web-based) access to email. See the Python library manual for IMAP server interface details. Python used to have a RFC822 module as well, but it’s been subsumed by the email package in 3.X.

^[52]We all know by now that such junk mail is usually referred to as spam, but not everyone knows that this name is a reference to a Monty Python skit in which a restaurant’s customers find it difficult to hear the reading of menu options over a group of Vikings singing an increasingly loud chorus of “spam, spam, spam…”. Hence the tie-in to junk email. Spam is used in Python program examples as a sort of generic variable name, though it also pays homage to the skit.

^[53]There will be more on POP message numbers when we study mailtools later in this chapter. Interestingly, the list of message numbers to be deleted need not be sorted; they remain valid for the duration of the delete connection, so deletions earlier in the list don’t change numbers of messages later in the list while you are still connected to the POP server. We’ll also see that some subtle issues may arise if mails in the server inbox are deleted without pymail’s knowledge (e.g., by your ISP or another email client); although very rare, suffice it to say for now that deletions in this script are not guaranteed to be accurate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 13. Client-Side Scripting

Create new playlist

Sign In

Sign Up

Chapter 13. Client-Side Scripting

“Socket to Me!”

FTP: Transferring Files over the Net

Transferring Files with ftplib

Using urllib to Download Files

FTP get and put Utilities

Download utility

Upload utility

Playing the Monty Python theme song

Adding a User Interface

Transferring Directories with ftplib

Downloading Site Directories

Note

Uploading Site Directories

Refactoring Uploads and Downloads for Reuse

Refactoring with functions

Refactoring with classes

Transferring Directory Trees with ftplib

Uploading Local Trees

Deleting Remote Trees

Downloading Remote Trees

Processing Internet Email

Unicode in Python 3.X and Email Tools

POP: Fetching Email

Mail Configuration Module

POP Mail Reader Script

Fetching Messages

Fetching Email at the Interactive Prompt

SMTP: Sending Email

SMTP Mail Sender Script

Sending Messages

Verifying receipt

Manipulating both From and To

Sending Email at the Interactive Prompt

email: Parsing and Composing Mail Content

Message Objects

Basic email Package Interfaces in Action

Handling multipart messages

Unicode, Internationalization, and the Python 3.1 email Package

Parser decoding requirement

Note

Text payload encodings: Handling mixed type results

Text payload encodings: Using header information to decode

Message header encodings: email package support

Note

Message address header encodings and parsing, and header creation

Workaround: Message text generation for binary attachment payloads is broken

Note

Workaround: Message composition for non-ASCII text parts is broken

Note

Summary: Solutions and workarounds

A Console-Based Email Client

Running the pymail Console Client

The mailtools Utility Package

Initialization File

MailTool Class

MailSender Class

Unicode issues for attachments, save files, and headers

MailFetcher Class

General usage

Unicode decoding for full mail text on fetches

Inbox synchronization tools

MailParser Class

Unicode decoding for text part payloads and message headers

Self-Test Script

Running the self-test

Updating the pymail Console Client

Running the pymail2 console client

NNTP: Accessing Newsgroups

HTTP: Accessing Websites

The urllib Package Revisited

Other urllib Interfaces

Invoking programs and escaping text

Other Client-Side Scripting Options

Table of Contents for
13. Client-Side Scripting