The preceding chapter introduced Internet fundamentals and explored sockets—the underlying communications mechanism over which bytes flow on the Net. In this chapter, we climb the encapsulation hierarchy one level and shift our focus to Python tools that support the client-side interfaces of common Internet protocols.
We talked about the Internet’s higher-level protocols in the abstract at the start of the preceding chapter, and you should probably review that material if you skipped over it the first time around. In short, protocols define the structure of the conversations that take place to accomplish most of the Internet tasks we’re all familiar with—reading email, transferring files by FTP, fetching web pages, and so on.
At the most basic level, all of these protocol dialogs happen over sockets using fixed and standard message structures and ports, so in some sense this chapter builds upon the last. But as we’ll see, Python’s protocol modules hide most of the underlying details—scripts generally need to deal only with simple objects and methods, and Python automates the socket and messaging logic required by the protocol.
In this chapter, we’ll concentrate on the FTP and email protocol modules in Python, and we’ll peek at a few others along the way (NNTP news, HTTP web pages, and so on). Because it is so prevalent, we will especially focus on email in much of this chapter, as well as in the two to follow—we’ll use tools and techniques introduced here in the larger PyMailGUI and PyMailCGI client and server-side programs of Chapters 14 and 16.
All of the tools employed in examples here are in the standard Python library and come with the Python system. All of the examples here are also designed to run on the client side of a network connection—these scripts connect to an already running server to request interaction and can be run from a basic PC or other client device (they require only a server to converse with). And as usual, all the code here is also designed to teach us something about Python programming in general—we’ll refactor FTP examples and package email code to show object-oriented programming (OOP) in action.
In the next chapter, we’ll look at a complete client-side program example before moving on to explore scripts designed to be run on the server side instead. Python programs can also produce pages on a web server, and there is support in the Python world for implementing the server side of things like HTTP, email, and FTP. For now, let’s focus on the client.[48]
As we saw in the preceding chapter, sockets see plenty of action on the
Net. For instance, the last chapter’s getfile
example allowed us to transfer
entire files between machines. In practice, though, higher-level
protocols are behind much of what happens on the Net. Protocols run on
top of sockets, but they hide much of the complexity of the network
scripting examples of the prior chapter.
FTP—the File Transfer Protocol—is one of the more commonly used
Internet protocols. It defines a higher-level conversation model that
is based on exchanging command strings and file contents over sockets.
By using FTP, we can accomplish the same task as the prior chapter’s
getfile
script, but the interface
is simpler, standard and more general—FTP lets us ask for files from
any server machine that supports FTP, without requiring that it run
our custom getfile
script. FTP also supports more advanced operations such as
uploading files to the server, getting remote directory listings, and
more.
Really, FTP runs on top of two sockets: one for passing control
commands between client and server (port 21), and another for
transferring bytes. By using a two-socket model, FTP avoids the
possibility of deadlocks (i.e., transfers on the data socket do not
block dialogs on the control socket). Ultimately, though, Python’s
ftplib
support module allows us to
upload and download files at a remote server machine by FTP, without
dealing in raw socket calls or FTP protocol details.
Because the Python FTP interface is so easy to use, let’s jump right into a realistic example. The script in Example 13-1 automatically fetches (a.k.a. “downloads”) and opens a remote file with Python. More specifically, this Python script does the following:
Downloads an image file (by default) from a remote FTP site
Opens the downloaded file with a utility we wrote in Example 6-23, in Chapter 6
The download portion will run on any machine with Python and an Internet connection, though you’ll probably want to change the script’s settings so it accesses a server and file of your own. The opening part works if your playfile.py supports your platform; see Chapter 6 for details, and change as needed.
#!/usr/local/bin/python """ A Python script to download and play a media file by FTP. Uses ftplib, the ftp protocol handler which uses sockets. Ftp runs on 2 sockets (one for data, one for control--on ports 20 and 21) and imposes message text formats, but Python's ftplib module hides most of this protocol's details. Change for your site/file. """ import os, sys from getpass import getpass # hidden password input from ftplib import FTP # socket-based FTP tools nonpassive = False # force active mode FTP for server? filename = 'monkeys.jpg' # file to be downloaded dirname = '.' # remote directory to fetch from sitename = 'ftp.rmi.net' # FTP site to contact userinfo = ('lutz', getpass('Pswd?')) # use () for anonymous if len(sys.argv) > 1: filename = sys.argv[1] # filename on command line? print('Connecting...') connection = FTP(sitename) # connect to FTP site connection.login(*userinfo) # default is anonymous login connection.cwd(dirname) # xfer 1k at a time to localfile if nonpassive: # force active FTP if server requires connection.set_pasv(False) print('Downloading...') localfile = open(filename, 'wb') # local file to store download connection.retrbinary('RETR ' + filename, localfile.write, 1024) connection.quit() localfile.close() if input('Open file?') in ['Y', 'y']: from PP4E.System.Media.playfile import playfile playfile(filename)
Most of the FTP protocol details are encapsulated by the Python
ftplib
module imported here. This
script uses some of the simplest interfaces in ftplib
(we’ll see others later in this
chapter), but they are representative of the module in general.
To open a connection to a remote (or local) FTP server, create
an instance of the ftplib.FTP
object, passing in the string name (domain or IP style) of the machine
you wish to connect to:
connection = FTP(sitename) # connect to ftp site
Assuming this call doesn’t throw an exception, the resulting FTP object exports methods that correspond to the usual FTP operations. In fact, Python scripts act much like typical FTP client programs—just replace commands you would normally type or select with method calls:
connection.login(*userinfo) # default is anonymous login connection.cwd(dirname) # xfer 1k at a time to localfile
Once connected, we log in and change to the remote directory
from which we want to fetch a file. The login
method allows us to pass in a username
and password as additional optional arguments to specify an account
login; by default, it performs anonymous FTP. Notice the use of the
nonpassive
flag in this
script:
if nonpassive: # force active FTP if server requires connection.set_pasv(False)
If this flag is set to True
,
the script will transfer the file in active FTP mode rather than the
default passive mode. We’ll finesse the details of the difference here
(it has to do with which end of the dialog chooses port numbers for
the transfer), but if you have trouble doing transfers with any of the
FTP scripts in this chapter, try using active mode as a first step. In
Python 2.1 and later, passive FTP mode is on by default. Now, open a
local file to receive the file’s content, and fetch the file:
localfile = open(filename, 'wb') connection.retrbinary('RETR ' + filename, localfile.write, 1024)
Once we’re in the target remote directory, we simply call the
retrbinary
method to download the
target server file in binary mode. The retrbinary
call will take a while to complete, since it must
download a big file. It gets three arguments:
An FTP command string; here, the string RETR
filename
, which is the standard format
for FTP retrievals.
A function or method to which Python passes each chunk of
the downloaded file’s bytes; here, the write
method of a newly created and
opened local file.
A size for those chunks of bytes; here, 1,024 bytes are downloaded at a time, but the default is reasonable if this argument is omitted.
Because this script creates a local file named localfile
of the same name as the remote
file being fetched, and passes its write
method to the FTP retrieval method,
the remote file’s contents will automatically appear in a local,
client-side file after the download is finished.
Observe how this file is opened in wb
binary output mode. If this script is run
on Windows we want to avoid automatically expanding any
bytes into
byte sequences; as we saw in Chapter 4, this happens automatically on
Windows when writing files opened in w
text mode. We also want to avoid Unicode
issues in Python 3.X—as we also saw in Chapter 4, strings are encoded when
written in text mode and this isn’t appropriate for binary data such
as images. A text-mode file would also not allow for the bytes
strings passed to write
by the FTP library’s retrbinary
in any event, so rb
is effectively required here (more on
output file modes later).
Finally, we call the FTP quit
method to break the connection with the server and manually close
the local file to force it to be
complete before it is further processed (it’s not impossible that
parts of the file are still held in buffers before the close
call):
connection.quit() localfile.close()
And that’s all there is to it—all the FTP, socket, and
networking details are hidden behind the ftplib
interface module. Here is this script
in action on a Windows 7 machine; after the download, the image file
pops up in a Windows picture viewer on my laptop, as captured in Figure 13-1. Change the
server and file assignments in this script to test on your own, and be
sure your PYTHONPATH
environment
variable includes the PP4E root’s
container, as we’re importing across directories on the examples tree
here:
C:...PP4EInternetFtp> python getone.py
Pswd?
Connecting...
Downloading...
Open file?y
Notice how the standard Python getpass.getpass
is
used to ask for an FTP password. Like the input
built-in function, this call prompts
for and reads a line of text from the console user; unlike input
, getpass
does not echo typed characters on
the screen at all (see the moreplus
stream redirection example of Chapter 3 for related tools). This is
handy for protecting things like passwords from potentially prying
eyes. Be careful, though—after issuing a warning, the IDLE GUI echoes
the password anyhow!
The main thing to notice is that this otherwise typical Python script fetches information from an arbitrarily remote FTP site and machine. Given an Internet link, any information published by an FTP server on the Net can be fetched by and incorporated into Python scripts using interfaces such as these.
In fact, FTP is just one way to transfer information across the Net, and there are
more general tools in the Python library to accomplish the prior
script’s download. Perhaps the most straightforward is the Python
urllib.request
module: given an Internet address string—a URL, or Universal
Resource Locator—this module opens a connection to the specified
server and returns a file-like object ready to be read with normal
file object method calls (e.g., read
, readline
).
We can use such a higher-level interface to download anything
with an address on the Web—files published by FTP sites (using URLs
that start with ftp://); web
pages and output of scripts that live on remote servers (using
http:// URLs); and even local
files (using file:// URLs). For
instance, the script in Example 13-2 does the same as the
one in Example 13-1, but it
uses the general urllib.request
module to fetch the source distribution file, instead of the
protocol-specific ftplib
.
#!/usr/local/bin/python """ A Python script to download a file by FTP by its URL string; use higher-level urllib instead of ftplib to fetch file; urllib supports FTP, HTTP, client-side HTTPS, and local files, and handles proxies, redirects, cookies, and more; urllib also allows downloads of html pages, images, text, etc.; see also Python html/xml parsers for web pages fetched by urllib in Chapter 19; """ import os, getpass from urllib.request import urlopen # socket-based web tools filename = 'monkeys.jpg' # remote/local filename password = getpass.getpass('Pswd?') remoteaddr = 'ftp://lutz:%[email protected]/%s;type=i' % (password, filename) print('Downloading', remoteaddr) # this works too: # urllib.request.urlretrieve(remoteaddr, filename) remotefile = urlopen(remoteaddr) # returns input file-like object localfile = open(filename, 'wb') # where to store data locally localfile.write(remotefile.read()) localfile.close() remotefile.close()
Note how we use a binary mode output file again; urllib
fetches return byte strings, even
for HTTP web pages. Don’t sweat the details of the URL string used
here; it is fairly complex, and we’ll explain its structure and that
of URLs in general in Chapter 15.
We’ll also use urllib
again in
this and later chapters to fetch web pages, format generated URL
strings, and get the output of remote scripts on the Web.
Technically speaking, urllib.request
supports a variety of
Internet protocols (HTTP, FTP, and local files). Unlike ftplib
, urllib.request
is generally used for
reading remote objects, not for writing or uploading them (though
the HTTP and FTP protocols support file uploads too). As with
ftplib
, retrievals must generally
be run in threads if blocking is a concern. But the basic interface
shown in this script is straightforward. The call:
remotefile = urllib.request.urlopen(remoteaddr) # returns input file-like object
contacts the server named in the remoteaddr
URL string and returns a
file-like object connected to its download stream (here, an
FTP-based socket). Calling this file’s read
method pulls down the file’s
contents, which are written to a local client-side file. An even
simpler interface:
urllib.request.urlretrieve(remoteaddr, filename)
also does the work of opening a local file and writing the downloaded bytes into it—things we do manually in the script as coded. This comes in handy if we want to download a file, but it is less useful if we want to process its data immediately.
Either way, the end result is the same: the desired server file shows up on the client machine. The output is similar to the original version, but we don’t try to automatically open this time (I’ve changed the password in the URL here to protect the innocent):
C:...PP4EInternetFtp>getone-urllib.py
Pswd? Downloading ftp://lutz:[email protected]/monkeys.jpg;type=i C:...PP4EInternetFtp>fc monkeys.jpg testmonkeys.jpg
FC: no differences encountered C:...PP4EInternetFtp>start monkeys.jpg
For more urllib
download
examples, see the section on HTTP later in this chapter, and the
server-side examples in Chapter 15. As
we’ll see in Chapter 15, in bigger
terms, tools like the urllib.request
urlopen
function allow scripts to both download remote
files and invoke programs that are located on a remote server
machine, and so serves as a useful tool for testing and using web
sites in Python scripts. In Chapter 15, we’ll also see that urllib.parse
includes tools for formatting
(escaping) URL strings for safe transmission.
When I present the ftplib
interfaces in Python classes, students often ask why programmers
need to supply the RETR string in the retrieval method. It’s a good
question—the RETR string is
the name of the download command in the FTP protocol, but ftplib
is supposed to encapsulate that
protocol. As we’ll see in a moment, we have to supply an arguably
odd STOR string for uploads as well. It’s boilerplate
code that you accept on faith once you see it, but that begs the
question. You could propose a patch to ftplib
, but that’s not really a good
answer for beginning Python students, and it may break existing code
(the interface is as it is for a reason).
Perhaps a better answer is that Python makes it easy to extend
the standard library modules with higher-level interfaces of our
own—with just a few lines of reusable code, we can make the FTP
interface look any way we want in Python. For instance, we could,
once and for all, write utility modules that wrap the ftplib
interfaces to hide the RETR string.
If we place these utility modules in a directory on PYTHONPATH
, they become just as accessible
as ftplib
itself, automatically
reusable in any Python script we write in the future. Besides
removing the RETR string requirement, a wrapper module could also
make assumptions that simplify FTP operations into single function
calls.
For instance, given a module that encapsulates and simplifies
ftplib
, our Python fetch-and-play
script could be further reduced to the script shown in Example 13-3—essentially just two
function calls plus a password prompt, but with a net effect exactly
like Example 13-1 when
run.
#!/usr/local/bin/python """ A Python script to download and play a media file by FTP. Uses getfile.py, a utility module which encapsulates FTP step. """ import getfile from getpass import getpass filename = 'monkeys.jpg' # fetch with utility getfile.getfile(file=filename, site='ftp.rmi.net', dir ='.', user=('lutz', getpass('Pswd?')), refetch=True) # rest is the same if input('Open file?') in ['Y', 'y']: from PP4E.System.Media.playfile import playfile playfile(filename)
Besides having a much smaller line count, the meat of this
script has been split off into a file for reuse elsewhere. If you
ever need to download a file again, simply import an existing
function instead of copying code with cut-and-paste editing. Changes
in download operations would need to be made in only one file, not
everywhere we’ve copied boilerplate code; getfile.getfile
could even be changed to
use urllib
rather than ftplib
without affecting any of its
clients. It’s good engineering.
So just how would we go about writing such an FTP interface
wrapper (he asks, rhetorically)? Given the ftplib
library module, wrapping
downloads of a particular file in a particular directory is
straightforward. Connected FTP objects support two download
methods:
retrbinary
This method downloads the requested file in binary mode, sending its bytes in chunks to a supplied function, without line-feed mapping. Typically, the supplied function is a write method of an open local file object, such that the bytes are placed in the local file on the client.
retrlines
This method downloads the requested file in ASCII text
mode, sending each line of text to a supplied function with
all end-of-line characters stripped. Typically, the supplied
function adds a
newline (mapped appropriately for the client machine), and
writes the line to a local file.
We will meet the retrlines
method in a later example; the
getfile
utility module in Example 13-4 always transfers
in binary mode with retrbinary
.
That is, files are downloaded exactly as they were on the server,
byte for byte, with the server’s line-feed conventions in text
files (you may need to convert line feeds after downloads if they
look odd in your text editor—see your editor or system shell
commands for pointers, or write a Python script that opens and
writes the text as needed).
#!/usr/local/bin/python """ Fetch an arbitrary file by FTP. Anonymous FTP unless you pass a user=(name, pswd) tuple. Self-test FTPs a test file and site. """ from ftplib import FTP # socket-based FTP tools from os.path import exists # file existence test def getfile(file, site, dir, user=(), *, verbose=True, refetch=False): """ fetch a file by ftp from a site/directory anonymous or real login, binary transfer """ if exists(file) and not refetch: if verbose: print(file, 'already fetched') else: if verbose: print('Downloading', file) local = open(file, 'wb') # local file of same name try: remote = FTP(site) # connect to FTP site remote.login(*user) # anonymous=() or (name, pswd) remote.cwd(dir) remote.retrbinary('RETR ' + file, local.write, 1024) remote.quit() finally: local.close() # close file no matter what if verbose: print('Download done.') # caller handles exceptions if __name__ == '__main__': from getpass import getpass file = 'monkeys.jpg' dir = '.' site = 'ftp.rmi.net' user = ('lutz', getpass('Pswd?')) getfile(file, site, dir, user)
This module is mostly just a repackaging of the FTP code we
used to fetch the image file earlier, to make it simpler and
reusable. Because it is a callable function, the exported getfile.getfile
here tries to be as
robust and generally useful as possible, but even a function this
small implies some design decisions. Here are a few usage
notes:
The getfile
function in this script runs in anonymous FTP
mode by default, but a two-item tuple containing a username
and password string may be passed to the user
argument in order to log in
to the remote server in nonanonymous mode. To use anonymous
FTP, either don’t pass the user
argument or pass it an empty
tuple, ()
. The FTP object
login
method allows two
optional arguments to denote a username and password, and
the function(*args)
call
syntax in Example 13-4 sends it
whatever argument tuple you pass to user
as individual
arguments.
If passed, the last two arguments (verbose
, refetch
) allow us to turn off
status messages printed to the stdout
stream (perhaps undesirable
in a GUI context) and to force downloads to happen even if
the file already exists locally (the download overwrites the
existing local file).
These two arguments are coded as Python 3.X default
keyword-only arguments, so if used they
must be passed by name, not position. The user
argument instead can be
passed either way, if it is passed at all. Keyword-only
arguments here prevent passed verbose or refetch values from
being incorrectly matched against the user
argument if the user value is
actually omitted in a call.
The caller is expected to handle exceptions; this
function wraps downloads in a try
/finally
statement to guarantee
that the local output file is closed, but it lets exceptions
propagate. If used in a GUI or run from a thread, for
instance, exceptions may require special handling unknown in
this file.
If run standalone, this file downloads an image file again from my website as a self-test (configure for your server and file as desired), but the function will normally be passed FTP filenames, site names, and directory names as well.
As in earlier examples, this script is careful to open
the local output file in wb
binary mode to suppress
end-line mapping and conform to Python 3.X’s Unicode string
model. As we learned in Chapter 4, it’s not impossible
that true binary datafiles may have bytes whose value is
equal to a
line-feed
character; opening in w
text mode instead would make these bytes automatically
expand to a
two-byte
sequence when written locally on Windows. This is only an
issue when run on Windows; mode w
doesn’t change end-lines
elsewhere.
As we also learned in Chapter 4, though, binary mode is required to suppress the automatic Unicode translations performed for text in Python 3.X. Without binary mode, Python would attempt to encode fetched data when written per a default or passed Unicode encoding scheme, which might fail for some types of fetched text and would normally fail for truly binary data such as images and audio.
Because retrbinary
writes bytes
strings in
3.X, we really cannot open the output file in text mode
anyhow, or write
will
raise exceptions. Recall that in 3.X text-mode files require
str
strings, and binary
mode files expect bytes
.
Since retrbinary
writes bytes
and retrlines
writes str
, they implicitly require
binary and text-mode output files, respectively. This
constraint is irrespective of end-line or Unicode issues,
but it effectively accomplishes those goals as well.
As we’ll see in later examples, text-mode retrievals
have additional encoding requirements; in fact, ftplib
will turn out to be a good
example of the impacts of Python 3.X’s Unicode string model
on real-world code. By always using binary mode in the
script here, we sidestep the issue altogether.
This function currently uses the same filename to
identify both the remote file and the local file where the
download should be stored. As such, it should be run in the
directory where you want the file to show up; use os.chdir
to move to directories if
needed. (We could instead assume
filename is the local file’s name, and
strip the local directory with os.path.split
to get the remote
name, or accept two distinct filename arguments—local and
remote.)
Also notice that, despite its name, this module is very
different from the getfile.py script we
studied at the end of the sockets material in the preceding
chapter. The socket-based getfile
implemented custom client and
server-side logic to download a server file to a client machine
over raw sockets.
The new getfile
here is a
client-side tool only. Instead of raw sockets, it uses the
standard FTP protocol to request a file from a server; all
socket-level details are hidden in the simpler ftplib
module’s implementation of the
FTP client protocol. Furthermore, the server here is a perpetually
running program on the server machine, which listens for and
responds to FTP requests on a socket, on the dedicated FTP port
(number 21). The net functional effect is that this script
requires an FTP server to be running on the machine where the
desired file lives, but such a server is much more likely to be
available.
While we’re at it, let’s write a script to upload a single file by FTP to a remote machine. The upload interfaces in the FTP module are symmetric with the download interfaces. Given a connected FTP object, its:
Unlike the download interfaces, both of these methods are
passed a file object as a whole, not a file object method (or
other function). We will meet the storlines
method in a later example. The
utility module in Example 13-5 uses storbinary
such that the file whose name
is passed in is always uploaded verbatim—in binary mode, without
Unicode encodings or line-feed translations for the target
machine’s conventions. If this script uploads a text file, it will
arrive exactly as stored on the machine it came from, with client
line-feed markers and existing Unicode encoding.
#!/usr/local/bin/python """ Store an arbitrary file by FTP in binary mode. Uses anonymous ftp unless you pass in a user=(name, pswd) tuple of arguments. """ import ftplib # socket-based FTP tools def putfile(file, site, dir, user=(), *, verbose=True): """ store a file by ftp to a site/directory anonymous or real login, binary transfer """ if verbose: print('Uploading', file) local = open(file, 'rb') # local file of same name remote = ftplib.FTP(site) # connect to FTP site remote.login(*user) # anonymous or real login remote.cwd(dir) remote.storbinary('STOR ' + file, local, 1024) remote.quit() local.close() if verbose: print('Upload done.') if __name__ == '__main__': site = 'ftp.rmi.net' dir = '.' import sys, getpass pswd = getpass.getpass(site + ' pswd?') # filename on cmdline putfile(sys.argv[1], site, dir, user=('lutz', pswd)) # nonanonymous login
Notice that for portability, the local file is opened in
rb
binary input mode this time
to suppress automatic line-feed character conversions. If this is
binary information, we don’t want any bytes that happen to have
the value of the
carriage-return character to mysteriously go away during the
transfer when run on a Windows client. We also want to suppress
Unicode encodings for nontext files, and we want reads to produce
the bytes
strings expected by
the storbinary
upload operation
(more on input file modes later).
This script uploads a file you name on the command line as a
self-test, but you will normally pass in real remote filename,
site name, and directory name strings. Also like the download
utility, you may pass a (username,
password)
tuple to the user
argument to trigger nonanonymous
FTP mode (anonymous FTP is the default).
It’s time for a bit of fun. To test, let’s use these scripts to transfer a copy of the Monty Python theme song audio file I have at my website. First, let’s write a module that downloads and plays the sample file, as shown in Example 13-6.
#!/usr/local/bin/python """ Usage: sousa.py. Fetch and play the Monty Python theme song. This will not work on your system as is: it requires a machine with Internet access and an FTP server account you can access, and uses audio filters on Unix and your .au player on Windows. Configure this and playfile.py as needed for your platform. """ from getpass import getpass from PP4E.Internet.Ftp.getfile import getfile from PP4E.System.Media.playfile import playfile file = 'sousa.au' # default file coordinates site = 'ftp.rmi.net' # Monty Python theme song dir = '.' user = ('lutz', getpass('Pswd?')) getfile(file, site, dir, user) # fetch audio file by FTP playfile(file) # send it to audio player # import os # os.system('getone.py sousa.au') # equivalent command line
There’s not much to this script, because it really just
combines two tools we’ve already coded. We’re reusing Example 13-4’s getfile
to download, and Chapter 6’s playfile
module (Example 6-23) to play the audio
sample after it is downloaded (turn back to that example for more
details on the player part of the task). Also notice the last two
lines in this file—we can achieve the same effect by passing in
the audio filename as a command-line argument to our original
script, but it’s less direct.
As is, this script assumes my FTP server account; configure as desired (alas, this file used to be at the ftp.python.org anonymous FTP site, but that site went dark for security reasons between editions of this book). Once configured, this script will run on any machine with Python, an Internet link, and a recognizable audio player; it works on my Windows laptop with a broadband Internet connection, and it plays the music clip in Windows Media Player (and if I could insert an audio file hyperlink here to show what it sounds like, I would…):
C:...PP4EInternetFtp>sousa.py
Pswd? Downloading sousa.au Download done. C:...PP4EInternetFtp>sousa.py
Pswd? sousa.au already fetched
The getfile
and putfile
modules themselves can be used to move the sample
file around too. Both can either be imported by clients that wish
to use their functions, or run as top-level programs to trigger
self-tests and command-line usage. For variety, let’s run these
scripts from a command line and the interactive prompt to see how
they work. When run standalone, the filename is passed in the
command line to putfile
and
both use password input and default site settings:
C:...PP4EInternetFtp> putfile.py sousa.py
ftp.rmi.net pswd?
Uploading sousa.py
Upload done.
When imported, parameters are passed explicitly to functions:
C:...PP4EInternetFtp>python
>>>from getfile import getfile
>>>getfile(file='sousa.au', site='ftp.rmi.net', dir='.', user=('lutz', 'XXX'))
sousa.au already fetched C:...PP4EInternetFtp>del sousa.au
C:...PP4EInternetFtp>python
>>>from getfile import getfile
>>>getfile(file='sousa.au', site='ftp.rmi.net', dir='.', user=('lutz', 'XXX'))
Downloading sousa.au Download done. >>>from PP4E.System.Media.playfile import playfile
>>>playfile('sousa.au')
Although Python’s ftplib
already automates the underlying socket and message formatting
chores of FTP, tools of our own like these can make the process
even simpler.
If you read the preceding chapter, you’ll recall that it concluded with a quick
look at scripts that added a user interface to a socket-based
getfile
script—one that
transferred files over a proprietary socket dialog, instead of over
FTP. At the end of that presentation, I mentioned that FTP is a much
more generally useful way to move files around because FTP servers
are so widely available on the Net. For illustration purposes, Example 13-7 shows a simple
mutation of the prior chapter’s user interface, implemented as a new
subclass of the preceding chapter’s general form builder, form.py of Example 12-20.
""" ################################################################################# launch FTP getfile function with a reusable form GUI class; uses os.chdir to goto target local dir (getfile currently assumes that filename has no local directory path prefix); runs getfile.getfile in thread to allow more than one to be running at once and avoid blocking GUI during downloads; this differs from socket-based getfilegui, but reuses Form GUI builder tool; supports both user and anonymous FTP as currently coded; caveats: the password field is not displayed as stars here, errors are printed to the console instead of shown in the GUI (threads can't generally update the GUI on Windows), this isn't 100% thread safe (there is a slight delay between os.chdir here and opening the local output file in getfile) and we could display both a save-as popup for picking the local dir, and a remote directory listing for picking the file to get; suggested exercises: improve me; ################################################################################# """ from tkinter import Tk, mainloop from tkinter.messagebox import showinfo import getfile, os, sys, _thread # FTP getfile here, not socket from PP4E.Internet.Sockets.form import Form # reuse form tool in socket dir class FtpForm(Form): def __init__(self): root = Tk() root.title(self.title) labels = ['Server Name', 'Remote Dir', 'File Name', 'Local Dir', 'User Name?', 'Password?'] Form.__init__(self, labels, root) self.mutex = _thread.allocate_lock() self.threads = 0 def transfer(self, filename, servername, remotedir, userinfo): try: self.do_transfer(filename, servername, remotedir, userinfo) print('%s of "%s" successful' % (self.mode, filename)) except: print('%s of "%s" has failed:' % (self.mode, filename), end=' ') print(sys.exc_info()[0], sys.exc_info()[1]) self.mutex.acquire() self.threads -= 1 self.mutex.release() def onSubmit(self): Form.onSubmit(self) localdir = self.content['Local Dir'].get() remotedir = self.content['Remote Dir'].get() servername = self.content['Server Name'].get() filename = self.content['File Name'].get() username = self.content['User Name?'].get() password = self.content['Password?'].get() userinfo = () if username and password: userinfo = (username, password) if localdir: os.chdir(localdir) self.mutex.acquire() self.threads += 1 self.mutex.release() ftpargs = (filename, servername, remotedir, userinfo) _thread.start_new_thread(self.transfer, ftpargs) showinfo(self.title, '%s of "%s" started' % (self.mode, filename)) def onCancel(self): if self.threads == 0: Tk().quit() else: showinfo(self.title, 'Cannot exit: %d threads running' % self.threads) class FtpGetfileForm(FtpForm): title = 'FtpGetfileGui' mode = 'Download' def do_transfer(self, filename, servername, remotedir, userinfo): getfile.getfile( filename, servername, remotedir, userinfo, verbose=False, refetch=True) if __name__ == '__main__': FtpGetfileForm() mainloop()
If you flip back to the end of the preceding chapter, you’ll
find that this version is similar in structure to its counterpart
there; in fact, it has the same name (and is distinct only because
it lives in a different directory). The class here, though, knows
how to use the FTP-based getfile
module from earlier in this chapter instead of the socket-based
getfile
module we met a chapter
ago. When run, this version also implements more input fields, as in
Figure 13-2, shown on Windows
7.
Notice that a full absolute file path can be entered for the
local directory here. If not, the script assumes the current working
directory, which changes after each download and can vary depending
on where the GUI is launched (e.g., the current directory differs
when this script is run by the PyDemos program at the top of the
examples tree). When we click this GUI’s Submit button (or press the
Enter key), the script simply passes the form’s input field values
as arguments to the getfile.getfile
FTP utility function of
Example 13-4 earlier in
this section. It also posts a pop up to tell us the download has
begun (Figure 13-3).
As currently coded, further download status messages, including any FTP error messages, show up in the console window; here are the messages for successful downloads as well as one that fails (with added blank lines for readability):
C:...PP4EInternetFtp> getfilegui.py
Server Name => ftp.rmi.net
User Name? => lutz
Local Dir => test
File Name => about-pp.html
Password? => xxxxxxxx
Remote Dir => .
Download of "about-pp.html" successful
Server Name => ftp.rmi.net
User Name? => lutz
Local Dir => C: emp
File Name => ora-lp4e-big.jpg
Password? => xxxxxxxx
Remote Dir => .
Download of "ora-lp4e-big.jpg" successful
Server Name => ftp.rmi.net
User Name? => lutz
Local Dir => C: emp
File Name => ora-lp4e.jpg
Password? => xxxxxxxx
Remote Dir => .
Download of "ora-lp4e.jpg" has failed: <class 'ftplib.error_perm'>
550 ora-lp4e.jpg: No such file or directory
Given a username and password, the downloader logs into the specified account. To do anonymous FTP instead, leave the username and password fields blank.
Now, to illustrate the threading capabilities of this GUI, start a download of a large file, then start another download while this one is in progress. The GUI stays active while downloads are underway, so we simply change the input fields and press Submit again.
This second download starts and runs in parallel with the first, because each download is run in a thread, and more than one Internet connection can be active at once. In fact, the GUI itself stays active during downloads only because downloads are run in threads; if they were not, even screen redraws wouldn’t happen until a download finished.
We discussed threads in Chapter 5, and their application to GUIs in Chapters 9 and 10, but this script illustrates some practical thread concerns:
This program takes care to not do anything GUI-related in a download thread. As we’ve learned, only the thread that makes GUIs can generally process them.
To avoid killing spawned download threads on some platforms, the GUI must also be careful not to exit while any downloads are in progress. It keeps track of the number of in-progress threads, and just displays a pop up if we try to kill the GUI by pressing the Cancel button while both of these downloads are in progress.
We learned about ways to work around the no-GUI rule for
threads in Chapter 10, and we will
apply such techniques when we explore the PyMailGUI example in the
next chapter. To be portable, though, we can’t really close the GUI
until the active-thread count falls to zero; the exit model of the
threading
module of Chapter 5 can be used to achieve the same
effect. Here is the sort of output that appears in the console
window when two downloads overlap in time:
C:...PP4EInternetFtp> python getfilegui.py
Server Name => ftp.rmi.net
User Name? => lutz
Local Dir => C: emp
File Name => spain08.JPG
Password? => xxxxxxxx
Remote Dir => .
Server Name => ftp.rmi.net
User Name? => lutz
Local Dir => C: emp
File Name => index.html
Password? => xxxxxxxx
Remote Dir => .
Download of "index.html" successful
Download of "spain08.JPG" successful
This example isn’t much more useful than a command line-based tool, of course, but it can be easily modified by changing its Python code, and it provides enough of a GUI to qualify as a simple, first-cut FTP user interface. Moreover, because this GUI runs downloads in Python threads, more than one can be run at the same time from this GUI without having to start or restart a different FTP client tool.
While we’re in a GUI mood, let’s add a simple interface to the
putfile
utility, too. The script
in Example 13-8 creates a
dialog that starts uploads in threads, using core FTP logic imported
from Example 13-5. It’s
almost the same as the getfile
GUI we just wrote, so there’s not much new to say. In fact, because
get
and put
operations are so similar from an
interface perspective, most of the get
form’s logic was deliberately factored
out into a single generic class (FtpForm
), so changes need be made in only
a single place. That is, the put
GUI here is mostly just a reuse of the get
GUI, with distinct output labels and
transfer methods. It’s in a file by itself, though, to make it easy
to launch as a standalone program.
""" ############################################################### launch FTP putfile function with a reusable form GUI class; see getfilegui for notes: most of the same caveats apply; the get and put forms have been factored into a single class such that changes need be made in only one place; ############################################################### """ from tkinter import mainloop import putfile, getfilegui class FtpPutfileForm(getfilegui.FtpForm): title = 'FtpPutfileGui' mode = 'Upload' def do_transfer(self, filename, servername, remotedir, userinfo): putfile.putfile(filename, servername, remotedir, userinfo, verbose=False) if __name__ == '__main__': FtpPutfileForm() mainloop()
Running this script looks much like running the download GUI, because it’s almost entirely the same code at work. Let’s upload some files from the client machine to the server; Figure 13-4 shows the state of the GUI while starting one.
And here is the console window output we get when uploading two files in serial fashion; here again, uploads run in parallel threads, so if we start a new upload before one in progress is finished, they overlap in time:
C:...PP4EInternetFtp est> ..putfilegui.py
Server Name => ftp.rmi.net
User Name? => lutz
Local Dir => .
File Name => sousa.au
Password? => xxxxxxxx
Remote Dir => .
Upload of "sousa.au" successful
Server Name => ftp.rmi.net
User Name? => lutz
Local Dir => .
File Name => about-pp.html
Password? => xxxxxxxx
Remote Dir => .
Upload of "about-pp.html" successful
Finally, we can bundle up both GUIs in a single launcher
script that knows how to start the get
and put
interfaces, regardless of which
directory we are in when the script is started, and independent of
the platform on which it runs. Example 13-9 shows this
process.
""" spawn FTP get and put GUIs no matter what directory I'm run from; os.getcwd is not necessarily the place this script lives; could also hardcode path from $PP4EHOME, or guessLocation; could also do: [from PP4E.launchmodes import PortableLauncher, PortableLauncher('getfilegui', '%s/getfilegui.py' % mydir)()], but need the DOS console pop up on Windows to view status messages which describe transfers made; """ import os, sys print('Running in: ', os.getcwd()) # PP3E # from PP4E.Launcher import findFirst # mydir = os.path.split(findFirst(os.curdir, 'PyFtpGui.pyw'))[0] # PP4E from PP4E.Tools.find import findlist mydir = os.path.dirname(findlist('PyFtpGui.pyw', startdir=os.curdir)[0]) if sys.platform[:3] == 'win': os.system('start %sgetfilegui.py' % mydir) os.system('start %sputfilegui.py' % mydir) else: os.system('python %s/getfilegui.py &' % mydir) os.system('python %s/putfilegui.py &' % mydir)
Notice that we’re reusing the find
utility from Chapter 6’s Example 6-13 again here—this time
to locate the home directory of the script in order to build command
lines. When run by launchers in the examples root directory or
command lines elsewhere in general, the current working directory
may not always be this script’s container. In the prior edition,
this script used a tool in the Launcher
module instead to search for its
own directory (see the examples distribution for that
equivalent).
When this script is started, both the get
and put
GUIs appear as distinct, independently
run programs; alternatively, we might attach both forms to a single
interface. We could get much fancier than these two interfaces, of
course. For instance, we could pop up local file selection dialogs,
and we could display widgets that give the status of downloads and
uploads in progress. We could even list files available at the
remote site in a selectable listbox by requesting remote directory
listings over the FTP connection. To learn how to add features like
that, though, we need to move on to the next section.
Once upon a time, I used Telnet to manage my website at my Internet Service Provider (ISP). I logged in to the web server in a shell window, and performed all my edits directly on the remote machine. There was only one copy of a site’s files—on the machine that hosted it. Moreover, content updates could be performed from any machine that ran a Telnet client—ideal for people with travel-based careers.[49]
Of course, times have changed. Like most personal websites, today mine are maintained on my laptop and I transfer their files to and from my ISP as needed. Often, this is a simple matter of one or two files, and it can be accomplished with a command-line FTP client. Sometimes, though, I need an easy way to transfer the entire site. Maybe I need to download to detect files that have become out of sync. Occasionally, the changes are so involved that it’s easier to upload the entire site in a single step.
Although there are a variety of ways to approach this task (including options in site-builder tools), Python can help here, too: writing Python scripts to automate the upload and download tasks associated with maintaining my website on my laptop provides a portable and mobile solution. Because Python FTP scripts will work on any machine with sockets, they can be run on my laptop and on nearly any other computer where Python is installed. Furthermore, the same scripts used to transfer page files to and from my PC can be used to copy my site to another web server as a backup copy, should my ISP experience an outage. The effect is sometimes called a mirror—a copy of a remote site.
The following two scripts address these needs. The first, downloadflat.py, automatically downloads (i.e., copies) by FTP all the files in a directory at a remote site to a directory on the local machine. I keep the main copy of my website files on my PC these days, but I use this script in two ways:
To download my website to client machines where I want to make edits, I fetch the contents of my web directory of my account on my ISP’s machine.
To mirror my site to my account on another server, I run this script periodically on the target machine if it supports Telnet or SSH secure shell; if it does not, I simply download to one machine and upload from there to the target server.
More generally, this script (shown in Example 13-10) will download a directory full of files to any machine with Python and sockets, from any machine running an FTP server.
#!/bin/env python """ ############################################################################### use FTP to copy (download) all files from a single directory at a remote site to a directory on the local machine; run me periodically to mirror a flat FTP site directory to your ISP account; set user to 'anonymous' to do anonymous FTP; we could use try to skip file failures, but the FTP connection is likely closed if any files fail; we could also try to reconnect with a new FTP instance before each transfer: connects once now; if failures, try setting nonpassive for active FTP, or disable firewalls; this also depends on a working FTP server, and possibly its load policies. ############################################################################### """ import os, sys, ftplib from getpass import getpass from mimetypes import guess_type nonpassive = False # passive FTP on by default in 2.1+ remotesite = 'home.rmi.net' # download from this site remotedir = '.' # and this dir (e.g., public_html) remoteuser = 'lutz' remotepass = getpass('Password for %s on %s: ' % (remoteuser, remotesite)) localdir = (len(sys.argv) > 1 and sys.argv[1]) or '.' cleanall = input('Clean local directory first? ')[:1] in ['y', 'Y'] print('connecting...') connection = ftplib.FTP(remotesite) # connect to FTP site connection.login(remoteuser, remotepass) # login as user/password connection.cwd(remotedir) # cd to directory to copy if nonpassive: # force active mode FTP connection.set_pasv(False) # most servers do passive if cleanall: for localname in os.listdir(localdir): # try to delete all locals try: # first, to remove old files print('deleting local', localname) # os.listdir omits . and .. os.remove(os.path.join(localdir, localname)) except: print('cannot delete local', localname) count = 0 # download all remote files remotefiles = connection.nlst() # nlst() gives files list # dir() gives full details for remotename in remotefiles: if remotename in ('.', '..'): continue # some servers include . and .. mimetype, encoding = guess_type(remotename) # e.g., ('text/plain', 'gzip') mimetype = mimetype or '?/?' # may be (None, None) maintype = mimetype.split('/')[0] # .jpg ('image/jpeg', None') localpath = os.path.join(localdir, remotename) print('downloading', remotename, 'to', localpath, end=' ') print('as', maintype, encoding or '') if maintype == 'text' and encoding == None: # use ascii mode xfer and text file # use encoding compatible wth ftplib's localfile = open(localpath, 'w', encoding=connection.encoding) callback = lambda line: localfile.write(line + ' ') connection.retrlines('RETR ' + remotename, callback) else: # use binary mode xfer and bytes file localfile = open(localpath, 'wb') connection.retrbinary('RETR ' + remotename, localfile.write) localfile.close() count += 1 connection.quit() print('Done:', count, 'files downloaded.')
There’s not much that is new to speak of in this script, compared to other FTP examples we’ve seen thus far. We open a connection with the remote FTP server, log in with a username and password for the desired account (this script never uses anonymous FTP), and go to the desired remote directory. New here, though, are loops to iterate over all the files in local and remote directories, text-based retrievals, and file deletions:
This script has a cleanall
option, enabled by an
interactive prompt. If selected, the script first deletes all
the files in the local directory before downloading, to make
sure there are no extra files that aren’t also on the server
(there may be junk here from a prior download). To delete
local files, the script calls os.listdir
to get a list of
filenames in the directory, and os.remove
to delete each; see Chapter 4 (or the Python library
manual) for more details if you’ve forgotten what these calls
do.
Notice the use of os.path.join
to concatenate a
directory path and filename according to the host platform’s
conventions; os.listdir
returns filenames without their directory paths, and this
script is not necessarily run in the local directory where
downloads will be placed. The local directory defaults to the
current directory (“.”), but can be set differently with a
command-line argument to the script.
To grab all the files in a remote directory, we first need a list of their
names. The FTP object’s nlst
method is the remote equivalent of os.listdir
: nlst
returns a list of the string
names of all files in the current remote directory. Once we
have this list, we simply step through it in a loop, running
FTP retrieval commands for each filename in turn (more on this
in a minute).
The nlst
method is,
more or less, like requesting a directory listing with an
ls
command in typical
interactive FTP programs, but Python automatically splits up
the listing’s text into a list of filenames. We can pass it a
remote directory to be listed; by default it lists the current
server directory. A related FTP method, dir
, returns the list of line
strings produced by an FTP LIST
command; its result is like typing a dir
command in an FTP session, and
its lines contain complete file information, unlike nlst
. If you need to know more about
all the remote files, parse the result of a dir
method call (we’ll see how in a
later example).
Notice how we skip “.” and “..” current and parent
directory indicators if present in remote directory listings;
unlike os.listdir
, some
(but not all) servers include these, so we need to either skip
these or catch the exceptions they may trigger (more on this
later when we start using dir
, too).
We discussed output file modes for FTP earlier, but
now that we’ve started transferring text, too, I
can fill in the rest of this story. To handle Unicode
encodings and to keep line-ends in sync with the machines that
my web files live on, this script distinguishes between binary
and text file transfers. It uses the Python mimetypes
module to choose between
text and binary transfer modes for each file.
We met mimetypes
in
Chapter 6 near Example 6-23, where we used
it to play media files (see the examples and description there
for an introduction). Here, mimetypes
is used to decide whether
a file is text or binary by guessing from its filename
extension. For instance, HTML web pages and simple text files
are transferred as text with automatic line-end mappings, and
images and tar archives are transferred
in raw binary mode.
For binary files data is pulled
down with the retrbinary
method we met earlier,
and stored in a local file with binary open mode of wb
. This file open mode is required
to allow for the bytes
strings passed to the write
method by retrbinary
, but
it also suppresses line-end byte mapping and Unicode encodings
in the process. Again, text mode requires encodable text in
Python 3.X, and this fails for binary data like images. This
script may also be run on Windows or Unix-like platforms, and
we don’t want a
byte
embedded in an image to get expanded to
on Windows. We don’t use a
chunk-size third argument for binary transfers here, though—it
defaults to a reasonable size if omitted.
For text files, the script instead
uses the retrlines
method,
passing in a function to be called for each line in the text
file downloaded. The text line handler function receives lines
in str
string form, and
mostly just writes the line to a local text file. But notice
that the handler function created by the lambda
here also adds a
line-end character to the end of
the line it is passed. Python’s retrlines
method strips all line-feed characters from
lines to sidestep platform differences. By adding a
, the script ensures the proper
line-end marker character sequence for the local platform on
which this script runs when written to the file (
or
).
For this auto-mapping of the
in the script to work, of course,
we must also open text output files in w
text mode, not in wb
—the mapping from
to
on Windows happens when data is
written to the file. As discussed earlier, text mode also
means that the file’s write
method will allow for the str
string passed in by retrlines
, and that text will be
encoded per Unicode when written.
Subtly, though, we also explicitly use the FTP connection
object’s Unicode
encoding scheme for our text output file
in open
, instead of the
default. Without this encoding option, the script aborted with
a UnicodeEncodeError
exception for some files in my site. In retrlines
, the FTP object itself
reads the remote file data over a socket with a text-mode file
wrapper and an explicit encoding scheme for decoding; since
the FTP object can do no better than this encoding anyhow, we
use its encoding for our output file as well.
By default, FTP objects use the latin1
scheme for decoding text
fetched (as well as for encoding text sent), but this can be
specialized by assigning to their encoding
attribute. Our
script’s local text output file will inherit whatever encoding
ftplib
uses and so be
compatible with the encoded text data that it produces and
passes.
We could try to also catch Unicode exceptions for files
outside the Unicode encoding used by the FTP object, but
exceptions leave the FTP object in an unrecoverable state in
tests I’ve run in Python 3.1. Alternatively, we could use
wb
binary mode for the
local text output file and manually encode line strings with
line.encode
, or simply use
retrbinary
and binary mode
files in all cases, but both of these would fail to map
end-lines portably—the whole point of making text distinct in
this context.
All of this is simpler in action than in words. Here is the command I use to download my entire book support website from my ISP server account to my Windows laptop PC, in a single step:
C:...PP4EInternetFtpMirror>downloadflat.py test
Password for lutz on home.rmi.net: Clean local directory first?y
connecting... deleting local 2004-longmont-classes.html deleting local 2005-longmont-classes.html deleting local 2006-longmont-classes.html deleting local about-hopl.html deleting local about-lp.html deleting local about-lp2e.html deleting local about-pp-japan.html ...lines omitted... downloading 2004-longmont-classes.html to test2004-longmont-classes.html as text downloading 2005-longmont-classes.html to test2005-longmont-classes.html as text downloading 2006-longmont-classes.html to test2006-longmont-classes.html as text downloading about-hopl.html to testabout-hopl.html as text downloading about-lp.html to testabout-lp.html as text downloading about-lp2e.html to testabout-lp2e.html as text downloading about-pp-japan.html to testabout-pp-japan.html as text ...lines omitted... downloading ora-pyref4e.gif to testora-pyref4e.gif as image downloading ora-lp4e-big.jpg to testora-lp4e-big.jpg as image downloading ora-lp4e.gif to testora-lp4e.gif as image downloading pyref4e-updates.html to testpyref4e-updates.html as text downloading lp4e-updates.html to testlp4e-updates.html as text downloading lp4e-examples.html to testlp4e-examples.html as text downloading LP4E-examples.zip to testLP4E-examples.zip as application Done: 297 files downloaded.
This may take a few moments to complete, depending on your
site’s size and your connection speed (it’s bound by network speed
constraints, and it usually takes roughly two to three minutes for
my site on my current laptop and wireless broadband connection). It
is much more accurate and easier than downloading files by hand,
though. The script simply iterates over all the remote files
returned by the nlst
method, and
downloads each with the FTP protocol (i.e., over sockets) in turn.
It uses text transfer mode for names that imply text data, and
binary mode for others.
With the script running this way, I make sure the initial assignments in it reflect the machines involved, and then run the script from the local directory where I want the site copy to be stored. Because the target download directory is often not where the script lives, I may need to give Python the full path to the script file. When run on a server in a Telnet or SSH session window, for instance, the execution and script directory paths are different, but the script works the same way.
If you elect to delete local files in the download directory, you may also see a batch of “deleting local…” messages scroll by on the screen before any “downloading…” lines appear: this automatically cleans out any garbage lingering from a prior download. And if you botch the input of the remote site password, a Python exception is raised; I sometimes need to run it again (and type more slowly):
C:...PP4EInternetFtpMirror> downloadflat.py test
Password for lutz on home.rmi.net:
Clean local directory first?
connecting...
Traceback (most recent call last):
File "C:...PP4EInternetFtpMirrordownloadflat.py", line 29, in <module>
connection.login(remoteuser, remotepass) # login as user/password
File "C:Python31libftplib.py", line 375, in login
if resp[0] == '3': resp = self.sendcmd('PASS ' + passwd)
File "C:Python31libftplib.py", line 245, in sendcmd
return self.getresp()
File "C:Python31libftplib.py", line 220, in getresp
raise error_perm(resp)
ftplib.error_perm: 530 Login incorrect.
It’s worth noting that this script is at least partially configured by assignments near the top of the file. In addition, the password and deletion options are given by interactive inputs, and one command-line argument is allowed—the local directory name to store the downloaded files (it defaults to “.”, the directory where the script is run). Command-line arguments could be employed to universally configure all the other download parameters and options, too, but because of Python’s simplicity and lack of compile/link steps, changing settings in the text of Python scripts is usually just as easy as typing words on a command line.
To check for version skew after a batch of downloads and
uploads, you can run the diffall
script we wrote in Chapter 6, Example 6-12. For instance, I
find files that have diverged over time due to updates on multiple
platforms by comparing the download to a local copy of my website
using a shell command line such as C:...PP4EInternetFtp>
....SystemFiletoolsdiffall.py Mirror est
C:...Websitespublic_html
. See Chapter 6 for more details on this
tool, and file diffall.out.txt in the diffs subdirectory of the examples
distribution for a sample run; its text file differences stem from
either final line newline characters or newline differences
reflecting binary transfers that Windows fc
commands and FTP servers do not
notice.
Uploading a full directory is symmetric to downloading: it’s mostly a matter of swapping the local and remote machines and operations in the program we just met. The script in Example 13-11 uses FTP to copy all files in a directory on the local machine on which it runs up to a directory on a remote machine.
I really use this script, too, most often to upload all of the files maintained on my laptop PC to my ISP account in one fell swoop. I also sometimes use it to copy my site from my PC to a mirror machine or from the mirror machine back to my ISP. Because this script runs on any computer with Python and sockets, it happily transfers a directory from any machine on the Net to any machine running an FTP server. Simply change the initial setting in this module as appropriate for the transfer you have in mind.
#!/bin/env python """ ############################################################################## use FTP to upload all files from one local dir to a remote site/directory; e.g., run me to copy a web/FTP site's files from your PC to your ISP; assumes a flat directory upload: uploadall.py does nested directories. see downloadflat.py comments for more notes: this script is symmetric. ############################################################################## """ import os, sys, ftplib from getpass import getpass from mimetypes import guess_type nonpassive = False # passive FTP by default remotesite = 'learning-python.com' # upload to this site remotedir = 'books' # from machine running on remoteuser = 'lutz' remotepass = getpass('Password for %s on %s: ' % (remoteuser, remotesite)) localdir = (len(sys.argv) > 1 and sys.argv[1]) or '.' cleanall = input('Clean remote directory first? ')[:1] in ['y', 'Y'] print('connecting...') connection = ftplib.FTP(remotesite) # connect to FTP site connection.login(remoteuser, remotepass) # log in as user/password connection.cwd(remotedir) # cd to directory to copy if nonpassive: # force active mode FTP connection.set_pasv(False) # most servers do passive if cleanall: for remotename in connection.nlst(): # try to delete all remotes try: # first, to remove old files print('deleting remote', remotename) connection.delete(remotename) # skips . and .. if attempted except: print('cannot delete remote', remotename) count = 0 # upload all local files localfiles = os.listdir(localdir) # listdir() strips dir path # any failure ends script for localname in localfiles: mimetype, encoding = guess_type(localname) # e.g., ('text/plain', 'gzip') mimetype = mimetype or '?/?' # may be (None, None) maintype = mimetype.split('/')[0] # .jpg ('image/jpeg', None') localpath = os.path.join(localdir, localname) print('uploading', localpath, 'to', localname, end=' ') print('as', maintype, encoding or '') if maintype == 'text' and encoding == None: # use ascii mode xfer and bytes file # need rb mode for ftplib's crlf logic localfile = open(localpath, 'rb') connection.storlines('STOR ' + localname, localfile) else: # use binary mode xfer and bytes file localfile = open(localpath, 'rb') connection.storbinary('STOR ' + localname, localfile) localfile.close() count += 1 connection.quit() print('Done:', count, 'files uploaded.')
Similar to the mirror download script, this program illustrates a handful of new FTP interfaces and a set of FTP scripting techniques:
Just like the mirror script, the upload begins by asking
whether we want to delete all the files in the remote target
directory before copying any files there. This cleanall
option is useful if we’ve
deleted files in the local copy of the directory in the
client—the deleted files would remain on the server-side copy
unless we delete all files there first.
To implement the remote cleanup, this script simply gets
a listing of all the files in the remote directory with the
FTP nlst
method, and
deletes each in turn with the FTP delete
method. Assuming we have delete permission, the directory will
be emptied (file permissions depend on the account we logged
into when connecting to the server). We’ve already moved to
the target remote directory when deletions occur, so no
directory paths need to be prepended to filenames here. Note
that nlst
may raise an
exception for some servers if the remote directory is empty;
we don’t catch the exception here, but you can simply not
select a cleaning if one fails for you. We do catch deletion
exceptions, because directory names like “.” and “..” may be
returned in the listing by some servers.
To apply the upload operation to each file in the local
directory, we get a list of local filenames with the standard
os.listdir
call, and
take care to prepend the local source directory
path to each filename with the os.path.join
call. Recall that
os.listdir
returns
filenames without directory paths, and the source directory
may not be the same as the script’s execution directory if
passed on the command line.
This script may also be run on both Windows and
Unix-like clients, so we need to handle text files specially.
Like the mirror download, this script picks text or binary
transfer modes by using Python’s mimetypes
module to guess a file’s type from its filename extension;
HTML and text files are moved in FTP text mode, for instance.
We already met the storbinary
FTP object method used to
upload files in binary mode—an exact, byte-for-byte copy
appears at the remote site.
Text-mode transfers work almost identically: the storlines
method accepts an FTP
command string and a local file (or file-like) object, and
simply copies each line read from the local file to a
same-named file on the remote machine.
Notice, though, that the local text input file must be
opened in rb
binary mode in Python3.X. Text input
files are normally opened in r
text mode to perform Unicode
decoding and to convert any
end-of-line sequences on
Windows to the
platform-neutral character as lines are read. However,
ftplib
in Python 3.1
requires that the text file be opened in rb
binary mode, because it converts
all end-lines to the
sequence for transmission; to do so, it must read lines as raw
bytes with readlines
and
perform bytes
string
processing, which implies binary mode files.
This ftplib
string
processing worked with text-mode files in Python 2.X, but only
because there was no separate bytes
type;
was expanded to
. Opening the local file in
binary mode for ftplib
to
read also means no Unicode decoding will occur: the text is
sent over sockets as a byte string in already encoded form.
All of which is, of course, a prime lesson on the impacts of
Unicode encodings; consult the module ftplib.py in the Python source
library directory for more details.
For binary mode transfers, things
are simpler—we open the local file in rb
binary mode to suppress Unicode
decoding and automatic mapping everywhere, and return the
bytes
strings expected by
ftplib
on read. Binary data
is not Unicode text, and we don’t want bytes in an audio file
that happen to have the same value as
to magically disappear when read
on Windows.
As for the mirror download script, this program simply iterates over all files to be transferred (files in the local directory listing this time), and transfers each in turn—in either text or binary mode, depending on the files’ names. Here is the command I use to upload my entire website from my laptop Windows PC to a remote Linux server at my ISP, in a single step:
C:...PP4EInternetFtpMirror> uploadflat.py test
Password for lutz on learning-python.com:
Clean remote directory first? y
connecting...
deleting remote .
cannot delete remote .
deleting remote ..
cannot delete remote ..
deleting remote 2004-longmont-classes.html
deleting remote 2005-longmont-classes.html
deleting remote 2006-longmont-classes.html
deleting remote about-lp1e.html
deleting remote about-lp2e.html
deleting remote about-lp3e.html
deleting remote about-lp4e.html
...lines omitted...
uploading test2004-longmont-classes.html to 2004-longmont-classes.html as text
uploading test2005-longmont-classes.html to 2005-longmont-classes.html as text
uploading test2006-longmont-classes.html to 2006-longmont-classes.html as text
uploading testabout-lp1e.html to about-lp1e.html as text
uploading testabout-lp2e.html to about-lp2e.html as text
uploading testabout-lp3e.html to about-lp3e.html as text
uploading testabout-lp4e.html to about-lp4e.html as text
uploading testabout-pp-japan.html to about-pp-japan.html as text
...lines omitted...
uploading testwhatsnew.html to whatsnew.html as text
uploading testwhatsold.html to whatsold.html as text
uploading testwxPython.doc.tgz to wxPython.doc.tgz as application gzip
uploading testxlate-lp.html to xlate-lp.html as text
uploading testzaurus0.jpg to zaurus0.jpg as image
uploading testzaurus1.jpg to zaurus1.jpg as image
uploading testzaurus2.jpg to zaurus2.jpg as image
uploading testzoo-jan-03.jpg to zoo-jan-03.jpg as image
uploading testzopeoutline.htm to zopeoutline.htm as text
Done: 297 files uploaded.
For my site and on my current laptop and wireless broadband connection, this process typically takes six minutes, depending on server load. As with the download script, I often run this command from the local directory where my web files are kept, and I pass Python the full path to the script. When I run this on a Linux server, it works in the same way, but the paths to the script and my web files directory differ.[50]
The directory upload and download scripts of the prior two
sections work as advertised and, apart from the mimetypes
logic, were the only FTP
examples that were included in the second edition of this book. If
you look at these two scripts long enough, though, their
similarities will pop out at you eventually. In fact, they are
largely the same—they use identical code to configure transfer
parameters, connect to the FTP server, and determine file type. The
exact details have been lost to time, but some of this code was
certainly copied from one file to the other.
Although such redundancy isn’t a cause for alarm if we never plan on changing these scripts, it can be a killer in software projects in general. When you have two copies of identical bits of code, not only is there a danger of them becoming out of sync over time (you’ll lose uniformity in user interface and behavior), but you also effectively double your effort when it comes time to change code that appears in both places. Unless you’re a big fan of extra work, it pays to avoid redundancy wherever possible.
This redundancy is especially glaring when we look at the
complex code that uses mimetypes
to determine file types. Repeating magic like this in more than one
place is almost always a bad idea—not only do we have to remember
how it works every time we need the same utility, but it is a recipe
for errors.
As originally coded, our download and upload scripts comprise top-level script code that relies on global variables. Such a structure is difficult to reuse—code runs immediately on imports, and it’s difficult to generalize for varying contexts. Worse, it’s difficult to maintain—when you program by cut-and-paste of existing code, you increase the cost of future changes every time you click the Paste button.
To demonstrate how we might do better, Example 13-12 shows one way to refactor (reorganize) the download script. By wrapping its parts in functions, they become reusable in other modules, including our upload program.
#!/bin/env python """ ############################################################################## use FTP to copy (download) all files from a remote site and directory to a directory on the local machine; this version works the same, but has been refactored to wrap up its code in functions that can be reused by the uploader, and possibly other programs in the future - else code redundancy, which may make the two diverge over time, and can double maintenance costs. ############################################################################## """ import os, sys, ftplib from getpass import getpass from mimetypes import guess_type, add_type defaultSite = 'home.rmi.net' defaultRdir = '.' defaultUser = 'lutz' def configTransfer(site=defaultSite, rdir=defaultRdir, user=defaultUser): """ get upload or download parameters uses a class due to the large number """ class cf: pass cf.nonpassive = False # passive FTP on by default in 2.1+ cf.remotesite = site # transfer to/from this site cf.remotedir = rdir # and this dir ('.' means acct root) cf.remoteuser = user cf.localdir = (len(sys.argv) > 1 and sys.argv[1]) or '.' cf.cleanall = input('Clean target directory first? ')[:1] in ['y','Y'] cf.remotepass = getpass( 'Password for %s on %s:' % (cf.remoteuser, cf.remotesite)) return cf def isTextKind(remotename, trace=True): """ use mimetype to guess if filename means text or binary for 'f.html, guess is ('text/html', None): text for 'f.jpeg' guess is ('image/jpeg', None): binary for 'f.txt.gz' guess is ('text/plain', 'gzip'): binary for unknowns, guess may be (None, None): binary mimetype can also guess name from type: see PyMailGUI """ add_type('text/x-python-win', '.pyw') # not in tables mimetype, encoding = guess_type(remotename, strict=False) # allow extras mimetype = mimetype or '?/?' # type unknown? maintype = mimetype.split('/')[0] # get first part if trace: print(maintype, encoding or '') return maintype == 'text' and encoding == None # not compressed def connectFtp(cf): print('connecting...') connection = ftplib.FTP(cf.remotesite) # connect to FTP site connection.login(cf.remoteuser, cf.remotepass) # log in as user/password connection.cwd(cf.remotedir) # cd to directory to xfer if cf.nonpassive: # force active mode FTP connection.set_pasv(False) # most servers do passive return connection def cleanLocals(cf): """ try to delete all locals files first to remove garbage """ if cf.cleanall: for localname in os.listdir(cf.localdir): # local dirlisting try: # local file delete print('deleting local', localname) os.remove(os.path.join(cf.localdir, localname)) except: print('cannot delete local', localname) def downloadAll(cf, connection): """ download all files from remote site/dir per cf config ftp nlst() gives files list, dir() gives full details """ remotefiles = connection.nlst() # nlst is remote listing for remotename in remotefiles: if remotename in ('.', '..'): continue localpath = os.path.join(cf.localdir, remotename) print('downloading', remotename, 'to', localpath, 'as', end=' ') if isTextKind(remotename): # use text mode xfer localfile = open(localpath, 'w', encoding=connection.encoding) def callback(line): localfile.write(line + ' ') connection.retrlines('RETR ' + remotename, callback) else: # use binary mode xfer localfile = open(localpath, 'wb') connection.retrbinary('RETR ' + remotename, localfile.write) localfile.close() connection.quit() print('Done:', len(remotefiles), 'files downloaded.') if __name__ == '__main__': cf = configTransfer() conn = connectFtp(cf) cleanLocals(cf) # don't delete if can't connect downloadAll(cf, conn)
Compare this version with the original. This script, and every other in this section, runs the same as the original flat download and upload programs. Although we haven’t changed its behavior, though, we’ve modified the script’s software structure radically—its code is now a set of tools that can be imported and reused in other programs.
The refactored upload program in Example 13-13, for instance, is now noticeably simpler, and the code it shares with the download script only needs to be changed in one place if it ever requires improvement.
#!/bin/env python """ ############################################################################## use FTP to upload all files from a local dir to a remote site/directory; this version reuses downloader's functions, to avoid code redundancy; ############################################################################## """ import os from downloadflat_modular import configTransfer, connectFtp, isTextKind def cleanRemotes(cf, connection): """ try to delete all remote files first to remove garbage """ if cf.cleanall: for remotename in connection.nlst(): # remote dir listing try: # remote file delete print('deleting remote', remotename) # skips . and .. exc connection.delete(remotename) except: print('cannot delete remote', remotename) def uploadAll(cf, connection): """ upload all files to remote site/dir per cf config listdir() strips dir path, any failure ends script """ localfiles = os.listdir(cf.localdir) # listdir is local listing for localname in localfiles: localpath = os.path.join(cf.localdir, localname) print('uploading', localpath, 'to', localname, 'as', end=' ') if isTextKind(localname): # use text mode xfer localfile = open(localpath, 'rb') connection.storlines('STOR ' + localname, localfile) else: # use binary mode xfer localfile = open(localpath, 'rb') connection.storbinary('STOR ' + localname, localfile) localfile.close() connection.quit() print('Done:', len(localfiles), 'files uploaded.') if __name__ == '__main__': cf = configTransfer(site='learning-python.com', rdir='books', user='lutz') conn = connectFtp(cf) cleanRemotes(cf, conn) uploadAll(cf, conn)
Not only is the upload script simpler now because it reuses
common code, but it will also inherit any changes made in the
download module. For instance, the isTextKind
function was later augmented
with code that adds the .pyw extension to
mimetypes
tables (this file
type is not recognized by default); because it is a shared
function, the change is automatically picked up in the upload
program, too.
This script and the one it imports achieve the same goals as the originals, but changing them for easier code maintenance is a big deal in the real world of software development. The following, for example, downloads the site from one server and uploads to another:
C:...PP4EInternetFtpMirror>python downloadflat_modular.py test
Clean target directory first? Password for lutz on home.rmi.net: connecting... downloading 2004-longmont-classes.html to test2004-longmont-classes.html as text ...lines omitted... downloading relo-feb010-index.html to test elo-feb010-index.html as text Done: 297 files downloaded. C:...PP4EInternetFtpMirror>python uploadflat_modular.py test
Clean target directory first? Password for lutz on learning-python.com: connecting... uploading test2004-longmont-classes.html to 2004-longmont-classes.html as text ...lines omitted... uploading testzopeoutline.htm to zopeoutline.htm as text Done: 297 files uploaded.
The function-based approach of the last two examples addresses the
redundancy issue, but they are perhaps clumsier than they need to
be. For instance, their cf
configuration options object provides a namespace that replaces
global variables and breaks cross-file dependencies. Once we start
making objects to model namespaces, though, Python’s OOP support
tends to be a more natural structure for our code. As one last
twist, Example 13-14
refactors the FTP code one more time in order to leverage Python’s
class feature.
#!/bin/env python """ ############################################################################## use FTP to download or upload all files in a single directory from/to a remote site and directory; this version has been refactored to use classes and OOP for namespace and a natural structure; we could also structure this as a download superclass, and an upload subclass which redefines the clean and transfer methods, but then there is no easy way for another client to invoke both an upload and download; for the uploadall variant and possibly others, also make single file upload/download code in orig loops methods; ############################################################################## """ import os, sys, ftplib from getpass import getpass from mimetypes import guess_type, add_type # defaults for all clients dfltSite = 'home.rmi.net' dfltRdir = '.' dfltUser = 'lutz' class FtpTools: # allow these 3 to be redefined def getlocaldir(self): return (len(sys.argv) > 1 and sys.argv[1]) or '.' def getcleanall(self): return input('Clean target dir first?')[:1] in ['y','Y'] def getpassword(self): return getpass( 'Password for %s on %s:' % (self.remoteuser, self.remotesite)) def configTransfer(self, site=dfltSite, rdir=dfltRdir, user=dfltUser): """ get upload or download parameters from module defaults, args, inputs, cmdline anonymous ftp: user='anonymous' pass=emailaddr """ self.nonpassive = False # passive FTP on by default in 2.1+ self.remotesite = site # transfer to/from this site self.remotedir = rdir # and this dir ('.' means acct root) self.remoteuser = user self.localdir = self.getlocaldir() self.cleanall = self.getcleanall() self.remotepass = self.getpassword() def isTextKind(self, remotename, trace=True): """ use mimetypes to guess if filename means text or binary for 'f.html, guess is ('text/html', None): text for 'f.jpeg' guess is ('image/jpeg', None): binary for 'f.txt.gz' guess is ('text/plain', 'gzip'): binary for unknowns, guess may be (None, None): binary mimetypes can also guess name from type: see PyMailGUI """ add_type('text/x-python-win', '.pyw') # not in tables mimetype, encoding = guess_type(remotename, strict=False)# allow extras mimetype = mimetype or '?/?' # type unknown? maintype = mimetype.split('/')[0] # get 1st part if trace: print(maintype, encoding or '') return maintype == 'text' and encoding == None # not compressed def connectFtp(self): print('connecting...') connection = ftplib.FTP(self.remotesite) # connect to FTP site connection.login(self.remoteuser, self.remotepass) # log in as user/pswd connection.cwd(self.remotedir) # cd to dir to xfer if self.nonpassive: # force active mode connection.set_pasv(False) # most do passive self.connection = connection def cleanLocals(self): """ try to delete all local files first to remove garbage """ if self.cleanall: for localname in os.listdir(self.localdir): # local dirlisting try: # local file delete print('deleting local', localname) os.remove(os.path.join(self.localdir, localname)) except: print('cannot delete local', localname) def cleanRemotes(self): """ try to delete all remote files first to remove garbage """ if self.cleanall: for remotename in self.connection.nlst(): # remote dir listing try: # remote file delete print('deleting remote', remotename) self.connection.delete(remotename) except: print('cannot delete remote', remotename) def downloadOne(self, remotename, localpath): """ download one file by FTP in text or binary mode local name need not be same as remote name """ if self.isTextKind(remotename): localfile = open(localpath, 'w', encoding=self.connection.encoding) def callback(line): localfile.write(line + ' ') self.connection.retrlines('RETR ' + remotename, callback) else: localfile = open(localpath, 'wb') self.connection.retrbinary('RETR ' + remotename, localfile.write) localfile.close() def uploadOne(self, localname, localpath, remotename): """ upload one file by FTP in text or binary mode remote name need not be same as local name """ if self.isTextKind(localname): localfile = open(localpath, 'rb') self.connection.storlines('STOR ' + remotename, localfile) else: localfile = open(localpath, 'rb') self.connection.storbinary('STOR ' + remotename, localfile) localfile.close() def downloadDir(self): """ download all files from remote site/dir per config ftp nlst() gives files list, dir() gives full details """ remotefiles = self.connection.nlst() # nlst is remote listing for remotename in remotefiles: if remotename in ('.', '..'): continue localpath = os.path.join(self.localdir, remotename) print('downloading', remotename, 'to', localpath, 'as', end=' ') self.downloadOne(remotename, localpath) print('Done:', len(remotefiles), 'files downloaded.') def uploadDir(self): """ upload all files to remote site/dir per config listdir() strips dir path, any failure ends script """ localfiles = os.listdir(self.localdir) # listdir is local listing for localname in localfiles: localpath = os.path.join(self.localdir, localname) print('uploading', localpath, 'to', localname, 'as', end=' ') self.uploadOne(localname, localpath, localname) print('Done:', len(localfiles), 'files uploaded.') def run(self, cleanTarget=lambda:None, transferAct=lambda:None): """ run a complete FTP session default clean and transfer are no-ops don't delete if can't connect to server """ self.connectFtp() cleanTarget() transferAct() self.connection.quit() if __name__ == '__main__': ftp = FtpTools() xfermode = 'download' if len(sys.argv) > 1: xfermode = sys.argv.pop(1) # get+del 2nd arg if xfermode == 'download': ftp.configTransfer() ftp.run(cleanTarget=ftp.cleanLocals, transferAct=ftp.downloadDir) elif xfermode == 'upload': ftp.configTransfer(site='learning-python.com', rdir='books', user='lutz') ftp.run(cleanTarget=ftp.cleanRemotes, transferAct=ftp.uploadDir) else: print('Usage: ftptools.py ["download" | "upload"] [localdir]')
In fact, this last mutation combines uploads and downloads
into a single file, because they are so closely related. As
before, common code is factored into methods to avoid redundancy.
New here, the instance object itself becomes a natural namespace
for storing configuration options (they become self
attributes). Study this example’s
code for more details of the restructuring applied.
Again, this revision runs the same as our original site download and upload scripts; see its self-test code at the end for usage details, and pass in a command-line argument to specify “download” or “upload.” We haven’t changed what it does, we’ve refactored it for maintainability and reuse:
C:...PP4EInternetFtpMirror>ftptools.py download test
Clean target dir first? Password for lutz on home.rmi.net: connecting... downloading 2004-longmont-classes.html to test2004-longmont-classes.html as text ...lines omitted... downloading relo-feb010-index.html to test elo-feb010-index.html as text Done: 297 files downloaded. C:...PP4EInternetFtpMirror>ftptools.py upload test
Clean target dir first? Password for lutz on learning-python.com: connecting... uploading test2004-longmont-classes.html to 2004-longmont-classes.html as text ...lines omitted... uploading testzopeoutline.htm to zopeoutline.htm as text Done: 297 files uploaded.
Although this file can still be run as a command-line script
like this, its class is really now a package of FTP tools that can
be mixed into other programs and reused. By wrapping its code in a
class, it can be easily customized by redefining its methods—its
configuration calls, such as getlocaldir
, for example, may be
redefined in subclasses for custom scenarios.
Perhaps most importantly, using classes optimizes code reusability. Clients of this file can both upload and download directories by simply subclassing or embedding an instance of this class and calling its methods. To see one example of how, let’s move on to the next section.
Perhaps the biggest limitation of the website download and upload scripts we just met is that they assume the site directory is flat (hence their names). That is, the preceding scripts transfer simple files only, and none of them handle nested subdirectories within the web directory to be transferred.
For my purposes, that’s often a reasonable constraint. I avoid nested subdirectories to keep things simple, and I store my book support home website as a simple directory of files. For other sites, though, including one I keep at another machine, site transfer scripts are easier to use if they also automatically transfer subdirectories along the way.
It turns out that supporting directories on uploads is fairly simple—we need to add only a bit of recursion and remote directory creation calls. The upload script in Example 13-15 extends the class-based version we just saw in Example 13-14, to handle uploading all subdirectories nested within the transferred directory. Furthermore, it recursively transfers subdirectories within subdirectories—the entire directory tree contained within the top-level transfer directory is uploaded to the target directory at the remote server.
In terms of its code structure, Example 13-15 is just a
customization of the FtpTools
class of the prior section—really, we’re just adding a method for
recursive uploads, by subclassing. As one consequence, we get tools such as
parameter configuration, content type testing, and connection and
upload code for free here; with OOP, some of the work is done before
we start.
#!/bin/env python """ ############################################################################ extend the FtpTools class to upload all files and subdirectories from a local dir tree to a remote site/dir; supports nested dirs too, but not the cleanall option (that requires parsing FTP listings to detect remote dirs: see cleanall.py); to upload subdirectories, uses os.path.isdir(path) to see if a local file is really a directory, FTP().mkd(path) to make dirs on the remote machine (wrapped in a try in case it already exists there), and recursion to upload all files/dirs inside the nested subdirectory. ############################################################################ """ import os, ftptools class UploadAll(ftptools.FtpTools): """ upload an entire tree of subdirectories assumes top remote directory exists """ def __init__(self): self.fcount = self.dcount = 0 def getcleanall(self): return False # don't even ask def uploadDir(self, localdir): """ for each directory in an entire tree upload simple files, recur into subdirectories """ localfiles = os.listdir(localdir) for localname in localfiles: localpath = os.path.join(localdir, localname) print('uploading', localpath, 'to', localname, end=' ') if not os.path.isdir(localpath): self.uploadOne(localname, localpath, localname) self.fcount += 1 else: try: self.connection.mkd(localname) print('directory created') except: print('directory not created') self.connection.cwd(localname) # change remote dir self.uploadDir(localpath) # upload local subdir self.connection.cwd('..') # change back up self.dcount += 1 print('directory exited') if __name__ == '__main__': ftp = UploadAll() ftp.configTransfer(site='learning-python.com', rdir='training', user='lutz') ftp.run(transferAct = lambda: ftp.uploadDir(ftp.localdir)) print('Done:', ftp.fcount, 'files and', ftp.dcount, 'directories uploaded.')
Like the flat upload script, this one can be run on any machine with Python and sockets and upload to any machine running an FTP server; I run it both on my laptop PC and on other servers by Telnet or SSH to upload sites to my ISP.
The crux of the matter in this script is the os.path.isdir
test near the top; if this
test detects a directory in the current local directory, we create
an identically named directory on the remote machine with connection.mkd
and descend into it with connection.cwd
, and recur into the subdirectory on the local machine
(we have to use recursive calls here, because the shape and depth of
the tree are arbitrary). Like all FTP object methods, mkd
and
cwd
methods issue FTP commands to
the remote server. When we exit a local subdirectory, we run a
remote cwd('..')
to climb to the
remote parent directory and continue; the recursive call level’s
return restores the prior directory on the local machine. The rest
of the script is roughly the same as the original.
In the interest of space, I’ll leave studying this variant in more depth as a suggested exercise. For more context, try changing this script so as not to assume that the top-level remote directory already exists. As usual in software, there are a variety of implementation and operation options here.
Here is the sort of output displayed on the console when the upload-all script is run, uploading a site with multiple subdirectory levels which I maintain with site builder tools. It’s similar to the flat upload (which you might expect, given that it is reusing much of the same code by inheritance), but notice that it traverses and uploads nested subdirectories along the way:
C:...PP4EInternetFtpMirror> uploadall.py Website-Training
Password for lutz on learning-python.com:
connecting...
uploading Website-Training2009-public-classes.htm to 2009-public-classes.htm text
uploading Website-Training2010-public-classes.html to 2010-public-classes.html text
uploading Website-Trainingabout.html to about.html text
uploading Website-Trainingooks to books directory created
uploading Website-Trainingooksindex.htm to index.htm text
uploading Website-Trainingooksindex.html to index.html text
uploading Website-Trainingooks\_vti_cnf to _vti_cnf directory created
uploading Website-Trainingooks\_vti_cnfindex.htm to index.htm text
uploading Website-Trainingooks\_vti_cnfindex.html to index.html text
directory exited
directory exited
uploading Website-Trainingcalendar.html to calendar.html text
uploading Website-Trainingcontacts.html to contacts.html text
uploading Website-Trainingestes-nov06.htm to estes-nov06.htm text
uploading Website-Trainingformalbio.html to formalbio.html text
uploading Website-Trainingfulloutline.html to fulloutline.html text
...lines omitted...
uploading Website-Training\_vti_pvtwriteto.cnf to writeto.cnf ?
uploading Website-Training\_vti_pvt\_vti_cnf to _vti_cnf directory created
uploading Website-Training\_vti_pvt\_vti_cnf\_x_todo.htm to _x_todo.htm text
uploading Website-Training\_vti_pvt\_vti_cnf\_x_todoh.htm to _x_todoh.htm text
directory exited
uploading Website-Training\_vti_pvt\_x_todo.htm to _x_todo.htm text
uploading Website-Training\_vti_pvt\_x_todoh.htm to _x_todoh.htm text
directory exited
Done: 366 files and 18 directories uploaded.
As is, the script of Example 13-15 handles only directory tree uploads; recursive uploads are generally more useful than recursive downloads if you maintain your websites on your local PC and upload to a server periodically, as I do. To also download (mirror) a website that has subdirectories, a script must parse the output of a remote listing command to detect remote directories. For the same reason, the recursive upload script was not coded to support the remote directory tree cleanup option of the original—such a feature would require parsing remote listings as well. The next section shows how.
One last example of code reuse at work: when I initially tested the prior section’s upload-all script, it contained a bug that caused it to fall into an infinite recursion loop, and keep copying the full site into new subdirectories, over and over, until the FTP server kicked me off (not an intended feature of the program!). In fact, the upload got 13 levels deep before being killed by the server; it effectively locked my site until the mess could be repaired.
To get rid of all the files accidentally uploaded, I quickly
wrote the script in Example 13-16 in emergency
(really, panic) mode; it deletes all files and nested subdirectories
in an entire remote tree. Luckily, this was very easy to do given
all the reuse that Example 13-16 inherits from the
FtpTools
superclass. Here, we
just have to define the extension for recursive remote deletions.
Even in tactical mode like this, OOP can be a decided
advantage.
#!/bin/env python """ ############################################################################## extend the FtpTools class to delete files and subdirectories from a remote directory tree; supports nested directories too; depends on the dir() command output format, which may vary on some servers! - see Python's ToolsScriptsftpmirror.py for hints; extend me for remote tree downloads; ############################################################################## """ from ftptools import FtpTools class CleanAll(FtpTools): """ delete an entire remote tree of subdirectories """ def __init__(self): self.fcount = self.dcount = 0 def getlocaldir(self): return None # irrelevent here def getcleanall(self): return True # implied here def cleanDir(self): """ for each item in current remote directory, del simple files, recur into and then del subdirectories the dir() ftp call passes each line to a func or method """ lines = [] # each level has own lines self.connection.dir(lines.append) # list current remote dir for line in lines: parsed = line.split() # split on whitespace permiss = parsed[0] # assume 'drw... ... filename' fname = parsed[-1] if fname in ('.', '..'): # some include cwd and parent continue elif permiss[0] != 'd': # simple file: delete print('file', fname) self.connection.delete(fname) self.fcount += 1 else: # directory: recur, del print('directory', fname) self.connection.cwd(fname) # chdir into remote dir self.cleanDir() # clean subdirectory self.connection.cwd('..') # chdir remote back up self.connection.rmd(fname) # delete empty remote dir self.dcount += 1 print('directory exited') if __name__ == '__main__': ftp = CleanAll() ftp.configTransfer(site='learning-python.com', rdir='training', user='lutz') ftp.run(cleanTarget=ftp.cleanDir) print('Done:', ftp.fcount, 'files and', ftp.dcount, 'directories cleaned.')
Besides again being recursive in order to handle arbitrarily
shaped trees, the main trick employed here is to parse the output of
a remote directory listing. The FTP nlst
call used earlier gives us a simple
list of filenames; here, we use dir
to also get file detail lines like
these:
C:...PP4EInternetFtp>ftp learning-python.com
ftp>cd training
ftp>dir
drwxr-xr-x 11 5693094 450 4096 May 4 11:06 . drwx---r-x 19 5693094 450 8192 May 4 10:59 .. -rw----r-- 1 5693094 450 15825 May 4 11:02 2009-public-classes.htm -rw----r-- 1 5693094 450 18084 May 4 11:02 2010-public-classes.html drwx---r-x 3 5693094 450 4096 May 4 11:02 books -rw----r-- 1 5693094 450 3783 May 4 11:02 calendar-save-aug09.html -rw----r-- 1 5693094 450 3923 May 4 11:02 calendar.html drwx---r-x 2 5693094 450 4096 May 4 11:02 images -rw----r-- 1 5693094 450 6143 May 4 11:02 index.html ...lines omitted...
This output format is potentially server-specific, so check this on your own server before relying on this script. For this Unix ISP, if the first character of the first item on the line is character “d”, the filename at the end of the line names a remote directory. To parse, the script simply splits on whitespace to extract parts of a line.
Notice how this script, like others before it, must skip the
symbolic “.” and “..” current and parent directory names in listings
to work properly for this server. Oddly this can vary per server as
well; one of the servers I used for this book’s examples, for
instance, does not include these special names in listings. We can
verify by running ftplib
at the
interactive prompt, as though it were a portable FTP client
interface:
C:...PP4EInternetFtp>python
>>>from ftplib import FTP
>>>f = FTP('ftp.rmi.net')
>>>f.login('lutz', 'xxxxxxxx')
# output lines omitted >>>for x in f.nlst()[:3]: print(x)
# no . or .. in listings ... 2004-longmont-classes.html 2005-longmont-classes.html 2006-longmont-classes.html >>>L = []
>>>f.dir(L.append)
# ditto for detailed list >>>for x in L[:3]: print(x)
... -rw-r--r-- 1 ftp ftp 8173 Mar 19 2006 2004-longmont-classes.html -rw-r--r-- 1 ftp ftp 9739 Mar 19 2006 2005-longmont-classes.html -rw-r--r-- 1 ftp ftp 805 Jul 8 2006 2006-longmont-classes.html
On the other hand, the server I’m using in this section does
include the special dot names; to be robust, our scripts must skip
over these names in remote directory listings just in case they’re
run against a server that includes them (here, the test is required
to avoid falling into an infinite recursive loop!). We don’t need to
care about local directory listings because Python’s os.listdir
never includes “.” or “..” in
its result, but things are not quite so consistent in the “Wild
West” that is the Internet today:
>>>f = FTP('learning-python.com')
>>>f.login('lutz', 'xxxxxxxx')
# output lines omitted >>>for x in f.nlst()[:5]: print(x)
# includes . and .. here ... . .. .hcc.thumbs 2009-public-classes.htm 2010-public-classes.html >>>L = []
>>>f.dir(L.append)
# ditto for detailed list >>>for x in L[:5]: print(x)
... drwx---r-x 19 5693094 450 8192 May 4 10:59 . drwx---r-x 19 5693094 450 8192 May 4 10:59 .. drwx------ 2 5693094 450 4096 Feb 18 05:38 .hcc.thumbs -rw----r-- 1 5693094 450 15824 May 1 14:39 2009-public-classes.htm -rw----r-- 1 5693094 450 18083 May 4 09:05 2010-public-classes.html
The output of our clean-all script in action follows; it shows up in the system console window where the script is run. You might be able to achieve the same effect with a “rm –rf” Unix shell command in a SSH or Telnet window on some servers, but the Python script runs on the client and requires no other remote access than basic FTP on the client:
C:PP4EInternetFtpMirror> cleanall.py
Password for lutz on learning-python.com:
connecting...
file 2009-public-classes.htm
file 2010-public-classes.html
file Learning-Python-interview.doc
file Python-registration-form-010.pdf
file PythonPoweredSmall.gif
directory _derived
file 2009-public-classes.htm_cmp_DeepBlue100_vbtn.gif
file 2009-public-classes.htm_cmp_DeepBlue100_vbtn_p.gif
file 2010-public-classes.html_cmp_DeepBlue100_vbtn_p.gif
file 2010-public-classes.html_cmp_deepblue100_vbtn.gif
directory _vti_cnf
file 2009-public-classes.htm_cmp_DeepBlue100_vbtn.gif
file 2009-public-classes.htm_cmp_DeepBlue100_vbtn_p.gif
file 2010-public-classes.html_cmp_DeepBlue100_vbtn_p.gif
file 2010-public-classes.html_cmp_deepblue100_vbtn.gif
directory exited
directory exited
...lines omitted...
file priorclients.html
file public_classes.htm
file python_conf_ora.gif
file topics.html
Done: 366 files and 18 directories cleaned.
It is possible to extend this remote tree-cleaner to also download a remote tree with subdirectories: rather than deleting, as you walk the remote tree simply create a local directory to match a remote one, and download nondirectory files. We’ll leave this final step as a suggested exercise, though, partly because its dependence on the format produced by server directory listings makes it complex to be robust and partly because this use case is less common for me—in practice, I am more likely to maintain a site on my PC and upload to the server than to download a tree.
If you do wish to experiment with a recursive download, though, be sure to consult the script ToolsScriptsftpmirror.py in Python’s install or source tree for hints. That script attempts to download a remote directory tree by FTP, and allows for various directory listing formats which we’ll skip here in the interest of space. For our purposes, it’s time to move on to the next protocol on our tour—Internet email.
Some of the other most common, higher-level Internet protocols have to do with reading and sending email messages: POP and IMAP for fetching email from servers, SMTP for sending new messages, and other formalisms such as RFC822 for specifying email message content and format. You don’t normally need to know about such acronyms when using common email tools, but internally, programs like Microsoft Outlook and webmail systems generally talk to POP and SMTP servers to do your bidding.
Like FTP, email ultimately consists of formatted commands and byte streams shipped over sockets and ports (port 110 for POP; 25 for SMTP). Regardless of the nature of its content and attachments, an email message is little more than a string of bytes sent and received through sockets. But also like FTP, Python has standard library modules to simplify all aspects of email processing:
These modules are related: for nontrivial messages, we typically
use email
to parse mail text which
has been fetched with poplib
and
use email
to compose mail text to
be sent with smtplib
. The email
package also handles tasks such as
address parsing, date and time formatting, attachment formatting and
extraction, and encoding and decoding of email content (e,g, uuencode,
Base64). Additional modules handle more specific tasks (e.g., mimetypes
to map filenames to and from
content types).
In the next few sections, we explore the POP and SMTP interfaces
for fetching and sending email from and to servers, and the email
package interfaces for parsing and
composing email message text. Other email interfaces in Python are
analogous and are documented in the Python library reference
manual.[51]
In the prior sections of this chapter, we studied how Unicode encodings can impact
scripts using Python’s ftplib
FTP
tools in some depth, because it illustrates the implications of
Python 3.X’s Unicode string model for real-world programming. In
short:
All binary mode transfers should open local output and
input files in binary mode (modes wb
and rb
).
Text-mode downloads should open local output files in text
mode with explicit encoding names (mode w
, with an encoding
argument that defaults to
latin1
within ftplib
itself).
Text-mode uploads should open local input files in binary
mode (mode rb
).
The prior sections describe why these rules are in force. The last two points here differ for scripts written originally for Python 2.X. As you might expect, given that the underlying sockets transfer byte strings today, the email story is somewhat convoluted for Unicode in Python 3.X as well. As a brief preview:
The poplib
module
returns fetched email text in bytes
string form. Command text sent
to the server is encoded per UTF8
internally, but replies are
returned as raw binary bytes
and not decoded into str
text.
The smtplib
module
accepts email content to send as str
strings. Internally, message
text passed in str
form is
encoded to binary bytes
for
transmission using the ascii
encoding scheme. Passing an
already encoded bytes
string to the send call may allow more explicit
control.
The email
package
produces Unicode str
strings containing plain text
when generating full email text for sending with smtplib
and accepts optional
encoding specifications for messages and their parts, which it
applies according to email standard rules. Message headers may
also be encoded per email, MIME, and Unicode
conventions.
The email
package in
3.1 currently requires raw email byte strings of the type
fetched with poplib
to be
decoded into Unicode str
strings as appropriate before it can be passed in to be parsed
into a message object. This pre-parse decoding might be done
by a default, user preference, mail headers inspection, or
intelligent guess. Because this requirement raises difficult
issues for package clients, it may be dropped in a future
version of email
and
Python.
The email
package
returns most message components as str
strings, though parts content
decoded by Base64 and other email encoding schemes may be
returned as bytes
strings,
parts fetched without such decoding may be str
or bytes
, and some str
string parts are internally
encoded to bytes
with
scheme raw-unicode-escape
before processing. Message headers may be decoded by the
package on request as well.
If you’re migrating email scripts (or your mindset) from 2.X, you’ll need to treat email text fetched from a server as byte strings, and encode it before passing it along for parsing; scripts that send or compose email are generally unaffected (and this may be the majority of Python email-aware scripts), though content may have to be treated specially if it may be returned as byte strings.
This is the story in Python 3.1, which is of course prone to change over time. We’ll see how these email constraints translate into code as we move along in this section. Suffice it to say, the text on the Internet is not as simple as it used to be, though it probably shouldn’t have been anyhow.
I confess: up until just before 2000, I took a lowest-common-denominator approach to email. I preferred to check my messages by Telnetting to my ISP and using a simple command-line email interface. Of course, that’s not ideal for mail with attachments, pictures, and the like, but its portability was staggering—because Telnet runs on almost any machine with a network link, I was able to check my mail quickly and easily from anywhere on the planet. Given that I make my living traveling around the world teaching Python classes, this wild accessibility was a big win.
As with website maintenance, times have changed on this front. Somewhere along the way, most ISPs began offering web-based email access with similar portability and dropped Telnet altogether. When my ISP took away Telnet access, however, they also took away one of my main email access methods. Luckily, Python came to the rescue again—by writing email access scripts in Python, I could still read and send email from any machine in the world that has Python and an Internet connection. Python can be as portable a solution as Telnet, but much more powerful.
Moreover, I can still use these scripts as an alternative to tools suggested by the ISP. Besides my not being fond of delegating control to commercial products of large companies, closed email tools impose choices on users that are not always ideal and may sometimes fail altogether. In many ways, the motivation for coding Python email scripts is the same as it was for the larger GUIs in Chapter 11: the scriptability of Python programs can be a decided advantage.
For example, Microsoft Outlook historically and by default has preferred to download mail to your PC and delete it from the mail server as soon as you access it. This keeps your email box small (and your ISP happy), but it isn’t exactly friendly to people who travel and use multiple machines along the way—once accessed, you cannot get to a prior email from any machine except the one to which it was initially downloaded. Worse, the web-based email interfaces offered by my ISPs have at times gone offline completely, leaving me cut off from email (and usually at the worst possible time).
The next two scripts represent one first-cut solution to such portability and reliability constraints (we’ll see others in this and later chapters). The first, popmail.py, is a simple mail reader tool, which downloads and prints the contents of each email in an email account. This script is admittedly primitive, but it lets you read your email on any machine with Python and sockets; moreover, it leaves your email intact on the server, and isn’t susceptible to webmail outages. The second, smtpmail.py, is a one-shot script for writing and sending a new email message that is as portable as Python itself.
Later in this chapter, we’ll implement an interactive console-based email client (pymail), and later in this book we’ll code a full-blown GUI email tool (PyMailGUI) and a web-based email program of our own (PyMailCGI). For now, we’ll start with the basics.
Before we get to the scripts, let’s first take a look at a common module they import and use. The module in Example 13-17 is used to configure email parameters appropriately for a particular user. It’s simply a collection of assignments to variables used by mail programs that appear in this book; each major mail client has its own version, to allow content to vary. Isolating these configuration settings in this single module makes it easy to configure the book’s email programs for a particular user, without having to edit actual program logic code.
If you want to use any of this book’s email programs to do mail processing of your own, be sure to change its assignments to reflect your servers, account usernames, and so on (as shown, they refer to email accounts used for developing this book). Not all scripts use all of these settings; we’ll revisit this module in later examples to explain more of them.
Note that some ISPs may require that you be connected directly to their systems in order to use their SMTP servers to send mail. For example, when connected directly by dial-up in the past, I could use my ISP’s server directly, but when connected via broadband, I had to route requests through a cable Internet provider. You may need to adjust these settings to match your configuration; see your ISP to obtain the required POP and SMTP servers. Also, some SMTP servers check domain name validity in addresses, and may require an authenticating login step—see the SMTP section later in this chapter for interface details.
""" user configuration settings for various email programs (pymail/mailtools version); email scripts get their server names and other email config options from this module: change me to reflect your server names and mail preferences; """ #------------------------------------------------------------------------------ # (required for load, delete: all) POP3 email server machine, user #------------------------------------------------------------------------------ popservername = 'pop.secureserver.net' popusername = '[email protected]' #------------------------------------------------------------------------------ # (required for send: all) SMTP email server machine name # see Python smtpd module for a SMTP server class to run locally; #------------------------------------------------------------------------------ smtpservername = 'smtpout.secureserver.net' #------------------------------------------------------------------------------ # (optional: all) personal information used by clients to fill in mail if set; # signature -- can be a triple-quoted block, ignored if empty string; # address -- used for initial value of "From" field if not empty, # no longer tries to guess From for replies: this had varying success; #------------------------------------------------------------------------------ myaddress = '[email protected]' mysignature = ('Thanks, ' '--Mark Lutz (http://learning-python.com/books)') #------------------------------------------------------------------------------ # (optional: mailtools) may be required for send; SMTP user/password if # authenticated; set user to None or '' if no login/authentication is # required; set pswd to name of a file holding your SMTP password, or # an empty string to force programs to ask (in a console, or GUI); #------------------------------------------------------------------------------ smtpuser = None # per your ISP smtppasswdfile = '' # set to '' to be asked #------------------------------------------------------------------------------ # (optional: mailtools) name of local one-line text file with your pop # password; if empty or file cannot be read, pswd is requested when first # connecting; pswd not encrypted: leave this empty on shared machines; #------------------------------------------------------------------------------ poppasswdfile = r'c: emppymailgui.txt' # set to '' to be asked #------------------------------------------------------------------------------ # (required: mailtools) local file where sent messages are saved by some clients; #------------------------------------------------------------------------------ sentmailfile = r'.sentmail.txt' # . means in current working dir #------------------------------------------------------------------------------ # (required: pymail, pymail2) local file where pymail saves pop mail on request; #------------------------------------------------------------------------------ savemailfile = r'c: empsavemail.txt' # not used in PyMailGUI: dialog #------------------------------------------------------------------------------ # (required: pymail, mailtools) fetchEncoding is the Unicode encoding used to # decode fetched full message bytes, and to encode and decode message text if # stored in text-mode save files; see Chapter 13 for details: this is a limited # and temporary approach to Unicode encodings until a new bytes-friendly email # package is developed; headersEncodeTo is for sent headers: see chapter13; #------------------------------------------------------------------------------ fetchEncoding = 'utf8' # 4E: how to decode and store message text (or latin1?) headersEncodeTo = None # 4E: how to encode non-ASCII headers sent (None=utf8) #------------------------------------------------------------------------------ # (optional: mailtools) the maximum number of mail headers or messages to # download on each load request; given this setting N, mailtools fetches at # most N of the most recently arrived mails; older mails outside this set are # not fetched from the server, but are returned as empty/dummy emails; if this # is assigned to None (or 0), loads will have no such limit; use this if you # have very many mails in your inbox, and your Internet or mail server speed # makes full loads too slow to be practical; some clients also load only # newly-arrived emails, but this setting is independent of that feature; #------------------------------------------------------------------------------ fetchlimit = 25 # 4E: maximum number headers/emails to fetch on loads
On to reading email in Python: the script in Example 13-18 employs Python’s
standard poplib
module, an
implementation of the client-side interface to POP—the Post Office
Protocol. POP is a well-defined and widely available way to fetch
email from servers over sockets. This script connects to a POP
server to implement a simple yet portable email download and display
tool.
#!/usr/local/bin/python """ ############################################################################## use the Python POP3 mail interface module to view your POP email account messages; this is just a simple listing--see pymail.py for a client with more user interaction features, and smtpmail.py for a script which sends mail; POP is used to retrieve mail, and runs on a socket using port number 110 on the server machine, but Python's poplib hides all protocol details; to send mail, use the smtplib module (or os.popen('mail...')). see also: imaplib module for IMAP alternative, PyMailGUI/PyMailCGI for more features; ############################################################################## """ import poplib, getpass, sys, mailconfig mailserver = mailconfig.popservername # ex: 'pop.rmi.net' mailuser = mailconfig.popusername # ex: 'lutz' mailpasswd = getpass.getpass('Password for %s?' % mailserver) print('Connecting...') server = poplib.POP3(mailserver) server.user(mailuser) # connect, log in to mail server server.pass_(mailpasswd) # pass is a reserved word try: print(server.getwelcome()) # print returned greeting message msgCount, msgBytes = server.stat() print('There are', msgCount, 'mail messages in', msgBytes, 'bytes') print(server.list()) print('-' * 80) input('[Press Enter key]') for i in range(msgCount): hdr, message, octets = server.retr(i+1) # octets is byte count for line in message: print(line.decode()) # retrieve, print all mail print('-' * 80) # mail text is bytes in 3.x if i < msgCount - 1: input('[Press Enter key]') # mail box locked till quit finally: # make sure we unlock mbox server.quit() # else locked till timeout print('Bye.')
Though primitive, this script illustrates the basics of
reading email in Python. To establish a connection to an email
server, we start by making an instance of the poplib.POP3
object, passing in the email
server machine’s name as a string:
server = poplib.POP3(mailserver)
If this call doesn’t raise an exception, we’re connected (by socket) to the POP server listening on POP port number 110 at the machine where our email account lives.
The next thing we need to do before fetching messages is tell
the server our username and password; notice that the password
method is called pass_
. Without
the trailing underscore, pass
would name a reserved word and trigger a syntax error:
server.user(mailuser) # connect, log in to mail server server.pass_(mailpasswd) # pass is a reserved word
To keep things simple and relatively secure, this script
always asks for the account password interactively; the getpass
module we met in the FTP section
of this chapter is used to input but not display a password string
typed by the user.
Once we’ve told the server our username and password, we’re
free to fetch mailbox information with the stat
method (number messages, total bytes
among all messages) and fetch the full text of a particular message
with the retr
method (pass the
message number—they start at 1). The full text includes all headers,
followed by a blank line, followed by the mail’s text and any
attached parts. The retr
call
sends back a tuple that includes a list of line strings representing
the content of the mail:
msgCount, msgBytes = server.stat() hdr, message, octets = server.retr(i+1) # octets is byte count
We close the email server connection by calling the POP
object’s quit
method:
server.quit() # else locked till timeout
Notice that this call appears inside the finally
clause of a try
statement that wraps the bulk of the
script. To minimize complications associated with changes, POP
servers lock your email inbox between the time you first connect and
the time you close your connection (or until an arbitrary,
system-defined timeout expires). Because the POP quit
method also unlocks the mailbox, it’s
crucial that we do this before exiting, whether an exception is
raised during email processing or not. By wrapping the action in a
try
/finally
statement, we guarantee that the
script calls quit
on exit to
unlock the mailbox to make it accessible to other processes (e.g.,
delivery of incoming email).
Here is the popmail
script
of Example 13-18 in action,
displaying two messages in my account’s mailbox on machine
pop.secureserver.net—the domain name of the
mail server machine used by the ISP hosting my
learning-python.com domain name, as configured in the module mailconfig
. To keep this output reasonably
sized, I’ve omitted or truncated a few irrelevant message header
lines here, including most of the Received:
headers that chronicle an
email’s journey; run this on your own to see all the gory details of
raw email text:
C:...PP4EInternetEmail> popmail.py
Password for pop.secureserver.net?
Connecting...
b'+OK <[email protected]>'
There are 2 mail messages in 3268 bytes
(b'+OK ', [b'1 1860', b'2 1408'], 16)
--------------------------------------------------------------------------------
[Press Enter key]
Received: (qmail 7690 invoked from network); 5 May 2010 15:29:43 −0000
X-IronPort-Anti-Spam-Result: AskCAG4r4UvRVllAlGdsb2JhbACDF44FjCkVAQEBAQkLCAkRAx+
Received: from 72.236.109.185 by webmail.earthlink.net with HTTP; Wed, 5 May 201
Message-ID: <27293081.1273073376592.JavaMail.root@mswamui-thinleaf.atl.sa.earthl
Date: Wed, 5 May 2010 11:29:36 −0400 (EDT)
From: [email protected]
Reply-To: [email protected]
To: [email protected]
Subject: I'm a Lumberjack, and I'm Okay
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Mailer: EarthLink Zoo Mail 1.0
X-ELNK-Trace: 309f369105a89a174e761f5d55cab8bca866e5da7af650083cf64d888edc8b5a35
X-Originating-IP: 209.86.224.51
X-Nonspam: None
I cut down trees, I skip and jump,
I like to press wild flowers...
--------------------------------------------------------------------------------
[Press Enter key]
Received: (qmail 17482 invoked from network); 5 May 2010 15:33:47 −0000
X-IronPort-Anti-Spam-Result: AlIBAIss4UthSoc7mWdsb2JhbACDF44FjD4BAQEBAQYNCgcRIq1
Received: (qmail 4009 invoked by uid 99); 5 May 2010 15:33:47 −0000
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
X-Originating-IP: 72.236.109.185
User-Agent: Web-Based Email 5.2.13
Message-Id: <20100505083347.deec9532fd532622acfef00cad639f45.0371a89d29.wbe@emai
From: [email protected]
To: [email protected]
Cc: [email protected]
Subject: testing
Date: Wed, 05 May 2010 08:33:47 −0700
Mime-Version: 1.0
X-Nonspam: None
Testing Python mail tools.
--------------------------------------------------------------------------------
Bye.
This user interface is about as simple as it could be—after
connecting to the server, it prints the complete and raw full text
of one message at a time, pausing between each until you press the
Enter key. The input
built-in is
called to wait for the key press between message displays. The pause
keeps messages from scrolling off the screen too fast; to make them
visually distinct, emails are also separated by lines of
dashes.
We could make the display fancier (e.g., we can use the
email
package to parse headers,
bodies, and attachments—watch for examples in this and later
chapters), but here we simply display the whole message that was
sent. This works well for simple mails like these two, but it can be
inconvenient for larger messages with attachments; we’ll improve on
this in later clients.
This book won’t cover the full of set of headers that may appear in emails, but we’ll make use of some along the way. For example, the X-Mailer header line, if present, typically identifies the sending program; we’ll use it later to identify Python-coded email senders we write. The more common headers such as From and Subject are more crucial to a message. In fact, a variety of extra header lines can be sent in a message’s text. The Received headers, for example, trace the machines that a message passed through on its way to the target mailbox.
Because popmail
prints the
entire raw text of a message, you see all headers here, but you
usually see only a few by default in end-user-oriented mail GUIs
such as Outlook and webmail pages. The raw text here also makes
apparent the email structure we noted earlier: an email in general
consists of a set of headers like those here, followed by a blank
line, which is followed by the mail’s main text, though as we’ll see
later, they can be more complex if there are alternative parts or
attachments.
The script in Example 13-18 never deletes mail
from the server. Mail is simply retrieved and printed and will be
shown again the next time you run the script (barring deletion in
another tool, of course). To really remove mail permanently, we need
to call other methods (e.g., server.dele(msgnum)
), but such a
capability is best deferred until we develop more interactive mail
tools.
Notice how the reader script decodes each mail content line
with line.decode
into a str
string for display; as mentioned
earlier, poplib
returns content
as bytes
strings in 3.X. In fact,
if we change the script to not decode, this becomes more obvious in
its output:
[Press Enter key] ...assorted lines omitted... b'Date: Wed, 5 May 2010 11:29:36 −0400 (EDT)' b'From: [email protected]' b'Reply-To: [email protected]' b'To: [email protected]' b"Subject: I'm a Lumberjack, and I'm Okay" b'Mime-Version: 1.0' b'Content-Type: text/plain; charset=UTF-8' b'Content-Transfer-Encoding: 7bit' b'X-Mailer: EarthLink Zoo Mail 1.0' b'' b'I cut down trees, I skip and jump,' b'I like to press wild flowers...' b''
As we’ll see later, we’ll need to decode similarly in order to parse this text with email tools. The next section exposes the bytes-based interface as well.
If you don’t mind typing code and reading POP server messages, it’s possible to use the Python interactive prompt as a simple email client, too. The following session uses two additional interfaces we’ll apply in later examples:
conn.list()
Returns a list of “message-number message-size” strings.
conn.top(
N
,
0)
Retrieves just the header text portion of message number N.
The top
call also returns a
tuple that includes the list of line strings sent back; its second
argument tells the server how many additional lines after the
headers to send, if any. If all you need are header details,
top
can be much quicker than the
full text fetch of retr
, provided
your mail server implements the TOP command (most do):
C:...PP4EInternetEmail>python
>>>from poplib import POP3
>>>conn = POP3('pop.secureserver.net')
# connect to server >>>conn.user('[email protected]')
# log in to account b'+OK ' >>>conn.pass_('xxxxxxxx')
b'+OK ' >>>conn.stat()
# num mails, num bytes (2, 3268) >>>conn.list()
(b'+OK ', [b'1 1860', b'2 1408'], 16) >>>conn.top(1, 0)
(b'+OK 1860 octets ', [b'Received: (qmail 7690 invoked from network); 5 May 2010 ...lines omitted... b'X-Originating-IP: 209.86.224.51', b'X-Nonspam: None', b'', b''], 1827) >>>conn.retr(1)
(b'+OK 1860 octets ', [b'Received: (qmail 7690 invoked from network); 5 May 2010 ...lines omitted... b'X-Originating-IP: 209.86.224.51', b'X-Nonspam: None', b'', b'I cut down trees, I skip and jump,', b'I like to press wild flowers...', b'', b''], 1898) >>>conn.quit()
b'+OK '
Printing the full text of a message at the interactive prompt
is easy once it’s fetched: simply decode each line to a normal
string as it is printed, like our pop mail script did, or
concatenate the line strings returned by retr
or top
adding a newline between; any of the
following will suffice for an open POP server object:
>>>info, msg, oct = connection.retr(1)
# fetch first email in mailbox >>>for x in msg: print(x.decode())
# four ways to display message lines >>>print(b' '.join(msg).decode())
>>>x = [print(x.decode()) for x in msg]
>>>x = list(map(print, map(bytes.decode, msg)))
Parsing email text to extract headers and components is more
complex, especially for mails with attached and possibly encoded
parts, such as images. As we’ll see later in this chapter, the
standard library’s email
package
can parse the mail’s full or headers text after it has been fetched
with poplib
(or imaplib
).
See the Python library manual for details on other POP module
tools. As of Python 2.4, there is also a POP3_SSL
class in the poplib
module that connects to the server
over an SSL-encrypted socket on port 995 by default (the standard
port for POP over SSL). It provides an identical interface, but it
uses secure sockets for the conversation where supported by servers.
There is a proverb in hackerdom that states that every useful computer program eventually grows complex enough to send email. Whether such wisdom rings true or not in practice, the ability to automatically initiate email from within a program is a powerful tool.
For instance, test systems can automatically email failure reports, user interface programs can ship purchase orders to suppliers by email, and so on. Moreover, a portable Python mail script could be used to send messages from any computer in the world with Python and an Internet connection that supports standard email protocols. Freedom from dependence on mail programs like Outlook is an attractive feature if you happen to make your living traveling around teaching Python on all sorts of computers.
Luckily, sending email from within a Python script is just as easy as reading it. In fact, there are at least four ways to do so:
os.popen
to launch a
command-line mail programOn some systems, you can send email from a script with a call of the form:
os.popen('mail -s "xxx" [email protected]', 'w').write(text)
As we saw earlier in the book, the popen
tool runs the command-line
string passed to its first argument, and returns a file-like
object connected to it. If we use an open mode of w
, we are connected to the command’s
standard input stream—here, we write the text of the new mail
message to the standard Unix mail
command-line program. The net
effect is as if we had run mail
interactively, but it happens
inside a running Python script.
sendmail
programThe open source sendmail
program offers another way to initiate mail from a
program. Assuming it is installed and configured on your system,
you can launch it using Python tools like the os.popen
call of the previous
paragraph.
smtplib
Python
modulePython’s standard library comes with support for the client-side interface to
SMTP—the Simple Mail
Transfer Protocol—a higher-level Internet standard for sending
mail over sockets. Like the poplib
module we met in the previous
section, smtplib
hides all
the socket and protocol details and can be used to send mail on
any machine with Python and a suitable socket-based Internet
link.
Other tools in the open source library provide higher-level mail handling packages for Python; most build upon one of the prior three techniques.
Of these four options, smtplib
is by far the most portable and
direct. Using os.popen
to spawn a
mail program usually works on Unix-like platforms only, not on Windows
(it assumes a command-line mail program), and requires spawning one or
more processes along the way. And although the sendmail
program is powerful, it is also
somewhat Unix-biased, complex, and may not be installed even on all
Unix-like machines.
By contrast, the smtplib
module works on any machine that has Python and an Internet link that
supports SMTP access, including Unix, Linux, Mac, and Windows. It
sends mail over sockets in-process, instead of starting other programs
to do the work. Moreover, SMTP affords us much control over the
formatting and routing of email.
Since SMTP is arguably the best option for sending mail from a Python script,
let’s explore a simple mailing program that illustrates its
interfaces. The Python script shown in Example 13-19 is intended to be
used from an interactive command line; it reads a new mail message
from the user and sends the new mail by SMTP using Python’s smtplib
module.
#!/usr/local/bin/python """ ########################################################################### use the Python SMTP mail interface module to send email messages; this is just a simple one-shot send script--see pymail, PyMailGUI, and PyMailCGI for clients with more user interaction features; also see popmail.py for a script that retrieves mail, and the mailtools pkg for attachments and formatting with the standard library email package; ########################################################################### """ import smtplib, sys, email.utils, mailconfig mailserver = mailconfig.smtpservername # ex: smtp.rmi.net From = input('From? ').strip() # or import from mailconfig To = input('To? ').strip() # ex: [email protected] Tos = To.split(';') # allow a list of recipients Subj = input('Subj? ').strip() Date = email.utils.formatdate() # curr datetime, rfc2822 # standard headers, followed by blank line, followed by text text = ('From: %s To: %s Date: %s Subject: %s ' % (From, To, Date, Subj)) print('Type message text, end with line=[Ctrl+d (Unix), Ctrl+z (Windows)]') while True: line = sys.stdin.readline() if not line: break # exit on ctrl-d/z #if line[:4] == 'From': # line = '>' + line # servers may escape text += line print('Connecting...') server = smtplib.SMTP(mailserver) # connect, no log-in step failed = server.sendmail(From, Tos, text) server.quit() if failed: # smtplib may raise exceptions print('Failed recipients:', failed) # too, but let them pass here else: print('No errors.') print('Bye.')
Most of this script is user interface—it inputs the sender’s
address (From), one or more recipient addresses (To, separated by
“;” if more than one), and a subject line. The sending date is
picked up from Python’s standard time
module, standard header lines are
formatted, and the while
loop
reads message lines until the user types the end-of-file character
(Ctrl-Z on Windows, Ctrl-D on Linux).
To be robust, be sure to add a blank line
between the header lines and the body in the message’s text; it’s
required by the SMTP protocol and some SMTP servers enforce this.
Our script conforms by inserting an empty line with
at the end of the string format
expression—one
to terminate
the current line and another for a blank line; smtplib
expands
to Internet-style
internally prior to transmission, so
the short form is fine here. Later in this chapter, we’ll format our
messages with the Python email
package, which handles such details for us automatically.
The rest of the script is where all the SMTP magic occurs: to send a mail by SMTP, simply run these two sorts of calls:
server =
smtplib.SMTP(mailserver)
Make an instance of the SMTP object, passing in the name
of the SMTP server that will dispatch the message first. If
this doesn’t throw an exception, you’re connected to the SMTP
server via a socket when the call returns. Technically, the
connect
method establishes
connection to a server, but the SMTP object calls this method
automatically if the mail server name is passed in this
way.
failed = server.sendmail(From,
Tos, text)
Call the SMTP object’s sendmail
method, passing in the
sender address, one or more recipient addresses, and the raw
text of the message itself with as many standard mail header
lines as you care to provide.
When you’re done, be sure to call the object’s quit
method to disconnect from the server
and finalize the transaction. Notice that, on failure, the sendmail
method may either raise an
exception or return a list of the recipient addresses that failed;
the script handles the latter case itself but lets exceptions kill
the script with a Python error message.
Subtly, calling the server object’s quit
method after sendmail
raises an exception may or may
not work as expected—quit
can
actually hang until a server timeout if the send fails internally
and leaves the interface in an unexpected state. For instance, this
can occur on Unicode encoding errors when translating the outgoing
mail to bytes per the ASCII scheme (the rset
reset request hangs in this case,
too). An alternative close
method
simply closes the client’s sockets without attempting to send a quit
command to the server; quit
calls
close internally as a last step (assuming the quit command can be
sent!).
For advanced usage, SMTP objects provide additional calls not used in this example:
server.login(user,
password)
provides an interface to SMTP servers that
require and support authentication; watch
for this call to appear as an option in the mailtools
package example later in
this chapter.
server.starttls([keyfile[,
certfile]])
puts the SMTP connection in Transport
Layer Security (TLS) mode; all commands will be encrypted using
the Python ssl
module’s
socket wrapper SSL support, and they assume the server supports
this mode.
See the Python library manual for more on these and other calls not covered here.
Let’s ship a few messages across the world. The smtpmail
script is a one-shot tool: each
run allows you to send a single new mail message. Like most of the
client-side tools in this chapter, it can be run from any computer
with Python and an Internet link that supports SMTP (most do, though
some public access machines may restrict users to HTTP [Web] access
only or require special server SMTP configuration). Here it is
running on Windows:
C:...PP4EInternetEmail>smtpmail.py
From?[email protected]
To?[email protected]
Subj?A B C D E F G
Type message text, end with line=[Ctrl+d (Unix), Ctrl+z (Windows)]Fiddle de dum, Fiddle de dee,
Eric the half a bee.
^Z
Connecting... No errors. Bye.
This mail is sent to the book’s email account address ([email protected]), so it ultimately shows up in the inbox at my ISP, but only after being routed through an arbitrary number of machines on the Net, and across arbitrarily distant network links. It’s complex at the bottom, but usually, the Internet “just works.”
Notice the From address, though—it’s completely fictitious (as far as I know, at least). It turns out that we can usually provide any From address we like because SMTP doesn’t check its validity (only its general format is checked). Furthermore, unlike POP, there is usually no notion of a username or password in SMTP, so the sender is more difficult to determine. We need only pass email to any machine with a server listening on the SMTP port, and we don’t need an account or login on that machine. Here, the name [email protected] works just fine as the sender; Marketing.Geek.[email protected] might work just as well.
In fact, I didn’t import a From email address from the mailconfig.py module on purpose, because I wanted to be able to demonstrate this behavior; it’s the basis of some of those annoying junk emails that show up in your mailbox without a real sender’s address.[52] Marketers infected with e-millionaire mania will email advertising to all addresses on a list without providing a real From address, to cover their tracks.
Normally, of course, you should use the same To address in the message and the SMTP call and provide your real email address as the From value (that’s the only way people will be able to reply to your message). Moreover, apart from teasing your significant other, sending phony addresses is often just plain bad Internet citizenship. Let’s run the script again to ship off another mail with more politically correct coordinates:
C:...PP4EInternetEmail>smtpmail.py
From?[email protected]
To?[email protected]
Subj?testing smtpmail
Type message text, end with line=[Ctrl+d (Unix), Ctrl+z (Windows)]Lovely Spam! Wonderful Spam!
^Z
Connecting... No errors. Bye.
At this point, we could run whatever email tool we normally
use to access our mailbox to verify the results of these two send
operations; the two new emails should show up in our mailbox
regardless of which mail client is used to view them. Since we’ve
already written a Python script for reading mail, though, let’s
put it to use as a verification tool—running the popmail
script from the last section
reveals our two new messages at the end of the mail list (again
parts of the output have been trimmed to conserve space and
protect the innocent here):
C:...PP4EInternetEmail> popmail.py
Password for pop.secureserver.net?
Connecting...
b'+OK <[email protected]>'
There are 4 mail messages in 5326 bytes
(b'+OK ', [b'1 1860', b'2 1408', b'3 1049', b'4 1009'], 32)
--------------------------------------------------------------------------------
[Press Enter key]
...first two mails omitted...
Received: (qmail 25683 invoked from network); 6 May 2010 14:12:07 −0000
Received: from unknown (HELO p3pismtp01-018.prod.phx3.secureserver.net) ([10.6.1
(envelope-sender <[email protected]>)
by p3plsmtp06-04.prod.phx3.secureserver.net (qmail-1.03) with SMTP
for <[email protected]>; 6 May 2010 14:12:07 −0000
...more deleted...
Received: from [66.194.109.3] by smtp.mailmt.com (ArGoSoft Mail Server .NET v.1.
for <[email protected]>; Thu, 06 May 2010 10:12:12 −0400
From: [email protected]
To: [email protected]
Date: Thu, 06 May 2010 14:11:07 −0000
Subject: A B C D E F G
Message-ID: <jdlohzf0j8dp8z4x06052010101212@SMTP>
X-FromIP: 66.194.109.3
X-Nonspam: None
Fiddle de dum, Fiddle de dee,
Eric the half a bee.
--------------------------------------------------------------------------------
[Press Enter key]
Received: (qmail 4634 invoked from network); 6 May 2010 14:16:57 −0000
Received: from unknown (HELO p3pismtp01-025.prod.phx3.secureserver.net) ([10.6.1
(envelope-sender <[email protected]>)
by p3plsmtp06-05.prod.phx3.secureserver.net (qmail-1.03) with SMTP
for <[email protected]>; 6 May 2010 14:16:57 −0000
...more deleted...
Received: from [66.194.109.3] by smtp.mailmt.com (ArGoSoft Mail Server .NET v.1.
for <[email protected]>; Thu, 06 May 2010 10:17:03 −0400
From: [email protected]
To: [email protected]
Date: Thu, 06 May 2010 14:16:31 −0000
Subject: testing smtpmail
Message-ID: <8fad1n462667fik006052010101703@SMTP>
X-FromIP: 66.194.109.3
X-Nonspam: None
Lovely Spam! Wonderful Spam!
--------------------------------------------------------------------------------
Bye.
Notice how the fields we input to our script show up as headers and text in the email’s raw text delivered to the recipient. Technically, some ISPs test to make sure that at least the domain of the email sender’s address (the part after “@”) is a real, valid domain name, and disallow delivery if not. As mentioned earlier, some servers also require that SMTP senders have a direct connection to their network and may require an authentication call with username and password (described near the end of the preceding section). In the second edition of the book, I used an ISP that let me get away with more nonsense, but this may vary per server; the rules have tightened since then to limit spam.
The first mail listed at the end of the preceding section
was the one we sent with a fictitious sender address; the second
was the more legitimate message. Like sender addresses, header
lines are a bit arbitrary under SMTP. Our smtpmail
script automatically adds From
and To header lines in the message’s text with the same addresses
that are passed to the SMTP interface, but only as a polite
convention. Sometimes, though, you can’t tell who a mail was sent
to, either—to obscure the target audience or to support legitimate
email lists, senders may manipulate the contents of both these
headers in the message’s text.
For example, if we change smtpmail
to not automatically generate a
“To:” header line with the same address(es) sent to the SMTP
interface call:
text = ('From: %s Date: %s Subject: %s ' % (From, Date, Subj))
we can then manually type a “To:” header that differs from
the address we’re really sending to—the “To” address list passed
into the smtplib
send call
gives the true recipients, but the “To:” header line in the text
of the message is what most mail clients will display (see
smtpmail-noTo.py in the
examples package for the code needed to support such anonymous
behavior, and be sure to type a blank line after “To:”):
C:...PP4EInternetEmail>smtpmail-noTo.py
From?[email protected]
To?[email protected]
Subj?a b c d e f g
Type message text, end with line=(ctrl + D or Z)To: [email protected]
Spam; Spam and eggs; Spam, spam, and spam.
^Z
Connecting... No errors. Bye.
In some ways, the From and To addresses in send method calls and message header lines are similar to addresses on envelopes and letters in envelopes, respectively. The former is used for routing, but the latter is what the reader sees. Here, From is fictitious in both places. Moreover, I gave the real To address for the account on the server, but then gave a fictitious name in the manually typed “To:” header line—the first address is where it really goes and the second appears in mail clients. If your mail tool picks out the “To:” line, such mails will look odd when viewed.
For instance, when the mail we just sent shows up in my mailbox at learning-python.com, it’s difficult to tell much about its origin or destination in the webmail interface my ISP provides, as captured in Figure 13-5.
Furthermore, this email’s raw text won’t help unless we look closely at the “Received:” headers added by the machines it has been routed through:
C:...PP4EInternetEmail> popmail.py
Password for pop.secureserver.net?
Connecting...
b'+OK <[email protected]>'
There are 5 mail messages in 6364 bytes
(b'+OK ', [b'1 1860', b'2 1408', b'3 1049', b'4 1009', b'5 1038'], 40)
--------------------------------------------------------------------------------
[Press Enter key]
...first three mails omitted...
Received: (qmail 30325 invoked from network); 6 May 2010 14:33:45 −0000
Received: from unknown (HELO p3pismtp01-004.prod.phx3.secureserver.net) ([10.6.1
(envelope-sender <[email protected]>)
by p3plsmtp06-03.prod.phx3.secureserver.net (qmail-1.03) with SMTP
for <[email protected]>; 6 May 2010 14:33:45 −0000
...more deleted...
Received: from [66.194.109.3] by smtp.mailmt.com (ArGoSoft Mail Server .NET v.1.
for <[email protected]>; Thu, 06 May 2010 10:33:16 −0400
From: [email protected]
Date: Thu, 06 May 2010 14:32:32 −0000
Subject: a b c d e f g
To: [email protected]
Message-ID: <66koqg66e0q1c8hl06052010103316@SMTP>
X-FromIP: 66.194.109.3
X-Nonspam: None
Spam; Spam and eggs; Spam, spam, and spam.
--------------------------------------------------------------------------------
Bye.
Once again, though, don’t do this unless you have good cause. This demonstration is intended only to help you understand how mail headers factor into email processing. To write an automatic spam filter that deletes incoming junk mail, for instance, you need to know some of the telltale signs to look for in a message’s text. Spamming techniques have grown much more sophisticated than simply forging sender and recipient names, of course (you’ll find much more on the subject on the Web at large and in the SpamBayes mail filter written in Python), but it’s one common trick.
On the other hand, such To address juggling may also be useful in the context of legitimate mailing lists—the name of the list appears in the “To:” header when the message is viewed, not the potentially many individual recipients named in the send-mail call. As the next section’s example demonstrates, a mail client can simply send a mail to all on the list but insert the general list name in the “To:” header.
But in other contexts, sending email with bogus “From:” and “To:” lines is equivalent to making anonymous phone calls. Most mailers won’t even let you change the From line, and they don’t distinguish between the To address and header line. When you program mail scripts of your own, though, SMTP is wide open in this regard. So be good out there, OK?
So where are we in the Internet abstraction model now? With all this email fetching and sending going on, it’s easy to lose the forest for the trees. Keep in mind that because mail is transferred over sockets (remember sockets?), they are at the root of all this activity. All email read and written ultimately consists of formatted bytes shipped over sockets between computers on the Net. As we’ve seen, though, the POP and SMTP interfaces in Python hide all the details. Moreover, the scripts we’ve begun writing even hide the Python interfaces and provide higher-level interactive tools.
Both the popmail
and
smtpmail
scripts provide portable
email tools but aren’t quite what we’d expect in terms of usability
these days. Later in this chapter, we’ll use what we’ve seen thus
far to implement a more interactive, console-based mail tool. In the
next chapter, we’ll also code a tkinter email GUI, and then we’ll go
on to build a web-based interface in a later chapter. All of these
tools, though, vary primarily in terms of user interface only; each
ultimately employs the Python mail transfer modules we’ve met here
to transfer mail message text over the Internet with sockets.
Before we move on, one more SMTP note: just as for reading mail, we can use the Python interactive prompt as our email sending client, too, if we type calls manually. The following, for example, sends a message through my ISP’s SMTP server to two recipient addresses assumed to be part of a mail list:
C:...PP4EInternetEmail>python
>>>from smtplib import SMTP
>>>conn = SMTP('smtpout.secureserver.net')
>>>conn.sendmail(
...'[email protected]',
# true sender ...['[email protected]', '[email protected]'],
# true recipients ..."""From: [email protected]
...To: maillist
...Subject: test interactive smtplib
... ...testing 1 2 3...
...""")
{} >>>conn.quit()
# quit() required, Date added (221, b'Closing connection. Good bye.')
We’ll verify receipt of this message in a later email client
program; the “To” recipient shows up as “maillist” in email
clients—a completely valid use case for header manipulation. In
fact, you can achieve the same effect with the smtpmail-noTo
script by separating
recipient addresses at the “To?” prompt with a semicolon (e.g.
[email protected];
[email protected]) and typing the email list’s
name in the “To:” header line. Mail clients that support mailing
lists automate such steps.
Sending mail interactively this way is a bit tricky to get
right, though—header lines are governed by standards: the blank line
after the subject line is required and significant, for instance,
and Date is omitted altogether (one is added for us). Furthermore,
mail formatting gets much more complex as we start writing messages
with attachments. In practice, the email
package in the standard library is
generally used to construct emails, before shipping them off with
smtplib
. The package lets us
build mails by assigning headers and attaching and possibly encoding
parts, and creates a correctly formatted mail text. To learn how,
let’s move on to the next section.
The second edition of this book used a handful of standard library
modules (rfc822
, StringIO
, and more) to parse the contents of
messages, and simple text processing to compose them. Additionally,
that edition included a section on extracting and decoding attached
parts of a message using modules such as mhlib
, mimetools
, and base64
.
In the third edition, those tools were still available, but
were, frankly, a bit clumsy and error-prone. Parsing attachments from
messages, for example, was tricky, and composing even basic messages
was tedious (in fact, an early printing of the prior edition contained
a potential bug, because it omitted one
character in a string formatting
operation). Adding attachments to sent messages wasn’t even attempted,
due to the complexity of the formatting involved. Most of these tools
are gone completely in Python 3.X as I write this fourth edition,
partly because of their complexity, and partly because they’ve been
made obsolete.
Luckily, things are much simpler today. After the second
edition, Python sprouted a new email
package—a
powerful collection of tools that automate most of the work behind
parsing and composing email messages. This module gives us an
object-based message interface and handles all the textual message
structure details, both analyzing and creating it. Not only does this
eliminate a whole class of potential bugs, it also promotes more
advanced mail processing.
Things like attachments, for instance, become accessible to mere
mortals (and authors with limited book real estate). In fact, an
entire original section on manual attachment parsing and decoding was
deleted in the third edition—it’s essentially automatic with email
. The new package parses and constructs
headers and attachments; generates correct email text; decodes and
encodes Base64, quoted-printable, and uuencoded data; and much more.
We won’t cover the email
package in its entirety in this book; it is well documented in
Python’s library manual. Our goal here is to explore some example
usage code, which you can study in conjunction with the manuals. But
to help get you started, let’s begin with a quick overview. In a
nutshell, the email
package is
based around the Message
object it
provides:
A mail’s full text, fetched from poplib
or imaplib
, is parsed into a new Message
object, with an API for
accessing its components. In the object, mail headers become
dictionary-like keys, and components become a “payload” that can
be walked with a generator interface (more on payloads in a
moment).
New mails are composed by creating a new Message
object, using an API to attach
headers and parts, and asking the object for its print
representation—a correctly formatted mail message text, ready to
be passed to the smtplib
module for delivery. Headers are added by key assignment and
attachments by method calls.
In other words, the Message
object is used both for accessing existing messages and for creating
new ones from scratch. In both cases, email
can automatically handle details like
content encodings (e.g., attached binary images can be treated as text
with Base64 encoding and decoding), content types, and more.
Since the email
module’s
Message
object is at the heart of its API, you need a cursory
understanding of its form to get started. In short, it is designed
to reflect the structure of a formatted email message. Each Message
consists of three main pieces of
information:
A content type (plain text, HTML text, JPEG image, and so on), encoded as a MIME main type and a subtype. For instance, “text/html” means the main type is text and the subtype is HTML (a web page); “image/jpeg” means a JPEG photo. A “multipart/mixed” type means there are nested parts within the message.
A dictionary-like mapping interface, with one key per mail header (From, To, and so on). This interface supports almost all of the usual dictionary operations, and headers may be fetched or set by normal key indexing.
A “payload,” which represents the mail’s content. This
can be either a string (bytes
or str
) for simple messages, or a list
of additional Message
objects for multipart
container messages with attached or alternative parts. For
some oddball types, the payload may be a Python None
object.
The MIME type of a Message is key to understanding its
content. For example, mails with attached images may have a main
top-level Message
(type multipart/mixed
), with three more Message
objects in its payload—one for its
main text (type text/plain
),
followed by two of type image for the photos (type image/jpeg
). The photo parts may be
encoded for transmission as text with Base64 or another scheme; the
encoding type, as well as the original image filename, are specified
in the part’s headers.
Similarly, mails that include both simple text and an HTML
alternative will have two nested Message
objects in their payload, of type
plain text (text/plain
) and HTML
text (text/html
), along with a
main root Message
of type
multipart/alternative
. Your mail
client decides which part to display, often based on your
preferences.
Simpler messages may have just a root Message
of type text/plain
or text/html
, representing the entire message
body. The payload for such mails is a simple string. They may also
have no explicitly given type at all, which generally defaults to
text/plain
. Some single-part
messages are text/html
, with no
text/plain
alternative—they
require a web browser or other HTML viewer (or a very keen-eyed
user).
Other combinations are possible, including some types that are
not commonly seen in practice, such as message/delivery
status. Most messages
have a main text part, though it is not required, and may be nested
in a multipart or other construct.
In all cases, an email message is a simple, linear string, but
these message structures are automatically detected when mail text
is parsed and are created by your method calls when new messages are
composed. For instance, when creating messages, the message attach
method adds parts for multipart
mails, and set_payload
sets the
entire payload to a string for simple mails.
Message
objects also have
assorted properties (e.g., the filename of an attachment), and they
provide a convenient walk
generator method, which returns the next Message
in the payload each time through
in a for
loop or other iteration
context. Because the walker yields the root Message
object first (i.e., self
), single-part messages don’t have to
be handled as a special case; a nonmultipart message is effectively
a Message
with a single item in
its payload—itself.
Ultimately, the Message
object structure closely mirrors the way mails are formatted as
text. Special header lines in the mail’s text give its type (e.g.,
plain text or multipart), as well as the separator used between the
content of nested parts. Since the underlying textual details are
automated by the email
package—both when parsing and when composing—we won’t go into
further formatting details here.
If you are interested in seeing how this translates to real
emails, a great way to learn mail structure is by inspecting the
full raw text of messages displayed by email clients you already
use, as we’ll see with some we meet in this book. In fact, we’ve
already seen a few—see the raw text printed by our earlier POP email
scripts for simple mail text examples. For more on the Message
object, and email
in general, consult the email
package’s entry in Python’s library
manual. We’re skipping details such as its available encoders and
MIME object classes here in the interest of space.
Beyond the email
package,
the Python library includes other tools for mail-related processing.
For instance, mimetypes
maps a
filename to and from a MIME type:
We also used the mimetypes
module earlier in this chapter to guess FTP transfer modes from
filenames (see Example 13-10), as well as in
Chapter 6, where we used it to
guess a media player for a filename (see the examples there,
including playfile.py, Example 6-23). For email, these can
come in handy when attaching files to a new message (guess_type
) and saving parsed attachments
that do not provide a filename (guess_extension
). In fact, this module’s
source code is a fairly complete reference to MIME types. See the
library manual for more on these tools.
Although we can’t provide an exhaustive reference here, let’s step
through a simple interactive session to illustrate the fundamentals
of email processing. To compose the full text
of a message—to be delivered with smtplib
, for instance—make a Message
, assign headers to its keys, and
set its payload to the message body. Converting to a string yields
the mail text. This process is substantially simpler and less
error-prone than the manual text operations we used earlier in Example 13-19 to build mail as
strings:
>>>from email.message import Message
>>>m = Message()
>>>m['from'] = 'Jane Doe <[email protected]>'
>>>m['to'] = '[email protected]'
>>>m.set_payload('The owls are not what they seem...')
>>> >>>s = str(m)
>>>print(s)
from: Jane Doe <[email protected]> to: [email protected] The owls are not what they seem...
Parsing a message’s text—like the kind
you obtain with poplib
—is
similarly simple, and essentially the inverse: we get back a
Message
object from the text,
with keys for headers and a payload for the body:
>>>s
# same as in prior interaction 'from: Jane Doe <[email protected]> to: [email protected] The owls are not...' >>>from email.parser import Parser
>>>x = Parser().parsestr(s)
>>>x
<email.message.Message object at 0x015EA9F0> >>> >>>x['From']
'Jane Doe <[email protected]>' >>>x.get_payload()
'The owls are not what they seem...' >>>x.items()
[('from', 'Jane Doe <[email protected]>'), ('to', '[email protected]')]
So far this isn’t much different from the older and
now-defunct rfc822
module, but
as we’ll see in a moment, things get more interesting when there is
more than one part. For simple messages like this one, the message
walk
generator treats it as a
single-part mail, of type plain text:
>>>for part in x.walk():
...print(x.get_content_type())
...print(x.get_payload())
... text/plain The owls are not what they seem...
Making a mail with attachments is a
little more work, but not much: we just make a root
Message
and attach nested
Message
objects created from
the MIME type object that corresponds to the type of data we’re
attaching. The MIMEText
class,
for instance, is a subclass of Message
, which is tailored for text
parts, and knows how to generate the right types of header
information when printed. MIMEImage
and MIMEAudio
similarly customize Message
for images and audio, and also know how to apply Base64 and other
MIME encodings to binary data. The root message is where we store
the main headers of the mail, and we attach parts here, instead of
setting the entire payload—the payload is a list now, not a
string. MIMEMultipart
is a
Message
that provides the extra
header protocol we need for the root:
>>>from email.mime.multipart import MIMEMultipart
# Message subclasses >>>from email.mime.text import MIMEText
# with extra headers+logic >>> >>>top = MIMEMultipart()
# root Message object >>>top['from'] = 'Art <[email protected]>'
# subtype default=mixed >>>top['to'] = '[email protected]'
>>> >>>sub1 = MIMEText('nice red uniforms... ')
# part Message attachments >>>sub2 = MIMEText(open('data.txt').read())
>>>sub2.add_header('Content-Disposition', 'attachment', filename='data.txt')
>>>top.attach(sub1)
>>>top.attach(sub2)
When we ask for the text, a correctly formatted full mail
text is returned, separators and all, ready to be sent with
smtplib
—quite a trick, if
you’ve ever tried this by hand:
>>>text = top.as_string()
# or do: str(top) or print(top) >>>print(text)
Content-Type: multipart/mixed; boundary="===============1574823535==" MIME-Version: 1.0 from: Art <[email protected]> to: [email protected] --===============1574823535== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit nice red uniforms... --===============1574823535== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="data.txt" line1 line2 line3 --===============1574823535==--
If we are sent this message and retrieve it via poplib
, parsing its full text yields a
Message
object just like the one we built to send. The message walk
generator allows us to step through
each part, fetching their types and payloads:
>>>text
# same as in prior interaction 'Content-Type: multipart/mixed; boundary="===============1574823535==" MIME-Ver...' >>>from email.parser import Parser
>>>msg = Parser().parsestr(text)
>>>msg['from']
'Art <[email protected]>' >>>for part in msg.walk():
...print(part.get_content_type())
...print(part.get_payload())
...print()
... multipart/mixed [<email.message.Message object at 0x015EC610>, <email.message.Message object at0x015EC630>] text/plain nice red uniforms... text/plain line1 line2 line3
Multipart alternative messages (with text and HTML
renditions of the same message) can be composed and parsed in
similar fashion. Because email
clients are able to parse and compose messages with a simple
object-based API, they are freed to focus on user-interface
instead of text processing.
Now that I’ve shown you how “cool” the email package is, I
unfortunately need to let you know that it’s not completely
operational in Python 3.1. The email
package
works as shown for simple messages, but is severely impacted by
Python 3.X’s Unicode/bytes string dichotomy in a number of
ways.
In short, the email
package
in Python 3.1 is still somewhat coded to operate in the realm of 2.X
str
text strings. Because these
have become Unicode in 3.X, and because some tools that email
uses are now oriented toward
bytes
strings, which do not mix
freely with str
, a variety of
conflicts crop up and cause issues for programs that depend upon
this module.
At this writing, a new version of email
is being developed which will handle
bytes
and Unicode encodings
better, but the going consensus is that it won’t be folded back into
Python until release 3.3 or later, long after this book’s release.
Although a few patches might make their way into 3.2, the current
sense is that fully addressing the package’s problems appears to
require a full redesign.
To be fair, it’s a substantial problem. Email has historically
been oriented toward single-byte ASCII text, and generalizing it for
Unicode is difficult to do well. In fact, the same holds true for
most of the Internet today—as discussed elsewhere in this chapter,
FTP, POP, SMTP, and even webpage bytes fetched over HTTP pose the
same sorts of issues. Interpreting the bytes shipped over networks
as text is easy if the mapping is one-to-one, but allowing for
arbitrary Unicode encoding in that text opens a Pandora’s box of
dilemmas. The extra complexity is necessary today, but, as email
attests, can be a daunting
task.
Frankly, I considered not releasing this edition of this book
until this package’s issues could be resolved, but I decided to go
forward because a new email
package may be years away (two Python releases, by all accounts).
Moreover, the issues serve as a case study of the types of problems
you’ll run into in the real world of large-scale software
development. Things change over time, and program code is no
exception.
Instead, this book’s examples provide new Unicode and
Internationalization support but adopt policies to work around
issues where possible. Programs in books are meant to be
educational, after all, not commercially viable. Given the state of
the email
package that the
examples depend on, though, the solutions used here might not be
completely universal, and there may be additional Unicode issues
lurking. To address the future, watch this book’s website (described
in the Preface) for updated notes and code examples if/when the
anticipated new email
package
appears. Here, we’ll work with what we have.
The good news is that we’ll be able to make use of email
in its current form to build fairly
sophisticated and full-featured email clients in this book anyhow.
It still offers an amazing number of tools, including MIME encoding
and decoding, message formatting and parsing, Internationalized
headers extraction and construction, and more. The bad news is that
this will require a handful of obscure workarounds and may need to
be changed in the future, though few software projects are exempt
from such realities.
Because email
’s limitations
have implications for later email code in this book, I’m going to
quickly run through them in this section. Some of this can be safely
saved for later reference, but parts of later examples may be
difficult to understand if you don’t have this background. The
upside is that exploring the package’s limitations here also serves
as a vehicle for digging a bit deeper into the email
package’s interfaces in
general.
The first Unicode issue in Python3.1’s email
package is nearly a showstopper in
some contexts: the bytes
strings of the sort produced by poplib
for mail fetches must be decoded
to str
prior to parsing with
email
. Unfortunately, because
there may not be enough information to know how to decode the
message bytes per Unicode, some clients of this package may need
to be generalized to detect whole-message encodings prior to
parsing; in worst cases other than email that may mandate mixed
data types, the current package cannot be used at all. Here’s the
issue live:
>>>text
# from prior example in his section 'Content-Type: multipart/mixed; boundary="===============1574823535==" MIME-Ver...' >>>btext = text.encode()
>>>btext
b'Content-Type: multipart/mixed; boundary="===============1574823535==" MIME-Ve...' >>>msg = Parser().parsestr(text)
# email parser expects Unicode str >>>msg = Parser().parsestr(btext)
# but poplib fetches email as bytes! Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:Python31libemailparser.py", line 82, in parsestr return self.parse(StringIO(text), headersonly=headersonly) TypeError: initial_value must be str or None, not bytes >>>msg = Parser().parsestr(btext.decode())
# okay per default >>>msg = Parser().parsestr(btext.decode('utf8'))
# ascii encoded (default) >>>msg = Parser().parsestr(btext.decode('latin1'))
# ascii is same in all 3 >>>msg = Parser().parsestr(btext.decode('ascii'))
This is less than ideal, as a bytes
-based email
would be able to handle message
encodings more directly. As mentioned, though, the email
package is not really fully
functional in Python 3.1, because of its legacy str
focus, and the sharp distinction
that Python 3.X makes between Unicode text and byte strings. In
this case, its parser should accept bytes
and not expect clients to know how
to decode.
Because of that, this book’s email clients take simplistic
approaches to decoding fetched message bytes to be parsed by
email
. Specifically, full-text
decoding will try a user-configurable encoding name, then fall
back on trying common types as a heuristic, and finally attempt to
decode just message headers.
This will suffice for the examples shown but may need to be enhanced for broader applicability. In some cases, encoding may have to be determined by other schemes such as inspecting email headers (if present at all), guessing from bytes structure analysis, or dynamic user feedback. Adding such enhancements in a robust fashion is likely too complex to attempt in a book’s example code, and it is better performed in common standard library tools in any event.
Really, robust decoding of mail text may not be possible
today at all, if it requires headers inspections—we can’t inspect
a message’s encoding information headers unless we parse the
message, but we can’t parse a message with 3.1’s email
package unless we already know the
encoding. That is, scripts may need to parse in order to decode,
but they need to decode in order to parse! The byte strings of
poplib
and Unicode strings of
email
in 3.1 are fundamentally
at odds. Even within its own libraries, Python 3.X’s changes have
created a chicken-and-egg dependency problem that still exists
nearly two years after 3.0’s release.
Short of writing our own email parser, or pursuing other similarly complex approaches, the best bet today for fetched messages seems to be decoding per user preferences and defaults, and that’s how we’ll proceed in this edition. The PyMailGUI client of Chapter 14, for instance, will allow Unicode encodings for full mail text to be set on a per-session basis.
The real issue, of course, is that email in general is inherently complicated by the presence of arbitrary text encodings. Besides full mail text, we also must consider Unicode encoding issues for the text components of a message once it’s parsed—both its text parts and its message headers. To see why, let’s move on.
Related Issue for CGI scripts: I
should also note that the full text decoding issue may not be as
large a factor for email as it is for some other email
package clients. Because the
original email standards call for ASCII text and require binary data
to be MIME encoded, most emails are likely to decode properly
according to a 7- or 8-bit encoding such as Latin-1.
As we’ll see in Chapter 15,
though, a more insurmountable and related issue looms for
server-side scripts that support CGI file
uploads on the Web—because Python’s CGI module also
uses the email
package to
parse multipart form data; because this package requires data to
be decoded to str
for
parsing; and because such data might have mixed text and binary
data (included raw binary data that is not
MIME-encoded, text of any encoding, and even arbitrary
combinations of these), these uploads fail in Python 3.1 if any
binary or incompatible text files are included. The cgi
module triggers Unicode decoding
or type errors internally, before the Python script has a chance
to intervene.
CGI uploads worked in Python 2.X, because the str
type represented both possibly
encoded text and binary data. Saving this type’s content to a
binary mode file as a string of bytes in 2.X sufficed for both
arbitrary text and binary data such as images. Email parsing
worked in 2.X for the same reason. For better or worse, the 3.X
str
/bytes
dichotomy makes this generality
impossible.
In other words, although we can generally work around the
email
parser’s str
requirement for fetched emails by
decoding per an 8-bit encoding, it’s much more malignant for web
scripting today. Watch for more details on this in Chapter 15, and stay tuned for a future
fix, which may have materialized by the time you read these
words.
Our next email
Unicode
issue seems to fly in the face of Python’s generic
programming model: the data types of message payload objects may
differ, depending on how they are fetched. Especially for programs
that walk and process payloads of mail parts generically, this complicates
code.
Specifically, the Message
object’s get_payload
method we used earlier accepts an optional decode
argument to control automatic
email-style MIME decoding (e.g., Base64, uuencode,
quoted-printable). If this argument is passed in as 1
(or equivalently, True
), the payload’s data is
MIME-decoded when fetched, if required. Because this argument is
so useful for complex messages with arbitrary parts, it will
normally be passed as true in all cases. Binary parts are normally
MIME-encoded, but even text parts might also be present in Base64
or another MIME form if their bytes fall outside email standards.
Some types of Unicode text, for example, require MIME
encoding.
The upshot is that get_payload
normally returns str
strings for str
text parts, but returns bytes
strings if its decode
argument is true—even if the
message part is known to be text by nature. If this argument is
not used, the payload’s type depends upon how it was set: str
or bytes
. Because Python 3.X does not allow
str
and bytes
to be mixed freely, clients that
need to use the result in text processing or store it in files
need to accommodate the difference. Let’s run some code to
illustrate:
>>>from email.message import Message
>>>m = Message()
>>>m['From'] = 'Lancelot'
>>>m.set_payload('Line?...')
>>>m['From']
'Lancelot' >>>m.get_payload()
# str, if payload is str 'Line?...' >>>m.get_payload(decode=1)
# bytes, if MIME decode (same as decode=True) b'Line?...'
The combination of these different return types and Python
3.X’s strict str
/bytes
dichotomy can cause problems in
code that processes the result unless they decode carefully:
>>>m.get_payload(decode=True) + 'spam'
# can't mix in 3.X! TypeError: can't concat bytes to str >>>m.get_payload(decode=True).decode() + 'spam'
# convert if required 'Line?...spam'
To make sense of these examples, it may help to remember that there are two different concepts of “encoding” for email text:
Email-style MIME encodings such as Base64, uuencode, and quoted-printable, which are applied to binary and otherwise unusual content to make them acceptable for transmission in email text
Unicode text encodings for strings in general, which apply to message text as well as its parts, and may be required after MIME encoding for text message parts
The email
package handles
email-style MIME encodings automatically when we pass decode=1
to fetch parsed payloads, or
generate text for messages that have nonprintable parts, but
scripts still need to take Unicode encodings into consideration
because of Python 3.X’s sharp string types differentiation. For
example, the first decode
in
the following refers to MIME, and the second to Unicode:
m.get_payload(decode=True).decode() # to bytes via MIME, then to str via Unicode
Even without the MIME decode
argument, the payload type may
also differ if it is stored in different forms:
>>>m = Message(); m.set_payload('spam'), m.get_payload()
# fetched as stored 'spam' >>>m = Message(); m.set_payload(b'spam'), m.get_payload()
b'spam'
Moreover, the same hold true for the text-specific MIME
subclass (though as we’ll see later in this section, we cannot
pass a bytes
to its constructor
to force a binary payload):
>>>from email.mime.text import MIMEText
>>>m = MIMEText('Line...?')
>>>m['From'] = 'Lancelot'
>>>m['From']
'Lancelot' >>>m.get_payload()
'Line...?' >>>m.get_payload(decode=1)
b'Line...?'
Unfortunately, the fact that payloads might be either
str
or bytes
today not only flies in the face
of Python’s type-neutral mindset, it can complicate your
code—scripts may need to convert in contexts that require one or
the other type. For instance, GUI libraries might allow both, but
file saves and web page content generation may be less flexible.
In our example programs, we’ll process payloads as bytes
whenever possible, but decode to
str
text in cases where
required using the encoding information available in the header
API described in the next section.
More profoundly, text in email can be even richer than implied so far—in principle, text payloads of a single message may be encoded in a variety of different Unicode schemes (e.g., three HTML webpage file attachments, all in different Unicode encodings, and possibly different than the full message text’s encoding). Although treating such text as binary byte strings can sometimes finesse encoding issues, saving such parts in text-mode files for opening must respect the original encoding types. Further, any text processing performed on such parts will be similarly type-specific.
Luckily, the email
package both adds character-set headers when generating message
text and retains character-set information for parts if it is
present when parsing message text. For instance, adding non-ASCII
text attachments simply requires passing in an encoding name—the
appropriate message headers are added automatically on text
generation, and the character set is available directly via
the get_content_charset
method:
>>>s = b'Axe4B'
>>>s.decode('latin1')
'AäB' >>>from email.message import Message
>>>m = Message()
>>>m.set_payload(b'Axe4B', charset='latin1')
# or 'latin-1': see ahead >>>t = m.as_string()
>>>print(t)
MIME-Version: 1.0 Content-Type: text/plain; charset="latin1" Content-Transfer-Encoding: base64 QeRC >>>m.get_content_charset()
'latin1'
Notice how email
automatically applies Base64 MIME encoding to non-ASCII text parts
on generation, to conform to email standards. The same is true for
the more specific MIME text subclass of Message
:
>>>from email.mime.text import MIMEText
>>>m = MIMEText(b'Axe4B', _charset='latin1')
>>>t = m.as_string()
>>>print(t)
Content-Type: text/plain; charset="latin1" MIME-Version: 1.0 Content-Transfer-Encoding: base64 QeRC >>>m.get_content_charset()
'latin1'
Now, if we parse this message’s text string with email
, we get back a new Message
whose text payload is the Base64
MIME-encoded text used to represent the non-ASCII Unicode string.
Requesting MIME decoding for the payload with decode=1
returns the byte string we
originally attached:
>>>from email.parser import Parser
>>>q = Parser().parsestr(t)
>>>q
<email.message.Message object at 0x019ECA50> >>>q.get_content_type()
'text/plain' >>>q._payload
'QeRC ' >>>q.get_payload()
'QeRC ' >>>q.get_payload(decode=1)
b'Axe4B'
However, running Unicode decoding on this byte string to
convert to text fails if we attempt to use the platform default on
Windows (UTF8). To be more accurate, and support a wide variety of
text types, we need to use the character-set information saved by
the parser and attached to the Message
object. This is especially
important if we need to save the data to a file—we either have to
store as bytes in binary mode files, or specify the correct (or at
least a compatible) Unicode encoding in order to use such strings
for text-mode files. Decoding manually works the same way:
>>>q.get_payload(decode=1).decode()
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected >>>q.get_content_charset()
'latin1' >>>q.get_payload(decode=1).decode('latin1')
# known type 'AäB' >>>q.get_payload(decode=1).decode(q.get_content_charset())
# allow any type 'AäB'
In fact, all the header details are available on Message
objects, if we know where to
look. The character set can also be absent entirely, in which case
it’s returned as None
; clients
need to define policies for such ambiguous text (they might try
common types, guess, or treat the data as a raw byte
string):
>>>q['content-type']
# mapping interface 'text/plain; charset="latin1"' >>>q.items()
[('Content-Type', 'text/plain; charset="latin1"'), ('MIME-Version', '1.0'), ('Content-Transfer-Encoding', 'base64')] >>q.get_params(header='Content-Type')
# param interface [('text/plain', ''), ('charset', 'latin1')] >>>q.get_param('charset', header='Content-Type')
'latin1' >>>charset = q.get_content_charset()
# might be missing >>>if charset:
...print(q.get_payload(decode=1).decode(charset))
... AäB
This handles encodings for message text parts in parsed emails. For composing new emails, we still must apply session-wide user settings or allow the user to specify an encoding for each part interactively. In some of this book’s email clients, payload conversions are performed as needed—using encoding information in message headers after parsing and provided by users during mail composition.
On a related note, the email
package also provides support for encoding and decoding
message headers themselves (e.g., From, Subject) per email
standards when they are not simple text. Such headers are often
called Internationalized (or
i18n) headers, because they support inclusion
of non-ASCII character set text in emails. This term is also
sometimes used to refer to encoded text of message payloads;
unlike message headers, though, message payload encoding is used
for both international Unicode text and truly binary data such as
images (as we’ll see in the next section).
Like mail payload parts, i18n headers are encoded specially
for email, and may also be encoded per Unicode. For instance,
here’s how to decode an encoded subject line from an arguably
spammish email that just showed up in my inbox; its =?UTF-8?Q?
preamble declares that the
data following it is UTF-8 encoded Unicode text, which is also
MIME-encoded per quoted-printable for transmission in email (in
short, unlike the prior section’s part payloads, which declare
their encodings in separate header lines, headers themselves may
declare their Unicode and MIME encodings by embedding them in
their own content this way):
>>>rawheader = '=?UTF-8?Q?Introducing=20Top=20Values=3A=20A=20Special=20Selecti
on=20of=20Great=20Money=20Savers?='
>>>from email.header import decode_header
# decode per email+MIME >>>decode_header(rawheader)
[(b'Introducing Top Values: A Special Selection of Great Money Savers', 'utf-8')] >>>bin, enc = decode_header(rawheader)[0]
# and decode per Unicode >>>bin, enc
(b'Introducing Top Values: A Special Selection of Great Money Savers', 'utf-8') >>>bin.decode(enc)
'Introducing Top Values: A Special Selection of Great Money Savers'
Subtly, the email
package
can return multiple parts if there are encoded substrings in the
header, and each must be decoded individually and joined to
produce decoded header text. Even more subtly, in 3.1, this
package returns all bytes
when
any substring (or the entire header) is encoded but returns
str
for a fully unencoded
header, and uncoded substrings returned as bytes
are encoded per
“raw-unicode-escape” in the package—an encoding scheme useful to
convert str
to bytes
when no encoding type
applies:
>>>from email.header import decode_header
>>>S1 = 'Man where did you get that assistant?'
>>>S2 = '=?utf-8?q?Man_where_did_you_get_that_assistant=3F?='
>>>S3 = 'Man where did you get that =?UTF-8?Q?assistant=3F?='
# str: don't decode() >>>decode_header(S1)
[('Man where did you get that assistant?', None)] # bytes: do decode() >>>decode_header(S2)
[(b'Man where did you get that assistant?', 'utf-8')] # bytes: do decode() using raw-unicode-escape applied in package >>>decode_header(S3)
[(b'Man where did you get that', None), (b'assistant?', 'utf-8')] # join decoded parts if more than one >>>parts = decode_header(S3)
>>>' '.join(abytes.decode('raw-unicode-escape' if enc == None else enc)
...for (abytes, enc) in parts)
'Man where did you get that assistant?'
We’ll use logic similar to the last step here in the
mailtools
package ahead, but
also retain str
substrings
intact without attempting to decode.
Late-breaking news: As I write this
in mid-2010, it seems possible that this mixed type,
nonpolymorphic, and frankly, non-Pythonic API behavior may be
addressed in a future Python release. In response to a rant
posted on the Python developers list by a book author whose work
you might be familiar with, there is presently a vigorous
discussion of the topic there. Among other ideas is a proposal
for a bytes
-like type which
carries with it an explicit Unicode encoding; this may make it
possible to treat some text cases in a more generic fashion.
While it’s impossible to foresee the outcome of such proposals,
it’s good to see that the issues are being actively explored.
Stay tuned to this book’s website for further developments in
the Python 3.X library API and Unicode stories.
One wrinkle pertaining to the prior section: for message headers that contain email addresses (e.g., From), the name component of the name/address pair might be encoded this way as well. Because the email package’s header parser expects encoded substrings to be followed by whitespace or the end of string, we cannot ask it to decode a complete address-related header—quotes around name components will fail.
To support such Internationalized address headers, we must
also parse out the first part of the email address and then
decode. First of all, we need to extract the name and address
parts of an email address using email
package tools:
>>>from email.utils import parseaddr, formataddr
>>>p = parseaddr('"Smith, Bob" <[email protected]>')
# split into name/addr pair >>>p
# unencoded addr ('Smith, Bob', '[email protected]') >>>formataddr(p)
'"Smith, Bob" <[email protected]>' >>>parseaddr('Bob Smith <[email protected]>')
# unquoted name part ('Bob Smith', '[email protected]') >>>formataddr(parseaddr('Bob Smith <[email protected]>'))
'Bob Smith <[email protected]>' >>>parseaddr('[email protected]')
# simple, no name ('', '[email protected]') >>>formataddr(parseaddr('[email protected]'))
'[email protected]'
Fields with multiple addresses (e.g., To) separate
individual addresses by commas. Since email names might embed
commas, too, blindly splitting on commas to run each though
parsing won’t always work. Instead, another utility can be used to
parse each address individually: getaddresses
ignores commas in names
when spitting apart separate addresses, and parseaddr
does, too, because it simply
returns the first pair in the getaddresses
result (some line breaks
were added to the following for legibility):
>>>from email.utils import getaddresses
>>>multi = '"Smith, Bob" <[email protected]>, Bob Smith <[email protected]>, [email protected],
"Bob" <[email protected]>'
>>>getaddresses([multi])
[('Smith, Bob', '[email protected]'), ('Bob Smith', '[email protected]'), ('', '[email protected]'), ('Bob', '[email protected]')] >>>[formataddr(pair) for pair in getaddresses([multi])]
['"Smith, Bob" <[email protected]>', 'Bob Smith <[email protected]>', '[email protected]', 'Bob <[email protected]>'] >>>', '.join([formataddr(pair) for pair in getaddresses([multi])])
'"Smith, Bob" <[email protected]>, Bob Smith <[email protected]>, [email protected], Bob <[email protected]>' >>>getaddresses(['[email protected]'])
# handles single address cases too ('', '[email protected]')]
Now, decoding email addresses is really just an extra step before and after the normal header decoding logic we saw earlier:
>>>rawfromheader = '"=?UTF-8?Q?Walmart?=" <[email protected]>'
>>>from email.utils import parseaddr, formataddr
>>>from email.header import decode_header
>>>name, addr = parseaddr(rawfromheader)
# split into name/addr parts >>>name, addr
('=?UTF-8?Q?Walmart?=', '[email protected]') >>>abytes, aenc = decode_header(name)[0]
# do email+MIME decoding >>>abytes, aenc
(b'Walmart', 'utf-8') >>>name = abytes.decode(aenc)
# do Unicode decoding >>>name
'Walmart' >>>formataddr((name, addr))
# put parts back together 'Walmart <[email protected]>'
Although From headers will typically have just one address,
to be fully robust we need to apply this to every address in
headers, such as To, Cc, and Bcc. Again, the multiaddress getaddresses
utility avoids comma clashes between names and address
separators; since it also handles the single address case, it
suffices for From headers as well:
>>>rawfromheader = '"=?UTF-8?Q?Walmart?=" <[email protected]>'
>>>rawtoheader = rawfromheader + ', ' + rawfromheader
>>>rawtoheader
'"=?UTF-8?Q?Walmart?=" <[email protected]>, "=?UTF-8?Q?Walmart?=" <newslet [email protected]>' >>>pairs = getaddresses([rawtoheader])
>>>pairs
[('=?UTF-8?Q?Walmart?=', '[email protected]'), ('=?UTF-8?Q?Walmart?=', 'ne [email protected]')] >>>addrs = []
>>>for name, addr in pairs:
...abytes, aenc = decode_header(name)[0] # email+MIME
...name = abytes.decode(aenc) # Unicode
...addrs.append(formataddr((name, addr))) # one or more addrs
... >>>', '.join(addrs)
'Walmart <[email protected]>, Walmart <[email protected]>'
These tools are generally forgiving for unencoded content
and return them intact. To be robust, though, the last portion of
code here should also allow for multiple parts returned by
decode_header
(for encoded
substrings), None
encoding
values for parts (for unencoded substrings), and str
substring values instead of bytes
(for fully unencoded names).
Decoding this way applies both MIME and Unicode decoding steps to fetched mails. Creating properly encoded headers for inclusion in new mails composed and sent is similarly straightforward:
>>>from email.header import make_header
>>>hdr = make_header([(b'Axc4Bxe4C', 'latin-1')])
>>>print(hdr)
AÄBäC >>>print(hdr.encode())
=?iso-8859-1?q?A=C4B=E4C?= >>>decode_header(hdr.encode())
[(b'Axc4Bxe4C', 'iso-8859-1')]
This can be applied to entire headers such as Subject, as
well as the name component of each email address in an
address-related header line such as From and To (use getaddresses
to split into individual
addresses first if needed). The header object provides an
alternative interface; both techniques handle additional details,
such as line lengths, for which we’ll defer to Python
manuals:
>>>from email.header import Header
>>>h = Header(b'Axe4Bxc4X', charset='latin-1')
>>>h.encode()
'=?iso-8859-1?q?A=E4B=C4X?=' >>> >>>h = Header('spam', charset='ascii')
# same as Header('spam') >>>h.encode()
'spam'
The mailtools
package
ahead and its PyMailGUI client of Chapter 14 will use these interfaces to
automatically decode message headers in fetched mails per their
content for display, and to encode headers sent that are not in
ASCII format. That latter also applies to the name component of
email addresses, and assumes that SMTP servers will allow these to
pass. This may encroach on some SMTP server issues which we don’t
have space to address in this book. See the Web for more on SMTP
headers handling. For more on headers decoding, see also file
_test-i18n-headers.py in the
examples package; it decodes additional subject and
address-related headers using mailtools
methods, and displays them in a
tkinter Text
widget—a foretaste
of how these will be displayed in PyMailGUI.
Our last two email
Unicode issues are outright bugs which we must work around
today, though they will almost certainly be fixed in a future
Python release. The first breaks message text generation for all
but trivial messages—the email
package today no longer supports generation of full mail text for
messages that contain any binary parts, such as images or audio
files. Without coding workarounds, only simple emails that consist
entirely of text parts can be composed and generated in Python
3.1’s email
package; any
MIME-encoded binary part causes mail text generation to
fail.
This is a bit tricky to understand without poring over
email
’s source code (which,
thankfully, we can in the land of open source), but to demonstrate
the issue, first notice how simple text payloads are rendered as
full message text when printed as we’ve already seen:
C:...PP4EInternetEmail>python
>>>from email.message import Message
# generic message object >>>m = Message()
>>>m['From'] = '[email protected]'
>>>m.set_payload(open('text.txt').read())
# payload is str text >>>print(m)
# print uses as_string() From: [email protected] spam Spam SPAM!
As we’ve also seen, for convenience, the email
package also provides subclasses
of the Message
object, tailored
to add message headers that provide the extra descriptive details
used by email clients to know how to process the data:
>>>from email.mime.text import MIMEText
# Message subclass with headers >>>text = open('text.txt').read()
>>>m = MIMEText(text)
# payload is str text >>>m['From'] = '[email protected]'
>>>print(m)
Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: [email protected] spam Spam SPAM!
This works for text, but watch what happens when we try to render a message part with truly binary data, such as an image that could not be decoded as Unicode text:
>>>from email.message import Message
# generic Message object >>>m = Message()
>>>m['From'] = '[email protected]'
>>>bytes = open('monkeys.jpg', 'rb').read()
# read binary bytes (not Unicode) >>>m.set_payload(bytes)
# we set the payload to bytes >>>print(m)
Traceback (most recent call last): ...lines omitted... File "C:Python31libemailgenerator.py", line 155, in _handle_text raise TypeError('string payload expected: %s' % type(payload)) TypeError: string payload expected: <class 'bytes'> >>>m.get_payload()[:20]
b'xffxd8xffxe0x00x10JFIFx00x01x01x01x00xx00xx00x00'
The problem here is that the email
package’s text generator assumes
that the message’s payload data is a Base64 (or similar) encoded
str
text string by generation
time, not bytes
. Really, the
error is probably our fault in this case, because we set the
payload to raw bytes
manually.
We should use the MIMEImage
MIME subclass tailored for images; if we do, the email
package internally performs Base64
MIME email encoding on the data when the message object is
created. Unfortunately, it still leaves it as bytes
, not str
, despite the fact the whole point of
Base64 is to change binary data to text (though the exact Unicode
flavor this text should take may be unclear). This leads to
additional failures in Python 3.1:
>>>from email.mime.image import MIMEImage
# Message sublcass with hdrs+base64 >>>bytes = open('monkeys.jpg', 'rb').read()
# read binary bytes again >>>m = MIMEImage(bytes)
# MIME class does Base64 on data >>>print(m)
Traceback (most recent call last): ...lines omitted... File "C:Python31libemailgenerator.py", line 155, in _handle_text raise TypeError('string payload expected: %s' % type(payload)) TypeError: string payload expected: <class 'bytes'> >>>m.get_payload()[:40]
# this is already Base64 text b'/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIB' >>>m.get_payload()[:40].decode('ascii')
# but it's still bytes internally! '/9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIB'
In other words, not only does the Python 3.1 email
package not fully support the
Python 3.X Unicode/bytes dichotomy, it was actually broken by it.
Luckily, there’s a workaround for this case.
To address this specific issue, I opted to create a custom
encoding function for binary MIME attachments, and pass it in to
the email
package’s MIME
message object subclasses
for all binary data types. This custom function is coded in the
upcoming mailtools
package of
this chapter (Example 13-23). Because it is
used by email
to encode from
bytes to text at initialization time, it is able to decode to
ASCII text per Unicode as an extra step, after running the
original call to perform Base64 encoding and arrange
content-encoding headers. The fact that email
does not do this extra Unicode
decoding step itself is a genuine bug in that package (albeit, one
introduced by changes elsewhere in Python standard libraries), but
the workaround does its
job:
# in mailtools.mailSender module ahead in this chapter... def fix_encode_base64(msgobj): from email.encoders import encode_base64 encode_base64(msgobj) # what email does normally: leaves bytes bytes = msgobj.get_payload() # bytes fails in email pkg on text gen text = bytes.decode('ascii') # decode to unicode str so text gen works ...line splitting logic omitted... msgobj.set_payload(' '.join(lines)) >>>from email.mime.image import MIMEImage
>>>from mailtools.mailSender import fix_encode_base64
# use custom workaround >>>bytes = open('monkeys.jpg', 'rb').read()
>>>m = MIMEImage(bytes, _encoder=fix_encode_base64)
# convert to ascii str >>>print(m.as_string()[:500])
Content-Type: image/jpeg MIME-Version: 1.0 Content-Transfer-Encoding: base64 /9j/4AAQSkZJRgABAQEAeAB4AAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYHBwcG BwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcIDAwMDAwM DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAARCAHoAvQDASIA AhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA AAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3 ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc >>>print(m)
# to print the entire message: very long
Another possible workaround involves defining a custom
MIMEImage
class that is like
the original but does not attempt to perform Base64 ending on
creation; that way, we could encode and translate to str
before message object creation, but
still make use of the original class’s header-generation logic. If
you take this route, though, you’ll find that it requires
repeating (really, cutting and pasting) far too much of the
original logic to be reasonable—this repeated code would have to
mirror any future email
changes:
>>>from email.mime.nonmultipart import MIMENonMultipart
>>>class MyImage(MIMENonMultipart):
...def __init__(self, imagedata, subtype):
...MIMENonMultipart.__init__(self, 'image', subtype)
...self.set_payload(_imagedata)
...repeat all the base64 logic here, with an extra ASCII Unicode decode... >>>m = MyImage(text_from_bytes)
Interestingly, this regression in email
actually reflects an unrelated
change in Python’s base64
module made in 2007, which was completely benign until the Python
3.X bytes
/str
differentiation came online. Prior
to that, the email encoder worked in Python 2.X, because bytes
was really str
. In 3.X, though, because base64
returns bytes
, the normal mail encoder in
email
also leaves the payload
as bytes
, even though it’s been
encoded to Base64 text form. This in turn breaks email
text generation, because it
assumes the payload is text in this case, and requires it to be
str
. As is common in
large-scale software
systems, the effects of some 3.X changes may have been difficult
to anticipate or accommodate in full.
By contrast, parsing binary attachments
(as opposed to generating text for them) works fine in 3.X,
because the parsed message payload is saved in message objects as
a Base64-encoded str
string,
not bytes
, and is converted to
bytes
only when fetched. This
bug seems likely to also go away in a future Python and email
package (perhaps even as a simple
patch in Python 3.2), but it’s more serious than the other Unicode
decoding issues described here, because it prevents mail
composition for all but trivial mails.
The flexibility afforded by the package and the Python language allows such a workaround to be developed external to the package, rather than hacking the package’s code directly. With open source and forgiving APIs, you rarely are truly stuck.
Late-breaking news: This section’s bug is scheduled to be fixed in Python 3.2, making our workaround here unnecessary in this and later Python releases. This is per communications with members of Python’s email special interest group (on the “email-sig” mailing list).
Regrettably, this fix didn’t appear until after this chapter and its examples had been written. I’d like to remove the workaround and its description entirely, but this book is based on Python 3.1, both before and after the fix was incorporated.
So that it works under Python 3.2 alpha, too, though, the workaround code ahead was specialized just before publication to check for bytes prior to decoding. Moreover, the workaround still must manually split lines in Base64 data, because 3.2 still does not.
Our final email
Unicode
issue is as severe as the prior one: changes like that of
the prior section introduced yet another regression for mail
composition. In short, it’s impossible to make text message parts
today without specializing for different Unicode encodings.
Some types of text are automatically MIME-encoded for
transmission. Unfortunately, because of the str
/bytes
split, the MIME text message class
in email
now requires different
string object types for different Unicode encodings. The net
effect is that you now have to know how the email
package will process your text
data when making a text message object, or repeat most of its
logic redundantly.
For example, to properly generate Unicode encoding headers and apply required MIME encodings, here’s how we must proceed today for common Unicode text types:
>>>m = MIMEText('abc', _charset='ascii')
# pass text for ascii >>>print(m)
MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit abc >>>m = MIMEText('abc', _charset='latin-1')
# pass text for latin-1 >>>print(m)
# but not for 'latin1': ahead MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable abc >>>m = MIMEText(b'abc', _charset='utf-8')
# pass bytes for utf8 >>>print(m)
Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 YWJj
This works, but if you look closely, you’ll notice that we
must pass str
to the first two,
but bytes
to the third. That
requires that we special-case code for Unicode types based upon
the package’s internal operation. Types other than those expected
for a Unicode encoding don’t work at all, because of newly invalid
str
/bytes
combinations that occur inside the
email
package in 3.1:
>>>m = MIMEText('abc', _charset='ascii')
>>>m = MIMEText(b'abc', _charset='ascii')
# bug: assumes 2.X str Traceback (most recent call last): ...lines omitted... File "C:Python31libemailencoders.py", line 60, in encode_7or8bit orig.encode('ascii') AttributeError: 'bytes' object has no attribute 'encode' >>>m = MIMEText('abc', _charset='latin-1')
>>>m = MIMEText(b'abc', _charset='latin-1')
# bug: qp uses str Traceback (most recent call last): ...lines omitted... File "C:Python31libemailquoprimime.py", line 176, in body_encode if line.endswith(CRLF): TypeError: expected an object with the buffer interface >>>m = MIMEText(b'abc', _charset='utf-8'
) >>>m = MIMEText('abc', _charset='utf-8')
# bug: base64 uses bytes Traceback (most recent call last): ...lines omitted... File "C:Python31libemailase64mime.py", line 94, in body_encode enc = b2a_base64(s[i:i + max_unencoded]).decode("ascii") TypeError: must be bytes or buffer, not str
Moreover, the email
package is pickier about encoding name synonyms than Python and
most other tools are: “latin-1” is detected as a quoted-printable
MIME type, but “latin1” is unknown and so defaults to Base64 MIME.
In fact, this is why Base64 was used for the “latin1” Unicode type
earlier in this section—an encoding choice that is irrelevant to
any recipient that understands the “latin1” synonym, including
Python itself. Unfortunately, that means that we also need to pass
in a different string type if we use a synonym the package doesn’t
understand today:
>>>m = MIMEText('abc', _charset='latin-1')
# str for 'latin-1' >>>print(m)
MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable abc >>>m = MIMEText('abc', _charset='latin1')
Traceback (most recent call last): ...lines omitted... File "C:Python31libemailase64mime.py", line 94, in body_encode enc = b2a_base64(s[i:i + max_unencoded]).decode("ascii") TypeError: must be bytes or buffer, not str >>>m = MIMEText(b'abc', _charset='latin1')
# bytes for 'latin1'! >>>print(m)
Content-Type: text/plain; charset="latin1" MIME-Version: 1.0 Content-Transfer-Encoding: base64 YWJj
There are ways to add aliases and new encoding types in the
email
package, but they’re not
supported out of the box. Programs that care about being robust
would have to cross-check the user’s spelling, which may be valid
for Python itself, against that expected by email
. This also holds true if your data
is not ASCII in general—you’ll have to first decode to text in
order to use the expected “latin-1” name because its
quoted-printable MIME encoding expects str
, even though bytes
are required if “latin1” triggers the default Base64
MIME:
>>>m = MIMEText(b'Axe4B', _charset='latin1')
>>>print(m)
Content-Type: text/plain; charset="latin1" MIME-Version: 1.0 Content-Transfer-Encoding: base64 QeRC >>>m = MIMEText(b'Axe4B', _charset='latin-1')
Traceback (most recent call last): ...lines omitted... File "C:Python31libemailquoprimime.py", line 176, in body_encode if line.endswith(CRLF): TypeError: expected an object with the buffer interface >>>m = MIMEText(b'Axe4B'.decode('latin1'), _charset='latin-1')
>>>print(m)
MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable A=E4B
In fact, the text message object doesn’t check to see that the data you’re MIME-encoding is valid per Unicode in general—we can send invalid UTF text but the receiver may have trouble decoding it:
>>>m = MIMEText(b'Axe4B', _charset='utf-8'
) >>>print(m)
Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 QeRC >>>b'Axe4B'.decode('utf8')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected... >>>import base64
>>>base64.b64decode(b'QeRC')
b'Axe4B' >>>base64.b64decode(b'QeRC').decode('utf')
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected...
So what to do if we need to attach message text to composed
messages if the text’s datatype requirement is indirectly dictated
by its Unicode encoding name? The generic Message
superclass doesn’t help here
directly if we specify an encoding, as it exhibits the same
encoding-specific behavior:
>>>m = Message()
>>>m.set_payload('spam', charset='us-ascii')
>>>print(m)
MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit spam >>>m = Message()
>>>m.set_payload(b'spam', charset='us-ascii')
AttributeError: 'bytes' object has no attribute 'encode' >>>m.set_payload('spam', charset='utf-8')
TypeError: must be bytes or buffer, not str
Although we could try to work around these issues by
repeating much of the code that email
runs, the redundancy would make us
hopelessly tied to its current implementation and dependent upon
its future changes. The following, for example, parrots the steps
that email runs internally to create a text message object for
ASCII encoding text; unlike the MIMEText
class, this approach allows all
data to be read from files as binary byte strings, even if it’s
simple ASCII:
>>>m = Message()
>>>m.add_header('Content-Type', 'text/plain')
>>>m['MIME-Version'] = '1.0'
>>>m.set_param('charset', 'us-ascii')
>>>m.add_header('Content-Transfer-Encoding', '7bit')
>>>data = b'spam'
>>>m.set_payload(data.decode('ascii'))
# data read as bytes here >>>print(m)
MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit spam >>>print(MIMEText('spam', _charset='ascii'))
# same, but type-specific MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit spam
To do the same for other kinds of text that require MIME encoding, just insert an extra encoding step; although we’re concerned with text parts here, a similar imitative approach could address the binary parts text generation bug we met earlier:
>>>m = Message()
>>>m.add_header('Content-Type', 'text/plain')
>>>m['MIME-Version'] = '1.0'
>>>m.set_param('charset', 'utf-8')
>>>m.add_header('Content-Transfer-Encoding', 'base64')
>>>data = b'spam'
>>>from binascii import b2a_base64
# add MIME encode if needed >>>data = b2a_base64(data)
# data read as bytes here too >>>m.set_payload(data.decode('ascii'))
>>>print(m)
MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 c3BhbQ== >>>print(MIMEText(b'spam', _charset='utf-8'))
# same, but type-specific Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 c3BhbQ==
This works, but besides the redundancy and dependency it
creates, to use this approach broadly we’d also have to generalize
to account for all the various kinds of Unicode encodings and MIME
encodings possible, like the email
package already does internally.
We might also have to support encoding name synonyms to be
flexible, adding further redundancy. In other words, this requires
additional work, and in the end, we’d still have to specialize our
code for different Unicode types.
Any way we go, some dependence on the current implementation
seems unavoidable today. It seems the best we can do here, apart
from hoping for an improved email
package in a few years’ time, is
to specialize text message construction calls by Unicode type, and
assume both that encoding names match those expected by the
package and that message data is valid for the Unicode type
selected. Here is the sort of arguably magic code that the
upcoming mailtools
package
(again in Example 13-23)
will apply to choose text types:
>>>from email.charset import Charset, BASE64, QP
>>>for e in ('us-ascii', 'latin-1', 'utf8', 'latin1', 'ascii'):
...cset = Charset(e)
...benc = cset.body_encoding
...if benc in (None, QP):
...print(e, benc, 'text')
# read/fetch data as str ...else:
...print(e, benc, 'binary')
# read/fetch data as bytes ... us-ascii None text latin-1 1 text utf8 2 binary latin1 2 binary ascii None text
We’ll proceed this way in this book, with the major caveat that this is almost certainly likely to require changes in the future because of its strong coupling with the current email implementation.
Late-breaking news: Like the prior
section, it now appears that this section’s bug will also be
fixed in Python 3.2, making the workaround here unnecessary in
this and later Python releases. The nature of the fix is
unknown, though, and we still need the fix for the version of
Python current when this chapter was written. As of just before
publication, the alpha release of 3.2 is still somewhat type
specific on this issue, but now accepts either str
or bytes
for text that triggers Base64
encodings, instead of just bytes
.
The email
package in
Python 3.1 provides powerful tools for parsing and composing
mails, and can be used as the basis for full-featured mail clients
like those in this book with just a few workarounds. As you can
see, though, it is less than fully functional today. Because of
that, further specializing code to its current API is perhaps a
temporary solution. Short of writing our own email parser and
composer (not a practical option in a finitely-sized book!), some
compromises are in order here. Moreover, the inherent complexity
of Unicode support in email
places some limits on how much we can pursue this thread in this
book.
In this edition, we will support Unicode encodings of text
parts and headers in messages composed, and respect the Unicode
encodings in text parts and mail headers of messages fetched. To
make this work with the partially crippled email
package in Python 3.1, though, we’ll apply the
following Unicode policies in various email clients in this
book:
Use user preferences and defaults for the preparse decoding of full mail text fetched and encoding of text payloads sent.
Use header information, if available, to decode the
bytes
payloads returned by
get_payload
when text parts
must be treated as str
text, but use binary mode files to finesse the issue in other
contexts.
Use formats prescribed by email standard to decode and encode message headers such as From and Subject if they are not simple text.
Apply the fix described to work around the message text generation issue for binary parts.
Special-case construction of text message objects
according to Unicode types and email
behavior.
These are not necessarily complete solutions. For example,
some of this edition’s email clients allow for Unicode encodings
for both text attachments and mail headers, but they do nothing
about encoding the full text of messages sent beyond the policies
inherited from smtplib
and
implement policies that might be inconvenient in some use cases.
But as we’ll see, despite their limitations, our email clients
will still be able to handle complex email tasks and a very large
set of emails.
Again, since this story is in flux in Python today, watch
this book’s website for updates that may improve or be required of
code that uses email
in the
future. A future email
may
handle Unicode encodings more accurately. Like Python 3.X, though,
backward compatibility may be sacrificed in the process and
require updates to this book’s code. For more on this issue, see
the Web as well as up-to-date Python release notes.
Although this quick tour captures the basic flavor of the
interface, we need to step up to larger examples to see more of
the email
package’s power. The
next section takes us on the first of those steps.
Let’s put together what we’ve learned about fetching, sending,
parsing, and composing email in a simple but functional command-line
console email tool. The script in Example 13-20 implements an
interactive email session—users may type commands to read, send, and
delete email messages. It uses poplib
and smtplib
to fetch and send, and uses the email
package directly to parse and
compose.
#!/usr/local/bin/python """ ########################################################################## pymail - a simple console email interface client in Python; uses Python poplib module to view POP email messages, smtplib to send new mails, and the email package to extract mail headers and payload and compose mails; ########################################################################## """ import poplib, smtplib, email.utils, mailconfig from email.parser import Parser from email.message import Message fetchEncoding = mailconfig.fetchEncoding def decodeToUnicode(messageBytes, fetchEncoding=fetchEncoding): """ 4E, Py3.1: decode fetched bytes to str Unicode string for display or parsing; use global setting (or by platform default, hdrs inspection, intelligent guess); in Python 3.2/3.3, this step may not be required: if so, return message intact; """ return [line.decode(fetchEncoding) for line in messageBytes] def splitaddrs(field): """ 4E: split address list on commas, allowing for commas in name parts """ pairs = email.utils.getaddresses([field]) # [(name,addr)] return [email.utils.formataddr(pair) for pair in pairs] # [name <addr>] def inputmessage(): import sys From = input('From? ').strip() To = input('To? ').strip() # datetime hdr may be set auto To = splitaddrs(To) # possible many, name+<addr> okay Subj = input('Subj? ').strip() # don't split blindly on ',' or ';' print('Type message text, end with line="."') text = '' while True: line = sys.stdin.readline() if line == '. ': break text += line return From, To, Subj, text def sendmessage(): From, To, Subj, text = inputmessage() msg = Message() msg['From'] = From msg['To'] = ', '.join(To) # join for hdr, not send msg['Subject'] = Subj msg['Date'] = email.utils.formatdate() # curr datetime, rfc2822 msg.set_payload(text) server = smtplib.SMTP(mailconfig.smtpservername) try: failed = server.sendmail(From, To, str(msg)) # may also raise exc except: print('Error - send failed') else: if failed: print('Failed:', failed) def connect(servername, user, passwd): print('Connecting...') server = poplib.POP3(servername) server.user(user) # connect, log in to mail server server.pass_(passwd) # pass is a reserved word print(server.getwelcome()) # print returned greeting message return server def loadmessages(servername, user, passwd, loadfrom=1): server = connect(servername, user, passwd) try: print(server.list()) (msgCount, msgBytes) = server.stat() print('There are', msgCount, 'mail messages in', msgBytes, 'bytes') print('Retrieving...') msgList = [] # fetch mail now for i in range(loadfrom, msgCount+1): # empty if low >= high (hdr, message, octets) = server.retr(i) # save text on list message = decodeToUnicode(message) # 4E, Py3.1: bytes to str msgList.append(' '.join(message)) # leave mail on server finally: server.quit() # unlock the mail box assert len(msgList) == (msgCount - loadfrom) + 1 # msg nums start at 1 return msgList def deletemessages(servername, user, passwd, toDelete, verify=True): print('To be deleted:', toDelete) if verify and input('Delete?')[:1] not in ['y', 'Y']: print('Delete cancelled.') else: server = connect(servername, user, passwd) try: print('Deleting messages from server...') for msgnum in toDelete: # reconnect to delete mail server.dele(msgnum) # mbox locked until quit() finally: server.quit() def showindex(msgList): count = 0 # show some mail headers for msgtext in msgList: msghdrs = Parser().parsestr(msgtext, headersonly=True) # expects str in 3.1 count += 1 print('%d: %d bytes' % (count, len(msgtext))) for hdr in ('From', 'To', 'Date', 'Subject'): try: print(' %-8s=>%s' % (hdr, msghdrs[hdr])) except KeyError: print(' %-8s=>(unknown)' % hdr) if count % 5 == 0: input('[Press Enter key]') # pause after each 5 def showmessage(i, msgList): if 1 <= i <= len(msgList): #print(msgList[i-1]) # old: prints entire mail--hdrs+text print('-' * 79) msg = Parser().parsestr(msgList[i-1]) # expects str in 3.1 content = msg.get_payload() # prints payload: string, or [Messages] if isinstance(content, str): # keep just one end-line at end content = content.rstrip() + ' ' print(content) print('-' * 79) # to get text only, see email.parsers else: print('Bad message number') def savemessage(i, mailfile, msgList): if 1 <= i <= len(msgList): savefile = open(mailfile, 'a', encoding=mailconfig.fetchEncoding) # 4E savefile.write(' ' + msgList[i-1] + '-'*80 + ' ') else: print('Bad message number') def msgnum(command): try: return int(command.split()[1]) except: return −1 # assume this is bad helptext = """ Available commands: i - index display l n? - list all messages (or just message n) d n? - mark all messages for deletion (or just message n) s n? - save all messages to a file (or just message n) m - compose and send a new mail message q - quit pymail ? - display this help text """ def interact(msgList, mailfile): showindex(msgList) toDelete = [] while True: try: command = input('[Pymail] Action? (i, l, d, s, m, q, ?) ') except EOFError: command = 'q' if not command: command = '*' # quit if command == 'q': break # index elif command[0] == 'i': showindex(msgList) # list elif command[0] == 'l': if len(command) == 1: for i in range(1, len(msgList)+1): showmessage(i, msgList) else: showmessage(msgnum(command), msgList) # save elif command[0] == 's': if len(command) == 1: for i in range(1, len(msgList)+1): savemessage(i, mailfile, msgList) else: savemessage(msgnum(command), mailfile, msgList) # delete elif command[0] == 'd': if len(command) == 1: # delete all later toDelete = list(range(1, len(msgList)+1)) # 3.x requires list else: delnum = msgnum(command) if (1 <= delnum <= len(msgList)) and (delnum not in toDelete): toDelete.append(delnum) else: print('Bad message number') # mail elif command[0] == 'm': # send a new mail via SMTP sendmessage() #execfile('smtpmail.py', {}) # alt: run file in own namespace elif command[0] == '?': print(helptext) else: print('What? -- type "?" for commands help') return toDelete if __name__ == '__main__': import getpass, mailconfig mailserver = mailconfig.popservername # ex: 'pop.rmi.net' mailuser = mailconfig.popusername # ex: 'lutz' mailfile = mailconfig.savemailfile # ex: r'c:stuffsavemail' mailpswd = getpass.getpass('Password for %s?' % mailserver) print('[Pymail email client]') msgList = loadmessages(mailserver, mailuser, mailpswd) # load all toDelete = interact(msgList, mailfile) if toDelete: deletemessages(mailserver, mailuser, mailpswd, toDelete) print('Bye.')
There isn’t much new here—just a combination of user-interface logic and tools we’ve already met, plus a handful of new techniques:
This client loads all email from the server into an in-memory Python list only once, on startup; you must exit and restart to reload newly arrived email.
On demand, pymail
saves
the raw text of a selected message into a local file, whose name
you place in the mailconfig
module of Example 13-17.
We finally support on-request deletion of mail from the
server here: in pymail
, mails
are selected for deletion by number, but are still only
physically removed from your server on exit, and then only if
you verify the operation. By deleting only on exit, we avoid
changing mail message numbers during a session—under POP,
deleting a mail not at the end of the list decrements the number
assigned to all mails following the one deleted. Since mail is
cached in memory by pymail
,
future operations on the numbered messages in memory can be
applied to the wrong mail if deletions were done
immediately.[53]
pymail
now displays
just the payload of a message on listing commands, not the
entire raw text, and the mail index listing only displays
selected headers parsed out of each message. Python’s email
package is used to extract
headers and content from a message, as shown in the prior
section. Similarly, we use email
to compose a message and ask for
its string to ship as a mail.
By now, I expect that you know enough to read this script for a
deeper look, so instead of saying more about its design here, let’s
jump into an interactive pymail
session to see how it works.
Let’s start up pymail
to
read and delete email at our mail server and send new
messages. pymail
runs on any
machine with Python and sockets, fetches mail from any email server
with a POP interface on which you have an account, and sends mail
via the SMTP server you’ve named in the mailconfig
module we wrote earlier (Example 13-17).
Here it is in action running on my Windows laptop machine; its
operation is identical on other machines thanks to the portability
of both Python and its standard library. First, we start the script,
supply a POP password (remember, SMTP servers usually require no
password), and wait for the pymail
email list index to appear; as is,
this version loads the full text of all mails in the inbox on
startup:
C:...PP4EInternetEmail>pymail.py
Password for pop.secureserver.net? [Pymail email client] Connecting... b'+OK <[email protected]>' (b'+OK ', [b'1 1860', b'2 1408', b'3 1049', b'4 1009', b'5 1038', b'6 957'], 47) There are 6 mail messages in 7321 bytes Retrieving... 1: 1861 bytes From =>[email protected] To =>[email protected] Date =>Wed, 5 May 2010 11:29:36 −0400 (EDT) Subject =>I'm a Lumberjack, and I'm Okay 2: 1409 bytes From =>[email protected] To =>[email protected] Date =>Wed, 05 May 2010 08:33:47 −0700 Subject =>testing 3: 1050 bytes From =>[email protected] To =>[email protected] Date =>Thu, 06 May 2010 14:11:07 −0000 Subject =>A B C D E F G 4: 1010 bytes From =>[email protected] To =>[email protected] Date =>Thu, 06 May 2010 14:16:31 −0000 Subject =>testing smtpmail 5: 1039 bytes From =>[email protected] To =>[email protected] Date =>Thu, 06 May 2010 14:32:32 −0000 Subject =>a b c d e f g [Press Enter key] 6: 958 bytes From =>[email protected] To =>maillist Date =>Thu, 06 May 2010 10:58:40 −0400 Subject =>test interactive smtplib [Pymail] Action? (i, l, d, s, m, q, ?)l 6
------------------------------------------------------------------------------- testing 1 2 3... ------------------------------------------------------------------------------- [Pymail] Action? (i, l, d, s, m, q, ?)l 3
------------------------------------------------------------------------------- Fiddle de dum, Fiddle de dee, Eric the half a bee. ------------------------------------------------------------------------------- [Pymail] Action? (i, l, d, s, m, q, ?)
Once pymail
downloads your
email to a Python list on the local client machine, you type command
letters to process it. The l
command lists (prints) the contents of a given mail number; here, we
just used it to list two emails we sent in the preceding section,
with the smtpmail
script, and
interactively.
pymail
also lets us get
command help, delete messages (deletions actually occur at the
server on exit from the program), and save messages away in a local
text file whose name is listed in the mailconfig
module we saw earlier:
[Pymail] Action? (i, l, d, s, m, q, ?)?
Available commands: i - index display l n? - list all messages (or just message n) d n? - mark all messages for deletion (or just message n) s n? - save all messages to a file (or just message n) m - compose and send a new mail message q - quit pymail ? - display this help text [Pymail] Action? (i, l, d, s, m, q, ?)s 4
[Pymail] Action? (i, l, d, s, m, q, ?)d 4
Now, let’s pick the m
mail
compose option—pymail
inputs the
mail parts, builds mail text with email
, and ships it off with smtplib
. You can separate recipients with
a comma, and use either simple “addr” or full “name <addr>”
address pairs if desired. Because the mail is sent by SMTP, you can
use arbitrary From addresses here; but again, you generally
shouldn’t do that (unless, of course, you’re trying to come up with
interesting examples for a book):
[Pymail] Action? (i, l, d, s, m, q, ?)m
From?[email protected]
To?[email protected]
Subj?Among our weapons are these
Type message text, end with line="."Nobody Expects the Spanish Inquisition!
.
[Pymail] Action? (i, l, d, s, m, q, ?)q
To be deleted: [4] Delete?y
Connecting... b'+OK <[email protected]>' Deleting messages from server... Bye.
As mentioned, deletions really happen only on exit. When we
quit pymail
with the q
command, it tells us which messages are
queued for deletion, and verifies the request. Once verified,
pymail
finally contacts the mail
server again and issues POP calls to delete the selected mail
messages. Because deletions change message numbers in the server’s
inbox, postponing deletion until exit simplifies the handling of
already loaded email (we’ll improve on this in the PyMailGUI client
of the next chapter).
Because pymail
downloads
mail from your server into a local Python list only once at startup,
though, we need to start pymail
again to refetch mail from the server if we want to see the result
of the mail we sent and the deletion we made. Here, our new mail
shows up at the end as new number 6, and the original mail assigned
number 4 in the prior session is gone:
C:...PP4EInternetEmail>pymail.py
Password for pop.secureserver.net? [Pymail email client] Connecting... b'+OK <[email protected]>' (b'+OK ', [b'1 1860', b'2 1408', b'3 1049', b'4 1038', b'5 957', b'6 1037'], 47) There are 6 mail messages in 7349 bytes Retrieving... 1: 1861 bytes From =>[email protected] To =>[email protected] Date =>Wed, 5 May 2010 11:29:36 −0400 (EDT) Subject =>I'm a Lumberjack, and I'm Okay 2: 1409 bytes From =>[email protected] To =>[email protected] Date =>Wed, 05 May 2010 08:33:47 −0700 Subject =>testing 3: 1050 bytes From =>[email protected] To =>[email protected] Date =>Thu, 06 May 2010 14:11:07 −0000 Subject =>A B C D E F G 4: 1039 bytes From =>[email protected] To =>[email protected] Date =>Thu, 06 May 2010 14:32:32 −0000 Subject =>a b c d e f g 5: 958 bytes From =>[email protected] To =>maillist Date =>Thu, 06 May 2010 10:58:40 −0400 Subject =>test interactive smtplib [Press Enter key] 6: 1038 bytes From =>[email protected] To =>[email protected] Date =>Fri, 07 May 2010 20:32:38 −0000 Subject =>Among our weapons are these [Pymail] Action? (i, l, d, s, m, q, ?)l 6
------------------------------------------------------------------------------- Nobody Expects the Spanish Inquisition! ------------------------------------------------------------------------------- [Pymail] Action? (i, l, d, s, m, q, ?)q
Bye.
Though not shown in this session, you can also send to
multiple recipients, and include full name and address pairs in your
email addresses. This works just because the script employs email
utilities described earlier to split
up addresses and fully parse to allow commas as both separators and
name characters. The following, for example, would send to two and
three recipients, respectively, using mostly full address
formats:
[Pymail] Action? (i, l, d, s, m, q, ?)m
From?"moi 1" <[email protected]>
To?"pp 4e" <[email protected]>, "lu,tz" <[email protected]>
[Pymail] Action? (i, l, d, s, m, q, ?)m
From?The Book <[email protected]>
To?"pp 4e" <[email protected]>, "lu,tz" <[email protected]>,
[email protected]
Finally, if you are running this live, you will also find the
mail save file on your machine, containing the one message we asked
to be saved in the prior session; it’s simply the raw text of saved
emails, with separator lines. This is both human and machine-readable—in principle, another script
could load saved mail from this file into a Python list by calling
the string object’s split
method
on the file’s text with the separator line as a delimiter. As shown
in this book, it shows up in file C: empsavemail.txt, but you can
configure this as you like in the mailconfig
module.
The email
package used by the pymail
example of the prior section is a
collection of powerful tools—in fact, perhaps too powerful to remember
completely. At the minimum, some reusable boilerplate code for common
use cases can help insulate you from some of its details; by isolating
module usage, such code can also ease the migration to possible future
email
changes. To simplify email
interfacing for more complex mail clients, and to further demonstrate
the use of standard library email tools, I developed the custom
utility modules listed in this section—a package called mailtools
.
mailtools
is a Python modules
package: a directory of code, with one module per tool class, and an
initialization module run when the directory is first imported. This
package’s modules are essentially just a wrapper layer above the
standard library’s email
package,
as well as its poplib
and smtplib
modules. They make some assumptions
about the way email
is to be used,
but they are reasonable and allow us to forget some of the underlying
complexity of the standard library tools employed.
In a nutshell, the mailtools
package provides three classes—to fetch, send, and parse email
messages. These classes can be used as
superclasses in order to mix in their methods to
an application-specific class, or as standalone
or embedded objects that export their methods for
direct calls. We’ll see these classes deployed both ways in this
text.
As a simple example of this package’s tools in action, its
selftest.py
module serves as a self-test script. When run, it sends
a message from you, to you, which includes the
selftest.py file as an attachment. It also
fetches and displays some mail headers and parsed and unparsed
content. These interfaces, along with some user-interface magic, will
lead us to full-blown email clients and websites in later
chapters.
Two design notes worth mentioning up front: First, none of the code in this package knows anything about the user interface it will be used in (console, GUI, web, or other) or does anything about things like threads; it is just a toolkit. As we’ll see, its clients are responsible for deciding how it will be deployed. By focusing on just email processing here, we simplify the code, as well as the programs that will use it.
Second, each of the main modules in this package illustrate
Unicode issues that confront Python 3.X code, especially when using
the 3.1 Python email
package:
The sender must address encodings for the main message text, attachment input files, saved-mail output files, and message headers.
The fetcher must resolve full mail text encodings when new mails are fetched.
The parser must deal with encodings in text part payloads of parsed messages, as well as those in message headers.
In addition, the sender must provide workarounds for the binary
parts generation and text part creation issues in email
described earlier in this chapter.
Since these highlight Unicode factors in general, and might not be
solved as broadly as they might be due to limitations of the current
Python email
package, I’ll
elaborate on each of these choices along the way.
The next few sections list mailtools
source code. Together, its files
consist of roughly 1,050 lines of code, including whitespace and
comments. We won’t cover all of this package’s code in depth—study its
listings for more details, and see its self-test module for a usage
example. Also, for more context and examples, watch for the three
clients that will use this package—the modified pymail2.py
following this listing, the
PyMailGUI client in Chapter 14, and the PyMailCGI server in Chapter 16. By sharing and reusing this module,
all three systems inherit all its utility, as well as any future
enhancements.
The module in Example 13-21 implements the
initialization logic of the mailtools
package; as usual, its code is
run automatically the first time a script imports through the
package’s directory. Notice how this file collects the contents of
all the nested modules into the directory’s namespace with from *
statements—because mailtools
began life as a single
.py file, this provides backward compatibility
for existing clients. We also must use package-relative import
syntax here (from .module
),
because Python 3.X no longer includes the package’s own directory on
the module import search path (only the package’s container is on
the path). Since this is the root module, global comments appear
here as well.
""" ################################################################################## mailtools package: interface to mail server transfers, used by pymail2, PyMailGUI, and PyMailCGI; does loads, sends, parsing, composing, and deleting, with part attachments, encodings (of both the email and Unicdode kind), etc.; the parser, fetcher, and sender classes here are designed to be mixed-in to subclasses which use their methods, or used as embedded or standalone objects; this package also includes convenience subclasses for silent mode, and more; loads all mail text if pop server doesn't do top; doesn't handle threads or UI here, and allows askPassword to differ per subclass; progress callback funcs get status; all calls raise exceptions on error--client must handle in GUI/other; this changed from file to package: nested modules imported here for bw compat; 4E: need to use package-relative import syntax throughout, because in Py 3.X package dir in no longer on module import search path if package is imported elsewhere (from another directory which uses this package); also performs Unicode decoding on mail text when fetched (see mailFetcher), as well as for some text part payloads which might have been email-encoded (see mailParser); TBD: in saveparts, should file be opened in text mode for text/ contypes? TBD: in walkNamedParts, should we skip oddballs like message/delivery-status? TBD: Unicode support has not been tested exhaustively: see Chapter 13 for more on the Py3.1 email package and its limitations, and the policies used here; ################################################################################## """ # collect contents of all modules here, when package dir imported directly from .mailFetcher import * from .mailSender import * # 4E: package-relative from .mailParser import * # export nested modules here, when from mailtools import * __all__ = 'mailFetcher', 'mailSender', 'mailParser' # self-test code is in selftest.py to allow mailconfig's path # to be set before running thr nested module imports above
Example 13-22 contains common superclasses for the other classes in the package. This is in part meant for future expansion. At present, these are used only to enable or disable trace message output (some clients, such as web-based programs, may not want text to be printed to the output stream). Subclasses mix in the silent variant to turn off output.
""" ############################################################################### common superclasses: used to turn trace massages on/off ############################################################################### """ class MailTool: # superclass for all mail tools def trace(self, message): # redef me to disable or log to file print(message) class SilentMailTool: # to mixin instead of subclassing def trace(self, message): pass
The class used to compose and send messages is coded in Example 13-23. This module
provides a convenient interface that combines standard library tools
we’ve already met in this chapter—the email
package to compose messages with
attachments and encodings, and the smtplib
module to send the resulting email
text. Attachments are passed in as a list of filenames—MIME types
and any required encodings are determined automatically with the
module mimetypes
. Moreover, date
and time strings are automated with an email.utils
call, and non-ASCII headers
are encoded per email, MIME, and Unicode standards. Study this
file’s code and comments for more on its operation.
This is also where we open and add attachment files, generate
message text, and save sent messages to a local file. Most
attachment files are opened in binary mode, but as we’ve seen,
some text attachments must be opened in text mode because the
current email
package requires
them to be str
strings when
message objects are created. As we also saw earlier, the email
package requires attachments to be
str
text when mail text is
later generated, possibly as the result of MIME encoding.
To satisfy these constraints with the Python 3.1 email
package, we must apply the two
fixes described earlier— part file open
calls select between text or binary
mode (and thus read str
or
bytes
) based upon the way
email
will process the data,
and MIME encoding calls for binary data are augmented to decode
the result to ASCII text. The latter of these also splits the
Base64 text into lines here for binary parts (unlike email
), because it is otherwise sent as
one long line, which may work in some contexts, but causes
problems in some text editors if the raw text is viewed.
Beyond these fixes, clients may optionally provide the names
of the Unicode encoding scheme associated with the main text part
and each text attachment part. In Chapter 14’s PyMailGUI, this is controlled
in the mailconfig
user settings
module, with UTF-8 used as a fallback default whenever user
settings fail to encode a text part. We could in principle also
catch part file decoding errors and return an error indicator
string (as we do for received mails in the mail fetcher ahead),
but sending an invalid attachment is much more grievous than
displaying one. Instead, the send request fails entirely on
errors.
Finally, there is also new support for encoding non-ASCII
headers (both full headers and names of email addresses) per a
client-selectable encoding that defaults to UTF-8, and the sent
message save file is opened in the same mailconfig
Unicode encoding mode used to
decode messages when they are fetched.
The latter policy for sent mail saves is used because the sent file may be opened to fetch full mail text in this encoding later by clients which apply this encoding scheme. This is intended to mirror the way that clients such as PyMailGUI save full message text in local files to be opened and parsed later. It might fail if the mail fetcher resorted to guessing a different and incompatible encoding, and it assumes that no message gives rise to incompatibly encoded data in the file across multiple sessions. We could instead keep one save file per encoding, but encodings for full message text probably will not vary; ASCII was the original standard for full mail text, so 7- or 8-bit text is likely.
""" ############################################################################### send messages, add attachments (see __init__ for docs, test) ############################################################################### """ import mailconfig # client's mailconfig import smtplib, os, mimetypes # mime: name to type import email.utils, email.encoders # date string, base64 from .mailTool import MailTool, SilentMailTool # 4E: package-relative from email.message import Message # general message, obj->text from email.mime.multipart import MIMEMultipart # type-specific messages from email.mime.audio import MIMEAudio # format/encode attachments from email.mime.image import MIMEImage from email.mime.text import MIMEText from email.mime.base import MIMEBase from email.mime.application import MIMEApplication # 4E: use new app class def fix_encode_base64(msgobj): """ 4E: workaround for a genuine bug in Python 3.1 email package that prevents mail text generation for binary parts encoded with base64 or other email encodings; the normal email.encoder run by the constructor leaves payload as bytes, even though it's encoded to base64 text form; this breaks email text generation which assumes this is text and requires it to be str; net effect is that only simple text part emails can be composed in Py 3.1 email package as is - any MIME-encoded binary part cause mail text generation to fail; this bug seems likely to go away in a future Python and email package, in which case this should become a no-op; see Chapter 13 for more details; """ linelen = 76 # per MIME standards from email.encoders import encode_base64 encode_base64(msgobj) # what email does normally: leaves bytes text = msgobj.get_payload() # bytes fails in email pkg on text gen if isinstance(text, bytes): # payload is bytes in 3.1, str in 3.2 alpha text = text.decode('ascii') # decode to unicode str so text gen works lines = [] # split into lines, else 1 massive line text = text.replace(' ', '') # no present in 3.1, but futureproof me! while text: line, text = text[:linelen], text[linelen:] lines.append(line) msgobj.set_payload(' '.join(lines)) def fix_text_required(encodingname): """ 4E: workaround for str/bytes combination errors in email package; MIMEText requires different types for different Unicode encodings in Python 3.1, due to the different ways it MIME-encodes some types of text; see Chapter 13; the only other alternative is using generic Message and repeating much code; """ from email.charset import Charset, BASE64, QP charset = Charset(encodingname) # how email knows what to do for encoding bodyenc = charset.body_encoding # utf8, others require bytes input data return bodyenc in (None, QP) # ascii, latin1, others require str class MailSender(MailTool): """ send mail: format a message, interface with an SMTP server; works on any machine with Python+Inet, doesn't use cmdline mail; a nonauthenticating client: see MailSenderAuth if login required; 4E: tracesize is num chars of msg text traced: 0=none, big=all; 4E: supports Unicode encodings for main text and text parts; 4E: supports header encoding, both full headers and email names; """ def __init__(self, smtpserver=None, tracesize=256): self.smtpServerName = smtpserver or mailconfig.smtpservername self.tracesize = tracesize def sendMessage(self, From, To, Subj, extrahdrs, bodytext, attaches, saveMailSeparator=(('=' * 80) + 'PY '), bodytextEncoding='us-ascii', attachesEncodings=None): """ format and send mail: blocks caller, thread me in a GUI; bodytext is main text part, attaches is list of filenames, extrahdrs is list of (name, value) tuples to be added; raises uncaught exception if send fails for any reason; saves sent message text in a local file if successful; assumes that To, Cc, Bcc hdr values are lists of 1 or more already decoded addresses (possibly in full name+<addr> format); client must parse to split these on delimiters, or use multiline input; note that SMTP allows full name+<addr> format in recipients; 4E: Bcc addrs now used for send/envelope, but header is dropped; 4E: duplicate recipients removed, else will get >1 copies of mail; caveat: no support for multipart/alternative mails, just /mixed; """ # 4E: assume main body text is already in desired encoding; # clients can decode to user pick, default, or utf8 fallback; # either way, email needs either str xor bytes specifically; if fix_text_required(bodytextEncoding): if not isinstance(bodytext, str): bodytext = bodytext.decode(bodytextEncoding) else: if not isinstance(bodytext, bytes): bodytext = bodytext.encode(bodytextEncoding) # make message root if not attaches: msg = Message() msg.set_payload(bodytext, charset=bodytextEncoding) else: msg = MIMEMultipart() self.addAttachments(msg, bodytext, attaches, bodytextEncoding, attachesEncodings) # 4E: non-ASCII hdrs encoded on sends; encode just name in address, # else smtp may drop the message completely; encodes all envelope # To names (but not addr) also, and assumes servers will allow; # msg.as_string retains any line breaks added by encoding headers; hdrenc = mailconfig.headersEncodeTo or 'utf-8' # default=utf8 Subj = self.encodeHeader(Subj, hdrenc) # full header From = self.encodeAddrHeader(From, hdrenc) # email names To = [self.encodeAddrHeader(T, hdrenc) for T in To] # each recip Tos = ', '.join(To) # hdr+envelope # add headers to root msg['From'] = From msg['To'] = Tos # poss many: addr list msg['Subject'] = Subj # servers reject ';' sept msg['Date'] = email.utils.formatdate() # curr datetime, rfc2822 utc recip = To for name, value in extrahdrs: # Cc, Bcc, X-Mailer, etc. if value: if name.lower() not in ['cc', 'bcc']: value = self.encodeHeader(value, hdrenc) msg[name] = value else: value = [self.encodeAddrHeader(V, hdrenc) for V in value] recip += value # some servers reject [''] if name.lower() != 'bcc': # 4E: bcc gets mail, no hdr msg[name] = ', '.join(value) # add commas between cc recip = list(set(recip)) # 4E: remove duplicates fullText = msg.as_string() # generate formatted msg # sendmail call raises except if all Tos failed, # or returns failed Tos dict for any that failed self.trace('Sending to...' + str(recip)) self.trace(fullText[:self.tracesize]) # SMTP calls connect server = smtplib.SMTP(self.smtpServerName, timeout=15) # this may fail too self.getPassword() # if srvr requires self.authenticateServer(server) # login in subclass try: failed = server.sendmail(From, recip, fullText) # except or dict except: server.close() # 4E: quit may hang! raise # reraise except else: server.quit() # connect + send OK self.saveSentMessage(fullText, saveMailSeparator) # 4E: do this first if failed: class SomeAddrsFailed(Exception): pass raise SomeAddrsFailed('Failed addrs:%s ' % failed) self.trace('Send exit') def addAttachments(self, mainmsg, bodytext, attaches, bodytextEncoding, attachesEncodings): """ format a multipart message with attachments; use Unicode encodings for text parts if passed; """ # add main text/plain part msg = MIMEText(bodytext, _charset=bodytextEncoding) mainmsg.attach(msg) # add attachment parts encodings = attachesEncodings or (['us-ascii'] * len(attaches)) for (filename, fileencode) in zip(attaches, encodings): # filename may be absolute or relative if not os.path.isfile(filename): # skip dirs, etc. continue # guess content type from file extension, ignore encoding contype, encoding = mimetypes.guess_type(filename) if contype is None or encoding is not None: # no guess, compressed? contype = 'application/octet-stream' # use generic default self.trace('Adding ' + contype) # build sub-Message of appropriate kind maintype, subtype = contype.split('/', 1) if maintype == 'text': # 4E: text needs encoding if fix_text_required(fileencode): # requires str or bytes data = open(filename, 'r', encoding=fileencode) else: data = open(filename, 'rb') msg = MIMEText(data.read(), _subtype=subtype, _charset=fileencode) data.close() elif maintype == 'image': data = open(filename, 'rb') # 4E: use fix for binaries msg = MIMEImage( data.read(), _subtype=subtype, _encoder=fix_encode_base64) data.close() elif maintype == 'audio': data = open(filename, 'rb') msg = MIMEAudio( data.read(), _subtype=subtype, _encoder=fix_encode_base64) data.close() elif maintype == 'application': # new in 4E data = open(filename, 'rb') msg = MIMEApplication( data.read(), _subtype=subtype, _encoder=fix_encode_base64) data.close() else: data = open(filename, 'rb') # application/* could msg = MIMEBase(maintype, subtype) # use this code too msg.set_payload(data.read()) data.close() # make generic type fix_encode_base64(msg) # was broken here too! #email.encoders.encode_base64(msg) # encode using base64 # set filename (ascii or utf8/mime encoded) and attach to container basename = self.encodeHeader(os.path.basename(filename)) # oct 2011 msg.add_header('Content-Disposition', 'attachment', filename=basename) mainmsg.attach(msg) # text outside mime structure, seen by non-MIME mail readers mainmsg.preamble = 'A multi-part MIME format message. ' mainmsg.epilogue = '' # make sure message ends with a newline def saveSentMessage(self, fullText, saveMailSeparator): """ append sent message to local file if send worked for any; client: pass separator used for your application, splits; caveat: user may change the file at same time (unlikely); """ try: sentfile = open(mailconfig.sentmailfile, 'a', encoding=mailconfig.fetchEncoding) # 4E if fullText[-1] != ' ': fullText += ' ' sentfile.write(saveMailSeparator) sentfile.write(fullText) sentfile.close() except: self.trace('Could not save sent message') # not a show-stopper def encodeHeader(self, headertext, unicodeencoding='utf-8'): """ 4E: encode composed non-ascii message headers content per both email and Unicode standards, according to an optional user setting or UTF-8; header.encode adds line breaks in header string automatically if needed; """ try: headertext.encode('ascii') except: try: hdrobj = email.header.make_header([(headertext, unicodeencoding)]) headertext = hdrobj.encode() except: pass # auto splits into multiple cont lines if needed return headertext # smtplib may fail if it won't encode to ascii def encodeAddrHeader(self, headertext, unicodeencoding='utf-8'): """ 4E: try to encode non-ASCII names in email addresess per email, MIME, and Unicode standards; if this fails drop name and use just addr part; if cannot even get addresses, try to decode as a whole, else smtplib may run into errors when it tries to encode the entire mail as ASCII; utf-8 default should work for most, as it formats code points broadly; inserts newlines if too long or hdr.encode split names to multiple lines, but this may not catch some lines longer than the cutoff (improve me); as used, Message.as_string formatter won't try to break lines further; see also decodeAddrHeader in mailParser module for the inverse of this; """ try: pairs = email.utils.getaddresses([headertext]) # split addrs + parts encoded = [] for name, addr in pairs: try: name.encode('ascii') # use as is if okay as ascii except UnicodeError: # else try to encode name part try: uni = name.encode(unicodeencoding) hdr = email.header.make_header([(uni, unicodeencoding)]) name = hdr.encode() except: name = None # drop name, use address part only joined = email.utils.formataddr((name, addr)) # quote name if need encoded.append(joined) fullhdr = ', '.join(encoded) if len(fullhdr) > 72 or ' ' in fullhdr: # not one short line? fullhdr = ', '.join(encoded) # try multiple lines return fullhdr except: return self.encodeHeader(headertext) def authenticateServer(self, server): pass # no login required for this server/class def getPassword(self): pass # no login required for this server/class ################################################################################ # specialized subclasses ################################################################################ class MailSenderAuth(MailSender): """ use for servers that require login authorization; client: choose MailSender or MailSenderAuth super class based on mailconfig.smtpuser setting (None?) """ smtpPassword = None # 4E: on class, not self, shared by poss N instances def __init__(self, smtpserver=None, smtpuser=None, tracesize=256): MailSender.__init__(self, smtpserver, tracesize) self.smtpUser = smtpuser or mailconfig.smtpuser #self.smtpPassword = None # 4E: makes PyMailGUI ask for each send! def authenticateServer(self, server): server.login(self.smtpUser, self.smtpPassword) def getPassword(self): """ get SMTP auth password if not yet known; may be called by superclass auto, or client manual: not needed until send, but don't run in GUI thread; get from client-side file or subclass method """ if not self.smtpPassword: try: localfile = open(mailconfig.smtppasswdfile) MailSenderAuth.smtpPassword = localfile.readline()[:-1] # 4E self.trace('local file password' + repr(self.smtpPassword)) except: MailSenderAuth.smtpPassword = self.askSmtpPassword() # 4E def askSmtpPassword(self): assert False, 'Subclass must define method' class MailSenderAuthConsole(MailSenderAuth): def askSmtpPassword(self): import getpass prompt = 'Password for %s on %s?' % (self.smtpUser, self.smtpServerName) return getpass.getpass(prompt) class SilentMailSender(SilentMailTool, MailSender): pass # replaces trace
The class defined in Example 13-24 does the work of interfacing with a POP email server—loading, deleting, and synchronizing. This class merits a few additional words of explanation.
This module deals strictly in email text; parsing email after it has been fetched is delegated to a different module in the package. Moreover, this module doesn’t cache already loaded information; clients must add their own mail-retention tools if desired. Clients must also provide password input methods or pass one in, if they cannot use the console input subclass here (e.g., GUIs and web-based programs).
The loading and deleting tasks use the standard library
poplib
module in ways we saw
earlier in this chapter, but notice that there are interfaces for
fetching just message header text with the TOP action in POP if
the mail server supports it. This can save substantial time if
clients need to fetch only basic details for an email index. In
addition, the header and full-text fetchers are equipped to load
just mails newer than a particular number (useful once an initial
load is run), and to restrict fetches to a fixed-sized set of the
mostly recently arrived emails (useful for large inboxes with slow
Internet access or servers).
This module also supports the notion of progress indicators—for methods that perform multiple downloads or deletions, callers may pass in a function that will be called as each mail is processed. This function will receive the current and total step numbers. It’s left up to the caller to render this in a GUI, console, or other user interface.
Additionally, this module is where we apply the session-wide message
bytes Unicode decoding policy required for parsing, as discussed
earlier in this chapter. This decoding uses an encoding name user
setting in the mailconfig
module, followed by heuristics. Because this decoding is performed
immediately when a mail is fetched, all clients of this package
can assume message text is str
Unicode strings—including any later parsing, display, or save
operations. In addition to the mailconfig
setting, we also apply a few
guesses with common encoding types, though it’s not impossible
that this may lead to problems if mails decoded by guessing cannot
be written to mail save fails using the mailconfig
setting.
As described, this session-wide approach to encodings is not
ideal, but it can be adjusted per client session and reflects the
current limitations of email
in
Python 3.1—its parser requires already decoded Unicode strings,
but fetches return bytes. If this decoding fails, as a last resort
we attempt to decode headers only, as either ASCII (or other
common format) text or the platform default, and insert an error
message in the email body—a
heuristic that attempts to avoid killing clients with exceptions
if possible (see file _test-decoding.py in the examples
package for a test of this logic). In practice, an 8-bit Unicode
encoding such as Latin-1 will probably suffice in most cases,
because ASCII was the original requirement of email
standards.
In principle, we could try to search for encoding
information in message headers if it’s present, by parsing mails
partially ourselves. We might then take a per-message instead of
per-session approach to decoding full text, and associate an
encoding type with each mail for later processing such as saves,
though this raises further complications, as a save file can have
just one (compatible) encoding, not one per message. Moreover,
character sets in email headers may refer to individual
components, not the entire email’s text. Since most mails will
conform to 7- or 8-bit standards, and since a future email
release will likely address this
issue, extra complexity is probably not warranted for this case in
this book.
Also keep in mind that the Unicode decoding performed here is for the entire mail text fetched from a server. Really, this is just one part of the email encoding story in the Unicode-aware world of today. In addition:
Payloads of parsed message parts may still be returned as bytes and require special handling or further Unicode decoding (see the parser module ahead).
Text parts and attachments in composed mails impose encoding choices as well (see the sender module earlier).
Message headers have their own encoding conventions, and may be both MIME and Unicode encoded if Internationalized (see both the parser and sender modules).
When you start studying this example, you’ll also notice that Example 13-24 devotes substantial code to detecting synchronization errors between an email list held by a client and the current state of the inbox at the POP email server. Normally, POP assigns relative message numbers to email in the inbox, and only adds newly arrived emails to the end of the inbox. As a result, relative message numbers from an earlier fetch may usually be used to delete and fetch in the future.
However, although rare, it is not impossible for the server’s inbox to change in ways that invalidate previously fetched message numbers. For instance, emails may be deleted in another client, and the server itself may move mails from the inbox to an undeliverable state on download errors (this may vary per ISP). In both cases, email may be removed from the middle of the inbox, throwing some prior relative message numbers out of sync with the server.
This situation can result in fetching the wrong message in an email client—users receive a different message than the one they thought they had selected. Worse, this can make deletions inaccurate—if a mail client uses a relative message number in a delete request, the wrong mail may be deleted if the inbox has changed since the index was fetched.
To assist clients, Example 13-24 includes tools, which match message headers on deletions to ensure accuracy and perform general inbox synchronization tests on demand. These tools are useful only to clients that retain the fetched email list as state information. We’ll use these in the PyMailGUI client in Chapter 14. There, deletions use the safe interface, and loads run the on-demand synchronization test; on detection of synchronization errors, the inbox index is automatically reloaded. For now, see Example 13-24 source code and comments for more details.
Note that the synchronization tests try a variety of
matching techniques, but require the complete headers text and, in
the worst case, must parse headers and match many header fields.
In many cases, the single previously fetched message-id
header field would be
sufficient for matching against messages in the server’s inbox.
However, because this field is optional and can be forged to have
any value, it might not always be a reliable way to identify
messages. In other words, a same-valued message-id
may not suffice to guarantee
a match, although it can be used to identify a mismatch; in Example 13-24, the message-id
is used to rule out a match
if either message has one, and they differ in value. This test is
performed before falling back on slower parsing and multiple
header matches.
""" ############################################################################### retrieve, delete, match mail from a POP server (see __init__ for docs, test) ############################################################################### """ import poplib, mailconfig, sys # client's mailconfig on sys.path print('user:', mailconfig.popusername) # script dir, pythonpath, changes from .mailParser import MailParser # for headers matching (4E: .) from .mailTool import MailTool, SilentMailTool # trace control supers (4E: .) # index/server msgnum out of synch tests class DeleteSynchError(Exception): pass # msg out of synch in del class TopNotSupported(Exception): pass # can't run synch test class MessageSynchError(Exception): pass # index list out of sync class MailFetcher(MailTool): """ fetch mail: connect, fetch headers+mails, delete mails works on any machine with Python+Inet; subclass me to cache implemented with the POP protocol; IMAP requires new class; 4E: handles decoding of full mail text on fetch for parser; """ def __init__(self, popserver=None, popuser=None, poppswd=None, hastop=True): self.popServer = popserver or mailconfig.popservername self.popUser = popuser or mailconfig.popusername self.srvrHasTop = hastop self.popPassword = poppswd # ask later if None def connect(self): self.trace('Connecting...') self.getPassword() # file, GUI, or console server = poplib.POP3(self.popServer, timeout=15) server.user(self.popUser) # connect,login POP server server.pass_(self.popPassword) # pass is a reserved word self.trace(server.getwelcome()) # print returned greeting return server # use setting in client's mailconfig on import search path; # to tailor, this can be changed in class or per instance; fetchEncoding = mailconfig.fetchEncoding def decodeFullText(self, messageBytes): """ 4E, Py3.1: decode full fetched mail text bytes to str Unicode string; done at fetch, for later display or parsing (full mail text is always Unicode thereafter); decode with per-class or per-instance setting, or common types; could also try headers inspection, or intelligent guess from structure; in Python 3.2/3.3, this step may not be required: if so, change to return message line list intact; for more details see Chapter 13; an 8-bit encoding such as latin-1 will likely suffice for most emails, as ASCII is the original standard; this method applies to entire/full message text, which is really just one part of the email encoding story: Message payloads and Message headers may also be encoded per email, MIME, and Unicode standards; see Chapter 13 and mailParser and mailSender for more; """ text = None kinds = [self.fetchEncoding] # try user setting first kinds += ['ascii', 'latin1', 'utf8'] # then try common types kinds += [sys.getdefaultencoding()] # and platform dflt (may differ) for kind in kinds: # may cause mail saves to fail try: text = [line.decode(kind) for line in messageBytes] break except (UnicodeError, LookupError): # LookupError: bad name pass if text == None: # try returning headers + error msg, else except may kill client; # still try to decode headers per ascii, other, platform default; blankline = messageBytes.index(b'') hdrsonly = messageBytes[:blankline] commons = ['ascii', 'latin1', 'utf8'] for common in commons: try: text = [line.decode(common) for line in hdrsonly] break except UnicodeError: pass else: # none worked try: text = [line.decode() for line in hdrsonly] # platform dflt? except UnicodeError: text = ['From: (sender of unknown Unicode format headers)'] text += ['', '--Sorry: mailtools cannot decode this mail content!--'] return text def downloadMessage(self, msgnum): """ load full raw text of one mail msg, given its POP relative msgnum; caller must parse content """ self.trace('load ' + str(msgnum)) server = self.connect() try: resp, msglines, respsz = server.retr(msgnum) finally: server.quit() msglines = self.decodeFullText(msglines) # raw bytes to Unicode str return ' '.join(msglines) # concat lines for parsing def downloadAllHeaders(self, progress=None, loadfrom=1): """ get sizes, raw header text only, for all or new msgs begins loading headers from message number loadfrom use loadfrom to load newly arrived mails only use downloadMessage to get a full msg text later progress is a function called with (count, total); returns: [headers text], [mail sizes], loadedfull? 4E: add mailconfig.fetchlimit to support large email inboxes: if not None, only fetches that many headers, and returns others as dummy/empty mail; else inboxes like one of mine (4K emails) are not practical to use; 4E: pass loadfrom along to downloadAllMsgs (a buglet); """ if not self.srvrHasTop: # not all servers support TOP # naively load full msg text return self.downloadAllMsgs(progress, loadfrom) else: self.trace('loading headers') fetchlimit = mailconfig.fetchlimit server = self.connect() # mbox now locked until quit try: resp, msginfos, respsz = server.list() # 'num size' lines list msgCount = len(msginfos) # alt to srvr.stat[0] msginfos = msginfos[loadfrom-1:] # drop already loadeds allsizes = [int(x.split()[1]) for x in msginfos] allhdrs = [] for msgnum in range(loadfrom, msgCount+1): # poss empty if progress: progress(msgnum, msgCount) # run callback if fetchlimit and (msgnum <= msgCount - fetchlimit): # skip, add dummy hdrs hdrtext = 'Subject: --mail skipped-- ' allhdrs.append(hdrtext) else: # fetch, retr hdrs only resp, hdrlines, respsz = server.top(msgnum, 0) hdrlines = self.decodeFullText(hdrlines) allhdrs.append(' '.join(hdrlines)) finally: server.quit() # make sure unlock mbox assert len(allhdrs) == len(allsizes) self.trace('load headers exit') return allhdrs, allsizes, False def downloadAllMessages(self, progress=None, loadfrom=1): """ load full message text for all msgs from loadfrom..N, despite any caching that may be being done in the caller; much slower than downloadAllHeaders, if just need hdrs; 4E: support mailconfig.fetchlimit: see downloadAllHeaders; could use server.list() to get sizes of skipped emails here too, but clients probably don't care about these anyhow; """ self.trace('loading full messages') fetchlimit = mailconfig.fetchlimit server = self.connect() try: (msgCount, msgBytes) = server.stat() # inbox on server allmsgs = [] allsizes = [] for i in range(loadfrom, msgCount+1): # empty if low >= high if progress: progress(i, msgCount) if fetchlimit and (i <= msgCount - fetchlimit): # skip, add dummy mail mailtext = 'Subject: --mail skipped-- Mail skipped. ' allmsgs.append(mailtext) allsizes.append(len(mailtext)) else: # fetch, retr full mail (resp, message, respsz) = server.retr(i) # save text on list message = self.decodeFullText(message) allmsgs.append(' '.join(message)) # leave mail on server allsizes.append(respsz) # diff from len(msg) finally: server.quit() # unlock the mail box assert len(allmsgs) == (msgCount - loadfrom) + 1 # msg nums start at 1 #assert sum(allsizes) == msgBytes # not if loadfrom > 1 return allmsgs, allsizes, True # not if fetchlimit def deleteMessages(self, msgnums, progress=None): """ delete multiple msgs off server; assumes email inbox unchanged since msgnums were last determined/loaded; use if msg headers not available as state information; fast, but poss dangerous: see deleteMessagesSafely """ self.trace('deleting mails') server = self.connect() try: for (ix, msgnum) in enumerate(msgnums): # don't reconnect for each if progress: progress(ix+1, len(msgnums)) server.dele(msgnum) finally: # changes msgnums: reload server.quit() def deleteMessagesSafely(self, msgnums, synchHeaders, progress=None): """ delete multiple msgs off server, but use TOP fetches to check for a match on each msg's header part before deleting; assumes the email server supports the TOP interface of POP, else raises TopNotSupported - client may call deleteMessages; use if the mail server might change the inbox since the email index was last fetched, thereby changing POP relative message numbers; this can happen if email is deleted in a different client; some ISPs may also move a mail from inbox to the undeliverable box in response to a failed download; synchHeaders must be a list of already loaded mail hdrs text, corresponding to selected msgnums (requires state); raises exception if any out of synch with the email server; inbox is locked until quit, so it should not change between TOP check and actual delete: synch check must occur here, not in caller; may be enough to call checkSynchError+deleteMessages, but check each msg here in case deletes and inserts in middle of inbox; """ if not self.srvrHasTop: raise TopNotSupported('Safe delete cancelled') self.trace('deleting mails safely') errmsg = 'Message %s out of synch with server. ' errmsg += 'Delete terminated at this message. ' errmsg += 'Mail client may require restart or reload.' server = self.connect() # locks inbox till quit try: # don't reconnect for each (msgCount, msgBytes) = server.stat() # inbox size on server for (ix, msgnum) in enumerate(msgnums): if progress: progress(ix+1, len(msgnums)) if msgnum > msgCount: # msgs deleted raise DeleteSynchError(errmsg % msgnum) resp, hdrlines, respsz = server.top(msgnum, 0) # hdrs only hdrlines = self.decodeFullText(hdrlines) msghdrs = ' '.join(hdrlines) if not self.headersMatch(msghdrs, synchHeaders[msgnum-1]): raise DeleteSynchError(errmsg % msgnum) else: server.dele(msgnum) # safe to delete this msg finally: # changes msgnums: reload server.quit() # unlock inbox on way out def checkSynchError(self, synchHeaders): """ check to see if already loaded hdrs text in synchHeaders list matches what is on the server, using the TOP command in POP to fetch headers text; use if inbox can change due to deletes in other client, or automatic action by email server; raises except if out of synch, or error while talking to server; for speed, only checks last in last: this catches inbox deletes, but assumes server won't insert before last (true for incoming mails); check inbox size first: smaller if just deletes; else top will differ if deletes and newly arrived messages added at end; result valid only when run: inbox may change after return; """ self.trace('synch check') errormsg = 'Message index out of synch with mail server. ' errormsg += 'Mail client may require restart or reload.' server = self.connect() try: lastmsgnum = len(synchHeaders) # 1..N (msgCount, msgBytes) = server.stat() # inbox size if lastmsgnum > msgCount: # fewer now? raise MessageSynchError(errormsg) # none to cmp if self.srvrHasTop: resp, hdrlines, respsz = server.top(lastmsgnum, 0) # hdrs only hdrlines = self.decodeFullText(hdrlines) lastmsghdrs = ' '.join(hdrlines) if not self.headersMatch(lastmsghdrs, synchHeaders[-1]): raise MessageSynchError(errormsg) finally: server.quit() def headersMatch(self, hdrtext1, hdrtext2): """" may not be as simple as a string compare: some servers add a "Status:" header that changes over time; on one ISP, it begins as "Status: U" (unread), and changes to "Status: RO" (read, old) after fetched once - throws off synch tests if new when index fetched, but have been fetched once before delete or last-message check; "Message-id:" line is unique per message in theory, but optional, and can be anything if forged; match more common: try first; parsing costly: try last """ # try match by simple string compare if hdrtext1 == hdrtext2: self.trace('Same headers text') return True # try match without status lines split1 = hdrtext1.splitlines() # s.split(' '), but no final '' split2 = hdrtext2.splitlines() strip1 = [line for line in split1 if not line.startswith('Status:')] strip2 = [line for line in split2 if not line.startswith('Status:')] if strip1 == strip2: self.trace('Same without Status') return True # try mismatch by message-id headers if either has one msgid1 = [line for line in split1 if line[:11].lower() == 'message-id:'] msgid2 = [line for line in split2 if line[:11].lower() == 'message-id:'] if (msgid1 or msgid2) and (msgid1 != msgid2): self.trace('Different Message-Id') return False # try full hdr parse and common headers if msgid missing or trash tryheaders = ('From', 'To', 'Subject', 'Date') tryheaders += ('Cc', 'Return-Path', 'Received') msg1 = MailParser().parseHeaders(hdrtext1) msg2 = MailParser().parseHeaders(hdrtext2) for hdr in tryheaders: # poss multiple Received if msg1.get_all(hdr) != msg2.get_all(hdr): # case insens, dflt None self.trace('Diff common headers') return False # all common hdrs match and don't have a diff message-id self.trace('Same common headers') return True def getPassword(self): """ get POP password if not yet known not required until go to server from client-side file or subclass method """ if not self.popPassword: try: localfile = open(mailconfig.poppasswdfile) self.popPassword = localfile.readline()[:-1] self.trace('local file password' + repr(self.popPassword)) except: self.popPassword = self.askPopPassword() def askPopPassword(self): assert False, 'Subclass must define method' ################################################################################ # specialized subclasses ################################################################################ class MailFetcherConsole(MailFetcher): def askPopPassword(self): import getpass prompt = 'Password for %s on %s?' % (self.popUser, self.popServer) return getpass.getpass(prompt) class SilentMailFetcher(SilentMailTool, MailFetcher): pass # replaces trace
Example 13-25
implements the last major class in the mailtools
package—given the (already
decoded) text of an email message, its tools parse the mail’s
content into a message object, with headers and decoded parts. This
module is largely just a wrapper around the standard library’s
email
package, but it adds
convenience tools—finding the main text part of a message, filename
generation for message parts, saving attached parts to files,
decoding headers, splitting address lists, and so on. See the code
for more information. Also notice the parts walker here: by coding
its search logic in one place as a generator function, we guarantee
that all its three clients here, as well as any others elsewhere,
implement the same traversal.
This module also provides support for decoding message headers per
email standards (both full headers and names in address headers),
and handles decoding per text part encodings. Headers are decoded
according to their content, using tools in the email
package; the headers themselves
give their MIME and Unicode encodings, so no user intervention is
required. For client convenience, we also perform Unicode decoding
for main text parts to convert them from bytes
to str
here if needed.
The latter main-text decoding merits elaboration. As
discussed earlier in this chapter, Message
objects (main or attached) may
return their payloads as bytes
if we fetch with a decode=1
argument, or if they are bytes
to begin with; in other cases, payloads may be returned as
str
. We generally need to
decode bytes
in order to treat
payloads as text.
In mailtools
itself,
str
text part payloads are
automatically encoded to bytes
by decode=1
and then saved to
binary-mode files to finesse encoding issues, but main-text
payloads are decoded to str
if
they are bytes
. This main-text
decoding is performed per the encoding name in the part’s message
header (if present and correct), the platform default, or a guess.
As we learned in Chapter 9,
while GUIs may allow bytes
for
display, str
text generally
provides broader Unicode support; furthermore, str
is sometimes needed for later
processing such as line wrapping and webpage generation.
Since this package can’t predict the role of other part payloads besides the main text, clients are responsible for decoding and encoding as necessary. For instance, other text parts which are saved in binary mode here may require that message headers be consulted later to extract Unicode encoding names for better display. For example, Chapter 14’s PyMailGUI will proceed this way to open text parts on demand, passing message header encoding information on to PyEdit for decoding as text is loaded.
Some of the to-text conversions performed here are
potentially partial solutions (some parts may lack the required
headers and fail per the platform defaults) and may need to be
improved; since this seems likely to be addressed in a future
release of Python’s email
package, we’ll settle for our assumptions here.
""" ############################################################################### parsing and attachment extract, analyse, save (see __init__ for docs, test) ############################################################################### """ import os, mimetypes, sys # mime: map type to name import email.parser # parse text to Message object import email.header # 4E: headers decode/encode import email.utils # 4E: addr header parse/decode from email.message import Message # Message may be traversed from .mailTool import MailTool # 4E: package-relative class MailParser(MailTool): """ methods for parsing message text, attachments subtle thing: Message object payloads are either a simple string for non-multipart messages, or a list of Message objects if multipart (possibly nested); we don't need to distinguish between the two cases here, because the Message walk generator always returns self first, and so works fine on non-multipart messages too (a single object is walked); for simple messages, the message body is always considered here to be the sole part of the mail; for multipart messages, the parts list includes the main message text, as well as all attachments; this allows simple messages not of type text to be handled like attachments in a UI (e.g., saved, opened); Message payload may also be None for some oddball part types; 4E note: in Py 3.1, text part payloads are returned as bytes for decode=1, and might be str otherwise; in mailtools, text is stored as bytes for file saves, but main-text bytes payloads are decoded to Unicode str per mail header info or platform default+guess; clients may need to convert other payloads: PyMailGUI uses headers to decode parts saved to binary files; 4E supports fetched message header auto-decoding per its own content, both for general headers such as Subject, as well as for names in address header such as From and To; client must request this after parse, before display: parser doesn't decode; """ def walkNamedParts(self, message): """ generator to avoid repeating part naming logic; skips multipart headers, makes part filenames; message is already parsed email.message.Message object; doesn't skip oddball types: payload may be None, must handle in part saves; some others may warrant skips too; """ for (ix, part) in enumerate(message.walk()): # walk includes message fulltype = part.get_content_type() # ix includes parts skipped maintype = part.get_content_maintype() if maintype == 'multipart': # multipart/*: container continue elif fulltype == 'message/rfc822': # 4E: skip message/rfc822 continue # skip all message/* too? else: filename, contype = self.partName(part, ix) yield (filename, contype, part) def partName(self, part, ix): """ extract filename and content type from message part; filename: tries Content-Disposition, then Content-Type name param, or generates one based on mimetype guess; """ filename = part.get_filename() # filename in msg hdrs? contype = part.get_content_type() # lowercase maintype/subtype if not filename: filename = part.get_param('name') # try content-type name if not filename: if contype == 'text/plain': # hardcode plain text ext ext = '.txt' # else guesses .ksh! else: ext = mimetypes.guess_extension(contype) if not ext: ext = '.bin' # use a generic default filename = 'part-%03d%s' % (ix, ext) return (self.decodeHeader(filename), contype) # oct 2011: decode i18n fnames def saveParts(self, savedir, message): """ store all parts of a message as files in a local directory; returns [('maintype/subtype', 'filename')] list for use by callers, but does not open any parts or attachments here; get_payload decodes base64, quoted-printable, uuencoded data; mail parser may give us a None payload for oddball types we probably should skip over: convert to str here to be safe; """ if not os.path.exists(savedir): os.mkdir(savedir) partfiles = [] for (filename, contype, part) in self.walkNamedParts(message): fullname = os.path.join(savedir, filename) fileobj = open(fullname, 'wb') # use binary mode content = part.get_payload(decode=1) # decode base64,qp,uu if not isinstance(content, bytes): # 4E: need bytes for rb content = b'(no content)' # decode=1 returns bytes, fileobj.write(content) # but some payloads None fileobj.close() # 4E: not str(content) partfiles.append((contype, fullname)) # for caller to open return partfiles def saveOnePart(self, savedir, partname, message): """ ditto, but find and save just one part by name """ if not os.path.exists(savedir): os.mkdir(savedir) fullname = os.path.join(savedir, partname) (contype, content) = self.findOnePart(partname, message) if not isinstance(content, bytes): # 4E: need bytes for rb content = b'(no content)' # decode=1 returns bytes, open(fullname, 'wb').write(content) # but some payloads None return (contype, fullname) # 4E: not str(content) def partsList(self, message): """" return a list of filenames for all parts of an already parsed message, using same filename logic as saveParts, but do not store the part files here """ validParts = self.walkNamedParts(message) return [filename for (filename, contype, part) in validParts] def findOnePart(self, partname, message): """ find and return part's content, given its name; intended to be used in conjunction with partsList; we could also mimetypes.guess_type(partname) here; we could also avoid this search by saving in dict; 4E: content may be str or bytes--convert as needed; """ for (filename, contype, part) in self.walkNamedParts(message): if filename == partname: content = part.get_payload(decode=1) # does base64,qp,uu return (contype, content) # may be bytes text def decodedPayload(self, part, asStr=True): """ 4E: decode text part bytes to Unicode str for display, line wrap, etc.; part is a Message; (decode=1) undoes MIME email encodings (base64, uuencode, qp), bytes.decode() performs additional Unicode text string decodings; tries charset encoding name in message headers first (if present, and accurate), then tries platform defaults and a few guesses before giving up with error string; """ payload = part.get_payload(decode=1) # payload may be bytes if asStr and isinstance(payload, bytes): # decode=1 returns bytes tries = [] enchdr = part.get_content_charset() # try msg headers first! if enchdr: tries += [enchdr] # try headers first tries += [sys.getdefaultencoding()] # same as bytes.decode() tries += ['latin1', 'utf8'] # try 8-bit, incl ascii for trie in tries: # try utf8 (windows dflt) try: payload = payload.decode(trie) # give it a shot, eh? break except (UnicodeError, LookupError): # lookuperr: bad name pass else: payload = '--Sorry: cannot decode Unicode text--' return payload def findMainText(self, message, asStr=True): """ for text-oriented clients, return first text part's str; for the payload of a simple message, or all parts of a multipart message, looks for text/plain, then text/html, then text/*, before deducing that there is no text to display; this is a heuristic, but covers most simple, multipart/alternative, and multipart/mixed messages; content-type defaults to text/plain if not in simple msg; handles message nesting at top level by walking instead of list scans; if non-multipart but type is text/html, returns the HTML as the text with an HTML type: caller may open in web browser, extract plain text, etc; if nonmultipart and not text, there is no text to display: save/open message content in UI; caveat: does not try to concatenate multiple inline text/plain parts if any; 4E: text payloads may be bytes--decodes to str here; 4E: asStr=False to get raw bytes for HTML file saves; """ # try to find a plain text for part in message.walk(): # walk visits message type = part.get_content_type() # if nonmultipart if type == 'text/plain': # may be base64,qp,uu return type, self.decodedPayload(part, asStr) # bytes to str too? # try to find an HTML part for part in message.walk(): type = part.get_content_type() # caller renders html if type == 'text/html': return type, self.decodedPayload(part, asStr) # try any other text type, including XML for part in message.walk(): if part.get_content_maintype() == 'text': return part.get_content_type(), self.decodedPayload(part, asStr) # punt: could use first part, but it's not marked as text failtext = '[No text to display]' if asStr else b'[No text to display]' return 'text/plain', failtext def decodeHeader(self, rawheader): """ 4E: decode existing i18n message header text per both email and Unicode standards, according to its content; return as is if unencoded or fails; client must call this to display: parsed Message object does not decode; i18n header example: '=?UTF-8?Q?Introducing=20Top=20Values=20..Savers?='; i18n header example: 'Man where did you get that =?UTF-8?Q?assistant=3F?='; decode_header handles any line breaks in header string automatically, may return multiple parts if any substrings of hdr are encoded, and returns all bytes in parts list if any encodings found (with unencoded parts encoded as raw-unicode-escape and enc=None) but returns a single part with enc=None that is str instead of bytes in Py3.1 if the entire header is unencoded (must handle mixed types here); see Chapter 13 for more details/examples; the following first attempt code was okay unless any encoded substrings, or enc was returned as None (raised except which returned rawheader unchanged): hdr, enc = email.header.decode_header(rawheader)[0] return hdr.decode(enc) # fails if enc=None: no encoding or encoded substrs """ try: parts = email.header.decode_header(rawheader) decoded = [] for (part, enc) in parts: # for all substrings if enc == None: # part unencoded? if not isinstance(part, bytes): # str: full hdr unencoded decoded += [part] # else do unicode decode else: decoded += [part.decode('raw-unicode-escape')] else: decoded += [part.decode(enc)] return ' '.join(decoded) except: return rawheader # punt! def decodeAddrHeader(self, rawheader): """ 4E: decode existing i18n address header text per email and Unicode, according to its content; must parse out first part of email address to get i18n part: '"=?UTF-8?Q?Walmart?=" <[email protected]>'; From will probably have just 1 addr, but To, Cc, Bcc may have many; decodeHeader handles nested encoded substrings within an entire hdr, but we can't simply call it for entire hdr here because it fails if encoded name substring ends in " quote instead of whitespace or endstr; see also encodeAddrHeader in mailSender module for the inverse of this; the following first attempt code failed to handle encoded substrings in name, and raised exc for unencoded bytes parts if any encoded substrings; namebytes, nameenc = email.header.decode_header(name)[0] (do email+MIME) if nameenc: name = namebytes.decode(nameenc) (do Unicode?) """ try: pairs = email.utils.getaddresses([rawheader]) # split addrs and parts decoded = [] # handles name commas for (name, addr) in pairs: try: name = self.decodeHeader(name) # email+MIME+Uni except: name = None # but uses encooded name if exc in decodeHeader joined = email.utils.formataddr((name, addr)) # join parts decoded.append(joined) return ', '.join(decoded) # >= 1 addrs except: return self.decodeHeader(rawheader) # try decoding entire string def splitAddresses(self, field): """ 4E: use comma separator for multiple addrs in the UI, and getaddresses to split correctly and allow for comma in the name parts of addresses; used by PyMailGUI to split To, Cc, Bcc as needed for user inputs and copied headers; returns empty list if field is empty, or any exception occurs; """ try: pairs = email.utils.getaddresses([field]) # [(name,addr)] return [email.utils.formataddr(pair) for pair in pairs] # [name <addr>] except: return '' # syntax error in user-entered field?, etc. # returned when parses fail errorMessage = Message() errorMessage.set_payload('[Unable to parse message - format error]') def parseHeaders(self, mailtext): """ parse headers only, return root email.message.Message object stops after headers parsed, even if nothing else follows (top) email.message.Message object is a mapping for mail header fields payload of message object is None, not raw body text """ try: return email.parser.Parser().parsestr(mailtext, headersonly=True) except: return self.errorMessage def parseMessage(self, fulltext): """ parse entire message, return root email.message.Message object payload of message object is a string if not is_multipart() payload of message object is more Messages if multiple parts the call here same as calling email.message_from_string() """ try: return email.parser.Parser().parsestr(fulltext) # may fail! except: return self.errorMessage # or let call handle? can check return def parseMessageRaw(self, fulltext): """ parse headers only, return root email.message.Message object stops after headers parsed, for efficiency (not yet used here) payload of message object is raw text of mail after headers """ try: return email.parser.HeaderParser().parsestr(fulltext) except: return self.errorMessage
The last file in the mailtools
package, Example 13-26,
lists the self-test code for the package. This code is a separate
script file, in order to allow for import search path
manipulation—it emulates a real client, which is assumed to have a
mailconfig.py
module in its own
source directory (this module can vary per client).
""" ############################################################################### self-test when this file is run as a program ############################################################################### """ # # mailconfig normally comes from the client's source directory or # sys.path; for testing, get it from Email directory one level up # import sys sys.path.append('..') import mailconfig print('config:', mailconfig.__file__) # get these from __init__ from mailtools import (MailFetcherConsole, MailSender, MailSenderAuthConsole, MailParser) if not mailconfig.smtpuser: sender = MailSender(tracesize=5000) else: sender = MailSenderAuthConsole(tracesize=5000) sender.sendMessage(From = mailconfig.myaddress, To = [mailconfig.myaddress], Subj = 'testing mailtools package', extrahdrs = [('X-Mailer', 'mailtools')], bodytext = 'Here is my source code ', attaches = ['selftest.py'], ) # bodytextEncoding='utf-8', # other tests to try # attachesEncodings=['latin-1'], # inspect text headers # attaches=['monkeys.jpg']) # verify Base64 encoded # to='i18n adddr list...', # test mime/unicode headers # change mailconfig to test fetchlimit fetcher = MailFetcherConsole() def status(*args): print(args) hdrs, sizes, loadedall = fetcher.downloadAllHeaders(status) for num, hdr in enumerate(hdrs[:5]): print(hdr) if input('load mail?') in ['y', 'Y']: print(fetcher.downloadMessage(num+1).rstrip(), ' ', '-'*70) last5 = len(hdrs)-4 msgs, sizes, loadedall = fetcher.downloadAllMessages(status, loadfrom=last5) for msg in msgs: print(msg[:200], ' ', '-'*70) parser = MailParser() for i in [0]: # try [0 , len(msgs)] fulltext = msgs[i] message = parser.parseMessage(fulltext) ctype, maintext = parser.findMainText(message) print('Parsed:', message['Subject']) print(maintext) input('Press Enter to exit') # pause if clicked on Windows
Here’s a run of the self-test script; it generates a lot of output, most of which has been deleted here for presentation in this book—as usual, run this on your own for further details:
C:...PP4EInternetEmailmailtools>selftest.py
config: ..mailconfig.py user: [email protected] Adding text/x-python Sending to...['[email protected]'] Content-Type: multipart/mixed; boundary="===============0085314748==" MIME-Version: 1.0 From: [email protected] To: [email protected] Subject: testing mailtools package Date: Sat, 08 May 2010 19:26:22 −0000 X-Mailer: mailtools A multi-part MIME format message. --===============0085314748== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Here is my source code --===============0085314748== Content-Type: text/x-python; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="selftest.py" """ ############################################################################### self-test when this file is run as a program ############################################################################### """ ...more lines omitted... print(maintext) input('Press Enter to exit') # pause if clicked on Windows --===============0085314748==-- Send exit loading headers Connecting... Password for [email protected] on pop.secureserver.net? b'+OK <[email protected]>' (1, 7) (2, 7) (3, 7) (4, 7) (5, 7) (6, 7) (7, 7) load headers exit Received: (qmail 7690 invoked from network); 5 May 2010 15:29:43 −0000 Received: from unknown (HELO p3pismtp01-026.prod.phx3.secureserver.net) ([10.6.1 ...more lines omitted... load mail?y
load 1 Connecting... b'+OK <[email protected]>' Received: (qmail 7690 invoked from network); 5 May 2010 15:29:43 −0000 Received: from unknown (HELO p3pismtp01-026.prod.phx3.secureserver.net) ([10.6.1 ...more lines omitted... load mail? loading full messages Connecting... b'+OK <[email protected]>' (3, 7) (4, 7) (5, 7) (6, 7) (7, 7) Received: (qmail 25683 invoked from network); 6 May 2010 14:12:07 −0000 Received: from unknown (HELO p3pismtp01-018.prod.phx3.secureserver.net) ([10.6.1 ...more lines omitted... Parsed: A B C D E F G Fiddle de dum, Fiddle de dee, Eric the half a bee. Press Enter to exit
As a final email example in this chapter, and to give a better
use case for the mailtools
module
package of the preceding sections, Example 13-27 provides an updated
version of the pymail
program we
met earlier (Example 13-20).
It uses our mailtools
package to
access email, instead of interfacing with Python’s email
package directly. Compare its code
to the original pymail
in this
chapter to see how mailtools
is
employed here. You’ll find that its mail download and send logic is
substantially simpler.
#!/usr/local/bin/python """ ################################################################################ pymail2 - simple console email interface client in Python; this version uses the mailtools package, which in turn uses poplib, smtplib, and the email package for parsing and composing emails; displays first text part of mails, not the entire full text; fetches just mail headers initially, using the TOP command; fetches full text of just email selected to be displayed; caches already fetched mails; caveat: no way to refresh index; uses standalone mailtools objects - they can also be used as superclasses; ################################################################################ """ import mailconfig, mailtools from pymail import inputmessage mailcache = {} def fetchmessage(i): try: fulltext = mailcache[i] except KeyError: fulltext = fetcher.downloadMessage(i) mailcache[i] = fulltext return fulltext def sendmessage(): From, To, Subj, text = inputmessage() sender.sendMessage(From, To, Subj, [], text, attaches=None) def deletemessages(toDelete, verify=True): print('To be deleted:', toDelete) if verify and input('Delete?')[:1] not in ['y', 'Y']: print('Delete cancelled.') else: print('Deleting messages from server...') fetcher.deleteMessages(toDelete) def showindex(msgList, msgSizes, chunk=5): count = 0 for (msg, size) in zip(msgList, msgSizes): # email.message.Message, int count += 1 # 3.x iter ok here print('%d: %d bytes' % (count, size)) for hdr in ('From', 'To', 'Date', 'Subject'): print(' %-8s=>%s' % (hdr, msg.get(hdr, '(unknown)'))) if count % chunk == 0: input('[Press Enter key]') # pause after each chunk def showmessage(i, msgList): if 1 <= i <= len(msgList): fulltext = fetchmessage(i) message = parser.parseMessage(fulltext) ctype, maintext = parser.findMainText(message) print('-' * 79) print(maintext.rstrip() + ' ') # main text part, not entire mail print('-' * 79) # and not any attachments after else: print('Bad message number') def savemessage(i, mailfile, msgList): if 1 <= i <= len(msgList): fulltext = fetchmessage(i) savefile = open(mailfile, 'a', encoding=mailconfig.fetchEncoding) # 4E savefile.write(' ' + fulltext + '-'*80 + ' ') else: print('Bad message number') def msgnum(command): try: return int(command.split()[1]) except: return −1 # assume this is bad helptext = """ Available commands: i - index display l n? - list all messages (or just message n) d n? - mark all messages for deletion (or just message n) s n? - save all messages to a file (or just message n) m - compose and send a new mail message q - quit pymail ? - display this help text """ def interact(msgList, msgSizes, mailfile): showindex(msgList, msgSizes) toDelete = [] while True: try: command = input('[Pymail] Action? (i, l, d, s, m, q, ?) ') except EOFError: command = 'q' if not command: command = '*' if command == 'q': # quit break elif command[0] == 'i': # index showindex(msgList, msgSizes) elif command[0] == 'l': # list if len(command) == 1: for i in range(1, len(msgList)+1): showmessage(i, msgList) else: showmessage(msgnum(command), msgList) elif command[0] == 's': # save if len(command) == 1: for i in range(1, len(msgList)+1): savemessage(i, mailfile, msgList) else: savemessage(msgnum(command), mailfile, msgList) elif command[0] == 'd': # mark for deletion later if len(command) == 1: # 3.x needs list(): iter toDelete = list(range(1, len(msgList)+1)) else: delnum = msgnum(command) if (1 <= delnum <= len(msgList)) and (delnum not in toDelete): toDelete.append(delnum) else: print('Bad message number') elif command[0] == 'm': # send a new mail via SMTP try: sendmessage() except: print('Error - mail not sent') elif command[0] == '?': print(helptext) else: print('What? -- type "?" for commands help') return toDelete def main(): global parser, sender, fetcher mailserver = mailconfig.popservername mailuser = mailconfig.popusername mailfile = mailconfig.savemailfile parser = mailtools.MailParser() sender = mailtools.MailSender() fetcher = mailtools.MailFetcherConsole(mailserver, mailuser) def progress(i, max): print(i, 'of', max) hdrsList, msgSizes, ignore = fetcher.downloadAllHeaders(progress) msgList = [parser.parseHeaders(hdrtext) for hdrtext in hdrsList] print('[Pymail email client]') toDelete = interact(msgList, msgSizes, mailfile) if toDelete: deletemessages(toDelete) if __name__ == '__main__': main()
This program is used interactively, the same as the original. In fact, the output is nearly identical, so we won’t go into further details. Here’s a quick look at this script in action; run this on your own machine to see it firsthand:
C:...PP4EInternetEmail>pymail2.py
user: [email protected] loading headers Connecting... Password for [email protected] on pop.secureserver.net? b'+OK <[email protected]>' 1 of 7 2 of 7 3 of 7 4 of 7 5 of 7 6 of 7 7 of 7 load headers exit [Pymail email client] 1: 1860 bytes From =>[email protected] To =>[email protected] Date =>Wed, 5 May 2010 11:29:36 −0400 (EDT) Subject =>I'm a Lumberjack, and I'm Okay 2: 1408 bytes From =>[email protected] To =>[email protected] Date =>Wed, 05 May 2010 08:33:47 −0700 Subject =>testing 3: 1049 bytes From =>[email protected] To =>[email protected] Date =>Thu, 06 May 2010 14:11:07 −0000 Subject =>A B C D E F G 4: 1038 bytes From =>[email protected] To =>[email protected] Date =>Thu, 06 May 2010 14:32:32 −0000 Subject =>a b c d e f g 5: 957 bytes From =>[email protected] To =>maillist Date =>Thu, 06 May 2010 10:58:40 −0400 Subject =>test interactive smtplib [Press Enter key] 6: 1037 bytes From =>[email protected] To =>[email protected] Date =>Fri, 07 May 2010 20:32:38 −0000 Subject =>Among our weapons are these 7: 3248 bytes From =>[email protected] To =>[email protected] Date =>Sat, 08 May 2010 19:26:22 −0000 Subject =>testing mailtools package [Pymail] Action? (i, l, d, s, m, q, ?)l 7
load 7 Connecting... b'+OK <[email protected]>' ------------------------------------------------------------------------------- Here is my source code ------------------------------------------------------------------------------- [Pymail] Action? (i, l, d, s, m, q, ?)d 7
[Pymail] Action? (i, l, d, s, m, q, ?)m
From?[email protected]
To?[email protected]
Subj?test pymail2 send
Type message text, end with line="."Run away! Run away!
.
Sending to...['[email protected]'] From: [email protected] To: [email protected] Subject: test pymail2 send Date: Sat, 08 May 2010 19:44:25 −0000 Run away! Run away! Send exit [Pymail] Action? (i, l, d, s, m, q, ?)q
To be deleted: [7] Delete?y
Deleting messages from server... deleting mails Connecting... b'+OK <[email protected]>'
The messages in our mailbox have quite a few origins now—ISP
webmail clients, basic SMTP scripts, the Python interactive
command line, mailtools
self-test code, and two console-based email clients; in later
chapters, we’ll add even more. All their mails look the same to
our script; here’s a verification of the email we just sent (the
second fetch finds it already in-cache):
C:...PP4EInternetEmail>pymail2.py
user: [email protected] loading headers Connecting... ...more lines omitted... [Press Enter key] 6: 1037 bytes From =>[email protected] To =>[email protected] Date =>Fri, 07 May 2010 20:32:38 −0000 Subject =>Among our weapons are these 7: 984 bytes From =>[email protected] To =>[email protected] Date =>Sat, 08 May 2010 19:44:25 −0000 Subject =>test pymail2 send [Pymail] Action? (i, l, d, s, m, q, ?)l 7
load 7 Connecting... b'+OK <[email protected]>' ------------------------------------------------------------------------------- Run away! Run away! ------------------------------------------------------------------------------- [Pymail] Action? (i, l, d, s, m, q, ?)l 7
------------------------------------------------------------------------------- Run away! Run away! ------------------------------------------------------------------------------- [Pymail] Action? (i, l, d, s, m, q, ?)q
Study pymail2
’s code for
more insights. As you’ll see, this version eliminates some
complexities, such as the manual formatting of composed mail
message text. It also does a better job of displaying a mail’s
text—instead of blindly listing the full mail text (attachments
and all), it uses mailtools
to
fetch the first text part of the message. The messages we’re using
are too simple to show the difference, but for a mail with
attachments, this new version will be more focused about what it
displays.
Moreover, because the interface to mail is encapsulated in
the mailtools
package’s
modules, if it ever must change, it will only need to be changed
in that module, regardless of how many mail clients use its tools.
And because the code in mailtools
is shared, if we know it works
for one client, we can be sure it will work in another; there is
no need to debug new code.
On the other hand, pymail2
doesn’t really leverage much of
the power of either mailtools
or the underlying email
package
it uses. For example, things like attachments, Internationalized
headers, and inbox synchronization are not handled at all, and
printing of some decoded main text may contain character sets
incompatible with the console terminal interface. To see the full
scope of the email
package, we
need to explore a larger email system, such as PyMailGUI or
PyMailCGI. The first of these is the topic of the next chapter,
and the second appears in Chapter 16.
First, though, let’s quickly survey a handful of additional
client-side protocol tools.
So far in this chapter, we have focused on Python’s FTP and email processing
tools and have met a handful of client-side scripting modules along
the way: ftplib
, poplib
, smtplib
, email
, mimetypes
, urllib
, and so on. This set is
representative of Python’s client-side library tools for transferring
and processing information over the Internet, but it’s not at all
complete.
A more or less comprehensive list of Python’s Internet-related modules appears at the start of the previous chapter. Among other things, Python also includes client-side support libraries for Internet news, Telnet, HTTP, XML-RPC, and other standard protocols. Most of these are analogous to modules we’ve already met—they provide an object-based interface that automates the underlying sockets and message structures.
For instance, Python’s nntplib
module supports the client-side
interface to NNTP—the Network News Transfer Protocol—which is used for
reading and posting articles to Usenet newsgroups on the Internet.
Like other protocols, NNTP runs on top of sockets and merely defines a
standard message protocol; like other modules, nntplib
hides most of the protocol details
and presents an object-based interface to Python scripts.
We won’t get into full protocol details here, but in brief, NNTP servers store a range of articles on the server machine, usually in a flat-file database. If you have the domain or IP name of a server machine that runs an NNTP server program listening on the NNTP port, you can write scripts that fetch or post articles from any machine that has Python and an Internet connection. For instance, the script in Example 13-28 by default fetches and displays the last 10 articles from Python’s Internet newsgroup, comp.lang.python, from the news.rmi.net NNTP server at one of my ISPs.
""" fetch and print usenet newsgroup posting from comp.lang.python via the nntplib module, which really runs on top of sockets; nntplib also supports posting new messages, etc.; note: posts not deleted after they are read; """ listonly = False showhdrs = ['From', 'Subject', 'Date', 'Newsgroups', 'Lines'] try: import sys servername, groupname, showcount = sys.argv[1:] showcount = int(showcount) except: servername = nntpconfig.servername # assign this to your server groupname = 'comp.lang.python' # cmd line args or defaults showcount = 10 # show last showcount posts # connect to nntp server print('Connecting to', servername, 'for', groupname) from nntplib import NNTP connection = NNTP(servername) (reply, count, first, last, name) = connection.group(groupname) print('%s has %s articles: %s-%s' % (name, count, first, last)) # get request headers only fetchfrom = str(int(last) - (showcount-1)) (reply, subjects) = connection.xhdr('subject', (fetchfrom + '-' + last)) # show headers, get message hdr+body for (id, subj) in subjects: # [-showcount:] if fetch all hdrs print('Article %s [%s]' % (id, subj)) if not listonly and input('=> Display?') in ['y', 'Y']: reply, num, tid, list = connection.head(id) for line in list: for prefix in showhdrs: if line[:len(prefix)] == prefix: print(line[:80]) break if input('=> Show body?') in ['y', 'Y']: reply, num, tid, list = connection.body(id) for line in list: print(line[:80]) print() print(connection.quit())
As for FTP and email tools, the script creates an NNTP object
and calls its methods to fetch newsgroup information and articles’
header and body text. The xhdr
method, for example, loads selected headers from a range of
messages.
For NNTP servers that require authentication, you may also have to pass a username, a password, and possibly a reader-mode flag to the NNTP call. See the Python Library manual for more on other NNTP parameters and object methods.
In the interest of space and time, I’ll omit this script’s
outputs here. When run, it connects to the server and displays each
article’s subject line, pausing to ask whether it should fetch and
show the article’s header information lines (headers listed in the
variable showhdrs
only) and body
text. We can also pass this script an explicit server name, newsgroup,
and display count on the command line to apply it in different ways.
With a little more work, we could turn this script into a full-blown
news interface. For instance, new articles could be posted from within
a Python script with code of this form (assuming the local file
already contains proper NNTP header lines):
# to post, say this (but only if you really want to post!) connection = NNTP(servername) localfile = open('filename') # file has proper headers connection.post(localfile) # send text to newsgroup connection.quit()
We might also add a tkinter-based GUI frontend to this script to make it more usable, but we’ll leave such an extension on the suggested exercise heap (see also the PyMailGUI interface’s suggested extensions at the end of the next chapter—email and news messages have a similar structure).
Python’s standard library (the modules that are installed with the interpreter) also includes client-side support for HTTP—the Hypertext Transfer Protocol—a message structure and port standard used to transfer information on the World Wide Web. In short, this is the protocol that your web browser (e.g., Internet Explorer, Firefox, Chrome, or Safari) uses to fetch web pages and run applications on remote servers as you surf the Web. Essentially, it’s just bytes sent over port 80.
To really understand HTTP-style transfers, you need to know some of the server-side scripting topics covered in Chapter 15 (e.g., script invocations and Internet address schemes), so this section may be less useful to readers with no such background. Luckily, though, the basic HTTP interfaces in Python are simple enough for a cursory understanding even at this point in the book, so let’s take a brief look here.
Python’s standard http.client
module automates much of the protocol defined by HTTP and
allows scripts to fetch web pages as clients much like web browsers;
as we’ll see in Chapter 15, http.server
also allows us to implement web servers to handle the
other side of the dialog. For instance, the script in Example 13-29 can be used to grab
any file from any server machine running an HTTP web server program.
As usual, the file (and descriptive header lines) is ultimately
transferred as formatted messages over a standard socket port, but
most of the complexity is hidden by the http.client
module (see our raw socket
dialog with a port 80 HTTP server in Chapter 12 for a comparison).
""" fetch a file from an HTTP (web) server over sockets via http.client; the filename parameter may have a full directory path, and may name a CGI script with ? query parameters on the end to invoke a remote program; fetched file data or remote program output could be saved to a local file to mimic FTP, or parsed with str.find or html.parser module; also: http.client request(method, url, body=None, hdrs={}); """ import sys, http.client showlines = 6 try: servername, filename = sys.argv[1:] # cmdline args? except: servername, filename = 'learning-python.com', '/index.html' print(servername, filename) server = http.client.HTTPConnection(servername) # connect to http site/server server.putrequest('GET', filename) # send request and headers server.putheader('Accept', 'text/html') # POST requests work here too server.endheaders() # as do CGI script filenames reply = server.getresponse() # read reply headers + data if reply.status != 200: # 200 means success print('Error sending request', reply.status, reply.reason) else: data = reply.readlines() # file obj for data received reply.close() # show lines with eoln at end for line in data[:showlines]: # to save, write data to file print(line) # line already has , but bytes
Desired server names and filenames can be passed on the command
line to override hardcoded defaults in the script. You need to know
something of the HTTP protocol to make the most sense of this code,
but it’s fairly straightforward to decipher. When run on the client,
this script makes an HTTP object to connect to the server, sends it a
GET request along with acceptable reply types, and then reads the
server’s reply. Much like raw email message text, the HTTP server’s
reply usually begins with a set of descriptive header lines, followed by
the contents of the requested file. The HTTP object’s getfile
method gives us a file object from
which we can read the downloaded data.
Let’s fetch a few files with this script. Like all Python
client-side scripts, this one works on any machine with Python and an
Internet connection (here it runs on a Windows client). Assuming that
all goes well, the first few lines of the downloaded file are printed;
in a more realistic application, the text we fetch would probably be
saved to a local file, parsed with Python’s html.parser
module (introduced in Chapter 19), and so on. Without arguments, the
script simply fetches the HTML index page at http://learning-python.com, a domain name I host at a
commercial service provider:
C:...PP4EInternetOther> http-getfile.py
learning-python.com /index.html
b'<HTML>
'
b'
'
b'<HEAD>
'
b"<TITLE>Mark Lutz's Python Training Services</TITLE>
"
b'<!--mstheme--><link rel="stylesheet" type="text/css" href="_themes/blends/blen...'
b'</HEAD>
'
Notice that in Python 3.X the fetched data comes back as
bytes
strings again, not str
; since the Python html.parser
HTML
parse we’ll meet in Chapter 19 expects
str
text strings instead of
bytes
, you’ll likely need to
resolve a Unicode encoding choice here in order to parse, much the
same as we did for email message text earlier in this chapter. As
there, we might decode from bytes
to str
per a default, user
preferences or selections, headers inspection, or byte structure
analysis. Because sockets send raw bytes, we confront this choice
point whenever data shipped over them is text in nature; unless that
text’s type is known or always simple in form, Unicode implies extra
steps.
We can also list a server and file to be fetched on the command line, if we want to be more specific. In the following code, we use the script to fetch files from two different websites by listing their names on the command lines (I’ve truncated some of these lines so they fit in this book). Notice that the filename argument can include an arbitrary remote directory path to the desired file, as in the last fetch here:
C:...PP4EInternetOther>http-getfile.py www.python.org /index.html
www.python.org /index.html b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3....' b' ' b' ' b'<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> ' b' ' b'<head> ' C:...PP4EInternetOther>http-getfile.py www.python.org index.html
www.python.org index.html Error sending request 400 Bad Request C:...PP4EInternetOther>http-getfile.py www.learning-python.com /books
www.learning-python.com /books Error sending request 301 Moved Permanently C:...PP4EInternetOther>http-getfile.py www.learning-python.com /books/index.html
www.learning-python.com /books/index.html b'<HTML> ' b' ' b'<HEAD> ' b"<TITLE>Mark Lutz's Book Support Site</TITLE> " b'</HEAD> ' b'<BODY BGCOLOR="#f1f1ff"> '
Notice the second and third attempts in this code: if the request fails, the script receives and displays an HTTP error code from the server (we forgot the leading slash on the second, and the “index.html” on the third—required for this server and interface). With the raw HTTP interfaces, we need to be precise about what we want.
Technically, the string we call filename
in the script can refer to either a
simple static web page file or a server-side program that generates
HTML as its output. Those server-side programs are usually called CGI
scripts—the topic of Chapters 15 and 16. For
now, keep in mind that when filename
refers to a script, this program
can be used to invoke another program that resides on a remote server
machine. In that case, we can also specify parameters (called a query
string) to be passed to the remote program after a ?
.
Here, for instance, we pass a language=Python
parameter to a CGI script we
will meet in Chapter 15 (to make this
work, we also need to first spawn a locally running HTTP web server
coded in Python using a script we first met in Chapter 1 and will revisit in Chapter 15):
In a different window C:...PP4EInternetWeb>webserver.py
webdir ".", port 80 C:...PP4EInternetOther>http-getfile.py localhost
/cgi-bin/languages.py?language=Python
localhost /cgi-bin/languages.py?language=Python b'<TITLE>Languages</TITLE> ' b'<H1>Syntax</H1><HR> ' b'<H3>Python</H3><P><PRE> ' b" print('Hello World') " b'</PRE></P><BR> ' b'<HR> '
This book has much more to say later about HTML, CGI scripts, and the meaning of the HTTP GET request used in Example 13-29 (along with POST, one of two way to format information sent to an HTTP server), so we’ll skip additional details here.
Suffice it to say, though, that we could use the HTTP interfaces to write our own web browsers and build scripts that use websites as though they were subroutines. By sending parameters to remote programs and parsing their results, websites can take on the role of simple in-process functions (albeit, much more slowly and indirectly).
The http.client
module
we just met provides low-level control for HTTP clients.
When dealing with items available on the Web, though, it’s often
easier to code downloads with Python’s standard urllib.request
module, introduced in the FTP section earlier in this
chapter. Since this module is another way to talk HTTP, let’s expand
on its interfaces here.
Recall that given a URL, urllib.request
either downloads the
requested object over the Net to a local file or gives us a file-like
object from which we can read the requested object’s contents. As a
result, the script in Example 13-30 does the same work as
the http.client
script we just
wrote but requires noticeably less code.
""" fetch a file from an HTTP (web) server over sockets via urllib; urllib supports HTTP, FTP, files, and HTTPS via URL address strings; for HTTP, the URL can name a file or trigger a remote CGI script; see also the urllib example in the FTP section, and the CGI script invocation in a later chapter; files can be fetched over the net with Python in many ways that vary in code and server requirements: over sockets, FTP, HTTP, urllib, and CGI outputs; caveat: should run filename through urllib.parse.quote to escape properly unless hardcoded--see later chapters; """ import sys from urllib.request import urlopen showlines = 6 try: servername, filename = sys.argv[1:] # cmdline args? except: servername, filename = 'learning-python.com', '/index.html' remoteaddr = 'http://%s%s' % (servername, filename) # can name a CGI script too print(remoteaddr) remotefile = urlopen(remoteaddr) # returns input file object remotedata = remotefile.readlines() # read data directly here remotefile.close() for line in remotedata[:showlines]: print(line) # bytes with embedded
Almost all HTTP transfer details are hidden behind the urllib.request
interface here. This version
works in almost the same way as the http.client
version we wrote first, but it
builds and submits an Internet URL address to get its work done (the
constructed URL is printed as the script’s first output line). As we
saw in the FTP section of this chapter, the urllib.request
function urlopen
returns a file-like object from
which we can read the remote data. But because the constructed URLs
begin with “http://” here, the urllib.request
module automatically employs
the lower-level HTTP interfaces to download the requested file instead
of FTP:
C:...PP4EInternetOther>http-getfile-urllib1.py
http://learning-python.com/index.html b'<HTML> ' b' ' b'<HEAD> ' b"<TITLE>Mark Lutz's Python Training Services</TITLE> " b'<!--mstheme--><link rel="stylesheet" type="text/css" href="_themes/blends/blen...' b'</HEAD> ' C:...PP4EInternetOther>http-getfile-urllib1.py www.python.org /index
http://www.python.org/index b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3....' b' ' b' ' b'<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> ' b' ' b'<head> ' C:...PP4EInternetOther>http-getfile-urllib1.py www.learning-python.com /books
http://learning-python.com/books b'<HTML> ' b' ' b'<HEAD> ' b"<TITLE>Mark Lutz's Book Support Site</TITLE> " b'</HEAD> ' b'<BODY BGCOLOR="#f1f1ff"> ' C:...PP4EInternetOther>http-getfile-urllib1.py
localhost /cgi-bin/languages.py?language=Java
http://localhost/cgi-bin/languages.py?language=Java b'<TITLE>Languages</TITLE> ' b'<H1>Syntax</H1><HR> ' b'<H3>Java</H3><P><PRE> ' b' System.out.println("Hello World"); ' b'</PRE></P><BR> ' b'<HR> '
As before, the filename argument can name a simple file or a program invocation with optional parameters at the end, as in the last run here. If you read this output carefully, you’ll notice that this script still works if you leave the “index.html” off the end of a site’s root filename (in the third command line); unlike the raw HTTP version of the preceding section, the URL-based interface is smart enough to do the right thing.
One last mutation: the following urllib.request
downloader script uses the
slightly higher-level urlretrieve
interface in that module to automatically save the downloaded file
or script output to a local file on the client machine. This
interface is handy if we really mean to store the fetched data
(e.g., to mimic the FTP protocol). If we plan on processing the
downloaded data immediately, though, this form may be less
convenient than the version we just met: we need to open and read
the saved file. Moreover, we need to provide an extra protocol for
specifying or extracting a local filename, as in Example 13-31.
""" fetch a file from an HTTP (web) server over sockets via urlllib; this version uses an interface that saves the fetched data to a local binary-mode file; the local filename is either passed in as a cmdline arg or stripped from the URL with urllib.parse: the filename argument may have a directory path at the front and query parameters at end, so os.path.split is not enough (only splits off directory path); caveat: should urllib.parse.quote filename unless known ok--see later chapters; """ import sys, os, urllib.request, urllib.parse showlines = 6 try: servername, filename = sys.argv[1:3] # first 2 cmdline args? except: servername, filename = 'learning-python.com', '/index.html' remoteaddr = 'http://%s%s' % (servername, filename) # any address on the Net if len(sys.argv) == 4: # get result filename localname = sys.argv[3] else: (scheme, server, path, parms, query, frag) = urllib.parse.urlparse(remoteaddr) localname = os.path.split(path)[1] print(remoteaddr, localname) urllib.request.urlretrieve(remoteaddr, localname) # can be file or script remotedata = open(localname, 'rb').readlines() # saved to local file for line in remotedata[:showlines]: print(line) # file is bytes/binary
Let’s run this last variant from a command line. Its basic operation is the same as the last two versions: like the prior one, it builds a URL, and like both of the last two, we can list an explicit target server and file path on the command line:
C:...PP4EInternetOther>http-getfile-urllib2.py
http://learning-python.com/index.html index.html b'<HTML> ' b' ' b'<HEAD> ' b"<TITLE>Mark Lutz's Python Training Services</TITLE> " b'<!--mstheme--><link rel="stylesheet" type="text/css" href="_themes/blends/blen...' b'</HEAD> ' C:...PP4EInternetOther>http-getfile-urllib2.py www.python.org /index.html
http://www.python.org/index.html index.html b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3....' b' ' b' ' b'<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> ' b' ' b'<head> '
Because this version uses a urllib.request
interface that
automatically saves the downloaded data in a local file, it’s
similar to FTP downloads in spirit. But this script must also
somehow come up with a local filename for storing the data. You can
either let the script strip and use the base filename from the
constructed URL, or explicitly pass a local filename as a last
command-line argument. In the prior run, for instance, the
downloaded web page is stored in the local file
index.html in the current working directory—the base filename stripped
from the URL (the script prints the URL and local filename as its
first output line). In the next run, the local filename is passed
explicitly as py-index.html:
C:...PP4EInternetOther>http-getfile-urllib2.py
www.python.org /index.html py-index.html
http://www.python.org/index.html py-index.html b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3....' b' ' b' ' b'<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> ' b' ' b'<head> ' C:...PP4EInternetOther>http-getfile-urllib2.py www.learning-python.com /books books.html
http://learning-python.com/books books.html b'<HTML> ' b' ' b'<HEAD> ' b"<TITLE>Mark Lutz's Book Support Site</TITLE> " b'</HEAD> ' b'<BODY BGCOLOR="#f1f1ff"> ' C:...PP4EInternetOther>http-getfile-urllib2.py www.learning-python.com /books/about-pp.html
http://learning-python.com/books/about-pp.html about-pp.html b'<HTML> ' b' ' b'<HEAD> ' b'<TITLE>About "Programming Python"</TITLE> ' b'</HEAD> ' b' '
The next listing shows this script being used to trigger a remote program. As before, if you don’t give the local filename explicitly, the script strips the base filename out of the filename argument. That’s not always easy or appropriate for program invocations—the filename can contain both a remote directory path at the front and query parameters at the end for a remote program invocation.
Given a script invocation URL and no explicit output
filename, the script extracts the base filename in the middle by
using first the standard urllib.parse
module to pull out the file path, and then os.path.split
to strip off the directory
path. However, the resulting filename is a remote script’s name,
and it may or may not be an appropriate place to store the data
locally. In the first run that follows, for example, the script’s
output goes in a local file called
languages.py, the script name in the middle
of the URL; in the second, we instead name the output
CxxSyntax.html explicitly to suppress
filename extraction:
C:...PP4EInternetOther>python http-getfile-urllib2.py localhost
/cgi-bin/languages.py?language=Scheme
http://localhost/cgi-bin/languages.py?language=Scheme languages.py b'<TITLE>Languages</TITLE> ' b'<H1>Syntax</H1><HR> ' b'<H3>Scheme</H3><P><PRE> ' b' (display "Hello World") (newline) ' b'</PRE></P><BR> ' b'<HR> ' C:...PP4EInternetOther>python http-getfile-urllib2.py localhost
/cgi-bin/languages.py?language=C++ CxxSyntax.html
http://localhost/cgi-bin/languages.py?language=C++ CxxSyntax.html b'<TITLE>Languages</TITLE> ' b'<H1>Syntax</H1><HR> ' b'<H3>C </H3><P><PRE> ' b"Sorry--I don't know that language " b'</PRE></P><BR> ' b'<HR> '
The remote script returns a not-found message when passed
“C++” in the last command here. It turns out that “+” is a special
character in URL strings (meaning a space), and to be robust, both
of the urllib
scripts we’ve
just written should really run the filename
string through something called
urllib.parse.quote
, a tool that
escapes special characters for transmission. We will talk about
this in depth in Chapter 15, so
consider this a preview for now. But to make this invocation work,
we need to use special sequences in the constructed URL. Here’s
how to do it by hand:
C:...PP4EInternetOther>python http-getfile-urllib2.py localhost
/cgi-bin/languages.py?language=C%2b%2b CxxSyntax.html
http://localhost/cgi-bin/languages.py?language=C%2b%2b CxxSyntax.html b'<TITLE>Languages</TITLE> ' b'<H1>Syntax</H1><HR> ' b'<H3>C++</H3><P><PRE> ' b' cout << "Hello World" << endl; ' b'</PRE></P><BR> ' b'<HR> '
The odd %2b
strings in
this command line are not entirely magical: the escaping required
for URLs can be seen by running standard Python tools
manually—this is what these scripts should do automatically to be
able to handle all possible cases well; url
lib
.
parse
.
unquote
can undo these escapes if
needed:
C:...PP4EInternetOther>python
>>>import urllib.parse
>>>urllib.parse.quote('C++')
'c%2B%2B'
Again, don’t work too hard at understanding these last few
commands; we will revisit URLs and URL escapes in Chapter 15, while exploring server-side
scripting in Python. I will also explain there why the C++ result
came back with other oddities like <<
—HTML escapes
for <<
, generated by the
tool cgi.escape
in the script
on the server that produces the reply, and usually undone by HTML
parsers including Python’s html.parser
module we’ll meet in Chapter 19:
>>>import cgi
>>>cgi.escape('<<')
'<<'
Also in Chapter 15, we’ll meet
urllib
support for
proxies, and its support for client-side
cookies. We’ll discuss the related HTTPS
concept in Chapter 16—HTTP
transmissions over secure sockets, supported by urllib.request
on the client side if SSL
support is compiled into your Python. For now, it’s time to wrap
up our look at the Web, and the Internet at large, from the client
side of the fence.
In this chapter, we focused on client-side interfaces to standard protocols that run over sockets, but as suggested in an earlier footnote, client-side programming can take other forms, too. We outlined many of these at the start of Chapter 12—web service protocols (including SOAP and XML-RPC); Rich Internet Application toolkits (including Flex, Silverlight, and pyjamas); cross-language framework integration (including Java and .NET); and more.
As mentioned, most of these serve to extend the functionality of web browsers, and so ultimately run on top of the HTTP protocol we explored in this chapter. For instance:
The Jython system, a compiler that supports Python-coded Java applets—general-purpose programs downloaded from a server and run locally on the client when accessed or referenced by a URL, which extend the functionality of web browsers and interactions.
Similarly, RIAs provide AJAX communication and widget toolkits that allow JavaScript to implement user interaction within web browsers, which is more dynamic and rich than HTML and web browsers otherwise support.
In Chapter 19, we’ll also study
Python’s support for XML—structured text that is used as the data
transfer medium of client/server dialogs in web
service protocols such as XML-RPC, which transfer
XML-encoded objects over HTTP, and are supported by Python’s xmlrpc
standard library package. Such protocols can
simplify the interface to web servers in their clients.
In deference to time and space, though, we won’t go into further details on these and other client-side tools here. If you are interested in using Python to script clients, you should take a few minutes to become familiar with the list of Internet tools documented in the Python library reference manual. All work on similar principles but have slightly distinct interfaces.
In Chapter 15, we’ll hop the fence to the other side of the Internet world and explore scripts that run on server machines. Such programs give rise to the grander notion of applications that live entirely on the Web and are launched by web browsers. As we take this leap in structure, keep in mind that the tools we met in this and the preceding chapter are often sufficient to implement all the distributed processing that many applications require, and they can work in harmony with scripts that run on a server. To completely understand the Web worldview, though, we need to explore the server realm, too.
Before we get there, though, the next chapter puts concepts we’ve learned here to work by presenting a complete client-side program—a full-blown mail client GUI, which ties together many of the tools we’ve learned and coded. In fact, much of the email work we’ve done in this chapter was designed to lay the groundwork we’ll need to tackle the realistically scaled PyMailGUI example of the next chapter. Really, much of this book so far has served to build up skills required to equip us for this task: as we’ll see, PyMailGUI combines system tools, GUIs, and client-side Internet protocols to produce a useful system that does real work. As an added bonus, this example will help us understand the trade-offs between the client solutions we’ve met here and the server-side solutions we’ll study later in this part of the book.
[48] There is also support in the Python world for other technologies that some might classify as “client-side scripting,” too, such as Jython/Java applets; XML-RPC and SOAP web services; and Rich Internet Application tools like Flex, Silverlight, pyjamas, and AJAX. These were all introduced early in Chapter 12. Such tools are generally bound up with the notion of web-based interactions—they either extend the functionality of a web browser running on a client machine, or simplify web server access in clients. We’ll study browser-based techniques in Chapters 15 and 16; here, client-side scripting means the client side of common Internet protocols such as FTP and email, independent of the Web or web browsers. At the bottom, web browsers are really just desktop GUI applications that make use of client-side protocols, including those we’ll study here, such as HTTP and FTP. See Chapter 12 as well as the end of this chapter for more on other client-side techniques.
[49] No, really. The second edition of this book included a tale of woe here about how my ISP forced its users to wean themselves off Telnet access. This seems like a small issue today. Common practice on the Internet has come far in a short time. One of my sites has even grown too complex for manual edits (except, of course, to work around bugs in the site-builder tool). Come to think of it, so has Python’s presence on the Web. When I first found Python in 1992, it was a set of encoded email messages, which users decoded and concatenated and hoped the result worked. Yes, yes, I know—gee, Grandpa, tell us more…
[50] Usage note: These scripts are highly dependent on the FTP server functioning properly. For a while, the upload script occasionally had timeout errors when running over my current broadband connection. These errors went away later, when my ISP fixed or reconfigured their server. If you have failures, try running against a different server; connecting and disconnecting around each transfer may or may not help (some servers limit their number of connections).
[51] IMAP, or Internet Message Access Protocol, was designed as
an alternative to POP, but it is still not as widely available
today, and so it is not presented in this text. For instance,
major commercial providers used for this book’s examples provide
only POP (or web-based) access to email. See the Python library
manual for IMAP server interface details. Python used to have a
RFC822 module as well, but it’s been subsumed by the email
package in 3.X.
[52] We all know by now that such junk mail is usually referred to as spam, but not everyone knows that this name is a reference to a Monty Python skit in which a restaurant’s customers find it difficult to hear the reading of menu options over a group of Vikings singing an increasingly loud chorus of “spam, spam, spam…”. Hence the tie-in to junk email. Spam is used in Python program examples as a sort of generic variable name, though it also pays homage to the skit.
[53] There will be more on POP message numbers when we
study mailtools
later in
this chapter. Interestingly, the list of message numbers to
be deleted need not be sorted; they remain valid for the
duration of the delete connection, so deletions earlier in
the list don’t change numbers of messages later in the list
while you are still connected to the POP server. We’ll also
see that some subtle issues may arise if mails in the server
inbox are deleted without pymail
’s knowledge (e.g., by your
ISP or another email client); although very rare, suffice it
to say for now that deletions in this script are not
guaranteed to be accurate.