Chapter 19. Client-Side Network Protocol Modules

Python’s standard library supplies several modules to simplify the use of Internet protocols, particularly on the client side (or for some simple servers). These days, the Python Package Index (PyPI) offers many more such packages. These third-party packages support a wider array of protocols, and several offer better APIs than the standard library’s equivalents. When you need to use a protocol that’s missing from the standard library, or covered by the standard library in a way you think is not satisfactory, be sure to search PyPI—you’re likely to find better solutions there.

In this chapter, we cover some standard library packages that may prove satisfactory for some uses of network protocols, especially simple ones: when you can code without requiring third-party packages, your application or library is easier to install on other machines. We also mention a few third-party packages covering important network protocols not included in the standard library. This chapter does not cover third-party packages using an asynchronous programming approach: for that kind of package, see “The asyncio Module (v3 Only)”.

For the specific, very frequent use case of HTTP1 clients and other network resources (such as anonymous FTP sites) best accessed via URLs,2 the standard library offers only complex, incompatible (between v2 and v3) support. For that case, therefore, we have chosen to cover and recommend the splendid third-party package requests, with its well-designed API and v2/v3 compatibility, instead of any standard library modules.

Email Protocols

Most email today is sent via servers implementing the Simple Mail Transport Protocol (SMTP) and received via servers and clients implementing the Post Office Protocol version 3 (POP3) and/or the IMAP4 one (either in the original version, specified in RFC 1730, or the IMAP4rev1 one, specified in RFC 2060). These protocols, client-side, are supported by the Python standard library modules smtplib, poplib, and imaplib.

If you need to write a client that can connect via either POP3 or IMAP4, a standard recommendation would be to pick IMAP4, since it is definitely more powerful, and—according to Python’s own online docs—often more accurately implemented on the server side. Unfortunately, the standard library module for IMAP4, imaplib, is also inordinately complicated, and far too vast to cover in this book. If you do choose to go that route, use the online docs, inevitably complemented by voluminous RFCs 1730 or 2060, and possibly other related RFCs, such as 5161 and 6855 for capabilities, and 2342 for namespaces. Using the RFCs, in addition to the online docs for the standard library module, can’t be avoided: many of the arguments passed to call imaplib functions and methods, and results from such calls, are strings with formats that are only documented in the RFCs, not in Python’s own online docs. Therefore, we don’t cover imaplib in this book; in the following sections, we only cover poplib and stmplib.

If you do want to use the rich and powerful IMAP4 protocol, we highly recommend that you do so, not directly via the standard library’s imaplib, but rather by leveraging the simpler, higher-abstraction-level third-party package IMAPClient, available with a pip install for both v2 and v3, and well-documented online.

The poplib Module

The poplib module supplies a class POP3 to access a POP mailbox. The specifications of the POP protocol are in RFC 1939.

POP3

class POP3(host,port=110)

Returns an instance p of class POP3 connected to host and port.

Sister class POP3_SSL behaves just the same, but connects to the host (by default on port 995) over a more secure TLS channel; it’s needed to connect to email servers that demand some minimum security, such as 'pop.gmail.com'.a

a To connect to a Gmail account in particular, you also need to configure that account to enaple POP, “Allow less secure apps,” and avoid two-step verification—actions that in general we can’t recommend, since they weaken your email’s security.

Instance p supplies many methods, of which the most frequently used are the following (in each case, msgnum, the identifying number of a message, can be a string or an int):

dele

p.dele(msgnum)

Marks message msgnum for deletion and returns the server response string. The server queues such deletion requests, and executes them later when you terminate this connection by calling p.quit.

list

p.list(msgnum=None)

Returns a tuple (response, messages, octets), where response is the server response string; messages a list of bytestrings, each of two words b'msgnum bytes', giving message number and length in bytes of each message in the mailbox; and octets is the length in bytes of the total response. When msgnum is not None, list returns just a string, the server response for the given msgnum instead of a tuple.

pass_

p.pass_(password)

Sends the password to the server. Must be called after p.user. The trailing underscore in the name is needed because pass is a Python keyword. Returns the server response string.

quit

p.quit()

Ends the session and tells the server to perform deletions that were requested by calls to p.dele. Returns the server response string.

retr

p.retr(msgnum)

Returns a three-item tuple (response,lines,bytes), where response is the server response string, lines the list of all lines in message msgnum as bytestrings, and bytes the total number of bytes in the message.

set_debuglevel

p.set_debuglevel(debug_level)

Sets debug level to int debug_level: 0, default, for no debugging; 1 for a modest amount of debugging output; and 2 or more for a complete output trace of all control information exchanged with the server.

stat

p.stat()

Returns pair (num_msgs,bytes), where num_msgs is the number of messages in the mailbox, bytes the total number of bytes.

top

p.top(msgnum,maxlines)

Like retr, but returns at most maxlines lines from the message’s body (in addition to returning all the lines from the headers, just like retr does). Can be useful for peeking at the start of long messages.

user

p.user(username)

Sends the server the username; followed up by a call to p.pass_.

The smtplib Module

The smtplib module supplies a class SMTP to send mail to an SMTP server. The specifications of the SMTP protocol are in RFC 2821.

SMTP

class SMTP([host,port=25])

Returns an instance s of the class SMTP. When host (and optionally port) is given, implicitly calls s.connect(host,port).

The sister class SMTP_SSL behaves just the same, but connects to the host (by default on port 465) over a more secure TLS channel; it’s needed to connect to email servers that demand some minimum security, such as 'smtp.gmail.com'.

The instance s supplies many methods, of which the most frequently used are the following:

connect

s.connect(host=127.0.0.1,port=25)

Connects to an SMTP server on the given host (by default, the local host) and port (port 25 is the default port for the SMTP service; 465 is the default port for the more-secure “SMTP over TLS”).

login

s.login(user,password)

Logs into the server with the given user and password. Needed only if the SMTP server requires authentication (nowadays, just about all do).

quit

s.quit()

Terminates the SMTP session.

sendmail

s.sendmail(from_addr,to_addrs,msg_string)

Sends mail message msg_string from the sender whose address is in string from_addr to each of the recipients whose addresses are the items of list to_addrs. msg_string must be a complete RFC 822 message in a single multiline bytestring: the headers, an empty line for separation, then the body. from_addr and to_addrs only direct the mail transport, and don’t add or change headers in msg_string. To prepare RFC 822–compliant messages, use the package email, covered in “MIME and Email Format Handling”.

HTTP and URL Clients

Most of the time, your code uses HTTP and FTP protocols through the higher-abstraction URL layer, supported by the modules and packages covered in the following sections. Python’s standard library also offers lower-level, protocol-specific modules that are less often used: for FTP clients, ftplib; for HTTP clients, in v3, http.client, and in v2, httplib (we cover HTTP servers in Chapter 20). If you need to write an FTP server, look at the third-party module pyftpdlib. Implementations of the brand-new HTTP/2 protocol are still at very early stages, but your best bet as of this writing is the third-party module hyper. (Third-party modules, as usual, can be installed from the PyPI repository with the highly recommended tool pip.) We do not cover any of these lower-level modules in this book, but rather focus on higher-abstraction, URL-level access throughout the following sections.

URL Access

A URL is a type of URI (Uniform Resource Identifier). A URI is a string that identifies a resource (but does not necessarily locate it), while a URL locates a resource on the Internet. A URL is a string composed of several optional parts, called components: scheme, location, path, query, and fragment. (The second component is sometimes also known as a net location, or netloc for short.) A URL with all its parts looks like:

scheme://lo.ca.ti.on/pa/th?qu=ery#fragment

In https://www.python.org/community/awards/psf-awards/#october-2016, for example, the scheme is http, the location is www.python.org, the path is /community/awards/psf-awards/, there is no query, and the fragment is #october-2016. (Most schemes default to a “well-known port” when the port is not explicitly specified; for example, 80 is the “well-known port” for the HTTP scheme.) Some punctuation is part of one of the components it separates; other punctuation characters are just separators, not part of any component. Omitting punctuation implies missing components. For example, in mailto:[email protected], the scheme is mailto, the path is [email protected], and there is no location, query, or fragment. The missing // means the URI has no location, the missing ? means it has no query, and the missing # means it has no fragment.

If the location ends with a colon followed by a number, this denotes a TCP port for the endpoint. Otherwise, the connection uses the “well-known port” associated with the scheme (e.g., port 80 for HTTP).

The urllib.parse (v3) / urlparse (v2) modules

The urllib.parse (in v3; urlparse in v2) module supplies functions for analyzing and synthesizing URL strings. The most frequently used of these functions are urljoin, urlsplit, and urlunsplit:

urljoin

urljoin(base_url_string,relative_url_string)

Returns a URL string u, obtained by joining relative_url_string, which may be relative, with base_url_string. The joining procedure that urljoin performs to obtain its result u may be summarized as follows:

  • When either of the argument strings is empty, u is the other argument.

  • When relative_url_string explicitly specifies a scheme that is different from that of base_url_string, u is relative_url_string. Otherwise, u’s scheme is that of base_url_string.

  • When the scheme does not allow relative URLs (e.g., mailto) or relative_url_string explicitly specifies a location (even when it is the same as the location of base_url_string), all other components of u are those of relative_url_string. Otherwise, u’s location is that of base_url_string.

  • u’s path is obtained by joining the paths of base_url_string and relative_url_string according to standard syntax for absolute and relative URL paths, as per RFC 1808. For example:

from urllib import parse as urlparse  # in v3
# in v2 it would be: import urlparse
urlparse.urljoin(
  'http://somehost.com/some/path/here','../other/path')
# Result is: 'http://somehost.com/some/other/path'

urlsplit

urlsplit(url_string,default_scheme='',allow_fragments=True)

Analyzes url_string and returns a tuple (actually an instance of SplitResult, which you can treat as a tuple or use with named attributes) with five string items: scheme, netloc, path, query, and fragment. default_scheme is the first item when the url_string lacks an explicit scheme. When allow_fragments is False, the tuple’s last item is always '', whether or not url_string has a fragment. Items corresponding to missing parts are ''. For example:

urlparse.urlsplit(
'http://www.python.org:80/faq.cgi?src=fie')
# Result is: 
# ('http','www.python.org:80','/faq.cgi','src=fie','')

urlunsplit

urlunsplit(url_tuple)

url_tuple is any iterable with exactly five items, all strings. Any return value from a urlsplit call is an acceptable argument for urlunsplit. urlunsplit returns a URL string with the given components and the needed separators, but with no redundant separators (e.g., there is no # in the result when the fragment, url_tuple’s last item, is ''). For example:

urlparse.urlunsplit((
'http','www.python.org:80','/faq.cgi','src=fie',''))
# Result is: 'http://www.python.org:80/faq.cgi?src=fie'

urlunsplit(urlsplit(x)) returns a normalized form of URL string x, which is not necessarily equal to x because x need not be normalized. For example:

urlparse.urlunsplit(
urlparse.urlsplit('http://a.com/path/a?'))
# Result is: 'http://a.com/path/a'

In this case, the normalization ensures that redundant separators, such as the trailing ? in the argument to urlsplit, are not present in the result.

The Third-Party requests Package

The third-party requests package (very well documented online) supports both v2 and v3, and it’s how we recommend you access HTTP URLs. As usual for third-party packages, it’s best installed with a simple pip install requests. In this section, we summarize how best to use it for reasonably simple cases.

Natively, requests only supports the HTTP transport protocol; to access URLs using other protocols, you need to also install other third-party packages (known as protocol adapters), such as requests-ftp for FTP URLs, or others supplied as part of the rich, useful requests-toolbelt package of requests utilities.

requestsfunctionality hinges mostly on three classes it supplies: Request, modeling an HTTP request to be sent to a server; Response, modeling a server’s HTTP response to a request; and Session, offering continuity across multiple requests, also known as a session. For the common use case of a single request/response interaction, you don’t need continuity, so you may often just ignore Session.

Sending requests

Most often, you don’t need to explicitly consider the Request class: rather, you call the utility function request, which internally prepares and sends the Request, and returns the Response instance. request has two mandatory positional arguments, both strs: method, the HTTP method to use, then url, the URL to address; and then, many optional named parameters (in the next section, we cover the most commonly used named parameters to the request function).

For further convenience, the requests module also supplies functions whose names are the HTTP methods delete, get, head, options, patch, post, and put; each takes a single mandatory positional argument, url, then the same optional named arguments as the function request.

When you want some continuity across multiple requests, call Session to make an instance s, then use s’s methods request, get, post, and so on, which are just like the functions with the same names directly supplied by the requests module (however, s’s methods merge s’s settings with the optional named parameters to prepare the request to send to the given url).

request’s optional named parameters

The function request (just like the functions get, post, and so on—and methods with the same names on an instance s of class Session) accepts many optional named parameters—refer to the requests package’s excellent online docs for the full set, if you need advanced functionality such as control over proxies, authentication, special treatment of redirection, streaming, cookies, and so on. The most frequently used named parameters are:

data

A dict, a sequence of key/value pairs, a bytestring, or a file-like object, to use as the body of the request

headers

A dict of HTTP headers to send in the request

json

Python data (usually a dict) to encode as JSON as the body of the request

files

A dict with names as keys, file-like objects, or file tuples as values, used with the POST method to specify a multipart-encoding file upload; we cover the format of values for files= in the next section

params

A dict, or a bytestring, to send as the query string with the request

timeout

A float number of seconds, the maximum time to wait for the response before raising an exception

data, json, and files are mutually incompatible ways to specify a body for the request; use only one of them, and only for HTTP methods that do use a body, namely PATCH, POST, and PUT. The one exception is that you can have both data= passing a dict, and a files=, and that is very common usage: in this case, both the key/value pairs in the dict, and the files, form the body of the request as a single multipart/form-data whole, according to RFC 2388.

The files argument (and other ways to specify the request’s body)

When you specify the request’s body with json=, or data= passing a bytestring or a file-like object (which must be open for reading, usually in binary mode), the resulting bytes are directly used as the request’s body; when you specify it with data= (passing a dict, or a sequence of key/value pairs), the body is built as a form, from the key/value pairs formatted in application/x-www-form-urlencoded format, according to the relevant web standard.

When you specify the request’s body with files=, the body is also built as a form, in this case with the format set to multipart/form-data (the only way to upload files in a PATCH, POST, or PUT HTTP request). Each file you’re uploading is formatted into its own part of the form; if, in addition, you want the form to give to the server further nonfile parameters, then in addition to files= also pass a data= with a dict value (or a sequence of key/value pairs) for the further parameters—those parameters get encoded into a supplementary part of the multipart form.

To offer you a lot of flexibility, the value of the files= argument can be a dict (its items are taken as an arbitrary-order sequence of name/value pairs), or a sequence of name/value pairs (in the latter case, the sequence’s order is maintained in the resulting request body).

Either way, each value in a name/value pair can be a str (or, best,3 a bytes or bytearray) to be used directly as the uploaded file’s contents; or, a file-like object open for reading (then, requests calls .read() on it, and uses the result as the uploaded file’s contents; we strongly recommend that, in such cases, you open the file in binary mode to avoid any ambiguity regarding content-length). When any of these conditions apply, requests uses the name part of the pair (e.g., the key into the dict) as the file’s name (unless it can improve on that because the open file object is able to reveal its underlying filename), takes its best guess at a content-type, and uses minimal headers for the file’s form-part.

Alternatively, the value in each name/value pair can be a tuple with two to four items: fn, fp, [ft, [fh]] (using square brackets as meta-syntax to indicate optional parts). In this case, fn is the file’s name, fp provides the contents (in just the same way as in the previous paragraph), optional ft provides the content type (if missing, requests guesses it, as in the previous paragraph), and the optional dict fh provides extra headers for the file’s form-part.

How to study examples of requests

In practical applications, you don’t usually need to consider the internal instance r of the class requests.Request, which functions like requests.post are building, preparing, and then sending on your behalf. However, to understand exactly what requests is doing, working at a lower level of abstraction (building, preparing, and examining r—no need to send it!) is instructive. For example:

import requests

r = requests.Request('GET', 'http://www.example.com',
    data={'foo': 'bar'}, params={'fie': 'foo'})
p = r.prepare()
print(p.url)
print(p.headers)
print(p.body)

prints out (splitting the p.headers dict’s printout for readability):

http://www.example.com/?fie=foo
{'Content-Length': '7',
 'Content-Type': 'application/x-www-form-urlencoded'}
foo=bar

Similarly, when files= is involved:

import requests

r = requests.Request('POST', 'http://www.example.com',
    data={'foo': 'bar'}, files={'fie': 'foo'})
p = r.prepare()
print(p.headers)
print(p.body)

prints out (with several lines split for readability):

{'Content-Type': 'multipart/form-data;
  boundary=5d1cf4890fcc4aa280304c379e62607b',
  'Content-Length': '228'}
b'--5d1cf4890fcc4aa280304c379e62607b
Content-Disposition: form-
data; name="foo"

bar
--5d1cf4890fcc4aa280304c379e62607b


Content-Disposition: form-data; name="fie"; filename="fie"



foo
--5d1cf4890fcc4aa280304c379e62607b--
'

Happy interactive exploring!

The Response class

The one class from the requests module that you always have to consider is Response: every request, once sent to the server (typically, that’s done implicitly by methods such as get), returns an instance r of requests.Response.

The first thing you usually want to do is to check r.status_code, an int that tells you how the request went, in typical “HTTPese”: 200 means “everything’s fine,” 404 means “not found,” and so on. If you’d rather just get an exception for status codes indicating some kind of error, call r.raise_for_status(); that does nothing if the request went fine, but raises a requests.exceptions.HTTPError otherwise. (Other exceptions, not corresponding to any specific HTTP status code, can and do get raised without requiring any such explicit call: e.g., ConnectionError for any kind of network problem, or TimeoutError for a timeout.)

Next, you may want to check the response’s HTTP headers: for that, use r.headers, a dict (with the special feature of having case-insensitive string-only keys, indicating the header names as listed, e.g., in Wikipedia, per HTTP specs). Most headers can be safely ignored, but sometimes you’d rather check. For example, you may check whether the response specifies which natural language its body is written in, via r.headers.get('content-language'), to offer the user different presentation choices, such as the option to use some kind of natural-language translation service to make the response more acceptable to the user.

You don’t usually need to make specific status or header checks for redirects: by default, requests automatically follows redirects for all methods except HEAD (you can explicitly pass the allow_redirection named parameter in the request to alter that behavior). If you do allow redirects, you may optionally want to check r.history, a list of all Response instances accumulated along the way, oldest to newest, up to but excluding r itself (so, in particular, r.history is empty if there have been no redirects).

In most cases, perhaps after some checks on status and headers, you’ll want to use the response’s body. In simple cases, you’ll just access the response’s body as a bytestring, r.content, or decoded via JSON (once you’ve determined that’s how it’s encoded, e.g., via r.headers.get('content-type')) by calling r.json().

Often, you’d rather access the response’s body as (Unicode) text, with property r.text. The latter gets decoded (from the octets that actually make up the response’s body) with the codec requests thinks is best, based on the content-type header and a cursory examination of the body itself. You can check what codec has been used (or is about to be used) via the attribute r.encoding, the name of a codec registered with the codecs module, covered in “The codecs Module”. You can even override the choice of codec to use by assigning to r.encoding the name of the codec you choose.

We do not cover other advanced issues, such as streaming, in this book; if you need information on that, check requests’ online docs.

The urllib Package (v3)

Beyond urllib.parse, covered in “The urllib.parse (v3) / urlparse (v2) modules”, the urllib package in v3 supplies the module urllib.robotparser for the specific purpose of parsing a site’s robots.txt file as documented in a well-known informal standard (in v2, use the standard library module robotparser); the module urllib.error, containing all exception types raised by other urllib modules; and, mainly, the module urllib.request, for opening and reading URLs.

The functionality supplied by v3’s urllib.request is parallel to that of v2’s urrlib2 module, covered in “The urllib2 Module (v2)”, plus some functionality from v2’s urllib module, covered in “The urllib Module (v2)”. For coverage of urllib.request, check the online docs, supplemented with Michael Foord’s HOWTO.

The urllib Module (v2)

The urllib module, in v2, supplies functions to read data from URLs. urllib supports the following protocols (schemes): http, https, ftp, gopher, and file. file indicates a local file. urllib uses file as the default scheme for URLs that lack an explicit scheme. You can find simple, typical examples of urllib use in Chapter 22, where urllib.urlopen is used to fetch HTML and XML pages that all the various examples parse and analyze.

Functions

The module urllib in v2 supplies a number of functions described in Table 19-1, with urlopen being the simplest and most frequently used.

Table 19-1.  

quote

quote(str,safe='/')

Returns a copy of str where special characters are changed into Internet-standard quoted form %xx. Does not quote alphanumeric characters, spaces, any of the characters _,.- (underscore, comma, period, hyphen), nor any of the characters in string safe. For example:

print(urllib.quote('zip&zap'))
# prints: zip%26zap

quote_plus

quote_plus(str, safe='/')

Like quote, but also changes spaces into plus signs.

unquote

unquote(str)

Returns a copy of str where each quoted form %xx is changed into the corresponding character. For example:

print(urllib.unquote('zip%26zap'))
# prints: zip&zap

unquote_plus

unquote_plus(str)

Like unquote, but also changes plus signs into spaces.

urlcleanup

urlcleanup()

Clears the cache of function urlretrieve, covered below.

urlencode

urlencode(query,doseq=False)

Returns a string with the URL-encoded form of query. query can be either a sequence of (name, value) pairs, or a mapping, in which case the resulting string encodes the mapping’s (key, value) pairs. For example:

urllib.urlencode([('ans',42),('key','val')])
# result is: 'ans=42&key=val'
urllib.urlencode({'ans':42, 'key':'val'})
# result is: 'key=val&ans=42'

The order of items in a dictionary is arbitrary. Should you need the URL-encoded form to have key/value pairs in a specific order, use a sequence as the query argument, as in the first call in this snippet.

When doseq is true, any value in query that is a sequence and is not a string is encoded as separate parameters, one per item in value. For example:

urllib.urlencode([('K',('x','y','z'))], True)
# result is: 'K=x&K=y&K=z'

When doseq is false (the default), each value is encoded as the quote_plus of its string form given by built-in str, regardless of whether the value is a sequence:

urllib.urlencode([('K',('x','y','z'))], False)
# result is: 'K=%28%27x%27%2C+%27y%27%2C+%27z%27%29'

urlopen

urlopen(urlstring,data=None,proxies=None)

Accesses the given URL and returns a read-only file-like object f. f supplies file-like methods read, readline, readlines, and close, as well as two others:

f.geturl()

Returns the URL of f. This may differ from urlstring by normalization (as mentioned for function urlunsplit earlier) and because of HTTP redirects (i.e., indications that the requested data is located elsewhere). urllib supports redirects transparently, and the method geturl lets you check for them if you want.

f.info()

Returns an instance m of the class Message of the module mimetools, covered in “rfc822 and mimetools Modules (v2)”. m’s headers provide metadata about f. For example, 'Content-Type' is the MIME type of the data in f, and m’s methods m.gettype(), m.getmaintype(), and m.getsubtype() provide the same information.

When data is None and urlstring’s scheme is http, urlopen sends a GET request. When data is not None, urlstring’s scheme must be http, and urlopen sends a POST request. data must then be in URL-encoded form, and you normally prepare it with the function urlencode, covered above.

urlopen can use proxies that do not require authentication. Set the environment variables http_proxy, ftp_proxy, and gopher_proxy to the proxies’ URLs to exploit this. You normally perform such settings in your system’s environment, in platform-dependent ways, before you start Python. On macOS only, urlopen transparently and implicitly retrieves proxy URLs from your Internet configuration settings. Alternatively, you can pass as argument proxies a mapping whose keys are scheme names, with the corresponding values being proxy URLs. For example:

f = urllib.urlopen('http://python.org',
proxies={'http':'http://prox:999'})

urlopen does not support proxies that require authentication; for such advanced needs, use the richer library module urllib2, covered in “The urllib2 Module (v2)”.

urlretrieve

urlretrieve(urlstring,filename=None,reporthook=None,
data=None)

Similar to urlopen(urlstring,data), but meant for downloading large files. Returns a pair (f,m), where f is a string that specifies the path to a file on the local filesystem, m an instance of class Message of module mimetools, like the result of method info called on the result value of urlopen, covered above.

When filename is None, urlretrieve copies retrieved data to a temporary local file, and f is the path to the temporary local file. When filename is not None, urlretrieve copies retrieved data to the file named filename, and f is filename. When reporthook is not None, it must be a callable with three arguments, as in the function:

def reporthook(block_count, block_size, file_size):
    print(block_count)

urlretrieve calls reporthook zero or more times while retrieving data. At each call, it passes block_count, the number of blocks of data retrieved so far; block_size, the size in bytes of each block; and file_size, the total size of the file in bytes. urlretrieve passes file_size as -1 when it cannot determine file size, which depends on the protocol involved and on how completely the server implements that protocol. The purpose of reporthook is to allow your program to give graphical or textual feedback to the user about the progress of the file-retrieval operation that urlretrieve performs.

The FancyURLopener class

You normally use the module urllib in v2 through the functions it supplies (most often urlopen). To customize urllib’s functionality, however, you can subclass urllib’s FancyURLopener class and bind an instance of your subclass to the attribute _urlopener of the module urllib. The customizable aspects of an instance f of a subclass of FancyURLopener are the following:

prompt_user_passwd

f.prompt_user_passwd(host,realm)

Returns a pair (user,password) that urllib is to use in order to authenticate access to host in the security realm. The default implementation in class FancyURLopener prompts the user for this data in interactive text mode. Your subclass can override this method in order to interact with the user via a GUI or to fetch authentication data from persistent storage.

version

f.version

The string that f uses to identify itself to the server—for example, via the User-Agent header in the HTTP protocol. You can override this attribute by subclassing or rebind it directly on an instance of FancyURLopener.

The urllib2 Module (v2)

The urllib2 module is a rich, highly customizable superset of the urllib module. urllib2 lets you work directly with advanced aspects of protocols such as HTTP. For example, you can send requests with customized headers as well as URL-encoded POST bodies, and handle authentication in various realms, in both Basic and Digest forms, directly or via HTTP proxies.

In the rest of this section, we cover only the ways in which v2’s urllib2 lets your program customize these advanced aspects of URL retrieval. We do not try to impart the advanced knowledge of HTTP and other network protocols, independent of Python, that you need to make full use of urllib2’s rich functionality.

Functions

urllib2 supplies a function urlopen that is basically identical to urllib’s urlopen. To customize urllib2, install, before calling urlopen, any number of handlers, grouped into an opener object, using the build_opener and install_opener functions.

You can also optionally pass to urlopen an instance of the class Request instead of a URL string. Such an instance may include both a URL string and supplementary information about how to access it, as covered in “The Request class”.

build_opener

build_opener(*handlers)

Creates and returns an instance of the class OpenerDirector (covered in “The OpenerDirector class”) with the given handlers. Each handler can be a subclass of the class BaseHandler, instantiable without arguments, or an instance of such a subclass, however instantiated. build_opener adds instances of various handler classes provided by the module urllib2 in front of the handlers you specify, in order to handle proxies; unknown schemes; the http, file, and https schemes; HTTP errors; and HTTP redirects. However, if you have instances or subclasses of said classes in handlers, this indicates that you want to override these defaults.

install_opener

install_opener(opener)

Installs opener as the opener for further calls to urlopen. opener can be an instance of the class OpenerDirector, such as the result of a call to the function build_opener, or any signature-compatible object.

urlopen

urlopen(url,data=None)

Almost identical to the urlopen function in the module urllib. However, you customize behavior via the opener and handler classes of urllib2 (covered in “The OpenerDirector class” and “Handler classes”) rather than via the class FancyURLopener as in the module urllib. The argument url can be a URL string, like for the urlopen function in the module urllib. Alternatively, url can be an instance of the class Request, covered in the next section.

The Request class

You can optionally pass to the function urlopen an instance of the class Request instead of a URL string. Such an instance can embody both a URL and, optionally, other information about how to access the target URL.

Request

class Request(urlstring,data=None,headers={})

urlstring is the URL that this instance of the Request class embodies. For example, when there are no data and headers, calling:

urllib2.urlopen(urllib2.Request(urlstring))

is just like calling:

urllib2.urlopen(urlstring)

When data is not None, the Request constructor implicitly calls on the new instance r its method r.add_data(data). headers must be a mapping of header names to header values. The Request constructor executes the equivalent of the loop:

for k,v in headers.items(): r.add_header(k,v)

The Request constructor also accepts optional parameters allowing fine-grained control of HTTP cookie behavior, but such advanced functionality is rarely necessary—the class’s default handling of cookies is generally sufficient. For fine-grained, client-side control of cookies, see the online docs; we do not cover the cookielib module of the standard library in this book.

An instance r of the class Request supplies the following methods:

add_data

r.add_data(data)

Sets data as r’s data. Calling urlopen(r) then becomes like calling urlopen(r,data)—that is, it requires r’s scheme to be http and uses a POST request with a body of data, which must be a URL-encoded string.

Despite its name, the method add_data does not necessarily add the data. If r already had data, set in r’s constructor or by previous calls to r.add_data, then the latest call to r.add_data replaces the previous value of r’s data with the new given one. In particular, r.add_data(None) removes r’s previous data, if any.

add_header

r.add_header(key,value)

Adds a header with the given key and value to r’s headers. If r’s scheme is http, r’s headers are sent as part of the request. When you add more than one header with the same key, later additions overwrite previous ones, so out of all headers with one given key, only the one given last matters.

add_unredirec-ted_header

r.add_unredirected_header(key,value)

Like add_header, except that the header is added only for the first request, and is not used if the requesting procedure meets and follows any further HTTP redirection.

get_data

r.get_data()

Returns the data of r, either None or a URL-encoded string.

get_full_url

r.get_full_url()

Returns the URL of r, as given in the constructor for r.

get_host

r.get_host()

Returns the host component of r’s URL.

get_method

r.get_method()

Returns the HTTP method of r, either of the strings 'GET' or 'POST'.

get_selector

r.get_selector()

Returns the selector components of r’s URL (path and all following components).

get_type

r.get_type()

Returns the scheme component of r’s URL (i.e., the protocol).

has_data

r.has_data()

Like r.get_data() is not None.

has_header

r.has_header(key)

Returns True if r has a header with the given key; otherwise, returns False.

set_proxy

r.set_proxy(host,scheme)

Sets r to use a proxy at the given host and scheme for accessing r’s URL.

The OpenerDirector class

An instance d of the class OpenerDirector collects instances of handler classes and orchestrates their use to open URLs of various schemes and to handle errors. Normally, you create d by calling the function build_opener and then install it by calling the function install_opener. For advanced uses, you may also access various attributes and methods of d, but this is a rare need and we do not cover it further in this book.

Handler classes

The urllib2 module supplies a class called BaseHandler to use as the superclass of any custom handler classes you write. urllib2 also supplies many concrete subclasses of BaseHandler that handle the schemes gopher, ftp, http, https, and file, as well as authentication, proxies, redirects, and errors. Writing custom urllib2 handlers is an advanced topic, and we do not cover it further in this book.

Handling authentication

urllib2’s default opener does no authentication. To get authentication, call build_opener to build an opener with instances of HTTPBasicAuthHandler, ProxyBasicAuthHandler, HTTPDigestAuthHandler, and/or ProxyDigestAuthHandler, depending on whether you want authentication to be directly in HTTP or to a proxy, and on whether you need Basic or Digest authentication.

To instantiate each of these authentication handlers, use an instance x of the class HTTPPasswordMgrWithDefaultRealm as the only argument to the authentication handler’s constructor. You normally use the same x to instantiate all the authentication handlers you need. To record users and passwords for given authentication realms and URLs, call x.add_password one or more times.

add_password

x.add_password(realm,URLs,user,password)

Records in x the pair (user,password) as the credentials in the given realm for URLs given by URLs. realm is a string that names an authentication realm; when realm is None, it matches any realm not specifically recorded. URLs is a URL string or a sequence of URL strings. A URL u is deemed applicable for these credentials if there is an item u1 of URLs such that the location components of u and u1 are equal, and the path component of u1 is a prefix of that of u. Other components (scheme, query, fragment) don’t affect applicability for authentication purposes.

The following example shows how to use urllib2 with basic HTTP authentication:

import urllib2

x = urllib2.HTTPPasswordMgrWithDefaultRealm( )
x.add_password(None, 'http://myhost.com/', 
'auser', 'apassword')
auth = urlib2.HTTPBasicAuthHandler(x)
opener = urllib2.build_opener(auth)
urllib2.install_opener(opener)

flob = urllib2.urlopen('http://myhost.com/index.html')
for line in flob.readlines(): print line,

Other Network Protocols

Many, many other network protocols are in use—a few are best supported by Python’s standard library, but, for most of them, you’ll be happier researching third-party modules on PyPI.

To connect as if you were logging into another machine (or, into a separate login session on your own node), you can use the secure SSH protocol, supported by the third-party module paramiko, or the higher abstraction layer wrapper around it, the third-party module spur. (You can also, with some likely security risks, still use classic telnet, supported by the standard library module telnetlib.)

Other network protocols include:

  • NNTP, to access the somewhat-old Usenet News servers, supported by the standard library module nntplib

  • XML-RPC, for a rudimentary remote procedure call functionality, supported by xmlrpc.client (xmlrpc in v2)

  • gRPC, for a more modern and advanced remote procedure call functionality, supported by the third-party module grpcio

  • NTP, to get precise time off the network, supported by the third-party module ntplib

  • SNMP, for network management, supported by the third-party module pysnmp

…among many others. No single book, including this one, could possibly cover all these protocols and their supporting modules. Rather, our best suggestion in the matter is a strategic one: whenever you decide that your application needs to interact with some other system via a certain networking protocol, don’t rush to implement your own modules to support that protocol. Instead, search and ask around, and you’re likely to find excellent existing Python modules already supporting that protocol.4

Should you find some bug or missing feature in such modules, open a bug or feature request (and, ideally, supply a patch or pull request that would fix the problem and satisfy your application’s needs!). In other words, become an active member of the open source community, rather than just a passive user: you will be welcome there, meet your own needs, and help many other people in the process. “Give forward,” since you cannot “give back” to all the awesome people who contributed to give you most of the tools you’re using!

1 HTTP, the Hypertext Transfer Protocol, is the core protocol of the World Wide Web: every web server and browser uses it, and it has become the dominant application-level protocol on the Internet today.

2 Uniform Resource Locators

3 As it gives you complete, explicit control of exactly what octets are uploaded.

4 Even more important: if you think you need to invent a brand-new protocol and implement it on top of sockets, think again, and search carefully: it’s far more likely that one or more of the huge number of existing Internet protocols meets your needs just fine!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset