Content compression with the Accept-Encoding
header and language selection with the Accept-Language
header are examples of content negotiation, where the client specifies its preferences regarding the format and the content of the requested resource. The following headers can also be used for this:
Accept
: For requesting a preferred file formatAccept-Charset
: For requesting the resource in a preferred character setThere are additional aspects to the content negotiation mechanism, but because it's inconsistently supported and it can become quite involved, we won't be covering it in this chapter. RFC 7231 contain all the details that you need. Take a look at sections such as 3.4, 5.3, 6.4.1, and 6.5.6, if you find that your application requires this.
HTTP can be used as a transport for any type of file or data. The server can use the Content-Type
header in a response to inform the client about the type of data that it has sent in the body. This is the primary means an HTTP client determines how it should handle the body data that the server returns to it.
To view the content type, we inspect the value of the response header, as shown here:
>>> response = urlopen('http://www.debian.org') >>> response.getheader('Content-Type') 'text/html'
The values in this header are taken from a list which is maintained by IANA. These values are variously called content types, Internet media types, or MIME types (MIME stands for Multipurpose Internet Mail Extensions, the specification in which the convention was first established). The full list can be found at http://www.iana.org/assignments/media-types.
There are registered media types for many of the types of data that are transmitted across the Internet, some common ones are:
Media type |
Description |
---|---|
text/html |
HTML document |
text/plain |
Plain text document |
image/jpeg |
JPG image |
application/pdf |
PDF document |
application/json |
JSON data |
application/xhtml+xml |
XHTML document |
Another media type of interest is application/octet-stream
, which in practice is used for files that don't have an applicable media type. An example of this would be a pickled Python object. It is also used for files whose format is not known by the server. In order to handle responses with this media type correctly, we need to discover the format in some other way. Possible approaches are as follows:
mimetypes
module can then be used for determining the media type (go to Chapter 3, APIs in Action to see an example of this).imghdr
module can be used for images, and the third-party python-magic
package, or the GNU
file command, can be used for other types.Content type values can contain optional additional parameters that provide further information about the type. This is usually used to supply the character set that the data uses. For example:
Content-Type: text/html; charset=UTF-8.
In this case, we're being told that the character set of the document is UTF-8. The parameter is included after a semicolon, and it always takes the form of a key/value pair.
Let's discuss an example, downloading the Python home page and using the Content-Type
value it returns. First, we submit our request:
>>> response = urlopen('http://www.python.org')
Then, we check the Content-Type
value of our response, and extract the character set:
>>> format, params = response.getheader('Content-Type').split(';') >>> params ' charset=utf-8' >>> charset = params.split('=')[1] >>> charset 'utf-8'
Lastly, we decode our response content by using the supplied character set:
>>> content = response.read().decode(charset)
Note that quite often, the server either doesn't supply a charset
in the Content-Type
header, or it supplies the wrong charset
. So, this value should be taken as a suggestion. This is one of the reasons that we look at the Requests
library later in this chapter. It will automatically gather all the hints that it can find about what character set should be used for decoding a response body and make a best guess for us.