Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Magnus Lie Hetland, Beginning Python, 10.1007/978-1-4842-0028-5_15

15. Python and the Web

Magnus Lie Hetland¹

(1)Trondheim, Norway

This chapter tackles some aspects of web programming with Python. This is a really vast area, but I’ve selected three main topics for your amusement: screen scraping, CGI, and mod_python.

In addition, I give you some pointers for finding the proper toolkits for more advanced web application and web service development. For extended examples using CGI, see Chapters 25 and 26. For an example of using the specific web service protocol XML-RPC, see Chapter 27.

Screen Scraping

Screen scraping is a process whereby your program downloads web pages and extracts information from them. This is a useful technique that is applicable whenever there is a page online that has information you want to use in your program. It is especially useful, of course, if the web page in question is dynamic, that is, if it changes over time. Otherwise, you could just download it once and extract the information manually. (The ideal situation is, of course, one where the information is available through web services, as discussed later in this chapter.)

Conceptually, the technique is very simple. You download the data and analyze it. You could, for example, simply use urllib, get the web page’s HTML source, and then use regular expressions (see Chapter 10) or another technique to extract the information. Let’s say, for example, that you wanted to extract the various employer names and web sites from the Python Job Board, at http://python.org/jobs . You browse the source and see that the names and URLs can be found as links like this one:

<a href="/jobs/1970/">Python Engineer</a>

Listing 15-1 shows a sample program that uses urllib and reto extract the required information.

Listing 15-1. A Simple Screen-Scraping Program

from urllib.request import urlopen
import re
p = re.compile(’<a href="(/jobs/\d+)/">(.*?)</a>’)
text = urlopen(’http://python.org/jobs’).read().decode()
for url, name in p.findall(text):
    print(’{} ({})’.format(name, url))

The code could certainly be improved, but it does its job pretty well. There are, however, at least three weaknesses with this approach.

The regular expression isn’t exactly readable. For more complex HTML code and more complex queries, the expressions can become even hairier and more unmaintainable.
It doesn’t deal with HTML peculiarities like CDATA sections and character entities (such as &). If you encounter such beasts, the program will, most likely, fail.
The regular expression is tied to details in the HTML source code, rather than some more abstract structure. This means that small changes in how the web page is structured can break the program. (By the time you’re reading this, it may already be broken.)

The following sections deal with two possible solutions for the problems posed by the regular expression-based approach. The first is to use a program called Tidy (as a Python library) together with XHTML parsing. The second is to use a library called Beautiful Soup, specifically designed for screen scraping.

Note

There are other tools for screen scraping with Python. You might, for example, want to check out Ka-Ping Yee’s scrape.py (found at http://zesty.ca/python ).

Tidy and XHTML Parsing

The Python standard library has plenty of support for parsing structured formats such as HTML and XML (see the Python Library Reference’s “Structured Markup Processing Tools” section). I discuss XML and XML parsing in more depth in Chapter 22. In this section, I just give you the tools needed to deal with XHTML, one of the two concrete syntaxes described by the HTML 5 specification, which happens to be a form of XML. Much of what is described should work equally well with plain HTML.

If every web page consisted of correct and valid XHTML, the job of parsing it would be quite simple. The problem is that older HTML dialects are a bit sloppier, and some people don’t even care about the strictures of those sloppier dialects. The reason for this is, probably, that most web browsers are quite forgiving and will try to render even the most jumbled and meaningless HTML as best they can. If this happens to look acceptable to the page authors, they may be satisfied. This does make the job of screen scraping quite a bit harder, though.

The general approach for parsing HTML in the standard library is event-based; you write event handlers that are called as the parser moves along the data. The standard library module html.parser will let you parse really sloppy HTML in this manner, but if you want to extract data based on document structure (such as the first item after the second level-two heading), you’ll need to do some heavy guessing if there are missing tags, for example. You are certainly welcome to do this, if you like, but there is another way: Tidy.

What’s Tidy ?

Tidy is a tool for fixing ill-formed and sloppy HTML. It can fix a range of common errors in a rather intelligent manner, doing a lot of work that you would probably rather not do yourself. It’s also quite configurable, letting you turn various corrections on or off.

Here is an example of an HTML file filled with errors, some of them just old-school HTML, and some of them plain wrong (can you spot all the problems?):

<h1>Pet Shop
<h2>Complaints</h3>

<p>There is <b>no <i>way</b> at all</i> we can accept returned
parrots.

<h1><i>Dead Pets</h1>                                                                              

<p>Our pets may tend to rest at times, but rarely die within the
warranty period.

<i><h2>News</h2></i>

<p>We have just received <b>a really nice parrot.

<p>It’s really nice.</b>

<h3><hr>The Norwegian Blue</h3>

<h4>Plumage and <hr>pining behavior</h4>
<a href="#norwegian-blue">More information<a>

<p>Features:
<body>
<li>Beautiful plumage

Here is the version that is fixed by Tidy:

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<h1>Pet Shop</h1>
<h2>Complaints</h2>
<p>There is <b>no <i>way</i></b> <i>at all</i> we can accept
returned parrots.</p>
<h1><i>Dead Pets</i></h1>
<p><i>Our pets may tend to rest at times, but rarely die within the
warranty period.</i></p>
<h2><i>News</i></h2>
<p>We have just received <b>a really nice parrot.</b></p>
<p><b>It’s really nice.</b></p>
<hr>
<h3>The Norwegian Blue</h3>
<h4>Plumage and</h4>
<hr>
<h4>pining behavior</h4>
<a href="#norwegian-blue">More information</a>
<p>Features:</p>
<ul>
<li>Beautiful plumage</li>
</ul>
</body>
</html>

Of course, Tidy can’t fix all problems with an HTML file, but it does make sure it’s well formed (that is, all elements nest properly), which makes it much easier for you to parse it.

Getting Tidy

There are several Python wrappers for the Tidy library, and which one is the most up-to-date seems to vary a bit. If you’re using pip, you can have a look at your options by using this:

$ pip search tidy

A good candidate is PyTidyLib, which you could install as follows:

$ pip install pytidylib

You don’t have to install a wrapper for the library, though. If you’re running a UNIX or Linux machine of some sort, it’s quite possible that you have the command-line version of Tidy available. And no matter what operating system you’re using, you can probably get an executable binary from the Tidy web site ( http://html-tidy.org ). Once you have the binary version, you can use the subprocess module (or some of the popen functions) to run the Tidy program. Assuming, for example, that you have a messy HTML file called messy.html and that you have the command-line version of Tidy in your execution path, the following program will run Tidy on it and print the result:

from subprocess import Popen, PIPE

text = open(’messy.html’).read()
tidy = Popen(’tidy’, stdin=PIPE, stdout=PIPE, stderr=PIPE)

tidy.stdin.write(text.encode())
tidy.stdin.close()

print(tidy.stdout.read().decode())

If Popen can’t find tidy, you might want to provide it with a full path to the executable.

In practice, instead of printing the result, you would, most likely, extract some useful information from it, as demonstrated in the following sections.

But Why XHTML ?

The main difference between XHTML and older forms of HTML (at least for our current purposes) is that XHTML is quite strict about closing all elements explicitly. So in HTML you might end one paragraph simply by beginning another (with a <p> tag), but in XHTML, you first need to close the paragraph explicitly (with a </p> tag). This makes XHTML much easier to parse, because you can tell directly when you enter or leave the various elements. Another advantage of XHTML (which I won’t really capitalize on in this chapter) is that it is an XML dialect, so you can use all kinds of nifty XML tools on it, such as XPath. (For more about XML, see Chapter 22; for more about the uses of XPath, see, for example, http://www.w3schools.com/xml/xml:xpath.asp .)

A very simple way of parsing the kind of well-behaved XHTML you get from Tidy is using the HTMLParser class from the standard library module html.parser.

Using HTMLParser

Using HTMLParser simply means subclassing it and overriding various event-handling methods such as handle_starttag and handle_data. Table 15-1 summarizes the relevant methods and when they’re called (automatically) by the parser.

Table 15-1. The HTMLParser Callback Methods

Callback Method	When Is It Called?
handle_starttag(tag, attrs)	When a start tag is found, attrs is a sequence of (name, value) pairs.
handle_startendtag(tag, attrs)	For empty tags; default handles start and end separately.
handle_endtag(tag)	When an end tag is found.
handle_data(data)	For textual data.
handle_charref(ref)	For character references of the form &#ref;.
handle_entityref(name)	For entity references of the form &name;.
handle_comment(data)	For comments; called with only the comment contents.
handle_decl(decl)	For declarations of the form <!...>.
handle_pi(data)	For processing instructions.
unknown_decl(data)	Called when an unknown declaration is read.

For screen-scraping purposes, you usually won’t need to implement all the parser callbacks (the event handlers), and you probably won’t need to construct some abstract representation of the entire document (such as a document tree) to find what you want. If you just keep track of the minimum of information needed to find what you’re looking for, you’re in business. (See Chapter 22 for more about this topic, in the context of XML parsing with SAX.) Listing 15-2 shows a program that solves the same problem as Listing 15-1, but this time using HTMLParser.

Listing 15-2. A Screen-Scraping Program Using the HTMLParser Module

from urllib.request import urlopen
from html.parser import HTMLParser

def isjob(url):
    try:
        a, b, c, d = url.split(’/’)
    except ValueError:
        return False
    return a == d == ’’ and b == ’jobs’ and c.isdigit()

class Scraper(HTMLParser):

    in_link = False

    def handle_starttag(self, tag, attrs):
        attrs = dict(attrs)
        url = attrs.get(’href’, ’’)
        if tag == ’a’ and isjob(url):
            self.url = url
            self.in_link = True
            self.chunks = []

    def handle_data(self, data):
        if self.in_link:
            self.chunks.append(data)

    def handle_endtag(self, tag):
        if tag == ’a’ and self.in_link:
            print(’{} ({})’.format(’’.join(self.chunks), self.url))
            self.in_link = False

text = urlopen(’http://python.org/jobs’).read().decode()
parser = Scraper()
parser.feed(text)
parser.close()

A few things are worth noting. First of all, I’ve dropped the use of Tidy here, because the HTML in the web page is well behaved enough. If you’re lucky, you may find that you don’t need to use Tidy either. Also note that I’ve used a Boolean state variable (attribute) to keep track of whether I’m inside a relevant link. I check and update this attribute in the event handlers. Second, the attrs argument to handle_starttag is a list of (key, value) tuples, so I’ve used dict to turn them into a dictionary, which I find to be more manageable.

The handle_data method (and the chunks attribute) may need some explanation. It uses a technique that is quite common in event-based parsing of structured markup such as HTML and XML. Instead of assuming that I’ll get all the text I need in a single call to handle_data, I assume that I may get several chunks of it, spread over more than one call. This may happen for several reasons—buffering, character entities, markup that I’ve ignored, and so on—and I just need to make sure I get all the text. Then, when I’m ready to present my result (in the handle_endtag method), I simply join all the chunks together. To actually run the parser, I call its feed method with the text and then call its close method.

Solutions like these may in some cases be more robust to changes in the input data than using regular expressions. Still, you may object that it is too verbose and perhaps no clearer or easier to understand than the regular expression. For a more complex extraction task, the arguments in favor of this sort of parsing might seem more convincing, but one is still left with the feeling that there must be a better way. And, if you don’t mind installing another module, there is . . .

Beautiful Soup

Beautiful Soup is a spiffy little module for parsing and dissecting the kind of HTML you sometimes find on the Web—the sloppy and ill-formed kind. To quote the Beautiful Soup web site ( http://crummy.com/software/BeautifulSoup ):

You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help.

Downloading and installing Beautiful Soup is a breeze. As with most packages, you can use pip.

$ pip install beautifulsoup4

You might want to do a pip search to see if there’s a more recent version. With Beautiful Soup installed, the running example of extracting Python jobs from the Python Job Board becomes really, really simple and quite readable, as shown in Listing 15-3. Instead of checking the contents of the URL, I now navigate the structure of the document.

Listing 15-3. A Screen-Scraping Program Using Beautiful Soup

from urllib.request import urlopen
from bs4 import BeautifulSoup

text = urlopen(’http://python.org/jobs’).read()
soup = BeautifulSoup(text, ’html.parser’)

jobs = set()
for job in soup.body.section(’h2’):
    jobs.add(’{} ({})’.format(job.a.string, job.a[’href’]))

print(’
’.join(sorted(jobs, key=str.lower)))

I simply instantiate the BeautifulSoup class with the HTML text I want to scrape and then use various mechanisms to extract parts of the resulting parse tree. For example, I use soup.body to get the body of the document and then access its first section. I call the resulting object with ’h2’ as an argument, and this is equivalent to using its find_all method, which gives me a collection of all the h2 elements inside the section. Each of those represents one job, and I’m interested in the first link it contains, job.a. The string attribute is its textual content, while a[’href’] is the href attribute. As I’m sure you noticed, I added the use of set and sorted (with a key function set to ignore case differences) in Listing 15-3. This has nothing to do with Beautiful Soup; it was just to make the program more useful, by eliminating duplicates and printing the names in sorted order.

If you want to use your scrapings for an RSS feed (discussed later in this chapter), you can use another tool related to Beautiful Soup, called Scrape ’N’ Feed (at http://crummy.com/ software/ScrapeNFeed).

Dynamic Web Pages with CGI

While the first part of this chapter dealt with client-side technology, now we switch gears and tackle the server side. This section deals with a basic web programming technology: the Common Gateway Interface (CGI). CGI is a standard mechanism by which a web server can pass your queries (typically supplied through a web form) to a dedicated program (for example, your Python program) and display the result as a web page. It is a simple way of creating web applications without writing your own special-purpose application server. For more information about CGI programming in Python, see the Web Programming topic guide on the Python web site ( http://wiki.python.org/moin/WebProgramming ).

The key tool in Python CGI programming is the cgi module. Another module that can be very useful during the development of CGI scripts is cgitb—more about that later, in the section “Debugging with cgitb.”

Before you can make your CGI scripts accessible (and runnable) through the Web, you need to put them where a web server can access them, add a pound bang line, and set the proper file permissions. These three steps are explained in the following sections.

Step 1: Preparing the Web Server

I’m assuming that you have access to a web server —in other words, that you can put stuff on the Web. Usually, that is a matter of putting your web pages, images, and so on, in a particular directory (in UNIX, typically called public_html). If you don’t know how to do this, you should ask your Internet service provider (ISP) or system administrator.

Tip

If you are running macOS, you have the Apache web server as part of your operating system installation. It can be switched on through the Sharing preference pane of System Preferences, by checking the Web Sharing option.

If you’re just experimenting a bit, you could run a temporary web server directly from Python, using the http.server module. As any module, it can be imported and run by supplying your Python executable with the -m switch. If you add --cgi to the module, the resulting server will support CGI. Note that the server will serve up files in the directory where you run it, so make sure you don’t have anything secret in there.

$ python -m http.server --cgi                                          
Serving HTTP on 0.0.0.0 port 8000 ...

If you now point your browser to http://127.0.0.1:8000 or http://localhost:8000, you should see a listing of the directory where you run the server. You should also see the server telling you about the connection.

Your CGI programs must also be put in a directory where they can be accessed via the Web. In addition, they must somehow be identified as CGI scripts, so the web server doesn’t just serve the plain source code as a web page. There are two typical ways of doing this:

Put the script in a subdirectory called cgi-bin.
Give your script the file name extension .cgi.

Exactly how this works varies from server to server—again, check with your ISP or system administrator if you’re in doubt. (For example, if you’re using Apache, you may need to turn on the ExecCGI option for the directory in question.) If you’re using the server from the http.server module, you should use a cgi-bin subdirectory.

Step 2: Adding the Pound Bang Line

When you’ve put the script in the right place (and possibly given it a specific file name extension), you must add a pound bang line to the beginning of the script. I mentioned this in Chapter 1 as a way of executing your scripts without needing to explicitly execute the Python interpreter. Usually, this is just convenient, but for CGI scripts, it’s crucial—without it, the web server won’t know how to execute your script. (For all it knows, the script could be written in some other programming language such as Perl or Ruby.) In general, simply adding the following line to the beginning of your script will do:

#!/usr/bin/env python

Note that it must be the very first line. (No empty lines before it.) If that doesn’t work, you need to find out exactly where the Python executable is and use the full path in the pound bang line, as in the following:

#!/usr/bin/python

If you have both Python 2 and 3 installed, you may need to use python3 instead. (That is also possible together with the env solution, shown earlier.) If it still doesn’t work, it may be that there is something wrong that you cannot see, namely, that the line ends in instead of simply , and your web server gets confused. Make sure you’re saving the file as a plain UNIX-style text file.

In Windows, you use the full path to your Python binary, as in this example:

#!C:Python36python.exe

Step 3: Setting the File Permissions

The final thing you need to do (at least if your web server is running on a UNIX or Linux machine) is to set the proper file permissions. You must make sure that everyone is allowed to read and execute your script file (otherwise the web server wouldn’t be able to run it) but also make sure that only you are allowed to write to it (so no one can change your script).

Tip

Sometimes, if you edit a script in Windows and it’s stored on a UNIX disk server (you may be accessing it through Samba or FTP, for example), the file permissions may be fouled up after you’ve made a change to your script. So if your script won’t run, make sure that the permissions are still correct.

The UNIX command for changing file permissions (or file mode) is chmod. Simply run the following command (if your script is called somescript.cgi), using your normal user account, or perhaps one set up specifically for such web tasks.

chmod 755 somescript.cgi

After having performed all these preparations, you should be able to open the script as if it were a web page and have it executed.

Note

You shouldn’t open the script in your browser as a local file. You must open it with a full HTTP URL so that you actually fetch it via the Web (through your web server).

Your CGI script won’t normally be allowed to modify any files on your computer. If you want to allow it to change a file, you must explicitly give it permission to do so. You have two options. If you have root (system administrator) privileges, you may create a specific user account for your script and change ownership of the files that need to be modified. If you don’t have root access, you can set the file permissions for the file so all users on the system (including that used by the web server to run your CGI scripts) are allowed to write to the file. You can set the file permissions with this command:

chmod 666 editable_file.txt

Caution

Using file mode 666 is a potential security risk. Unless you know what you’re doing, it’s best to avoid it.

CGI Security Risks

Some security issues are associated with using CGI programs. If you allow your CGI script to write to files on your server, that ability may be used to destroy data unless you write your program carefully. Similarly, if you evaluate data supplied by a user as if it were Python code (for example, with exec or eval) or as a shell command (for example, with os.system or using the subprocess module), you risk performing arbitrary commands, which is a huge (as in humongous) risk. Even using a user-supplied string as part of a SQL query is risky, unless you take great care to sanitize the string first; so-called SQL injection is a common way of attacking or breaking into a system.

A Simple CGI Script

The simplest possible CGI script looks something like Listing 15-4.

Listing 15-4. A Simple CGI Script

#!/usr/bin/env python

print(’Content-type: text/plain’)
print() # Prints an empty line, to end the headers

print(’Hello, world!’)

If you save this in a file called simple1.cgi and open it through your web server, you should see a web page containing only the words “Hello, world!” in plain text. To be able to open this file through a web server, you must put it where the web server can access it. In a typical UNIX environment, putting it in a directory called public_html in your home directory would enable you to open it with the URL http://localhost/∼username/simple1.cgi (substitute your user name for username). Ask your ISP or system administrator for details. If you’re using a cgi-bin directory, you may as well call it something like simple1.py.

As you can see, everything the program writes to standard output (for example, with print) ends up in the resulting web page—at least almost everything. The fact is that the first things you print are HTTP headers, which are lines of information about the page. The only header I concern myself with here is Content-type. As you can see, the phrase Content-type is followed by a colon, a space, and the type name text/plain. This indicates that the page is plain text. To indicate HTML, this line should instead be as follows:

print(’Content-type: text/html’)

After all the headers have been printed, a single empty line is printed to signal that the document itself is about to begin. And, as you can see, in this case the document is simply the string ’Hello, world!’.

Debugging with cgitb

Sometimes a programming error makes your program terminate with a stack trace because of an uncaught exception. When running the program through CGI, this will most likely result in an unhelpful error message from the web server, or perhaps even just a black page. If you have access to the server log (for example, if you’re using http.server), you can probably get some information there. To help you debug your CGI scripts in general, though, the standard module contains a useful module called cgitb (for CGI traceback). By importing it and calling its enable function, you can get a quite helpful web page with information about what went wrong. Listing 15-5 gives an example of how you might use the cgitb module .

Listing 15-5. A CGI Script That Invokes a Traceback (faulty.cgi)

#!/usr/bin/env python

import cgitb; cgitb.enable()

print(’Content-type: text/html
’)

print(1/0)

print(’Hello, world!’)

The result of accessing this script in a browser (through a web server) is shown in Figure 15-1.

Figure 15-1. A CGI traceback from the cgitb module

Note that you’ll probably want to turn off the cgitb functionality after developing the program, since the traceback page isn’t meant for the casual user of your program.¹

Using the cgi Module

So far, the programs have only produced output; they haven’t used any form of input. Input is supplied to the CGI script from an HTML form (described in the next section) as key-value pairs, or fields. You can retrieve these fields in your CGI script using the FieldStorage class from the cgi module. When you create your FieldStorage instance (you should create only one), it fetches the input variables (or fields) from the request and presents them to your program through a dictionary-like interface. The values of the FieldStorage can be accessed through ordinary key lookup, but because of some technicalities (related to file uploads, which we won’t be dealing with here), the elements of the FieldStorage aren’t really the values you’re after. For example, if you knew the request contained a value named name, you couldn’t simply do this:

form = cgi.FieldStorage()
name = form[’name’]

You would need to do this:

form = cgi.FieldStorage()
name = form[’name’].value

A slightly simpler way of fetching the values is the getvalue method, which is similar to the dictionary method get, except that it returns the value of the value attribute of the item. Here is an example:

form = cgi.FieldStorage()
name = form.getvalue(’name’, ’Unknown’)

In the preceding example, I supplied a default value (’Unknown’). If you don’t supply one, None will be the default. The default is used if the field is not filled in.

Listing 15-6 contains a simple example that uses cgi.FieldStorage.

Listing 15-6. A CGI Script That Retrieves a Single Value from a FieldStorage (simple2.cgi)

#!/usr/bin/env python

import cgi
form = cgi.FieldStorage()

name = form.getvalue(’name’, ’world’)

print(’Content-type: text/plain
’)

print(’Hello, {}!’.format(name))

Invoking CGI Scripts Without Forms

Input to CGI scripts generally comes from web forms that have been submitted, but it is also possible to call the CGI program with parameters directly. You do this by adding a question mark after the URL to your script and then adding key-value pairs separated by ampersands (&). For example, if the URL to the script in Listing 15-6 were http://www.example.com/simple2.cgi , you could call it with name=Gumby and age=42 with the URL http://www.example.com/simple2.cgi?name=Gumby&age=42 . If you try that, you should get the message “Hello, Gumby!” instead of “Hello, world!” from your CGI script. (Note that the age parameter isn’t used.) You can use the urlencode method of the urllib.parse module to create this kind of URL query:

>>> urlencode({’name’: ’Gumby’, ’age’: ’42’})
’age=42&name=Gumby’

You can use this strategy in your own programs, together with urllib, to create a screen-scraping program that can actually interact with a CGI script. However, if you’re writing both ends (that is, both server and client sides) of such a contraption, you would, most likely, be better off using some form of web service (as described in the section “Web Services: Scraping Done Right” in this chapter).

A Simple Form

Now you have the tools for handling a user request; it’s time to create a form that the user can submit. That form can be a separate page, but I’ll just put it all in the same script.

To find out more about writing HTML forms (or HTML in general), you should perhaps get a good book on HTML (your local bookstore probably has several). You can also find plenty of information on the subject online. And, as always, if you find some page that you think looks like a good example for what you would like to do, you can inspect its source in your browser by choosing View Source or something similar (depending on which browser you have) from one of the menus.

Note

There are two main ways of getting information from a CGI script: the GET method and the POST method. For the purposes of this chapter, the difference between the two isn’t really important. Basically, GET is for retrieving things and encodes its query in the URL; POST can be used for any kind of query but encodes the query a bit differently.

Let’s return to our script. An extended version can be found in Listing 15-7.

Listing 15-7. A Greeting Script with an HTML Form (simple3.cgi)

#!/usr/bin/env python

import cgi
form = cgi.FieldStorage()

name = form.getvalue(’name’, ’world’)

print("""Content-type: text/html

<html>
  <head>
    <title>Greeting Page</title>
  </head>
  <body>
    <h1>Hello, {}!</h1>                                                                              

    <form action=’simple3.cgi’>
    Change name <input type=’text’ name=’name’ />
    <input type=’submit’ />
    </form>
  </body>
</html>
""".format(name))

In the beginning of this script, the CGI parameter name is retrieved, as before, with the default ’world’. If you just open the script in your browser without submitting anything, the default is used.

Then a simple HTML page is printed, containing name as a part of the headline. In addition, this page contains an HTML form whose action attribute is set to the name of the script itself (simple3.cgi). That means that if the form is submitted, you are taken back to the same script. The only input element in the form is a text field called name. Thus, if you submit the field with a new name, the headline should change because the name parameter now has a value.

Figure 15-2 shows the result of accessing the script in Listing 15-7 through a web server.

Figure 15-2. The result of executing the CGI script in Listing 15-7

Using a Web Framework

Most people don’t write CGI scripts directly for any serious web applications; rather, they use a web framework, which does a lot of heavy lifting for you. There are plenty of such frameworks available, and I’ll mention a few of them later—but for now, let’s stick to a really simple but highly useful one called Flask ( http://flask.pocoo.org ). It’s easily installed using pip.

$ pip install flask

Suppose you’ve written an exciting function that calculates powers of two.

def powers(n=10):
    return ’, ’.join(str(2**i) for i in range(n))

Now you want to make this masterpiece available to the world! To do that with Flask, you first instantiate the Flask class with the appropriate name and tell it which URL path corresponds to your function.

from flask import Flask
app = Flask(__name__)

@app.route(’/’)
def powers(n=10):
    return ’, ’.join(str(2**i) for i in range(n))

If your script is called powers.py, you can have Flask run it as follows (assuming a UNIX-style shell):

$ export FLASK_APP=powers.py
$ flask run
 * Serving Flask app "powers"
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

The last two lines are output from Flask. If you enter the URL in your browser, you should see the string returned from powers. You could also supply a more specific path to your function. For example, if you use route(’/powers’) instead of route(’/’), the function would be available at http://127.0.0.1:5000/powers. You could then set up multiple functions, each with its own URL.

You can even provide arguments to your function. You specify a parameter using angle brackets, so you might use ’/powers/<n>’, for example. Whatever you specified after the slash would then be supplied as a keyword argument named n. It would be a string, though, and in our case we want an integer. We can add this conversion by using route(’/powers/<int:n>’). Then, after restarting Flask, if you access the URL http://127.0.0.1:5000/powers/3, you should get the output 1, 2, 4.

Flask has plenty of other features, and its documentation is highly readable. If you’d like to experiment with simple server-side web app development, I recommend giving it a look.

Other Web Application Frameworks

There are plenty of other web frameworks available, both large and small. Some are rather obscure, while others have regular conferences devoted to them. Some popular ones are listed in Table 15-2; for a more comprehensive list, you should consult the Python web pages ( https://wiki.python.org/moin/WebFrameworks ).

Table 15-2. Python Web Application Frameworks

Name	Web Site
Django	https://djangoproject.com
TurboGears	http://turbogears.org
web2py	http://web2py.com
Grok	https://pypi.python.org/pypi/grok
Zope2	https://pypi.python.org/pypi/Zope2
Pyramid	https://trypyramid.com

Web Services: Scraping Done Right

Web services are a bit like computer-friendly web pages. They are based on standards and protocols that enable programs to exchange information across the network, usually with one program (the client or service requester) asking for some information or service, and the other program (the server or service provider) providing this information or service. Yes, this is glaringly obvious stuff, and it also seems very similar to the network programming discussed in Chapter 14, but there are differences.

Web services often work on a rather high level of abstraction. They use HTTP (the “Web’s protocol”) as the underlying protocol. On top of this, they use more content-oriented protocols, such as some XML format to encode requests and responses. This means that a web server can be the platform for web services. As the title of this section indicates, it’s web scraping taken to another level. You could see the web service as a dynamic web page designed for a computerized client, rather than for human consumption.

There are standards for web services that go really far in capturing all kinds of complexity, but you can get a lot done with utter simplicity as well. In this section, I give only a brief introduction to the subject, with some pointers to where you can find the tools and information you might need.

Note

As there are many ways of implementing web services, including a multitude of protocols, and each web service system may provide several services, it can sometimes be necessary to describe a service in a manner that can be interpreted automatically by a client—a metaservice, so to speak. The standard for this sort of description is the Web Service Description Language (WSDL). WSDL is an XML format that describes such things as which methods are available through a service, along with their arguments and return values. Many, if not most, web service toolkits will include support for WSDL in addition to the actual service protocols, such as SOAP.

RSS and Friends

RSS, which stands for either Rich Site Summary, RDF Site Summary, or Really Simple Syndication (depending on the version number), is, in its simplest form, a format for listing news items in XML. What makes RSS documents (or feeds) more of a service than simply a static document is that they’re expected to be updated regularly (or irregularly). They may even be computed dynamically, representing, for example, the most recent additions to a blog or the like. A newer format used for the same thing is Atom. For information about RSS and its relative Resource Description Framework (RDF), see http://www.w3.org/RDF . For a specification of Atom, see http://tools.ietf.org/html/rfc4287 .

Plenty of RSS readers are out there, and often they can also handle other formats such as Atom. Because the RSS format is so easy to deal with, developers keep coming up with new applications for it. For example, some browsers (such as Mozilla Firefox) will let you bookmark an RSS feed and will then give you a dynamic bookmark submenu with the individual news items as menu items. RSS is also the backbone of podcasting; a podcast is essentially an RSS feed listing sound files.

The problem is that if you want to write a client program that handles feeds from several sites, you must be prepared to parse several different formats, and you may even need to parse HTML fragments found in the individual entries of the feed. Even though you could use BeautifulSoup (or one of its XML-oriented versions) to tackle this, it’s probably a better idea to use Mark Pilgrim’s Universal Feed Parser ( https://pypi.python.org/pypi/feedparser ), which handles several feed formats (including RSS and Atom, along with some extensions) and has support for some degree of content cleanup. Pilgrim has also written a useful article, “Parsing RSS At All Costs” ( http://xml.com/pub/a/2003/01/22/dive-into-xml.html ), in case you want to deal with some of the cleanup yourself.

Remote Procedure Calls with XML-RPC

Beyond the simple download-and-parse mechanic of RSS lies the remote procedure call. A remote procedure call is an abstraction of a basic network interaction. Your client program asks the server program to perform some computation and return the result, but it is all camouflaged as a simple procedure (or function or method) call. In the client code, it looks like an ordinary method is called, but the object on which it is called actually resides on a different machine entirely. Probably the simplest mechanism for this sort of procedure call is XML-RPC, which implements the network communication with HTTP and XML. Because there is nothing language-specific about the protocol, it is easy for client programs written in one language to call functions on a server program written in another.

Tip

Try a web search to find plenty of other RPC options for Python.

The Python standard library includes support for both client-side and server-side XML-RPC programming. For examples of using XML-RPC, see Chapters 27 and 28.

RPC and REST

Even though the two mechanisms are rather different, remote procedure calls may be compared to the so-called representational state transfer style of network programming, usually called REST. REST-based (or RESTful) programs also allow clients to access the servers programmatically, but the server program is assumed not to have any hidden state. Returned data is uniquely determined by the given URL (or, in the case of HTTP POST, additional data supplied by the client).

More information about REST is readily available online. For example, you could start with the Wikipedia article on it, at http://en.wikipedia.org/wiki/Representational_State_Transfer . A simple and elegant protocol that is used quite a bit in RESTful programming is JavaScript Object Notation, or JSON ( http://www.json.org ), which allows you to represent complex objects in a plain-text format. You can find support for JSON in the json standard library module.

SOAP

SOAP ² is also a protocol for exchanging messages, with XML and HTTP as underlying technologies. Like XML-RPC, SOAP supports remote procedure calls, but the SOAP specification is much more complex than that of XML-RPC. SOAP is asynchronous, supports metarequests about routing, and has a complex typing system (as opposed to XML-RPC’s simple set of fixed types).

There is no single standard SOAP toolkit for Python. You might want to consider Twisted ( http://twistedmatrix.com ), ZSI ( http://pywebsvcs.sf.net ) or SOAPy ( http://soapy.sf.net ). For more information about the SOAP format, see http://www.w3.org/TR/soap .

A Quick Summary

Here is a summary of the topics covered in this chapter:

Screen scraping: This is the practice of downloading web pages automatically and extracting information from them. The Tidy program and its library version are useful tools for fixing ill-formed HTML before using an HTML parser. Another option is to use Beautiful Soup, which is very forgiving of messy input.
CGI: The Common Gateway Interface is a way of creating dynamic web pages, by making a web server run and communicate with your programs and display the results. The cgi and cgitb modules are useful for writing CGI scripts. CGI scripts are usually invoked from HTML forms.
Flask: A simple web framework that lets you publish your code as a web application, without worrying too much about the web part of things.
Web application frameworks: For developing large, complex web applications in Python, a web application framework is almost a must. Flask is a good choice for simpler projects. For larger projects, you might want to consider something like Django or TurboGears.
Web services: Web services are to programs what (dynamic) web pages are to people. You may see them as a way of making it possible to do network programming at a higher level of abstraction. Common web service standards are RSS (and its relatives, RDF and Atom), XML-RPC, and SOAP.

New Functions in This Chapter

Function	Description
cgitb.enable()	Enables tracebacks in CGI script

What Now?

I’m sure you’ve tested the programs you’ve written so far by running them. In the next chapter, you will learn how you can really test them—thoroughly and methodically, maybe even obsessively (if you’re lucky).

Footnotes

1 An alternative is to turn off the display and log the errors to files instead. See the Python Library Reference for more information.

2 While the name once stood for Simple Object Access Protocol, this is no longer true. Now it’s just SOAP.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Python and the Web

Create new playlist

Sign In

Sign Up

15. Python and the Web

Screen Scraping

Listing 15-1. A Simple Screen-Scraping Program

Note

Tidy and XHTML Parsing

What’s Tidy ?

Getting Tidy

But Why XHTML ?

Using HTMLParser

Table 15-1. The HTMLParser Callback Methods

Listing 15-2. A Screen-Scraping Program Using the HTMLParser Module

Beautiful Soup

Listing 15-3. A Screen-Scraping Program Using Beautiful Soup

Dynamic Web Pages with CGI

Step 1: Preparing the Web Server

Tip

Step 2: Adding the Pound Bang Line

Step 3: Setting the File Permissions

Tip

Note

Caution

CGI Security Risks

A Simple CGI Script

Listing 15-4. A Simple CGI Script

Debugging with cgitb

Listing 15-5. A CGI Script That Invokes a Traceback (faulty.cgi)

Figure 15-1. A CGI traceback from the cgitb module

Using the cgi Module

Listing 15-6. A CGI Script That Retrieves a Single Value from a FieldStorage (simple2.cgi)

Invoking CGI Scripts Without Forms

A Simple Form

Note

Listing 15-7. A Greeting Script with an HTML Form (simple3.cgi)

Figure 15-2. The result of executing the CGI script in Listing 15-7

Using a Web Framework

Other Web Application Frameworks

Table 15-2. Python Web Application Frameworks

Web Services: Scraping Done Right

Note

RSS and Friends

Remote Procedure Calls with XML-RPC

Tip

RPC and REST

SOAP

A Quick Summary

New Functions in This Chapter

What Now?

Footnotes

Table of Contents for
15. Python and the Web