© Magnus Lie Hetland 2017

Magnus Lie Hetland, Beginning Python, 10.1007/978-1-4842-0028-5_23

23. Project 4: In the News

Magnus Lie Hetland

(1)Trondheim, Norway

The Internet is replete with news sources in many forms, including newspapers, video channels, blogs and podcasts, to name a few. Some of these also provide services, such as RSS or Atom feeds, that let you retrieve the latest news using relatively simple code, without having to parse their web pages. In this project, we’ll be exploring a mechanism that predates the Web: the Network News Transfer Protocol (NNTP) . We’ll go from a simple prototype without any form of abstraction (no functions, no classes) to a generic system in which some important abstractions have been added. We’ll be using the nntplib library, which lets you interact with NNTP servers, but adding other protocols and mechanisms should be straightforward.

NNTP is a standard network protocol for managing messages posted on so-called Usenet discussion groups. NNTP servers form a global network that collectively manages these newsgroups, and through an NNTP client (also called a newsreader) you can post and read messages. The main network of NNTP servers , called Usenet , was established in 1980 (although the NNTP protocol wasn’t used until 1985). Compared to current web trends, this is quite “old school,” but most of the Internet is based (to some degree) on such old-school technologies,1 and it probably doesn’t hurt to play around with the low-level stuff a bit. Also, you could always replace the NNTP stuff in this chapter with some news-gathering module of your own (perhaps using the web API of some social networking site like Facebook or Twitter).

What’s the Problem?

The program you write in this project will be an information-gathering agent, a program that can gather information (more specifically, news) and compile a report for you. Given the network functionality you have already encountered, that might not seem very difficult—and it isn’t, really. But in this project you go a bit beyond the simple “download a file with urllib” approach. You use another network library that is a bit more difficult to use than urllib, namely, nntplib. In addition, you get to refactor the program to allow many types of news sources and various types of destinations, making a clear separation between the front end and the back end, with the main engine in the middle.

The main goals for the final program are as follows:

  • The program should be able to gather news from many different sources.

  • It should be easy to add new news sources (and even new kinds of sources).

  • The program should be able to dispatch its compiled news report to many different destinations, in many different formats.

  • It should be easy to add new destinations (and even new kinds of destinations).

Useful Tools

For this project, you don’t need to install separate software. However, you do need some standard library modules, including one that you haven’t seen before, nntplib, which deals with NNTP servers. Instead of explaining all the details of that module, let’s examine it through some prototyping.

Preparations

To be able to use nntplib, you need to have access to an NNTP server. If you’re not sure whether you do, you could ask your ISP or system administrator for details. In the code examples in this chapter, I use the newsgroup comp.lang.python.announce, so you should make sure that your news (NNTP) server has that group, or you should find some other group you would like to use. If you don’t have access to an NNTP server, several open servers are available for anyone to use. A quick web search for “free nntp server” should give you some servers to choose from. (The code examples in the official documentation for nntplib use news.gmane.org.) Assuming that your news server is news.foo.bar (this is not a real server name, and won’t work), you can test your NNTP server like this:

>>> from nntplib import NNTP
>>> server = NNTP('news.foo.bar')
>>> server.group('comp.lang.python.announce')[0]
Note

To connect to some servers, you may need to supply additional parameters for authentication. Consult the Python Library Reference ( https://docs.python.org/library/nntplib.html ) for details on the optional parameters of the NNTP constructor.

The result of the last line should be a string beginning with '211' (basically meaning that the server has the group you asked for) or '411' (which means that the server doesn’t have the group). It might look something like this:

'211 51 1876 1926 comp.lang.python.announce'

If the returned string starts with '411', you should use a newsreader to look for another group you might want to use. (You may also get an exception with an equivalent error message.) If an exception is raised, perhaps you got the server name wrong. Another possibility is that you were “timed out” between the time you created the server object and the time you called the group method—the server may allow you to stay connected for only a short period of time (such as 10 seconds). If you’re having trouble typing that fast, simply put the code in a script and execute it (with an added print) or put the server object creation and method call on the same line (separated by a semicolon).

First Implementation

In the spirit of prototyping, let’s just tackle the problem head on. The first thing you want to do is download the most recent messages from a newsgroup on an NNTP server . To keep things simple, just print out the result to standard output (with print). Before looking at the details of the implementation, you might want to browse the source code in Listing 23-1 later in this section, and perhaps even execute the program to see how it works. The program logic isn’t very complicated—the challenge lies mostly in using nntplib. We’ll be using one single object of the NNTP class, and as you saw in the previous section, this class is instantiated with the name of an NNTP server. You need to call three methods on this instance.

  • group, which selects a given newsgroup as the current one, and returns some information about it, including the number of the last message

  • over, which gives you overview information of a set of messages specified by their numbers

  • body, which returns the body text of a given message

Using the same fictitious server name as earlier, we can set things up as follows:

servername = 'news.foo.bar'
group = 'comp.lang.python.announce'
server = NNTP(servername)
howmany = 10

The howmany variable indicates how many articles we want to retrieve. We can then select our group.

resp, count, first, last, name = server.group(group)

The returned values are a general server response, the estimated number of messages in the group, the first and last message numbers, and the name of the group. We’re mainly interested in last, which we’ll use to construct the interval of article numbers we’re interested in, starting at start = last - howmany + 1 and ending with last. We supply this pair of numbers to the over method, which gives us a series of (id, overview) pairs for the messages. We extract the subject from the overview and use the ID to fetch the message body from the server.

The lines of the message body are returned as bytes. If we decode them using the default UTF-8, we may get some illegal byte sequences if we’ve guessed wrong. Ideally, we should extract encoding info, but to keep things simple, we’ll just use the Latin-1 encoding, which works for plain ASCII and won’t complain about non-ASCII bytes. After printing all the articles, we call server.quit(), and that’s it. In a UNIX shell such as bash, you could run this program like this:

$ python newsagent1.py | less

The use of less is useful for reading the articles one at a time. If you have no such pager program available, you could rewrite the print part of the program to store the resulting text in a file, which you’ll also be doing in the second implementation (see Chapter 11 for more information about file handling). The source code for the simple news-gathering agent is shown in Listing 23-1.

Listing 23-1. A Simple News-Gathering Agent (newsagent1.py)
from nntplib import NNTP

servername = 'news.foo.bar'
group = 'comp.lang.python.announce'
server = NNTP(servername)
howmany = 10


resp, count, first, last, name = server.group(group)

start = last - howmany + 1

resp, overviews = server.over((start, last))

for id, over in overviews:
    subject = over['subject']
    resp, info = server.body(id)
    print(subject)
    print('-' * len(subject))
    for line in info.lines:
        print(line.decode('latin1'))
    print()


server.quit()

Second Implementation

The first implementation worked but was quite inflexible in that it let you retrieve news only from Usenet discussion groups. In the second implementation, you fix that by refactoring the code a bit. You add structure and abstraction by creating some classes and methods to represent the various parts of the code. Once you’ve done that, some of the parts may be replaced by other classes much more easily than you could replace parts of the code in the original program.

Again, before immersing yourself in the details of the second implementation, you might want to skim (and perhaps execute) the code in Listing 23-2, later in this chapter.

Note

You need to set the clpa_server variable to a usable NNTP server before the code in Listing 23-2 will work.

So, what classes do you need? Let’s just do a quick review of the important nouns in the problem description, as suggested in Chapter 7: information, agent, news, report, network, news source, destination, front end, back end, and main engine. This list of nouns suggests the following main classes (or kinds of classes): NewsAgent, NewsItem, Source, and Destination.

The various sources will constitute the front end, and the destinations will constitute the back end, with the news agent sitting in the middle.

The easiest of these is NewsItem. It represents only a piece of data, consisting of a title and a body (a short text), and can be implemented as follows:

class NewsItem:

    def __init__(self, title, body):

        self.title = title
        self.body = body

To find out exactly what is needed from the news sources and the news destinations, it could be a good idea to start by writing the agent itself. The agent must maintain two lists: one of sources and one of destinations. Adding sources and destinations can be done through the methods addSource and addDestination.

class NewsAgent:

    def __init__(self):
        self.sources = []
        self.destinations = []


    def addSource(self, source):
        self.sources.append(source)


    def addDestination(self, dest):
        self.destinations.append(dest)

The only thing missing now is a method to distribute the news items from the sources to the destinations. During distribution, each destination must have a method that returns all its news items, and each source needs a method for receiving all the news items that are being distributed. Let’s call these methods getItems and receiveItems. In the interest of flexibility, let’s just require getItems to return an arbitrary iterator of NewsItems. To make the destinations easier to implement, however, let’s assume that receiveItems is callable with a sequence argument (which can be iterated over more than once, to make a table of contents before listing the news items, for example). After this has been decided, the distribute method of NewsAgentsimply becomes as follows:

def distribute(self):
    items = []
    for source in self.sources:
        items.extend(source.getItems())
    for dest in self.destinations:
        dest.receiveItems(items)

This iterates through all the sources, building a list of news items. Then it iterates through all the destinations and supplies each of them with the full list of news items.

Now, all you need is a couple of sources and destinations. To begin testing, you can simply create a destination that works like the printing in the first prototype.

class PlainDestination:

    def receiveItems(self, items):
        for item in items:
            print(item.title)
            print('-' * len(item.title))
            print(item.body)

The formatting is the same; the difference is that you have encapsulated the formatting. It is now one of several alternative destinations, rather than a hard-coded part of the program. A slightly more complicated destination (HTMLDestination, which produces HTML) can be seen in Listing 23-2, later in this chapter. It builds on the approach of PlainDestination, adding a few features.

  • The text it produces is HTML.

  • It writes the text to a specific file, rather than standard output.

  • It creates a table of contents in addition to the main list of items.

And that’s it, really. The table of contents is created using hyperlinks that link to parts of the page. We’ll accomplish this by using links of the form <a href="#nn">...</a> (where nn is some number), which leads to the headline with the enclosing anchor tag <a name="nn">...</a> (where nn should be the same number as in the table of contents). The table of contents and the main listing of news items are built in two different for loops. You can see a sample result (using the upcoming NNTPSource) in Figure 23-1.

A326949_3_En_23_Fig1_HTML.jpg
Figure 23-1. An automatically generated news page

When thinking about the design, I considered using a generic superclass to represent news sources and one to represent news destinations. As it turns out, the sources and destinations don’t really share any behavior, so there is no point in using a common superclass. As long as they implement the necessary methods (getItems and receiveItems) correctly, the NewsAgent will be happy. (This is an example of using a protocol, as described in Chapter 9, rather than requiring a specific, common superclass.)

When creating an NNTPSource, much of the code can be snipped from the original prototype. As you will see in Listing 23-2, the main differences from the original are the following:

  • The code has been encapsulated in the getItems method. The servername and group variables are now arguments to the constructor. Also, the howmany variable has been turned into a constructor argument for this class.

  • I’ve added a call to decode_header, which deals with some specialized encodings used in header fields such as the subject.

  • Instead of printing each news item directly, a NewsItem object is yielded (making getItems a generator).

To show the flexibility of the design, let’s add another news source—one that can extract news items from web pages (using regular expressions; see Chapter 10 for more information). SimpleWebSource(see Listing 23-2) takes a URL and two regular expressions (one representing titles and one representing bodies) as its constructor arguments. In getItems, it uses the regular expression methods findall to find all the occurrences (titles and bodies) and zip to combine these. It then iterates over the list of (title, body) pairs, yielding a NewsItem for each. As you can see, adding new kinds of sources (or destinations, for that matter) isn’t very difficult.

To put the code to work, let’s instantiate an agent, some sources, and some destinations. In the function runDefaultSetup (which is called if the module is run as a program), several such objects are instantiated.

  • A SimpleWebSourcefor the Reuters web site, which uses two simple regular expressions to extract the information it needs

Note

The layout of the HTML on the Reuters pages might change, in which case you need to rewrite the regular expressions. This also applies if you are using some other page, of course. Just view the HTML source and try to find a pattern that applies.

  • An NNTPSource for comp.lang.python, with howmany set to 10, so it works just like the first prototype

  • A PlainDestination, which prints all the news gathered

  • An HTMLDestination, which generates a news page called news.html

When all of these objects have been created and added to the NewsAgent, the distribute method is called. You can run the program like this:

$ python newsagent2.py

The resulting news.html page is shown in Figure 23-2. The full source code of the second implementation is found in Listing 23-2.

A326949_3_En_23_Fig2_HTML.jpg
Figure 23-2. A news page with more than one source

Listing 23-2. A More Flexible News-Gathering Agent (newsagent2.py )
from nntplib import NNTP, decode_header
from urllib.request import urlopen
import textwrap
import re


class NewsAgent:
    """
    An object that can distribute news items from news sources to news
    destinations.
    """


    def __init__(self):
        self.sources = []
        self.destinations = []


    def add_source(self, source):
        self.sources.append(source)


    def addDestination(self, dest):
        self.destinations.append(dest)


    def distribute(self):
        """
        Retrieve all news items from all sources, and Distribute them to all
        destinations.
        """
        items = []
        for source in self.sources:
            items.extend(source.get_items())
        for dest in self.destinations:
            dest.receive_items(items)


class NewsItem:
    """
    A simple news item consisting of a title and body text.
    """
    def __init__(self, title, body):
        self.title = title
        self.body = body


class NNTPSource:
    """
    A news source that retrieves news items from an NNTP group.
    """
    def __init__(self, servername, group, howmany):
        self.servername = servername
        self.group = group
        self.howmany = howmany


    def get_items(self):
        server = NNTP(self.servername)
        resp, count, first, last, name = server.group(self.group)
        start = last - self.howmany + 1
        resp, overviews = server.over((start, last))
        for id, over in overviews:
            title = decode_header(over['subject'])
            resp, info = server.body(id)
            body = ' '.join(line.decode('latin')
                             for line in info.lines) + ' '
            yield NewsItem(title, body)
        server.quit()


class SimpleWebSource:
    """
    A news source that extracts news items from a web page using regular
    expressions.
    """
    def __init__(self, url, title_pattern, body_pattern, encoding='utf8'):
        self.url = url
        self.title_pattern = re.compile(title_pattern)
        self.body_pattern = re.compile(body_pattern)
        self.encoding = encoding


    def get_items(self):
        text = urlopen(self.url).read().decode(self.encoding)
        titles = self.title_pattern.findall(text)
        bodies = self.body_pattern.findall(text)
        for title, body in zip(titles, bodies):
            yield NewsItem(title, textwrap.fill(body) + ' ')


class PlainDestination:
    """
    A news destination that formats all its news items as plain text.
    """
    def receive_items(self, items):
        for item in items:
            print(item.title)
            print('-' * len(item.title))
            print(item.body)


class HTMLDestination:
    """
    A news destination that formats all its news items as HTML.
    """
    def __init__(self, filename):
        self.filename = filename


    def receive_items(self, items):

        out = open(self.filename, 'w')
        print("""
        <html>
          <head>
            <title>Today's News</title>
          </head>
          <body>
          <h1>Today's News</h1>
        """, file=out)


        print('<ul>', file=out)
        id = 0
        for item in items:
            id += 1
            print('  <li><a href="#{}">{}</a></li>'
                    .format(id, item.title), file=out)
        print('</ul>', file=out)


        id = 0
        for item in items:
            id += 1
            print('<h2><a name="{}">{}</a></h2>'
                    .format(id, item.title), file=out)
            print('<pre>{}</pre>'.format(item.body), file=out)


        print("""
          </body>
        </html>
        """, file=out)


def runDefaultSetup():
    """
    A default setup of sources and destination. Modify to taste.
    """
    agent = NewsAgent()


    # A SimpleWebSource that retrieves news from Reuters:
    reuters_url = 'http://www.reuters.com/news/world'
    reuters_title = r'<h2><a href="[^"]*"s*>(.*?)</a>'
    reuters_body = r'</h2><p>(.*?)</p>'
    reuters = SimpleWebSource(reuters_url, reuters_title, reuters_body)


    agent.add_source(reuters)

    # An NNTPSource that retrieves news from comp.lang.python.announce:
    clpa_server = 'news.foo.bar' # Insert real server name
    clpa_server = 'news.ntnu.no'
    clpa_group = 'comp.lang.python.announce'
    clpa_howmany = 10
    clpa = NNTPSource(clpa_server, clpa_group, clpa_howmany)


    agent.add_source(clpa)

    # Add plain-text destination and an HTML destination:
    agent.addDestination(PlainDestination())
    agent.addDestination(HTMLDestination('news.html'))


    # Distribute the news items:
    agent.distribute()


if __name__ == '__main__': runDefaultSetup()

Further Exploration

Because of its extensible nature, this project invites further exploration. Here are some ideas:

  • Create a more ambitious WebSource, using the screen-scraping techniques discussed in Chapter 15.

  • Create an RSSSource, which parses RSS, also discussed briefly in Chapter 15.

  • Improve the layout for the HTMLDestination.

  • Create a page monitor that gives you a news item if a given web page has changed since the last time you examined it. (Just download a copy when it has changed and compare that later. Take a look at the standard library module filecmp for comparing files.)

  • Create a CGI version of the news script (see Chapter 15).

  • Create an EmailDestination, which sends you an email message with news items. (See the standard library module smtplib for sending email.)

  • Add command-line switches to decide which news formats you want. (See the standard library module argparse for some techniques.)

  • Give the destinations information about where the news comes from, to allow fancier layout.

  • Try to categorize your news items (by searching for keywords, perhaps).

  • Create an XMLDestination, which produces XML files suitable for the site builder in Project 3 (Chapter 22). Voilà—you have a news web site.

What Now?

We’ve done a lot of file creation and file handling (including downloading the required files), and although that is very useful for a lot of things, it isn’t very interactive. In the next project, we’ll create a chat server, where you can chat with your friends online. You can even extend it to create your own virtual (textual) environment.

Footnotes

1 Did you know, for example, that the discussion groups at http://groups.google.com , such as sci.math and rec.arts.sf.written, are really Usenet groups under the hood?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset