To transform an XML repository into an HTML docbase with an NNTP discussion component, you need to do the following three things.
Links in the discussion area point back to these targets, as do links in the web-based table of contents.
These links invoke the comment form, or rather the script that generates that form. The links encode the information that the script needs to produce an NNTP message that will bind to the right spot in the newsgroup and that will point back to the right spot in the docbase.
The docbase’s headers (h1..h6
) define the
desired structure. To populate the newsgroup accordingly, you
generate a set of NNTP messages whose Message-ID:
and References:
headers correspond to that
structure; then load those messages using one of several techniques.
Let’s consider each of these three steps in more detail.
We want to
translate <p>
or
<li>
into <a name="252"><p>
or <a name="1124"><li>
so that comments posted regarding
these elements can point back to the right spot in the text.
Although the final solution I’ll present uses Perl’s XML::Parser, the examples in Example 9.3 and Example 9.4 use two other parsers, one driven by Java and one by JavaScript. Why? There’s more than one way to do it, and that can come in handy when you’re stuck. For example, when I started working with XML, I’d rather have used Perl, but the XML::Parser module wasn’t quite ready at the time. No matter. At the end of the day, a component is just a component. What matters is getting the job done, not which programming language you use. There isn’t One True Language for the successful developer of Internet groupware. This book includes examples of Perl, Java, JavaScript, Visual Basic, SQL, and C. If I had eschewed XML because I couldn’t (at the time) write parser-based Perl scripts, I would have been cutting off my nose to spite my face. Value resides in components, not in programming languages.
There are macrocomponents—the clients and servers that make up the mail/news/Web trio—and there are microcomponents that can bind the macrocomponents into useful new configurations. Keep an open mind and a well-stocked toolkit. Microcomponents such as XML parsers and NNTP interface modules come in many varieties. When you need one that doesn’t happen to come in your favorite flavor, try a different flavor. If you’re a Perl programmer, but the component you need happens to come in only the Python or Java flavor, it may be quicker to learn the little bit of Python or Java you’ll need to use that component than to reinvent it in Perl. That’s particularly true in web environments where, as we’ve seen, parts can easily combine. In the case of my first reviewable-docbase builder, for example, links inserted into the generated docbase by a Java program invoked CGI scripts written in Perl.
Example 9.3 shows a Java-based solution to the problem of instrumenting a docbase with link targets. It uses the DataChannel/Microsoft XJ Parser.
Example 9-3. Inserting Link Targets Using the DataChannel/Microsoft XJ Parser
import java.util.*; import java.io.*; import java.net.*; import com.datachannel.xml.om.*; public class parseXML { static int element = 0; public final static void main(String argv[]) { String myURL = "book.xml"; boolean caseSensitive = false; boolean validating = true; boolean preserveWhiteSpaces = false; Document doc = new Document(); try { doc.load(myURL); traverse( (IXMLDOMNode) doc.getDocumentElement()); } catch (Exception e) { e.printStackTrace(); } } public static void traverse (IXMLDOMNode node) { XMLDOMNamedNodeMap attrMap = (XMLDOMNamedNodeMap) node.getAttributes(); XMLDOMNodeList childList = (XMLDOMNodeList)node.getChildNodes(); if ( node.getNodeType() == node.ELEMENT_NODE ) { if ( node.getNodeName().equals("p") || node.getNodeName().equals("li") ) { System.out.print( "<a name="" + element++ + "">"); } System.out.print( "<" + node.getNodeName() ); IXMLDOMNode attr = attrMap.nextNode(); while ( attr != null ) { System.out.print ( " " + attr.getNodeName() + "="" + attr.getNodeValue() + """); attr = attrMap.nextNode(); } System.out.println(">"); } else if ( ( node.getNodeType() == node.TEXT_NODE ) ) { System.out.println(node.getNodeValue()); } else if ( ( node.getNodeType() == node.ENTITY_NODE ) ) { System.out.print ( node.getNodeValue() ); } else { System.out.println ( " node: " + node.getNodeType()) ; } IXMLDOMNode child = childList.nextNode() ; while ( child != null ) { traverse(child); child = childList.nextNode(); } if ( node.getNodeType() == node.ELEMENT_NODE ) // close the element { System.out.println("</" + node.getNodeName() + ">"); } } }
This Java program begins by reading the whole XML document into an
in-memory tree. Then it traverses that tree, emitting element tags,
attributes, and contents. It applies just one transformation to the
XML source, prepending link targets to the elements that are the
reviewable chunks of the docbase. In this example, these are
paragraphs and list items. The code emits the XML tags themselves and
all the attributes that come with each tag. Why? Remember that
we’re depending on this XML to be HTML/CSS as well. This book,
for example, uses CSS-enhanced tags like <h1 class="chapter">
and <p class="figure-title">
. The transformed docbase has to
preserve the tags with their attributes so that a browser can render
the output as HTML, governed by CSS styles.
Let’s look at another way to do it. Example 9.4 inserts link targets using JavaScript to drive the MSXML parser. And in this example, the script is embedded in a web page.
Example 9-4. Inserting Element Anchors Using MSXML in an ASP Script
<%@ language = "jscript"%> <% var element = 1; var doc = Server.CreateObject("microsoft.xmldom"); doc.load("c:\web\book.xml"); if (doc.parseError != "") { Response.write( doc.parseError.reason + "," + doc.parseError.line + "," + doc.parseError.linepos + "," + doc.parseError.srcText); } traverse(doc.documentElement); function traverse(node) { if (node.nodeTypeString == "element") { doStartTag(node); if (node.childNodes.length != null) { var i; for (i = 0; i < node.childNodes.length; i++) { traverse(node.childNodes.item(i)); } } doEndTag(node); } else if (node.nodeTypeString == "text") { Response.write(node.nodeValue); } else if (node.nodeTypeString == "entity") { Response.write(node.nodeValue); } else Response.write ("node: " + node.nodeType); } function doStartTag(node) { if ( (node.nodeName == 'p') || (node.nodeName == 'li') ) { Response.write( '<a name="' + element++ + '"> '), } Response.write("<" + node.nodeName); doAttrs(node); Response.write("> "); } function doEndTag(node) { Response.write("</" + node.nodeName + ">"); } function doAttrs(node) { if ( node.attributes.length > 0 ) { var i; for (i = 0; i < node.attributes.length; i++) { Response.write( " " + node.attributes.item(i).nodeName + "=" + node.attributes.item(i).nodeValue); } } } %>
Because this script runs in the Active Server Pages environment, it can do XML-to-HTML conversion on the fly. This is useful, but since on-the-fly conversion can be a slow process for a large document, the technique I actually used for this book instead generates HTML pages that are statically served, or just read into a browser using the file:// protocol
Comment links are the numbered links at the end of each paragraph and list element, as shown in Figure 9.2. The text of each link is the same sequence number encoded in the link targets we just made. The address lurking behind those few digits, though, includes all sorts of instrumentation:
The fourth paragraph under an <h2>
header,
for example, will encode that header’s message ID so that a
comment posted by way of that paragraph’s comment link will
nest under the NNTP message that represents that header.
For this book, I processed the whole set of chapters as a single XML
stream. But since it would be inconvenient to view the book as a
single HTML document, I carved the HTML output into per-chapter
chunks. So if paragraph 253 occurs in Chapter 7,
its URL—for Version 2 of the draft—would be /groupware/v2/chap7.htm#253
.
The comment form quotes this text so that reviewers can refer to it
as they compose their comments. That form’s handler, which
constructs and posts the NNTP message that is the comment, uses a
leading fragment of the text as the Subject:
header of the message.
When you click on the comment link, these items enable a CGI script to generate a form that quotes the section heading and paragraph from the book, collects comments about it, and posts a message containing all this information to the reviewers’ newsgroup. As we’ll see shortly, there’s an alternate implementation in which clicking the link launches a mail message that works in a similar way.
Controlling NNTP message and
reference IDs is the key to this step. Newsreaders don’t
transmit message IDs when they post. It’s normally the
server’s job to create those IDs. It assigns a unique ID, such
as [email protected]. But if you
create a message that includes a Message-ID:
header, the news server will honor that ID so long as it
doesn’t conflict with any existing messages. Since you
can’t use a newsreader to transmit such a message, how do you
send it? We saw in Chapter 5 how to use
telnet to drive an NNTP server “by
hand.” There are several ways to automate the posting of a news
message. Standard INN and most derived
implementations—including Netscape’s Collabra Server, but
not Microsoft’s NNTP Service—come with a command-line
tool called inews. Given a file called
msg.txt
containing a set of NNTP headers and a
message body, you can post a message like this:
inews -h msg.txt
A hybrid Web/NNTP application might use a CGI Perl script to pipe the data to an instance of inews, as shown in this Perl fragment:
open (INEWS, " | inews -h") or die "cannot open pipe to inews $!"; print INEWS $msg;
If you lack the inews tool, you can use one of a number of NNTP client modules. These are available for Perl, Python, Java, and doubtless many other languages that can use TCP/IP sockets. For Perl programmers, the hardest part is deciding which module to use. There are at least three available on the Comprehensive Perl Archive Network (CPAN, http://www.cpan.org/): Net::NNTP, LWP (which is nominally a web client but which also handles NNTP), and NNTPClient. Example 9.5 shows how to post a message using Net::NNTP.
Example 9-5. Posting a Message Using Net::NNTP
use Net::NNTP; my $nntp = Net::NNTP->new('localhost'), my @msg = ( "Newsgroups: groupware.v3 ", "Subject: (What's more, you can join components... ", "From: [email protected] ", "Message-ID: <925327035_159@local> ", "References: <925327035_158@local> ", " ", "I almost wonder if you need somewhere to develop a metaphor ", "analogous to "the pipeline." Maybe go reread the wonderful... ", ); $nntp->post(@msg);
Newsgroup hierarchy arises from References:
headers. This header, which is optional, can contain one or more
message IDs. Newsreaders use this information to create hierarchical
views of newsgroups. In our example, we want each message
representing an <h1>
docbase tag to omit the
References:
header. These chapter names will form
the top level of the tree. Messages corresponding to all other
docbase <hn>
tags should carry a
References:
header that is the message ID of the
closest ancestral (that is: <hn-1>
) tag. A
series of <h2>
tags, for example, should all
refer back to the nearest preceding <h1>
; an
<h3>
following one of those
<h2>
tags should refer back to that
<h2>
. If the message ID of that
<h2>
is <925327035_158@local>, then
the message shown in Example 9.5 will become a reply
to it.
How should we form the message IDs? It’s a good idea to incorporate a timestamp so that this batch of autogenerated messages won’t conflict with any others. Since it only takes a second to generate the batch, the timestamp alone won’t guarantee uniqueness. So we’ll tack a sequence number onto the end of each ID. That yields IDs like the ones shown in Example 9.5.
I started with the Java DXP parser but switched immediately to Perl’s XML::Parser when it became available. You can use XML::Parser in a variety of modes, or “styles.” For example, the Tree style builds a complete in-memory representation of parsed XML content, which your script can then navigate and transform. The Stream style, which I’ll demonstrate here, doesn’t build an in-memory tree. Instead, it calls handlers, registered by your script, for three events—recognition of the beginning of a tag, of a tag’s content, or of the end of a tag. Here’s the skeleton of an XML::Parser script that uses the Stream style:
#! perl -w use strict; use XML::Parser; my $xml = new XML::Parser (Style => 'Stream'), $xml->parsefile("book.xml"); sub StartTag {} sub Text {} sub EndTag {}
The work of transforming this book’s XML source into a
reviewable docbase is divided among the three handlers,
StartTag( )
, Text( )
, and
EndTag( )
. Let’s walk through these one at
a time.
The parser calls StartTag( )
(see Example 9.6) when
it recognizes a tag, passing the tag name explicitly, and a hash
representation of the attributes in Perl’s default hash,
%_
. What’s that? It was news to me too. I
was familiar with $_
, Perl’s default
scalar, which magically stores the current line in a file-reading
loop, or the current list element in a foreach
loop. And I knew about @_
, the default list that
holds subroutine arguments. But I never suspected there might also be
a default hash. Live and learn!
Example 9-6. The StartTag Handler
sub StartTag { my ($expat,$element) = @_; if (withinCommentableElement($expat,$element) ) { print DOCBASE $_; return; } $comment_chars = ""; if ( isPreformattedElement ($element) ) # work around broken CSS in MSIE { print DOCBASE " <pre>"; } if ( $element eq 'h1' ) # new chapter { $counters->{chapter}++; # update counters $counters->{figure} = 0; $counters->{listing} = 0; $counters->{table} = 0; # start new HTML output file open (DOCBASE, ">./docbase/chap$counters->{chapter}.htm") or die "cannot chap$counters->{chapter}.htm"; print DOCBASE <<EOT; # emit boilerplate <head> <link rel="stylesheet" type="text/css" href="chap-style.css"> </head> <body> EOT } $tocListTags = ''; if ( my $hdr = isHeader ($element) ) # do table-of-contents outline { $newTocLev = $hdr; $lastHdrElt = $element; $tocPreamble = "<a name="$counters->{element}"> <a href="chap$counters->{chapter}.htm#$counters->{element}" target="chap"> "; if ($newTocLev > $tocStack[-1]) { $tocListTags .= "<ul> "; push (@tocStack, $newTocLev); } else { while ($tocStack[-1] > $newTocLev ) { $oldTocLev = pop @tocStack; $tocListTags .= "</ul> "; } } } if ( isCommentableElement ($element) ) # emit tag with jump target { print DOCBASE " <a name=$counters->{element}>$_ "; } else # emit plain tag { print DOCBASE "$_"; } }
StartTag( )
begins by calling
withinCommentableElement( )
, a routine that
tests whether the current element is contained within any of those to
which comment links can attach.
sub withinCommentableElement { my ($expat,$element) = @_; my $within = 0; foreach my $elt ('p','li','h1','h2','h3','h4','h5') { if ( $expat->within_element($elt) ) { $within = 1; } } return $within; }
Why do we need this routine? We want to accumulate complete
paragraphs, list items, or headings for the quote that will be
included in each comment link. Suppose a paragraph contains a
<span>...</span>
. We don’t want
the StartTag( )
invocation that handles that tag
to clear $comment_chars
, the variable that’s
accumulating the paragraph that contains this element. So if
withinCommentable-Element( )
succeeds,
StartTag
just echoes the tag and returns.
When a new chapter appears in the stream, StartTag(
)
increments the chapter counter, resets the figure and
listing counters, and begins a new output file for that
chapter’s generated HTML.
When a header appears, StartTag( )
records the
HTML list syntax (<ul>
tags) for the table
of contents so that headers will indent properly. It also records
link targets for these table-of-contents entries, so the links
wrapped around the corresponding headers in the generated web page
can jump to the right spot in the table of contents.
Finally, it writes the header tag to the generated web page. To
headers, paragraphs, and list items—those elements that
participate in the commenting system—it prepends link targets.
The headers in the generated web pages, and the references in
newsgroup messages, point to these targets in the
docbase.
The parser sends all the characters it finds between matched pairs of
start and end tags to the
Text( )
routine, shown in Example 9.7.
Example 9-7. The Text( ) Handler
sub Text { my ($expat) = @_; my $chars = $_; $comment_chars .= $chars; # save text for use by Endtag if ($expat->current_element() eq 'h1') # if new chapter { $chars = "Chapter $counters->{chapter}: " . $chars; # announce its number } if ( my $level = isHeader ($expat->current_element) ) # if header { my ($prev) = $level-1; # compute parent level $prev = "h" . $prev; # form parent h tag $msg_id++; # update msg_id counter my $s_msg_id = $timestamp . "_" . $msg_id; # form message id $current_header = $chars; # remember current header's text $lastHdrs{$expat->current_element} = $s_msg_id; # remember governing ID $lastHdrId = $s_msg_id; # remember last ID my $s_ref_id = ""; if ($expat->current_element ne 'h1') # if not an h1 { $s_ref_id = $lastHdrs{$prev} } # make a References: header make_nntp_msg ( $s_msg_id, $s_ref_id, # add an entry to nntp load file $counters->{chapter},$current_header); } if ( my $type = isFigureOrListingOrTable ($expat->current_element) ) { $current_figttl = $chars; $counters->{$type}++; print DOCBASE "$type $counters->{chapter}-$counters->{$type}: "; } if ( isHeader ($expat->current_element)) # if header { my $elt = $counters->{element}; my $cnum = ($expat->current_element() eq 'h1') ? "$counters->{chapter}: " : ''; print TOC # write table-of-contents entry "$tocListTags $tocPreamble <li> <span class="lev$newTocLev">$cnum $_</span></li></a> "; print DOCBASE # write HTML doc fragment "<a href="toc.htm#$elt" target="toc">$chars</a>"; } else { print DOCBASE $chars; } }
Note that this routine also runs in what you might think of as the interstitial spaces of the XML stream. For example:
<p>some text</p> <- Text receives "some text" <- Text receives two newlines <p>more text</p> <- Text receives "more text"
Why does the parser report the newlines in this apparent
no-man’s land? Because there’s always an enclosing scope.
There has to be an outermost tag pair—which could be
<html>..</html>
or
<GroupwareBook>..</GroupwareBook>
—enclosing
the whole stream. So there really is no interstitial space.
Note how characters accumulate in $comment_chars( )
, the variable that will ultimately produce the quoted
version of each element that appears on the comment form. Like
StartTag( )
, Text( )
may be
called within an <li>
or
<p>
tag pair. This happens, for example,
when the parser sees an inline element such as
<span>
or <strong>
.
So the Text( )
routine uses
$comment_chars
to accumulate characters across
multiple calls. StartTag( )
, as we’ve
seen, resets $comment_chars
to the empty string.
When Text( )
encounters an HTML
header—that is, an element in the set
h1..h6
—it builds an entry in a file of NNTP
messages that will be used to populate the newsgroup. It forms a
message ID from the timestamp taken at the beginning of the run, plus
a message counter. The characters received from the parser—that
is, the contents of the header—go into the variable
$current_header
. It will later be used by the
EndTag( )
routine to complete the
table-of-contents entry for this element. The Text(
)
routine also passes $current_header( )
to the make_nntp_msg( )
routine for
use as the Subject:
header of the NNTP message.
make_nntp_msg( )
also receives the message ID
created for this element. And for headers other than the top-level
<h1>
, it receives another message ID for use
in the References:
header. The Text(
)
routine finds this ID in the
%lastHdrs
hashtable, which it also maintains. In
the case of an <h3>
header, for example, it
looks up $lastHdrs{h2}
to find the ID of the
<h3>
’s parent.
The Text( )
routine could post NNTP messages as
it creates them, but instead it just builds a file that looks like
Example 9.8:
Example 9-8. An NNTP Load File to Create the Discussion Framework
From: [email protected] Message-ID: <925224566_151@local> Subject: Chapter 8: Docbase Search Newsgroups: groupware.v3 Content-type: text/html Refer to <a href="http://localhost/.//chap8.htm#969">docbase</a> From: [email protected] Message-ID: <925224566_152@local> Subject: A docbase's Web API Newsgroups: groupware.v3 References: <925224566_151@local> Content-type: text/html Refer to <a href="http://localhost/.//chap8.htm#973">docbase</a> From: [email protected] Message-ID: <925224566_153@local> Subject: URL namespace reengineering Newsgroups: groupware.v3 References: <925224566_152@local> Content-type: text/html Refer to <a href="http://localhost/.//chap8.htm#978">docbase</a>
Why a standalone file of messages? It enables pipelined processing. For example, the first version of this generator was written in Java, and the NNTP loader was written in Perl. When I rebuilt the generator in Perl, there was no need to change the loader. The Perl generator only had to target the same interface—the file format shown in Example 9.8—as the Java version had. The loader itself, shown in Example 9.9, is very simple.
Example 9-9. An NNTP Message Loader
use Net::NNTP; $nntp = Net::NNTP->new('localhost'), my @msg = (); open(F,"nntp_msgs") or die "cannot open nntp_msgs $!"; while (<F>) { push (@msg,$_); if ( m/^Refer to/ ) { if (! $nntp->post($msg)) { die "cannot post" } @msg = (); } } close F;
The Text( )
routine also takes care of
autonumbering figures and listings by trapping elements of these
types and inserting formatted numbers into two output
streams—the table of contents and the docbase itself. Finally,
it emits the characters received from the parser—wrapping a
table-of-contents link around header text to create the other half of
the table-of-contents/chapter cross-linkage.
The EndTag( )
routine (Example 9.10)
adds the instrumented comment link to each commentable element. The
link’s address encodes the information that the form-generating
script passes to its handler, which in turn posts the comment to the
news server using Net::NNTP. By the time the
parser calls EndTag( )
, all this information is
available. Before emitting a </p>
or
</li>
tag, it writes a link whose text is
just an element number, but whose address is a muscular CGI call that
passes the docbase name, chapter number, element number, the NNTP
message ID for this element’s governing header, and the
complete text of the element for quoting purposes. Finally,
EndTag( )
increments the element counter.
Example 9-10. The EndTag( ) Handler
sub EndTag { my ($expat,$element) = @_; if ( withinCommentableElement($expat,$element) ) { print DOCBASE $_; return; } if ( isCommentableElement($element) ) # need to add comment link { my $escaped_current_header = # escape current header escape ($current_header); my $encoded_chars = escape($comment_chars); # escape current element if ($protocol eq 'mail') # email version { $comment_chars = "Chapter: $counters->{chapter}, Section: $escaped_current_header, Para $counters->{element}: [$comment_chars]"; print DOCBASE " <span class="eltnum"> <a href="mailto:[email protected]?subject=groupware.$version, $escaped_current_header&body=$encoded_chars">" . $counters->{element} . "</a></span>"; } else # nntp version { $comment_chars = "Chapter: $counters->{chapter}, Section: $current_header, Para $counters->{element}<p>$comment_chars"; print DOCBASE " <span class="eltnum"> <a href="http://$server/$cgi_path/comment.pl? docbase=groupware.$version&chapnum=$counters->{chapter}& elt=$counters->{element}&fragment=$encoded_chars& id=$lastHdrId">" . $counters->{element} . "</a></span>"; } } if ( my $type = isFigureOrListingOrTable($element) ) # update table-of-contents { my $tocElt = "<a name="$counters->{element}"><a href="chap$counters-> {chapter}.htm#$counters->{element}" target="chap"><li> $type $counters->{chapter}-$counters->{$type}: $current_figttl</li></a> "; } print DOCBASE $_; # emit tag if ( isPreformattedElement ($element) ) # CSS workaround { print DOCBASE "</pre>"; } if ( isCommentableElement ($element) ) # update element counter { $counters->{element}++; } }