In the Docbase system described here and in Chapter 7, the repository format coincides with the delivery format. The same set of HTML pages serves both purposes. This differs from the BYTE Magazine docbase we explored in Chapter 5; in that case, a translator read the repository format and wrote deliverable pages.
Why use one versus the other? The BYTE docbase had a fairly complex format but was batch-oriented and maintained by a single production expert who exported material from QuarkXPress and then massaged it to meet a detailed repository specification. There was no need to preview individual pages or validate input interactively, and although a tool could have provided these features, it would have been costly to build and maintain.
The Virtual Press Room, by contrast, had a relatively simple format and was built interactively by many untrained users. These users required an authoring tool that did validate and preview the information they supplied. Because the format was simple, that tool was cheap to build and maintain. Since the preview pages had to be produced immediately, it was convenient to just store them as is.
In the Docbase system, the deliverable HTML pages can be XML pages too, if you format the docbase template as XML. When the two formats coincide, HTML pages are much more manageable than they otherwise would be. The Virtual Press Room was, like the BYTE docbase, a pre-XML-era invention. Had I built it in 1999 rather than 1995, I’d have exploited XML, as I’ll demonstrate here.
When a user enters a new record in the ProductAnalysis docbase, Docbase::Input( ) interpolates the validated input into the same docbase template used for the preview. That template might be plain HTML, perhaps augmented with CSS styling. But as Example 6.8 shows, it can also conform to the rules for well-formed XML. As we’ll see again in Chapter 9, those rules are minimal. In this case, they simply require that all tags must be closed and all attributes quoted.
Example 6-8. An HTML/XML Docbase Record Template
<html> <head> <meta name="company" content="[company]
"/> <meta name="product" content="[product]
"/> <meta name="analyst" content="[analyst]
"/> <meta name="duedate" content="[duedate]
"/> <title>[company]
,[product]
,[title]
</title> <link rel="stylesheet" type="text/css" href="../../../Docbase/ProductAnalysis/style.css"/> </head> <body> <!-- navcontrols --> <!-- navigation controls go here --> <h1>[company]
/[product]
</h1> <table border="1" cellpadding="4"> <tr> <td align="right" valign="top" class="label">Date</td> <td align="left" class="duedate">[duedate]
</td> </tr> <tr> <td align="right" valign="top" class="label">Analyst</td> <td align="left" class="analyst">[analyst]
</td> </tr> <tr> <td align="right" valign="top" class="label">Title</td> <td align="left" class="title">[title]
</td> </tr> <tr> <td align="right" valign="top" class="label">Summary</td> <td align="left" class="summary">[summary]
</td> </tr> <tr> <td align="right" valign="top" class="label">Full Report</td> <td align="left" class="fulltext">[fulltext]
</td> </tr> <tr> <td align="right" valign="top" class="label">Contact Info</td> <td align="left" class="contact">[contact]
</td> </tr> </table> </body> </html>
This template is used twice—first to create the preview, as we’ve already seen, and again to create the final record stored in the docbase.
The combination of HTML, CSS, and XML shown here is a transitional strategy. You could, instead, write a pure XML template like this:
<company>[company]
</company> <product>[product]
</product> <analyst>[analyst]
</analyst>
The problem with this approach is that, for most browsers, you’ll end up with a repository format that doesn’t coincide with a delivery format. Internet Explorer 5.0 can associate XML tags with CSS or extensible stylesheet language (XSL) styles and thus render a page of XML as it would render a page of HTML. So can the beta version of Navigator 5.0. But this is a new capability that’s not yet universally deployed and won’t be for a while. So in practice, to support the installed base of browsers, you’d need another step to translate between repository and delivery formats.
The
middle-ground approach shown in Example 6.8, which
we’ll see again in Chapter 9, makes ordinary
CSS class
attributes do double duty. In the
presence of a CSS style sheet, these attributes exert stylistic
control over the docbase record. That control can be as detailed as
your tagging will support—you could even assign a unique style
to each field of the record. What’s more, styles obey
inheritance rules, so styles assigned to a class attached to the
<body>
tag, or to a
<table>
tag, will ripple down through these
structures unless explicitly overridden at lower levels. Well, in
theory that’s what happens. In practice neither the Netscape
nor the Microsoft browser currently implements all of CSS1, and
you’ll run into the usual headaches when you try to figure out
which features, and combinations of features, work reliably in both.
Flaky CSS implementations don’t detract at all from another
role played by the class
attribute. It is,
fundamentally, a selector that operates on a document and returns a
subset of its elements. Normally it’s a CSS-aware application
(e.g., your browser) that does the selection in order to apply a
style. But any other application can use the selectors too. Suppose
you want to create a view of the docbase that presents report
summaries containing a search term. In SQL terms, you’d like to
issue the query:
select summary from docbase where summary like '%LDAP%'
Example 6.9 demonstrates a filter, called xml-grep, that reads one of the HTML/CSS/XML files in this docbase and performs the same query.
Example 6-9. A Docbase Query Based on CSS Tags
# usage: xml-grep FILENAME TAG PATTERN # example: xml-grep 000127.htm summary LDAP use XML::Parser; my $xml = new XML::Parser (Style => 'Stream'), $xml->parsefile($ARGV[0]); # parse the file sub StartTag {} # not needed here sub EndTag {} # not needed here sub Text { my $expat = shift; if ( $expat->current_element eq 'td' and # table cell $_{class} eq $ARGV[1] and # of class 'summary' m/$ARGV[2]/ # matching LDAP ) { print "$_{class}: $_ "; } # found a hit }
This script expects three arguments: a filename, a class attribute, and a search string. It’s a whole lot slower than grep. But it’s more flexible, because it will match, for example, either of these patterns:
<td class="summary" width="20%">...</td> <td valign="top" class="summary" align="left" colspan="2">...</td>
What’s
more, this approach can deal with inheritance in the same way that
CSS display processors do. For example, the
analyst
field might not always be immediately
contained within a cell of an HTML table. Suppose that inside that
cell, the name is wrapped up in link syntax, like this:
<td class="analyst"><a href="mailto:[email protected]"> Jon Udell</a></td>
We can still capture my name like this:
if ( $expat->within_element('td') and # inside a table cell $last_seen_class_attr eq $ARGV[1] and # class="analyst" m/$ARGV[2]/ # match "Jon Udell" )
If we saved the value of the
last-seen class
attribute as
$last_seen_class_attr
, then this
fragment—which runs in the context of the <a href>
tag—will succeed. A line-oriented
grep can’t do this. But an XML query that
understands the hierarchy of an attributed docbase can find things
that are nested in other things. Several formal query languages are
proposed for XML, notably XQL (http://www.w3.org/TandS/QL/QL98/pp/xql.html)
and XML-QL (http://www.w3.org/TR/NOTE-xml-ql/). Even without a general-purpose XML query language,
though, you can see that it’s not hard to write parser-enabled
code to do simple queries.
The XML nature of the docbase records created by the template in Example 6.8 solves another important problem too. When I managed the Virtual Press Room, I sometimes had to make wholesale changes to the docbase. That was never a problem with the BYTE docbase, because its “object code” was routinely “compiled” from its “source code.” But the VPR’s “object code” was its “source code,” and there was no “compiler” in the same sense.
Because the VPR’s HTML pages were machine written, they exhibited regular patterns that Perl scripts could latch onto and use to make systematic transformations. But the pages weren’t trivially rewritable. Creating those scripts was feasible but was a time-consuming and ultimately wasteful exercise. XML means never having to waste your time writing custom parsing code.
Docbases need to evolve. Inevitably you’ll run into situations that require wholesale rewriting of a set of records. The XML discipline makes that kind of rewrite vastly simpler than it otherwise would be. That’s a huge bonus for a manager of semistructured information.
HTML’s <meta>
tag has for years
provided a way to make the header of a web page behave much like the
header of an email or news message. You can use the
<meta>
tag to tuck a set of name/value pairs
into a document header. In the long run, XML may obselete this way of
maintaining a structured header inside a web page. But for the near
future, it’s a really useful technique. Like email headers,
these kinds of web-page headers are easy to parse and manipulate,
using a variety of tools. Because the <meta>
tags in Example 6.1 are well-formed XML, any XML
parser can work with them. But as we’ll see in the next
chapter, sometimes that can be overkill. It’s faster and easier
to deal with a simple pattern like this one using Perl’s native
regular-expression
engine.
In Chapter 7, we’ll use the meta-tagged header in the docbase record to build indexes that enable several modes of navigation. In Chapter 8, we’ll see how full-text indexers can automatically recognize the meta-tagged header and use it to support field-level as well as full-text search of the docbase. That’s a powerful capability, but one that’s seldom used. Why? It requires a tagging discipline that many web archives lack. By doing that tagging automatically, the Docbase system creates potential value. A smart navigational system is one way to actualize that potential; a smart search system is another.
Note that some of the fields defined in Example 6.8
with <meta>
tags duplicate fields governed
by CSS class attributes. Why do it both ways? Sometimes you just need
to scan for indexable fields, as we’ll be doing in the next
chapter, and then it’s handy to have a nice neat header tucked
into the top of every docbase record. Sometimes you need to do a
wholesale transformation of the docbase, in which case you’ll
want to deal with XML elements rather than simple text patterns.
There’s more than one way to do it!
Now let’s see how a record, having been previewed and submitted by a user, enters the docbase. The preview is hardwired to a common script, final-submit.pl, shown in Example 6.10.
Example 6-10. The final-submit.pl Script
#!/usr/bin/perl -w use strict; use TinyCGI; my $tc = TinyCGI->new(); print $tc->printHeader; my $vars = $tc->readParse(); use Docbase::Docbase; my $db = Docbase::Docbase->new($vars->{app}); use Docbase::Input; my $di = Docbase::Input->new($db); $di->writeDoc($vars);
It’s brief, needing only to pass a hashtable of CGI variables
to the Docbase::Input method writeDoc(
)
, shown in Example 6.11.
Example 6-11. The writeDoc Method
sub writeDoc { my ($self,$vars) = @_; my $app = $self->{app}; my $cgi_absolute = $self->{docbase_cgi_absolute}; my $web_absolute = $self->{docbase_web_absolute}; my $db_template = # make template name "$cgi_absolute/$app/docbase-template.htm"; my $content .= # interpolate vars into template _fillTemplate($db_template,$vars); my $docnum = # get next record number _nextFilenum("$web_absolute/$app/docs","htm"); my $docfile = # make record's filename "$web_absolute/$app/docs/$docnum.htm"; if ( open(F,">$docfile") ) { print F $content; # store record close F; print "<br>Done. Your reference number is $docnum "; } else { print "<p>cannot open docfile $docfile"; } }
The writeDoc( )
method is also brief. It uses
_fillTemplate( )
again to interpolate form
variables into the record template, asks for the next available
record number, creates a file named for that record number, and
writes the record to the file.