Search implementations often begin with the question: “What’s the best (or cheapest, or easiest-to-use) search engine? This quest for the One True Search Engine is, I think, misguided. Search engines are more alike than they are different. You plug in a search term, you get back a list of URLs. In this chapter, we’ll focus on the much neglected art of organizing that list of URLs in useful ways.
Webmasters who obsess about the design of their standard web pages often invest little or no effort in the design of their search-results pages. By design I don’t simply mean the kinds of template alterations that stamp a site-branded style onto the default Excite or Verity or Microsoft result pages. And I don’t mean just ordering results by relevance, which begs the question: relevant to whom and for what? Rather, I mean a deep reorganization of the result set, which both reflects and adapts to the underlying information architecture of one or more docbases. Search engines can’t do this for you. Most aren’t intrinsically programmable; those that are (such as the Microsoft Index Server) still can’t easily do the kinds of data wrangling required to get the job done well. And yet the task isn’t really hard at all, if you build docbases with an eye toward search integration and if you understand how web components work.
We began developing this notion of web components in Chapter 6, and Chapter 7. There, we saw how sets of CGI-based template/processor pairs, connected in series, create the Web equivalent of the Unix pipeline. But there’s a larger story to tell about components in the Web environment, and that theme will play out in this chapter and throughout the rest of this book. Let’s call the things we built in Chapter 6 and Chapter 7 microcomponents. Now we’ll shift our attention to macrocomponents. These can include search engines, docbases, or any other kind of application that exports a Web API. We first encountered this term back in Chapter 4, when I showed how the Polls servlet’s Web API enables formless data entry using URLs sent in mail messages. In this chapter, we’ll first discover and then exploit the Web APIs presented by docbases and by search engines. When you understand properly how these two kinds of macrocomponents work, you can effectively organize search results drawn from any combination of docbases by any search engine.
Is this groupware? Yes, if you construe the term broadly, as I do, to mean a variety of ways to connect people to each other and to information. In Chapter 5, we saw how a navigational system can express bindings among groups of users and classes of documents. A search system should work the same way and not just within an individual docbase but across sets of them.
A docbase is one
kind of macrocomponent. What are its APIs, and how can we manipulate
them? One API to a docbase, so obvious that we tend to overlook it,
is the namespace. Consider a random URL from the ProductAnalysis
docbase:
/Docbase/ProductAnalysis/docs/000127.htm
. This access
handle contains important clues about the document it points to:
It’s a member of the Docbase family. That fact alone distinguishes the kinds of documents likely to be found here from the more informal documents you’d expect to find in, say, a newsgroup.
The ProductAnalysis docbase stores a particular kind of structured report, related to a particular business process. When scanning search results drawn from multiple docbases, you might use that fact to zero in on, or conversely to pass over, documents of this type.
The single filename 000127.htm
doesn’t say
anything about the age of the document it contains. A set of such
names, though, carries information about the order in which the
documents were added to the docbase. We used that information to
impose a secondary ordering on the index pages. We can use the same
strategy on search pages.
This is, admittedly, an implied API. The docbase doesn’t really
have a documentAge
property or a corresponding
getDocumentAge( )
method. But it’s so
straightforward to create such methods that I’ll claim poetic
license and just pretend that these implied APIs really exist.
Although the URL namespaces built by
the Docbase system convey the relative age of records, they
don’t encode their absolute age. That would be valuable, and
it’s easy to do. Suppose we tweak
Docbase::Input so that it generates record numbers
like 1999-04-15-000127
rather than simply
000127
. That’s a trivial change that affects
none of the input processing or the sequential- and tabbed-index
processing. To all these processes, record numbers (and their
corresponding filenames) are just opaque tokens that happen to sort
by age. The new format preserves that property but makes the names
more useful. Now we can read off a document’s creation date
directly from its name, whether or not a docbase’s structured
header carries a creation date
field. We can
arrange sets of records in yearly, monthly, or daily groupings. More
subtly, the relationship between the two components of the
name—date and sequence number—carries information about
the rate at which the docbase is growing.
A URL like http://host/Docbase/ProductAnalysis/docs/1999-04-15-000127.htm is admittedly long and hard to type. As we saw in Part I, users are best advised to manipulate it using cut-and-paste or drag-and-drop. They should also enclose it in angle brackets for email transmission, to prevent breakage. These strategies help matters, but admittedly a long name trades away some convenience in exchange for descriptive power. On balance I think the advantages of a more richly descriptive name outweigh the disadvantages of untypeability and possible breakage.
Some sites take pains to hide URLs
behind “friendly” labels. One university’s faculty
directory uses a JavaScript onMouseOver
handler to
prettify the link addresses displayed in the browser’s status
window. For a link labeled Home page for John
Doe, with an address of
/directories/faculty/es/JohnDoe.html
, the status
window’s display echoes “Home page for John Doe.”
This effort is misguided for several reasons. Recapitulating the
link’s label in the status bar is redundant. Worse, it reduces
the total amount of information available at this level of the
system. The information-rich address is now completely hidden. That
raw address, presumed unfriendly by the webmaster, actually tells us
quite a bit about the document that it points to:
It’s part of a directory, as distinct from other kinds of docbases on the site.
It’s part of a faculty directory, as distinct from the parallel docbase whose records describe administrative personnel.
It refers to a member of the environmental science department, as distinct from computer science or English.
Admittedly es
is a cryptic departmental tag. And
in the context in which I found that page, it happened to be an
unnecessary clue, because the directory entries were already grouped
by department. But consider what happens when John Doe’s home
page appears as a search result. Now es
is a vital
clue as to the document’s type and purpose. An intelligent
search-results filter can exploit that clue to integrate this
document into a collection of search results.
That filter won’t care whether the department ID is
es
or environmental_science
. Of
course, people who encounter the URL would prefer the latter. Why was
es
used? Almost certainly because the directory
was built manually, by a webmaster who quite understandably did not
want to type out environmental_science
or
psychology
. Were the directory a Docbase-style
application, these longer departmental names could be maintained as a
controlled vocabulary and automatically interpolated into the
docbase’s namespace.