Chapter 8. Organizing Search Results

Search implementations often begin with the question: “What’s the best (or cheapest, or easiest-to-use) search engine? This quest for the One True Search Engine is, I think, misguided. Search engines are more alike than they are different. You plug in a search term, you get back a list of URLs. In this chapter, we’ll focus on the much neglected art of organizing that list of URLs in useful ways.

Webmasters who obsess about the design of their standard web pages often invest little or no effort in the design of their search-results pages. By design I don’t simply mean the kinds of template alterations that stamp a site-branded style onto the default Excite or Verity or Microsoft result pages. And I don’t mean just ordering results by relevance, which begs the question: relevant to whom and for what? Rather, I mean a deep reorganization of the result set, which both reflects and adapts to the underlying information architecture of one or more docbases. Search engines can’t do this for you. Most aren’t intrinsically programmable; those that are (such as the Microsoft Index Server) still can’t easily do the kinds of data wrangling required to get the job done well. And yet the task isn’t really hard at all, if you build docbases with an eye toward search integration and if you understand how web components work.

We began developing this notion of web components in Chapter 6, and Chapter 7. There, we saw how sets of CGI-based template/processor pairs, connected in series, create the Web equivalent of the Unix pipeline. But there’s a larger story to tell about components in the Web environment, and that theme will play out in this chapter and throughout the rest of this book. Let’s call the things we built in Chapter 6 and Chapter 7 microcomponents. Now we’ll shift our attention to macrocomponents. These can include search engines, docbases, or any other kind of application that exports a Web API. We first encountered this term back in Chapter 4, when I showed how the Polls servlet’s Web API enables formless data entry using URLs sent in mail messages. In this chapter, we’ll first discover and then exploit the Web APIs presented by docbases and by search engines. When you understand properly how these two kinds of macrocomponents work, you can effectively organize search results drawn from any combination of docbases by any search engine.

Is this groupware? Yes, if you construe the term broadly, as I do, to mean a variety of ways to connect people to each other and to information. In Chapter 5, we saw how a navigational system can express bindings among groups of users and classes of documents. A search system should work the same way and not just within an individual docbase but across sets of them.

A Docbase’s Web API

A docbase is one kind of macrocomponent. What are its APIs, and how can we manipulate them? One API to a docbase, so obvious that we tend to overlook it, is the namespace. Consider a random URL from the ProductAnalysis docbase: /Docbase/ProductAnalysis/docs/000127.htm. This access handle contains important clues about the document it points to:

The general type

It’s a member of the Docbase family. That fact alone distinguishes the kinds of documents likely to be found here from the more informal documents you’d expect to find in, say, a newsgroup.

The specific type

The ProductAnalysis docbase stores a particular kind of structured report, related to a particular business process. When scanning search results drawn from multiple docbases, you might use that fact to zero in on, or conversely to pass over, documents of this type.

The relative age

The single filename 000127.htm doesn’t say anything about the age of the document it contains. A set of such names, though, carries information about the order in which the documents were added to the docbase. We used that information to impose a secondary ordering on the index pages. We can use the same strategy on search pages.

This is, admittedly, an implied API. The docbase doesn’t really have a documentAge property or a corresponding getDocumentAge( ) method. But it’s so straightforward to create such methods that I’ll claim poetic license and just pretend that these implied APIs really exist.

URL Namespace Reengineering

Although the URL namespaces built by the Docbase system convey the relative age of records, they don’t encode their absolute age. That would be valuable, and it’s easy to do. Suppose we tweak Docbase::Input so that it generates record numbers like 1999-04-15-000127 rather than simply 000127. That’s a trivial change that affects none of the input processing or the sequential- and tabbed-index processing. To all these processes, record numbers (and their corresponding filenames) are just opaque tokens that happen to sort by age. The new format preserves that property but makes the names more useful. Now we can read off a document’s creation date directly from its name, whether or not a docbase’s structured header carries a creation date field. We can arrange sets of records in yearly, monthly, or daily groupings. More subtly, the relationship between the two components of the name—date and sequence number—carries information about the rate at which the docbase is growing.

A URL like http://host/Docbase/ProductAnalysis/docs/1999-04-15-000127.htm is admittedly long and hard to type. As we saw in Part I, users are best advised to manipulate it using cut-and-paste or drag-and-drop. They should also enclose it in angle brackets for email transmission, to prevent breakage. These strategies help matters, but admittedly a long name trades away some convenience in exchange for descriptive power. On balance I think the advantages of a more richly descriptive name outweigh the disadvantages of untypeability and possible breakage.

Issues in URL Namespace Design

Some sites take pains to hide URLs behind “friendly” labels. One university’s faculty directory uses a JavaScript onMouseOver handler to prettify the link addresses displayed in the browser’s status window. For a link labeled Home page for John Doe, with an address of /directories/faculty/es/JohnDoe.html, the status window’s display echoes “Home page for John Doe.” This effort is misguided for several reasons. Recapitulating the link’s label in the status bar is redundant. Worse, it reduces the total amount of information available at this level of the system. The information-rich address is now completely hidden. That raw address, presumed unfriendly by the webmaster, actually tells us quite a bit about the document that it points to:

  • It’s part of a directory, as distinct from other kinds of docbases on the site.

  • It’s part of a faculty directory, as distinct from the parallel docbase whose records describe administrative personnel.

  • It refers to a member of the environmental science department, as distinct from computer science or English.

Admittedly es is a cryptic departmental tag. And in the context in which I found that page, it happened to be an unnecessary clue, because the directory entries were already grouped by department. But consider what happens when John Doe’s home page appears as a search result. Now es is a vital clue as to the document’s type and purpose. An intelligent search-results filter can exploit that clue to integrate this document into a collection of search results.

That filter won’t care whether the department ID is es or environmental_science. Of course, people who encounter the URL would prefer the latter. Why was es used? Almost certainly because the directory was built manually, by a webmaster who quite understandably did not want to type out environmental_science or psychology. Were the directory a Docbase-style application, these longer departmental names could be maintained as a controlled vocabulary and automatically interpolated into the docbase’s namespace.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset