Semantics and Search ◾  367
2007; García et al., 2003; Moench et al., 2003), and they usually use standardized
taxonomies to provide drill-down functionality on the result page. e hierarchical
links shown in Figure13.13 for Squirrel are semantic categories used to classify the
documents at index time.
13.7.2 Faceted Search
In faceted search applications, we use multiple classication schemes to describe a
range of document dimensions. e method provides a more ne-grained descrip-
tion than those oered by taxonomies, although the same dimensions can also
be modeled within extensive ontologies. Normally, the relevant facets are derived
from an analysis of the document text, so that aggregated views of these facets
are available when the documents are retrieved and presented as the result set to
a query.
Figure13.14 illustrates some of the facets used to describe document result
sets in Endeca’s topic search. e initial result set is listed at right and the user is
encouraged to drill into the result set using the facets listed at left. Choosing for
example, futures trading in organization and the Beverly Hills location, the system
will display only documents that satisfy the query and deal with futures trading
and Beverly Hills. Ben-Yitzhak et al. (2008) present an extended faceted search
application in which business intelligence features are added to the standard drill-
down functions of faceted search. Another system from Hyvönen et al. (2003)
builds facets from ontology concepts and combines them with recommendation
functionality.
13.7.3 Ontology Rules and Navigators
In the SmartSearch project of Deutsche Telekom Laboratories (Burkhardt et al,
2008), a technique similar to faceted search is used to interpret and rene posted
search queries. Within a limited domain, so-called navigators oer drill-down func-
tionality based on ontological category classication. Figure13.15 shows the results
from posting a new bruce willis query. In the standard ranked list of documents
Figure 13.13 Links for drilling into search result set in Squirrel.
368 ◾  Jon Atle Gulla et al.
under pages, several recent movies starring Bruce Willis are listed in the movie navi-
gator on the left side. An internal rule system uses a movie ontology to match bruce
willis to an instance of the actor concept and new to movies that are less than 5
years old. If a user clicks on the Live Free or Die Hard link under movies, the out-
put displayed in Figure13.16 appears. e result page now consists of documents
referring to this particular movie, while the navigators extend to movies that are
similar (based on a tf-idf ranking)actors who appear in the movie and companies
connected to it. Again, this technique is based on a navigational rule that connects
Figure 13.15 Navigational search with SmartSearch.
Figure 13.14 Faceted search with Endeca (www.endeca.com).
Semantics and Search ◾  369
instances of movies to similar movies, actors, and companies. With the help of
navigators, the user is assisted in exploratory search (the user may have only a vague
idea of his needs and the target documents are not known).
e navigational rules state which information from the ontology should be used
to interpret queries, retrieve relevant information from the ontology, and construct
drill-down semantic navigators. For example, the system retrieves e Untouchables
movie to respond to a james bond maa movies query because the ontology deter-
mines that this movie deals with the maa and includes Sean Connery, an actor
who played James Bond. In a search for Brangelina movies, the system presents the
Mr. and Mrs. Smith movie because the main characters are played by Brad Pitt and
Angelina Jolie and the ontology denes Brangelina as this group of two actors. To
date, these rules have been manually crafted and are domain-specicsucient
for the concrete task of creating a exible and intelligent movie search interface.
Whether it will be possible to partially automate the generation of such rules for
this and other domains remains to be seen.
13.8 Temporal Aspects of Search
Normally, search applications assume that documents are time-independent and
express content that is meaningful and valid over time. Users do not normally
Figure 13.16 Navigational search with SmartSearch.
370 ◾  Jon Atle Gulla et al.
worry about dates of documents. In some cases, though, dates may severely aect
document retrieval and ranking; for example, a user is seeking only published infor-
mation or information from particular time periods, or a user may post queries that
make use of terms that are not used at the time of publication, even though the
document content is relevant to the query.
One observation with ranking based on cosine similarity and PageRank is
that older documents tend to receive higher scores than newer ones. Since older
documents are more likely to be known to a large community, often more links
lead to them, giving them a higher PageRank and thereby higher relevance score
than recently published documents. For developing domains like scientic dis-
ciplines in which new information tends to be more reliable and relevant than
old information, this eect of PageRank may be unfortunate. Yu et al. (2004)
present a timed page rank score TPR that boosts documents that have been
recently published:
TPR(A) = Aging(A) * PR
T
(A)
PR A d d
w PR p
C p
w PR p
T
n
T
n
T
( ) ( ) *
* ( )
( )
* ( )
= + + +1
1 1
1
CC p
n
( )
where PR
T
(A) is the time-weighted PageRank score of paper A, PR
T
(p
i
) is the time-
weighted PageRank score of paper p
i
that links to paper A, C(p
i
) is the number of
outbound links of paper p
i
, and d is a dampening factor set to 0.85. A time weight,
w
i
, has a value that reduces exponentially with citation age. Aging(A) is an aging
factor for paper A set to 1 for brand new papers, and which declines linearly with
time down to 0.5. Initial experiments with papers about particle physics suggest
that the new score better reects the relevance of new papers and better captures
the likelihood that later papers include citations to them.
More generally, a user may want to retrieve documents from particular time
periods. Systems like GEIN support this by storing temporal data as part of the
metadata of documents (see Figure 13.17). Every document or catalog item in
GEIN is associated with a number of valid time periods, a number of semantic cat-
egories, a number of geographical locations, and a specic item class (Tochtermann
et al., 1997). When searching for information sources, a user may add a temporal
restriction to the query that species the relevant period, e.g., all documents about
emission levels in Munich from 1990 to 1995. Grandi et al. (2005) developed a
similar approach, but included a more ne-grained temporal system with attributes
for validity time, ecacy time, transaction time, and publication time.
If a document date is not known, it may be feasible to estimate it using tem-
poral mining or information extraction. Statistical techniques for estimating dates
of documents were suggested by Kanhabua and Nørg (2009) and deJong et al.
(2005). ese methods rely on extensive training data and are suitable for documents
that contain no direct or indirect time indications. Sometimes temporal phrases in
Semantics and Search ◾  371
documents can be extracted and interpreted with appropriate information extrac-
tion techniques. Kalczynski and Chou (2005) dene a t-zoidal fuzzy time represen-
tation that can model both direct time stamps such as 1990 and temporal content
of phrases (last month, several days ago). By adding the fuzzy time representations
of every time reference extracted from a document, the system can compute a tf-idf-
like score for every calendar date relevant to the document. A query is assumed to
contain an exact date and the system will calculate date similarities between the
query and all documents following a standard cosine similarity. Unlike temporal
mining, the fuzzy approach allows every document to be relevant to dierent time
periods with dierent weights. If no proper time indication is cited in the text, how-
ever, the approach will fail and temporal mining will be the only option.
e examples above assume that the terminology relevant to queries does not
change over time. e same query terms would be used for retrieving information
about the same topic at dierent time points, independently of any terminological
changes in the domain. Obviously, this is a simplication that is not satisfactory
for searches of evolving or unstable domains. For example, using Berlin as a search
term to look for information about the capital of Germany in the 1970s would not
work because Bonn was the capital of West Germany until reunication in 1990.
With the ontologies at hand, we can model the validity of terminology over
time. In Eder and Koncilia’s approach (2004), every concept is modeled in OWL
and, when necessary, associated with a start date and/or end date. e OWL speci-
cation below shows how they model the fact that class DegreeCourseSchema is a
valid concept from 1 January 1990:
Gazetteer
object
esaurus
term
Classification
object
Time period
Catalog item
a
c
d
b
Geometry
Other
attributes
NameTo
From
Geographic
name
Other
attributesURLName
Hierarchy
Broader
a: Semantic relationship
b: Geographical relationship
c: Temporal relationship
d: Classification relationship
Figure 13.17 Valid time periods associated with documents (catalog items) in
GEIN.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset