Chapter 13: Semantics and Search (5/7)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Semantics and Search ◾ 367

2007; García et al., 2003; Moench et al., 2003), and they usually use standardized

taxonomies to provide drill-down functionality on the result page. e hierarchical

links shown in Figure13.13 for Squirrel are semantic categories used to classify the

documents at index time.

13.7.2 Faceted Search

In faceted search applications, we use multiple classication schemes to describe a

range of document dimensions. e method provides a more ne-grained descrip-

tion than those oered by taxonomies, although the same dimensions can also

be modeled within extensive ontologies. Normally, the relevant facets are derived

from an analysis of the document text, so that aggregated views of these facets

are available when the documents are retrieved and presented as the result set to

a query.

Figure13.14 illustrates some of the facets used to describe document result

sets in Endeca’s topic search. e initial result set is listed at right and the user is

encouraged to drill into the result set using the facets listed at left. Choosing for

example, futures trading in organization and the Beverly Hills location, the system

will display only documents that satisfy the query and deal with futures trading

and Beverly Hills. Ben-Yitzhak et al. (2008) present an extended faceted search

application in which business intelligence features are added to the standard drill-

down functions of faceted search. Another system from Hyvönen et al. (2003)

builds facets from ontology concepts and combines them with recommendation

functionality.

13.7.3 Ontology Rules and Navigators

In the SmartSearch project of Deutsche Telekom Laboratories (Burkhardt et al,

2008), a technique similar to faceted search is used to interpret and rene posted

search queries. Within a limited domain, so-called navigators oer drill-down func-

tionality based on ontological category classication. Figure13.15 shows the results

from posting a new bruce willis query. In the standard ranked list of documents

Figure 13.13 Links for drilling into search result set in Squirrel.

368 ◾ Jon Atle Gulla et al.

under pages, several recent movies starring Bruce Willis are listed in the movie navi-

gator on the left side. An internal rule system uses a movie ontology to match bruce

willis to an instance of the actor concept and new to movies that are less than 5

years old. If a user clicks on the Live Free or Die Hard link under movies, the out-

put displayed in Figure13.16 appears. e result page now consists of documents

referring to this particular movie, while the navigators extend to movies that are

similar (based on a tf-idf ranking)—actors who appear in the movie and companies

connected to it. Again, this technique is based on a navigational rule that connects

Figure 13.15 Navigational search with SmartSearch.

Figure 13.14 Faceted search with Endeca (www.endeca.com).

Semantics and Search ◾ 369

instances of movies to similar movies, actors, and companies. With the help of

navigators, the user is assisted in exploratory search (the user may have only a vague

idea of his needs and the target documents are not known).

e navigational rules state which information from the ontology should be used

to interpret queries, retrieve relevant information from the ontology, and construct

drill-down semantic navigators. For example, the system retrieves e Untouchables

movie to respond to a james bond maa movies query because the ontology deter-

mines that this movie deals with the maa and includes Sean Connery, an actor

who played James Bond. In a search for Brangelina movies, the system presents the

Mr. and Mrs. Smith movie because the main characters are played by Brad Pitt and

Angelina Jolie and the ontology denes Brangelina as this group of two actors. To

date, these rules have been manually crafted and are domain-specic—sucient

for the concrete task of creating a exible and intelligent movie search interface.

Whether it will be possible to partially automate the generation of such rules for

this and other domains remains to be seen.

13.8 Temporal Aspects of Search

Normally, search applications assume that documents are time-independent and

express content that is meaningful and valid over time. Users do not normally

Figure 13.16 Navigational search with SmartSearch.

370 ◾ Jon Atle Gulla et al.

worry about dates of documents. In some cases, though, dates may severely aect

document retrieval and ranking; for example, a user is seeking only published infor-

mation or information from particular time periods, or a user may post queries that

make use of terms that are not used at the time of publication, even though the

document content is relevant to the query.

One observation with ranking based on cosine similarity and PageRank is

that older documents tend to receive higher scores than newer ones. Since older

documents are more likely to be known to a large community, often more links

lead to them, giving them a higher PageRank and thereby higher relevance score

than recently published documents. For developing domains like scientic dis-

ciplines in which new information tends to be more reliable and relevant than

old information, this eect of PageRank may be unfortunate. Yu et al. (2004)

present a timed page rank score TPR that boosts documents that have been

recently published:

TPR(A) = Aging(A) * PR

(A)

PR A d d

w PR p

C p

w PR p

( ) ( ) *

* ( )

( )

* ( )

= − + + +1

1 1



CC p

( )

⎛

⎝

⎜

⎞

⎠

⎟

where PR

(A) is the time-weighted PageRank score of paper A, PR

) is the time-

weighted PageRank score of paper p

that links to paper A, C(p

) is the number of

outbound links of paper p

, and d is a dampening factor set to 0.85. A time weight,

, has a value that reduces exponentially with citation age. Aging(A) is an aging

factor for paper A set to 1 for brand new papers, and which declines linearly with

time down to 0.5. Initial experiments with papers about particle physics suggest

that the new score better reects the relevance of new papers and better captures

the likelihood that later papers include citations to them.

More generally, a user may want to retrieve documents from particular time

periods. Systems like GEIN support this by storing temporal data as part of the

metadata of documents (see Figure 13.17). Every document or catalog item in

GEIN is associated with a number of valid time periods, a number of semantic cat-

egories, a number of geographical locations, and a specic item class (Tochtermann

et al., 1997). When searching for information sources, a user may add a temporal

restriction to the query that species the relevant period, e.g., all documents about

emission levels in Munich from 1990 to 1995. Grandi et al. (2005) developed a

similar approach, but included a more ne-grained temporal system with attributes

for validity time, ecacy time, transaction time, and publication time.

If a document date is not known, it may be feasible to estimate it using tem-

poral mining or information extraction. Statistical techniques for estimating dates

of documents were suggested by Kanhabua and Nørvåg (2009) and deJong et al.

(2005). ese methods rely on extensive training data and are suitable for documents

that contain no direct or indirect time indications. Sometimes temporal phrases in

Semantics and Search ◾ 371

documents can be extracted and interpreted with appropriate information extrac-

tion techniques. Kalczynski and Chou (2005) dene a t-zoidal fuzzy time represen-

tation that can model both direct time stamps such as 1990 and temporal content

of phrases (last month, several days ago). By adding the fuzzy time representations

of every time reference extracted from a document, the system can compute a tf-idf-

like score for every calendar date relevant to the document. A query is assumed to

contain an exact date and the system will calculate date similarities between the

query and all documents following a standard cosine similarity. Unlike temporal

mining, the fuzzy approach allows every document to be relevant to dierent time

periods with dierent weights. If no proper time indication is cited in the text, how-

ever, the approach will fail and temporal mining will be the only option.

e examples above assume that the terminology relevant to queries does not

change over time. e same query terms would be used for retrieving information

about the same topic at dierent time points, independently of any terminological

changes in the domain. Obviously, this is a simplication that is not satisfactory

for searches of evolving or unstable domains. For example, using Berlin as a search

term to look for information about the capital of Germany in the 1970s would not

work because Bonn was the capital of West Germany until reunication in 1990.

With the ontologies at hand, we can model the validity of terminology over

time. In Eder and Koncilia’s approach (2004), every concept is modeled in OWL

and, when necessary, associated with a start date and/or end date. e OWL speci-

cation below shows how they model the fact that class DegreeCourseSchema is a

valid concept from 1 January 1990:

Gazetteer

object

esaurus

term

Classiﬁcation

object

Time period

Catalog item

Geometry

Other

attributes

NameTo

From

Geographic

name

Other

attributesURLName

Hierarchy

Broader

a: Semantic relationship

b: Geographical relationship

c: Temporal relationship

d: Classiﬁcation relationship

Figure 13.17 Valid time periods associated with documents (catalog items) in

GEIN.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 13: Semantics and Search (5/7)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 13: Semantics and Search (5/7)