Chapter 13: Semantics and Search (2/7)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

352 ◾ Jon Atle Gulla et al.

13.3 Layered Model of Semantic Search

e architecture of a search application is composed of components for indexing,

querying, searching and ranking, result presentation, and result navigation. In a

traditional search application, the query terms are matched against an index of all

terms in the documents. A result page with ranked links to all the retrieved docu-

ments is presented to a user. Some applications oer navigational features to help

a user specify a new and more precise query or drill down into the result set. All

these stages operate on terms that are, by denition, meaningless to the application

itself.

We can imagine a similar search process at a semantic layer (see Figure13.4).

Documents are indexed by means of semantic structures; queries include refer-

ences to concepts, instances, and their relationships; retrieved documents are

described by semantic content; and navigational links address semantic clas-

sication or renement of result sets. However, in a real semantic search appli-

cation, only some of the search stages operate at this semantic level and the

others follow a traditional syntactic approach. We classify a search application

as semantic if at least one of the stages of the search process establishes a cor-

respondence between the syntactic and semantic realms. is correspondence is

facilitated by an underlying ontology or other mechanism that exposes semantic

aspects of text.

Common to all semantic applications is an explicit or implicit understanding of

how ontological concepts or instances are referred to in the real world. Ontologies

may be linked to terms in documents and queries in four basic ways:

◾ Syntactic correspondence: Concept names are assumed to be identical with the

terms used to refer to them. For example, the movie concept is assumed to be cited

by the same term in the real world as well.

Reasoning

Class

instances

Glossary/

WordNet

Morpho-

syntactic

Keyword

Car Cars Automobile Honda

..the engine

is used to run

a vehicle..

Figure 13.3 Which documents are relevant to the car query?

Semantics and Search ◾ 353

◾ Morpho-syntactic correspondence: Concept names are assumed to be identical to

the lemmas of the terms used to refer to them. A concept like movie will then be

referred to by movie, movies, movie’s, etc.

◾ Synonym correspondence: Concept names are linked to a set of synonyms that may

all be used to refer to the concept. For a concept like movie, a synonym set may

contain movie, lm, and motion picture.

◾ Relative correspondence: is method involves relative relatedness among concepts

and terms. A relative score, usually between 0 and 1, indicates to what extent a

specic term is used to refer to a specic concept. A similar score reveals the likeli-

hood that a concept is materialized by a particular term in the text. For example,

bank refers to the concept of river banks with a score of 0.2 and to nancial banks

with a score of 0.6.

We now discuss the various approaches or techniques of semantic search-

ing. Extending existing frameworks (Hildebrand et al., 2007; Mäkelä, 2005;

Mangold, 2007), we classify semantic search applications with respect to their

approach to index matching, query processing, result presentation, and navi-

gation. Many search applications use several semantic strategies; for reasons of

simplicity we discuss them in isolation. We will also cover aspects of temporal

search because these applications require a semantics not readily available from

current ontologies.

13.4 Index Matching Approaches

Most semantic search applications make use of standard indexing mechanisms

from the vector space model. y utilize both content- and link-related relevance

calculations.

Ontology

Listof terms

Search & Rank NavigationIndexIndexing

Presentation

Querying

Pages

Syntactic layer

Semantic layer

Figure 13.4 Search process and representations.

354 ◾ Jon Atle Gulla et al.

Content-related relevance—Most search applications use a variant of the tf-

idf (term frequency–inverse document frequency) weight and the cosine similarity

score to calculate the relevance of a document to a query. Every document and

query is given a vector representation, (w

, w

,…, w

), where the weight w

of each

term i in the vector of document j is given by the tf-idf score: w

= tf

*idf

= (j/

max )*log(N/n).tf

is the frequency of term i in j, max f

is the max frequency of

any term in j, N is the number of documents in the collection, and n is the num-

ber of documents that contain i. A document is deemed relevant to a query if the

cosine similarity of the document and query vector representations (views) is above

a certain threshold:

similarity Q D cos

Q D

( , ) ( )= =

∗

where Q and D are vector representations of the query and the document, and ||Q||

and ||D|| are the lengths of the vectors.

Link-related relevance—Web search applications also make use of link-

related information using techniques similar to Google’s Page rank algo-

rithm (Page et al., 1997) or the Hypertext Induced Topic Selection (HITS)

of Kleinberg (1999). PageRank is a probability distribution that models the

likelihood that a person randomly clicking on links will arrive at a particular

page. e idea is that links from other high-quality pages to a particular page

count as a vote of condence in that page and indicate the particular page is

more important than other pages with fewer good links. e Page rank of a web

page is given by:

PR u

PR v

L v

v B

( )

∈

∑

where PR(u) is the page rank of page u, B

is the set of all pages linking to u, and

L(v) is the number of links from v to other pages. When ranking documents with

respect to a query, we boost documents with high Page rank scores, assuming that

they have higher general quality than low score documents.

In addition, semantic search may involve graph traversals, semantic struc-

ture matching, or reasoning with implicit information. All these approaches

assume that semantic aspects of documents can be extracted and used to set up

semantically structured indices instead of or in addition to traditional indices.

To what extent Page rank and cosine similarity can be applied to semantic indi-

ces remains unclear, although attempts have been made to dene new ranking

schemes like ReConRank to extend Page rank into the semantic realm (Hogan

et al., 2006).

Semantics and Search ◾ 355

13.4.1 Named Entity Matching

Named entity recognition has been used to recognize occurrences of entity refer-

ences in documents either by looking up carefully crafted entity dictionaries or by

matching document text against regular expressions for types of entities. A regular

expression can, for example, indicate that a word starting with a capital letter and

followed by an Ltd. abbreviation is the name of a company. e range of entity types

(person names, locations, dates) recognized is usually restricted and the types yield

a minimal semantic characterization of entities. If an entity like George W. Bush is

recognized at the beginning of a document, short forms of that entity (George Bush

or simply Bush) may also be recognized.

e recognized entity references may go to a semantic index or be normalized for

a traditional inverted index. Normalization means that all variants of an entity (like

George W. Bush, President Bush, George Bush, and Bush) are combined and we cal-

culate a total frequency for the entity rather than for each separate entity reference.

Since the entities are typed, this algorithm can support entity searches based on type

indications. For example, a query such as Q = LOCATION Deutsche Telekom may list

and rank all entities of type location that are prominent in documents citing Deutsche

Telekom. Presumably, the ranking should show cities and countries in which Deutsche

Telekom is located or does business. Special named entity indices can be found in sev-

eral search applications (Amaral et al., 2004; Duke et al., 2007; Kiryakov et al., 2004).

13.4.2 Graph Traversal

If a semantic index is set up and a query is mapped onto conceptual classes, the

user’s query may correspond to nodes in a graph-structured index. Finding relevant

information means traversing a graph on the basis of the user’s selected nodes,

certain constraints, and some general search strategies. One traversal strategy may,

for example, specify how the graph is traversed to nd interesting instances of a

particular class.

Graph-structured indices are used most often when the documents already have

some inherent uniform structure, e.g., a particular XML format. It is not obvious

that these structures are semantically sound or meaningful and they may not make

much sense to a user. e approach tends to be more useful in professional settings

in which the documents and index structures are understood and respected. e

semantics are given by conventions in the community and may not be explicitly

dened with ontology languages.

Graph-traversing strategies have been used in a number of search systems for

XML documents: XSearch from Cohen et al. (2003) and XRANK from Guo et

al. (2003). Recent systems like Tabulator (Berners-Lee et al., 2006) and Swoogle

(Li et al., 2004) use similar strategies for RDF documents. e SSARK system of

Anyanwu et al. (2005) uses a congurable ranking algorithm for ranking the asso-

ciations between entities in the result set.

356 ◾ Jon Atle Gulla et al.

13.4.3 Conceptual Matching

e conceptual matching approach requires that both documents and queries

be specied at a conceptual (semantic) level using similar conceptual structures.

During retrieval, a document is deemed relevant to a query if its conceptual content

subsumes the conceptual content of the query. Since there are no vector calcula-

tions of similarity scores, how relevant documents are to be ranked in the result set

is not obvious.

If a document has semantic annotations that can be used to construct seman-

tic indices, the search process is a matching of semantic query structures against

semantic document structures. Separate concept indices are used in several search

engines (Bonino et al., 2003; Celino et al., 2006; Davies et al., 2004; Finin et al.,

2005; Hein and Hendler, 2000; Lei et al., 2006). As shown in Figure13.5 from

Hein and Hendler (2000), the SHOE search application asks the user to select

an ontology before posting a query. Available categories from the ontology are pre-

sented to the user, and the user selects the relevant ones and species the details

(properties) of the query. e categories and properties are all reected in the docu-

ment index. is strategy is in many ways similar to query reformulation strategies

Figure 13.5 SHOE search interface. (Source: Heﬂin, J. and J. Hendler. In Artiﬁcial

Intelligence for Web Search: Papers from AAAI Workshop, 2000, AAAI Press.

pp. 35–40.)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 13: Semantics and Search (2/7)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 13: Semantics and Search (2/7)