352 ◾  Jon Atle Gulla et al.
13.3 Layered Model of Semantic Search
e architecture of a search application is composed of components for indexing,
querying, searching and ranking, result presentation, and result navigation. In a
traditional search application, the query terms are matched against an index of all
terms in the documents. A result page with ranked links to all the retrieved docu-
ments is presented to a user. Some applications oer navigational features to help
a user specify a new and more precise query or drill down into the result set. All
these stages operate on terms that are, by denition, meaningless to the application
itself.
We can imagine a similar search process at a semantic layer (see Figure13.4).
Documents are indexed by means of semantic structures; queries include refer-
ences to concepts, instances, and their relationships; retrieved documents are
described by semantic content; and navigational links address semantic clas-
sication or renement of result sets. However, in a real semantic search appli-
cation, only some of the search stages operate at this semantic level and the
others follow a traditional syntactic approach. We classify a search application
as semantic if at least one of the stages of the search process establishes a cor-
respondence between the syntactic and semantic realms. is correspondence is
facilitated by an underlying ontology or other mechanism that exposes semantic
aspects of text.
Common to all semantic applications is an explicit or implicit understanding of
how ontological concepts or instances are referred to in the real world. Ontologies
may be linked to terms in documents and queries in four basic ways:
Syntactic correspondence: Concept names are assumed to be identical with the
terms used to refer to them. For example, the movie concept is assumed to be cited
by the same term in the real world as well.
Reasoning
Class
instances
Glossary/
WordNet
Morpho-
syntactic
search
Keyword
search
Car Cars Automobile Honda
..the engine
is used to run
a vehicle..
Figure 13.3 Which documents are relevant to the car query?
Semantics and Search ◾  353
Morpho-syntactic correspondence: Concept names are assumed to be identical to
the lemmas of the terms used to refer to them. A concept like movie will then be
referred to by movie, movies, movie’s, etc.
Synonym correspondence: Concept names are linked to a set of synonyms that may
all be used to refer to the concept. For a concept like movie, a synonym set may
contain movie, lm, and motion picture.
Relative correspondence: is method involves relative relatedness among concepts
and terms. A relative score, usually between 0 and 1, indicates to what extent a
specic term is used to refer to a specic concept. A similar score reveals the likeli-
hood that a concept is materialized by a particular term in the text. For example,
bank refers to the concept of river banks with a score of 0.2 and to nancial banks
with a score of 0.6.
We now discuss the various approaches or techniques of semantic search-
ing. Extending existing frameworks (Hildebrand et al., 2007; kelä, 2005;
Mangold, 2007), we classify semantic search applications with respect to their
approach to index matching, query processing, result presentation, and navi-
gation. Many search applications use several semantic strategies; for reasons of
simplicity we discuss them in isolation. We will also cover aspects of temporal
search because these applications require a semantics not readily available from
current ontologies.
13.4 Index Matching Approaches
Most semantic search applications make use of standard indexing mechanisms
from the vector space model. y utilize both content- and link-related relevance
calculations.
Ontology
Listof terms
Search & Rank NavigationIndexIndexing
Presentation
Querying
Pages
Syntactic layer
Semantic layer
on
Figure 13.4 Search process and representations.
354 ◾  Jon Atle Gulla et al.
Content-related relevanceMost search applications use a variant of the tf-
idf (term frequencyinverse document frequency) weight and the cosine similarity
score to calculate the relevance of a document to a query. Every document and
query is given a vector representation, (w
1j
, w
2j
,…, w
kj
), where the weight w
ij
of each
term i in the vector of document j is given by the tf-idf score: w
ij
= tf
ij
*idf
i
= (j/
max )*log(N/n).tf
ij
is the frequency of term i in j, max f
j
is the max frequency of
any term in j, N is the number of documents in the collection, and n is the num-
ber of documents that contain i. A document is deemed relevant to a query if the
cosine similarity of the document and query vector representations (views) is above
a certain threshold:
similarity Q D cos
Q D
Q D
( , ) ( )= =
θ
where Q and D are vector representations of the query and the document, and ||Q||
and ||D|| are the lengths of the vectors.
Link-related relevanceWeb search applications also make use of link-
related information using techniques similar to Google’s Page rank algo-
rithm (Page et al., 1997) or the Hypertext Induced Topic Selection (HITS)
of Kleinberg (1999). PageRank is a probability distribution that models the
likelihood that a person randomly clicking on links will arrive at a particular
page. e idea is that links from other high-quality pages to a particular page
count as a vote of condence in that page and indicate the particular page is
more important than other pages with fewer good links. e Page rank of a web
page is given by:
PR u
PR v
L v
v B
u
( )
( )
( )
=
where PR(u) is the page rank of page u, B
u
is the set of all pages linking to u, and
L(v) is the number of links from v to other pages. When ranking documents with
respect to a query, we boost documents with high Page rank scores, assuming that
they have higher general quality than low score documents.
In addition, semantic search may involve graph traversals, semantic struc-
ture matching, or reasoning with implicit information. All these approaches
assume that semantic aspects of documents can be extracted and used to set up
semantically structured indices instead of or in addition to traditional indices.
To what extent Page rank and cosine similarity can be applied to semantic indi-
ces remains unclear, although attempts have been made to dene new ranking
schemes like ReConRank to extend Page rank into the semantic realm (Hogan
et al., 2006).
Semantics and Search ◾  355
13.4.1 Named Entity Matching
Named entity recognition has been used to recognize occurrences of entity refer-
ences in documents either by looking up carefully crafted entity dictionaries or by
matching document text against regular expressions for types of entities. A regular
expression can, for example, indicate that a word starting with a capital letter and
followed by an Ltd. abbreviation is the name of a company. e range of entity types
(person names, locations, dates) recognized is usually restricted and the types yield
a minimal semantic characterization of entities. If an entity like George W. Bush is
recognized at the beginning of a document, short forms of that entity (George Bush
or simply Bush) may also be recognized.
e recognized entity references may go to a semantic index or be normalized for
a traditional inverted index. Normalization means that all variants of an entity (like
George W. Bush, President Bush, George Bush, and Bush) are combined and we cal-
culate a total frequency for the entity rather than for each separate entity reference.
Since the entities are typed, this algorithm can support entity searches based on type
indications. For example, a query such as Q = LOCATION Deutsche Telekom may list
and rank all entities of type location that are prominent in documents citing Deutsche
Telekom. Presumably, the ranking should show cities and countries in which Deutsche
Telekom is located or does business. Special named entity indices can be found in sev-
eral search applications (Amaral et al., 2004; Duke et al., 2007; Kiryakov et al., 2004).
13.4.2 Graph Traversal
If a semantic index is set up and a query is mapped onto conceptual classes, the
user’s query may correspond to nodes in a graph-structured index. Finding relevant
information means traversing a graph on the basis of the user’s selected nodes,
certain constraints, and some general search strategies. One traversal strategy may,
for example, specify how the graph is traversed to nd interesting instances of a
particular class.
Graph-structured indices are used most often when the documents already have
some inherent uniform structure, e.g., a particular XML format. It is not obvious
that these structures are semantically sound or meaningful and they may not make
much sense to a user. e approach tends to be more useful in professional settings
in which the documents and index structures are understood and respected. e
semantics are given by conventions in the community and may not be explicitly
dened with ontology languages.
Graph-traversing strategies have been used in a number of search systems for
XML documents: XSearch from Cohen et al. (2003) and XRANK from Guo et
al. (2003). Recent systems like Tabulator (Berners-Lee et al., 2006) and Swoogle
(Li et al., 2004) use similar strategies for RDF documents. e SSARK system of
Anyanwu et al. (2005) uses a congurable ranking algorithm for ranking the asso-
ciations between entities in the result set.
356 ◾  Jon Atle Gulla et al.
13.4.3 Conceptual Matching
e conceptual matching approach requires that both documents and queries
be specied at a conceptual (semantic) level using similar conceptual structures.
During retrieval, a document is deemed relevant to a query if its conceptual content
subsumes the conceptual content of the query. Since there are no vector calcula-
tions of similarity scores, how relevant documents are to be ranked in the result set
is not obvious.
If a document has semantic annotations that can be used to construct seman-
tic indices, the search process is a matching of semantic query structures against
semantic document structures. Separate concept indices are used in several search
engines (Bonino et al., 2003; Celino et al., 2006; Davies et al., 2004; Finin et al.,
2005; Hein and Hendler, 2000; Lei et al., 2006). As shown in Figure13.5 from
Hein and Hendler (2000), the SHOE search application asks the user to select
an ontology before posting a query. Available categories from the ontology are pre-
sented to the user, and the user selects the relevant ones and species the details
(properties) of the query. e categories and properties are all reected in the docu-
ment index. is strategy is in many ways similar to query reformulation strategies
Figure 13.5 SHOE search interface. (Source: Heflin, J. and J. Hendler. In Artificial
Intelligence for Web Search: Papers from AAAI Workshop, 2000, AAAI Press.
pp. 35–40.)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset