Semantics and Search ◾ 355
13.4.1 Named Entity Matching
Named entity recognition has been used to recognize occurrences of entity refer-
ences in documents either by looking up carefully crafted entity dictionaries or by
matching document text against regular expressions for types of entities. A regular
expression can, for example, indicate that a word starting with a capital letter and
followed by an Ltd. abbreviation is the name of a company. e range of entity types
(person names, locations, dates) recognized is usually restricted and the types yield
a minimal semantic characterization of entities. If an entity like George W. Bush is
recognized at the beginning of a document, short forms of that entity (George Bush
or simply Bush) may also be recognized.
e recognized entity references may go to a semantic index or be normalized for
a traditional inverted index. Normalization means that all variants of an entity (like
George W. Bush, President Bush, George Bush, and Bush) are combined and we cal-
culate a total frequency for the entity rather than for each separate entity reference.
Since the entities are typed, this algorithm can support entity searches based on type
indications. For example, a query such as Q = LOCATION Deutsche Telekom may list
and rank all entities of type location that are prominent in documents citing Deutsche
Telekom. Presumably, the ranking should show cities and countries in which Deutsche
Telekom is located or does business. Special named entity indices can be found in sev-
eral search applications (Amaral et al., 2004; Duke et al., 2007; Kiryakov et al., 2004).
13.4.2 Graph Traversal
If a semantic index is set up and a query is mapped onto conceptual classes, the
user’s query may correspond to nodes in a graph-structured index. Finding relevant
information means traversing a graph on the basis of the user’s selected nodes,
certain constraints, and some general search strategies. One traversal strategy may,
for example, specify how the graph is traversed to nd interesting instances of a
particular class.
Graph-structured indices are used most often when the documents already have
some inherent uniform structure, e.g., a particular XML format. It is not obvious
that these structures are semantically sound or meaningful and they may not make
much sense to a user. e approach tends to be more useful in professional settings
in which the documents and index structures are understood and respected. e
semantics are given by conventions in the community and may not be explicitly
dened with ontology languages.
Graph-traversing strategies have been used in a number of search systems for
XML documents: XSearch from Cohen et al. (2003) and XRANK from Guo et
al. (2003). Recent systems like Tabulator (Berners-Lee et al., 2006) and Swoogle
(Li et al., 2004) use similar strategies for RDF documents. e SSARK system of
Anyanwu et al. (2005) uses a congurable ranking algorithm for ranking the asso-
ciations between entities in the result set.