Automatic Evaluation of Search Ontologies ◾ 283
and products. We then extracted all person names from the textual corpus and
searched the labels for each entity in the ontology.
Results show that 74% of the named-entities that appear in professional reviews
also appear as terms in our ontology. For user reviews (nonedited), the gure is 50%.
e main reasons for mismatches lay in orthography variations (accents or trans-
literation dierences), mentions of people not related to a movie in the reviews, and
aliasing or spelling variations (mostly in user reviews). We conclude that the cover-
age of people entities in the ontology is satisfactory. However, whether a search for
these entities will nd them or will nd the intended individuals in the ontology is
not certain. is fuzziness is caused by term variation (as observed especially in user
reviews) and term ambiguities.
10.5.4 Assessing Terminological Precision
To investigate terminological variation, we measured the ambiguity levels of named-
entity labels. By ambiguity, we refer to the possibility that a single name refers to
more than one ontology individual. Variation relates to the opposite case—one
ontology individual can be described by various terms in text.
We measured the level of terminological variation for each ontology individual,
i.e., given a single ontology individual (e.g., an actor), how many variations of the
name are found in the corpus? Bilenko and Mooney [19] used a similar method
in a dierent setting. To identify variations in the text, we used the StringMetrics
similarity matching library.* We experimented with the Levenstein, Jaro-Winkler,
and q-gram similarity measures. For example, using such similarity measures, we
could match Bill Jackson (a name often used in blogs to informally refer to the
actor) with William Jackson (the name under which the actor is described in the
ontology). Such exibility in aligning query terms with ontology terms increases
search system recall but also risk of precision loss is introduced when two dis-
tinct individuals in the ontology can be named by the same term. For example,
if an ontology contains an actor named Bill Johnson and another named William
Johnson, a fuzzy string matching would confuse the two actors.
To measure the practical impacts of the name variability and ambiguity factors,
we extracted information from the corpus of movie reviews we collected. We rst
developed a NER specialized to the Movies domain. (e OpenCalais NER we
used above properly tags person names, but cannot distinguish actors and directors
or identify movie names.) We manually tagged a corpus of 200 movie reviews from
the Ebert corpus, to indicate the occurrences of movie names and actor names.
We then applied the YAMCHA
†
package to train an automatic NER system on
our corpus. YAMCHA uses a support vector machine (SVM) classier to recog-
nize named entities in text based on features describing each word. We used two
*
http://www.dcs.shef.ac.uk/sam/stringmetrics.html
†
http://chasen.org/~taku/software/yamcha/