185
Chapter 8
Relation Extraction
for Semantic Web
with Taxonomic
Sequential Patterns
Sebastian Blohm
Universität Karlsruhe, Karlsruhe, Germany
Krisztian Buza and Lars Schmidt-Thieme
Universität Hildesheim, Hildesheim, Germany
Philipp Cimiano
Universität Bielefeld, Bielefeld, Germany
Contents
8.1 Introduction .............................................................................................186
8.2 Problem Statement ...................................................................................188
8.3 Process .....................................................................................................190
8.4 Taxonomic Sequential Patterns ................................................................192
8.5 Pattern Mining .........................................................................................194
8.6 Experiments .............................................................................................198
186 ◾  Sebastian Blohm et al.
8.1 Introduction
e Semantic Web relies on explicit content captured by means of the RDF(S) and
OWL Semantic Web languages. As most of the content (text in particular) avail-
able on the Web is still unstructured, eective and ecient techniques are needed
to extract and capture the relevant information so that it can be formalized in RDF
and/or OWL, thus becoming accessible to the Semantic Web. e data models of
the Semantic Web (especially RDF) build on the central notion of a triple of the
form (s, p, o) where s is the so-called subject, p the predicate, and o the object. e
extraction of binary relations or tuples (s, o) standing in a relation (predicate or
property) p is thus a crucial task from a Semantic Web perspective. In this chapter
we are concerned with extracting binary relations or tuples from large bodies of
textual data.
Many approaches have been presented to deal with this task. On the one
hand, we can distinguish discriminative techniques that train a statistical classi-
er using machine learning techniques to spot mentions of the relation of inter-
est [5,21,22,23,24]. On the other hand, pattern-based approaches that induce and
match explicit patterns have also been proposed [4,6,8,15,16,18,10].
Patterns can be understood as constraints on the surface form of a text frag-
ment. We say that a pattern matches a textual fragment if the fragment fullls all
the constraints dened by the pattern. In this work, we follow the pattern-based
approach and present a new class Taxonomic Sequential Patterns (TSPs). Patterns
can be thought of as simple crisp classiers that match or do not match a given
text fragment. If they match, we can extract the corresponding tuple by way of
marked argument positions in the pattern. Clearly, the expressiveness of the pat-
tern class considered determines the performance of a pattern-based approach to
relation extraction.
Many pattern-based approaches to relation extraction incorporate some sort of
morpho-syntactic or semantic types into the pattern class to yield more general pat-
terns. However, the impact of these features on extraction quality is typically not
assessed. We can expect that at least two dimensions of pattern classes have a major
impact on extraction performance: (1) the pattern language elements that allow for
under-specication (wild card, skip, disjunction) and (2) the set of features (mor-
pho-syntactic, semantic types, etc.) taken into account during pattern matching.
8.6.1 Evaluation Protocol .......................................................................198
8.6.2 Dataset and Preprocessing ............................................................199
8.6.3 Experimental Setup ..................................................................... 200
8.6.4 Experimental Results ....................................................................202
8.7 Related Work ...........................................................................................205
8.8 Conclusions ..............................................................................................207
Acknowledgement ............................................................................................ 208
References ........................................................................................................ 208
Relation Extraction for Semantic Web ◾  187
is work constitutes a generalization of approaches that allow integration of a
taxonomy of morpho-syntactic and lexico-semantic features directly into the pat-
tern mining process. Figure8.1 gives an example of a sentence that is indicative of
an instance of the locatedIn relation along with linguistic information available for
each token. e features for each token are ordered by generality, i.e., each column
above the surface string of a token corresponds to the tokens path in a taxonomy.
e topmost row contains for each token the * wild card !!! feature that matches any
token. It constitutes the top concept of the taxonomy or, in a constraint view, is the
constraint that does not exclude any token.
e goal is to identify for each token in the pattern the right level of detail at
which a constraint is added to the pattern. Figure8.2 illustrates this for the example
sentence. e greyed token information will not be part of the pattern. To identify
patterns, we present a principled and uniform mining approach based on sound tech-
niques from the area of knowledge discovery and, in particular, frequent sequence
mining. More specically, this chapter makes the following contributions:
We introduce Taxonomic Sequential Patterns (TSPs) as generalizations of
many pattern classes adopted in the literature. rough this pattern class we
can study the eect of taxonomic knowledge on the information extraction
task and reproduce many other pattern classes from the literature with which
to compare the TSP pattern class. e main question we address in this chap-
ter is whether TSPs are superior to other types of patterns by looking in par-
ticular at the precision–recall trade-o.
We present a principled mining algorithm as an extension of the well known
ECLAT algorithm [13] that allows us to mine taxonomic sequential pat-
terns and all the pattern classes that we directly compare with, e.g., the pat-
terns used in the URES system [18] and a few baseline pattern classes. Such
comparisons of performance across dierent classes do not exist, probably
because most mining algorithms are somewhat ad hoc and cannot be directly
extended to mine other types of patterns. e mining algorithm we present
is principled in the sense that it is complete (guaranteed to nd all patterns
with a given frequency threshold of occurrence, called minimum support)
* * * * * * * *
quantifier
determiner
adjective
superlative
noun
living being
person
other
preposition
noun
location
country
verb
stative
other
preposition
noun
location
city
e happiest people in Germany live in Osnabrück
Figure 8.1 Example sentence with morpho-syntactic token features. The fea-
tures for each token are ordered by generality. * denotes the most general con-
straint, matching everything.
188 ◾  Sebastian Blohm et al.
and extensible by making minimal assumptions on the patterns (an order of
specicity dened on them).
We present results of experiments on four relations for ve pattern classes,
showing that TSPs perform generally better compared to the URES and
other baseline pattern classes.
e chapter is structured as follows. After dening the task addressed in more
detail in the following section, we describe the overall settings and workow of
our approach in the process section. We then introduce the pattern class of TSPs.
In the section on pattern mining, we describe a novel extension of the well-known
ECLAT algorithm that allows ecient mining of TSPs and many of the other pat-
tern classes suggested in the literature. We then present our experimental results,
showing the superiority of TSPs in relation to three other pattern classes as base-
lines. Before concluding, we discuss related work.
8.2 Problem Statement
As noted in the introduction, the basic data units in the Semantic Web are triples
of the form (s, p o) where s is the so-called subject, p the predicate, and o the object.
e extraction of binary relations or tuples (s, o) standing in a relation (predicate or
property) p is crucial from a Semantic Web perspective. is chapter is concerned
with extracting binary relations or tuples from large bodies of textual data. For
example, we want to extract information like Osnabrück is located in Germany or
Paris is the capital of France. More formally, we can say, that the tuple (Osnabrück,
Germany) belongs to the extension of the binary relation locatedIn. e second
tuple (Paris, France) is an instance of the relation capitalOf. For given relations, we
aim to extract all their tuples from large collections of texts.
We dene the text formally as follows: given (1) a large collection of texts (e.g.,
the Web as a whole), (2) a binary relation R between entities mentioned in the texts,
and (3) a few given tuples belonging to R (the so-called seeds), the goal is to extract
all the tuples of R that occur somewhere in the given collection oftexts.
e relation R can be, e.g., locatedIn (Figure8.1), productOf (a company), born-
InYear, etc. We assume that some (few) tuples of R are already known and we want
* * * * * * * *
quantifier
determiner
adjective
superlative
noun
living being
person
other
preposition
noun
location
country
verb
stative
other
preposition
noun
location
city
e happiest people in Germany live in Osnabrück
Figure 8.2 Possible choice of features for pattern from example sentence.
Relation Extraction for Semantic Web ◾  189
to extract all additional tuples mentioned in a large collection of texts. e goal
of our work is not to develop a completely new system, but to analyze the impact
of a crucial design choice on pattern-based relation extraction: the pattern class or
pattern language that denes which patterns can be expressed. e choice of the
pattern class directly determines the search space for patterns and thus clearly has
the potential of aecting performance.
We explore one particular type of design choice by analyzing the impact of
factoring taxonomic information into the pattern class. We do so by designing a
generic pattern class (TSPs) that subsumes many of the pattern classes discussed
in the literature and allows us to explore the impact of taxonomic versus non-
taxonomic patterns from a general view. While many pattern classes incorporate
type information (sense information, semantic or named entity tags, etc.) into
the patterns, the positive eects have not been empirically demonstrated. For
the sake of simplicity, we assume that all information can be encoded into a
single hierarchyclearly an over-simplication because the hierarchy will then
contain classes corresponding to dierent linguistic levels.is assumption may
lead to ad hoc modeling decisions, such as putting the person class under the
noun part of speech in the hierarchy. Nevertheless, this assumption eases the
task of mining patterns, as the algorithm must deal with a single taxonomy
instead of many.
In many approaches in the literature, the description of the pattern class or
language is mixed with the description of the way the patterns are matched and
thus takes on a somewhat procedural nature. A declarative description such as we
attempt here makes systematic comparison of the features used in dierent pattern
languages easier.
Overall, we are re-examining and substantiating with experimental evidence
an assumption that many systems have made: that abstraction with respect to a
type system or taxonomy can have a positive eect on extraction performance. In
this sense we are contributing to clarication of a fundamental issue that will help
designers make more informed decisions about the designs of pattern classes in
future work on relation extraction.
While it sounds straightforward that additional information (part of speech or
semantic tags) can improve extraction, the result is far from obvious as an adverse
eect is possible (compare Figure8.3). Suppose we are given a simple pattern class
that only includes words and (untyped) wild cards. e introduction of a taxonomy
could have at least three eects:
1. e integration of type information (by way of a taxonomy of concepts)
increases recall (and potentially decreases precision) as concrete tokens
in the pattern may be replaced by more generic “types” (generalization
eect).
2. Gaps or untyped wild cards (tokens in patterns that match arbitrary input
tokens) may be replaced by typed gaps or wild cards, thus restricting the
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset