Chapter 8: Relation Extraction for Semantic Web with Taxonomic Sequential Patterns (1/6)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

185

Chapter 8

Relation Extraction

for Semantic Web

with Taxonomic

Sequential Patterns

Sebastian Blohm

Universität Karlsruhe, Karlsruhe, Germany

Krisztian Buza and Lars Schmidt-Thieme

Universität Hildesheim, Hildesheim, Germany

Philipp Cimiano

Universität Bielefeld, Bielefeld, Germany

Contents

8.1 Introduction .............................................................................................186

8.2 Problem Statement ...................................................................................188

8.3 Process .....................................................................................................190

8.4 Taxonomic Sequential Patterns ................................................................192

8.5 Pattern Mining .........................................................................................194

8.6 Experiments .............................................................................................198

186 ◾ Sebastian Blohm et al.

8.1 Introduction

e Semantic Web relies on explicit content captured by means of the RDF(S) and

OWL Semantic Web languages. As most of the content (text in particular) avail-

able on the Web is still unstructured, eective and ecient techniques are needed

to extract and capture the relevant information so that it can be formalized in RDF

and/or OWL, thus becoming accessible to the Semantic Web. e data models of

the Semantic Web (especially RDF) build on the central notion of a triple of the

form (s, p, o) where s is the so-called subject, p the predicate, and o the object. e

extraction of binary relations or tuples (s, o) standing in a relation (predicate or

property) p is thus a crucial task from a Semantic Web perspective. In this chapter

we are concerned with extracting binary relations or tuples from large bodies of

textual data.

Many approaches have been presented to deal with this task. On the one

hand, we can distinguish discriminative techniques that train a statistical classi-

er using machine learning techniques to spot mentions of the relation of inter-

est [5,21,22,23,24]. On the other hand, pattern-based approaches that induce and

match explicit patterns have also been proposed [4,6,8,15,16,18,10].

Patterns can be understood as constraints on the surface form of a text frag-

ment. We say that a pattern matches a textual fragment if the fragment fullls all

the constraints dened by the pattern. In this work, we follow the pattern-based

approach and present a new class Taxonomic Sequential Patterns (TSPs). Patterns

can be thought of as simple crisp classiers that match or do not match a given

text fragment. If they match, we can extract the corresponding tuple by way of

marked argument positions in the pattern. Clearly, the expressiveness of the pat-

tern class considered determines the performance of a pattern-based approach to

relation extraction.

Many pattern-based approaches to relation extraction incorporate some sort of

morpho-syntactic or semantic types into the pattern class to yield more general pat-

terns. However, the impact of these features on extraction quality is typically not

assessed. We can expect that at least two dimensions of pattern classes have a major

impact on extraction performance: (1) the pattern language elements that allow for

under-specication (wild card, skip, disjunction) and (2) the set of features (mor-

pho-syntactic, semantic types, etc.) taken into account during pattern matching.

8.6.1 Evaluation Protocol .......................................................................198

8.6.2 Dataset and Preprocessing ............................................................199

8.6.3 Experimental Setup ..................................................................... 200

8.6.4 Experimental Results ....................................................................202

8.7 Related Work ...........................................................................................205

8.8 Conclusions ..............................................................................................207

Acknowledgement ............................................................................................ 208

References ........................................................................................................ 208

Relation Extraction for Semantic Web ◾ 187

is work constitutes a generalization of approaches that allow integration of a

taxonomy of morpho-syntactic and lexico-semantic features directly into the pat-

tern mining process. Figure8.1 gives an example of a sentence that is indicative of

an instance of the locatedIn relation along with linguistic information available for

each token. e features for each token are ordered by generality, i.e., each column

above the surface string of a token corresponds to the token’s path in a taxonomy.

e topmost row contains for each token the * wild card !!! feature that matches any

token. It constitutes the top concept of the taxonomy or, in a constraint view, is the

constraint that does not exclude any token.

e goal is to identify for each token in the pattern the right level of detail at

which a constraint is added to the pattern. Figure8.2 illustrates this for the example

sentence. e greyed token information will not be part of the pattern. To identify

patterns, we present a principled and uniform mining approach based on sound tech-

niques from the area of knowledge discovery and, in particular, frequent sequence

mining. More specically, this chapter makes the following contributions:

◾ We introduce Taxonomic Sequential Patterns (TSPs) as generalizations of

many pattern classes adopted in the literature. rough this pattern class we

can study the eect of taxonomic knowledge on the information extraction

task and reproduce many other pattern classes from the literature with which

to compare the TSP pattern class. e main question we address in this chap-

ter is whether TSPs are superior to other types of patterns by looking in par-

ticular at the precision–recall trade-o.

◾ We present a principled mining algorithm as an extension of the well known

ECLAT algorithm [13] that allows us to mine taxonomic sequential pat-

terns and all the pattern classes that we directly compare with, e.g., the pat-

terns used in the URES system [18] and a few baseline pattern classes. Such

comparisons of performance across dierent classes do not exist, probably

because most mining algorithms are somewhat ad hoc and cannot be directly

extended to mine other types of patterns. e mining algorithm we present

is principled in the sense that it is complete (guaranteed to nd all patterns

with a given frequency threshold of occurrence, called minimum support)

* * * * * * * *

quantiﬁer

determiner

adjective

superlative

noun

living being

person

other

preposition

noun

location

country

verb

stative

other

preposition

noun

location

city

e happiest people in Germany live in Osnabrück

Figure 8.1 Example sentence with morpho-syntactic token features. The fea-

tures for each token are ordered by generality. * denotes the most general con-

straint, matching everything.

188 ◾ Sebastian Blohm et al.

and extensible by making minimal assumptions on the patterns (an order of

specicity dened on them).

◾ We present results of experiments on four relations for ve pattern classes,

showing that TSPs perform generally better compared to the URES and

other baseline pattern classes.

e chapter is structured as follows. After dening the task addressed in more

detail in the following section, we describe the overall settings and workow of

our approach in the process section. We then introduce the pattern class of TSPs.

In the section on pattern mining, we describe a novel extension of the well-known

ECLAT algorithm that allows ecient mining of TSPs and many of the other pat-

tern classes suggested in the literature. We then present our experimental results,

showing the superiority of TSPs in relation to three other pattern classes as base-

lines. Before concluding, we discuss related work.

8.2 Problem Statement

As noted in the introduction, the basic data units in the Semantic Web are triples

of the form (s, p o) where s is the so-called subject, p the predicate, and o the object.

e extraction of binary relations or tuples (s, o) standing in a relation (predicate or

property) p is crucial from a Semantic Web perspective. is chapter is concerned

with extracting binary relations or tuples from large bodies of textual data. For

example, we want to extract information like Osnabrück is located in Germany or

Paris is the capital of France. More formally, we can say, that the tuple (Osnabrück,

Germany) belongs to the extension of the binary relation locatedIn. e second

tuple (Paris, France) is an instance of the relation capitalOf. For given relations, we

aim to extract all their tuples from large collections of texts.

We dene the text formally as follows: given (1) a large collection of texts (e.g.,

the Web as a whole), (2) a binary relation R between entities mentioned in the texts,

and (3) a few given tuples belonging to R (the so-called seeds), the goal is to extract

all the tuples of R that occur somewhere in the given collection oftexts.

e relation R can be, e.g., locatedIn (Figure8.1), productOf (a company), born-

InYear, etc. We assume that some (few) tuples of R are already known and we want

* * * * * * * *

quantiﬁer

determiner

adjective

superlative

noun

living being

person

other

preposition

noun

location

country

verb

stative

other

preposition

noun

location

city

e happiest people in Germany live in Osnabrück

Figure 8.2 Possible choice of features for pattern from example sentence.

Relation Extraction for Semantic Web ◾ 189

to extract all additional tuples mentioned in a large collection of texts. e goal

of our work is not to develop a completely new system, but to analyze the impact

of a crucial design choice on pattern-based relation extraction: the pattern class or

pattern language that denes which patterns can be expressed. e choice of the

pattern class directly determines the search space for patterns and thus clearly has

the potential of aecting performance.

We explore one particular type of design choice by analyzing the impact of

factoring taxonomic information into the pattern class. We do so by designing a

generic pattern class (TSPs) that subsumes many of the pattern classes discussed

in the literature and allows us to explore the impact of taxonomic versus non-

taxonomic patterns from a general view. While many pattern classes incorporate

type information (sense information, semantic or named entity tags, etc.) into

the patterns, the positive eects have not been empirically demonstrated. For

the sake of simplicity, we assume that all information can be encoded into a

single hierarchy—clearly an over-simplication because the hierarchy will then

contain classes corresponding to dierent linguistic levels. is assumption may

lead to ad hoc modeling decisions, such as putting the person class under the

noun part of speech in the hierarchy. Nevertheless, this assumption eases the

task of mining patterns, as the algorithm must deal with a single taxonomy

instead of many.

In many approaches in the literature, the description of the pattern class or

language is mixed with the description of the way the patterns are matched and

thus takes on a somewhat procedural nature. A declarative description such as we

attempt here makes systematic comparison of the features used in dierent pattern

languages easier.

Overall, we are re-examining and substantiating with experimental evidence

an assumption that many systems have made: that abstraction with respect to a

type system or taxonomy can have a positive eect on extraction performance. In

this sense we are contributing to clarication of a fundamental issue that will help

designers make more informed decisions about the designs of pattern classes in

future work on relation extraction.

While it sounds straightforward that additional information (part of speech or

semantic tags) can improve extraction, the result is far from obvious as an adverse

eect is possible (compare Figure8.3). Suppose we are given a simple pattern class

that only includes words and (untyped) wild cards. e introduction of a taxonomy

could have at least three eects:

1. e integration of type information (by way of a taxonomy of concepts)

increases recall (and potentially decreases precision) as concrete tokens

in the pattern may be replaced by more generic “types” (generalization

eect).

2. Gaps or untyped wild cards (tokens in patterns that match arbitrary input

tokens) may be replaced by typed gaps or wild cards, thus restricting the

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Relation Extraction for Semantic Web with Taxonomic Sequential Patterns (1/6)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 8: Relation Extraction for Semantic Web with Taxonomic Sequential Patterns (1/6)