Chapter 9: Data-Driven Evaluation of Ontologies Using Machine Learning Algorithms (4/13)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

226 ◾ Dae-Ki Kang

to construct a taxonomy



by starting with the primitive attributes in Ã as the

leaves of



and recursively adding nodes to



T one at a time by merging two exist-

ing nodes.

Let DM(P(x)||Q(x)) denote a measure of pair-wise divergence between two

probability distributions P and Q of the random variable x. We use a pair-wise

measure of divergence between the distributions of the class labels associated

with the corresponding Boolean attributes as a measure of dissimilarity. e

lower the divergence between the class distribution of two attributes, the greater

is their similarity. e choice of this measure is motivated by the intended use

of propositionalized attribute taxonomy for a PAT-aware algorithm to gener-

ate classiers. If two attributes are indistinguishable with respect to their class

distribution, they will provide statistically similar information for classication

of instance.

e algorithm maintains a cut γ through the taxonomy



, updating the cut γ

as new nodes are added to



. At each step, the two words to be grouped to yield

an abstract word to be added to



are selected from γ based on the divergence

between the class distributions associated with the corresponding words. at is, a

pair of words in γ is merged if they have more similar class distributions than any

other pair of words in γ. is process terminates when the cut γ contains a single

word that corresponds to the root of



T . Overall, the algorithm iterates | |



A −

(

)

times of merging, and at each merging it adds one more abstract node to the tax-

onomy. Hence, the resulting



will have 2 1| |



A −

(

)

nodes when the algorithm

terminates.

9.2.5 Pair-Wise Divergence Measures

Similiarities between two probability distributions can be measured by several

methods. We tested 13 divergence measures for probability distributions P and Q,

including J-divergence. Jensen–Shannon divergence, and arithmetic and geometric

mean divergence.

J-divergence (Topsøe, 2000)—Also known as Jereys–Kullback–Liebler

divergence, this method is a symmetric version of Kullback–Liebler (KL) diver-

gence. J-divergence between two probability distributions P and Q is dened as:

J P Q K P Q K Q P p q

i i

( || ) ( ( || ) ( || ) ( - )log= + =

⎛

⎝

⎜

⎞

⎠⎠

⎟

∑

where Kullback–Liebler divergence, also known as relative information, directed

divergence, relative entropy, and function of discrimination, is given by:

K P Q p

( || ) log=

⎛

⎝

⎜

⎞

⎠

⎟

⎛

⎝

⎜

⎞

⎠

⎟

∑

Data-Driven Evaluation of Ontologies ◾ 227

Kullback–Liebler divergence is a natural measure for dissimilarity of distribu-

tions. It is nonnegative and reexive; it is asymmetric and does not satisfy triangle

inequality.

Jensen–Shannon divergence (Slonim et al., 2006)—is is a weighted infor-

mation gain technique also called Jensen dierence divergence, information radius,

and Sibson–Burbea–Rao–Jensen–Shannon divergence:

I P Q p

p q

i i

( || ) log log=

⎛

⎝

⎜

⎞

⎠

⎟

⎛⎛

⎝

⎜

⎞

⎠

⎟

⎡

⎣

⎢

⎤

⎦

⎥

∑∑

Jensen–Shannon divergence is reexive, symmetric, and bounded. Figure9.10 shows

an AVT of an odor attribute generated by AVT learner (with binary clustering).

isa

Odor attribute

isa

isa isa isa

isa

isa isa

isa

(s+y)

(p+c)

(l+a)

(f+(s+y))

((p+c)+(f+(s+y)))

(n+(l+a))

(((p+c)+(f+(s+y)))+m)

isa

isa isa

isa isa isa

y s

c p

a l

Figure 9.10 AVT of odor attribute of UCI’s AGARICUS-LEPIOTA mushroom

data set generated by AVT learner using Jensen–Shannon divergence (binary

clustering).

228 ◾ Dae-Ki Kang

Arithmetic and geometric (A&G) mean divergence (Taneja, 1995)—

Popularly known as backward Jensen–Shannon divergence, the formula is:

T P Q

p q p q

p q

i i i i

i i

( || ) log=

⎛

⎝

⎜

⎞

⎠

⎟

⎛

⎝

⎜

⎞

⎠

⎟

∑

It is the KL divergence between the arithmetic and geometric means of two

distributions. Since the results from dierent symmetric divergence measures do

not reveal remarkable dierences, we limit discussion to the Jensen–Shannon

divergence measure.

9.3 Evaluation of Taxonomies

Our approach to evaluating AVTs generated by AVT learner arose because an AVT

that captures relevant relationships among attribute values can generate simple and

accurate classiers from data, just as an appropriate choice of axioms in a mathe-

matical domain can simplify proofs of theorems. us, the simplicity and predictive

accuracy of the learned classiers based on alternative choices of AVT can be used to

evaluate the utility of a corresponding AVT in specic contexts. Similar arguments

are applicable for WTL and PAT learner. For evaluation, it is necessary to discuss

the learning algorithms that can exploit taxonomies. We explain AVT-NBL (Zhang

and Honavar, 2004) for structured multivariate data, WTNBL-MN (Kang et al.,

2005) for unstructured data, PAT-DTL (Kang and Sohn, 2009), and PAT-NBL.

9.3.1 AVT-Guided Variants of Standard Learning Algorithms

To make standard learning algorithm-aware taxonomies, we extend the standard

learning algorithms in principled ways to exploit the information provided by AVT.

AVT-DTL (Yamazaki et al., 1995; Zhang et al., 2002; Zhang and Honavar, 2003)

and AVT-NBL (Zhang and Honavar, 2004) that extend the decision tree learning

algorithm (Quinlan, 1993) and the naive Bayes learning algorithm (Langley et al.,

1992) are examples of such algorithms.

e basic idea behind AVT-NBL is to start with the naive Bayes classier based

on the most abstract attribute values in AVTs and successively rene the classier

by a scoring function—a conditional minimum description length (CMDL) score

suggested by Friedman et al. (1997) to capture trade-o between the accuracy of

classication and the complexity of the resulting naive Bayes classier.

e experiments reported by Zhang and Honavar (2004) using several bench-

mark data sets show that AVT-NBL is able to learn, via human-generated AVTs,

substantially more accurate classiers than those produced by naive Bayes learner

(NBL) applied directly to the data sets and to data sets represented by a set of binary

Data-Driven Evaluation of Ontologies ◾ 229

features that correspond to the nodes of the AVT (PROP-NBL). e classiers

generated by AVT-NBL are substantially more compact than those generated by

NBL and PROP-NBL. ese results hold across a wide range of missing attribute

values in data sets. Hence, the performance of naive Bayes classiers generated

by AVT-NBL when supplied with AVTs generated by AVT learner provide useful

measures of the eectiveness of AVT learner in discovering hierarchical groupings

of attribute values that are useful in constructing compact and accurate classiers

from data.

9.3.2 WTNBL-MN Algorithm

If you understand the underlying idea of AVT-guided variants of standard learning

algorithms, it is easy to understand the other taxonomy-aware algorithms because

they are similar. e problem of learning classiers from a word taxonomy and

sequence data is a natural generalization of the problem of learning classiers from

sequence data. An original data set D is a collection of labeled instances <I

, C

where IÎI. A classier is a hypothesis in the form of function h: I→C, whose domain

is the instance space I and range is the set of class C. A hypothesis space H is a set of

hypotheses that can be represented in some hypothesis language or by a parameter-

ized family of functions. e task of learning classiers from data set D is to induce

a hypothesis hÎH that satises given criteria.

Learning classiers from word taxonomy and data can be described by assum-

ing a word taxonomy T

over words Σ and a data set D. e aim is to induce a

classier h

γ*

: I

γ*

→C where γ* is a cut that maximizes given criteria. Note that

the resulting hypothesis space

of a chosen cut

is ecient in searching for

both concise and accurate hypotheses. Word taxonomy-guided NBL has two major

components: (1) estimation of parameters of naive Bayes classiers based on a cut,

and (b) a criterion for rening a cut.

9.3.3 Aggregation of Class Conditional Frequency Counts

We can estimate the relevant parameters of a naive Bayes classier eciently by

aggregating class conditional frequency counts. For a particular node of a given

cut, parameters of the node can be estimated by summing the class conditional

frequency counts of its children (Zhang and Honavar, 2004).

Given word taxonomy T

, we can dene a tree of class conditional frequency

counts T

so that there is one-to-one correspondence between the nodes of word

taxonomy T

and the nodes of the corresponding T

. e class conditional frequency

counts associated with a nonleaf node of T

are aggregations of the corresponding

class conditional frequency counts associated with its children. Because a cut through

word taxonomy corresponds to a partition of the set of words, the corresponding cut

through T

species a valid class conditional probability table for words. erefore, to

estimate each node of T

, we simply estimate the class conditional frequency counts

230 ◾ Dae-Ki Kang

of primitive words in Σ that correspond to the leaves of T

. en we aggregate them

recursively to calculate the class conditional frequency counts associated with their

parent node.

9.3.3.1 Multinomial Model for Representing Text and Sequence

In a multinomial model, a sequence is represented as a vector of word occurrence fre-

quencies f

i,j

. e probability of an instance I

given its class c

is dened as follows:

P d c

j j

i j

( )

( )!

| |

| =

⎛

⎝

⎞

⎠

⎧

⎨

⎪

⎩

⎪

⎫

⎬

∑

∏

⎪⎪

⎭

⎪

∑

∏

{ }

| |

i j

(9-1)

e term {( )!/ ( )!}

| |

Σ Π

Σ Σ

i i j i i j

f f represents the number of possible combinations of

words for the instance I

. In Equation 9-1, p

is basically calculated as:

Count c w

Count c

i j

j i

( , )

( )

Count(c

, w

) is the number of times word w

appears in all the instances that have

class labels c

. Count(c

) is the total number of words in a particular class label c

With Laplacian smoothing, p

i,j

will be:

Count c w

Count c

i j

j i

( , )

| | ( )

∑ +

Or, if we follow the Dirichlet prior, p

v c

, will be:

L v Count c v

L Count c

v c

/ | | ( , )

( )

where

is an average length and |v| is the number of values. If we consider the num-

ber of words in an instance (i.e., document length) (McCallum and Nigam, 1998)

and assume that document length is independent of class for simplicity, we get:

P v c P d

v c

( | ) ( )

{ }

| |

⎧

⎨

⎪

⎩

⎪

⎫

⎬

⎪

⎭

⎪

∏

(9-2)

where d v

= ( )

| |

Σ is the number of words in a particular instance (document

length). In practice, document length may be class dependent:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Data-Driven Evaluation of Ontologies Using Machine Learning Algorithms (4/13)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9: Data-Driven Evaluation of Ontologies Using Machine Learning Algorithms (4/13)