226 ◾  Dae-Ki Kang
to construct a taxonomy
T
by starting with the primitive attributes in à as the
leaves of
T
and recursively adding nodes to
T one at a time by merging two exist-
ing nodes.
Let DM(P(x)||Q(x)) denote a measure of pair-wise divergence between two
probability distributions P and Q of the random variable x. We use a pair-wise
measure of divergence between the distributions of the class labels associated
with the corresponding Boolean attributes as a measure of dissimilarity. e
lower the divergence between the class distribution of two attributes, the greater
is their similarity. e choice of this measure is motivated by the intended use
of propositionalized attribute taxonomy for a PAT-aware algorithm to gener-
ate classiers. If two attributes are indistinguishable with respect to their class
distribution, they will provide statistically similar information for classication
of instance.
e algorithm maintains a cut γ through the taxonomy
T
, updating the cut γ
as new nodes are added to
T
. At each step, the two words to be grouped to yield
an abstract word to be added to
T
are selected from γ based on the divergence
between the class distributions associated with the corresponding words. at is, a
pair of words in γ is merged if they have more similar class distributions than any
other pair of words in γ. is process terminates when the cut γ contains a single
word that corresponds to the root of
T . Overall, the algorithm iterates | |
A
(
)
1
times of merging, and at each merging it adds one more abstract node to the tax-
onomy. Hence, the resulting
T
will have 2 1| |
A
(
)
nodes when the algorithm
terminates.
9.2.5 Pair-Wise Divergence Measures
Similiarities between two probability distributions can be measured by several
methods. We tested 13 divergence measures for probability distributions P and Q,
including J-divergence. Jensen–Shannon divergence, and arithmetic and geometric
mean divergence.
J-divergence (Topsøe, 2000)Also known as JereysKullback–Liebler
divergence, this method is a symmetric version of Kullback–Liebler (KL) diver-
gence. J-divergence between two probability distributions P and Q is dened as:
J P Q K P Q K Q P p q
pi
qi
i i
( || ) ( ( || ) ( || ) ( - )log= + =
where Kullback–Liebler divergence, also known as relative information, directed
divergence, relative entropy, and function of discrimination, is given by:
K P Q p
p
q
i
i
i
( || ) log=
Data-Driven Evaluation of Ontologies ◾  227
Kullback–Liebler divergence is a natural measure for dissimilarity of distribu-
tions. It is nonnegative and reexive; it is asymmetric and does not satisfy triangle
inequality.
Jensen–Shannon divergence (Slonim et al., 2006)is is a weighted infor-
mation gain technique also called Jensen dierence divergence, information radius,
and Sibson–BurbeaRao–JensenShannon divergence:
I P Q p
p
p q
q
q
p q
i
i
i i
i
i
i i
( || ) log log=
+
+
+
1
2
2
2
Jensen–Shannon divergence is reexive, symmetric, and bounded. Figure9.10 shows
an AVT of an odor attribute generated by AVT learner (with binary clustering).
isa
Odor attribute
isa
isa
isa isa isa
isa
isa isa
isa
(s+y)
(p+c)
(l+a)
(f+(s+y))
((p+c)+(f+(s+y)))
(n+(l+a))
(((p+c)+(f+(s+y)))+m)
isa
isa isa
isa isa isa
f
y s
c p
a l
nm
Figure 9.10 AVT of odor attribute of UCI’s AGARICUS-LEPIOTA mushroom
data set generated by AVT learner using JensenShannon divergence (binary
clustering).
228 ◾  Dae-Ki Kang
Arithmetic and geometric (A&G) mean divergence (Taneja, 1995)
Popularly known as backward JensenShannon divergence, the formula is:
T P Q
p q p q
p q
i i i i
i i
( || ) log=
+
+
2
2
It is the KL divergence between the arithmetic and geometric means of two
distributions. Since the results from dierent symmetric divergence measures do
not reveal remarkable dierences, we limit discussion to the JensenShannon
divergence measure.
9.3 Evaluation of Taxonomies
Our approach to evaluating AVTs generated by AVT learner arose because an AVT
that captures relevant relationships among attribute values can generate simple and
accurate classiers from data, just as an appropriate choice of axioms in a mathe-
matical domain can simplify proofs of theorems. us, the simplicity and predictive
accuracy of the learned classiers based on alternative choices of AVT can be used to
evaluate the utility of a corresponding AVT in specic contexts. Similar arguments
are applicable for WTL and PAT learner. For evaluation, it is necessary to discuss
the learning algorithms that can exploit taxonomies. We explain AVT-NBL (Zhang
and Honavar, 2004) for structured multivariate data, WTNBL-MN (Kang et al.,
2005) for unstructured data, PAT-DTL (Kang and Sohn, 2009), and PAT-NBL.
9.3.1 AVT-Guided Variants of Standard Learning Algorithms
To make standard learning algorithm-aware taxonomies, we extend the standard
learning algorithms in principled ways to exploit the information provided by AVT.
AVT-DTL (Yamazaki et al., 1995; Zhang et al., 2002; Zhang and Honavar, 2003)
and AVT-NBL (Zhang and Honavar, 2004) that extend the decision tree learning
algorithm (Quinlan, 1993) and the naive Bayes learning algorithm (Langley et al.,
1992) are examples of such algorithms.
e basic idea behind AVT-NBL is to start with the naive Bayes classier based
on the most abstract attribute values in AVTs and successively rene the classier
by a scoring function—a conditional minimum description length (CMDL) score
suggested by Friedman et al. (1997) to capture trade-o between the accuracy of
classication and the complexity of the resulting naive Bayes classier.
e experiments reported by Zhang and Honavar (2004) using several bench-
mark data sets show that AVT-NBL is able to learn, via human-generated AVTs,
substantially more accurate classiers than those produced by naive Bayes learner
(NBL) applied directly to the data sets and to data sets represented by a set of binary
Data-Driven Evaluation of Ontologies ◾  229
features that correspond to the nodes of the AVT (PROP-NBL). e classiers
generated by AVT-NBL are substantially more compact than those generated by
NBL and PROP-NBL. ese results hold across a wide range of missing attribute
values in data sets. Hence, the performance of naive Bayes classiers generated
by AVT-NBL when supplied with AVTs generated by AVT learner provide useful
measures of the eectiveness of AVT learner in discovering hierarchical groupings
of attribute values that are useful in constructing compact and accurate classiers
from data.
9.3.2 WTNBL-MN Algorithm
If you understand the underlying idea of AVT-guided variants of standard learning
algorithms, it is easy to understand the other taxonomy-aware algorithms because
they are similar. e problem of learning classiers from a word taxonomy and
sequence data is a natural generalization of the problem of learning classiers from
sequence data. An original data set D is a collection of labeled instances <I
j
, C
j
>
where IÎI. A classier is a hypothesis in the form of function h: IC, whose domain
is the instance space I and range is the set of class C. A hypothesis space H is a set of
hypotheses that can be represented in some hypothesis language or by a parameter-
ized family of functions. e task of learning classiers from data set D is to induce
a hypothesis hÎH that satises given criteria.
Learning classiers from word taxonomy and data can be described by assum-
ing a word taxonomy T
Σ
over words Σ and a data set D. e aim is to induce a
classier h
γ*
: I
γ*
C where γ* is a cut that maximizes given criteria. Note that
the resulting hypothesis space
H
ˆ
γ
of a chosen cut
ˆ
γ
is ecient in searching for
both concise and accurate hypotheses. Word taxonomy-guided NBL has two major
components: (1) estimation of parameters of naive Bayes classiers based on a cut,
and (b) a criterion for rening a cut.
9.3.3 Aggregation of Class Conditional Frequency Counts
We can estimate the relevant parameters of a naive Bayes classier eciently by
aggregating class conditional frequency counts. For a particular node of a given
cut, parameters of the node can be estimated by summing the class conditional
frequency counts of its children (Zhang and Honavar, 2004).
Given word taxonomy T
Σ
, we can dene a tree of class conditional frequency
counts T
f
so that there is one-to-one correspondence between the nodes of word
taxonomy T
Σ
and the nodes of the corresponding T
f
. e class conditional frequency
counts associated with a nonleaf node of T
f
are aggregations of the corresponding
class conditional frequency counts associated with its children. Because a cut through
word taxonomy corresponds to a partition of the set of words, the corresponding cut
through T
f
species a valid class conditional probability table for words. erefore, to
estimate each node of T
f
, we simply estimate the class conditional frequency counts
230 ◾  Dae-Ki Kang
of primitive words in Σ that correspond to the leaves of T
f
. en we aggregate them
recursively to calculate the class conditional frequency counts associated with their
parent node.
9.3.3.1 Multinomial Model for Representing Text and Sequence
In a multinomial model, a sequence is represented as a vector of word occurrence fre-
quencies f
i,j
. e probability of an instance I
J
given its class c
j
is dened as follows:
P d c
f
f
j j
i j
i
i j
i
( )
!
( )!
,
| |
,
| |
| =
{ }
,
| |
,
p
i j
f
i
i j
(9-1)
e term {( )!/ ( )!}
| |
,
| |
,
Σ Π
Σ Σ
i i j i i j
f f represents the number of possible combinations of
words for the instance I
j
. In Equation 9-1, p
ij
is basically calculated as:
p
Count c w
Count c
i j
j i
j
,
( , )
( )
=
Count(c
j
, w
i
) is the number of times word w
i
appears in all the instances that have
class labels c
j
. Count(c
j
) is the total number of words in a particular class label c
j
.
With Laplacian smoothing, p
i,j
will be:
p
Count c w
Count c
i j
j i
j
,
( , )
| | ( )
=
+
+
1
Or, if we follow the Dirichlet prior, p
v c
i
, will be:
p
L v Count c v
L Count c
v c
i
i
,
/ | | ( , )
( )
=
+
+
where
L
is an average length and |v| is the number of values. If we consider the num-
ber of words in an instance (i.e., document length) (McCallum and Nigam, 1998)
and assume that document length is independent of class for simplicity, we get:
P v c P d
d
v
p
i
i
v
v c
v
i
v
i
i
( | ) ( )
!
!
{ }
| |
,
|
=
||
(9-2)
where d v
i
v
i
= ( )
| |
Σ is the number of words in a particular instance (document
length). In practice, document length may be class dependent:
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset