Data-Driven Evaluation of Ontologies ◾  241
dierence between the two was at a minimum. Precision and recall of text catego-
rization are dened as:
Precision =
|detected documents in the categgory (true positives)|
|documents in the ca
ttegory (true positives + false positives)|
Recall =
|detected documents in the categoryy (true positives)|
|detected documents (tr
uue positives + false positi ves)|
Table9.4 shows the break-even point of precision and recall and the size of the
classier (from Equation 9-6) for the 10 most frequent categories. WTNBL-MN
usually shows similar performance in terms of break-even performance, except in
the case of the corn category, while the classiers generated by WTNBL-MN were
smaller than those generated by NBL. Figure9.18 shows the precision–recall curve
(Fawcett, 2003, 2006) for the grain category. WTNBL-MN generated a naive
Bayes classier that is more compact than (but performs comparable to) the classi-
er generated by NBL.
Table9.4 Break-Even Points of Classifiers from Use of NBL-MN and
WTNBL-MN on 10 Largest Categories of Reuters 21578 Data
Data
NBL-MN WTNBL-MN
Number of
Documents
Break-even Size Break-even Size Train Test
Earn 94.94 602 94.57 348 2877 1087
Acq 89.43 602 89.43 472 1650 719
Money-fx 64.80 602 65.36 346 538 179
Grain 74.50 602 77.85 198 433 149
Crude 79.89 602 76.72 182 389 189
Trade 59.83 602 47.01 208 369 118
Interest 61.07 602 59.54 366 347 131
Ship 82.02 602 82.02 348 197 89
Wheat 57.75 602 53.52 226 212 71
Corn 57.14 602 21.43 106 182 56
Average (top 5) 80.71 602 80.79 309.20
Average (top 10) 72.14 602 66.75 280
242 ◾  Dae-Ki Kang
WTNBL-MN did not show good performance for the corn category, possibly
because conditional minimum description length trades o the accuracy of the
model against its complexity, which may not necessarily optimize precision and
recall for a particular class. As a consequence, WTNBL-MN may terminate rene-
ment of the classier prematurely for class labels with low support, i.e. when the
data set is unbalanced.
9.4.2.2 Protein Sequences
We applied the WTNBL-MN algorithm to two protein data sets with a view to iden-
tifying their localizations (Reinhardt and Hubbard, 1998; Andorf et al., 2006).
e rst data set contained 997 prokaryotic protein sequences derived from the
SWISS-PROT data base (Bairoch and Apweiler, 2000). It included proteins from
three subcellular locations: cytoplasmic (688 proteins), periplasmic (202 proteins),
and extracellular (107 proteins).
e second data set contained 2,427 eukaryotic protein sequences derived from
SWISS-PROT (Bairoch and Apweiler, 2000) and included proteins from four sub-
cellular locations: nuclear (1,097 proteins), cytoplasmic (684 proteins), mitochon-
drial (321 proteins), and extracellular (325 proteins).
For these data sets,* we conducted 10-fold cross validation. To measure perfor-
mance, the following measures (Yan et al., 2004) were applied and the results for
the data sets are reported:
* ese datasets are available to download at http://www.doe-mbi.ucla.edu/
~
astrid/astrid.html
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision
Recall
Grain
Naive Bayes Multinomial
WTNBL-MN
Figure9.18 Precisionrecall curves for Grain category.
Data-Driven Evaluation of Ontologies ◾  243
Correlation coefficient =
× ×
+
TP TN FP FN
TP FN( )(
TTP FP T N FP T N FN+ + +)( )( )
Accuracy =
TP + TN
TP+TN+FP+FN
Sensiti vity
TP
TP+FN
+
=
Specificity
TP
TP+FP
+
=
TP is the number of true positives, FP is the number of false positives, TN is the
number of true negatives, and FN is the number of false negatives. Figure9.19
shows the amino acid taxonomy constructed for the prokaryotic protein sequences.
Table9.5 shows the results for the two protein sequences. For both data sets, the
classiers generated by WTNBL were more concise and performed more accurately
than the classier generated by NBL based on the measures reported.
9.4.3 Experiments for PAT
9.4.3.1 Experimental Results from PAT-DTL
In this section, we explore certain performance issues of the proposed algorithms
through various experimental settings: (1) performance of PAT-DTL compared
with that of C4.5 decision tree learner to see whether taxonomies (as ontologies)
can help the algorithm to improve the performance; (2) dissimilarity measures for
comparing two probability distributions to see whether the algorithm assesses tax-
onomies from dierent disciplines; and (3) comprehensibility of the hypothesis to
see whether humans can comprehend the generated hypothesis.
Comparison with C4.5 decision tree learnerWe conducted experiments
on 37 data sets from the UCI Machine Learning Repository (Blake and Merz,
1998). We tested four settings: (1) C4.5 (Quinlan, 1993) decision tree learner on
the original attributes, (2) C4.5 decision tree learner on propositionalized attri-
butes, (3) PAT-DTL with abstraction, and (4) PAT-DTL with renement. Ten-fold
cross-validation was used for evaluation. Taxonomies were generated using PAT
learner and a decision tree was constructed using PAT-DTL on the resulting PAT
and data.
e results of experiments indicate that none of the algorithms showed the high-
est accuracy over most data sets. Table9.6 shows classier accuracy and tree size on
UCI data sets for C4.5 decision tree learner on the original attributes, C4.5 decision
244 ◾  Dae-Ki Kang
subcell2prodata
subcell2prodata.txt.bag_of_words
attribute-of attribute-of
class
isaisa
isa
isaisaisa
isa isa
isa isa isa isa
isa isa
isaisaisaisaisa isa isa isa
isaisa
isaisa
isaisaisaisaisaisaisaisaisaisa
H I F L C M
A
Q D V Y
W
T
G(D+V)(A+Q)
((A+Q)+(D+V)) (G+T)(K+P)
(C+M)(F+L)(H+I)
(R+(H+I)) ((F+L)+(C+M))
R PK
(N+S)
( Y+(N+S))
(W+(Y+(N+S)))
((G+T)+(W+(Y+(N+S))))((K+P)+((A+Q)+(D+V)))((R+(H+I))+((F+L)+(C+M)))
(E+((R+(H+I))+((F+L)+(C+M)))) (((K+P)+((A+Q)+(D+V)))+((G+T)+( W+(Y+(N+S)))))
E
N S
isa
isa
cytoplasmicPro extracellularPro peripalsmicPro
isa isa isa
Figure9.19 Taxonomy from prokaryotic protein localization sequences constructed by WTL.
Data-Driven Evaluation of Ontologies ◾  245
Table9.5 Localization Prediction Results of Use of NBL-MN and
WTNBL-MN on Prokaryotic (a) and Eukaryotic (b) Protein Sequences
(a) Prokaryotic protein sequences
NBL-MN
Correlation
Coefficient Accuracy Specificity
+
Sensitivity
+
Size
Cytoplasmic 71.96±2.79 88.26±2.00 89.60±1.89 93.90±1.49 42
Extracellular 70.57±2.83 93.58±1.52 65.93±2.94 83.18±2.32 42
Periplasmic 51.31±3.10 81.85±2.39 53.85±3.09 72.77±2.76 42
WTNBL-MN
Correlation
Coefficient Accuracy Specificity+ Sensitivity+ Size
Cytoplasmic 72.43±2.77 88.47±1.98 89.63±1.89 94.19±1.45 20
Extracellular 69.31±2.86 93.18±1.56 64.03±2.98 83.18±2.32 20
Periplasmic 51.53±3.10 81.85±2.39 53.82±3.09 73.27±2.75 40
(b) Eukaryotic protein sequences
NBL-MN
Correlation
Coefficient Accuracy Specificity
+
Sensitivity
+
Size
Nuclear 61.00±1.94 80.72±1.57 82.06±1.53 73.38±1.76 46
Extracellular 36.83±1.92 83.11±1.49 40.23±1.95 53.85±1.98 46
Mitochondrial 25.13±1.73 71.69±1.79 25.85±1.74 61.06±1.94 46
Cytoplasmic 44.05±1.98 71.41±1.80 49.55±1.99 81.29±1.55 46
WTNBL-MN
Correlation
Coefficient Accuracy Specificity
+
Sensitivity
+
Size
Nuclear 60.82±1.94 80.63±1.57 81.70±1.54 73.66±1.75 24
Extracellular 38.21±1.93 84.01±1.46 42.30±1.97 53.23±1.99 36
Mitochondrial 25.48±1.73 72.35±1.78 26.29±1.75 60.44±1.95 34
Cytoplasmic 43.46±1.97 71.24±1.80 49.37±1.99 80.56±1.57 32
Error rates calculated by 10-fold cross validation with 95% confidence interval.
Note: ‘+’ symbol after specificity and sensitivity means the criteria are measured
for the positive class label.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset