Chapter 9: Data-Driven Evaluation of Ontologies Using Machine Learning Algorithms (7/13)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data-Driven Evaluation of Ontologies ◾ 241

dierence between the two was at a minimum. Precision and recall of text catego-

rization are dened as:

Precision =

|detected documents in the categgory (true positives)|

|documents in the ca

ttegory (true positives + false positives)|

Recall =

|detected documents in the categoryy (true positives)|

|detected documents (tr

uue positives + false positi ves)|

Table9.4 shows the break-even point of precision and recall and the size of the

classier (from Equation 9-6) for the 10 most frequent categories. WTNBL-MN

usually shows similar performance in terms of break-even performance, except in

the case of the corn category, while the classiers generated by WTNBL-MN were

smaller than those generated by NBL. Figure9.18 shows the precision–recall curve

(Fawcett, 2003, 2006) for the grain category. WTNBL-MN generated a naive

Bayes classier that is more compact than (but performs comparable to) the classi-

er generated by NBL.

Table9.4 Break-Even Points of Classiﬁers from Use of NBL-MN and

WTNBL-MN on 10 Largest Categories of Reuters 21578 Data

Data

NBL-MN WTNBL-MN

Number of

Documents

Break-even Size Break-even Size Train Test

Earn 94.94 602 94.57 348 2877 1087

Acq 89.43 602 89.43 472 1650 719

Money-fx 64.80 602 65.36 346 538 179

Grain 74.50 602 77.85 198 433 149

Crude 79.89 602 76.72 182 389 189

Trade 59.83 602 47.01 208 369 118

Interest 61.07 602 59.54 366 347 131

Ship 82.02 602 82.02 348 197 89

Wheat 57.75 602 53.52 226 212 71

Corn 57.14 602 21.43 106 182 56

Average (top 5) 80.71 602 80.79 309.20

Average (top 10) 72.14 602 66.75 280

242 ◾ Dae-Ki Kang

WTNBL-MN did not show good performance for the corn category, possibly

because conditional minimum description length trades o the accuracy of the

model against its complexity, which may not necessarily optimize precision and

recall for a particular class. As a consequence, WTNBL-MN may terminate rene-

ment of the classier prematurely for class labels with low support, i.e. when the

data set is unbalanced.

9.4.2.2 Protein Sequences

We applied the WTNBL-MN algorithm to two protein data sets with a view to iden-

tifying their localizations (Reinhardt and Hubbard, 1998; Andorf et al., 2006).

e rst data set contained 997 prokaryotic protein sequences derived from the

SWISS-PROT data base (Bairoch and Apweiler, 2000). It included proteins from

three subcellular locations: cytoplasmic (688 proteins), periplasmic (202 proteins),

and extracellular (107 proteins).

e second data set contained 2,427 eukaryotic protein sequences derived from

SWISS-PROT (Bairoch and Apweiler, 2000) and included proteins from four sub-

cellular locations: nuclear (1,097 proteins), cytoplasmic (684 proteins), mitochon-

drial (321 proteins), and extracellular (325 proteins).

For these data sets,* we conducted 10-fold cross validation. To measure perfor-

mance, the following measures (Yan et al., 2004) were applied and the results for

the data sets are reported:

* ese datasets are available to download at http://www.doe-mbi.ucla.edu/

astrid/astrid.html

0.2

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

Precision

Recall

Grain

Naive Bayes Multinomial

WTNBL-MN

Figure9.18 Precision–recall curves for Grain category.

Data-Driven Evaluation of Ontologies ◾ 243

Correlation coefficient =

× − ×

TP TN FP FN

TP FN( )(

TTP FP T N FP T N FN+ + +)( )( )

Accuracy =

TP + TN

TP+TN+FP+FN

Sensiti vity

TP+FN

Specificity

TP+FP

TP is the number of true positives, FP is the number of false positives, TN is the

number of true negatives, and FN is the number of false negatives. Figure9.19

shows the amino acid taxonomy constructed for the prokaryotic protein sequences.

Table9.5 shows the results for the two protein sequences. For both data sets, the

classiers generated by WTNBL were more concise and performed more accurately

than the classier generated by NBL based on the measures reported.

9.4.3 Experiments for PAT

9.4.3.1 Experimental Results from PAT-DTL

In this section, we explore certain performance issues of the proposed algorithms

through various experimental settings: (1) performance of PAT-DTL compared

with that of C4.5 decision tree learner to see whether taxonomies (as ontologies)

can help the algorithm to improve the performance; (2) dissimilarity measures for

comparing two probability distributions to see whether the algorithm assesses tax-

onomies from dierent disciplines; and (3) comprehensibility of the hypothesis to

see whether humans can comprehend the generated hypothesis.

Comparison with C4.5 decision tree learner—We conducted experiments

on 37 data sets from the UCI Machine Learning Repository (Blake and Merz,

1998). We tested four settings: (1) C4.5 (Quinlan, 1993) decision tree learner on

the original attributes, (2) C4.5 decision tree learner on propositionalized attri-

butes, (3) PAT-DTL with abstraction, and (4) PAT-DTL with renement. Ten-fold

cross-validation was used for evaluation. Taxonomies were generated using PAT

learner and a decision tree was constructed using PAT-DTL on the resulting PAT

and data.

e results of experiments indicate that none of the algorithms showed the high-

est accuracy over most data sets. Table9.6 shows classier accuracy and tree size on

UCI data sets for C4.5 decision tree learner on the original attributes, C4.5 decision

244 ◾ Dae-Ki Kang

subcell2prodata

subcell2prodata.txt.bag_of_words

attribute-of attribute-of

class

isaisa

isa

isaisaisa

isa isa

isa isa isa isa

isa isa

isaisaisaisaisa isa isa isa

isaisa

isaisaisaisaisaisaisaisaisaisa

H I F L C M

Q D V Y

G(D+V)(A+Q)

((A+Q)+(D+V)) (G+T)(K+P)

(C+M)(F+L)(H+I)

(R+(H+I)) ((F+L)+(C+M))

R PK

(N+S)

( Y+(N+S))

(W+(Y+(N+S)))

((G+T)+(W+(Y+(N+S))))((K+P)+((A+Q)+(D+V)))((R+(H+I))+((F+L)+(C+M)))

(E+((R+(H+I))+((F+L)+(C+M)))) (((K+P)+((A+Q)+(D+V)))+((G+T)+( W+(Y+(N+S)))))

N S

isa

cytoplasmicPro extracellularPro peripalsmicPro

isa isa isa

Figure9.19 Taxonomy from prokaryotic protein localization sequences constructed by WTL.

Data-Driven Evaluation of Ontologies ◾ 245

Table9.5 Localization Prediction Results of Use of NBL-MN and

WTNBL-MN on Prokaryotic (a) and Eukaryotic (b) Protein Sequences

(a) Prokaryotic protein sequences

NBL-MN

Correlation

Coefﬁcient Accuracy Speciﬁcity

Sensitivity

Size

Cytoplasmic 71.96±2.79 88.26±2.00 89.60±1.89 93.90±1.49 42

Extracellular 70.57±2.83 93.58±1.52 65.93±2.94 83.18±2.32 42

Periplasmic 51.31±3.10 81.85±2.39 53.85±3.09 72.77±2.76 42

WTNBL-MN

Correlation

Coefﬁcient Accuracy Speciﬁcity+ Sensitivity+ Size

Cytoplasmic 72.43±2.77 88.47±1.98 89.63±1.89 94.19±1.45 20

Extracellular 69.31±2.86 93.18±1.56 64.03±2.98 83.18±2.32 20

Periplasmic 51.53±3.10 81.85±2.39 53.82±3.09 73.27±2.75 40

(b) Eukaryotic protein sequences

NBL-MN

Correlation

Coefﬁcient Accuracy Speciﬁcity

Sensitivity

Size

Nuclear 61.00±1.94 80.72±1.57 82.06±1.53 73.38±1.76 46

Extracellular 36.83±1.92 83.11±1.49 40.23±1.95 53.85±1.98 46

Mitochondrial 25.13±1.73 71.69±1.79 25.85±1.74 61.06±1.94 46

Cytoplasmic 44.05±1.98 71.41±1.80 49.55±1.99 81.29±1.55 46

WTNBL-MN

Correlation

Coefﬁcient Accuracy Speciﬁcity

Sensitivity

Size

Nuclear 60.82±1.94 80.63±1.57 81.70±1.54 73.66±1.75 24

Extracellular 38.21±1.93 84.01±1.46 42.30±1.97 53.23±1.99 36

Mitochondrial 25.48±1.73 72.35±1.78 26.29±1.75 60.44±1.95 34

Cytoplasmic 43.46±1.97 71.24±1.80 49.37±1.99 80.56±1.57 32

Error rates calculated by 10-fold cross validation with 95% conﬁdence interval.

Note: ‘+’ symbol after speciﬁcity and sensitivity means the criteria are measured

for the positive class label.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Data-Driven Evaluation of Ontologies Using Machine Learning Algorithms (7/13)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9: Data-Driven Evaluation of Ontologies Using Machine Learning Algorithms (7/13)