Data-Driven Evaluation of Ontologies ◾  251
= False = True
= False = True
(b) PAT-DTL with refinement (accuracy = 82.84 ± 0.06, tree size = 5)
(ca > 0.5) (thal = reversable_defect)
<50 (103.0/14.0)
<50 (38.0/11.0)
>50_1 (102.0/16.0)
(thalach ≤ 147.5) (cp = asympt)
PSfrag replacemnts
Figure9.20(b) Decision tree learned by C4.5 and PAT-DTL (with specialization) for Cleveland Clinic Foundation’s Heart Disease
(heart-c) data.
252 ◾  Dae-Ki Kang
Table9.7 Running Times (Minutes:Seconds) of DTL (C4.5 Decision Tree
Learner) for Original and Propositionalized Data and PAT-DTL with
Abstraction and Refinement on UCI Data Sets
Data
DTL
(Original)
DTL
(Propositionalized)
PAT-DTL
(Abstraction)
PAT-DTL
(Refinement)
Anneal 00:01.69 00:02.91 22:55.76 03:07.08
Audiology 00:01.60 00:02.24 18:31.92 00:27.70
Autos 00:01.29 00:01.81 10:45.95 00:24.68
Balance-scale 00:00.95 00:00.85 00:06.30 00:04.86
Breast-cancer 00:01.06 00:06.90 04:27.15 00:40.29
Breast-w 00:01.20 00:01.13 01:23.71 00:13.82
Car 00:01.31 00:01.75 01:12.98 00:27.00
Colic 00:01.05 00:01.35 05:54.09 00:17.09
Credit-a 00:01.09 00:01.83 04:36.33 00:18.07
Credit-g 00:01.49 00:03.60 16:53.16 00:30.69
Dermatology 00:01.05 00:01.42 32:59.23 00:31.77
Diabetes 00:00.97 00:01.13 00:34.93 00:05.18
Glass 00:00.92 00:00.98 00:13.50 00:04.14
Heart-c 00:01.24 00:01.16 00:30.92 00:05.70
Heart-h 00:01.48 00:01.66 00:26.08 00:07.10
Heart-statlog 00:01.17 00:01.43 00:24.71 00:08.94
Hepatitis 00:01.05 00:01.04 00:43.84 00:06.34
Hypothyroid 00:03.26 00:05.63 33:05.69 08:54.18
Ionosphere 00:01.48 00:02.22 60:29.00 01:05.04
Iris 00:00.84 00:00.99 00:03.76 00:03.64
Kr-vs-kp 00:04.09 00:05.52 60:56.00 02:21.48
Labor 00:00.85 00:00.87 00:13.60 00:02.96
Letter 00:14.10 04:20.12 840:57.00 420:00.12
Lymph 00:00.80 00:00.86 00:49.24 00:03.65
Mushroom 00:02.84 00:22.05 563:58.00 11:07.54
Data-Driven Evaluation of Ontologies ◾  253
Examination of the results of experiments shown in Table9.8 indicates that all
three divergence measures (JKL, JS, and AGM) yielded PATs that, when used by
PAT-NBL, produced classiers with similar accuracy. Of the 13 divergence mea-
sures we tested, Hellinger discrimination (Topsøe, 2000), symmetric diversion, and
triangular discrimination (Topsøe, 2000) showed similar performances. us, PAT
learner appears able to use a broad class of measures of similarity of attribute values
based on class distributions associated with the respective values to generate PATs
that are useful for constructing compact and accurate classiers from data.
Comprehensibility of hypothesisFigure9.21 shows an example of PAT for
the subset of attributes in the UCI repositorys Balance Scale Weight and Distance
Database (balance scale). e leaf nodes (gray boxes) correspond to the original
attribute values of the balance scale data set and the dotted lines show the original
attribute–value relationships. After propositionalization, each leaf node is treated
as an attribute. e solid lines represent ISA relationships; therefore the nodes with
solid lines represent a taxonomy. If we remove all the dotted lines and nodes inside
the dotted box in the gure, we can see the taxonomy more clearly.
e balance scale data set has only four attributes, but most data sets have
many more attributes and their taxonomies are too big to t on one page. To aid
Table9.7 (Continued) Running Times (Minutes:Seconds) of DTL
(C4.5Decision Tree Learner) for Original and Propositionalized Data and
PAT-DTL with Abstraction and Refinement on UCI Data Sets
Data
DTL
(Original)
DTL
(Propositionalized)
PAT-DTL
(Abstraction)
PAT-DTL
(Refinement)
Nursery 00:05.09 00:12.90 21:03.24 00:47.45
Primary-tumor 00:01.54 00:01.76 01:18.51 00:09.23
Segment 00:01.91 00:13.05 04:38:55 02:55:53
Sick 00:04.58 00:07.67 43:01.48 02:53.22
Sonar 00:01.06 00:01.09 03:53.15 00:19.08
Soybean 00:01.70 00:02.48 10:16.57 04:15.50
Splice 00:01.50 00:05.77 803:08.00 22:07.00
Vehicle 00:00.77 00:01.12 09:38.08 01:27.03
Vote 00:00.65 00:00.47 00:17.29 00:03.47
Vowel 00:00.63 00:01.69 05:00.22 01:12.13
Waveform-5000 00:02.22 00:17.27 917:01.00 55:23.00
Zoo 00:00.46 00:00.42 00:04.47 00:02.63
254 ◾  Dae-Ki Kang
Table9.8 Accuracy and Tree Size of PAT-DTL with Refinement Coupled
with Divergences* on Selected UCI Data Sets
Data
PAT-DTL(JKL) PAT-DTL(JS) PAT-DTL(AGM)
Acccuracy Size Accuracy Size Accuracy Size
Anneal 92.43±1.73 5 90.31±1.93 11 76.17±2.79 1
Audiology 47.35±6.51 7 46.46±6.50 3 46.46±6.50 3
Autos 71.22±6.20 13 44.88±6.81 3 45.37±6.82 9
Balance-scale 72.80±3.49 7 73.92±3.44 11 74.40±3.42 7
Breast-cancer 70.28±5.30 1 73.43±5.12 7 69.23±5.35 5
Breast-w 97.14±1.24 5 96.85±1.29 5 97.14±1.24 5
Car 81.13±1.84 7 85.47±1.66 23 94.16±1.11 61
Colic 85.22±3.63 3 86.41±3.50 3 86.41±3.50 3
Credit-a 85.22±2.65 3 85.36±2.64 3 84.93±2.67 5
Credit-g 73.50±2.74 7 73.00±2.75 5 73.80±2.73 15
Dermatology 36.89±4.94 5 30.60±4.72 1 30.60±4.72 1
Diabetes 78.39±2.91 5 73.70±3.11 3 73.70±3.11 3
Glass 71.96±6.02 19 55.14±6.66 9 63.08±6.47 13
Heart-c 83.50±4.18 5 82.84±4.25 5 83.50±4.18 5
Heart-h 84.01±4.19 5 82.65±4.33 3 82.99±4.29 3
Heart-statlog 80.00±4.77 5 82.22±4.56 5 82.96±4.48 5
Hepatitis 83.23±5.88 5 83.23±5.88 7 83.23±5.88 5
Hypothyroid 99.13±0.30 11 97.00±0.54 7 97.22±0.52 7
Ionosphere 96.01±2.05 3 92.88±2.69 3 92.31±2.79 3
Iris 94.00±3.80 5 88.67±5.07 5 96.00±3.14 5
Kr-vs-kp 66.05±1.64 3 72.84±1.54 5 85.83±1.21 5
Labor 87.72±8.52 3 89.47±7.97 3 85.96±9.02 5
Letter 55.91±0.56 527 68.30±0.64 2047 67.22±0.65 3261
Lymph 53.38±8.04 3 77.03±6.78 7 73.65±7.10 3
Mushroom 99.70±0.12 3 98.52±0.26 3 99.41±0.17 3
Data-Driven Evaluation of Ontologies ◾  255
understanding of the interpretation of the taxonomy and the generated decision
tree, we show the Cleveland Clinic Foundations Heart Disease (heart-c) data set
from the UCI repository. We already showed the decision trees of the heart-c data
set generated by C4.5 and PAT-DTL in Figure9.20. Figure9.22 shows the attri-
bute–value relationships for the subset of attributes in heart-c. e data set has 13
attributes (age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, and
thal). It is hard to show clearly the attribute–value relationships and the taxonomy
for the original data set with 13 attributes on a page, so we choose 4 (cp, thalach, ca,
and thal) from the 13 for presentation based on the results in Figure9.20(b)
Figure9.22(a) shows the original attribute-value relationship of the subsets of the
heart-c data set. After propositionalization (Figure9.22(b)), each attribute value will
be considered as having 0 and 1 values. If the original attribute has a certain value,
the propositionalized attribute will have a 1 value. Figure9.23 shows the taxonomy
generated from the propositionalized attributes of heart-c shown in Figure9.22(b).
Table9.8 (Continued) Accuracy and Tree Size of PAT-DTL with Refinement
Coupled with Divergences* on Selected UCI Data Sets
Data
PAT-DTL(JKL) PAT-DTL(JS) PAT-DTL(AGM)
Acccuracy Size Accuracy Size Accuracy Size
Nursery 66.25±0.81 3 66.25±0.81 3 70.97±0.78 5
Primary-tumor 33.63±5.03 9 38.64±5.18 19 30.68±4.91 1
Segment 84.07±1.49 23 78.10±1.69 21 87.10±1.37 41
Sick 96.85±0.56 3 97.61±0.49 7 97.14±0.53 3
Sonar 75.48±5.85 3 76.44±5.77 3 76.44±5.77 13
Soybean 87.41±2.49 63 70.72±3.41 31 68.08±3.50 53
Splice 80.31±1.38 17 87.93±1.13 55 67.67±1.62 5
Vehicle 68.79±3.12 35 62.41±3.26 33 66.43±3.18 17
Vote 95.63±1.92 3 95.63±1.92 3 95.63±1.92 3
Vowel 55.76±3.09 133 46.46±3.11 135 49.90±3.11 185
Waveform-5000 79.88±1.11 63 68.76±1.28 37 71.38±1.25 39
Zoo 92.08±5.27 15 85.15±6.94 9 85.15±6.94 13
# of wins 19 20 11 24 14 21
*
Error rates estimated using 10-fold cross validation with 95% confidence
interval.
Jeffreys–Kullback–Liebler divergence = JKL. Jensen–Shannon divergence = JS.
Arithmetic and Geometric Mean divergence = AGM.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset