Here, j is the smallest integer with max (|vi|<1).

Example 3.6 (Decimal scaling). The recorded values of Sweden Economy (Table 2.8) vary from 102 to 192. The maximum absolute value pertaining to Sweden is 192. In order to normalize by decimal scaling, each value is divided by 1,000 (i.e., j =3) and by doing so 102 normalizes to 0.102, and 192 normalizes to 0.192.

It is possible to see cases when normalization can alter the original data slightly, when using z-score normalization or decimal scaling, particularly. At the same time, it is required to save the normalization parameters (e.g., the mean and standard deviation if one uses z-score normalization). If this is done, then future data can be normalized uniformly.

3.8Classification of data

This section provides definition of classification, general approach to classification along with decision tree induction, attribute selection measures and Bayesian classification methods.

3.8.1Definition of classification

A specialist in his/her area can make use of the parameters provided in Table 2.12 (in Section 2) in order to make the diagnosis whether an individual has MS or is healthy. In the example given above, the data analysis task is in fact a task of classification. In classification, a model or classifier is created in order to estimate class labels, or in other words categorical labels, such as “RRMS,” “SPMS,” “PPMS” or “healthy” for the medical application data; or “USA,” “New Zealand,” “Italy” or “Sweden” for the economy data. Discrete values can represent these categories. The ordering among values has no meaning. In some types of data analysis, the task may be an example of numeric prediction. If that is so, the model constructed makes a prediction of a continuous-valued function, or ordered value which is unlike the class label. This is a predictor model. Regression analysis is a statistical methodology frequently used for numeric predictions. The two terms are generally used with the same meaning but there are also other methods for numeric predictions. Regarding the types of prediction problems, classification and numeric prediction are the two most important types.

Let us start with the general approach to classification. Data classification is a process with two steps that include a learning step. In the training phase or learning step, a classification model is formed. Classification step is used for predicting class labels for the given data. Figure 3.5 details such processes.

A classifier is built to describe a predetermined set of data classes (concepts) in the first step. This is called the learning step or in some contexts we also call it training phase. In this step, a classification algorithm builds the classifier through the analysis or learning from a training set. This training set constitutes datasets and their associated class labels. A dataset, X, is denoted by an n-dimensional attribute vector, X = (x1, x2, ..., xk) which shows n measurements on the dataset from n dataset attributes in the following order: A1, A2, ..., An. (Each attribute represents a “feature” of X.) Thus, literature regarding the pattern recognition makes use of feature vector term instead of the term attribute vector. Each dataset, X, belongs to a class that is predetermined. It is in fact determined by another dataset attribute referred to as the class label attribute. The class label attribute has a discrete value; and it is unordered, being categorical (or nominal) (since each value functions as a category or class). The separate datasets constitute the training set, known as training datasets. The sampling is done randomly from the dataset under discussion. For classification, datasets are also known as instances, data points, objects, samples or examples. Training sample is the term used commonly in the literature regarding the machine learning [16–20].

This step is also called supervised learning since the class label of each training dataset is given. For unsupervised learning or clustering, we can say that unsupervised learning has class label of each training dataset that is not known. The number or set of classes that are going to be learned might not be known in advance either.

Figure 3.5: The data classification process. (a) Learning: A classification algorithm analyzes the training data. The class label attribute is class. Classifier or the learned model is shown in the form of classification rules. (b) Classification: It is important that test data are utilized for the estimation of the accuracy regarding the classification rules. If the accuracy is deemed to be acceptable, it is possible to apply the rules to the new dataset classification.

The first step regarding the classification process is also regarded as the learning of a mapping or function, y = f(X). This is capable of making a prediction of the connected or associated class label y of a given dataset X. It is important in this context that function or mapping separates the data classes. Such mapping is characteristically characterized in classification rules, decision trees or mathematical formula form. In our example, the mapping is depicted as classification rules that can identify medical data applications as RRMS, SPMS, PPMS or healthy (Figure 3.5(a)) as subgroups of MS. The rules have the function of classifying future datasets and eliciting deeper understanding as to the data content. A compressed data representation is also provided under such light.

In the second step (Figure 3.5(b)), we can see that the model is used for the purposes of classification including estimation of the predictive accuracy of the classifier. If we wanted to use the training set for the accuracy measurement of the classifier, this estimate would possibly be optimistic because the classifier tends to overfit the data (e.g., during learning it may include some specific anomalies of the training data that are not present in the general dataset). Hence, we use a test set which constitutes test datasets and their associated class labels. They are independent of the training datasets, which means that they were not used for the construction of the classifier.

Accuracy is an important feature in estimation. Accuracy of a classifier is the percentage of test set datasets correctly classified by the classifier. A comparison is made between the associated class label of each test dataset and the class prediction of the learned classifier regarding that dataset. Section 3.9 provides a number of methods that can be employed for the estimation of the classifier accuracy. If the classifier accuracy is deemed acceptable, then that classifier can be employed for the classification of the future datasets. And for these datasets, we do not know the class label. In literature with regard to machine learning, the data that are not known are referred to as either “unknown” or “previously unseen” data [16–18]. For example, classification rules that have been learned as in Figure 3.5(a) based on the data analysis of previous medical data applications can be utilized so as to approve or reject new or future medical data applicants.

3.9Model evaluation and selection

Even though we have built a classification model, there may still be many questions that require more elucidation. For instance, suppose that you are using data from previous sales in order to construct a classifier to do the prediction of customer purchasing behavior. You aim at having a prediction of how correctly the classifier can estimate the buying behavior of prospective customers. In other words, this refers to the prospective customer data on which the classifier has not been trained. It is possible that you tried different methods so that you could build more than one classifier and have the intention of doing comparison regarding their accuracy [21]. At this point, we may ask what accuracy is and how we can estimate it. Other relevant questions that follow are some measures that indicate the accuracy of a classifier is more appropriate than the other ones. Furthermore, we also may know how we can obtain a reliable accuracy estimate. You will find the answers to such questions in this section of the book.

As common techniques for assessing accuracy, we see holdout and random subsampling (Section 3.9.1.1) and cross-validation (Section 3.9.1.2) based on randomly sampled partitions of the given data. What about the case when there is more than one classifier and aim at choosing the “best” one? This is referred to as model selection (choosing one classifier over the other one).

Following this introductory part, you may still be confused about certain issues. MS subgroups are classified based on the existence of lesion entity from the previously taken MR images and EDSS score. Then, you can ensure the MR image and EDSS score data to be tested for the purposes of classification.

It is possible that one might have attempted using different methods so as to build more than one classifier and compare the accuracy thereof. One may pose the question as to the definition and the way of estimating accuracy [22, 23]. You can further question if some measures of a classifier’s accuracy are more appropriate than others or about the ways of getting a reliable estimation of accuracy.

You can find the answers and explanations for such questions in this section of the book. Section 3.9.1 presents various metrics for the evaluation of predictive accuracy pertaining to a classifier. Two common techniques are used for the assessment: holdout and random subsampling (Section 3.9.1.1), and cross-validation (Section 3.9.1.2). These are based on randomly sampled partitions of the data. Let’s say you have more than one classifier and you are in a position to choose the “best” one. This is related to what is known as model selection; in other words, choosing one classifier instead of another one.

3.9.1Metrics for evaluating classifier performance

In this section, measures for the assessment of how “accurate” a classifier are shown in the context of predicting the class label of datasets. Table 3.4 presents a summary of the classifier evaluation measures which encompass accuracy, sensitivity, or recall, specificity, precision, F1, and Fβ. Accuracy refers to the prediction abilities of a classifier. It is a better option to measure the accuracy of a classifier on a test set that includes datasets with class labels (datasets that were not actually used for the training of the model) [16, 21–23]. In machine learning literature, positive samples refer to datasets with main class of interest, whereas negative samples refer to all the other datasets. With two classes, the positive datasets may be patient = yes, whereas the negative datasets are healthy = no. Assume that our classifier on a test set of labeled datasets is used. For denotation, P is the number of positive samples and N is the number of negative samples. For each of them, the class label prediction of the classifier is compared with the known class label of the dataset.

Below you can find some of the “building block” terms that are used.

The confusion matrix provided in Table 3.5 summarizes these building block terms. The confusion matrix is a beneficial tool used for the analysis of how well the classifier can identify datasets belonging to different classes. TP and TN show when the classifier does things right, whereas FP and FN show when the classifier gets things wrong such as doing mislabeling.

A confusion matrix for the two classes patient = yes (positive) and healthy = no (negative) is provided in Table 3.6.

Table 3.4: Evaluation measures.

MeasureFormula
Accuracy, recognition rateTP+TNP+N
Error rate, misclassification rateFP+FNP+N
Sensitivity, true positive rate, recallTPP
Specificity, true negative rateTNN
PrecisionTPTP+FP
F, F1, F-score, harmonic mean of precision andrecall2×precision×recallPrecision+recall
Fβ, where β is a non-negative real number(1+β2)×precision×recallβ2×Precision+recall

Note: It should be noted that some measures are known by more than one name. TP, TN, FP, P and N show the number of true positive, true negative, false positive, positive and negative samples, respectively (see text).

Table 3.5: Confusion matrix that is presented with total for positive and negative datasets.

Actual ClassPredicted Class
yesno
yesTPFN
noFPTN
TotalP'N'

Table 3.6: Confusion matrix for the classes patient, yes and healthy, no.

Note: An entry in row i and column j presents the number of datasets of class i labeled by the classifier as class j. In an ideal situation, the nondiagonal entries should have a value that equals to zero or a value that is close to zero.

With mclasses, in which m ≥ 2, a confusion matrix will be a table with at least size m × m. An entry, CMi, j in the first m rows and m columns presents the number of datasets of class i labeled by the classifier as class j. If one wants a classifier to have decent accuracy, most of the datasets are preferably to be represented along the confusion matrix diagonal. That will be from entry CM1, 1 to entry Cm,m, and the rest of the entries will be either zero or a value close to zero. FP and FN are close to zero. That is the ideal situation in fact.

For providing the totals, it is possible for the table to have additional rows or columns. P and N are shown in the confusion matrix given in Table 3.5. In addition, P′ is the number of datasets labeled as positive (TP + FP), whereas N′ denotes the number of datasets labeled as negative (TN + FN). TP + TN + FP + TN, or P + N, or P′ + N′ yields the total number of datasets. You may notice here that despite confusion matrix being shown for a binary classification problem, it is also possible to draw such matrices for multiple classes in a similar manner.

At this point, we can handle the evaluation measures with an onset with accuracy. On a given test, the accuracy of a classifier set is the percentage of test set datasets that the classifier correctly classifies, which can be shown as

Accuracy=TP+TNP+N(3.12)

Another common term used for this is the classifier’s recognition rate, which shows how well the classifier recognizes the datasets of the various classes. The confusion matrix example for two classes “patient = yes (positive) and healthy = no (negative)” has been presented in Table 3.6. Recognition rates per class and overall as well as the totals are shown. With an overall look, it is not hard to understand if the corresponding classifier is causing a confusion in two classes. We see that it has mislabeled 600 no datasets as yes. When the class distribution is relatively balanced, accuracy seems to be the most effective.

Other terms that one can consider are misclassification rate or error rate of a classifier, M, which is basically 1 – accuracy (M), where M is the accuracy. The relevant calculation can be done using this formula:

Errorrate=FP+FNP+N(3.13)

If one wants to use the training set rather than the test set for estimating the error rate of a model, we call this quantity resubstitution error, which is an error approximation optimistic of the true error rate since the testing of the model on a sample (not yet seen) has not been done. Accordingly, you get an optimistic kind of a corresponding accuracy rate.

As for the class imbalance problem, the main class of interest is uncommon, which means the dataset distribution reveals a minority positive class and a substantial majority of the negative class. For instance, in disease identification applications, the class of interest (or positive class) is patient, which occurs less often than the negative healthy class. To give an example from medicine, we can consider medical data with a rare class such as RRMS. Let’s assume that the researcher trained a classifier for the purpose of classification of medical datasets, where we see RRMS as the class label attribute along with the possible class values yes and no. An accuracy rate of 98% is likely to make the classifier appear accurate. Yet, what would we say for a 2% of the training datasets having actually RRMS (patient)? Apparently 98% would not be an accuracy rate that is acceptable (the classifier could accurately be labeling only the datasets that are not RRMS [patient]). As a result of this, all the MS dataset in Table 2.12 would be misclassified. Rather than such a result, the researcher may require some other measures that have an access to how well the classifier is able to recognize the positive datasets/samples (RRMS = yes) and how well it can recognize the negative datasets/ samples (RRMS = no).

For this purpose, a researcher can use the measures of sensitivity and specificity, respectively. True positive (recognition) rate (with the proportion of positive datasets identified in the approved manner) is another term sensitivity is also referred to and true negative rate is used for the specificity (with the proportion of negative datasets identified correctly). The definitions of these two measures are

Sensitivity=TPP(3.14)

Specifity=TNN(3.15)

We see accuracy as a function of both sensitivity and specificity:

Accuracy=sensitivity(PP+N)+specifityN(P+N)(3.16)

Recall and precision measures are other terms used extensively for the purposes of classification. Precision is regarded as a measure of exactness (e.g., what percentage of datasets that are labeled positive are actually positive) while recall is a measure of completeness (e.g., what percentage of positive datasets are labeled as positive). These measures may be calculated through the following formulae:

Precision=TPTP+FP(3.17)

Recall=TPTP+FN=precision=TPP(3.18)

For a class C, 1.0 is a perfect precision score of means with each dataset having the classifier being labeled as belonging to class C in fact belongs to class C. Yet, this does not give us any information regarding the number of class C datasets mislabeled by the classifier. A perfect recall score of 1.0 for C suggests that every item from class C was labeled accordingly. However, it does not provide us with the information concerning how many other datasets were labeled incorrectly as the ones falling into or belonging to class C. It is likely that there exists an inverse relationship between precision and recall so you can increase one of them at the expense of reducing the other one. For instance, it is possible that our medical classifier can achieve a high precision through the labeling of all MS datasets that present a certain way as the MS subgroups disorder; however, this might achieve a low recall should it mislabel many other instances of MS datasets on Table 2.12. It is a typical practice to use precision and recall scores together with each other with precision values being compared for a fixed value of recall or vice versa. To illustrate this point, the researcher might compare precision values at 0.75 as a recall value.

Alternative means for using recall and precision is to merge them into a single measure. This approach concerns the F measure, referred to as the F1-score or F-score) and the Fβ measure is defined as

F=2×precision×recallprecision+recall(3.19)

Fβ=(1+β2)×precision×recallβ2×precision+recall(3.20)

According to these formulae, β is a nonnegative real number. F measure is the harmonic mean of precision, with equal weight to precision and recall. F β measure is a weighted measure of recall and precision. F 2 (this weights recall twice as much as precision) and F0.5 (this weights precision twice as much as recall) are two of the F β measures generally used.

Another question to pose at this point could be if there are other cases in which accuracy may perhaps not be appropriate. It is a common assumption that in classification problems all datasets are classifiable in a unique way, which means that each training dataset can only belong to one single class. However, large datasets have an extensive variety of data. For this reason, it would not be always wise to suppose that all of the datasets can be classified in a unique fashion. Instead of such an assumption, it is more likely to suppose that each dataset has the chance of belonging to more than one single class. Given this situation, we can pose the question of how to measure the accuracy of classifiers on large datasets. In this case, the accuracy measure will not be appropriate since it does not take into consideration the likelihood of datasets to belong to more than one class. In these instances, it will be beneficial to return a probability class distribution instead of returning a class label [24]. Then, it will be possible that accuracy measures use a second guess heuristic. In this way, a class prediction is deemed as correct if it agrees with the first or second most probable class. This approach will not take into account the nonunique classification of datasets to some extent, and it is not a complete solution either. Besides using the accuracy-based measures, it is also possible to compare the classifiers in regard to the additional aspects as provided. Speed is one way. It refers to the computational costs related during production and use of the given classifier. Robustness is known as the classifier’s ability while carrying out correct predictions with noisy data or data with missing values. Robustness is usually assessed by means of a sequence of synthetic datasets that signify the increasing degrees related to noise as well as the missing values. Another measure is scalability, which is known as the capability of constructing the classifier efficiently with bulky amounts of data. It is usually assessed with a sequence of datasets that have increasing size. Finally, we can mention interpretability. This is regarded as the level of understanding that the predictor of the classifier provides. It is known that interpretability is subjective. For this very reason, its assessment would be harder to make. It is easy to do the interpretation of classification rules or the decision trees. However, it is possible that their interpretability may be lessened as they become more complex and complicated (Figure 3.6) [55].

Figure 3.6: Estimating accuracy with the holdout method.

So far, some evaluation measures have been presented. Among the measures, what works best is the accuracy measure when the data classes are moderately distributed in an even way. Other measures such as precision, sensitivity/ recall, specificity, F and Fβ are more appropriate for the class imbalance problem with the main class of interest being unusual.

In the following sections, we focus on the fact of getting classifier accuracy estimates in a reliable manner.

3.9.1.1Holdout method and random subsampling

One of the ways to obtain reliable classifier accuracy estimation is the holdout method by which the given data are divided randomly into two independent sets: training set and test set. In general, two-thirds of the data are assigned to the training set and the remaining one-third to the test set. The training set is used for the purpose of deriving the model. The accuracy of the model is subsequently estimated with the test set (Figure 3.7). In this case, pessimistic estimate is observed because only a segment of the initial data is used for deriving the model. Another method is the random subsampling which is a variation of the holdout method. In random subsampling methods, one repeats the holdout at repeated times of iteration and takes the general accuracy prediction as average of accuracies as gained from each of the iteration [25].

Figure 3.7: Random subsampling carried out k data splits of the dataset.

Based on holdout method experiment 1, experiment 2 and experiment 3 in Figure 3.7, the same dataset is formed by being split into k number of proportions (in different proportions). Now, let us provide the procedural steps for the calculation of the test accuracy rates regarding experiment 1, experiment 2 and experiment 3 as obtained by splitting the same dataset into different k proportions.

Step (1) By splitting the dataset into different k proportions (test data), experiment 1, experiment 2 and experiment 3 are formed.

The following are the steps about how one may choose the experiment regarding the best result as to the classification accuracy rate (in the following steps) for experiment 1, experiment 2 and experiment 3 (best validation).

Step (2) The data to be allocated for the test data should be smaller than the data of the training.

Step (3) The classification accuracy rates are obtained from experiment 1, experiment 2 and experiment 3, and k-split with the highest classification accuracy rate among the results is preferred.

Please find below how holdout method is applied for the MS dataset (see Table 2.12).

Example 3.7 MS dataset has been formed by being split into k number of proportions (in different proportions) based on holdout method. Now, let us provide the procedural steps for the calculation of the test accuracy rates regarding experiment 1, experiment 2 and experiment 3 as obtained by splitting the same dataset into different k proportions based on k-NN (k-nearest neighbor) algorithm.

Step (1) By splitting the MS dataset into different k proportions (test data), experiment 1 and experiment 2 are formed.

The following are the steps about how one may choose the experiment (experiment 1 and experiment 2) regarding the best result as to the classification accuracy rate with k-NN algorithm (best validation).

Step (2) For experiment 1, MS dataset has been split as follows: one-third of the dataset as test dataset and two-thirds of the dataset as the training dataset. For experiment 2, the MS dataset has been split as follows: one-fourth of the dataset as test dataset and three-fourths as the training dataset.

Step (3) The classification accuracy rate of the MS dataset obtained based on Figure 3.8 according to k-NN algorithm has been calculated in two different experiments (pertaining to Step (2)). For experiments 1 and 2, results have been obtained as 68.31% and 60.5%, respectively. As can be seen, result of experiment 1 is higher than that of experiment 2. The best classification accuracy rate according to experiment 1 is yielded with a split of MS dataset having two-thirds allocated as training data and one-third as test data.

Figure 3.8: The classification of MS dataset based on holdout method with k-NN algorithm.

This book is categorized as follows: Chapter 6 describes decision tree algorithms (ID3, C4.5 and CART), Chapter 7 is about naïve Bayes algorithm, Chapter 8 describes SVM algorithms (linear and nonlinear), Chapter 9 describes k-NN algorithm and Chapter 10 describes ANN algorithms (FFBP and LVQ). The accuracy rate in the classification of data has been received through the application of holdout method. In the chapters specified, MS dataset based on holdout method (see Table 2.12), economy U.N.I.S. dataset (see Table 2.8) and WAIS-R dataset (see Table 2.19) have been split as follows: two-thirds (66.66 %) for the training procedure and one-third for the test procedure.

3.9.1.2Cross-validation method

The initial data are randomly divided into k mutually exclusive subsets or “folds” in k-fold cross-validation. The denotation is D1, D2, ..., Dk each having a fairly accurate equal size. The training and testing procedures are performed k times. As for the iteration process, i, partition Di is reserved as the test set with the remaining partitions being used together for the training of the model. Let us give further details on this process: in the first iteration, subsets D2, ..., Dk serve as the training set together so that one can obtain a first model (tested on D1). One does the training of the second iteration on subsets D1, D3, ..., Dk with the test being done on D2, and the sequence goes on and on. In this method, we see that each sample is used the same number of times for the training and only once for the testing procedure as opposed to the methods of holdout and random subsampling. The accuracy estimation is the overall number of correct classifications from the k iterations as they are divided by the total number of datasets in the initial data for the classification process. Stratified 10-fold cross-validation is suggested to be employed for the estimation of accuracy in general despite the fact that computation power allows the use of more folds because of its comparatively low bias as well as its variance. In stratified cross-validation, the folds are stratified in order that the class distribution of datasets in each fold will be generally the same as the one in the initial data. Leave-one-out is considered to be a special case of k-fold cross-validation in which k set is arranged in accordance with the number of initial datasets. Only one single sample is left out at a time for the test set. This is how the leave-one-out is termed [26, 27].

The procedural steps for cross-validation method are as follows:

Step (1) k-Fold data are split from the dataset for the test procedure.

Step (2) The dataset is split into k-fold equal number of subsets. While the k − 1 number of dataset is used for the dataset training procedure, one is excluded from the training procedure so that it can be used for the purposes of testing.

Step (3) The application procedure in Step (2) is iterated k times. In the meantime, the accuracy rates of each application are calculated. By doing the calculation of the accuracy rate average, test accuracy rate is obtained.

Now, at this point, let us try to understand the application steps of cross-validation method in line with the procedural steps specified earlier.

Example 3.8 Let us get the test accuracy rate of WAIS-R dataset (Table 2.19) by splitting it according to 10-fold cross-validation method through k-NN algorithm.

Step (1) WAIS-R dataset (400 × 21) is split for the 10-fold data test procedure.

Step (2) The dataset is split into k-fold equal number of subsets. While the 10 × (40 × 21) −1 number of dataset is used for the dataset training procedure, one is excluded from the training procedure so that it can be used for the purposes of testing.

Step (3) The application procedure in Step (2) is iterated k = 10 times. In the meantime, the accuracy rates of each application are calculated. The cells marked in dark color in Figure 3.9 are calculated. By calculating the accuracy rate average based on Figure 3.9 ((50% + 51% +42% + ... + 75%)/10), test accuracy rate is obtained. According to k-NN algorithm, the test accuracy rate is obtained as 70.01%.

Figure 3.9: WAIS-R dataset 10-fold cross-validation.

As shown in Figure 3.9, k = 1, 2,…, 10 steps, 40 × 21 matrix training and test data in each cell are different from one another.

References

[1]Larose DT, Larose CD. Discovering knowledge in data: an introduction to data mining. USA: John Wiley & Sons, Inc., Publication, 2014.

[2]Ott RL, Longnecker MT. An introduction to statistical methods and data analysis. USA: Nelson Education. 2015.

[3]Linoff GS, Berry MJ. Data mining techniques: for marketing, sales, and customer relationship management. USA: Wiley Publishing, Inc., 2011.

[4]Tufféry S. Data mining and statistics for decision making. UK: John Wiley & Sons, Ltd., Publication, 2011.

[5]Kantardzic M. Data mining: concepts, models, methods, and algorithms. USA: John Wiley & Sons, Ltd., Publication, 2011.

[6]Giudici P. Applied data mining: statistical methods for business and industry. England: John Wiley & Sons, Ltd., Publication, 2005.

[7]Myatt GJ. Making sense of data: a practical guide to exploratory data analysis and data mining USA: John Wiley & Sons, Ltd., Publication, 2007.

[8]Harrington P. Machine learning in action. USA: Manning Publications, 2012.

[9]Mourya SK, Gupta S. Data Mining and Data Warehousing. Oxford, United Kingdom: Alpha Science International Ltd., 2012.

[10]North M. Data mining for the masses. Global Text Project, 2012.

[11]Nisbet R, Elder J, Miner G. Handbook of statistical analysis and data mining applications. Canada: Academic Press publications, Elsevier, 2009.

[12]Pujari AK. Data mining techniques. United Kingdom: Universities Press, Sangam Books Ltd., 2001.

[13]Borgelt C, Steinbrecher M, Kruse R R. Graphical models: representations for learning, reasoning and data mining. USA: John Wiley & Sons, Ltd., Publication, 2009.

[14]Brown ML, John FK. Data mining and the impact of missing data. Industrial Management & Data Systems, 2003, 103(8), 611–621.

[15]Gupta GK. Introduction to data mining with case studies. Delhi: PHI Learning Pvt. Ltd., 2014.

[16]Han J, Kamber M, Pei J, Data mining Concepts and Techniques. USA: The Morgan Kaufmann Series in Data Management Systems, Elsevier, 2012.

[17]Raudenbush SW, Anthony SB. Hierarchical linear models: Applications and data analysis methods. USA: Sage Publications, 2002.

[18]John Lu ZQ. The elements of statistical learning: Data mining, inference, and prediction. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2010, 173(3), 693–694.

[19]LaValle S, Lesser E, Shockley R, Hopkins MS, Kruschwitz N. Big data, analytics and the path from insights to value. MIT Sloan Management Review, 2011, 52(2), 21–31.

[20]Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical machine learning tools and techniques. USA: Morgan Kaufmann Series in Data Management Systems, Elsevier, 2016.

[21]Arlot S, Alain C. A survey of cross-validation procedures for model selection. Statistics Surveys, 2010, 4, 40–79.

[22]Picard RR, Cook DR. Cross-validation of regression models. Journal of the American Statistical Association, 1984, 79(387), 575–583.

[23]Whitney AW. A direct method of nonparametric measurement selection. IEEE Transactions on Computers, 1971, 100 (9), 1100–1103.

[24]Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. Appears in the International Joint Conference on Articial Intelligence (IJCAI), 1995, 14(2), 1137–1145.

[25]Gutierrez-Osuna R. Pattern analysis for machine olfaction: A review, IEEE Sensors Journal, 2002, 2 (3), 189–202.

[26]Golub GH, Health M, Wahba G. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 1979, 21(2), 215–223.

[27]Arlot S, Matthieu L. Choice of V for V-fold cross-validation in least-squares density estimation. The Journal of Machine Learning Research, 2016, 17(208), 7256–7305.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset