In the era of big data, sampling has become a popular and essential part of the data science pipeline. Even if you can get all the available data to fit into a large data structure, it may be ill-advised (unless you already have an adequate model at your disposal). Just because the cloud and large computer clusters make it possible, doesn’t mean that you should use all the available data as-is. You shouldn’t even need all of your data to see whether a feature holds value; a sample can make the whole process much more efficient.
Equally important is the evaluation of the results of your models. Although there are some obvious techniques we have used already, there is much more to it, as not all data science problems involve the same goals. Besides, when it comes to evaluating results, there are often conflicting objectives which must be considered.
In this chapter we’ll examine the following topics:
Sampling Techniques
Although we have glanced at sampling throughout the book, we’ve merely scratched the surface. Sampling can be a sophisticated process for certain problems. In this section we’ll examine the two main types of sampling and how we can employ Julia for these tasks. In all cases of sampling used in data science, we make use of randomization, as this yields more unbiased samples.
Basic sampling
This is the type of sampling we have been using so far. It works well for the majority of cases and it is very fast in its implementation, particularly in a language like Julia. Because it is such a simple way to create samples, you don’t need a package for it. However, there are a variety of options available in the StatsBase package, using the sample() function. We recommend you load that function into memory using the import StatsBase.sample command, as we’ll be extending it later with a custom function.
The idea behind basic sampling is to take a random subset of the original data and yield the features and the target variable values that correspond to it. The only parameter we need to use is the number of data points we expect to have in the sample (e.g. 1000). So, if we were to take the magic dataset, for instance, and apply basic sampling, we would type:
In[1]: p = 0.1
n = round(Int64, 19020*p)
ind = sample(1:19020, n)
SampleData1 = data[ind,:]
Where the value multiplied by p is the number of elements in the sample, rounded to the nearest integer. Alternatively, you can use the built-in function randperm(n), which provides you with a random permutation of a range from 1 to n. So, in the case of the magic dataset we could type:
In[1]: ind = randperm(19020)[1:n]
SampleData2 = data[ind,:]
Both of these methods are fast, but the latter is preferable if you don’t want to worry about external packages (that often tend to be difficult to remember when you need them!).
Stratified sampling
Stratified sampling is a bit trickier as it aims to preserve the class structure of the dataset, when classes are available. However, you can use any discreet variable instead of a class, if you want to apply this type of sampling to a dataset (e.g. the gender variable for a demographics dataset). The objective is to create a sample with a distribution of a given variable similar to that of the original dataset.
Again, the only parameter we need to address is the number of elements that need to be included in the sample. As this is not a trivial task, we need to make use of an external script (like the sample.jl included in this book, which extends the sample() function of the StatsBase package) as currently there is no package that provides this method. So, applying stratified sampling to the magic dataset would look like this:
In[3]: StratifiedSample = sample(data[:,1:10], data[:,11], n)
This works only because we have extended the sample() function using the sample.jl script. If you were to run this snippet with the default sample() function of the StatsBase package, you would get an error. The output of this function yields two distinct arrays of the sample: one for the features and one for the labels. You could easily merge the two by employing the built-in hcat() function:
In[4]: SampleData3 = hcat(StratifiedSample[1], StratifiedSample[2])
In this example stratified sampling may be an overkill, but there are several real-world scenarios where this approach would be appropriate and necessary. If a particular class is very small compared to the other classes, its signal may be weakened or lost altogether through the basic sampling approach.
Performance Metrics for Classification
Once we have completed a classification task, we end up with a vector of predicted labels that correspond to the test set we have run through a classification system. But how close were we to the actual labels? To answer this question in a reliable and insightful way, we employ a series of metrics. Let’s look at each one of them more closely.
Confusion matrix
Although this is not strictly a classification metric, a confusion matrix (or table) is essential for evaluating a classifier’s performance. The matrix should indicate how close the predictions were to the actual labels, and divulge some insight as to where the errors (misclassifications) lie. This will not only allow us to go deeper into the classifier’s performance, but also get a better understanding of the dataset. This is especially true if we run several classifications on the same data, using different parameters and different classifiers. Let us now see how Julia can produce a confusion matrix for a basic classification experiment (so that you can check it by hand, for more in-depth understanding of the methods described in this section). First, let’s create the predictions and the labels’ vectors:
In[5]: t = [1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2]
y = [1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1]
Now, let’s calculate the corresponding confusion matrix:
In[6]: CM = confusmat(2, t, y)
Out[6]: 2x2 Array{Int64,2}:
1 3
2 6
The target variable can be anything (e.g. a string, a character), but for this function of the MLBase package to work, it must be encoded as an array of integers. If you are unsure about how to accomplish this, please refer to Chapter 6.
In this simple example, we can deduce that the classifier has mediocre performance, since there are a lot of cases where class 1 was predicted as class 2 and vice versa. In general, the closer the confusion matrix is to a diagonal matrix, the better the performance of the classifier. Let us see how we can quantify this rule of thumb, using a series of metrics based on the confusion matrix.
Accuracy metrics
Basic accuracy
In the beginning of the book, as well as in the “Caffeine for the Force” case study, we looked briefly at this metric for evaluating the performance of various classifiers. It is the simplest metric for this task and it works well for relatively balanced datasets (i.e. datasets where the classes are of comparable size).
In essence, this metric calculates the proportion of correct classifications over the total number of classifications. By correct classifications, we mean cases where the predictions and the labels are the same. This corresponds to the diagonal of the confusion matrix. So, in our previous example, the accuracy of the classification is (1 + 6) / length(t) = 7 / 12 equals roughly 0.583. The same result can be obtained through the correctrate() function of the MLBase package:
In[7]: correctrate(t, y)
Out[7]: 0.58333333334
The same result would be obtained if you reversed the inputs of the function by accident, since Julia has no way of knowing which labels are the ground truth and which are the predicted ones.
Accuracy is often referred to as “accuracy rate” and it always takes a value between 0.0 and 1.0 (usually expressed as a percentage). Usually an accuracy higher than 1/q, where q is the number of classes, is considered satisfactory (although you’d like to get as close to 1.0 as possible). Unfortunately, the accuracy metric performs poorly when the dataset is heavily unbalanced, yielding misleading results. For instance, in a fraud detection scenario, even a horrible classifier is bound to produce accuracy rates close to 1 simply by defaulting to the dominant class, which in this case would be “safe transaction.”
Weighted accuracy
Although accuracy on its own is a very blunt metric (considering every hit or miss of exactly the same value), it has a modified version that works well for even the most challenging datasets: weighted accuracy. One benefit of weighted accuracy is that it makes some data points more important than others, as is often the case in unbalanced datasets, where the smaller class usually holds the most value.
In the previous example, if we were more interested in predicting the first class correctly, we would assign higher weights to the elements of the corresponding test set. The weights are normalized. Their exact values don’t matter all that much, since it is the ratio of the weight of one data point to another that affects the overall score. So, in our example, we could apply weighted accuracy by first defining and normalizing the weights:
In[8]: w = [4, 1]
w ./= sum(w)
Out[8]: 2-element Array{Float64,1}:
0.8
0.2
Now, we need to apply them to the calculation of the accuracy rate:
In[9]: 2*sum(diag(CM) .* w) / 12
Out[9]: 0.333333333333333
Note that “2” corresponds to the number of classes and “12” to the total number of classifications. In general, the formula for the weighted accuracy metric would be:
q*sum(diag(CM) .* w) / N
If you plan to use weighted accuracy often, it would be helpful to code it into a function such as the following:
function wa{T <: Real}(t::Array{Int64, 1}, y::Array{Int64, 1}, w::Array{T, 1})
q = length(unique(t))
N = length(t)
CM = confusmat(q, t, y)
return q*sum(diag(CM) .* w) / N
end
In the example, we chose the first class to be much more important than the second one, making the overall performance of the classifier vastly different. Before, the classifier seemed decent; after the application of the weights it seems subpar for this particular problem. If we were to reverse the weights and make the second class more important, the weighted accuracy would swing in the opposite direction, taking a value of about 0.833. It is important to choose the weights carefully if you want the weighted accuracy to provide meaningful and insightful results.
Precision and recall metrics
Accuracy often fails to capture how well a classifier performs for a particular class, as it focuses on delivering the classifier’s overall performance. If we wish to zoom in on a particular class we’ll need to make use of alternative metrics, such as precision and recall.
Precision depicts how reliable a classifier is for predicting a particular class. Mathematically it is defined as the number of hits over the total number of predictions for that particular class: TP / (TP + FP). So, for the given example, the precision of the classifier for class 1 is 1 / (1 + 2) = 0.333. In general, you can use the confusion matrix CM to calculate the precision of class c using the snippet:
CM[c,c] / sum(CM[:,c])
Recall shows how many of the cases of the particular class the classifier got right. Mathematically this is equivalent to the number of hits over the total number of cases for that class, predicted or otherwise: TP / (TP + FN). For our example, the recall of the classifier for class 1 is 1 / (1 + 3) = 0.25. The code to calculate recall in the general case is:
CM[c,c] / sum(CM[c,:])
Clearly, both the precision and the recall for class 1 are not excellent. The precision and recall of class 2 will be larger, since overall the performance of the system is 0.583 and there are only two classes. Of course, to know exactly what they are we’ll need to do the corresponding calculations.
F1 metric
Although precision and recall capture important information about the performance of a classifier for a given class, neither of them provides a good snapshot of the bigger picture. One can easily tweak a classifier so that it has amazing precision or amazing recall, but it is extremely difficult to get a very good score for both (unless you actually improve the classifier or the features significantly). Any metric that took into account both precision and recall would be able to deliver a more holistic view of the classifier’s performance.
One such metric (the most popular one) is the F1 metric, which is expressed as a number between precision and recall, leaning toward whichever one is smaller. This is expressed mathematically using the harmonic mean (which you can read more about at http://bit.ly/29p4FpY), or in terms of the confusion matrix: F1 = 1/(1+(FP+FN)/(2TP)) = 2TP/(2TP+FP+FN). In the aforementioned example the F1 of the classifier would be: 2*1/(2*1+2+3) = 2/7 = 0.286. This general formula for the F1 metric can be expressed in Julia as:
2*CM[c,c] / (sum(CM[:,c]) + sum(CM[c,:]))
Alternatively, you can just calculate the harmonic mean of R(CM,c) and P(CM,c).
Not surprisingly, the F1 metric for class 1 of this classification is not good either, as both the precision and the recall for that class are relatively low. Yet even if one of them were satisfactory, the other one would pull the F1 metric to a low value. As an exercise, you can calculate the F1 metric of this classification for the second class of this classification experiment.
Misclassification cost
Rarely do all errors have the same importance. In security applications, for example, if you misclassify a normal person as an intruder, it’s not the end of the world (though that person may get frustrated). If you misclassify an intruder for a normal person, though, you’ll make the headlines–and not in a positive way!
Defining the cost matrix
Mathematically we can express this difference in the importance of errors as different misclassification costs, usually arranged in a matrix. Obviously in such a setup, the diagonal of the matrix would be all zeros (since this corresponds to the correct classifications), with all the remaining elements being some non-negative numbers. This is often referred to as a cost matrix and in our example it could be something like this:
0 10
3 0
You can store this in Julia as:
C = [0 10; 3 0]
Note that this is unrelated to the classifier used and it is usually arbitrary defined (unless we know how much each type of mistake would cost us). Also, the cost is not necessarily in a particular currency; it is usually the relative size of each cost that is important. Even if you double the whole matrix, it shouldn’t affect your view of the classifier’s performance over a series of classification experiments.
Calculating the total misclassification cost
In order to make use of the cost matrix, we need to combine it with the confusion matrix we calculated earlier. You can do this by applying the following simple piece of code:
In[10]: total_cost = sum(CM .* C)
Out[10]: 36
This number may not mean much to us, but if the cost matrix is populated in a meaningful way, this result would be more insightful. For example, if the numbers in C represented thousands in lost revenue, the above output would mean that this classifier’s imperfect performance would cost us $36,000. When optimizing for the total misclassification cost we want as small a number as possible.
Receiver Operating Characteristic (ROC) Curve and related metrics
ROC Curve
The ROC curve is one of the most common methods for assessing a classifier’s performance for cases with just two classes, particularly if you are new to the field. It doesn’t take any parameters and provides you with an insightful plot that you can share with your supervisors and anyone else who doesn’t need to see all of the details.
Contrary to what its name suggests, it’s not usually a curve, but more like a zig-zag line. It basically shows how well a classifier performs in terms of true positives and false positives, under various conditions. Usually the classifier is applied with a different set of parameters and its performance is noted as a point in the plot. Afterward, all the points are connected, resulting in a plot called the ROC curve. The closer the curve lies to the top-left corner of the plot, the better the classifier is overall.
In Figure 9.1 you can see a typical ROC curve. The dotted line across the plot shows the performance of a naive classifier as a reference. The closer the curve is to that line, the worse the classifier’s performance is.
Figure 9.1. A typical ROC curve of a classifier.
In order to create a ROC curve, though, you’ll need more than just the predicted labels of the classification. You’ll also require the probabilities of each data point’s classification. This is basically a measure of how confident the classifier is for a particular label, and in many cases is the score that defines the label (this score is sometimes referred to as “degree of confidence”). Let us first generate an array p with these probabilities (this would normally come from the classifier as one of its outputs):
In[11]: p = [0.6, 0.55, 0.65, 0.6, 0.7, 0.65, 0.9, 0.75, 0.55, 0.65, 0.8, 0.45]
You can generate ROC curves in Julia using the following code, after you’ve loaded the ROC package:
In[12]: using ROC
In[13]: z = (y .== t)
rc = ROC.roc(p, z)
Since the roc() function exists in the MLBase as well, we should make it explicit to Julia that we want to use the function from the ROC package, which is why we call this function by adding “ROC.” in front of it.
Now, let’s create a plot based on the ROC object we have created:
In[14]: plot(rc)
Figure 9.2. The ROC curve of the classifier in our example.
The MLBase package also has a few functions for the ROC curve, but if you want to create a plot of the curve itself, the easiest way is through the ROC package. Still, once you are comfortable with Julia, we recommend you give it a try as it is much better documented than the ROC package.
AUC Metric
Since the ROC curve itself doesn’t tell us much about exactly how well a classifier performs, we often need to quantify its insight through a metric. AUC, which stands for area under curve, is the most popular metric as it shows exactly that, taking values between 0 and 1. The higher its value, the better the classifier performed in this particular experiment. Let’s look at how we can calculate AUC using Julia:
In[15]: AUC(rc)
Out[15]: 0.7142857142857143
Usually scores over 0.7 are considered satisfactory. In this fictitious example the classifier performed acceptably based on the AUC metric, though there is plenty of room for improvement. The relatively good performance is partly due to the fact that the classifier was not confident about the cases where it failed to provide a correct prediction. It is important to keep in mind that although this imaginary classifier performed decently overall, it may still not be a good choice for the problem at hand; take the evaluation of AUC with a grain of salt.
Gini Coefficient
This is another metric that stems from the ROC curve; it shows how a classifier’s performance compares to the random classifier represented by the straight line in the ROC plot. It takes values between -1 and 1, with 0 representing a performance as good as a random guess. The Gini Coefficient is calculated by the formula G = 2AUC - 1, where AUC is the area under the ROC curve metric, as we saw previously. Generally, a positive Gini Coefficient is good. However, it is not an absolute metric, since we may be interested in how a classifier performs for a particular class–something that this metric fails to capture.
In Julia there is no function that yields the Gini Coefficient of a classification, but if you want to calculate it you can use the AUC() function as a basis:
In[16]: function GC(roc_curve) = 2*AUC(roc_curve) - 1
In[17]: GC(rc)
Out[17]: 0.4285714285714286
Clearly, evaluation of this fictitious classification experiment is pretty good (about 43% better than chance), though there is still a lot of room for improvement.
Performance Metrics for Regression
Since the regression aspect of machine learning is a whole new animal, it comes with its own set of metrics. This is because the output of a regressor is a continuous variable, just like the target variable that it tries to predict. None of the classification metrics would work properly (if at all).
Regression performance metrics basically compute how different the prediction is from the target (this difference is also known as error). The lower they are, the better the performance of the regressor. The most widely-used metrics are the mean square error (MSE) and the sum squared error (SSE). There are a few reasons that performance metrics for regression tend to focus on squared errors:
MSE Metric and its variant, RMSE
By taking all the aforementioned squared errors and averaging them, we obtain the MSE metric. If we were to take the square root of this value, we would get the RMSE metric, which is a variant of MSE. RMSE is often used in physics, particularly in signal processing.
In Julia we can compute the MSE and RMSE metric as follows. First we need to get some sample data going:
In[18]: t = [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 2.0, 3.0, 2.5, 1.5, 3.0]
y = [-0.5, 0.6, 1.4, 1.2, 1.9, 2.6, 3.1, 2.4, 2.9, 2.5, 1.3, 2.8]
The first vector (t) has the target values, while the second (y) has the corresponding predictions. Next, we need to calculate the squared errors (se) based on these vectors:
In[19]: se = (t - y).^2
Now, we can calculate MSE and RMSE easily by executing the following code:
In[20]: mse = mean(se)
Out[20]: 0.06583333333333334
In[21]: rmse = sqrt(mse)
Out[21]: 0.2565800719723442
Here are a couple of functions for calculating the MSE and RMSE metrics swiftly. Unfortunately, they aren’t yet contained in a reliable package.
MSE(t::Array{Float64, 1}, y::Array{Float64, 1}) = mean((t-y).^2)
RMSE(t::Array{Float64, 1}, y::Array{Float64, 1}) = sqrt(mean((t-y).^2))
SSE Metric
If we were to add the squared errors instead of taking their average, we would end up with the SSE metric. This is also popular when evaluating the performance of a regression system. Though it correlates perfectly with the MSE metric (so there is no need to calculate both), it may be useful in some cases where you need to calculate performance several times (e.g. through an optimization process), as it is slightly faster to compute. Here is how Julia can help you make use of this metric by using the squared errors vector we calculated previously:
In[22]: sse = sum(se)
Out[22]: 0.79
Should you want to wrap this whole process in a function, you could use something like this:
SSE(t::Array{Float64, 1}, y::Array{Float64, 1}) = sum((t-y).^2)
Other metrics
All the performance metrics for regression are basically variants of the Euclidean distance between the predicted targets and the actual ones. So it is not difficult to create your own performance metrics, if you want to dig deeper into the evaluation of your regressor’s performance.
For example, instead of the vanilla-flavored mean used in MSE, you could use the harmonic or even the reverse harmonic mean (which is not canon in statistics, despite its utility) on the squared errors. These two alternative metrics will be milder and harsher on the extreme errors performed by the regression system being assessed, respectively. Here are the corresponding functions for these metrics:
function harmean{T<:Real}(X::Array{T,1}, tol::Float64 = 0.1)
# tol = tolerance parameter for avoiding “division by 0” errors
return length(X) / sum(1./(X + tol)) - tol
end
function revharmean{T<:Real}(X::Array{T,1}, tol::Float64 = 0.1)
# Reverse Harmonic Mean
return maximum(X) + minimum(X) - harmean(X, tol)
end
function HSE{T<:Real}(y::Array{T,1}, t::Array{T,1})
# Harmonic mean of Squared Error
SE = (y-t).^2
return harmean(SE)
end
function RHSE{T<:Real}(y::Array{T,1}, t::Array{T,1})
# Reverse Harmonic mean of Squared Error
SE = (y-t).^2
return revharmean(SE)
end
Feel free to explore other alternatives as well. At the end of the day, the data science field is work in progress, so who is to say that the best performance metric for regression isn’t yet to be discovered? Besides, it may be the case that a custom performance metric works better for your data science problem.
K-fold Cross Validation (KFCV)
KFCV is an evaluation strategy that allows you to check whether a particular machine learning system achieves good generalization by performing a series of experiments on it. The experiments make use of the whole dataset by partitioning it into K roughly equal parts, which are used as the test set in K different iterations. In order to obtain useful results from KFCV in the case of classification problems, we often must perform stratified sampling.
The logic of KFCV is that if a machine learning system happens by chance to perform well at a particular sample, it might fool us into believing that it is a reliable one. However, it would be practically impossible to repeat this lucky break again and again, over mutually exclusive test sets. So, if a classifier or regressor has a lucky run, it may influence the results of a single experiment, but not of the whole series of experiments dictated by KFCV.
The selection of the K parameter in KFCV can be tricky. There is no “right answer” since it depends, as usual, on the problem at hand. Usually a K value of 10 provides adequate results (which is why it is the default option in the corresponding script for this task).
If we were to choose the maximum possible value for K (i.e. the total number of data points in the dataset), we would end up with a special evaluation technique that is referred to as Leave-One-Out validation. This is particularly useful for very small datasets, where sampling the data doesn’t make much sense as it would compromise the signal in the data. Although well-documented, this particular evaluation method is rarely used in data science (you could say that data scientists tend to “leave it out” of their toolbox!).
If you want to learn more about KFCV, we recommend studying a set of slides prepared by Professor Gutierrez-Osuna; they can be found at http://bit.ly/1gOxGwp. Also, even though you can use any kind of sampling for KFCV when dealing with classification problems, we strongly recommend that you go with stratified sampling; it is easy to get a biased sample at some point, after repeating the process K times.
Applying KFCV in Julia
You can use Julia to apply K-fold Cross Validation to your data using the provided script KFCV.jl (unfortunately, there isn’t a decent implementation of the KFCV method in any of the available packages at this time). Here is how you can do this for the magic dataset, using K = 10.
First you need to make sure that the KFCV.jl file is on your working directory. Then you can go ahead and load it:
In[23]: include(“KFCV.jl”)
Finally, you can apply it on your data as follows:
In[24]: P, T, PT, TT = KFCV(data[:,1:(end-1)], data[:,end], 10);
The semicolon at the end is optional, but can help keep your notebook more usable (if you don’t use the semicolon, Julia will flood your notebook with data that you don’t need to view).
The outputs of this function are all arrays consisting of 10 elements. Each one of these outputs are respectively:
KFCV tips
Although KFCV is better than any single test you can run on a machine learning system (even with stratified sampling), it is not a panacea. There are times that a classifier or a regressor would perform relatively well on a series of KFCV experiments; this is why we recommend you run a number of KFCVs—enough to yield statistical significance. However, as the data you use for developing your model gets larger, the need for repeated KFCV runs diminishes.
Also, if you want to compare two classifiers or two regressors to each other, you are better off using the same KFCV partitioning for both systems. This is particularly important if both machine learning systems are of about the same level of performance.
Summary
Sampling:
Classification:
Regression:
K-fold Cross Validation (KFCV):
Chapter Challenge