Understanding cross-validation

In the previous chapter, we built a model with certain assumptions and settings, measuring its performance with accuracy metrics (the overall ratio of correctly classified labels). To do this, we split our data randomly into training and testing sets. While that approach is fundamental, it has its problems. Most importantly, this way, we may fine-tune our model to gain better performance on the test dataset but at the expense of other data (in other words, we might make the model worse while getting a better metric on the specific dataset). This phenomenon is called overfitting.

To combat this issue, we'll use a slightly more complex approach: cross-validation. In its basic form, cross-validation creates multiple so-called folds or data subsections. Usually, each fold has approximately the same size and can be further balanced by target variable representation or any other criteria. Additionally, cross-validation generates all combinations of folds, so that there is one combination with each fold being the test set, and everything else is the train set. It further trains the same model on training data from each combination and measures performance on the corresponding test set. Once it is done, we estimate the average value for each metric across all splits.

Before we start improving our model, let's recreate our model from Chapter 13, Training a Machine Learning Model:

model1 = DecisionTreeClassifier(random_state=2019, max_depth=10)

cols = [
'allies_infantry', 'axis_infantry', 'allies_tanks', 'axis_tanks',
'allies_planes', 'axis_planes', 'duration'
]

y = data['result_num']

It is easier not to create the X variable, but to get a subset of columns instead, as we'll be adding more features throughout this chapter.

We don't need to split the data into training and testing sets—cross-validation will do that for us. Let's import the function and run it. We will split the data into 4, as we only have 4 tie cases:

from sklearn.model_selection import cross_validate

cv = cross_validate(model1,
data[cols], y,
cv=4)

Now, let's convert cv from a dictionary into a dataframe:

>>> cv = pd.DataFrame(cv)
>>> cv
fit_time score_time test_score 0 0.003015 0.001030 0.500000
1 0.003033 0.000937 0.571429
2 0.002311 0.000868 0.428571
3 0.001939 0.001125 0.250000

>>> cv['test_score'].mean()
0.4375

This function, cross_validate, is the basic, core function. In the next sections, we'll use cross_val_score—a simple wrapper around cross_validate, that returns only scores. As you can see here, on average, our performance is pretty sad—45% accuracy. This is our starting point; let's now improve it. For that, let's first understand how feature engineering works!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset