Splitting the data into train and test sets

Now that we have our response variable, the next step is to split the dataset into train and test sets. In data science, the training set is the data that is used to determine the model coefficients. In the training phase, the model takes into account the predictor variable values together with the response value to "discover" the rules and the weights that will guide the prediction of new data. The testing set is then used to measure our model performance, as we discussed in Chapter 3, Machine Learning Foundations. Typical splits use 70-80% for the training data and 20-30% for the testing data (unless the dataset is very large, in which case a smaller percentage can be allotted toward the testing set).

Some practitioners also have a validation set that is used to train model parameters, such as the tree size in the random forest model or the lasso parameter in regularized logistic regression.

Fortunately, the scikit-learn library has a handy function called train_test_split() that takes care of the random splitting for us, when given the test set percentage. To use this function, we must first separate the target variable from the rest of the data, we do so as follows:

def split_target(data, target_name):
target = data[[target_name]]
data.drop(target_name, axis=1, inplace=True)
return (data, target)

X, y = split_target(df_ed, 'ADMITFINAL')

After running the preceding code, y holds our response variable and X holds our dataset. We feed these two variables to the train_test_split() function, along with 0.25 for the test_size and a random state for reproducibility:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=1234
)

The result is a 2 x 2 split: X_train, X_test, y_train, and y_test. We can now use X_train and y_train to train the model, and X_test and y_test to test the model performance.

An important thing to remember is that during the preprocessing phase, any transformation made to the training set must also be performed on the testing set at test time, or otherwise, the model's output for the new data will be incorrect.

As a sanity check and also to detect any target variable imbalance, let's check the number of positive and negative responses in the response variable:

print(y_train.groupby('ADMITFINAL').size())

The output is as follows:

ADMITFINAL
0    15996
1     2586
dtype: int64

Our result indicates that approximately 1 out of 7 observations in the test set have a positive response. While it is not a perfectly balanced dataset (in which case the ratio would be 1 out of 2), it is not so imbalanced that we need to do any upsampling or downsampling of the data. Let's proceed with preprocessing the predictors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset