Train and test data

When we build predictive models, we need to create two separate sets of data with the help of the following segments. One is used by the model to learn the task and the other is used to test how well the model learned the task. Here are the types of data that we will look at:

  • Train data: The segment of the data used to fit the model. The model has access to the explainer variables or independent variables, which are the selected columns, to describe a record in your data, as well as the target variable or dependent variable. That is the value we are trying to predict during the training process using this dataset. This segment should usually be between 50% and 80% of your total data.
  • Test data: The segment of the data used to evaluate the model results. The model should never have access to this data during the learning process and should never see the target variable. This dataset is used to test what the model has learned about the dependent variables. After fitting our model during the training phase, we now use this model to predict values on the test set. During this phase, only we have the correct answers; the independent variable, that is, the model, never has access to these values. After the model makes its predictions, we can evaluate how well the model performed by comparing the predicted values to the actual correct values.
  • Validation data: Validation data is a portion of the training dataset that the model uses to refine hyperparameters. As the varying values are selected for the hyperparameters, the model makes checks against the validation set and uses the results gathered during this process to select the values for the hyperparameters that produce the best performing models.
  • Cross-validation: One potential issue that can arise when we use only one train and test set is that the model will learn about specific descriptive features that are particular to this segment of the data. What the model learns may not generalize well when applied to other data in the future. This is known as overfitting. To mitigate this problem, we can use a process known as cross-validation. In a simple example, we can do an 80/20 split of the data where 20% is held for test data and we can model and test on this split. We can then create a separate 80/20 split and do the same modeling and testing. We can repeat this process 5 times with 5 different test sets—each composed of a different fifth of the data. This exact type of cross-validation is known as 5-fold cross-validation. After all of the iterations are completed, we can check whether the results are consistent for each. And, if so, we can feel more confident that our model is not overfitting and use this to generalize on more data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset