D.3. Knowing is half the battle

In machine learning practice, if data is gold, labeled data is raritanium (or whatever metaphor for what is most precious to you). Your first instinct may be to take every last bit of labeled data and feed it to the model. More training data leads to a more resilient model, right? But that would leave us with no way to test the model short of throwing it out into the real world and hoping for the best. This obviously isn’t practical. The solution is to split your labeled data into two and sometimes three datasets: a training set, a validation set, and in some cases a test set.

The training set is obvious. The validation set is a smaller portion of the labeled data we hold out and keep hidden from the model for one round of training. Good performance on the validation set is a first step to verifying that the trained model will perform well in the wild, as novel data comes in. You will often see an 80%/20% or 70%/30% split for training versus validation from a given labeled dataset. The test set is like the validation set—a subset of the labeled training data to run the model against and measure performance. But how is this test set different from the validation set then? In formulation, they aren’t different at all. The difference comes in how you use each of them.

While training the model on the training set, there will be several iterations with various hyperparameters; the final model you choose will be the one that performs the best on the validation set. But there’s a catch. How do you know you haven’t tuned a model that’s merely highly biased toward the validation set? There’s no way to verify that the model will perform well on data from the wild. And this is what your boss or the readers of your white paper are most interested in—how well will it work on their data.

So if you have enough data, you want to hold a third chunk of the labeled dataset as a test set. This will allow your readers (or boss) to have more confidence that your model will work on data that your training and tuning process was never allowed to see. Once the trained model is selected based on validation set performance, and you’re no longer training or tweaking your model at all, you can then run predictions (inference) on each sample in the test set. If the model performs well against this third set of data, it has generalized well. For this kind of high-confidence model verification, you will often see a 60%/20%/20% training/validation/test dataset split.

Tip

Shuffling your dataset before you make the split between training, validation, and testing datasets is vital. You want each subset to be a representative sample of the “real world,” and they need to have roughly equal proportions of each of the labels you expect to see. If your training set has 25% positive examples and 75% negative examples, you want your test and validation sets to have 25% positive and 75% negative examples, too. And if your original dataset had all the negative examples first and you did a 50%/50% train/test split without shuffling the dataset first, you’d end up with 100% negative examples in your training set and 50%/50% in your test set. Your model would never learn from the positive examples in your dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset