D.6. Imbalanced training sets

Machine learning models are only ever as good as the data you feed them. Having a huge amount of data is only helpful if you have examples that cover all the cases you hope to predict in the wild. And covering each case only once isn’t necessarily enough. Imagine you are trying to predict whether an image is a dog or a cat. But you have a training set with 20,000 pictures of cats and only 200 pictures of dogs. If you were to train a model on this dataset, it would be likely that the model would simply learn to predict any given image is a cat, regardless of the input. And from the model’s perspective that would be fine, right? I mean, it would be correct in 99% of the cases from the training set. Of course, that’s a bogus argument and that model is worthless. But totally outside the scope of any particular model, the most likely cause of this failure is the imbalanced training set.

Models can be very finicky regarding training sets, for the simple reason that the signal from an overly sampled class in the labeled data can overwhelm the signal from the small cases. The weights will more often be updated by the error generated by the dominant class, and the signal from the minority class will be washed out. It isn’t vital to get an exactly even representation of each class, because the models have the ability to overcome some noise. The goal here is just to get the counts into the same ballpark.

The first step, as with any machine learning task, is to look long and hard at your data. Get a feel for the details and run some rough statistics on what the data actually represents. Find out not only how much data you have, but how much of which kinds of data you have.

So what do you do if things aren’t magical even from the beginning? If the goal is to even out the class representations (and it is), there are three main options: oversampling, undersampling, and augmenting.

D.6.1. Oversampling

Oversampling is the technique of repeating examples from the under-represented class or classes. Let’s take the dog/cat example from earlier (only 200 dogs to 20,000 cats). You can simply repeat the dog images you do have 100 times each and end up with 40,000 total samples, half dogs/half cats.

This is an extreme example, and as such will lead to its own problems. The network will likely get very good at recognizing those specific 200 dogs and not generalize well to other dogs not in the training set. But the technique of oversampling can certainly help balance a training set in cases that aren’t so radically imbalanced.

D.6.2. Undersampling

Undersampling is the opposite side of the same coin. Here you drop examples from the over-represented class. In the dog/cat example, we would randomly drop 19,800 cat images and be left with 400 examples, half dog/half cat. This, of course, has a glaring problem of its own. We’ve thrown away the vast majority of the data and are working from a much less broad footing. Extreme cases such as this aren’t ideal but can be a good path forward if you have a large number of examples in the under-represented class. Having that much data is definitely a luxury.

D.6.3. Augmenting your data

This is a little trickier, but in the right circumstances, augmenting the data can be your friend. The concept of augmentation is to generate novel data, either from perturbations of the existing data or generating it from scratch. AffNIST (http://www.cs.toronto.edu/~tijmen/affNIST) is such an example. The famous MNIST dataset is a set of handwritten digits, 0-9 (see figure D.4). AffNIST takes each of the digits and skews, rotates, and scales them in various ways, while maintaining the original labels. The purpose of this particular effort wasn’t to balance the training set but to make nets such as convolutional neural nets more resilient to new data written in other ways, but the concept of augmenting data still applies.

Figure D.4. The entries in the leftmost column are examples from the original MNIST; the other columns are all affine transformations of the data included in affNIST

[image credit: “affNIST” (http://www.cs.toronto.edu/~tijmen/affNIST)].

You must be cautious, though. Adding data that isn’t truly representative of that which you’re trying to model can hurt more than it helps. Say your dataset is the 200/20,000 dogs/cats from earlier. And let’s further assume that the images are all high-resolution color images taken under ideal conditions. Now handing a box of crayons to 19,000 kindergarteners wouldn’t necessarily get you the augmented data you desired. So think a bit about what augmenting your data will do to the model. The answer isn’t always clear, so if you do go down this path, keep it in mind while you validate the resulting model and try to test around its edges to verify that you didn’t introduce unexpected behavior unintentionally.

And lastly, probably the least helpful thing to say, but it’s true: going back to the well to look for additional data should always be considered if your dataset is “incomplete.” It isn’t always feasible, but you should at least consider it as an option.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset