Weak or noisy data

The most important ingredient of a successful model is the dataset. If the data contains noise or incomplete information, there is not a single machine learning technique that will generate a highly performant model.

Let's illustrate this with a simple example. Suppose we study populations (in the statistical sense) of cars and we gather data about the color, shape, and manufacturer. It is difficult to generate a very accurate model for either variable, as a lot of cars are the same color and shape but are made by a different manufacturer. The following table depicts this sample dataset.

The best any model can do is achieve 33% classification accuracy, as there are three viable choices for any given feature combination. Adding more features to the dataset can greatly improve the model's performance. Adding more models to an ensemble cannot improve performance:

Color	Shape	Manufacturer
Black	Sedan	BMW
Black	Sedan	Audi
Black	Sedan	Alfa Romeo
Blue	Hatchback	Ford
Blue	Hatchback	Opel
Blue	Hatchback	Fiat

Car dataset

Table of Contents for Weak or noisy data

Create new playlist

Sign In

Sign Up

Table of Contents for
Weak or noisy data