Weak or noisy data

The most important ingredient of a successful model is the dataset. If the data contains noise or incomplete information, there is not a single machine learning technique that will generate a highly performant model.

Let's illustrate this with a simple example. Suppose we study populations (in the statistical sense) of cars and we gather data about the color, shape, and manufacturer. It is difficult to generate a very accurate model for either variable, as a lot of cars are the same color and shape but are made by a different manufacturer. The following table depicts this sample dataset.

The best any model can do is achieve 33% classification accuracy, as there are three viable choices for any given feature combination. Adding more features to the dataset can greatly improve the model's performance. Adding more models to an ensemble cannot improve performance:

Color

Shape

Manufacturer

Black

Sedan

BMW

Black

Sedan

Audi

Black

Sedan

Alfa Romeo

Blue

Hatchback

Ford

Blue

Hatchback

Opel

Blue

Hatchback

Fiat

Car dataset
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset