Underfitting and overfitting

Underfitting and overfitting are problems not just with a classifier but for all supervised methods.

Imagine you have a classifier with just one rule that tries to distinguish between healthy and not healthy patients. The rule is as follows:

If Temperature < 37 then Healthy

This classifier will classify all patients with a lower temperature than 37 degrees, as healthy. This classifier will have a huge error rate. The tree that represents this rule will have only the root node and two branches, with a leaf in each branch.

Underfitting occurs when the tree is too short to classify a new observation correctly; the rules are too general.

On the other hand, if we have a dataset with many attributes, and if we generate a very deep Decision Tree, we risk the fact that our Tree fits well with the training dataset, but not able to predict new examples. In our previous example, we can have a rule such as this:

If Temperature<27 and Sintom_A = V …… and Sintom_B = Y …..Age=12 and … and Eyes = Blue and Height = 182 and Weight=74.6 then Healthy

In this case, the rule is too specific. What happens if the weight of the next patient is 76? The classifier will not be able to classify the new patient correctly. The Tree is too deep and the rules are too specific; this problem is called overfitting. We'll see a very low error rate on training data, but a high error rate on test data.

We'll come back to overfitting in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset