Classifying a harder dataset

The previous dataset was an easy dataset for classification using texture features. In fact, many of the problems that are interesting from a business point of view are relatively easy. However, sometimes we may be faced with a tougher problem and need better and more modern techniques to get good results.

We will now test a public dataset, which has the same structure: several photographs split into a small number of classes. The classes are animals, cars, transportation, and natural scenes.

When compared to the three-class problem we discussed previously, these classes are harder to tell apart. Natural scenes, buildings, and text have very different textures. In this dataset, however, texture and color are not clear markers of the image class. The following is one example from the animal class:

And here is another example from the car class:

Both objects are against natural backgrounds, and with large, smooth areas inside the objects. This is a harder problem than the previous dataset, so we will need to use more advanced methods. The first improvement will be to use a slightly more powerful classifier. The logistic regression that scikit-learn provides is a penalized form of logistic regression, which contains an adjustable parameter, C. By default, C = 1.0, but this may not be optimal. We can use a grid search to find a good value for this parameter as follows:

from sklearn.grid_search import GridSearchCV 
C_range = 10.0 ** np.arange(-4, 3) 
grid = GridSearchCV(LogisticRegression(), param_grid={'C' : C_range}) 
clf = Pipeline([('preproc', StandardScaler()), 
               ('classifier', grid)]) 

The data is not organized in a random order inside the dataset: similar images are close together. Thus, we use a cross-validation schedule that considers the data shuffled so that each fold has a more representative training set, as shown in the following code:

cv = model_selection.KFold(n_splits=5, 
                     shuffle=True, random_state=123) 
scores = model_selection.cross_val_score( 
   clf, ifeatures, labels, cv=cv) 
print('Accuracy: {:.1%}'.format(scores.mean())) 
Accuracy: 73.4% 

This is not so bad for four classes, but we will now see if we can do better by using a different set of features. In fact, we will see that we need to combine these features with other methods to get the best possible results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset