Seeds classification

We will now look at three main groups of classification (learning) models: Support Vector Machine (SVM), Nearest Neighbor, and Random Forest. SVM simply divides the space in N regions, separated by a boundary. The boundary can be allowed to have different shapes, for example, there is a linear boundary or quadratic boundary. The Nearest Neighbor classification identifies the k-nearest neighbors and classifies the current data point depending on what class the k-nearest neighbors belong to. The Random Forest classifier is a decision tree learning method, which, in simple terms, creates rules from the given training data to be able to classify new data. A set of if-statements in a row is what gives it the name decision tree.

The data that we are going to use comes from the the UCI Machine Learning repository (Lichman, M. (2013)- http://archive.ics.uci.edu/ml . (Irvine, CA: University of California, School of Information and Computer Science). The dataset contains several measured attributes of three different types of wheat grains (M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak,A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images, in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.).

We want to create a classifier, something that, if we measure specific parameters of a seed, can tell what type of seed it is. With the dataset, a description of the columns are supplied, which can also be found on the UCI web page for the dataset. There are eight columns, seven for the parameters and one for the known type of the seed (that is, the label). I have created a text file for this; in case you are running a Linux-based system, you can list the contents with Jupyter magic:

%%bash 
less data/seeds.desc 

Seeds classification

Tip

If you are running Microsoft Windows, you can use the more command, that is, more data/seeds.desc, which gives you the same output but in a pop-up window, not as convenient but still useful.

Now that we know what columns are there, we can read it into a Pandas DataFrame:

seeds = pd.read_csv('data/seeds_dataset.txt', 
       delim_whitespace=True, 
       names=['A', 'P', 'C', 'lkern', 'wkern',  
       'asym', 'lgro', 'gr']) 

As always, list the contents of what was read:

seeds.head() 

Seeds classification

Visualizing the data

We could just run the whole classification process on all seven parameters in the dataset. This is computationally expensive, and the costs increase very quickly when increasing the amount of data. To make a first attempt at selecting only the attributes that matter for classification, I want to visually inspect the distribution of the values for the attributes for the different types of grains. To do this, we first create a selection filter for the different groups:

gr1 = seeds.gr == 1 
gr2 = seeds.gr == 2 
gr3 = seeds.gr == 3 

To only plot the relevant parameters, we also create a list of the ones that we want to see (that is, not the type of grain):

pars = ['A','C','P','asym','lgro','lkern','wkern'] 

With Pandas' built-in histogram function, we can plot the attributes for each group. I have added some extra commands to make the figures look a bit better and more clutter-free:

axes = seeds[pars][gr1].hist(figsize=(8,6)) 
despine(list(axes.flatten())) 
_ = [ax.grid() for ax in list(axes.flatten())] 
_ = [ax.locator_params(axis='x', nbins=4) for ax in  
    list(axes.flatten())] 
_ = [ax.locator_params(axis='y', nbins=2) for ax in  
    list(axes.flatten())] 
plt.subplots_adjust(wspace=0.5, hspace=0.7) 

Once again, we use the despine function to make the plots clearer. The preceding code will plot all of the attributes for the first group, gr1. The histogram will show you how the values are distributed:

Visualizing the data

With the selection filters, it is easy to plot the other groups:

axes = seeds[pars][gr2].hist(figsize=(8,6)) 
despine(list(axes.flatten())) 
_ = [ax.grid() for ax in list(axes.flatten())] 
_ = [ax.locator_params(axis='x', nbins=4) for ax in  
    list(axes.flatten())] 
_ = [ax.locator_params(axis='y', nbins=2) for ax in  
    list(axes.flatten())] 
plt.subplots_adjust(wspace=0.5, hspace=0.7) 

By plotting all the groups, we can look at how the distribution of values for the attributes differ. We are trying to identify the ones that differ the most between groups:

Visualizing the data

axes = seeds[pars][gr3].hist(figsize=(8,6)) 
despine(list(axes.flatten())) 
_ = [ax.grid() for ax in list(axes.flatten())] 
_ = [ax.locator_params(axis='x', nbins=5) for ax in  
    list(axes.flatten())] 
_ = [ax.locator_params(axis='y', nbins=2) for ax in  
    list(axes.flatten())] 
plt.subplots_adjust(wspace=0.5, hspace=0.7) 

After plotting this last group, we can look at each attribute and find the attribute where the distribution is separate for each group. This way, we can use it to distinguish the various groups (types of seeds), that is, classify them:

Visualizing the data

From this, I would argue that porosity and groove length are good parameters due to their fairly well-defined and, for the three groups, separated peaks. To check this, we plot them against each other. We also want to mark the various groups of grains:

fig = plt.figure() 
ax = fig.add_subplot(111) 
ax.scatter(seeds.P[gr1], seeds.lgro[gr1],  
           color='LightCoral')  
ax.scatter(seeds.P[gr2], seeds.lgro[gr2],  
           color='SteelBlue', marker='s')  
ax.scatter(seeds.P[gr3], seeds.lgro[gr3],  
           color='Green', marker='<'), 
ax.text(seeds.P[gr1].mean(), seeds.lgro[gr1].mean(), 
        '1', bbox=dict(color='w', alpha=0.7,  
                 boxstyle="Round")) 
ax.text(seeds.P[gr2].mean(), seeds.lgro[gr2].mean(), 
        '2', bbox=dict(color='w', alpha=0.7,  
                  boxstyle="Round")) 
ax.text(seeds.P[gr3].mean(), seeds.lgro[gr3].mean(), 
        '3', bbox=dict(color='w', alpha=0.7,  
                 boxstyle="Round")) 
ax.set_xlabel('Porosity') 
ax.set_ylabel('Groove length') 
ax.set_title('Seed parameters') 
despine(ax) 
plt.minorticks_on() 
ax.locator_params(axis='x', nbins=5) 
ax.locator_params(axis='y', nbins=4) 
ax.set_xlim(11.8,18) 
ax.set_ylim(3.8,7.1); 

Visualizing the data

Now, this is only two of the parameters; we have several more. However, from this, it seems that it is the hardest to separate groups 1 and 3, that is, the circles and triangles.

Feature selection

Built-in into Scikit-learn are several ways of determining the best parameters to look at. This is sometimes called feature selection, which is trying to determine which parameters have the biggest differences between each other and are best suited to describe the various groups as exactly that-distinct groups. Here, we use one where we can give a number, K (not to be confused with K in K-means), which determines up to what number of features it should select the best ones.

First, we store the seeds table as a matrix, a NumPy array, and then we separate the data and labels:

X_raw = seeds.as_matrix() 
X_pre, labels = X_raw[:,:-1], X_raw[:,-1] 

Now we can import the selection algorithm and run it on the data. Note that we also import the chi2 estimator and supply it to the selection object. This means that chi-squared minimization will be used to determine the best parameters:

from sklearn.feature_selection import SelectKBest 
from sklearn.feature_selection import chi2 
X_best = SelectKBest(chi2, k=2).fit_transform(X_pre, labels) 

It has now selected two columns; to check which columns, we print out the first few rows of the selection and the raw data:

X_best[:5] 

Feature selection

seeds.head()

Feature selection

The area (A) and asymmetry (asym) coefficients are the two best parameters to work with according to this selection algorithm. Before we run it through one of the machine learning algorithms for classification, we plot all of the data again, but this time the features are selected by the algorithm:

fig = plt.figure() 
ax = fig.add_subplot(111) 
ax.scatter(seeds.A[gr1], seeds.asym[gr1],  
           color='LightCoral') 
ax.text(seeds.A[gr1].mean(), seeds.asym[gr1].mean(), 
        '1', bbox=dict(color='w', alpha=0.7,  
        boxstyle="Round")) 
ax.scatter(seeds.A[gr2], seeds.asym[gr2],  
           color='SteelBlue', 
           marker='s')  
ax.text(seeds.A[gr2].mean(), seeds.asym[gr2].mean(), 
        '2', bbox=dict(color='w', alpha=0.7,  
        boxstyle="Round")) 
ax.scatter(seeds.A[gr3], seeds.asym[gr3],  
           color='Green', 
           marker='<') 
ax.text(seeds.A[gr3].mean(), seeds.asym[gr3].mean(), 
        '3', bbox=dict(color='w', alpha=0.7,  
        boxstyle="Round")) 
ax.set_xlabel('Area') 
ax.set_ylabel('Asymmetry') 
ax.set_title('Seed parameters') 
despine(ax) 
plt.minorticks_on() 
ax.locator_params(axis='x', nbins=5) 
ax.locator_params(axis='y', nbins=3) 
ax.set_xlim(9.6,22) 
ax.set_ylim(-0.6,10); 

Feature selection

Compared with the previous similar plot, the points are more spread out, and perhaps the circles and triangles are more separated with less overlap—it is hard to assess.

Classifying the data

To start classifying the data, we prepare a few things first. We import the SVM module and K-nearest neighbor along with the random forest estimators. Within the SVM module is the Support Vector Classification (SVC) estimator, the main estimator for SVM. SVC can be run with different kernels; we will cover the linear, radial basis function, and polynomial kernels. I will give a short explanation of them before we run them.

To visualize the classification, I want to plot the boundaries, and we will use contour lines to do this. For this, we need to create a grid of points and evaluate them with our trained classifier:

from sklearn import svm 
from sklearn.neighbors import KNeighborsClassifier 
 
res = 0.01 
#X, y  = X_best[::2], labels[::2] 
X, y  = X_best, labels 
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1 
xx, yy = np.meshgrid(np.arange(x_min, x_max, res), 
                     np.arange(y_min, y_max, res)) 

Here, we write a function to draw the results. To draw the boundaries, the x and y grid that we created previously is used. It is passed to the estimator's predict(xxyy) method. Here, input is the estimator, output from the machine learning classification model (that is, different SVMs, K-Nearest Neighbor, and Random Forest), and title of the plot. The contour plot draws the boundaries, and you can change ax.contour to ax.contourf to get filled contours. Now that we have a function to take care of the visualization, we can focus on testing the different models (called kernels):

def plot_results(clf, title): 
    fig = plt.figure() 
    ax = fig.add_subplot(111) 
    plt.subplots_adjust(wspace=0.2, hspace=0.4) 
    xxyy = np.vstack((xx.flatten(), yy.flatten())).T 
    Z = clf.predict(xxyy) 
    Z = Z.reshape(xx.shape) 
    ax.contour(xx, yy, Z,  
               colors=['Green','LightCoral', 'SteelBlue'],  
               alpha=0.7, zorder=-1) 
    ax.scatter(seeds.A[gr1], seeds.asym[gr1],  
               color='LightCoral')  
    ax.scatter(seeds.A[gr2], seeds.asym[gr2],  
               color='SteelBlue', marker='s')  
    ax.scatter(seeds.A[gr3], seeds.asym[gr3],  
               color='Green', marker='<')
    ax.text(seeds.A[gr1].mean(), seeds.asym[gr1].mean(), 
            '1', bbox=dict(color='w', alpha=0.7,  
            boxstyle="Round")) 
    ax.text(seeds.A[gr2].mean(), seeds.asym[gr2].mean(), 
            '2', bbox=dict(color='w', alpha=0.7,  
            boxstyle="Round")) 
    ax.text(seeds.A[gr3].mean(), seeds.asym[gr3].mean(), 
            '3', bbox=dict(color='w', alpha=0.7,  
            boxstyle="Round")) 
    despine(ax) 
    plt.minorticks_on() 
    ax.locator_params(axis='x', nbins=5) 
    ax.locator_params(axis='y', nbins=3) 
    ax.set_xlabel('Area') 
    ax.set_ylabel('Asymmetry') 
    ax.set_title(title, size=10) 
    ax.set_xlim(9.6,22) 
    ax.set_ylim(-0.6,10); 

The SVC linear kernel

One simple kernel in SVC is the linear kernel, which assumes linear boundaries. As input, it takes C, the parameter determining the sensitivity to noisy data; with very noisy data, you can decrease this parameter. To get the linear kernel, run the classification on our data, and plot the results with our function, we run the following:

svc = svm.SVC(kernel='linear', C=1.).fit(X, y) 
plot_results(svc, 'SVC-Linear') 

The SVC linear kernel

As you can see, the boundaries are linear, and it does a reasonable job in dividing the various points into groups that correspond to the ones created by the researchers.

The SVC Radial Basis Function

The next kernel, the Radial Basis Function (RBF), is the kernel used if no input kernel is given to the SVC call; it is basically a Gaussian kernel. The result is a kernel (region) that is built up by a linear combination of Gaussians. In addition to the C parameter, the gamma parameter can be given here; it is the inverse width of the Gaussian(s), so it gives the steepness of the boundary:

rbf_svc = svm.SVC(kernel='rbf', gamma=0.4, C=1.).fit(X, y) 
plot_results(rbf_svc, 'SVC-Radial Basis Function') 

The SVC Radial Basis Function

Here, the boundaries are smoother. The borders in the center are roughly the same as with the linear kernel, but they differ going away from the dense regions.

The SVC polynomial

The last SVC kernel that we will cover is the polynomial, and it is exactly what it sounds like, a polynomial. As input, it takes the degree:

poly_svc = svm.SVC(kernel='poly', degree=3, C=1.).fit(X, y) 
plot_results(poly_svc, 'SVC-Polynomial') 

The SVC polynomial

The borders are thus represented by polynomials. I suggest that you try to change the degree to something else and see what happens with the borders.

K-Nearest Neighbour

Now we use the K-Nearest Neighbor. As input, this takes the weight and number of neighbors to compare with. The default weight is one that assumes uniform weights of all n_neighbors nearby points; changing the weights keyword to distance assumes that weights decrease with distance:

knn = KNeighborsClassifier(weights = 'uniform', n_neighbors=5).fit(X, y) 
plot_results(knn, 'k-Nearest Neighbours') 

K-Nearest Neighbour

This resembles some of the SVC kernels, but adapted to even smaller changes. Try changing weights to 'distance' and also changing the n_neighbors parameter and see how the results change. What happens when you change n_neighbors to something larger than 30; which of the other classifiers does it almost exactly replicate?

Random Forest

As a last example classifier, we use the Random Forest method. You can think of it as the dendrogram for hierarchical clustering in Chapter 5 Clustering; however, each branch is a rule to classify the data here. We give the object three inputs: max_depthn_estimators, and max_features. The first, max_depth, determines how far each decision tree should go, and n_estimators gives how many decision trees there are in the forest. This is not really intuitive, so to show you what it is, first put n_estimators=1, run the code, and look at the output. Then, change to another, higher number and look at the new output:

rfc = RandomForestClassifier( 
                             max_depth=3,  
                             n_estimators=10,  
                             max_features='auto').fit(X, y) 
plot_results(rfc, 'Random Forest Classifier') 

Random Forest

The number of trees in the forest is how many decision trees to build up the classifier with. The resulting plot shows that the random forest classifier is simple yet able to classify fairly complex problems. SVC together with the linear kernel is also simple, but you could imagine how the random forest classifier would be able to classify a more complex problem with fairly low depth and few estimators. I suggest that you take some time and play around with the input parameters for all the classifiers and see how the results change.

Choosing your classifier

The preceding examples show you different results for the SVC (with various kernels), kNN, and Random Forest classifiers. However, when should you use one over the other? In general, try all of the methods on the given problem. The main advantages of SVC are highlighted in these examples—it is very versatile with many different kernels. Decision trees, like the random forest, run the risk of becoming too complex and thus overfitting the data. Also highlighted in the preceding examples, kNN is very good for classifications where the boundary is clearly not linear, as you can see in the results of the preceding image for kNN. There are also other kernels and classifiers. For example, a comparison of 179 classifiers was done for the article, Do we Need Hundreds of Classifiers to Solve Real World Classification Problems? (M. Fernandez-Delgado, 2014, JMLR, 15, 3133-3181). All of the common classifiers are covered in the study. However, as stated before, the important thing is to try various classifiers on your data to see what works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset