Using the ColumnDataClassifier class for classification

This classifier uses data with multiple values to describe the data. In this demonstration, we will use a training file to create a classifier. We will then use a test file to assess the performance of the classifier. The class uses a property file to configure the creation process.

We will be creating a classifier that attempts to classify a box based on its dimensions. There are three possible categories: small, medium, and large. The height, width, and length dimensions of a box will be expressed as floating-point numbers. They are used to characterize a box.

The properties file specifies parameter information and supplies data about the training and test files. There are many possible properties that can be specified. For this example, we will use only a few of the more relevant properties.

We will use the following properties file, saved as box.prop. The first set of properties deals with the number of features that are contained in the training and test files. Since we used three values, three realValued columns are specified. The trainFile and testFile properties specify the location and names of the respective files:

useClassFeature=true 
1.realValued=true 
2.realValued=true 
3.realValued=true 
trainFile=.box.train 
testFile=.box.test 

The training and test files use the same format. Each line consists of a category followed by the defining values, each separated by a tab. The box.train training file consists of 60 entries and the box.test file consists of 30 entries. These files can be downloaded from https://github.com/PacktPublishing/Natural-Language-Processing-with-Java-Second-Edition/ or from the GitHub repository. The first line of the box.train file is shown in the following code. The category is small; its height, width, and length are 2.34, 1.60, and 1.50, respectively:

small  2.34  1.60  1.50

The code to create the classifier is shown in the following code. An instance of the ColumnDataClassifier class is created using the properties file as the constructor's argument. An instance of the Classifier interface is returned by the makeClassifier method. This interface supports three methods, two of which we will demonstrate. The readTrainingExamples method reads the training data from the training file:

ColumnDataClassifier cdc =  
    new ColumnDataClassifier("box.prop"); 
Classifier<String, String> classifier =  
    cdc.makeClassifier(cdc.readTrainingExamples("box.train")); 

When executed, we get extensive output. We will discuss the more relevant parts in this section. The first part of the output repeats parts of the property file:

    3.realValued = true
    testFile = .box.test
    ...
    trainFile = .box.train

The next part displays the number of datasets, read along with the information regarding various features, as shown here:

    Reading dataset from box.train ... done [0.1s, 60 items].
    numDatums: 60
    numLabels: 3 [small, medium, large]
    ...
    AVEIMPROVE     The average improvement / current value
    EVALSCORE      The last available eval score
    Iter ## evals ## <SCALING> [LINESEARCH] VALUE TIME |GNORM| {RELNORM} AVEIMPROVE EVALSCORE
  

The classifier then iterates over the data to create the classifier:

    Iter 1 evals 1 <D> [113M 3.107E-4] 5.985E1 0.00s |3.829E1| {1.959E-1} 0.000E0 - 
    Iter 2 evals 5 <D> [M 1.000E0] 5.949E1 0.01s |1.862E1| {9.525E-2} 3.058E-3 - 
    Iter 3 evals 6 <D> [M 1.000E0] 5.923E1 0.01s |1.741E1| {8.904E-2} 3.485E-3 - 
    ...
    Iter 21 evals 24 <D> [1M 2.850E-1] 3.306E1 0.02s |4.149E-1| {2.122E-3} 1.775E-4 - 
    Iter 22 evals 26 <D> [M 1.000E0] 3.306E1 0.02s
    QNMinimizer terminated due to average improvement: | newest_val - previous_val | / |newestVal| < TOL 
    Total time spent in optimization: 0.07s

At this point, the classifier is ready to use. Next, we use the test file to verify the classifier. We start by getting a line from the text file using the ObjectBank class's getLineIterator method. This class supports the conversion of data that has been read into a more standardized form. The getLineIterator method returns one line at a time in a format that can be used by the classifier. The loop for this process is shown here:

for (String line :  
        ObjectBank.getLineIterator("box.test", "utf-8")) { 
    ... 
} 

Within the for-each statement, a Datum instance is created from the line and then its classOf method is used to return the predicted category, as shown in the following code. The Datum interface supports objects that contain features. When used as the argument of the classOf method, the category determined by the classifier is returned:

Datum<String, String> datum = cdc.makeDatumFromLine(line); 
System.out.println("Datum: {"  
    + line + "]	Predicted Category: "  
    + classifier.classOf(datum)); 

When this sequence is executed, each line of the test file is processed and the predicted category is displayed, as shown in the following code. Only the first two and last two lines are shown here. The classifier was able to correctly classify all of the test data:

    Datum: {small  1.33  3.50  5.43]  Predicted Category: medium
    Datum: {small  1.18  1.73  3.14]  Predicted Category: small
    ...
    Datum: {large  6.01  9.35  16.64]  Predicted Category: large
    Datum: {large  6.76  9.66  15.44]  Predicted Category: large
  

To test an individual entry, we can use the makeDatumFromStrings method to create a Datum instance. In the following code sequence, a one-dimensional array of strings is created, where each element represents data values for a box. The first entry, the category, is left null. The Datum instance is then used as the argument of the classOf method to predict its category:

String sample[] = {"", "6.90", "9.8", "15.69"}; 
Datum<String, String> datum =  
    cdc.makeDatumFromStrings(sample); 
System.out.println("Category: " + classifier.classOf(datum)); 

The output for this sequence is shown here. It correctly classifies the box:

Category: large
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset