Sentiment analysis using LingPipe

Sentiment analysis is performed in a very similar manner to that of general text classification. One difference is that it uses only two categories: positive and negative.

We need to use data files to train our model. We will use a simplified version of the sentiment analysis performed at http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html by using sentiment data that was developed for movies (http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz). This data was developed from 1,000 positive and 1,000 negative reviews of movies that are in IMDb's movie archives.

These reviews need to be downloaded and extracted. A txt_sentoken directory will be extracted along with its two subdirectories: neg and pos. Both of these subdirectories contain movie reviews. Although some of these files can be held in reserve to evaluate the model that was created, we will use all of them to simplify the explanation.

We will start with the reinitialization of variables declared in the Using LingPipe to classify text section. The categories array is set to a two-element array to hold the two categories. The classifier variable is assigned a new DynamicLMClassifier instance using the new category array and a nGramSize of size 8:

categories = new String[2]; 
categories[0] = "neg"; 
categories[1] = "pos"; 
nGramSize = 8; 
classifier = DynamicLMClassifier.createNGramProcess( 
    categories, nGramSize); 

As we did earlier, we will create a series of instances based on the content found in the training files. We will not examine the following code in detail as it is very similar to the one found in the Training text using the Classified class section. The main difference is that there are only two categories to process:

String directory = "..."; 
File trainingDirectory = new File(directory, "txt_sentoken"); 
for (int i = 0; i < categories.length; ++i) { 
    Classification classification =  
        new Classification(categories[i]); 
    File file = new File(trainingDirectory, categories[i]); 
    File[] trainingFiles = file.listFiles(); 
    for (int j = 0; j < trainingFiles.length; ++j) { 
        try { 
            String review = Files.readFromFile( 
                trainingFiles[j], "ISO-8859-1"); 
            Classified<CharSequence> classified =  
                new Classified<>(review, classification); 
            classifier.handle(classified); 
        } catch (IOException ex) { 
            ex.printStackTrace(); 
        } 
    } 
} 

The model is now ready to be used. We will use the review for the movie Forrest Gump:

String review = "An overly sentimental film with a somewhat " 
    + "problematic message, but its sweetness and charm " 
    + "are occasionally enough to approximate true depth " 
    + "and grace. "; 

We use the classify method to perform the actual work. It returns a Classification instance whose bestCategory method returns the best category, as shown here:

Classification classification = classifier.classify(review); 
String bestCategory = classification.bestCategory(); 
System.out.println("Best Category: " + bestCategory); 

When executed, we get the following output:

Best Category: pos  

This approach will also work well for other categories of text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset