TextBlob
is an interesting library that has a collection of tools for text processing purposes. It comes with the API for natural language processing (NLP) tasks, such as classification, noun phrase extraction, part-of-speech tagging, and sentiment analysis.
There are a few steps involved to make sure that one can use TextBlob
. Any library that works with NLP needs some corpora; therefore, the following sequence of installation and configuration needs to be done before attempting to use this interesting library:
TextBlob
(either via conda
or pip
)Using
binstar search -t conda textblob
, one can find where to install it for anaconda users. More details can be found in Appendix, Go Forth and Explore Visualization.
The following command will let one download corpora
:
$ python -m textblob.download_corpora [nltk_data] Downloading package brown to [nltk_data] /Users/administrator/nltk_data... [nltk_data] Unzipping corpora/brown.zip. [nltk_data] Downloading package punkt to [nltk_data] /Users/administrator/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [nltk_data] Downloading package wordnet to [nltk_data] /Users/administrator/nltk_data... [nltk_data] Unzipping corpora/wordnet.zip. [nltk_data] Downloading package conll2000 to [nltk_data] /Users/administrator/nltk_data... [nltk_data] Unzipping corpora/conll2000.zip. [nltk_data] Downloading package maxent_treebank_pos_tagger to [nltk_data] /Users/administrator/nltk_data... [nltk_data] Unzipping taggers/maxent_treebank_pos_tagger.zip. [nltk_data] Downloading package movie_reviews to [nltk_data] /Users/administrator/nltk_data... [nltk_data] Unzipping corpora/movie_reviews.zip. Finished.
TextBlob
makes it easy to create custom text classifiers. In order to understand this better, one may need to do some experimentation with their training and test data. In the TextBlob 0.6.0 version, the following classifiers are available:
BaseClassifier
DecisionTreeClassifier
MaxEntClassifier
NLTKClassifier *
NaiveBayesClassifier
PositiveNaiveBayesClassifier
The classifier marked with *
is the abstract class that wraps around the nltk.classify
module.
For sentiment analysis, one can use the Naive Bayes classifier and train the system with this classifier and textblob.en.sentiments.PatternAnalyzer
. A simple example is as follows:
from textblob.classifiers import NaiveBayesClassifier from textblob.blob import TextBlob from textblob.classifiers import NaiveBayesClassifier from textblob.blob import TextBlob train = [('I like this new tv show.', 'pos'), # similar train sentences with sentiments goes here] test = [ ('I do not enjoy my job', 'neg'), # similar test sentences with sentiments goes here] ] cl = NaiveBayesClassifier(train) cl.classify("The new movie was amazing.") # shows if pos or neg cl.update(test) # Classify a TextBlob blob = TextBlob("The food was good. But the service was horrible. " "My father was not pleased.", classifier=cl) print(blob) print(blob.classify()) for sentence in blob.sentences: print(sentence) print(sentence.classify())
Here is the result that will be displayed when the preceding code is run:
pos neg The food was good. pos But the service was horrible. neg My father was not pleased. pos
One can read the training data from a file either in the text format or the JSON format. The sample data in the JSON file is shown here:
[ {"text": "mission impossible three is awesome btw","label": "pos"}, {"text": "brokeback mountain was beautiful","label":"pos"}, {"text": " da vinci code is awesome so far","label":"pos"}, {"text": "10 things i hate about you + a knight's tale * brokeback mountain","label":"neg"}, {"text": "mission impossible 3 is amazing","label":"pos"}, {"text": "harry potter = gorgeous","label":"pos"}, {"text": "i love brokeback mountain too: ]","label":"pos"}, ] from textblob.classifiers import NaiveBayesClassifier from textblob.blob import TextBlob from nltk.corpus import stopwords stop = stopwords.words('english') pos_dict={} neg_dict={} with open('/Users/administrator/json_train.json', 'r') as fp: cl = NaiveBayesClassifier(fp, format="json") print "Done Training" rp = open('/Users/administrator/test_data.txt','r') res_writer = open('/Users/administrator/results.txt','w') for line in rp: linelen = len(line) line = line[0:linelen-1] sentvalue = cl.classify(line) blob = TextBlob(line) sentence = blob.sentences[0] for word, pos in sentence.tags: if (word not in stop) and (len(word)>3 and sentvalue == 'pos'): if pos == 'NN' or pos == 'V': pos_dict[word.lower()] = word.lower() if (word not in stop) and (len(word)>3 and sentvalue == 'neg'): if pos == 'NN' or pos == 'V': neg_dict[word.lower()] = word.lower() res_writer.write(line+" => sentiment "+sentvalue+" ") #print(cl.classify(line)) print "Lengths of positive and negative sentiments",len(pos_dict), len(neg_dict) Lengths of positive and negative sentiments 203 128
We can add more training data from the corpus and evaluate the accuracy of the classifier with the following code:
test=[ ("mission impossible three is awesome btw",'pos'), ("brokeback mountain was beautiful",'pos'), ("that and the da vinci code is awesome so far",'pos'), ("10 things i hate about you =",'neg'), ("brokeback mountain is a spectacularly beautiful movie",'pos'), ("mission impossible 3 is amazing",'pos'), ("the actor who plays harry potter sucks",'neg'), ("harry potter = gorgeous",'pos'), ('The beer was good.', 'pos'), ('I do not enjoy my job', 'neg'), ("I ain't feeling very good today.", 'pos'), ("I feel amazing!", 'pos'), ('Gary is a friend of mine.', 'pos'), ("I can't believe I'm doing this.", 'pos'), ("i went to see brokeback mountain, which is beautiful(",'pos'), ("and i love brokeback mountain too: ]",'pos') ] print("Accuracy: {0}".format(cl.accuracy(test))) from nltk.corpus import movie_reviews reviews = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] new_train, new_test = reviews[0:100], reviews[101:200] cl.update(new_train) accuracy = cl.accuracy(test + new_test) print("Accuracy: {0}".format(accuracy)) # Show 5 most informative features cl.show_informative_features(4)
The output would be as follows:
Accuracy: 0.973913043478 Most Informative Features contains(awesome) = True pos : neg = 51.9 : 1.0 contains(with) = True neg : pos = 49.1 : 1.0 contains(for) = True neg : pos = 48.6 : 1.0 contains(on) = True neg : pos = 45.2 : 1.0
First, the training set had 250 samples with an accuracy of 0.813
and later it added another 100 samples from movie reviews. The accuracy went up to 0.974
. We therefore attempted to use different test samples and plotted the sample size versus accuracy, as shown in the following graph: