We create a document-term matrix with 934 tokens as follows:
vectorizer = CountVectorizer(min_df=.001, max_df=.8, stop_words='english')
train_dtm = vectorizer.fit_transform(train.text)
<1566668x934 sparse matrix of type '<class 'numpy.int64'>'
with 6332930 stored elements in Compressed Sparse Row format>
We then train the MultinomialNB classifier as before and predict the test set:
nb = MultinomialNB()
nb.fit(train_dtm, train.polarity)
predicted_polarity = nb.predict(test_dtm)
The result is over 77.5% accuracy:
accuracy_score(test.polarity, y_pred_class)
0.7768361581920904