We have totally covered the fundamentals of the SVM classifier. Now let's apply it right away on news topic classification. We start with a binary case classifying two topics,
comp.graphics and sci.space:
First, load the training and testing subset of the computer graphics and science space news data respectively:
>>> categories = ['comp.graphics', 'sci.space']
>>> data_train = fetch_20newsgroups(subset='train',
categories=categories, random_state=42)
>>> data_test = fetch_20newsgroups(subset='test',
categories=categories, random_state=42)
Again, don't forget to specify a random state for reproducing experiments.
Clean the text data and retrieve label information:
>>> cleaned_train = clean_text(data_train.data)
>>> label_train = data_train.target
>>> cleaned_test = clean_text(data_test.data)
>>> label_test = data_test.target
>>> len(label_train), len(label_test)
(1177, 783)
As a good practice, check whether the classes are imbalanced:
>>> from collections import Counter
>>> Counter(label_train)
Counter({1: 593, 0: 584})
>>> Counter(label_test)
Counter({1: 394, 0: 389})
Next, extract tf-idf features using the TfidfVectorizer extractor that we just acquired:
>>> tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True,
max_df=0.5, stop_words='english', max_features=8000)
>>> term_docs_train =
tfidf_vectorizer.fit_transform(cleaned_train)
>>> term_docs_test = tfidf_vectorizer.transform(cleaned_test)
Now we can apply our SVM algorithm with features ready. Initialize an SVC model with the kernel parameter set to linear (we will explain this shortly) and penalty C set to the default value 1:
>>> from sklearn.svm import SVC
>>> svm = SVC(kernel='linear', C=1.0, random_state=42)
Then fit our model on the training set:
>>> svm.fit(term_docs_train, label_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto',
kernel='linear',max_iter=-1, probability=False, random_state=42,
shrinking=True, tol=0.001, verbose=False)
And then predict on the testing set with the trained model and obtain the prediction accuracy directly:
>>> accuracy = svm.score(term_docs_test, label_test)
>>> print('The accuracy on testing set is:
{0:.1f}%'.format(accuracy*100))
The accuracy on testing set is: 96.4%
Our first SVM model just works so well with 96.4% accuracy achieved. How about more than two topics? How does SVM handle multiclass classification?