The implementations of SVM

We have totally covered the fundamentals of the SVM classifier. Now let's apply it right away on news topic classification. We start with a binary case classifying two topics,

comp.graphics and sci.space:

First, load the training and testing subset of the computer graphics and science space news data respectively:

>>> categories = ['comp.graphics', 'sci.space']
>>> data_train = fetch_20newsgroups(subset='train', 
                           categories=categories, random_state=42)
>>> data_test = fetch_20newsgroups(subset='test', 
                           categories=categories, random_state=42)

Again, don't forget to specify a random state for reproducing experiments.

Clean the text data and retrieve label information:

>>> cleaned_train = clean_text(data_train.data)
>>> label_train = data_train.target
>>> cleaned_test = clean_text(data_test.data)
>>> label_test = data_test.target
>>> len(label_train), len(label_test)
(1177, 783)

As a good practice, check whether the classes are imbalanced:

>>> from collections import Counter
>>> Counter(label_train)
Counter({1: 593, 0: 584})
>>> Counter(label_test)
Counter({1: 394, 0: 389})

Next, extract tf-idf features using the TfidfVectorizer extractor that we just acquired:

>>> tfidf_vectorizer = TfidfVectorizer(sublinear_tf=True, 
              max_df=0.5, stop_words='english', max_features=8000)
>>> term_docs_train = 
             tfidf_vectorizer.fit_transform(cleaned_train)
>>> term_docs_test = tfidf_vectorizer.transform(cleaned_test)

Now we can apply our SVM algorithm with features ready. Initialize an SVC model with the kernel parameter set to linear (we will explain this shortly) and penalty C set to the default value 1:

>>> from sklearn.svm import SVC
>>> svm = SVC(kernel='linear', C=1.0, random_state=42)

Then fit our model on the training set:

>>> svm.fit(term_docs_train, label_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', 
  kernel='linear',max_iter=-1, probability=False, random_state=42,   
  shrinking=True, tol=0.001, verbose=False)

And then predict on the testing set with the trained model and obtain the prediction accuracy directly:

>>> accuracy = svm.score(term_docs_test, label_test)
>>> print('The accuracy on testing set is: 
                                  {0:.1f}%'.format(accuracy*100))
The accuracy on testing set is: 96.4%

Our first SVM model just works so well with 96.4% accuracy achieved. How about more than two topics? How does SVM handle multiclass classification?

Table of Contents for The implementations of SVM

Create new playlist

Sign In

Sign Up

Table of Contents for
The implementations of SVM