Implementing SVM

We have largely covered the fundamentals of the SVM classifier. Now, let's apply it right away to newsgroup topic classification. We start with a binary case classifying two topics – comp.graphics and sci.space:

Let's take a look at the following steps:

  1. First, we load the training and testing subset of the computer graphics and science space newsgroup data respectively:
>>> from sklearn.datasets import fetch_20newsgroups
>>> categories = ['comp.graphics', 'sci.space']
>>> data_train = fetch_20newsgroups(subset='train',
categories=categories, random_state=42)
>>> data_test = fetch_20newsgroups(subset='test',
categories=categories, random_state=42)
Don't forget to specify a random state in order to reproduce experiments.
  1. Clean the text data using the clean_text function we developed in previous chapters and retrieve the label information:
>>> cleaned_train = clean_text(data_train.data)
>>> label_train = data_train.target
>>> cleaned_test = clean_text(data_test.data)
>>> label_test = data_test.target
>>> len(label_train), len(label_test)
(1177, 783)

There are 1,177 training samples and 783 testing ones.

  1. By way of good practice, check whether the two classes are imbalanced:
>>> from collections import Counter
>>> Counter(label_train)
Counter({1: 593, 0: 584})
>>> Counter(label_test)
Counter({1: 394, 0: 389})

They are quite balanced.

  1. Next, we extract the tf-idf features from the cleaned text data:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=None)
>>> term_docs_train = tfidf_vectorizer.fit_transform(cleaned_train)
>>> term_docs_test = tfidf_vectorizer.transform(cleaned_test)
  1. We can now apply the SVM classifier to the data. We first initialize an SVC model with the kernel parameter set to linear (we will explain what kernel means in the next section) and the penalty hyperparameter C set to the default value, 1.0:
>>> from sklearn.svm import SVC
>>> svm = SVC(kernel='linear', C=1.0, random_state=42)
  1. We then fit our model on the training set as follows:
>>> svm.fit(term_docs_train, label_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto',
kernel='linear',max_iter=-1, probability=False, random_state=42,
shrinking=True, tol=0.001, verbose=False)
  1. And we predict on the testing set with the trained model and obtain the prediction accuracy directly:
>>> accuracy = svm.score(term_docs_test, label_test)
>>> print('The accuracy of binary classification is:
{0:.1f}%'.format(accuracy*100))

The accuracy of binary classification is: 96.4%

Our first SVM model works just great, achieving an accuracy of 96.4%. How about more than two topics? How does SVM handle multiclass classification?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset