There's more...

Logistic regression is normally the benchmark used to asses the relative performance of other classification models, that is, whether they are performing better or worse. The drawback of logistic regression, however, is that it cannot handle cases where two classes cannot be separated by a line. SVMs do not have these kinds of problem, as their kernel can be expressed in quite flexible ways:

income_model_svm = cl.SVMWithSGD.train(
    final_data_income
    , miniBatchFraction=1/2.0
)

In this example, just like with the LogisticRegressionWithSGD model, we can specify a host of parameters (we will not be repeating them here). However, the miniBatchFraction parameter instructs the SVM model to only use half of the data in each iteration; this helps preventing overfitting.

The results for the 10 observations from the small_sample_income RDD are calculated the same way as with the logistic regression model:

for t,p in zip(
    small_sample_income
        .map(lambda row: row.label)
        .collect()
    , income_model_svm.predict(
        small_sample_income
            .map(lambda row: row.features)
    ).collect()):
    print(t,p)

The model produces the same results as the logistic regression model, so we will not be repeating them here. However, in the Computing performance statistics recipe, we will see how these differ.

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...