How it works...

As with the LinearRegressionWithSGD model, the only required parameter is the RDD with labeled points. Also, you can specify the same set of parameters:

The number of iterations; the default is 100
The step is the parameter used in SGD; the default is 1.0
miniBatchFraction specifies the proportion of data to be used in each SGD iteration; the default is 1.0
The initialWeights parameter allows us to initialize the coefficients to some specific values; it has no defaults and the algorithm will start with the weights equal to 0.0
The regularizer type parameter, regType, allows us to specify the type of the regularizer used: l1 for L1 regularization and l2 for L2 regularization; the default is None, no regularization
The regParam parameter specifies the regularizer parameter; the default is 0.0
The model can also fit the intercept but it is not set by default; the default is false
Before training, the model by default can validate data
You can also specify convergenceTol; the default is 0.001

The LogisticRegressionModel(...) object that is returned upon finalizing the training allows us to utilize the model. By passing a vector of features to the .predict(...) method, we can predict the class the observations will most likely be associated with.

Any classification model produces a set of probabilities and logistic regression is not an exception. In the binary case, we can specify a threshold that, once breached, would indicate that the observation would be assigned with the class equal to 1 rather than 0; this threshold is normally set to 0.5. LogisticRegressionModel(...) assumes 0.5 by default, but you can change it by calling the .setThreshold(...) method and passing a desired threshold value that is between 0 and 1 (not inclusive).

Let's see how our model performs:

small_sample_income = sc.parallelize(final_data_income_test.take(10))

for t,p in zip(
    small_sample_income
        .map(lambda row: row.label)
        .collect()
    , income_model_lr.predict(
        small_sample_income
            .map(lambda row: row.features)
    ).collect()):
    print(t,p)

As with the linear regression example, we first extract 10 records from our test dataset so we can fit them on the screen. Next, we extract the desired label and call the income_model_lr model of .predict(...) the class. Here's what we get back:

So, out of 10 records, we got 9 right. Not bad.

In the Computing performance statistics recipe, we will learn how to use the full testing dataset to more formally evaluate our models.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...