How it works...

Before we create the RDDs, we have to import the pyspark.mllib.regression submodule, as that is where we can access the LabeledPoint class:

import pyspark.mllib.regression as reg

Next, we simply loop through all the elements of the final_data RDD and create a labeled point for each element using the .map(...) transformation.

The first parameter of LabeledPoint(...) is the label. If you look at the the two code snippets, the only difference between them is what we consider labels and features.

As a reminder, a classification problem aims to find the probability of an observation belonging to a specific class; thus, the label is normally a categorical or, in other words, discrete. On the other hand, the regression problem aims to predict a value given an observation; thus, the label is normally numerical, or continuous if you will.

So, in the final_data_income case, we are using the binary indicator for whether the census respondent earns more (a value of 1) or less (the label equal to 0) than $50,000, whereas in the final_data_hours, we use the hours-per-week feature (see the Loading the data recipe), which, in our case, is the fifth piece of each of the elements of the final_data RDD. Note for this label we need to scale it back, so we need to multiply by the standard deviation and add the mean.

We assume here that you are working through the 5. Machine Learning with MLlib.ipynb notebook and have the sModel object already created. If you do not, please go back to the previous recipe and follow the steps outlined there.

The second parameter of the LabeledPoint(...) is a vector of all the features. You can pass either a NumPy array, list, scipy.sparse column matrix, or pyspark.mllib.linalg.SparseVector or pyspark.mllib.linalg.DenseVector; in our case, we encoded our features into DenseVector as we have already encoded all our features using the hashing trick.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...