Before we create the RDDs, we have to import the pyspark.mllib.regression submodule, as that is where we can access the LabeledPoint class:
import pyspark.mllib.regression as reg
Next, we simply loop through all the elements of the final_data RDD and create a labeled point for each element using the .map(...) transformation.
The first parameter of LabeledPoint(...) is the label. If you look at the the two code snippets, the only difference between them is what we consider labels and features.
So, in the final_data_income case, we are using the binary indicator for whether the census respondent earns more (a value of 1) or less (the label equal to 0) than $50,000, whereas in the final_data_hours, we use the hours-per-week feature (see the Loading the data recipe), which, in our case, is the fifth piece of each of the elements of the final_data RDD. Note for this label we need to scale it back, so we need to multiply by the standard deviation and add the mean.
The second parameter of the LabeledPoint(...) is a vector of all the features. You can pass either a NumPy array, list, scipy.sparse column matrix, or pyspark.mllib.linalg.SparseVector or pyspark.mllib.linalg.DenseVector; in our case, we encoded our features into DenseVector as we have already encoded all our features using the hashing trick.