How it works...

As you can see, we first create the LinearRegressionWithSGD object and call its .train(...) method.

For a very good overview of different derivatives of stochastic gradient descent, check this out: http://ruder.io/optimizing-gradient-descent/.

The first, and the only, required parameter we pass to the method is an RDD of labeled points that we created earlier. There is a host of parameters, though, that you can specify:

  • Number of iterations; the default is 100
  • Step is the parameter used in SGD; the default is 1.0
  • miniBatchFraction specifies the proportion of data to be used in each SGD iteration; the default is 1.0
  • The initialWeights parameter allows us to initialize the coefficients to some specific values; it has no defaults and the algorithm will start with the weights equal to 0.0
  • The regularizer type parameter, regType, allows us to specify the type of the regularizer used: 'l1' for L1 regularization and 'l2' for L2 regularization; the default is Noneno regularization
  • The regParam parameter specifies the regularizer parameter; the default is 0.0
  • The model can also fit the intercept but it is not set by default; the default is false
  • Before training, the model by default can validate data
  • You can also specify convergenceTol; the default is 0.001

Let's now see how well our model predicts working hours:

small_sample_hours = sc.parallelize(final_data_hours_test.take(10))

for t,p in zip(
small_sample_hours
.map(lambda row: row.label)
.collect()
, workhours_model_lm.predict(
small_sample_hours
.map(lambda row: row.features)
).collect()):
print(t,p)

First, from our full testing dataset, we select 10 observations (so we can print them on the screen). Next, we extract the true value from the testing dataset, whereas for the prediction we simply call the .predict(...) method of the workhours_model_lm model and pass the .features vector. Here is what we get:

As you can see, our model does not do very well, so further refining would be necessary. This, however, goes beyond the scope of this chapter and the book itself.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset