There's more...

If you want to calculate a correlation matrix, you need to do this somewhat manually. Here's our solution:

n_features = len(features)

corr = []

for i in range(0, n_features):
temp = [None] * i

for j in range(i, n_features):
temp.append(no_outliers.corr(features[i], features[j]))
corr.append([features[i]] + temp)

correlations = spark.createDataFrame(corr, ['Column'] + features)

The preceding code is effectively looping through the list of our features and computing the pair-wise correlations between them to fill the upper-triangular portion of the matrix.

We introduced the features list in the Handling outliers recipe earlier.

The calculated coefficient is then appended to the temp list which, in return, gets added to the corr list. Finally, we create the correlations DataFrame. Here's what it looks like:

As you can see, the only strong correlation is between Displacement and Cylinders and this, of course, comes as no surprise. FuelEconomy is not really correlated with the displacement as there are other factors that affect FuelEconomy, such as drag and weight of the car. However, if you were trying to predict, for example, maximum speed and assuming (and it is a fair assumption to make) that both Displacement and Cylinders would be highly positively correlated with the maximum speed, then you should only use one of them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset