There's more...

We cannot use the .ChiSqSelector(...) method to select features against targets that are continuous, that is, the regression problems. One approach to select the best features would be to check correlations between each and every feature and the target and select those that are the most highly correlated with the target but exhibit little to no correlation with other features:

import pyspark.ml.stat as st

features_and_label = feat.VectorAssembler(
    inputCols=forest.columns
    , outputCol='features'
)

corr = st.Correlation.corr(
    features_and_label.transform(forest), 
    'features', 
    'pearson'
)

print(str(corr.collect()[0][0]))

There is no automatic way to do this in Spark, but starting with Spark 2.2, we can now calculate correlations between features in DataFrames.

The .Correlation(...) method is part of the pyspark.ml.stat module, so we import it first.

Next, we create .VectorAssembler(...), which collates all the columns of the forest DataFrame. We can now use the Transformer and pass the resulting DataFrame to the Correlation class. The .corr(...) method of the Correlation class accepts a DataFrame as its first parameter, the name of the column with all the features as the second, and the type of correlation to calculate as the third; the available values are pearson (the default value) and spearman.

Check out this website for more information about the two correlation methods: http://bit.ly/2xm49s7.

Here's what we would expect to see from running the method:

Now that we have the correlation matrix, we can extract the top 10 most correlated features with our label:

num_of_features = 10
cols = dict([
    (i, e) 
    for i, e 
    in enumerate(forest.columns)
])

corr_matrix = corr.collect()[0][0]
label_corr_with_idx = [
    (i[0], e) 
    for i, e 
    in np.ndenumerate(corr_matrix.toArray()[:,0])
][1:]

label_corr_with_idx_sorted = sorted(
    label_corr_with_idx
    , key=lambda el: -abs(el[1])
)

features_selected = np.array([
    cols[el[0]] 
    for el 
    in label_corr_with_idx_sorted
])[0:num_of_features]

First, we specify the number of features we want to extract and create a dictionary with all the columns from our forest DataFrame; note that we ZIP it with the index as the correlation matrix does not propagate the feature names, only the indices.

Next, we extract the first column from the corr_matrix (as this is our target, the Elevation feature); the .toArray() method converts a DenseMatrix to a NumPy array representation. Note that we also append the index to the elements of this array so we know which element is most correlated with our target.

Next, we sort the list in descending order by looking at the absolute values of the correlation coefficient.

Finally, we loop through the top 10 (in this case) elements of the resulting list and select the column from the cols dictionary that corresponds with the selected index.

For our problem that aims at estimating the forest elevation, here's the list of features we get:

Table of Contents for There's more...

Create new playlist

Sign In

Sign Up

Table of Contents for
There's more...