There's more...

Another important metric you might want to check is the correlations between numerical variables. Calculating correlations with MLlib is very easy:

correlations = st.Statistics.corr(rdd_num)

The .corr(...) action returns a NumPy array or arrays, or, in other words, a matrix where each element is a Pearson (by default) or Spearman correlation coefficient.

To print it out, we just loop through all the elements:

for i, el_i in enumerate(abs(correlations) > 0.05):
    print(cols_num[i])
    
    for j, el_j in enumerate(el_i):
        if el_j and j != i:
            print(
                '    '
                , cols_num[j]
                , correlations[i][j]
            )
            
    print()

We only print the upper triangular portion of the matrix without the diagonal. Using the enumerate allows us to print out the column names since the correlations NumPy matrix does not list them. Here's what we get:

As you can see, there is not much correlation between our numerical variables. This is actually a good thing, as we can use all of them in our model since we will not suffer from much multicollinearity.

If you do not know what multicollinearity, is check out this lecture: https://onlinecourses.science.psu.edu/stat501/node/343.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset