How to do it...

In order to find the best features for the problem at hand, we first need to understand what problem we are dealing with, as different methods will be used for selecting features in regression problems or for classifiers:

Regression: In regression, your target (or ground truth) is a continuous variable (such as number of work hours per week). You have two methods to select your best features:
- Pearson's correlation: We covered this one in the previous recipe. As noted there, the correlation can only be calculated between two numerical (continuous) features.
- Analysis of variance (ANOVA): It is a tool to explain (or test) the distribution of observations conditional on some categories. Thus, it can be used to select the most discriminatory (categorical) features of the continuous dependent variable.
Classification: In classification, your target (or label) is a discrete variable of two (binomial) or many (multinomial) levels. There are also two methods that help to select the best features:
- Linear discriminant analysis (LDA): This helps to find a linear combination of continuous features that best explains the variance of the categorical label
- χ² test: A test that tests the independence between two categorical variables

Spark, for now, allows us to test (or select) the best features between comparable variables; it only implements the correlations (the pyspark.mllib.stat.Statistics.corr(...) we covered earlier) and the χ² test (the pyspark.mllib.stat.Statistics.chiSqTest(...) or the pyspark.mllib.feature.ChiSqSelector(...) methods).

In this recipe, we will use .chiSqTest(...) to test the independence between our label (that is, an indicator that someone is earning more than $50,000) and the occupation of the census responder. Here's a snippet that does this for us:

import pyspark.mllib.linalg as ln

census_occupation = (
    census
    .groupby('label')
    .pivot('occupation')
    .count()
)

census_occupation_coll = (
    census_occupation
    .rdd
    .map(lambda row: (row[1:]))
    .flatMap(lambda row: row)
    .collect()
)

len_row = census_occupation.count()
dense_mat = ln.DenseMatrix(
    len_row
    , 2
    , census_occupation_coll
    , True)
chi_sq = st.Statistics.chiSqTest(dense_mat)

print(chi_sq.pValue)

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...