Interpreting the p-value

The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it.

For a more in-depth handling of statistical testing, check out Principles of Data Science, https://www.packtpub.com/big-data-and-business-intelligence/principles-data-science, by Packt Publishing.

The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset