Feature selection

If you a have dataset with many variables, a good way to check correlations among columns is by visualizing the correlation matrix as a heatmap. We can identify and remove those that are highly correlated, thereby simplifying our analysis. The visualization can be achieved using the seaborn library in Python:

The following will be the output:

Correlation heatmap of the DataFrame

We can observe the following in the preceding heatmap:

  • soldBy and days_old are highly negatively correlated
  • age_median and income_median are positively correlated

Similarly, we can derive the correlation between different sets of variables. Hence, based on the correlation results, we can minimize the number of independent features by selecting only the important features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset