Feature selection is one of the toughest parts of financial model building. Feature selection can be done statistically or by having domain knowledge. Here we are going to discuss only a few of the statistical feature selection methods in the financial space.
Data may contain highly correlated features and the model does better if we do not have highly correlated features in the model. The Caret R package gives the method for finding a correlation matrix between the features, which is shown by the following example.
A few lines of data used for correlation analysis and multiple regression analysis are displayed here by executing the following code:
>DataMR = read.csv("C:/Users/prashant.vats/Desktop/Projects/BOOK R/DataForMultipleRegression.csv") >head(DataMR)
|
|
|
|
| |
1 |
80.13 |
72.86 |
93.1 |
63.7 |
83.1 |
2 |
79.57 |
72.88 |
90.2 |
63.5 |
82 |
3 |
79.93 |
71.72 |
99 |
64.5 |
82.8 |
4 |
81.69 |
71.54 |
90.9 |
66.7 |
86.5 |
5 |
80.82 |
71 |
90.7 |
60.7 |
80.8 |
6 |
81.07 |
71.78 |
93.1 |
62.9 |
84.2 |
The preceding output shows five variables in DataMR
named StockYPrice
, StockX1Price
, StockX2Price
, StockX3Price
, and StockX4Price
. Here StockYPrice
is dependent and all the other four variables are independent variables. Dependence structure is very important to study for going deep into the analysis.
The following command calculates the correlation matrix between the first four columns, which are StockYPrice
, StockX1Price
, StockX2Price
ΒΈ and StockX3Price
:
> correlationMatrix<- cor(DataMR[,1:4])
Figure 3.11: Correlation matrix table
The preceding correlation matrix shows which variables are highly correlated and, accordingly, the feature will be selected in such a way that highly correlated features are not in the model.