Making sense of result parameters

Apart from the R2 statistic, there are other statistics and parameters that one needs to look at in order to do the following:

  1. Select some variables and discard others for the model.
  2. Assess the relationship between the predictor and output variable and check whether a predictor variable is significant in the model or not.
  3. Calculate the error in the values predicted by the selected model.

Let us now see some of the statistics which helps to address the issues discussed earlier.

p-values

One thing to realize here is that the calculation of a and β are estimates and not the exact calculations. Whether their values are significant or not need to be tested using a hypothesis test.

The hypothesis tests whether the value of β is non-zero or not; in other words whether there is a sufficient correlation between X and yact. If there is, the β will be non-zero.

In the equation, y= a +β*x, if we put β=0, there will be no relation between y and x. Hence the hypothesis test is defined, as shown in the following:

p-values

So, whenever a regression task is performed and β is calculated, there will be an accompanying t-statistic and a p-value corresponding to this hypothesis test, calculated automatically by the program. Our task is to assume a significance level of our choice and compare this with the p-value. It will be a two-tailed test and if the p-value is less than the chosen significance level, then the null hypothesis that β=0 is rejected.

If p-value for the t-statistic is less than the significance level, then the null-hypothesis is rejected and β is taken to be significant and non-zero. The values of p-value larger than the significance level demonstrate that β is not very significant in explaining the relationship between the two variables. As we see in the case of multiple regression (multiple input variables/predictors), this fact can be used to weed out unwanted columns from the model. The higher the p-value, the less significant they are to the model and the less significant ones can be weeded out first.

F-statistics

When one moves from a simple linear regression to a multiple regression, there will be multiple βs and each of them will be an estimate. In such a case, apart from testing the significance of the individual variables in the model by checking the p-values associated with their estimation, it is also required to check whether, as a group all the predictors are significant or not. This can be done using the following hypothesis:

F-statistics

The statistic that is used to test this hypothesis is called the F-statistic and is defined as follows:

F-statistics

Where SST and SSD have been defined earlier as:

F-statistics

where n=number of rows in the dataset; p- number of predictor variables in the model

The F-statistics follows the F-distribution. There will be a p-value that is associated with this F-statistic. If this p-value is small enough (smaller than the significance level chosen), the null hypothesis can be rejected.

The significance of F-statistic is as follows:

  • p-values are about individual relationships between one predictor and one outcome variable. In case of more than one predictor variable, one predictor's relationship with the output might get changed due to the presence of other variables. The F-statistics provides us with a way to look at the partial change in the associated p-value because of the addition of a new variable.
  • When the number of predictors in the model is very large and all the β i are very close to 0, the individual p-values associated with the predictors might give very small values. In such a case, if we rely solely on individual p-values, we might incorrectly conclude that there is a relationship between the predictors and the outcome, when it is not there actually. In such cases, we should look at the p-value associated with the F-statistic.

Residual Standard Error

Another concept to learn is the concept of Residual Standard Error (RSE). It is defined as: Residual Standard Error and Residual Standard Error So, RSE can be written as Residual Standard Error for a simple linear regression model.

Where n=number of data points. In general, Residual Standard Error where p=number of predictor variables in the model.

The RSE is an estimate of the standard deviation of the error term (res). This is the error that is inevitable even if the model coefficients are known correctly. This may be the case because the model lacks something else, or may be another variable in the model (we have just looked at one variable regression till now, but in most of the practical scenarios we have to deal with multiple regression, where there would be more than one input variable. In multiple regressions, values of the RSE generally go down, as we add more variables that are more significant predictors of the output variable).

The RSE for a model can be calculated using the following code snippet. Here, we are calculating the RSE for the data frame we have used for the model, df:

df['RSE']=(df['Actual_Output(yact)']-df['ymodel'])**2
RSEd=df.sum()['RSE']
RSE=np.sqrt(RSEd)/98
RSE

The value of the RSE comes out to be 0.46 in this case. As you might have guessed, the smaller the RSE, the better the model is. Again, the benchmark to compare this error is the mean of the actual values, yact. As we have seen earlier, this value is ymean=2.53. So, we will observe an error of 0.46 over 2.53 that amounts to around an 18% error.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset