Best practices for statistics

Statistics are an integral part of any predictive modelling assignment. Statistics are important because they help us gauge the efficiency of a model. Each predictive model generates a set of statistics, which suggests how good the model is and how the model can be fine-tuned to perform better. The following is a summary of the most widely reported statistics and their desired values for the predictive models described in this book:

Algorithms

Statistics/Parameter

The desired value of statistics

Linear regression

R2, p-values, F-statistic, and Adj. R2

High Adj. R2, low F-statistic, and low p-value

Logistic regression

Sensitivity, specificity, Area Under the Curve (AUC), and KS statistic

High AUC (proximity to 1)

Clustering

Intra-cluster distance and silhouette coefficient

High intra-cluster distance and high silhouette coefficient (proximity to 1)

Decision trees (classification)

AUC and KS statistics

High AUC (proximity to 1)

While reporting the results of a predictive model, the value of these statistics and its meaning in the business context should be stated explicitly. A brief and lucid explanation of the relevance and significance of the statistic is appreciated. Report the best values (most optimum value attainable) of these statistics. The model should be fine-tuned based on the value of these statistics until the point that they can't be further improved.

Apart from these statistics, there are various statistical tests that can be performed over the dataset to test certain hypothesis about the data before fitting any predictive model to it. These tests include Z-test, t-test, chi-square test, ANOVA, and so on. If such tests have been performed, the results (value and significance) and their implications should be clearly stated.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset