After dealing with a variety of problems with our dataset, from identifying missing values hidden as zeros, imputing missing values, and normalizing data at different scales, it's time to put all of our scores together into a single table and see what combination of feature engineering did the best:
Pipeline description |
# rows model learned from |
Cross-validated accuracy |
Drop missing-valued rows |
392 |
.7449 |
Impute values with 0 |
768 |
.7304 |
Impute values with mean of column |
768 |
.7318 |
Impute values with median of column |
768 |
.7357 |
Z-score normalization with median imputing |
768 |
.7422 |
Min-max normalization with mean imputing |
768 |
.7461 |
Row-normalization with mean imputing |
768 |
.6823 |
It seems as though we were finally able to get a better accuracy by applying mean imputing and min-max normalization to our dataset and still use all 768 available rows. Great!