In this chapter, we started with an introduction to a typical machine learning problem, online advertising click-through prediction and the challenges including categorical features. We then resorted to tree-based algorithms that can take in both numerical and categorical features. We then had an in-depth discussion on the decision tree algorithm: the mechanics, different types, how to construct a tree, and two metrics, Gini impurity and entropy, to measure the effectiveness of a split at a tree node. After constructing a tree in an example by hand, we implemented the algorithm from scratch. We also learned how to use the decision tree package from scikit-learn and applied it to predict click-through. We continued to improve the performance by adopting the feature-based bagging algorithm random forest. The chapter then ended with tips to tune a random forest model.
More practice is always good for honing skills. Another great project in the same area is the Display Advertising Challenge from CriteoLabs (https://www.kaggle.com/c/criteo-display-ad-challenge). Access to the data and descriptions can be found on the page https://www.kaggle.com/c/criteo-display-ad-challenge/data. What is the best AUC you can achieve on the second 100000 samples with a decision tree or random forest model that you train and fine tune based on the first 100000 samples?