Summary

In this chapter, we started with an introduction to a typical machine learning problem, online advertising click-through prediction and the challenges including categorical features. We then resorted to tree-based algorithms that can take in both numerical and categorical features. We then had an in-depth discussion on the decision tree algorithm: the mechanics, different types, how to construct a tree, and two metrics, Gini impurity and entropy, to measure the effectiveness of a split at a tree node. After constructing a tree in an example by hand, we implemented the algorithm from scratch. We also learned how to use the decision tree package from scikit-learn and applied it to predict click-through. We continued to improve the performance by adopting the feature-based bagging algorithm random forest. The chapter then ended with tips to tune a random forest model.

More practice is always good for honing skills. Another great project in the same area is the Display Advertising Challenge from CriteoLabs (https://www.kaggle.com/c/criteo-display-ad-challenge). Access to the data and descriptions can be found on the page https://www.kaggle.com/c/criteo-display-ad-challenge/data. What is the best AUC you can achieve on the second 100000 samples with a decision tree or random forest model that you train and fine tune based on the first 100000 samples?

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary