Predicting ad click-through with decision tree

After several examples, it is now time to predict ad click-through with the decision tree algorithm we have just thoroughly learned about and practiced. We will use the dataset from a Kaggle machine learning competition, Click-Through Rate Prediction (https://www.kaggle.com/c/avazu-ctr-prediction). The dataset can be downloaded from https://www.kaggle.com/c/avazu-ctr-prediction/data.

Only the train.gz file contains labeled samples, so we only need to download this and unzip it (it will take a while). In this chapter, we focus on only the first 300,000 samples from the train file unzipped from train.gz.

The fields in the raw file are as follows:

We take a glance at the head of the file by running the following command:

head train | sed 's/,,/, ,/g;s/,,/, ,/g' | column -s, -t

Rather than simple head train, the output is cleaner as all the columns are aligned:

Don't be scared by the anonymized and hashed values. They are categorical features, and each possible value of them corresponds to a real and meaningful value, but is presented this way out of privacy policy. Maybe C1 means user gender, and 1005 and 1002 represent male and female respectively.

Now, let's get started with reading the dataset using pandas. That's right, pandas is extremely good at handling data in a tabular format:

>>> import pandas as pd
>>> n_rows = 300000
>>> df = pd.read_csv("train.csv", nrows=n_rows)

The first 300,000 lines of the file are loaded and stored in a dataframe. Take a quick look at the first five rows of the dataframe:

>>> print(df.head(5))
id click hour C1 banner_pos site_id ... C16 C17 C18 C19 C20 C21
0 1.000009e+18 0 14102100 1005 0 1fbe01fe ... 50 1722 0 35 -1 79
1 1.000017e+19 0 14102100 1005 0 1fbe01fe ... 50 1722 0 35 100084 79
2 1.000037e+19 0 14102100 1005 0 1fbe01fe ... 50 1722 0 35 100084 79
3 1.000064e+19 0 14102100 1005 0 1fbe01fe ... 50 1722 0 35 100084 79
4 1.000068e+19 0 14102100 1005 1 fe8cc448 ... 50 2161 0 35 -1 157

The target variable is the click column:

>>> Y = df['click'].values

For the remaining columns, there are several columns that should be removed from the features (id, hour, device_id, and device_ip) as they do not contain much useful information:

>>> X = df.drop(['click', 'id', 'hour', 'device_id', 'device_ip'],  
axis=1).values
>>> print(X.shape)
(300000, 19)

Each sample has 19 predictive attributes.

Next, we need to split the data into training and testing sets. Normally, we do so by randomly picking samples. However, in our case, samples are in chronological order as indicated in the hour field. Obviously, we cannot use future samples to predict the past ones. Hence, we take the first 90% as training samples and the rest as testing samples:

>>> n_train = int(n_rows * 0.9)
>>> X_train = X[:n_train]
>>> Y_train = Y[:n_train]
>>> X_test = X[n_train:]
>>> Y_test = Y[n_train:]

As mentioned, decision tree models can take in categorical features. However, because the tree-based algorithms in scikit-learn (the current version is 0.20.0 as of the end of 2018) only allow numerical input, we need to transform categorical features into numerical ones. But note that in general we do not need to do so; for example, the decision tree classifier we developed from scratch earlier can directly take in categorical features.

We now transform string-based categorical features into one-hot encoded vectors using the OneHotEncoder module from scikit-learn. One-hot encoding was briefly mentioned in Chapter 1, Getting Started with Machine Learning and Python. To recap, it basically converts a categorical feature with k possible values into k binary features. For example, the site category feature with three possible values, news, education, and sports, will be encoded into three binary features, such as is_news, is_education, and is_sports, whose values are either 1 or 0.

Initialize a OneHotEncoder object as follows:

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder(handle_unknown='ignore')

Fit it on the training set as follows:

>>> X_train_enc = enc.fit_transform(X_train)
>>> X_train_enc[0]
<1x8385 sparse matrix of type '<class 'numpy.float64'>'
with 19 stored elements in Compressed Sparse Row format>
>>> print(X_train_enc[0])
(0, 2) 1.0
(0, 6) 1.0
(0, 30) 1.0
(0, 1471) 1.0
(0, 2743) 1.0
(0, 3878) 1.0
(0, 4000) 1.0
(0, 4048) 1.0
(0, 6663) 1.0
(0, 7491) 1.0
(0, 7494) 1.0
(0, 7861) 1.0
(0, 8004) 1.

(0, 8008) 1.0
(0, 8085) 1.0
(0, 8158) 1.0
(0, 8163) 1.0
(0, 8202) 1.0
(0, 8383) 1.0

Each converted sample is a sparse vector.

Transform the testing set using the trained one-hot encoder as follows:

>>> X_test_enc = enc.transform(X_test)

Remember, we specify the handle_unknown='ignore' parameter in the one-hot encoder earlier. This is to prevent errors due to any unseen categorical values. Use the previous site category example, if there is a sample with the value movie, three converted binary features (is_news, is_education, and is_sports) all become 0s. If we do not specify ignore, an error will be raised.  

Next, we train a decision tree model using grid search, which we learned about in Chapter 5Classifying Newsgroup Topics with a Support Vector Machine. For demonstration purposes, we only tweak the max_depth hyperparameter. Other hyperparameters, such as min_samples_split and class_weight, are also highly recommended. The classification metric should be AUC of ROC, as it is an imbalanced binary case (only 51,211 out of 300,000 training samples are clicks, that, is a 17% positive click-through rate):

>>> from sklearn.tree import DecisionTreeClassifier
>>> parameters = {'max_depth': [3, 10, None]}

Pick three options for the maximal depth, 3, 10, and unbounded. Initialize a decision tree model with Gini Impurity as the metric and 30 as the minimum number of samples required to split further:

>>> decision_tree = DecisionTreeClassifier(criterion='gini',
min_samples_split=30)
>>> from sklearn.model_selection import GridSearchCV

As for grid search, we use three-fold (as there are enough training samples) cross-validation and select the best performing hyperparameter measured by AUC:

>>> grid_search = GridSearchCV(decision_tree, parameters,
n_jobs=-1, cv=3, scoring='roc_auc')

Note n_jobs=-1 means that we use all available CPU processors:

>>> grid_search.fit(X_train, y_train)
>>> print(grid_search.best_params_)
{'max_depth': 10}

Use the model with the optimal parameter to predict future test cases, as follows:

>>> decision_tree_best = grid_search.bestestimator
>>> pos_prob = decision_tree_best.predict_proba(X_test)[:, 1]

>>> from sklearn.metrics import roc_auc_score
>>> print('The ROC AUC on testing set is:
{0:.3f}'.format(roc_auc_score(y_test, pos_prob)))
The ROC AUC on testing set is: 0.719

The AUC we can achieve with the optimal decision tree model is 0.72. It does not seem very high, but click-through involves many intricate human factors, which is why predicting it is not an easy problem. Although we can further optimize its hyperparameters, an AUC of 0.72 is pretty good, actually. Randomly selecting 17% of the samples to be click will generate an AUC of 0.496:

>>> pos_prob = np.zeros(len(Y_test))
>>> click_index = np.random.choice(len(Y_test),
int(len(Y_test) * 51211.0/300000), replace=False)
>>> pos_prob[click_index] = 1
>>> roc_auc_score(Y_test, pos_prob)
0.496

Looking back, we can see that a decision tree is a sequence of greedy searches for the best splitting point at each step, based on the training dataset. However, this tends to cause overfitting as it is likely that the optimal points only work well for the training samples. Fortunately, ensembling is the technique to correct this, and random forest is an ensemble tree model that usually outperforms a simple decision tree.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset