The notebook also illustrates how to run cross-validation more manually to obtain custom tree attributes, such as the total number of nodes or leaf nodes associated with certain hyperparameter settings. The following function accesses the internal .tree_ attribute to retrieve information about the total node count, and how many of these nodes are leaf nodes:
def get_leaves_count(tree):
t = tree.tree_
n = t.node_count
leaves = len([i for i in range(t.node_count) if t.children_left[i]== -1])
return leaves
We can combine this information with the train and test scores to gain detailed knowledge about the model behavior throughout the cross-validation process, as follows:
train_scores, val_scores, leaves = {}, {}, {}
for max_depth in range(1, 26):
print(max_depth, end=' ', flush=True)
clf = DecisionTreeClassifier(criterion='gini',
max_depth=max_depth,
min_samples_leaf=500,
max_features='auto',
random_state=42)
train_scores[max_depth], val_scores[max_depth], leaves[max_depth] = [], [], []
for train_idx, test_idx in cv.split(X):
X_train, y_train, = X.iloc[train_idx], y_binary.iloc[train_idx]
X_test, y_test = X.iloc[test_idx], y_binary.iloc[test_idx]
clf.fit(X=X_train, y=y_train)
train_pred = clf.predict_proba(X=X_train)[:, 1]
train_score = roc_auc_score(y_score=train_pred, y_true=y_train)
train_scores[max_depth].append(train_score)
test_pred = clf.predict_proba(X=X_test)[:, 1]
val_score = roc_auc_score(y_score=test_pred, y_true=y_test)
val_scores[max_depth].append(val_score)
leaves[max_depth].append(get_leaves_count(clf))
The result is shown on the left panel of the following chart. It highlights the in- and out-of-sample performance across the range of max_depth settings, alongside a confidence interval around the error metrics. It also shows the number of leaf nodes on the right-hand log scale and indicates the best-performing setting at 13 consecutive splits, as indicated by the vertical black line.