How to regularize a decision tree

The following table lists key parameters available for this purpose in the sklearn decision tree implementation. After introducing the most important parameters, we will illustrate how to use cross-validation to optimize the hyperparameter settings with respect to the bias-variance tradeoff and lower prediction errors:

<tdDefault

Parameter	Options	Description
`max_depth`	None	int	Maximum number of levels: split nodes until reaching `max_depth` or all leaves are pure or contain fewer than `min_samples_split` samples.
`max_features`	None	None: all features; int float: fraction auto, sqrt: sqrt(n_features) log2: log2(n_features)	Number of features to consider for a split.
`max_leaf_nodes`	None	None: unlimited number of leaf nodes int	Split nodes until creating this many leaves.
`min_impurity_decrease`	0	float	Split node if impurity decreases by at least this value.
`min_samples_leaf`	1	int; float (as a percentage of N)	Minimum number of samples to be at a leaf node. A split will only be considered if there are at least `min_samples_leaf` training samples in each of the left and right branches. May smoothen the model, especially for regression.
`min_samples_split`	2	int; float (percent of N)	The minimum number of samples required to split an internal node:
`min_weight_fraction_leaf`	0		The minimum weighted fraction of the sum total of all sample weights needed at a leaf node. Samples have equal weight unless `sample_weight` provided in fit method.

The max_depth parameter imposes a hard limit on the number of consecutive splits and represents the most straightforward way to cap the growth of a tree.

The min_samples_split and min_samples_leaf parameters are alternative, data-driven ways to limit the growth of a tree. Rather than imposing a hard limit on the number of consecutive splits, these parameters control the minimum number of samples required to further split the data. The latter guarantees a certain number of samples per leaf, while the former can create very small leaves if a split results in a very uneven distribution. Small parameter values facilitate overfitting, while a high number may prevent the tree from learning the signal in the data. The default values are often quite low, and you should use cross-validation to explore a range of potential values. You can also use a float to indicate a percentage as opposed to an absolute number.

The sklearn documentation contains additional details about how to use the various parameters for different use cases; see GitHub references.

Table of Contents for How to regularize a decision tree

Create new playlist

Sign In

Sign Up

Table of Contents for
How to regularize a decision tree