The following table lists key parameters available for this purpose in the sklearn decision tree implementation. After introducing the most important parameters, we will illustrate how to use cross-validation to optimize the hyperparameter settings with respect to the bias-variance tradeoff and lower prediction errors:
<tdDefault
Parameter | Options | Description | |
max_depth | None | int | Maximum number of levels: split nodes until reaching max_depth or all leaves are pure or contain fewer than min_samples_split samples. |
max_features | None | None: all features; int float: fraction auto, sqrt: sqrt(n_features) log2: log2(n_features) |
Number of features to consider for a split. |
max_leaf_nodes | None | None: unlimited number of leaf nodes int |
Split nodes until creating this many leaves. |
min_impurity_decrease | 0 | float | Split node if impurity decreases by at least this value. |
min_samples_leaf | 1 | int;
float (as a percentage of N) |
Minimum number of samples to be at a leaf node. A split will only be considered if there are at least min_samples_leaf training samples in each of the left and right branches. May smoothen the model, especially for regression. |
min_samples_split | 2 | int; float (percent of N) | The minimum number of samples required to split an internal node: |
min_weight_fraction_leaf | 0 | The minimum weighted fraction of the sum total of all sample weights needed at a leaf node. Samples have equal weight unless sample_weight provided in fit method. |
The max_depth parameter imposes a hard limit on the number of consecutive splits and represents the most straightforward way to cap the growth of a tree.
The min_samples_split and min_samples_leaf parameters are alternative, data-driven ways to limit the growth of a tree. Rather than imposing a hard limit on the number of consecutive splits, these parameters control the minimum number of samples required to further split the data. The latter guarantees a certain number of samples per leaf, while the former can create very small leaves if a split results in a very uneven distribution. Small parameter values facilitate overfitting, while a high number may prevent the tree from learning the signal in the data. The default values are often quite low, and you should use cross-validation to explore a range of potential values. You can also use a float to indicate a percentage as opposed to an absolute number.
The sklearn documentation contains additional details about how to use the various parameters for different use cases; see GitHub references.