Random forest

Random forests are ensemble learning methods that are used for either classification or regression purposes. Random forests are composed of several decision trees that are combined together to make a unanimous decision or classification. Random forest are better than just regular decision trees because they do not cause overfitting of the data:

We would then try to use the random forest classifier:

from sklearn.ensemble import RandomForestClassifier
 rf = RandomForestClassifier()
scores = cross_val_score(rf, X, y, scoring='accuracy', cv=5)
print np.mean(scores)  
# nicer
rf.fit(X, y)

The output can be seen as follows:

Out[39]:
0.9997307783262454

With random forest, we are able to get more best term to use than a single decision tree:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
 pd.DataFrame({'feature':X.columns, 'importance':rf.feature_importances_}).sort_values('importance', ascending=False).head(10)

The features importance can be seen here:

	Feature	Importance
53	`service__ecr_i`	0.278599
25	`same_srv_rate`	0.129464
20	`srv_count`	0.108782
1	`src_bytes`	0.101766
113	`flag__SF`	0.073368
109	`flag__S0`	0.058412
19	`count`	0.055665
29	`dst_host_srv_count`	0.038069
38	`protocol_type__tcp`	0.036816
30	`dst_host_same_srv_rate`	0.026287

Table of Contents for Random forest

Create new playlist

Sign In

Sign Up

Table of Contents for
Random forest