Random forests are ensemble learning methods that are used for either classification or regression purposes. Random forests are composed of several decision trees that are combined together to make a unanimous decision or classification. Random forest are better than just regular decision trees because they do not cause overfitting of the data:
We would then try to use the random forest classifier:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
scores = cross_val_score(rf, X, y, scoring='accuracy', cv=5)
print np.mean(scores)
# nicer
rf.fit(X, y)
The output can be seen as follows:
Out[39]:
0.9997307783262454
With random forest, we are able to get more best term to use than a single decision tree:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
pd.DataFrame({'feature':X.columns, 'importance':rf.feature_importances_}).sort_values('importance', ascending=False).head(10)
The features importance can be seen here:
Feature |
Importance |
|
53 |
service__ecr_i |
0.278599 |
25 |
same_srv_rate |
0.129464 |
20 |
srv_count |
0.108782 |
1 |
src_bytes |
0.101766 |
113 |
flag__SF |
0.073368 |
109 |
flag__S0 |
0.058412 |
19 |
count |
0.055665 |
29 |
dst_host_srv_count |
0.038069 |
38 |
protocol_type__tcp |
0.036816 |
30 |
dst_host_same_srv_rate |
0.026287 |