Random forest

Random forests are ensemble learning methods that are used for either classification or regression purposes. Random forests are composed of several decision trees that are combined together to make a unanimous decision or classification. Random forest are better than just regular decision trees because they do not cause overfitting of the data:

We would then try to use the random forest classifier:

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
scores = cross_val_score(rf, X, y, scoring='accuracy', cv=5)
print np.mean(scores)
# nicer
rf.fit(X, y)

The output can be seen as follows:

Out[39]:
0.9997307783262454

With random forest, we are able to get more best term to use than a single decision tree:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
pd.DataFrame({'feature':X.columns, 'importance':rf.feature_importances_}).sort_values('importance', ascending=False).head(10)

The features importance can be seen here:

Feature

Importance

53

service__ecr_i

0.278599

25

same_srv_rate

0.129464

20

srv_count

0.108782

1

src_bytes

0.101766

113

flag__SF

0.073368

109

flag__S0

0.058412

19

count

0.055665

29

dst_host_srv_count

0.038069

38

protocol_type__tcp

0.036816

30

dst_host_same_srv_rate

0.026287

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset