Features of the data

The KDD data that we use for this example has the following features that are listed as follows.

The following table shows the basic features of individual TCP connections:

Feature name Description Type
duration Length (number of seconds) of the connection continuous
protocol_type Type of the protocol, for example, tcp, udp, and so on discrete
service Network service on the destination, for example, http, telnet, and so on discrete
src_bytes Number of data bytes from source to destination continuous
dst_bytes Number of data bytes from destination to source continuous
flag Normal or error status of the connection discrete
land 1 if connection is from/to the same host/port; 0 otherwise discrete
wrong_fragment Number of wrong fragments continuous
urgent Number of urgent packets continuous

 

The preceding table also shows the content features within a connection suggested by domain knowledge. The following table shows the traffic features computed using a two-second time window:

Feature name Description Type
count Number of connections to the same host as the current connection in the past two seconds continuous

 

The following features refer to these same-host connections:

Feature name Description Type
serror_rate % of connections that have SYN errors continuous
rerror_rate % of connections that have REJ errors continuous
same_srv_rate % of connections to the same service continuous
diff_srv_rate % of connections to different services continuous

 

The following features refer to these same-service connections:

Feature name Description Type
srv_count Number of connections to the same service as the current connection in the past two seconds continuous
srv_serror_rate % of connections that have SYN errors continuous
srv_rerror_rate % of connections that have REJ errors continuous
srv_diff_host_rate % of connections to different hosts continuous

 

Now let us print the few values from the table:

feature_cols = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serrer_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']
X = pd.DataFrame(X, columns = feature_cols)

y = pd.Series(y)
X.head()

Previous code will display first few row of the table with all the column names. Then we convert the columns into floats for efficient processing:

for col in X.columns:  
try:
X[col] = X[col].astype(float)
except ValueError:
pass

We convert the categorical into dummy or indicator variables:

X = pd.get_dummies(X, prefix=['protocol_type_', 'service_', 'flag_'], drop_first=True)
X.head()

Now we will generate the counts.

On executing, the previous code displays around 5 rows × 115 columns:

y.value_counts()

Out:
smurf. 280790 neptune. 107201 normal. 97278 back. 2203 satan. 1589 ipsweep. 1247 portsweep. 1040 warezclient. 1020 teardrop. 979 pod. 264 nmap. 231 guess_passwd. 53 buffer_overflow. 30 land. 21 warezmaster. 20 imap. 12 rootkit. 10 loadmodule. 9 ftp_write. 8 multihop. 7 phf. 4 perl. 3 spy. 2 dtype: int64

We fit a classification tree with max_depth=7 on all data as follows:


from sklearn.tree import DecisionTreeClassifier, export_graphviz

treeclf = DecisionTreeClassifier(max_depth=7)

scores = cross_val_score(treeclf, X, y, scoring='accuracy', cv=5)

print np.mean(scores)

treeclf.fit(X, y)

The output of the preceding model fit is as follows:

0.9955204407492013
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset