Features of the data

The KDD data that we use for this example has the following features that are listed as follows.

The following table shows the basic features of individual TCP connections:

Feature name	Description	Type
`duration`	Length (number of seconds) of the connection	continuous
`protocol_type`	Type of the protocol, for example, tcp, udp, and so on	discrete
`service`	Network service on the destination, for example, http, telnet, and so on	discrete
`src_bytes`	Number of data bytes from source to destination	continuous
`dst_bytes`	Number of data bytes from destination to source	continuous
`flag`	Normal or error status of the connection	discrete
`land`	1 if connection is from/to the same host/port; 0 otherwise	discrete
`wrong_fragment`	Number of wrong fragments	continuous
`urgent`	Number of urgent packets	continuous

The preceding table also shows the content features within a connection suggested by domain knowledge. The following table shows the traffic features computed using a two-second time window:

Feature name	Description	Type
`count`	Number of connections to the same host as the current connection in the past two seconds	continuous

The following features refer to these same-host connections:

Feature name	Description	Type
`serror_rate`	% of connections that have `SYN` errors	continuous
`rerror_rate`	% of connections that have `REJ` errors	continuous
`same_srv_rate`	% of connections to the same service	continuous
`diff_srv_rate`	% of connections to different services	continuous

The following features refer to these same-service connections:

Feature name	Description	Type
`srv_count`	Number of connections to the same service as the current connection in the past two seconds	continuous
`srv_serror_rate`	% of connections that have `SYN` errors	continuous
`srv_rerror_rate`	% of connections that have `REJ` errors	continuous
`srv_diff_host_rate`	% of connections to different hosts	continuous

Now let us print the few values from the table:

feature_cols = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serrer_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']
 X = pd.DataFrame(X, columns = feature_cols)
 
 y = pd.Series(y)
X.head()

Previous code will display first few row of the table with all the column names. Then we convert the columns into floats for efficient processing:

for col in X.columns:  
    try:
        X[col] = X[col].astype(float)
    except ValueError:
        pass

We convert the categorical into dummy or indicator variables:

X = pd.get_dummies(X, prefix=['protocol_type_', 'service_', 'flag_'], drop_first=True)
X.head()

Now we will generate the counts.

On executing, the previous code displays around 5 rows × 115 columns:

y.value_counts()

Out: 
smurf.              280790
neptune.            107201
normal.              97278
back.                 2203
satan.                1589
ipsweep.              1247
portsweep.            1040
warezclient.          1020
teardrop.              979
pod.                   264
nmap.                  231
guess_passwd.           53
buffer_overflow.        30
land.                   21
warezmaster.            20
imap.                   12
rootkit.                10
loadmodule.              9
ftp_write.               8
multihop.                7
phf.                     4
perl.                    3
spy.                     2
dtype: int64

We fit a classification tree with max_depth=7 on all data as follows:


 from sklearn.tree import DecisionTreeClassifier, export_graphviz
 
 treeclf = DecisionTreeClassifier(max_depth=7)
 
 scores = cross_val_score(treeclf, X, y, scoring='accuracy', cv=5)
 
 print np.mean(scores)
 
 treeclf.fit(X, y)

The output of the preceding model fit is as follows:

0.9955204407492013

Table of Contents for Features of the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Features of the data