The KDD data that we use for this example has the following features that are listed as follows.
The following table shows the basic features of individual TCP connections:
Feature name | Description | Type |
duration | Length (number of seconds) of the connection | continuous |
protocol_type | Type of the protocol, for example, tcp, udp, and so on | discrete |
service | Network service on the destination, for example, http, telnet, and so on | discrete |
src_bytes | Number of data bytes from source to destination | continuous |
dst_bytes | Number of data bytes from destination to source | continuous |
flag | Normal or error status of the connection | discrete |
land | 1 if connection is from/to the same host/port; 0 otherwise | discrete |
wrong_fragment | Number of wrong fragments | continuous |
urgent | Number of urgent packets | continuous |
The preceding table also shows the content features within a connection suggested by domain knowledge. The following table shows the traffic features computed using a two-second time window:
Feature name | Description | Type |
count | Number of connections to the same host as the current connection in the past two seconds | continuous |
The following features refer to these same-host connections:
Feature name | Description | Type |
serror_rate | % of connections that have SYN errors | continuous |
rerror_rate | % of connections that have REJ errors | continuous |
same_srv_rate | % of connections to the same service | continuous |
diff_srv_rate | % of connections to different services | continuous |
The following features refer to these same-service connections:
Feature name | Description | Type |
srv_count | Number of connections to the same service as the current connection in the past two seconds | continuous |
srv_serror_rate | % of connections that have SYN errors | continuous |
srv_rerror_rate | % of connections that have REJ errors | continuous |
srv_diff_host_rate | % of connections to different hosts | continuous |
Now let us print the few values from the table:
feature_cols = ['duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_serrer_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']
X = pd.DataFrame(X, columns = feature_cols)
y = pd.Series(y)
X.head()
Previous code will display first few row of the table with all the column names. Then we convert the columns into floats for efficient processing:
for col in X.columns:
try:
X[col] = X[col].astype(float)
except ValueError:
pass
We convert the categorical into dummy or indicator variables:
X = pd.get_dummies(X, prefix=['protocol_type_', 'service_', 'flag_'], drop_first=True)
X.head()
Now we will generate the counts.
On executing, the previous code displays around 5 rows × 115 columns:
y.value_counts()
Out:
smurf. 280790 neptune. 107201 normal. 97278 back. 2203 satan. 1589 ipsweep. 1247 portsweep. 1040 warezclient. 1020 teardrop. 979 pod. 264 nmap. 231 guess_passwd. 53 buffer_overflow. 30 land. 21 warezmaster. 20 imap. 12 rootkit. 10 loadmodule. 9 ftp_write. 8 multihop. 7 phf. 4 perl. 3 spy. 2 dtype: int64
We fit a classification tree with max_depth=7 on all data as follows:
from sklearn.tree import DecisionTreeClassifier, export_graphviz
treeclf = DecisionTreeClassifier(max_depth=7)
scores = cross_val_score(treeclf, X, y, scoring='accuracy', cv=5)
print np.mean(scores)
treeclf.fit(X, y)
The output of the preceding model fit is as follows:
0.9955204407492013