We will use the intrusion detection problem again to detect anomalies. Initially, we will import pandas, as shown:
import pandas as pd
We get the names of the features from the dataset at this link: http://icsdweb.aegean.gr/awid/features.html.
We will include the features code as shown here:
features = ['frame.interface_id',
'frame.dlt',
'frame.offset_shift',
'frame.time_epoch',
'frame.time_delta',
'frame.time_delta_displayed',
'frame.time_relative',
'frame.len',
'frame.cap_len',
'frame.marked',
'frame.ignored',
'radiotap.version',
'radiotap.pad',
'radiotap.length',
'radiotap.present.tsft',
'radiotap.present.flags',
'radiotap.present.rate',
'radiotap.present.channel',
'radiotap.present.fhss',
'radiotap.present.dbm_antsignal',
...
The preceding list contains all 155 features in the AWID dataset. We import the training set and see the number of rows and columns:
awid = pd.read_csv("../data/AWID-CLS-R-Trn.csv", header=None, names=features)
# see the number of rows/columns
awid.shape
We can ignore the warning:
The output of the shape is a list of all the training data in the 155-feature dataset:
We will eventually have to replace the None values:
# they use ? as a null attribute.
awid.head()
The preceding code will produce a table of 5 rows × 155 columns as an output.
awid['class'].value_counts(normalize=True)
normal 0.909564
injection 0.036411
impersonation 0.027023
flooding 0.027002
Name: class, dtype: float64
We check for NAs:
# claims there are no null values because of the ?'s'
awid.isna().sum()
The output looks like this:
We replace all ? marks with None:
# replace the ? marks with None
awid.replace({"?": None}, inplace=True)
The sum shows a large amount of missing data:
# Many missing pieces of data!
awid.isna().sum()
Here is what the output looks like:
Here, we remove columns that have over 50% of their data missing:
columns_with_mostly_null_data = awid.columns[awid.isnull().mean() >= 0.5]
# 72 columns are going to be affected!
columns_with_mostly_null_data.shape
Out[11]:
(72,)
We drop the columns with over 50% of their data missing:
awid.drop(columns_with_mostly_null_data, axis=1, inplace=True)
The output can be seen as follows:
awid.shape
(1795575, 83)
Now, drop the rows that have missing values:
#
awid.dropna(inplace=True) # drop rows with null data
We lost 456,169 rows:
awid.shape
(1339406, 83)
However, it doesn't affect our distribution too much:
# 0.878763 is our null accuracy. Our model must be better than this number to be a contender
awid['class'].value_counts(normalize=True)
normal 0.878763
injection 0.048812
impersonation 0.036227
flooding 0.036198
Name: class, dtype: float64
We only select numerical columns for our ML algorithms, but there should be more:
awid.select_dtypes(['number']).shape
(1339406, 45)
for col in awid.columns:
awid[col] = pd.to_numeric(awid[col], errors='ignore')
# that makes more sense
awid.select_dtypes(['number']).shape
The output can be seen here:
We derive basic descriptive statistics:
awid.describe()
By executing the preceding code will get a table of 8 rows × 74 columns.
X, y = awid.select_dtypes(['number']), awid['class']
We do a basic Naive Bayes fitting. We fit our model to the data:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X, y)
Gaussian Naive Bayes is performed as follows:
We read in the test data and do the same transformations to it, to match the training data:
awid_test = pd.read_csv("../data/AWID-CLS-R-Tst.csv", header=None, names=features)
# drop the problematic columns
awid_test.drop(columns_with_mostly_null_data, axis=1, inplace=True)
# replace ? with None
awid_test.replace({"?": None}, inplace=True)
# drop the rows with null data
awid_test.dropna(inplace=True) # drop rows with null data
# convert columns to numerical values
for col in awid_test.columns:
awid_test[col] = pd.to_numeric(awid_test[col], errors='ignore')
awid_test.shape
The output is as follows:
We compute the basic metric, accuracy:
from sklearn.metrics import accuracy_score
We define a simple function to test the accuracy of a model fitted on training data by using our testing data:
X_test = awid_test.select_dtypes(['number'])
y_test = awid_test['class']
def get_test_accuracy_of(model):
y_preds = model.predict(X_test)
return accuracy_score(y_preds, y_test)
# naive bayes does very poorly on its own!
get_test_accuracy_of(nb)
The output can be seen here:
We perform logistic regression, but it performs even worse:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X, y)
# Logistic Regressions does even worse
get_test_accuracy_of(lr)
We can ignore this warning:
The following shows the output:
We test with DecisionTreeClassifier as shown here:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X, y)
# Tree does very well!
get_test_accuracy_of(tree)
The output can be seen as follows:
We test the Gini scores of the decision tree features as follows:
pd.DataFrame({'feature':awid.select_dtypes(['number']).columns,
'importance':tree.feature_importances_}).sort_values('importance', ascending=False).head(10)
The output of the preceding code gives the following table:
We import RandomForestClassifier as shown here:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(X, y)
# Random Forest does slightly worse
get_test_accuracy_of(forest)
We can ignore this warning:
The following is the output:
We create a pipeline that will scale the numerical data and then feed the resulting data into a decision tree:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
preprocessing = Pipeline([
("scale", StandardScaler()),
])
pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", DecisionTreeClassifier())
])
# try varying levels of depth
params = {
"classifier__max_depth": [None, 3, 5, 10],
}
# instantiate a gridsearch module
grid = GridSearchCV(pipeline, params)
# fit the module
grid.fit(X, y)
# test the best model
get_test_accuracy_of(grid.best_estimator_)
We can ignore this warning:
The output is as follows:
We try the same thing with a random forest:
preprocessing = Pipeline([
("scale", StandardScaler()),
])
pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", RandomForestClassifier())
])
# try varying levels of depth
params = {
"classifier__max_depth": [None, 3, 5, 10],
}
grid = GridSearchCV(pipeline, params)
grid.fit(X, y)
# best accuracy so far!
get_test_accuracy_of(grid.best_estimator_)
The following shows the output:
Out[31]:
0.8893431144571348
We import LabelEncoder:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded_y = encoder.fit_transform(y)
encoded_y.shape
The output is as follows:
We do this to import LabelBinarizer:
from sklearn.preprocessing import LabelBinarizer
binarizer = LabelBinarizer()
binarized_y = binarizer.fit_transform(encoded_y)
binarized_y.shape
We will get the following output:
Now, execute the following code:
binarized_y[:5,]
And the output will be as follows:
Run the y.head() command:
y.head()
The output is as follows:
Now run the following code:
print encoder.classes_
print binarizer.classes_
The output can be seen as follows:
Import the following packages:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
We baseline the model for the neural network. We choose a hidden layer of 10 neurons. A lower number of neurons helps to eliminate the redundancies in the data and select the most important features:
def create_baseline_model(n, input_dim):
# create model
model = Sequential()
model.add(Dense(n, input_dim=input_dim, kernel_initializer='normal', activation='relu'))
model.add(Dense(4, kernel_initializer='normal', activation='sigmoid'))
# Compile model. We use the the logarithmic loss function, and the Adam gradient optimizer.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
KerasClassifier(build_fn=create_baseline_model, epochs=100, batch_size=5, verbose=0, n=20)
We can see the following output:
Run the following code:
# use the KerasClassifier
preprocessing = Pipeline([
("scale", StandardScaler()),
])
pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", KerasClassifier(build_fn=create_baseline_model, epochs=2, batch_size=128,
verbose=1, n=10, input_dim=74))
])
cross_val_score(pipeline, X, binarized_y)
The Epoch length can be seen as follows:
The output for the preceding code is as follows:
# notice the LARGE variance in scores of a neural network. This is due to the high-variance nature of how networks fit
# using stochastic gradient descent
pipeline.fit(X, binarized_y)
We will get the following code as an output:
Now execute the following code:
# remake
encoded_y_test = encoder.transform(y_test)
def get_network_test_accuracy_of(model):
y_preds = model.predict(X_test)
return accuracy_score(y_preds, encoded_y_test)
# not the best accuracy
get_network_test_accuracy_of(pipeline)
389185/389185 [==============================] - 3s 7us/step
The following is the output of the preceding input:
By fitting again, we get a different test accuracy. This also highlights the variance on the network:
#
pipeline.fit(X, binarized_y)
get_network_test_accuracy_of(pipeline)
We will get the following code:
We add some more epochs to learn more:
preprocessing = Pipeline([
("scale", StandardScaler()),
])
pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", KerasClassifier(build_fn=create_baseline_model, epochs=10, batch_size=128,
verbose=1, n=10, input_dim=74))
])
cross_val_score(pipeline, X, binarized_y)
We get output as follows:
By fitting again, we get a different test accuracy. This also highlights the variance on the network:
pipeline.fit(X, binarized_y)
get_network_test_accuracy_of(pipeline)
The output of the preceding code is as follows:
This took much longer and still didn't increase the accuracy. We change our function to have multiple hidden layers in our network:
def network_builder(hidden_dimensions, input_dim):
# create model
model = Sequential()
model.add(Dense(hidden_dimensions[0], input_dim=input_dim, kernel_initializer='normal', activation='relu'))
# add multiple hidden layers
for dimension in hidden_dimensions[1:]:
model.add(Dense(dimension, kernel_initializer='normal', activation='relu'))
model.add(Dense(4, kernel_initializer='normal', activation='sigmoid'))
# Compile model. We use the the logarithmic loss function, and the Adam gradient optimizer.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
We add some more hidden layers to learn more:
#
preprocessing = Pipeline([
("scale", StandardScaler()),
])
pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", KerasClassifier(build_fn=network_builder, epochs=10, batch_size=128,
verbose=1, hidden_dimensions=(60,30,10), input_dim=74))
])
cross_val_score(pipeline, X, binarized_y)
We get the output as follows:
For binarized_y, we get this:
pipeline.fit(X, binarized_y)
get_network_test_accuracy_of(pipeline)
We get the epoch output as follows:
We got a small bump by increasing the hidden layers. Adding some more hidden layers to learn more, we get the following:
preprocessing = Pipeline([
("scale", StandardScaler()),
])
pipeline = Pipeline([
("preprocessing", preprocessing),
("classifier", KerasClassifier(build_fn=network_builder, epochs=10, batch_size=128,
verbose=1, hidden_dimensions=(30,30,30,10), input_dim=74))
])
cross_val_score(pipeline, X, binarized_y)
The Epoch output is as shown here:
The output can be seen as follows:
Execute the following command pipeline.fit():
pipeline.fit(X, binarized_y)
get_network_test_accuracy_of(pipeline)
By executing the preceding code we will get the following ouput:
The best result so far comes from using deep learning. However, deep learning isn't the best choice for all datasets.