We start with importing the relevant packages that will be used. Since the data is very big, we may choose to use Spark.

Spark is an open source distributed cluster-computing system that is used for handling big data:

import os
import sys
import re
import time
from pyspark import SparkContext
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
# from pyspark.sql.functions import *
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pyspark.sql.functions as func
import matplotlib.patches as mpatches
from operator import add
from pyspark.mllib.clustering import KMeans, KMeansModel
from operator import add
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
import itertools

We start by loading the entire dataset:

input_path_of_file = "/datasets/kddcup.data"
data_raw = sc.textFile(input_path_of_file, 12)

Since the data is associated with the label, we write a function that will separate the label from the feature vector:

def parseVector(line):
  columns = line.split(',')
  thelabel = columns[-1]
  featurevector = columns[:-1]
  featurevector = [element for i, element in enumerate(featurevector) if i not in [1, 2, 3]]
  featurevector = np.array(featurevector, dtype=np.float)
  return (thelabel, featurevector)

labelsAndData = raw_data.map(parseVector).cache()
thedata = labelsAndData.map(lambda row: row[1]).cache()
n = thedata.count()

len(data.first())

The output for n, that is, the number of connections, is as follows:

4898431

38

We use the k-mean algorithm from the MLLIB package. The initial choice here is to use two clusters, because first we need to understand the data:

time1 = time.time()
k_clusters = KMeans.train(thedata, 2, maxIterations=10, runs=10, initializationMode="random")

print(time.time() - time1)

We will display how these features look. Since the dataset is huge, we will randomly choose three out of the 38 features and display some portions of the data:

def getFeatVecs(data):
 n = thedata.count()
 means = thedata.reduce(add) / n
 vecs_ = thedata.map(lambda x: (x - means)**2).reduce(add) / n
 return vecs_

vecs_ = getFeatVecs(data)

On displaying the vectors, we see that there is a lot variance in the data:

print vecs_

array([  5.23205909e+05,   8.86292287e+11,   4.16040826e+11,
5.71608336e-06,   1.83649380e-03,   5.20574220e-05,         2.19940474e-01,   5.32813401e-05,   1.22928440e-01,         1.48724429e+01,   6.81804492e-05,   6.53256901e-05,         1.55084339e+01,   1.54220970e-02,   7.63454566e-05,         1.26099403e-03,   0.00000000e+00,   4.08293836e-07,         8.34467881e-04,   4.49400827e+04,   6.05124011e+04,         1.45828938e-01,   1.46118156e-01,   5.39414093e-02,         5.41308521e-02,   1.51551218e-01,   6.84170094e-03,         1.97569872e-02,   4.09867958e+03,   1.12175120e+04,         1.69073904e-01,   1.17816269e-02,   2.31349138e-01,         1.70236904e-03,   1.45800386e-01,   1.46059565e-01,         5.33345749e-02,   5.33506914e-02])

The mean shows that a small portion of the data has great variance. Sometimes, this could be an indication of anomalies, but we do not want to jump to a conclusion so soon:

mean = thedata.map(lambda x: x[1]).reduce(add) / n
print(thedata.filter(lambda x: x[1] > 10*mean).count())


4499

We want to identify the features that vary the most and to be able to plot them:

indices_of_variance = [t[0] for t in sorted(enumerate(vars_), key=lambda x: x[1])[-3:]] 
dataprojected = thedata.randomSplit([10, 90])[0]
# separate into two rdds
rdd0 = thedata.filter(lambda point: k_clusters.predict(point)==0)
rdd1 = thedata.filter(lambda point: k_clusters.predict(point)==1)

center_0 = k_clusters.centers[0]
center_1 = k_clusters.centers[1]
cluster_0 = rdd0.take(5)
cluster_1 = rdd1.take(5)

cluster_0_projected = np.array([[point[i] for i in indices_of_variance] for point in cluster_0])
cluster_1_projected = np.array([[point[i] for i in indices_of_variance] for point in cluster_1])

M = max(max(cluster1_projected.flatten()), max(cluster_0_projected.flatten()))
m = min(min(cluster1_projected.flatten()), min(cluster_0_projected.flatten()))

fig2plot = plt.figure(figsize=(8, 8))
pltx = fig2plot.add_subplot(111, projection='3d')
pltx.scatter(cluster0_projected[:, 0], cluster0_projected[:, 1], cluster0_projected[:, 2], c="b")
pltx.scatter(cluster1_projected[:, 0], cluster1_projected[:, 1], cluster1_projected[:, 2], c="r")
pltx.set_xlim(m, M)
pltx.set_ylim(m, M)
pltx.set_zlim(m, M)
pltx.legend(["cluster 0", "cluster 1"])

The graph we get from the preceding is as follows:

We see that the number of elements in cluster 1 is far more than that of the number of elements in cluster 2. Cluster 0 has its elements far from the center of the data, which is indicative of the imbalance in the data.

Table of Contents for
Coding the network intrusion attack

Coding the network intrusion attack

Table of Contents for Coding the network intrusion attack

Create new playlist

Sign In

Sign Up

Table of Contents for
Coding the network intrusion attack