We start with importing the relevant packages that will be used. Since the data is very big, we may choose to use Spark.
Spark is an open source distributed cluster-computing system that is used for handling big data:
import os
import sys
import re
import time
from pyspark import SparkContext
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
# from pyspark.sql.functions import *
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pyspark.sql.functions as func
import matplotlib.patches as mpatches
from operator import add
from pyspark.mllib.clustering import KMeans, KMeansModel
from operator import add
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
import itertools
We start by loading the entire dataset:
input_path_of_file = "/datasets/kddcup.data" data_raw = sc.textFile(input_path_of_file, 12)
Since the data is associated with the label, we write a function that will separate the label from the feature vector:
def parseVector(line):
columns = line.split(',')
thelabel = columns[-1]
featurevector = columns[:-1]
featurevector = [element for i, element in enumerate(featurevector) if i not in [1, 2, 3]]
featurevector = np.array(featurevector, dtype=np.float)
return (thelabel, featurevector)
labelsAndData = raw_data.map(parseVector).cache()
thedata = labelsAndData.map(lambda row: row[1]).cache()
n = thedata.count()
len(data.first())
The output for n, that is, the number of connections, is as follows:
4898431
38
We use the k-mean algorithm from the MLLIB package. The initial choice here is to use two clusters, because first we need to understand the data:
time1 = time.time()
k_clusters = KMeans.train(thedata, 2, maxIterations=10, runs=10, initializationMode="random")
print(time.time() - time1)
We will display how these features look. Since the dataset is huge, we will randomly choose three out of the 38 features and display some portions of the data:
def getFeatVecs(data):
n = thedata.count()
means = thedata.reduce(add) / n
vecs_ = thedata.map(lambda x: (x - means)**2).reduce(add) / n
return vecs_
vecs_ = getFeatVecs(data)
On displaying the vectors, we see that there is a lot variance in the data:
print vecs_
array([ 5.23205909e+05, 8.86292287e+11, 4.16040826e+11,
5.71608336e-06, 1.83649380e-03, 5.20574220e-05, 2.19940474e-01, 5.32813401e-05, 1.22928440e-01, 1.48724429e+01, 6.81804492e-05, 6.53256901e-05, 1.55084339e+01, 1.54220970e-02, 7.63454566e-05, 1.26099403e-03, 0.00000000e+00, 4.08293836e-07, 8.34467881e-04, 4.49400827e+04, 6.05124011e+04, 1.45828938e-01, 1.46118156e-01, 5.39414093e-02, 5.41308521e-02, 1.51551218e-01, 6.84170094e-03, 1.97569872e-02, 4.09867958e+03, 1.12175120e+04, 1.69073904e-01, 1.17816269e-02, 2.31349138e-01, 1.70236904e-03, 1.45800386e-01, 1.46059565e-01, 5.33345749e-02, 5.33506914e-02])
mean = thedata.map(lambda x: x[1]).reduce(add) / n
print(thedata.filter(lambda x: x[1] > 10*mean).count())
4499
We want to identify the features that vary the most and to be able to plot them:
indices_of_variance = [t[0] for t in sorted(enumerate(vars_), key=lambda x: x[1])[-3:]]
dataprojected = thedata.randomSplit([10, 90])[0]
# separate into two rdds
rdd0 = thedata.filter(lambda point: k_clusters.predict(point)==0)
rdd1 = thedata.filter(lambda point: k_clusters.predict(point)==1)
center_0 = k_clusters.centers[0]
center_1 = k_clusters.centers[1]
cluster_0 = rdd0.take(5)
cluster_1 = rdd1.take(5)
cluster_0_projected = np.array([[point[i] for i in indices_of_variance] for point in cluster_0])
cluster_1_projected = np.array([[point[i] for i in indices_of_variance] for point in cluster_1])
M = max(max(cluster1_projected.flatten()), max(cluster_0_projected.flatten()))
m = min(min(cluster1_projected.flatten()), min(cluster_0_projected.flatten()))
fig2plot = plt.figure(figsize=(8, 8))
pltx = fig2plot.add_subplot(111, projection='3d')
pltx.scatter(cluster0_projected[:, 0], cluster0_projected[:, 1], cluster0_projected[:, 2], c="b")
pltx.scatter(cluster1_projected[:, 0], cluster1_projected[:, 1], cluster1_projected[:, 2], c="r")
pltx.set_xlim(m, M)
pltx.set_ylim(m, M)
pltx.set_zlim(m, M)
pltx.legend(["cluster 0", "cluster 1"])
The graph we get from the preceding is as follows:
We see that the number of elements in cluster 1 is far more than that of the number of elements in cluster 2. Cluster 0 has its elements far from the center of the data, which is indicative of the imbalance in the data.