Data cleansing and transformation

Our dataset is largely clean so we will directly transform the data into more meaningful forms. For example, the timestamp of the data is in epoch format. Epoch format is alternatively known as Unix or Posix time format. We will convert this to the date-time format that we have previously discussed, as shown in the following: 

import time
time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(1521388078))

Out: '2018-03-18 21:17:58'

We perform the preceding operation on the Time column, add it to a new column, and call it Newtime:

pdata_frame['Newtime'] = pdata_frame['Time'].apply(lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(float(x))))

Once we have transformed the data to a more readable format, we look at the other data columns. Since the other columns look pretty cleansed and transformed, we will leave them as is. The volume column is the next data that we will look into. We aggregate volume in the same way by the hour and plot it with the following code:

import matplotlib.pyplot as plt
plt.scatter(pdata_frame['time'],pdata_frame['volume'])
plt.show() # Depending on whether you use IPython or interactive mode, etc.

To carry out any further analysis on the data, we need to aggregate the data to generate features.

We extract the following features:

  • For any source, we compute the volume of packets exchanged per minute
  • For any source, we count the total number of connections received per minute

The following image shows how unprocessed data changes to data for the feature engine using data analysis:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset