Data transformation

Consider the following pieces of a single dataset:

 ts_complete_df = pd.read_csv("sensor_df.csv")

 

The following screenshot shows the head of sensor data that contains the time series components of unequal length:

Head of sensor data containing time series components of unequal length

The following screenshot shows the tail of sensor data that contains the time series components of unequal length:

Tail of sensor data containing time series components of unequal length

The dataset here consists of time series data in 10-minute intervals for 314 different devices. All these 314 devices have data captured for different durations. Let's examine the duration for which data has been captured in each device:

ts_complete_df.groupby("ID").size().describe()

The following is the output:

Summary of sensor data

The lengths of data for each device vary drastically. Several time series problems such as Shapelet transformation and Long-Short Term Memory (LSTM) require the length of data for each device to be the same. The following code snippet truncates each device to the highest possible length:

truncate_df = pd.DataFrame()
min_len = ts_complete_df.groupby("ID").size().min()
for i in range(1,315):
df = ts_complete_df[ts_complete_df["ID"] == i].iloc[0:min_len, :]
truncate_df = truncate_df.append(df)  

After truncating, the length can be seen to be uniform. It can be checked by running the following:

truncate_df.groupby("ID").size().describe()  

The following is the output:

Summary of sensor data after all the time series components have been made  equal in length

Let's perform feature extraction for the following univariate time series data:

 ts = pd.read_csv("D:datatest.txt").iloc[:,0:2].set_index("date") ts

The following is the output:

 Reading the occupancy data and setting the datetime column as an index

Feature extraction is vital for performing machine learning with time series data in order to obtain better performance metrics. Here, let's extract the rolling mean, rolling standard deviation, and gradient for the temperature data:

    feat_ext = pd.concat([ts.rolling(5).mean(), ts.rolling(5).std(), (ts - ts.shift(-5))/ts], axis=1).iloc[5:,:]
    feat_ext.columns = ['5_day_mean', '5_day_std', '5_day_gradient']
    feat_ext.head(5)

The following is the output:

 Feature (5_day_mean, 5_day_std) generation using rolling functions

The first 5 rows with NA values have been dropped in the feature extraction process. Here, the features have been extracted for a rolling window of 5 days. Using a similar method, it is possible to extract hundreds of features from a time series variable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset