Consider the following pieces of a single dataset:
ts_complete_df = pd.read_csv("sensor_df.csv")
The following screenshot shows the head of sensor data that contains the time series components of unequal length:
The following screenshot shows the tail of sensor data that contains the time series components of unequal length:
The dataset here consists of time series data in 10-minute intervals for 314 different devices. All these 314 devices have data captured for different durations. Let's examine the duration for which data has been captured in each device:
ts_complete_df.groupby("ID").size().describe()
The following is the output:
The lengths of data for each device vary drastically. Several time series problems such as Shapelet transformation and Long-Short Term Memory (LSTM) require the length of data for each device to be the same. The following code snippet truncates each device to the highest possible length:
truncate_df = pd.DataFrame() min_len = ts_complete_df.groupby("ID").size().min() for i in range(1,315): df = ts_complete_df[ts_complete_df["ID"] == i].iloc[0:min_len, :] truncate_df = truncate_df.append(df)
After truncating, the length can be seen to be uniform. It can be checked by running the following:
truncate_df.groupby("ID").size().describe()
The following is the output:
Let's perform feature extraction for the following univariate time series data:
ts = pd.read_csv("D:datatest.txt").iloc[:,0:2].set_index("date") ts
The following is the output:
Feature extraction is vital for performing machine learning with time series data in order to obtain better performance metrics. Here, let's extract the rolling mean, rolling standard deviation, and gradient for the temperature data:
feat_ext = pd.concat([ts.rolling(5).mean(), ts.rolling(5).std(), (ts - ts.shift(-5))/ts], axis=1).iloc[5:,:] feat_ext.columns = ['5_day_mean', '5_day_std', '5_day_gradient'] feat_ext.head(5)
The following is the output:
The first 5 rows with NA values have been dropped in the feature extraction process. Here, the features have been extracted for a rolling window of 5 days. Using a similar method, it is possible to extract hundreds of features from a time series variable.