Developing a predictive model for time series data

RNNs, specifically LSTM models, is often a difficult topic to understand. Time series prediction is a useful application of RNNs because of temporal dependencies in the data. Time series data is abundantly available online. In this section, we will see an example of using an LSTM for handling time series data. Our LSTM network will be able to predict the number of airline passengers in the future.

Description of the dataset

The dataset that I will be using is data about international airline passengers from 1949 to 1960. The dataset can be downloaded from https://datamarket.com/data/set/22u3/international-airlinepassengers- monthly-totals-in#!ds=22u3&display=line. The following screenshot shows the metadata of the international airline passengers:

Description of the dataset

Figure 14: Metadata of the international airline passengers (source: https://datamarket.com/)

You can download the data by choosing the Export tab and then selecting CSV (,) in the Export group. You will have to edit the CSV file manually to remove the header line, as well as the additional footer line. I have downloaded and saved the data file named international-airline-passengers.csv. The following graph is a nice plot of the time series data:

Description of the dataset

Figure 15: International airline passengers: monthly totals in thousands from Jan 49 – Dec 60 

Pre-processing and exploratory analysis

Now let's load the original dataset and see some facts. At first, we load the time series as follows (see time_series_preprocessor.py):

import csv
import numpy as np

Here, we can see the signature of load_series(), which is a user-defined method that loads the time series and normalizes it:

def load_series(filename, series_idx=1):
    try:
        with open(filename) as csvfile:
            csvreader = csv.reader(csvfile)
            data = [float(row[series_idx]) for row in csvreader if len(row) > 0]
            normalized_data = (data - np.mean(data)) / np.std(data)
        return normalized_data
    except IOError:
       Print("Error occurred")

        return None

Now let's invoke the preceding method to load the time series and print (issue $ python3 plot_time_series.py on Terminal) the number of series in the dataset:

import csv
import numpy as np
import matplotlib.pyplot as plt
import time_series_preprocessor as tsp
timeseries = tsp.load_series('international-airline-passengers.csv')
print(timeseries)

The following is the output of the preceding code:

>>>
[-1.40777884 -1.35759023 -1.24048348 -1.26557778 -1.33249593 -1.21538918
 -1.10664719 -1.10664719 -1.20702441 -1.34922546 -1.47469699 -1.35759023
…..
 2.85825285  2.72441656  1.9046693   1.5115252   0.91762667  1.26894693]
print(np.shape(timeseries))
>>>
144

That means there are 144 entries in the time series. Let's plot the time series:

plt.figure()
plt.plot(timeseries)
plt.title('Normalized time series')
plt.xlabel('ID')
plt.ylabel('Normalized value')
plt.legend(loc='upper left')
plt.show()

The following is the output of the preceding code:

>>>
Pre-processing and exploratory analysis

Figure 16: Time series (y-axis, normalized value versus x-axis, ID)

Once we have loaded the time series dataset, the next task is to prepare the training set. Since we will be evaluating the model multiple times to predict future values, we will split the data into training and testing. To be more specific, the split_data() function divides the dataset into two components for training and testing, 75% for training and 25% for testing:

def split_data(data, percent_train):
    num_rows = len(data)
    train_data, test_data = [], []
    for idx, row in enumerate(data):
        if idx < num_rows * percent_train:
            train_data.append(row)
        else:
            test_data.append(row)
    return train_data, test_data

LSTM predictive model

Once we have our dataset ready, we can train the predictor by loading the data in an acceptable format. For this step, I have written a Python script called TimeSeriesPredictor.py, which starts by importing the necessary library and modules (issue $ python3 TimeSeriesPredictor.py command on Terminal for this script):

import numpy as np
import tensorflow as tf
from tensorflow.python.ops import rnn, rnn_cell
import time_series_preprocessor as tsp
import matplotlib.pyplot as plt

Next, we define the hyperparameters for the LSTM network (tune it accordingly):

input_dim = 1
seq_size = 5
hidden_dim = 5

We now define the weight variables (no biases) and input placeholders:

W_out = tf.get_variable("W_out", shape=[hidden_dim, 1], dtype=tf.float32, initializer=None, regularizer=None, trainable=True, collections=None) 
b_out = tf.get_variable("b_out", shape=[1], dtype=tf.float32, initializer=None, regularizer=None, trainable=True, collections=None)
x = tf.placeholder(tf.float32, [None, seq_size, input_dim])
y = tf.placeholder(tf.float32, [None, seq_size])

The next task is to construct the LSTM network. The following method, LSTM_Model(), takes three parameters, as follows:

  • x: Inputs of size [T, batch_size, input_size]
  • W: A matrix of fully-connected output layer weights
  • b: A vector of fully-connected output layer biases

Now let's see the signature of the method:

def LSTM_Model():
        cell = rnn_cell.BasicLSTMCell(hidden_dim)
        outputs, states = rnn.dynamic_rnn(cell, x, dtype=tf.float32)
        num_examples = tf.shape(x)[0]
        W_repeated = tf.tile(tf.expand_dims(W_out, 0), [num_examples, 1, 1])
        out = tf.matmul(outputs, W_repeated) + b_out
        out = tf.squeeze(out)
        return out

Additionally, we create three empty lists to store the training loss, test loss, and the step:

train_loss = []
test_loss = []
step_list = []

The next method, called train(), is used to train the LSTM network:

def trainNetwork(train_x, train_y, test_x, test_y):
        with tf.Session() as sess:
            tf.get_variable_scope().reuse_variables()
            sess.run(tf.global_variables_initializer())
            max_patience = 3
            patience = max_patience
            min_test_err = float('inf')
            step = 0
            while patience > 0:
                _, train_err = sess.run([train_op, cost], feed_dict={x: train_x, y: train_y})
                if step % 100 == 0:
                    test_err = sess.run(cost, feed_dict={x: test_x, y: test_y})
                    print('step: {}		train err: {}		test err: {}'.format(step, train_err, test_err))
                    train_loss.append(train_err)
                    test_loss.append(test_err)
                    step_list.append(step)
                    if test_err < min_test_err:
                        min_test_err = test_err
                        patience = max_patience
                    else:
                        patience -= 1
                step += 1
            save_path = saver.save(sess, 'model.ckpt')
            print('Model saved to {}'.format(save_path))

The next task is to create the cost optimizer and instantiate training_op:

cost = tf.reduce_mean(tf.square(LSTM_Model()- y))
train_op = tf.train.AdamOptimizer(learning_rate=0.003).minimize(cost)

Additionally, here we have an auxiliary op called saving the model:

saver = tf.train.Saver()

Now that we have created the model, the next method, called testLSTM(), is used to test the prediction power of the model on the test set:

def testLSTM(sess, test_x):
        tf.get_variable_scope().reuse_variables()
        saver.restore(sess, 'model.ckpt')
        output = sess.run(LSTM_Model(), feed_dict={x: test_x})
        return output

To plot the predicted results, we have a function called plot_results(). The signature is as follows:

def plot_results(train_x, predictions, actual, filename):
    plt.figure()
    num_train = len(train_x)
    plt.plot(list(range(num_train)), train_x, color='b', label='training data')
    plt.plot(list(range(num_train, num_train + len(predictions))), predictions, color='r', label='predicted')
    plt.plot(list(range(num_train, num_train + len(actual))), actual, color='g', label='test data')
    plt.legend()
    if filename is not None:
        plt.savefig(filename)
    else:
        plt.show()

Model evaluation

To evaluate the model, we have a method called main() that actually invokes the preceding methods to create and train the LSTM network. The workflow of the code is as following:

  1. Load the data
  2. Slide a window through the time series data to construct the training dataset
  3. Do the same window sliding strategy to construct the test dataset
  4. Train a model on the training dataset
  5. Visualize the model's performance

Let's see the signature of the method:

def main():
    data = tsp.load_series('international-airline-passengers.csv')
    train_data, actual_vals = tsp.split_data(data=data, percent_train=0.75)
    train_x, train_y = [], []
    for i in range(len(train_data) - seq_size - 1):
        train_x.append(np.expand_dims(train_data[i:i+seq_size], axis=1).tolist())
        train_y.append(train_data[i+1:i+seq_size+1])
    test_x, test_y = [], []
    for i in range(len(actual_vals) - seq_size - 1):
        test_x.append(np.expand_dims(actual_vals[i:i+seq_size], axis=1).tolist())
        test_y.append(actual_vals[i+1:i+seq_size+1])
    trainNetwork(train_x, train_y, test_x, test_y)
    with tf.Session() as sess:
        predicted_vals = testLSTM(sess, test_x)[:,0]
        # Following prediction results of the model given ground truth values
        plot_results(train_data, predicted_vals, actual_vals, 'ground_truth_predition.png')
        prev_seq = train_x[-1]
        predicted_vals = []
        for i in range(1000):
            next_seq = testLSTM(sess, [prev_seq])
            predicted_vals.append(next_seq[-1])
            prev_seq = np.vstack((prev_seq[1:], next_seq[-1]))
        # Following predictions results where only the training data was given
        plot_results(train_data, predicted_vals, actual_vals, 'prediction_on_train_set.png')
>>>

Finally, we call the main() method to perform the training. Once the training is completed, it further plots the prediction results of the model consisting of ground truth values versus predictions results, where only the training data was given:

>>>
Model evaluation

Figure 17: The results of the model on the ground truth values

The next graph shows the prediction results on the training data. This procedure has less information available, but it still did a good job of matching the trends in the data:

Model evaluation

Figure 18: The results of the model on the training set

The following method helps us plot the training and the test error:

def plot_error():
    # Plot training loss over time
    plt.plot(step_list, train_loss, 'r--', label='LSTM training loss per iteration', linewidth=4)
    plt.title('LSTM training loss per iteration')
    plt.xlabel('Iteration')
    plt.ylabel('Training loss')
    plt.legend(loc='upper right')
    plt.show()

    # Plot test loss over time
    plt.plot(step_list, test_loss, 'r--', label='LSTM test loss per iteration', linewidth=4)
    plt.title('LSTM test loss per iteration')
    plt.xlabel('Iteration')
    plt.ylabel('Test loss')
    plt.legend(loc='upper left')
    plt.show()

Now we call the preceding method as follows:

plot_error()
>>>
Model evaluation

Figure 19: a) LSTM training loss per iteration, b) LSTM test loss per iteration

We can use a time series predictor to reproduce realistic fluctuations in data. Now you can prepare your own dataset and do some other predictive analytics. The next example is about sentiment analysis from the product and movie review dataset. We will also see how to develop a more complex RNN using an LSTM network.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset