Performance tips

In many financial applications, speed is of the essence. Machine learning, especially deep learning, has a reputation for being slow. However, recently, there have been many advances in hardware and software that enable faster machine learning applications.

Using the right hardware for your problem

A lot of progress in deep learning has been driven by the use of graphics processing units (GPUs). GPUs enable highly parallel computing at the expense of operating frequency. Recently, multiple manufacturers have started working on specialized deep learning hardware. Most of the time, GPUs are a good choice for deep learning models or other parallelizable algorithms such as XGboost gradient-boosted trees. However, not all applications benefit equally.

In natural language processing (NLP), for instance, batch sizes often need to be small, so the parallelization of operations does not work as well since not that many samples are processed at the same time. Additionally, some words appear much more often than others, giving large benefits to caching frequent words. Thus, many NLP tasks run faster on CPUs than GPUs. If you can work with large batches, however, a GPU or even specialized hardware is preferable.

Making use of distributed training with TF estimators

Keras is not only a standalone library that can use TensorFlow, but it is also an integrated part of TensorFlow. TensorFlow features multiple high-level APIs that can be used to create and train models.

From version 1.8 onward, the estimator API's features distribute training on multiple machines, while the Keras API does not feature them yet. Estimators also have a number of other speed-up tricks, so they are usually faster than Keras models.

Note

You can find information on how to set up your cluster for distributed TensorFlow here: https://www.tensorflow.org/deploy/distributed.

By changing the import statements, you can easily use Keras as part of TensorFlow and don't have to change your main code:

import tensorflow as tf
from tensorflow.python import keras

from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense,Activation

In this section, we will create a model to learn the MNIST problem before training it using the estimator API. First, we load and prepare the dataset as usual. For more efficient dataset loading, see the next section:

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train.shape = (60000, 28 * 28)
x_train = x_train / 255
y_train = keras.utils.to_categorical(y_train)

We can create a Keras model as usual:

model = Sequential()
model.add(Dense(786, input_dim = 28*28))
model.add(Activation('relu'))
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dense(160))
model.add(Activation('relu'))
model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(optimizer=keras.optimizers.SGD(lr=0.0001, momentum=0.9),loss='categorical_crossentropy',metric='accuracy')

The TensorFlow version of Keras offers a one-line conversion to a TF estimator:

estimator = keras.estimator.model_to_estimator(keras_model=model)

To set up training, we need to know the name assigned to the model input. We can quickly check this with the following code:

model.input_names
['dense_1_input']

Estimators get trained with an input function. The input function allows us to specify a whole pipeline, which will be executed efficiently. In this case, we only want an input function that yields our training set:

train_input_fn = tf.estimator.inputs.numpy_input_fn(x={'dense_1_input': x_train},y=y_train,num_epochs=1,shuffle=False)

Finally, we train the estimator on the input. And that is it; you can now utilize distributed TensorFlow with estimators:

estimator.train(input_fn=train_input_fn, steps=2000)

Using optimized layers such as CuDNNLSTM

You will often find that someone created a special layer optimized to perform certain tasks on certain hardware. Keras' CuDNNLSTM layer, for example, only runs on GPUs supporting CUDA, a programming language specifically for GPUs.

When you lock in your model to specialized hardware, you can often make significant gains in your performance. If you have the resources, it might even make sense to write your own specialized layer in CUDA. If you want to change hardware later, you can usually export weights and import them to a different layer.

Optimizing your pipeline

With the right hardware and optimized software in place, your model often ceases to be the bottleneck. You should check your GPU utilization by entering the following command in your Terminal:

nvidia-smi -l 2

If your GPU utilization is not at around 80% to 100%, you can gain significantly by optimizing your pipeline. There are several steps you can take to optimize your pipeline:

  • Create a pipeline running parallel to the model: Otherwise, your GPU will be idle while the data is loading. Keras does this by default. If you have a generator and want to have a larger queue of data to be held ready for preprocessing, change the max_queue_size parameter of the fit_generator method. If you set the workers argument of the fit_generator method to zero, the generator will run on the main thread, which slows things down.
  • Preprocess data in parallel: Even if you have a generator working independently of the model training, it might not keep up with the model. So, it is better to run multiple generators in parallel. In Keras, you can do this by setting use_multiprocessing to true and setting the number of workers to anything larger than one, preferably to the number of CPUs available. Let's look at an example:
    model.fit_generator(generator, steps_per_epoch = 40, workers=4, use_multiprocessing=False)

    You need to make sure your generator is thread safe. You can make any generator thread safe with the following code snippet:

    import threading
    
    class thread_safe_iter:                   #1
        def __init__(self, it):
            self.it = it
            self.lock = threading.Lock()
    
        def __iter__(self):
            return self
    
        def next(self):                       #2
            with self.lock:
                return self.it.next()
    
    def thread_safe_generator(f):             #3
        def g(*a, **kw):
            return thread_safe_iter(f(*a, **kw))
        return g
    
    @thread_safe_generator
    def gen():

    Let's look at the three key components of the preceding code:

    1. The thread_safe_iter class makes any iterator thread safe by locking threads when the iterator has to produce the next yield.
    2. When next() is called on the iterator, the iterators thread is locked. Locking means that no other function, say, another variable, can access variables from the thread while it is locked. Once the thread is locked, it yields the next element.
    3. thread_safe_generator is a Python decorator that turns any iterator it decorates into a thread-safe iterator. It takes the function, passes it to the thread-safe iterator, and then returns the thread-safe version of the function.

    You can also use the tf.data API together with an estimator, which does most of the work for you.

  • Combine files into large files: Reading a file takes time. If you have to read thousands of small files, this can significantly slow you down. TensorFlow offers its own data format called TFRecord. You can also just fuse an entire batch into a single NumPy array and save that array instead of every example.
  • Train with the tf.data.Dataset API: If you are using the TensorFlow version of Keras, you can use the Dataset API, which optimizes data loading and processing for you. The Dataset API is the recommended way to load data into TensorFlow. It offers a wide range of ways to load data, for instance, from a CSV file with tf.data.TextLineDataset, or from TFRecord files with tf.data.TFRecordDataset.

    Note

    Note: For a more comprehensive guide to the Dat aset API, see https://www.tensorflow.org/get_started/datasets_quickstart.

    In this example, we will use the dataset API with NumPy arrays that we have already loaded into RAM, such as the MNIST database.

    First, we create two plain datasets for data and targets:

    dxtrain = tf.data.Dataset.from_tensor_slices(x_test)
    dytrain = tf.data.Dataset.from_tensor_slices(y_train)

    The map function allows us to perform operations on data before passing it to the model. In this case, we apply one-hot encoding to our targets. However, this could be any function. By setting the num_parallel_calls argument, we can specify how many processes we want to run in parallel:

    def apply_one_hot(z):
        return tf.one_hot(z,10)
    
    dytrain = dytrain.map(apply_one_hot,num_parallel_calls=4)

    We zip the data and targets into one dataset. We instruct TensorFlow to shuffle the data when loading, keeping 200 instances in memory from which to draw samples. Finally, we make the dataset yield batches of batch size 32:

    train_data = tf.data.Dataset.zip((dxtrain,dytrain)).shuffle(200).batch(32)

    We can now fit a Keras model on this dataset just as we would fit it to a generator:

    model.fit(dataset, epochs=10, steps_per_epoch=60000 // 32)

If you have truly large datasets, the more you can parallelize, the better. Parallelization does come with overhead costs, however, and not every problem actually features huge datasets. In these cases, refrain from trying to do too much in parallel and focus on slimming down your network, using CPUs and keeping all your data in RAM if possible.

Speeding up your code with Cython

Python is a popular language because developing code in Python is easy and fast. However, Python can be slow, which is why many production applications are written in either C or C++. Cython is Python with C data types, which significantly speeds up execution. Using this language, you can write pretty much normal Python code, and Cython converts it to fast-running C code.

Note

Note: You can read the full Cython documentation here: http://cython.readthedocs.io. This section is a short introduction to Cython. If performance is important to your application, you should consider diving deeper.

Say you have a Python function that prints out the Fibonacci series up to a specified point. This code snippet is taken straight from the Python documentation:

from __future__ import print_function
def fib(n):
    a, b = 0, 1
    while b < n:
        print(b, end=' ')
        a, b = b, a + b
    print()

Note that we have to import the print_function to make sure that print() works in the Python 3 style. To use this snippet with Cython, save it as cython_fib_8_7.pyx.

Now create a new file called 8_7_cython_setup.py:

from distutils.core import setup                   #1
from Cython.Build import cythonize                 #2

setup(                                             #3ext_modules=cythonize("cython_fib_8_7.pyx"),)

The three main features of the code are these:

  1. The setup function is a Python function to create modules, such as the ones you install with pip.
  2. cythonize is a function to turn a pyx Python file into Cython C code.
  3. We create a new model by calling setup and passing on our Cythonized code.

To run this, we now run the following command in a Terminal:

python 8_7_cython_setup.py build_ext --inplace

This will create a C file, a build file, and a compiled module. We can import this module now by running:

import cython_fib_8_7
cython_fib_8_7.fib(1000)

This will print out the Fibonacci numbers up to 1,000. Cython also comes with a handy debugger that shows where Cython has to fall back onto Python code, which will slow things down. Type the following command into your Terminal:

cython -a cython_fib_8_7.pyx

This will create an HTML file that looks similar to this when opened in a browser:

Speeding up your code with Cython

Cython profile

As you can see, Cython has to fall back on Python all the time in our script because we did not specify the types of variables. By letting Cython know what data type a variable has, we can speed up the code significantly. To define a variable with a type, we use cdef:

from __future__ import print_function
def fib(int n):
    cdef int a = 0
    cdef int b = 1
    while b < n:
        print(b, end=' ')
        a, b = b, a + b
    print()

This snippet is already better. Further optimization is certainly possible, by first calculating the numbers before printing them, we can reduce the reliance on Python print statements. Overall, Cython is a great way to keep the development speed and ease of Python and gain execution speed.

Caching frequent requests

An under-appreciated way to make models run faster is to cache frequent requests in a database. You can go so far as to cache millions of predictions in a database and then look them up. This has the advantage that you can make your model as large as you like and expend a lot of computing power to make predictions.

By using a MapReduce database, looking up requests in a very large pool of possible requests and predictions is entirely possible. Of course, this requires requests to be somewhat discrete. If you have continuous features, you can round them if precision is not as important.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset