In many financial applications, speed is of the essence. Machine learning, especially deep learning, has a reputation for being slow. However, recently, there have been many advances in hardware and software that enable faster machine learning applications.
A lot of progress in deep learning has been driven by the use of graphics processing units (GPUs). GPUs enable highly parallel computing at the expense of operating frequency. Recently, multiple manufacturers have started working on specialized deep learning hardware. Most of the time, GPUs are a good choice for deep learning models or other parallelizable algorithms such as XGboost gradient-boosted trees. However, not all applications benefit equally.
In natural language processing (NLP), for instance, batch sizes often need to be small, so the parallelization of operations does not work as well since not that many samples are processed at the same time. Additionally, some words appear much more often than others, giving large benefits to caching frequent words. Thus, many NLP tasks run faster on CPUs than GPUs. If you can work with large batches, however, a GPU or even specialized hardware is preferable.
Keras is not only a standalone library that can use TensorFlow, but it is also an integrated part of TensorFlow. TensorFlow features multiple high-level APIs that can be used to create and train models.
From version 1.8 onward, the estimator API's features distribute training on multiple machines, while the Keras API does not feature them yet. Estimators also have a number of other speed-up tricks, so they are usually faster than Keras models.
You can find information on how to set up your cluster for distributed TensorFlow here: https://www.tensorflow.org/deploy/distributed.
By changing the import
statements, you can easily use Keras as part of TensorFlow and don't have to change your main code:
import tensorflow as tf from tensorflow.python import keras from tensorflow.python.keras.models import Sequential from tensorflow.python.keras.layers import Dense,Activation
In this section, we will create a model to learn the MNIST problem before training it using the estimator API. First, we load and prepare the dataset as usual. For more efficient dataset loading, see the next section:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data() x_train.shape = (60000, 28 * 28) x_train = x_train / 255 y_train = keras.utils.to_categorical(y_train)
We can create a Keras model as usual:
model = Sequential() model.add(Dense(786, input_dim = 28*28)) model.add(Activation('relu')) model.add(Dense(256)) model.add(Activation('relu')) model.add(Dense(160)) model.add(Activation('relu')) model.add(Dense(10)) model.add(Activation('softmax')) model.compile(optimizer=keras.optimizers.SGD(lr=0.0001, momentum=0.9),loss='categorical_crossentropy',metric='accuracy')
The TensorFlow version of Keras offers a one-line conversion to a TF estimator:
estimator = keras.estimator.model_to_estimator(keras_model=model)
To set up training, we need to know the name assigned to the model input. We can quickly check this with the following code:
model.input_names ['dense_1_input']
Estimators get trained with an input function. The input function allows us to specify a whole pipeline, which will be executed efficiently. In this case, we only want an input function that yields our training set:
train_input_fn = tf.estimator.inputs.numpy_input_fn(x={'dense_1_input': x_train},y=y_train,num_epochs=1,shuffle=False)
Finally, we train the estimator on the input. And that is it; you can now utilize distributed TensorFlow with estimators:
estimator.train(input_fn=train_input_fn, steps=2000)
You will often find that someone created a special layer optimized to perform certain tasks on certain hardware. Keras' CuDNNLSTM
layer, for example, only runs on GPUs supporting CUDA, a programming language specifically for GPUs.
When you lock in your model to specialized hardware, you can often make significant gains in your performance. If you have the resources, it might even make sense to write your own specialized layer in CUDA. If you want to change hardware later, you can usually export weights and import them to a different layer.
With the right hardware and optimized software in place, your model often ceases to be the bottleneck. You should check your GPU utilization by entering the following command in your Terminal:
nvidia-smi -l 2
If your GPU utilization is not at around 80% to 100%, you can gain significantly by optimizing your pipeline. There are several steps you can take to optimize your pipeline:
max_queue_size
parameter of the fit_generator
method. If you set the workers
argument of the fit_generator
method to zero, the generator will run on the main thread, which slows things down.use_multiprocessing
to true
and setting the number of workers to anything larger than one, preferably to the number of CPUs available. Let's look at an example:model.fit_generator(generator, steps_per_epoch = 40, workers=4, use_multiprocessing=False)
You need to make sure your generator is thread safe. You can make any generator thread safe with the following code snippet:
import threading class thread_safe_iter: #1 def __init__(self, it): self.it = it self.lock = threading.Lock() def __iter__(self): return self def next(self): #2 with self.lock: return self.it.next() def thread_safe_generator(f): #3 def g(*a, **kw): return thread_safe_iter(f(*a, **kw)) return g @thread_safe_generator def gen():
Let's look at the three key components of the preceding code:
thread_safe_iter
class makes any iterator thread safe by locking threads when the iterator has to produce the next yield.next()
is called on the iterator, the iterators thread is locked. Locking means that no other function, say, another variable, can access variables from the thread while it is locked. Once the thread is locked, it yields the next element.thread_safe_generator
is a Python decorator that turns any iterator it decorates into a thread-safe iterator. It takes the function, passes it to the thread-safe iterator, and then returns the thread-safe version of the function.You can also use the tf.data
API together with an estimator, which does most of the work for you.
tf.data.Dataset
API: If you are using the TensorFlow version of Keras, you can use the Dataset
API, which optimizes data loading and processing for you. The Dataset
API is the recommended way to load data into TensorFlow. It offers a wide range of ways to load data, for instance, from a CSV file with tf.data.TextLineDataset
, or from TFRecord files with tf.data.TFRecordDataset
.
Note: For a more comprehensive guide to the Dat
aset
API, see https://www.tensorflow.org/get_started/datasets_quickstart.
In this example, we will use the dataset API with NumPy arrays that we have already loaded into RAM, such as the MNIST database.
First, we create two plain datasets for data and targets:
dxtrain = tf.data.Dataset.from_tensor_slices(x_test) dytrain = tf.data.Dataset.from_tensor_slices(y_train)
The map
function allows us to perform operations on data before passing it to the model. In this case, we apply one-hot encoding to our targets. However, this could be any function. By setting the num_parallel_calls
argument, we can specify how many processes we want to run in parallel:
def apply_one_hot(z): return tf.one_hot(z,10) dytrain = dytrain.map(apply_one_hot,num_parallel_calls=4)
We zip the data and targets into one dataset. We instruct TensorFlow to shuffle the data when loading, keeping 200 instances in memory from which to draw samples. Finally, we make the dataset yield batches of batch size 32
:
train_data = tf.data.Dataset.zip((dxtrain,dytrain)).shuffle(200).batch(32)
We can now fit a Keras model on this dataset just as we would fit it to a generator:
model.fit(dataset, epochs=10, steps_per_epoch=60000 // 32)
If you have truly large datasets, the more you can parallelize, the better. Parallelization does come with overhead costs, however, and not every problem actually features huge datasets. In these cases, refrain from trying to do too much in parallel and focus on slimming down your network, using CPUs and keeping all your data in RAM if possible.
Python is a popular language because developing code in Python is easy and fast. However, Python can be slow, which is why many production applications are written in either C or C++. Cython is Python with C data types, which significantly speeds up execution. Using this language, you can write pretty much normal Python code, and Cython converts it to fast-running C code.
Note: You can read the full Cython documentation here: http://cython.readthedocs.io. This section is a short introduction to Cython. If performance is important to your application, you should consider diving deeper.
Say you have a Python function that prints out the Fibonacci series up to a specified point. This code snippet is taken straight from the Python documentation:
from __future__ import print_function def fib(n): a, b = 0, 1 while b < n: print(b, end=' ') a, b = b, a + b print()
Note that we have to import the print_function
to make sure that print()
works in the Python 3 style. To use this snippet with Cython, save it as cython_fib_8_7.pyx
.
Now create a new file called 8_7_cython_setup.py
:
from distutils.core import setup #1 from Cython.Build import cythonize #2 setup( #3ext_modules=cythonize("cython_fib_8_7.pyx"),)
The three main features of the code are these:
setup
function is a Python function to create modules, such as the ones you install with pip
.cythonize
is a function to turn a pyx
Python file into Cython C code.setup
and passing on our Cythonized code.To run this, we now run the following command in a Terminal:
python 8_7_cython_setup.py build_ext --inplace
This will create a C file, a build file, and a compiled module. We can import this module now by running:
import cython_fib_8_7 cython_fib_8_7.fib(1000)
This will print out the Fibonacci numbers up to 1,000. Cython also comes with a handy debugger that shows where Cython has to fall back onto Python code, which will slow things down. Type the following command into your Terminal:
cython -a cython_fib_8_7.pyx
This will create an HTML file that looks similar to this when opened in a browser:
As you can see, Cython has to fall back on Python all the time in our script because we did not specify the types of variables. By letting Cython know what data type a variable has, we can speed up the code significantly. To define a variable with a type, we use cdef
:
from __future__ import print_function def fib(int n): cdef int a = 0 cdef int b = 1 while b < n: print(b, end=' ') a, b = b, a + b print()
This snippet is already better. Further optimization is certainly possible, by first calculating the numbers before printing them, we can reduce the reliance on Python print
statements. Overall, Cython is a great way to keep the development speed and ease of Python and gain execution speed.
An under-appreciated way to make models run faster is to cache frequent requests in a database. You can go so far as to cache millions of predictions in a database and then look them up. This has the advantage that you can make your model as large as you like and expend a lot of computing power to make predictions.
By using a MapReduce database, looking up requests in a very large pool of possible requests and predictions is entirely possible. Of course, this requires requests to be somewhat discrete. If you have continuous features, you can round them if precision is not as important.