Chapter 3. Machine Learning Libraries That Use Ray

In the first chapter, we learned that Ray was created to address the need for flexible, efficient, and easy-to-use distributed computing, especially for newer machine learning systems like reinforcement learning, which are mostly written in Python. We also learned about the Ray ecosystem today.

In the last chapter, we saw Ray’s concise and intuitive but low-level API in action.

Now we’ll discuss several of the libraries made possible by Ray, which were drivers for the creation of Ray, too. They include Ray RLlib for reinforcement learning, Ray Tune for hyperparameter tuning, Ray S_GD_ for distributed training of TensorFlow and PyTorch models, and Ray Serve for model serving.

What Is Reinforcement Learning?

Reinforcement learning (RL) is a large and diverse subject. We can’t do it justice here, but we can explore the highlights and see how Ray RLlib enables RL practitioners to work efficiently. RLlib is also modular and flexible to support researchers exploring new RL algorithms and techniques.

So here goes a whirlwind tour of reinforcement learning. I’ll italicize popular terms as we encounter them. Consider Figure 3-1.

Reinforcement Learning
Figure 3-1. Reinforcement learning

An agent takes actions in an environment, attempting to maximize a cumulative reward. At each step, the agent observes the environment’s current state and the reward received from taking the previous action. Then the agent decides on the next action to take.

Learning the best policy that maximizes the cumulative reward is the essence of RL. Often this is done by trial and error, trying repeated episodes to determine what actions are best. The agent may or may not have prior knowledge about the environment, i.e., a model. For example, an environment that represents a physical system might be modeled with a simulation of the physics involved.

A key consideration is the exploitation versus exploration trade-off. If the agent finds that a particular action always returns a good reward, the agent may wish to exploit that action. However, even better actions may exist that the agent has not yet discovered, so exploration is required, even though most alternative actions may prove inferior. Balancing this trade-off effectively is important.

Another key challenge is the credit assignment problem. If we are maximizing the cumulative reward, it can be difficult to know how each particular action during a long episode contributed to that reward or subtracted from it.

A popular example environment is CartPole, part of OpenAI Gym, which simulates a cart moving left or right while trying to keep a vertical pole balanced. CartPole can be completely determined by simple physics, but we’ll use RL to learn how to balance the pole by trial and error, the same way a human would learn this task.

Figure 3-2 shows the OpenAI implementation of this environment.

CartPole
Figure 3-2. CartPole

Let’s use Ray RLlib to train a simple neural network to balance poles!

Reinforcement Learning with Ray RLlib

Ray RLlib is a reinforcement learning system built on Ray. It provides implementations of many algorithms for RL. It integrates with third-party tools like TensorFlow and PyTorch for deep learning, as well as OpenAI Gym. We’ll train our neural network for CartPole with TensorFlow.

First, we’ll need to install Ray RLlib, which will also install other libraries we need, like TensorFlow:

pip install 'ray[rllib]'

Assuming we are running a new ipython session, we need to import Ray and RLlib components, then start Ray:

import ray
import ray.rllib.agents.ppo as ppo

ray.init()

We will use a popular RL algorithm for training called proximal policy optimization (PPO). PPO is not the best-known RL algorithm, but it works very well, so it is widely used by RL experts. I won’t explain PPO here, but see the RLlib PPO documentation for more details.

Next, we specify configuration settings, starting with a default configuration object for PPO. The code comments explain most of the details:

# Specifies the OpenAI Gym environment for CartPole, V1.
SELECT_ENV = "CartPole-v1"
# Number of training runs.
N_ITER = 20
# PPO's default configuration.
config = ppo.DEFAULT_CONFIG.copy()
# Suppress too many messages.
config["log_level"] = "WARN"
# Use > 1 for more CPU cores, e.g., over a cluster.
config['num_workers'] = 8  # 8 is optimal for my laptop!
# Number of stochastic gradient descent iterations
# per training minibatch.
config['num_sgd_iter'] = 10
# The amount of data records per minibatch.
config['sgd_minibatch_size'] = 250
# Use two neural network layers with these sizes.
config['model']['fcnet_hiddens'] = [40,20]
# Don't pin a CPU core to each worker (allows more workers).
config['num_cpus_per_worker'] = 0
checkpoint_dir = 'checkpoints'

We’ll train for 20 iterations, seeing progressive improvement for each one. Each iteration is an episode, a sequence of the “virtuous cycles” we discussed at the top of the chapter, until some stopping criteria is met, such as the pole falling over.

We’ll train a TensorFlow neural network, with two, fully connected layers of size 40 and 20, respectively. In other words, the first layer will have 40 weights, and the second will have 20 weights. These choices are hyperparameters. We’ll return later to how we chose them.

Next, construct a PPO trainer using the configuration and the CartPole-v1 environment:

trainer = ppo.PPOTrainer(config, SELECT_ENV)

Finally, train for N_ITER (20) episodes. While it’s running, look at the Ray Dashboard we discussed in the last chapter:

fmt = '{:3d},{:8.4f},{:8.4f},{:8.4f}'
last_checkpoint = ''
for n in range(N_ITER):
    result = trainer.train()
    min  = result['episode_reward_min']
    mean = result['episode_reward_mean']
    max  = result['episode_reward_max']
    last_checkpoint = trainer.save(checkpoint_dir)
    print(fmt.format(n, min, mean, max))
print(f'last checkpoint file: {last_checkpoint}')

The output of one training run is formatted into a nice table in Table 3-1.

Table 3-1. RLlib training results
Min reward Mean reward Max reward

0

8.0000

21.1429

69.0000

1

8.0000

23.4556

80.0000

2

8.0000

28.1189

88.0000

3

11.0000

41.0400

127.0000

4

10.0000

42.8400

178.0000

5

11.0000

55.8900

181.0000

6

18.0000

68.5100

181.0000

7

18.0000

78.8600

188.0000

8

18.0000

85.9700

188.0000

9

25.0000

98.2600

273.0000

10

20.0000

108.9600

273.0000

11

20.0000

123.4100

379.0000

12

20.0000

134.0800

379.0000

13

20.0000

155.3600

422.0000

14

20.0000

174.5800

500.0000

15

20.0000

184.2600

500.0000

16

20.0000

202.5300

500.0000

17

35.0000

218.8900

500.0000

18

35.0000

232.4200

500.0000

19

35.0000

240.6800

500.0000

The mean reward is the most meaningful metric, because it determines how well the cart generally performs. The maximum possible reward for each episode is 500. (One point is rewarded for each iteration in an episode that the cart stays balanced; 500 is an arbitrary cutoff.) We start seeing such successful runs early in training, but even after 20 episodes, the average reward still has room to improve. Try running with a higher value for N_ITER. Try using larger values for the neural network layer sizes.

The script preceding Table 3-1 printed the relative path to the last checkpoint file: checkpoints/checkpoint_20/checkpoint-20. A typical last step is to take the final checkpoint and use the rllib rollout command-line tool to observe several iterations of the environment running with our trained neural network:

c='{"env":"CartPole-v1", "model":{"fcnet_hiddens":[40, 20]}}'
rllib rollout checkpoints/checkpoint_20/checkpoint-20 
  --config $c --run PPO --steps 2000

For each episode, you’ll see the total reward printed, and a pop-up window will show the cart in action. The image shown previously was taken from one of those episodes.

What Did We Learn?

Notice that except for Ray imports and ray.init(), we didn’t use the core Ray API at all, but RLlib used it extensively to schedule work in our cluster or laptop CPU cores. All the features of RLlib API are specifically designed for flexibility and relatively easy integration with other tools, including your applications that use RL.

While fully understanding the example code in the previous section would require a lot more background in RL, hopefully you can appreciate how concise it is to say what you want to do and let RLlib do the work. In fact, the loop we used could be performed even more concisely by the rllib train shell command instead, which is similar to the rollout command we used.

Hyperparameter Tuning with Ray Tune

Ray Tune is a hyperparameter tuning system built on Ray. Hyperparameters define fundamental decisions about how you implement your machine learning system, before you actually do training to discover the parameters.

For example, when using a neural network, there are a vast number of possible architectures to choose from, such as how many layers and the properties of each layer. Naively exploring this vast space can be very compute-intensive and expensive, as you seek an architecture that works well for your problem. Ray Tune implements several algorithms that optimize this search process.

In our CartPole example, we chose a two-layer, fully connected neural network with 40 weights in the first layer and 20 in the second. In fact, the size of the layers wasn’t an arbitrary choice; we used Tune to find good choices for these sizes that minimized the training times, as we’ll discuss shortly. Tune is flexible enough to integrate new algorithms and work with many systems, like TensorFlow, PyTorch, and, of course, RLlib.

Once the hyperparameters are selected, then the system’s parameters can be trained as appropriate, such as using labeled data for supervised learning.

Using very large layers in our CartPole neural network would give us excellent results, but increase the computation required. I used Tune to find relatively small layer sizes that still provide good results with reasonable training times.

Training a neural network with all possible size pairs would be prohibitively expensive. So, I picked from four reasonable values: 20, 40, 60, 80. I tried all 16 combinations of these four values for the two layers.

Let’s start with the necessary imports and start Ray:

import ray
from ray import tune

ray.init()

The following command does the hyperparameter search:

tune.run(
    "PPO",
    stop={"episode_reward_mean": 400},
    config={
        "env": "CartPole-v1",
        "num_gpus": 0,
        "num_workers": 6,  # Use a smaller number than before.
        "model": {
            'fcnet_hiddens': [
                tune.grid_search([20, 40, 60, 80]),
                tune.grid_search([20, 40, 60, 80])
            ]
        },
    },
)

Just by reading this code, can you tell what’s going on?

  • We’re training CartPole with PPO.

  • We specified a stopping criteria, when the network achieves a mean reward of 400.

  • Workers will be pegged to CPU cores, so we can’t request eight workers like we used before, since Tune will need at least one for itself.1

  • We’re telling Tune to try different values from the four layer sizes for each neural network layer.

The training takes a while. It prints out a lot of information as the trials progress, ending with the table of results summarized in Table 3-2.

Table 3-2. Tune results
Layer 1 Layer 2 Iterations Total time(s) Mean reward

20

20

32

385.13

403.93

40

20

23

81.21

400.84

60

20

18

164.11

410.70

80

20

16

154.51

414.93

20

40

24

107.72

408.19

40

40

18

82.10

406.47

60

40

16

90.66

405.69

80

40

16

141.26

415.00

20

60

19

135.66

406.87

40

60

18

122.97

401.28

60

60

17

106.22

408.06

80

60

16

123.32

412.71

20

80

23

127.48

403.85

40

80

16

120.14

404.57

60

80

16

137.49

406.99

80

80

16

94.96

410.73

As you can see, all trials (combinations of the four allowed sizes) run until the network achieves a mean reward of 400. It’s actually easy to find layer sizes that provide good results for CartPole. Several reached 400 with just 16 iterations, but the [40, 20] combination completed the quickest, even though it took 23 iterations. (The [40, 40] combination was a close second.)

Distributed Training with Ray SGD

The newest Ray library is Ray SGD. It provides thin wrappers around native modules in PyTorch and TensorFlow for data parallel training, specifically stochastic gradient descent (SGD). Ray SGD allows you to use these modules for distributed training without having to manage the compute nodes yourself. Ray is used to do this chore for you.

Model Serving with Ray Serve

Finally, Ray Serve is a model serving library that transparently scales the serving load across a cluster.

Serve can be embedded as a library in applications, eliminating the need to stand up separate services and make REST or gRPC calls. (There will still be some network overhead when Serve tasks and actors are distributed.) Serve can also run as a REST server, if you prefer.

Serve is designed to be agnostic to the model it is serving. In fact, it can “serve” any Python function you want, so you can use it for other purposes. It also provides declarative traffic routing and splitting options, which can be changed dynamically at runtime. For example, if you want to try a canary deployment to test a new model on a small subset of users, then reroute all traffic through the new model later on, Serve makes this deployment pattern easy to do.

The flexibility of Serve sets the stage for our next chapter, where we discuss Ray for general application development, including microservices.

What’s Next?

The popular Ray libraries for ML provide concise, expressive APIs for their target problem while leveraging Ray to handle the required distributed computing. Next, we’ll discuss considerations for why and how to use Ray for general application development, with implications for how you design and deploy conventional microservices. We’ll see that Ray’s flexible abstractions support a wide range of applications. In fact, Ray could transcend the limitations of existing platforms for serverless computing.

1 If you get errors that enough CPUs aren’t available, reduce the number for this configuration setting until it works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset