Developing an RL cycle

A basic RL cycle is shown in the following code block. This essentially makes the RL model play for 10 moves while rendering the game at each step:

import gym

# create the environment
env = gym.make("CartPole-v1")
# reset the environment before starting
env.reset()

# loop 10 times
for i in range(10):
# take a random action
env.step(env.action_space.sample())
# render the game
env.render()

# close the environment
env.close()

This leads to the following output:

Figure 2.1: Rendering of CartPole

Let's take a closer look at the code. It starts by creating a new environment named CartPole-v1, a classic game used in control theory problems. However, before using it, the environment is initialized by calling reset(). After doing so, the cycle loops 10 times. In each iteration, env.action_space.sample() samples a random action, executes it in the environment with env.step(), and displays the result with the render() method; that is, the current state of the game, as in the preceding screenshot. In the end, the environment is closed by calling env.close().

Don't worry if the following code outputs deprecation warnings; they are there to notify you that some functions have been changed. The code will still be functioning correctly.

This cycle is the same for every environment that uses the Gym interface, but for now, the agent can only play random actions without having any feedback, which is essential to any RL problem. 

In RL, you may see the terms state and observation being used almost interchangeably, but they are not the same. We talk about state when all the information pertaining to the environment is encoded in it. We talk about observation when only a part of the actual state of the environment is visible to the agent, such as the perception of a robot. To simplify this, OpenAI Gym always uses the term observation.

The following diagram shows the flow of the cycle:

Figure 2.2: Basic RL cycle according to OpenAI Gym. The environment returns the next state, a reward, a done flag, and some additional information

Indeed, the step() method returns four variables that provide information about the interaction with the environment. The preceding diagram shows the loop between the agent and environment, as well as the variables exchanged; namely, Observation, Reward, Done, and Info. Observation is an object that represents the new observation (or state) of the environment. Reward is a float number that represents the number of rewards obtained in the last action. Done is a Boolean value that is used on tasks that are episodic; that is, tasks that are limited in terms of the number of interactions. Whenever done is True, this means that the episode has terminated and that the environment should be reset. For example, done is True when the task has been completed or the agent has died. Info, on the other hand, is a dictionary that provides extra information about the environment but that usually isn't used.

If you have never heard of CartPole, it's a game with the goal of balancing a pendulum acting on a horizontal cart. A reward of +1 is provided for every timestep when the pendulum is in the upright position. The episode ends when it is too unbalanced or it manages to balance itself for more than 200 timesteps (collecting a maximum cumulative reward of 200).

We can now create a more complete algorithm that plays 10 games and prints the accumulated reward for each game using the following code:

import gym

# create and initialize the environment
env = gym.make("CartPole-v1")
env.reset()

# play 10 games
for i in range(10):
# initialize the variables
done = False
game_rew = 0

while not done:
# choose a random action
action = env.action_space.sample()
# take a step in the environment
new_obs, rew, done, info = env.step(action)
game_rew += rew

# when is done, print the cumulative reward of the game and reset the environment
if done:
print('Episode %d finished, reward:%d' % (i, game_rew))
env.reset()

The output will be similar to the following:

Episode: 0, Reward:13
Episode: 1, Reward:16
Episode: 2, Reward:23
Episode: 3, Reward:17
Episode: 4, Reward:30
Episode: 5, Reward:18
Episode: 6, Reward:14
Episode: 7, Reward:28
Episode: 8, Reward:22
Episode: 9, Reward:16

The following table shows the output of the step() method over the last four actions of a game:

Observation Reward Done Info
[-0.05356921, -0.38150626, 0.12529277, 0.9449761 ] 1.0  False  {}
[-0.06119933, -0.57807287, 0.14419229, 1.27425449] 1.0 False  {}
[-0.07276079, -0.38505429, 0.16967738, 1.02997704] 1.0 False  {}
[-0.08046188, -0.58197758, 0.19027692, 1.37076617] 1.0 False  {}
[-0.09210143, -0.3896757, 0.21769224, 1.14312384] 1.0 True {}

 

Notice that the environment's observation is encoded in a 1 x 4 array; that the reward, as we expected, is always 1; and that done is True only in the last row when the game is terminated. Also, Info, in this case, is empty.

In the upcoming chapters, we'll create agents that play CartPole by taking more intelligent actions depending on the current state of the pole. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset