A basic RL cycle is shown in the following code block. This essentially makes the RL model play for 10 moves while rendering the game at each step:
import gym
# create the environment
env = gym.make("CartPole-v1")
# reset the environment before starting
env.reset()
# loop 10 times
for i in range(10):
# take a random action
env.step(env.action_space.sample())
# render the game
env.render()
# close the environment
env.close()
This leads to the following output:
Let's take a closer look at the code. It starts by creating a new environment named CartPole-v1, a classic game used in control theory problems. However, before using it, the environment is initialized by calling reset(). After doing so, the cycle loops 10 times. In each iteration, env.action_space.sample() samples a random action, executes it in the environment with env.step(), and displays the result with the render() method; that is, the current state of the game, as in the preceding screenshot. In the end, the environment is closed by calling env.close().
This cycle is the same for every environment that uses the Gym interface, but for now, the agent can only play random actions without having any feedback, which is essential to any RL problem.
The following diagram shows the flow of the cycle:
Indeed, the step() method returns four variables that provide information about the interaction with the environment. The preceding diagram shows the loop between the agent and environment, as well as the variables exchanged; namely, Observation, Reward, Done, and Info. Observation is an object that represents the new observation (or state) of the environment. Reward is a float number that represents the number of rewards obtained in the last action. Done is a Boolean value that is used on tasks that are episodic; that is, tasks that are limited in terms of the number of interactions. Whenever done is True, this means that the episode has terminated and that the environment should be reset. For example, done is True when the task has been completed or the agent has died. Info, on the other hand, is a dictionary that provides extra information about the environment but that usually isn't used.
If you have never heard of CartPole, it's a game with the goal of balancing a pendulum acting on a horizontal cart. A reward of +1 is provided for every timestep when the pendulum is in the upright position. The episode ends when it is too unbalanced or it manages to balance itself for more than 200 timesteps (collecting a maximum cumulative reward of 200).
We can now create a more complete algorithm that plays 10 games and prints the accumulated reward for each game using the following code:
import gym
# create and initialize the environment
env = gym.make("CartPole-v1")
env.reset()
# play 10 games
for i in range(10):
# initialize the variables
done = False
game_rew = 0
while not done:
# choose a random action
action = env.action_space.sample()
# take a step in the environment
new_obs, rew, done, info = env.step(action)
game_rew += rew
# when is done, print the cumulative reward of the game and reset the environment
if done:
print('Episode %d finished, reward:%d' % (i, game_rew))
env.reset()
The output will be similar to the following:
Episode: 0, Reward:13
Episode: 1, Reward:16
Episode: 2, Reward:23
Episode: 3, Reward:17
Episode: 4, Reward:30
Episode: 5, Reward:18
Episode: 6, Reward:14
Episode: 7, Reward:28
Episode: 8, Reward:22
Episode: 9, Reward:16
The following table shows the output of the step() method over the last four actions of a game:
Observation | Reward | Done | Info |
[-0.05356921, -0.38150626, 0.12529277, 0.9449761 ] | 1.0 | False | {} |
[-0.06119933, -0.57807287, 0.14419229, 1.27425449] | 1.0 | False | {} |
[-0.07276079, -0.38505429, 0.16967738, 1.02997704] | 1.0 | False | {} |
[-0.08046188, -0.58197758, 0.19027692, 1.37076617] | 1.0 | False | {} |
[-0.09210143, -0.3896757, 0.21769224, 1.14312384] | 1.0 | True | {} |
Notice that the environment's observation is encoded in a 1 x 4 array; that the reward, as we expected, is always 1; and that done is True only in the last row when the game is terminated. Also, Info, in this case, is empty.
In the upcoming chapters, we'll create agents that play CartPole by taking more intelligent actions depending on the current state of the pole.