The partially observable Markov decision process

Back in Chapter 5, Introducing DRL, we learned that a Markov Decision Process (MDP) is used to define the state/model an agent uses to calculate an action/value from. In the case of Q-learning, we have seen how a table or grid could be used to hold an entire MDP for an environment such as the Frozen Pond or GridWorld. These types of RL are model-based, meaning they completely model every state in the environment—every square in a grid game, for instance. Except, in most complex games and environments, being able to map physical or visual state becomes a partially observable problem, or what we may refer to as a partially observable Markov decision process (POMDP).

A POMDP defines a process where an agent never has a complete view of its environment, but instead learns to conduct actions based on a derived general policy. This is demonstrated well in the Crawler example, because we can see the agent learning to move using only limited information—the direction to target. The following table outlines the definition of Markov models we generally use for RL:

No Yes
All states observable? No Markov Chain MDP
Yes Hidden Markov Model POMDP

 

Since we provide our agent with control over its states in the form of actions, the Markov models we study are the MDP and POMDP. Likewise, these processes will also be often referred to as on or off model, while if an RL algorithm is completely aware of state, we call it a model-based process. Conversely, a POMDP refers to an off-model process, or what we will refer to as a policy-based method. Policy-based algorithms, provide better generalization and have the ability to learn in environments with an unknown or infinite number of observable states. Examples of partially observable states are environments such as the Hallway, VisualHallway, and, of course, Crawler.

Markov models provide a foundation for many aspects of machine learning, and you may encounter their use in more advanced deep learning methods known as deep probabilistic programming. Deep PPL, as it is referred to, is a combination or variational inference and deep learning methods.

Model-free methods typically use an experienced buffer to store a set of experiences that it will use later to learn a general policy from. This buffer is defined by a few hyperparameters, called time_horizonbatch_size, and buffer_size. Definitions of each of these parameters extracted from the ML-Agents documentation are given here:

  • time_horizon: This corresponds to how many steps of experience to collect per agent before adding them to the experience buffer. When this limit is reached before the end of an episode, a value estimate is used to predict the overall expected reward from the agent's current state. As such, this parameter trades off between a less biased, but higher variance estimate (long time horizon), and a more biased, but less varied estimate (short time horizon). In cases where there are frequent rewards within an episode, or episodes are prohibitively large, a smaller number can be more ideal. This number should be large enough to capture all the important behavior within a sequence of an agent's actions:
    • Typical range: 32 – 2,048
  • buffer_size: This corresponds to how many experiences (agent observations, actions, and rewards obtained) should be collected before we update the model or do any learning. This should be a multiple of batch_size. Typically, a larger buffer_size parameter corresponds to more stable training updates.
    • Typical range: 2,048 – 4,09,600
  • batch_size: This is the number of experiences used for one iteration of a gradient descent update. This should always be a fraction of the buffer_size parameter. If you are using a continuous action space, this value should be large (in the order of thousands). If you are using a discrete action space, this value should be smaller (in order of tens). 

    • Typical range (continuous): 512 – 5,120

    • Typical range (discrete): 32 – 512

We can see how these values are set by looking at the CrawlerDynamicLearning brain configuration, and altering this to see the effect this has on training. Open up the editor and a properly configured Python window to the CrawlerDynamicTarget scene and follow this exercise:

  1. Open the trainer_config.yaml file located in the ML-Agents/ml-agents/config folder.
  1. Scroll down to the CrawlerDynamicLearning brain configuration section:
CrawlerDynamicLearning:
normalize: true
num_epoch: 3
time_horizon: 1000
batch_size: 2024
buffer_size: 20240
gamma: 0.995
max_steps: 1e6
summary_freq: 3000
num_layers: 3
hidden_units: 512
  1. Note the highlighted lines showing the time_horizon, batch_size, and buffer_size parameters. If you recall from our earlier Hallway/VisualHallway examples, the time_horizon parameter was only 32 or 64. Since those examples used a discrete action space, we could set a much lower value for time_horizon.
  2.  Double all the parameter values, as shown in the following code excerpt:
time_horizon: 2000
batch_size: 4048
buffer_size: 40480
  1. Essentially, what we are doing here is doubling the amount of experiences the agent will use to build a policy of the environment around it. In essence, we are giving the agent a larger snapshot of experiences to train against.
  2. Run the agent in training as you have done so many times before.
  3. Let the agent train for as long as you ran the previous base sample. This will give you a good comparison in training performance.

One thing that will become immediately obvious is how much more stable the agent trains, meaning the agent's mean reward will progress more steadily and jump around less. Recall that we want to avoid training jumps, spikes, or wobbles, as this could indicate poor convergence on the part of the network's optimization method. This means that more gradual changes are generally better, and indicate good training performance. By doubling time_horizon and associated parameters, we have doubled the amount of experiences the agent used to learn from. This, in turn, had the effect of stabilizing the training, but it is likely that you noticed the agent took longer to train to the same number of iterations. 

Partially observable RL algorithms are classed as policy-based, model-free, or off-model, and are a foundation for PPO. In the next section, we will look at the improvements in RL that deal with the additional complexities of managing continuous action spaces better. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset