Creating a DAgger loop

It's now time to set up the core of the DAgger algorithm. The outline has already been defined in the pseudocode in The DAgger algorithm section, but let's take a more in-depth look at how it works:

  1. Initialize the dataset composed of two lists, X and y, where we'll put the states visited and the expert target actions. We also initialize the environment:
    X = []
y = []

env = FlappyBird()
env = PLE(env, fps=30, display_screen=False)
env.init()
  1. Iterate across all the DAgger iterations. At the beginning of every DAgger iteration, we have to reinitialize the learner computational graph (because we retrain the learner on every iteration on the new dataset), reset the environment, and run a number of random actions. At the start of each game, we run a few random actions to add a stochastic component to the deterministic environment. The result will be a more robust policy:
    for it in range(dagger_iterations):
sess.run(tf.global_variables_initializer())
env.reset_game()
no_op(env)

game_rew = 0
rewards = []

  1. Collect new data by interacting with the environment. As we said previously, the first iteration contains the expert that has to choose the actions by calling expert_policy, but, in the following iterations, the learner progressively takes control. The learned policy is executed by the learner_policy function. The dataset is collected by appending to X (the input variable) the current state of the game, and by appending to y (the output variable) the actions that the expert would have taken in that state. When the game is over, the game is reset and game_rew is set to 0. The code is as follows:
        for _ in range(step_iterations):
state = flappy_game_state(env)

if np.random.rand() < (1 - it/5):
action = expert_policy(state)
else:
action = learner_policy(state)

action = 119 if action == 1 else None

rew = env.act(action)
rew += env.act(action)

X.append(state)
y.append(expert_policy(state))
game_rew += rew

if env.game_over():
env.reset_game()
np_op(env)

rewards.append(game_rew)
game_rew = 0

Note that the actions are performed twice. This is done to reduce the number of actions every second to 15 instead of 30, as required by the environment.

  1. Train the new policy on the aggregated dataset. The pipeline is standard. The dataset is shuffled and divided into mini-batches of length batch_size. Then, the optimization is repeated by running p_opt for a number of epochs equals to train_epochs on each mini-batch. This is done with the following code:
        n_batches = int(np.floor(len(X)/batch_size))

shuffle = np.arange(len(X))
np.random.shuffle(shuffle)
shuffled_X = np.array(X)[shuffle]
shuffled_y = np.array(y)[shuffle]

ep_loss = []
for
_ in range(train_epochs):

for b in range(n_batches):
p_start = b*batch_size
tr_loss, _ = sess.run([p_loss, p_opt], feed_dict=
obs_ph:shuffled_X[p_start:p_start+batch_size],
act_ph:shuffled_y[p_start:p_start+batch_size]})

ep_loss.
append(tr_loss)
print('Ep:', it, np.mean(ep_loss), 'Test:', np.mean(test_agent(learner_policy)))

test_agent tests learner_policy on a few games to understand how well the learner is performing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset