Policy gradients in AlphaGo

For AlphaGo using policy gradients, the network was set up to play games against itself. It did so with a reward of 0 for every time step until the final one where the game is either won or lost, giving a reward of 1 or -1. This final reward is then applied to every time step in the network, and the network is trained using policy gradients in the same way as our Tic-tac-toe example. To prevent overfitting, games were played against a randomly selected previous version of the network. If the network constantly plays against itself, the risk is it could end up with some very niche strategies, which would not work against varied opponents, a local minima of sorts.

Building the initial supervised learning network that predicted the most likely moves by human players allowed AlphaGo to massively reduce the breadth of the search it needs to perform in MCTS. This allowed them to get much more accurate evaluation per rollout. The problem is that running a large many-layered neural network is very slow, compared to just selecting a random action. In our Monte-Carlo rollout, we need to select 100 moves on average and we want to do this in the order of hundreds of thousands of rollouts to evaluate a position. This makes using the network this way impractical. We need to find a way to reduce our computation time.

If we use the best moves selected by our network instead of manually selecting a move with the probability of our output, then our network is deterministic. Given a position on the board, the result achieved by the board will also be deterministic. When evaluated using the best moves from the network, the position is either a winning one for white or black or a draw. This result is the value of the position under the network's optimal policy. Because the result is deterministic, we can train a new deep neural network to learn the value of this position. If it performs well, a position can be evaluated accurately using just one pass of the neural network, rather than one for each move.

A final supervised network is created using the same structure as the previous networks, except this time the final output, rather than being a probability of actions across the board, is just a single node representing the expected result of the game: win for white, win for black, or draw.

The loss function for this network is the mean squared error between its output and the result achieved by the reinforcement learning network. It was found after training that the value network could achieve a mean squared error of just 0.226 and 0.234 on the training and test sets, respectively. This indicated that it could learn the result with good accuracy.

To recap, at this point, Alpha Go has three differently trained deep neural networks:

  • SL: This is a network trained using supervised learning to predict the probability of a human move from a board position.
  • RL: This is a network trained that initially used the weights from the SL network, but was then further trained using reinforcement learning to choose the best move from a given position.
  • V: This is a network again trained with supervised learning to learn the expected result of the position when played using the RL network. It provides the value of the state.

For a real game against Lee Sedol, Alpha Go used a variant on the MCTS-UCT that we introduced earlier. When the rollout was simulated from the MCTS leaves, rather than selecting moves randomly, they were selected using another, much smaller, single layer network. This network called the fast rollout policy and used a softmax classifier across all possible moves, where the input was the 3 x 3 color pattern around the action and a collection of handcrafted features, such as the liberty count. This is, in our example, the following line:

move = random.choice(list(move_states.keys()))

This could be replaced with something like this:

probability_of_move = fast_rollout_policy.run(board_state)
move = np.random.binomial(1, probability_of_move)

This small network was used to run the Monte-Carlo rollout. The SL network would almost certainly have been better, but would have been prohibitively slow.

When evaluating the success value of a rollout from a leaf, the score was determined using a combination of the result from the fast rollout policy and the score as given by the V-network. A mixing parameter ? was used to determine the relative weights of these:

Policy gradients in AlphaGo

Here, s is the state of the leaf and f is the result of the rollout using the fast rollout policy. After experimenting with a wide range of the values for ?, it was found that 0.5 yielded the best results, suggesting that both methods of evaluation are complementary.

The five-game series between Lee Sedol and Alpha Go started on March 9, 2016, in front of a large audience with a $1,000,000 prize for the winner. Lee Sedol was very confident in the buildup, declaring, "I have heard that Google DeepMind's AI is surprisingly strong and getting stronger, but I am confident that I can win at least this time." Sadly, for him, Alpha Go proceeded to win the first three games, each forcing a resignation. At this point, with the competition decided, he came back to win the fourth, but lost the fifth, leaving the series at 4-1.

This was very significant progress on the part of AI, marking the first time that an AI had come even close to beating a top human player at such a complex game. It raises all kinds of questions such as in which other domains, it might be possible to develop AI that can outperform the best humans. The match's full significance on humanity remains to be seen.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset