Deep learning in Monte Carlo Tree Search

Even with MCTS-UCT, computers could still not even come close to beating the best Go players; however, in 2016, a team from Google Deep Mind developed an AI they called AlphaGo. It defeated Lee Sedol, the world's top Go player, over a five game series, winning 4-1. The way they did this was using three improvements over the standard MCTS UCT approach.

If we were to think about why MCTS is so inaccurate, an intuitive answer that might arise is that the moves used in the evaluation are selected randomly when we know that some moves are much more likelier than others. In Go, when there is a battle for control of a corner, the moves around that area are much better candidates for selection, as opposed to moves on the opposite side of the board. If we had a good way of selecting which moves are likely to be played, we would have massively reduced the breadth of our search, and by extension, increased the accuracy of our MCTS evaluations. If we go back to the preceding chess position, although every legal move can potentially be played, if you are playing against someone who without any chess skill will only play the winning move, evaluation of the other moves is simply wasted CPU cycles.

This is where deep learning can help us. We can use the pattern recognition qualities of a neural network to give us a rough estimate of the probability of a move being played in the game, given a position. For AlphaGo, a 13-layer convolutional network with relu activation functions was used. The input to the network was the 19 x 19 board state and its output, another 19 x 19 softmax layer representing the probability of a move being played in each square of the board. It was then trained on a large database of expert-level human Go games. The network would be given a single position as input and the move that was played from that position as a target. The loss function is the mean squared error between network activation and the human move made. Given plenty of training, the network learned to predict human moves with 57 percent accuracy against a test set. The use of a test set here is particularly important because overfitting is a big worry. Unless the network can generalize its understanding of a position to a previously unseen position, it is useless.

If we wanted to implement something similar in our preceding Tic-tac-toe example, we would simply replace the move = random.choice(moves) line with the monte_carlo_sample method or the UCT version with a move chosen by a trained neural network. This technique will work for any discrete game if you have a large training set of example games.

If you do not have a database of example games, there is another approach you can use. If you have an agent that plays with a tiny degree of skill, you can even use that agent to generate the initial collection of example games. A good approach, for instance, is to generate example positions and moves using the min-max or MCTS UCT algorithms. A network can then be trained to play moves from that collection. This is a good way to get a network to learn how to play a game at a good enough standard so that it can at least explore the space of the game with the plausible moves, as opposed to completely random moves.

If we implement such a neural network, use it to select which moves to use in a Monte-Carlo rollout, with this, our evaluation will be much more accurate, but we will still suffer from the problem that our MCTS will be evaluating averages when we still care about the best outcome for us from the moves we make. This is where reinforcement learning can be introduced to improve our agent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset