Upper confidence bounds applied to trees

To recap, Min-Max gives us the actual best move in a position, given perfect information; however, MCTS only gives an average value; though it allows us to work with much larger state spaces that cannot be evaluated with Min-Max. Is there a way that we could improve MCTS so it could converge to the Min-Max algorithm if enough evaluations are given? Yes, Monte Carlo Tree Search with Confidence bounds applied to Trees (UCT) does exactly this. The idea behind it is to treat MCTS like a multiarmed bandit problem. The multiarmed bandit problem is that we have a group of slot machines—one armed bandits—each of which has an undetermined payout and average amount of money received per play. The payout for each machine is random, but the mean payout may vary significantly. How should we determine which slot machines to play?

There are two factors that need to be considered when choosing a slot machine. The first is the obvious one, an exploitative value, which is the expected return that the given slot machine will output. To maximize the payout, we would need to always play the machine with the highest expected payout. The second is the explorative value, where we want our playing machine to increase the information we have about the payoffs of different machines.

If we play machine A thrice, you get a payoff of 13, 10, and 7 for an average payoff of 10. We also have machine B; we have played it once and have gotten a payoff of 9. In this case, it might be preferable to play machine B because though the average payoff is lower, 9 versus 10. The fact that we have only played it once means the lower payout may have just been bad luck. If we play it again and get a payout of 13, our average for machine B would be 11. Therefore, we should switch to playing that machine for the best payout.

The multiarmed bandit problem has been widely studied within mathematics. If we can reframe our MCTS evaluation to look like a multiarmed bandit problem, we can take advantage of these well-developed theories. One way of thinking about it is rather than seeing the problem as one with maximizing reward, think of it as a problem with minimizing regret. Regret here is defined as the difference between the reward we get for the machine we play and the maximum possible reward we would get if we knew the best machine from the beginning. If we follow a policy, p(a) chooses an action that gives a reward at each time step. The regret for t number of plays, given r* as the reward of the best possible action, is as follows:

Upper confidence bounds applied to trees

If we were to choose a policy of always picking the machine with the highest reward, it may not be the true best machine. Therefore, our regret will increase linearly with each play. Similarly, if we take a policy of always trying to explore for finding the best machine, our regret will also increase linearly. What we want is a policy for p(a) that increases in sublinear time.

The best theoretical solution is to perform the search based on confidence intervals. A confidence interval is the range within which we expect the true mean, with some probability. We want to be optimistic in the face of uncertainty. If we don't know something, we want to find it out. The confidence interval represents our uncertainty about the true mean of a given random variable. Select something based on your sample mean plus the confidence interval; it will encourage you to explore the space of possibilities while also exploiting it at the same time.

For an i.i.d random variable x, in the range of 0 to 1, over n samples, the probability that the true mean is greater than the sample mean—Upper confidence bounds applied to trees plus constant u—is given by Hoeffding's inequality: Hoeffding, Wassily (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association:

Upper confidence bounds applied to trees

We want to use this equation to find the upper bound confidence for each machine. E{x},x, and n are all part of statistics we have already. We need to solve it to use it for the purpose of finding a value for u. In order to do this, reduce the left side of the equation to p and find where it equals the right side:

Upper confidence bounds applied to trees

We can rearrange it to make u defined in terms of n and p:

Upper confidence bounds applied to trees
Upper confidence bounds applied to trees
Upper confidence bounds applied to trees

Now we want to choose a value for p so that our precision increases over time. If we set Upper confidence bounds applied to trees, then as n approaches infinity, our regret will tend toward 0. Substitute that in and we can simplify it down to:

Upper confidence bounds applied to trees

The mean plus u is our upper confidence bounds, so we can use it to give us the UCB1 (Upper Confidence Bounds) algorithm. We can substitute our values with the values in the multiarmed bandit problem we saw earlier, where r i is the sum of the reward received from the machine i, n i is the number of plays of machine i, and n is the sum of plays across all machines:

Upper confidence bounds applied to trees

We will always want to choose the machine that will give us the highest score for this equation. If we do so, our regret will scale logarithmically with the number of plays, which is the theoretical best we can do. Using this equation for our action choice has the behavior that we will try a range of machines early on, but the more we try a single machine, the more it will encourage us to eventually try a different machine.

It's also good to remember that an assumption at the beginning of this series of equations was that the range, for x in early equations, and r for when we apply it to the multiarmed bandit problem was that values were in the range of 0 to 1. So if we are not working in this range, we need to scale our input. We have not made any assumptions about the nature of the distribution though; it could be Gaussian, binomial, and so on.

Now we have an optimal solution to the problem of sampling from a set of unknown distributions; how do you apply it to MCTS? The simplest way to do this is to only treat the first moves from the current board state as bandits or slot machines. Though this would improve the estimation at the top level a little, every move beneath that would be completely random, meaning the r i estimation would be very inaccurate.

Alternatively, we could treat every move at every branch of the tree as a multiarmed bandit problem. The issue with this is that if our tree is very deep, as our evaluation goes deeper, we will reach positions we have never encountered before so we would have no samples for the range of moves we need to choose between. We would be keeping a huge number of statistics for a huge range of positions, most of which will never be used.

The compromise solution, known as Upper Confidence for Trees, is to do what we discuss next. We will do successive rollouts from the current board state. At each branch of the tree, where we have a range of actions to choose from, if we have previous sample statistics for each potential move, we will use the UCB1 algorithm to choose which action to choose for the rollout. If we do not have sample statistics for every move, we will choose the move randomly.

How do we decide which sample statistics to keep? For each rollout, we keep new statistics for the first position we encounter that we do not have previous statistics for. After the rollout is complete, we update the statistics for every position we are keeping track of. This way, we ignore all the positions deeper down the rollout. After x evaluations, we should have exactly x nodes of our tree, growing by one with each rollout. What's more, the nodes we keep track of are likely to be around the paths we are using the most, allowing us to increase our top-level evaluation accuracy by increasing the accuracy of the moves we evaluate further down the tree.

The steps are as follows:

  1. Start a rollout from the current board state. When you select a move, do the following:
    1. If you have statistics for every move from the current position, use the UCB1 algorithm to choose the move.
    2. Otherwise, choose the move randomly. If this is the first randomly chosen position, add it to the list of positions we are keeping statistics for.
  2. Run the rollout until you hit a terminal state, which will give you the result of this rollout.
  3. Update the statistics for every position you are keeping statistics for, indicating what you went through in the rollout.
  4. Repeat until you get to the maximum number of rollouts. Upper confidence bounds applied to Trees, the statistics for each position, are shown in the square boxes:
    Upper confidence bounds applied to trees
  5. The preceding diagram illustrates how this happens. In position A, there is statistics collected for all four possible moves. Because of this, the UCB1 algorithm can be used to select the best move, balancing exploitative for exploitative value. In the preceding diagram, the leftmost move is chosen. This leads us to position B; here only two out of the three possible moves have statistics collected on them. Because of this, the move you need to make for this rollout is selected randomly. By chance, the rightmost move is selected; the remaining moves are selected randomly until you reach the final position C, where the noughts were to win. This information is then applied to a graph, as shown in the following diagram:
    Upper confidence bounds applied to trees
  6. We add statistics for any position that we passed through that already has statistics, so 1/2 in the first diagram now becomes 2/3. We also add statistics for the first position we encounter with no stats. Here, it is the rightmost position in the second row; it now has a score of 1/1 because the nought player won. If this branch is selected again and you get to position D, use the UCB1 algorithm to select the move, not just make a random selection as before.
  7. Here is what this looks like in Python for our Tic-Tac-Toe game:
    def upper_confidence_bounds(payout, samples_for_this_machine, log_total_samples):
        return payout / samples_for_this_machine 
               + math.sqrt((2 * log_total_samples)
                           / samples_for_this_machine)

First, we need a method that calculates UCB1; this is the UCB1 equation in Python. The one difference is here we are using log_total_samples as input because it allows us to do a small optimization later on:

def monte_carlo_tree_search_uct(board_state, side, number_of_rollouts):
    
state_results = collections.defaultdict(float)
    
state_samples = collections.defaultdict(float)

Declare the method and the two dictionaries, namely state_results and state_samples. They will keep track of our statistics for the different board states we will encounter during the rollouts:

    for _ in range(number_of_rollouts):
        current_side = side
        current_board_state = board_state
        
first_unvisited_node = True
        
rollout_path = []
        result = 0

The main loop is what we go through for each rollout. At the beginning of the rollout, we need to initialize the variables that will track our progress within the rollout. first_unvisited_node will keep track of whether we have created a new statistics tracking node for this rollout. On encountering the first state for which we have no statistics, we create the new statistics node, adding it to state_results and state_samples dictionaries and then setting the variable to False. rollout_path will keep track of each node we visit in this rollout that we are keeping statistics nodes for. Once we obtain the result at the end of the rollout, we will update the statistics of all the states along the path:

        while result == 0:
            
move_states = {move: apply_move(current_board_state, move, current_side)
                           for move in available_moves(current_board_state)}

            if not move_states:
                
result = 0
                break

The while result == 0 puts us into the loop for a rollout; this will run until one side or the other wins. In each loop of the rollout, we first construct a dictionary, move_states, mapping each available move to the state that move will put us into. If there are no moves to make, we are in a terminal state, it is a draw. So you need to record that as result and break out of the rollout loop:

            if all((state in state_samples) for _, state in move_states):
                log_total_samples = math.log(sum(state_samples[s] for s in move_states.values()))
                move, state = max(move_states, key=lambda _, s:
                upper_confidence_bounds(state_results[s], state_samples[s], log_total_samples))
            else:
                move = random.choice(list(move_states.keys()))

Now we need to choose which move we are going to take at this step of the rollout. As specified by the MCTS-UCT algorithm, if we have statistics for every possible move, we choose the move with the best upper_confidence_bounds score; otherwise, we make the selection randomly:

            
current_board_state = move_states[move]

Now that we have selected our move, we can update current_board_state to the state that the move puts us in:

            if first_unvisited_node:
                
rollout_path.append((current_board_state, current_side))
                if current_board_state not in state_samples:
                    first_unvisited_node = False

Now we need to check whether we have hit the end of our MCTS-UCT tree. We will add every node we visit up to the first previously unvisited node to rollout_path. We will update the statistics of all these nodes once we get our result from this rollout:

            current_side = -current_side
            result = has_winner(current_board_state)

We are at the end of our rollout loop, so switch the sides for the next iteration and check whether anyone has won in the current state. If so, it will cause us to break out of the rollout loop when we pop back to the while result == 0 statement:

        for path_board_state, path_side in rollout_path:
            state_samples[path_board_state] += 1.
            
result = result*path_side/2.+.5
            state_results[path_board_state] += result

Now we have completed a single rollout and thus left the rollout loop. We now need to update our statistics with the result. rollout_path contains path_board_state and path_side for each node we want to update, so we need to go through every entry in there. The last two points to make are that the results from our game are between -1 and 1. But the UCB1 algorithm expects its payouts between 0 and 1; the line result *path_side/2.+.5 does this. Second, we also need to switch the results to represent the side they are for. A good move for my opponent is the opposite of a good move for me:

    move_states = {move: apply_move(board_state, move, side) for move in available_moves(board_state)}

    move = max(move_states, key=lambda x: state_results[move_states[x]] / state_samples[move_states[x]])

    return state_results[move_states[move]] / state_samples[move_states[move]], move

Finally, once we have done the required number of rollouts, we can choose the best move from the current state based on the best expected payout. There is no longer any need to use UCB1 to choose the best move. It's because this being the final decision, there is no value in doing any extra exploration; the best move is simply the best mean payout.

This is the MCTS-UCT algorithm. There are many different variants to it with different advantages in specific situations, but they all have this as core logic. MCTS-UCT gives us a general way to judge moves for games, such as Go, with vast search spaces. Also, it isn't limited to games of perfect information; it can often perform well in games with partially observed states, such as poker. Or, even more generally, any problem we might encounter that we can reconfigure to fit it, for example, it was used as a basis for an automated theorem proving machine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset