I'm trying to debug my MCTS implementation for TicTacToe (it doesn't block obvious wins for the opponent). I was wondering what the algorithm should do if it expands to a node which is a game over state. Should it continue to "simulate" that node and back-propogate the results up the tree or just ignore it if the node is chosen.
- The game over states are the only states that give you a score in tic tac toe: win, lose, or draw. Back propogating that information is important.candied_orange– candied_orange2016-12-10 14:39:05 +00:00Commented Dec 10, 2016 at 14:39
- @CandiedOrange I understand that but I was wondering whether there was a difference between reaching a terminal state through random playout (not reflected in the game tree) vs through expansion of the tree.Amja– Amja2016-12-10 17:00:19 +00:00Commented Dec 10, 2016 at 17:00
1 Answer
The basic variant of MCTS doesn't require a special management for terminal states.
It just updates the score / number of visits depending on the result (win / lose) and proceeds with back-propagation.
It's a sort of istantaneous playout/simulation and cannot be skipped (allowing MCTS to converge to the game-theoretical value).
The algorithm can be improved (e.g. MONTE-CARLO TREE SEARCH SOLVER) using large scores (say +∞ / -∞) and ad hoc backpropagation / selection schemes:
A special provision is then taken when backing such proven values up the tree. There are three cases to consider as shown in Fig. (we use the negamax formulation, alternating signs between levels).
First, when a simulation backs up a proven loss (−∞) from a child
cto a parentp, the parent nodepbecomes, and is labelled as, a proven win (∞), that is, the position is won for the player atpbecause the move played leads to a win (left backup diagram in the figure).When backing up a proven win (∞) from
ctop, one must, however, also look at the other children ofpto determinep’s value. In the second case, when all child nodes ofpare also a proven win (∞), then the value ofpbecomes a proven loss (-∞), because all moves lead to a position lost forp(middle backup diagram in the figure).However, the third case occurs if there exists at least one child with a value different value from a proven win. Then we cannot label
pas a proven loss. Insteadpgets updates as if a simulation win (instead of a proven win) were being backed up from nodec(right backup diagram in the figure;vanduindicate non-proven values). Non-proven values are backed up as in regular MCTS.
1) Further details in Monte-Carlo Tree Search in Lines of Action by Mark H.M. Winands, Yngvi Bjornsson and Jahn-Takeshi Saito.
