蒙特卡洛树搜索（Monte Carlo Tree Search，简称MCTS）是一种用于决策过程的模拟和搜索算法，特别适用于解决复杂的顺序决策问题。以下是MCTS的几个核心步骤：

选择（Selection）：从根节点开始，根据某种策略（如UCB1公式），选择一条路径直到达到一个未扩展的节点。
拓展（Expansion）：在当前选中的节点上添加一个新的子节点。
模拟（Simulation）：从新扩展的节点开始进行随机模拟（即走子），直到游戏结束，并记录结果。
反向传播（Backpropagation）：将模拟的结果回传，更新沿途节点的统计信息。

值得一提的是，MCTS通过不断重复上述过程，逐渐构建出一个代表不同行动及其结果概率的搜索树。最终，算法会选择具有最高胜率的行动作为推荐的行动。

此外，MCTS与传统的博弈树搜索相比，它不需要构建整个博弈树，而是通过模拟来评估局面，这使得它能够处理更大规模的搜索空间。然而，MCTS也存在一些局限性，比如对于非零和游戏的适应性、在面对随机性或不完全信息时的表现等问题。尽管如此，MCTS仍然是现代人工智能领域的一个重要工具，尤其是在棋类游戏中的应用，如AlphaGo/Zero等著名程序就是基于MCTS的变体。

Background

包含决策过程（decision theory）、博弈论（game theory）、Monte Carlo 以及 bandit-based methods。

Decision Theory

本质是马尔可夫决策决策过程（Markov Decision Processes，简记为 MDPs），通过一系列状态-动作-回报系列完成对于决策过程的刻画。

部分观测的MDP（Partially Observable MDPs，简记为POMDP），每次获得的状态只是整体状态的一部。

Game Theory

Monte Carlo Methods

Bandit-Based Methods

多臂老虎机问题（Multi-armed Bandit problems）是知名的系列决策问题。该问题的难点在于平衡探索与开发（Exploration and Exploitation），以推荐系统为例，在已经知道用户兴趣点之后需要进行“开发”，但是现有的兴趣点可能不是用户最感兴趣的，或者该兴趣点会随着时间改变，因此需要进行“探索”，不断试探用户新的感兴趣内容。在Spin Glass求基态问题中，表现为是在这个结构中继续降低能量，还是选择升高温度跳出当前局域范围进行探索。Bandit 算法用于解决该问题。

一种有效的方案是基于上置信界（upper confidence bound 简记UCB）。最简单的UCB算法为UCB1。

细致平衡和这个感觉类似，这两者是否有关系？

Monte Carlo Tree Search

MCTS依靠于两个基本概念：动作的真实回报，可以依靠随机模拟近似；这些回报可以有效的调整策略去接近最优策略。

Algorithm

基础算法通过迭代构建树，从根节点出发搜索子节点，然后选取最好的子节点作为下一次搜索的根节点。在这个迭代过程中有4个典型步骤： 1) 选择：从根节点开始递归应用子节点选择策略遍历整个树，直到达到收敛节点。 2) 探索：产生动作，将更多节点添加进树中。 3) 模拟评估：通过策略模拟给出一个节点的评估分数。 4) 反向传播：通树结构反向传播模拟评估结果。

采用以下两个直接的策略： 1) Tree policy: 从搜索树中选择或者探索一个叶子节点。如何平衡探索与贪心选择，是这个算法的关键。 2) Default policy: 模拟评估非终止状态的分数。这里一直没有明白，算法并没有对该模型的知识，如何评估一个局面的好坏呢？如果已经对这个问题已经有了很好的了解，那直接贪心就可以了，为什么还要有MCTS？

伪代码为：

v₀是根节点s₀的评分，其中v_l是叶子节点s_l的评分，Δ是从探索节点返回的值。

上面展示了一次过程的示意图。

Development

Tree policy

Upper Confidence Bound for Trees (UCT)算法是Tree policy中的一种，用于解决exploration–exploitation dilemma.

在子节点j去选择最大化： $$\begin{align} \mathrm{UCT}=\bar{X}_j+2 C_p \sqrt{\frac{2 \ln n}{n_j}} \label{UCB} \end{align}$$

其中 n 是当前节点被遍历过的次数，n_j是子节点j被遍历过的次数，并且限制C_p > 0。如果超过一个子节点有相同的最大值，通常随机选择。X_i, t和X̄_j是选择j的平均回报，值在范围[0, 1]上。通常设置n_j = 0用于鼓励探索未曾探索过的区域。C_p可以用来调节探索的倾向，有很多文献对这个的选择有很多讨论。

$\eqref{UCB}$存在这样的一种思想，如果子节点都探索过，倾向于选择收益高的节点；如果子节点存在没有探索过的节点，倾向于选择没有探索过的节点。

Default policy

In the simplest case, this default policy is uniformly random.

最简单的处理方案是随机产生。

看这部分算法原本目的是想要了解这部分内容，但是现在发现MCTS的核心思想在于如何平衡探索和开发。这部分策略本质上是评价函数，如何构建评价函数，从而可以对状态进行评分。这其实是强化学习的任务。MCTS的任务是在知道这部分内容的前提下，将评估函数很好的与实际应用结合。

该函数的设计本质上是一个很难的问题。综述中给出了如下的设计提升方案： * Rule-Based Simulation Policy * Contextual Monte Carlo Search * Fill the Board * Learning a Simulation Policy * Using History Heuristics * Evaluation Function * Simulation Balancing * Last Good Reply * Patterns

Code

以上式算法的伪代码。

下面分析GitHub上关于MCTS的实现，代码地址。

首先导入相关的包：

import logging
import math

import numpy as np

EPS = 1e-8

log = logging.getLogger(__name__)

接下来创建名为MCTS的类：

class MCTS():
    """
    This class handles the MCTS tree.
    """

    def __init__(self, game, nnet, args):
        self.game = game
        self.nnet = nnet
        self.args = args
        # 以下内容全部用字典表示，key是状态对应的字符串
        # 存储状态的Q值
        self.Qsa = {}  # stores Q values for s,a (as defined in the paper)
        # 存储已经探索过状态-动作对的次数
        self.Nsa = {}  # stores #times edge s,a was visited
        # 存储已经探索过状态的次数
        self.Ns = {}  # stores #times board s was visited
        # 存储初始策略
        self.Ps = {}  # stores initial policy (returned by neural net)

        # 存储游戏终止状态
        self.Es = {}  # stores game.getGameEnded ended for board s
        # 存储该状态下接下来合法移动
        self.Vs = {}  # stores game.getValidMoves for board s

以下search与getActionProb函数，均是MCTS类中的函数。

search函数执行一次搜索过程，从一个节点开始，不断探索子节点，直到达到终态。每一步选择最大UCB的过程。

class MCTS():
    def __init__(self, game, nnet, args):
        ...

    def search(self, canonicalBoard):
        """
        This function performs one iteration of MCTS. It is recursively called
        till a leaf node is found. The action chosen at each node is one that
        has the maximum upper confidence bound as in the paper.

        Once a leaf node is found, the neural network is called to return an
        initial policy P and a value v for the state. This value is propagated
        up the search path. In case the leaf node is a terminal state, the
        outcome is propagated up the search path. The values of Ns, Nsa, Qsa are
        updated.

        NOTE: the return values are the negative of the value of the current
        state. This is done since v is in [-1,1] and if v is the value of a
        state for the current player, then its value is -v for the other player.

        Returns:
            v: the negative of the value of the current canonicalBoard
        """

        # 将状态转化为字符串的形式
        s = self.game.stringRepresentation(canonicalBoard)

        # 如果状态不在之前是否记录表中，则重新检测是否为终态。
        if s not in self.Es:
            # 是终态返回 1 or -1（表示胜负结果），不是终态返回 0
            self.Es[s] = self.game.getGameEnded(canonicalBoard, 1)

        # 如果已经终止，直接返回结果
        if self.Es[s] != 0:
            # terminal node
            return -self.Es[s]

        # 如果不在策略的记录表中，产生初始策略
        if s not in self.Ps:
            # leaf node
            # 神经网络评估该状态价值
            self.Ps[s], v = self.nnet.predict(canonicalBoard)
            # 产生所有合法移动
            valids = self.game.getValidMoves(canonicalBoard, 1)
            # 遮盖不合法移动
            self.Ps[s] = self.Ps[s] * valids  # masking invalid moves
            sum_Ps_s = np.sum(self.Ps[s])
            if sum_Ps_s > 0:
                self.Ps[s] /= sum_Ps_s  # renormalize
            else:
                # if all valid moves were masked make all valid moves equally probable

                # NB! All valid moves may be masked if either your NNet architecture is insufficient or you've get overfitting or something else.
                # If you have got dozens or hundreds of these messages you should pay attention to your NNet and/or training process.   
                log.error("All valid moves were masked, doing a workaround.")
                self.Ps[s] = self.Ps[s] + valids
                self.Ps[s] /= np.sum(self.Ps[s])

            self.Vs[s] = valids
            self.Ns[s] = 0
            return -v

        # 给予当前状态的合法移动
        valids = self.Vs[s]
        # 初始化当前最好UCB，当前最好选择
        cur_best = -float('inf')
        best_act = -1

        # pick the action with the highest upper confidence bound
        # 选择UCB最高的动作，返回所有可能的动作状态数
        for a in range(self.game.getActionSize()):
            if valids[a]:   #检测是否合法
                # 计算UCB
                if (s, a) in self.Qsa:
                    u = self.Qsa[(s, a)] + self.args.cpuct * self.Ps[s][a] * math.sqrt(self.Ns[s]) / (
                            1 + self.Nsa[(s, a)])
                else:
                    u = self.args.cpuct * self.Ps[s][a] * math.sqrt(self.Ns[s] + EPS)  # Q = 0 ?

                if u > cur_best:
                    cur_best = u
                    best_act = a

        # 选择子节点并更新状态
        a = best_act
        next_s, next_player = self.game.getNextState(canonicalBoard, 1, a)
        next_s = self.game.getCanonicalForm(next_s, next_player)

        # 子节点进行迭代
        v = self.search(next_s)

        # 更新X与Nsa
        if (s, a) in self.Qsa:
            self.Qsa[(s, a)] = (self.Nsa[(s, a)] * self.Qsa[(s, a)] + v) / (self.Nsa[(s, a)] + 1)
            self.Nsa[(s, a)] += 1

        else:
            self.Qsa[(s, a)] = v
            self.Nsa[(s, a)] = 1

        self.Ns[s] += 1
        return -v

getActionProb为MCTS的每一步搜索

class MCTS():
    def __init__(self, game, nnet, args):
        ...

    # 温度相当于给予不同动作一个概率，如果temp=1则直接贪心选择 
    def getActionProb(self, canonicalBoard, temp=1):
        """
        This function performs numMCTSSims simulations of MCTS starting from
        canonicalBoard.

        Returns:
            probs: a policy vector where the probability of the ith action is
                   proportional to Nsa[(s,a)]**(1./temp)
        """
        # 进行多次模拟
        for i in range(self.args.numMCTSSims):
            self.search(canonicalBoard)

        # 将状态编码为字符串
        s = self.game.stringRepresentation(canonicalBoard)

        # 遍历当前状态下，所有合法动作已经探索过的次数
        counts = [self.Nsa[(s, a)] if (s, a) in self.Nsa else 0 for a in range(self.game.getActionSize())]

        # 如果温度为0则贪心的返回
        if temp == 0:
            bestAs = np.array(np.argwhere(counts == np.max(counts))).flatten()
            bestA = np.random.choice(bestAs)
            probs = [0] * len(counts)
            probs[bestA] = 1
            return probs

        # 温度非0,给予每个动作一个归一化的被选择几率值
        counts = [x ** (1. / temp) for x in counts]
        counts_sum = float(sum(counts))
        probs = [x / counts_sum for x in counts]
        return probs

renyixiong blog

A Survey of Monte Carlo Tree Search Methods