【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)

【原文作者及来源:Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676):354-359.

【此译文由COCO主要完成,对MarkDown编辑器正在熟悉过程中,因此,文章中相关公式存在问题,请见谅】

【原文】A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

【翻译】人工智能的长期目标是后天自主学习,并且在一些具有挑战性的领域中实现超人的算法。最近,AlphaGo成为第一个在围棋中击败人类世界冠军的程序。AlphaGo的树搜索使用深度神经网络来评估棋局和选定下棋位置。神经网络是利用对人类专业棋手的移动进行监督学习,同时通过自我博弈进行强化学习来进行训练的。在这里,我们引入了一种没有人类的数据、指导或超越游戏规则的领域知识的、基于强化学习的算法。AlphaGo成为了自己的老师:神经网络被训练用来预测AlphaGo自己的落子选择和胜负。这种神经网络提高了树搜索的强度,从而提高了落子选择的质量和在下一次迭代中的自我博弈能力。从零开始,我们的新程序“AlphaGo Zero”取得了“超人”的成绩,以100比0战胜了的此前公布的AlphaGo版本(指代和李世石对弈的AlphaGo)。
【原文】Much progress towards artificial intelligence has been made using supervised learning systems that are trained to replicate the decisions of human experts . However, expert data sets are often expensive, unreliable or simply unavailable. Even when reliable data sets are available, they may impose a ceiling on the performance of systems trained in this manner . By contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities, and to operate in domains where human expertise is lacking. Recently, there has been rapid progress towards this goal, using deep neural networks trained by reinforcement learning. These systems have outperformed humans in computer games, such as Atari  and 3D virtual environments . However, the most challenging domains in terms of human intellect—such as the game of Go, widely viewed as a grand challenge for artificial intelligence —require a precise and sophisticated looka head in vast search spaces. Fully general methods have not previously achieved human ­level performance in these domains.
【翻译】使用监督学习系统来做出与人类棋手一样的决策使人工智能取得了很大进展。然而,人类棋手的数据集通常是昂贵的、不可靠的或根本不可用的。即使在可靠的数据集可用时,人类的认知局限也可能对以这种方式训练的系统的性能施加上限。相比之下,强化学习系统是通过自己的经验训练的,原则上他们能够超越人的能力,并在缺乏人类知识的领域中运作。近年来,利用强化学习训练的深层神经网络在这一目标上取得了快速的进展。这些系统在电脑游戏如Atari和3D虚拟环境上已经超过了人类。但是,在人类智力方面最具挑战性的领域,如围棋领域,使用完全通用的方法没有办法实现与人类相媲美的性能。因为围棋被广泛视为是人工智能的一大挑战——它需要在庞大的搜索空间上进行精确和复杂的前瞻(预判,也就是我们所说的看几步棋)。
【原文】AlphaGo was the first program to achieve superhuman performance in Go. The published version, which we refer to as AlphaGo Fan, defeated the European champion Fan Hui in October 2015. AlphaGo Fan used two deep neural networks: a policy network that outputs move probabilities and a value network that outputs a position evaluation. The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policy­gradient reinforcement learning. The value network was trained to predict the winner of games played by the policy network against itself. Once trained, these networks were combined with a Monte Carlo tree search (MCTS) to provide a lookahead search, using the policy network to narrow down the search to high probability moves, and using the value network (in conjunction with Monte Carlo rollouts using a fast rollout policy) to evaluate positions in the tree. A subsequent version, which we refer to as AlphaGo Lee, used a similar approach , and defeated Lee Sedol, the winner of international titles, in March 2016.
【翻译】AlphaGo是第一个在围棋比赛中实现超人表现的程序。之前发布的我们称之为AlphaGo Fan的版本,在2015年10月击败了欧洲冠军樊麾(法国国家围棋队总教练)。AlphaGo Fan使用两个深层神经网络:一个是策略网络,来输出下一步落子的概率;另一个是价值网络,来输出对棋局的评估,也就是落子的胜率。策略网络最初是通过监督学习来精确预测人类专业棋手的落子,随后又通过策略梯度强化学习对系统进行了增强。价值网络通过使用策略网络进行自我博弈来预测谁是赢家从而完成训练。一旦经过训练,这些网络结合蒙特卡洛树搜索(MCTS)提供对未来局势的预测。运用策略网络来缩小高概率落子的搜索过程,运用价值网络结合蒙特卡洛快速走子策略来评估树中的落子位置。随后开发的版本,我们称之为AlphaGo Lee,用类似的方法,在2016年3月击败具有国际冠军头衔的Lee Sedol(曾获18项国际冠军)。
【原文】Our program, AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee in several important aspects. First and foremost, it is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data. Second, it uses only the black and white stones from the board as input features. Third, it uses a single neural network, rather than separate policy and value networks. Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning. Further technical differences in the search algorithm, training procedure and network architecture are described in Methods.

【翻译】我们现在的程序AlphaGo Zero,与AlphaGo Fan和AlphaGo Lee存在以下几点的差异。首先,它完全由自我博弈强化学习进行训练,从刚开始的随机博弈开始,就没有任何监督或使用人类的数据。第二,它只使用棋盘上的黑白棋作为输入特征。第三,它使用单一的神经网络,而不是分离的策略网络和价值网络。最后,它使用了一个简化版搜索树,这个搜索树依靠单一的神经网络进行棋局评价和落子采样,不执行任何蒙特卡洛rollout。为了实现上述结果,我们引入一个新的强化学习算法,在训练过程中完成前向搜索,从而达到速的提高以及精确、稳定的学习过程。在搜索算法、训练过程和网络架构方面更多的技术差异在方法中进行了描述。

【原文】Reinforcement learning in AlphaGo Zero,Our new method uses a deep neural network    with parameters θ. This neural network takes as an input the raw board representation S of the position and its history, and outputs both move probabilities and a value, . The vector of move probabilities p represents the probability of selecting each move a (including pass) . The value v is a scalar evaluation, estimating the probability of the current player winning from position s. This neural network combines the roles of both policy network and value network into a single architecture. The neural network consists of many residual blocks of convolutional layers with batch normalization and rectifier nonlinearities(see Methods).

【翻译】我们在AlphaGo Zero的强化学习中,法使用一个参数为θ的深度神经网络。该神经网络将棋局和其历史的原始图作为输入,输出落子概率和价值 。落子概率向量p代表选择每个落子动作a(包括放弃行棋)的概率, 。价值v是标量评估,估计当前玩家在棋局状态为s时获胜的概率。这个神经网络将策略网络和价值网络合并成一个单一的体系结构。神经网络包括许多残差块、批量归一化和整流器非线性的卷积层。
【原文】The neural network in AlphaGo Zero is trained from games of self­play by a novel reinforcement learning algorithm. In each position s, an MCTS search is executed, guided by the neural network  . The MCTS search outputs probabilities π of playing each move. These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network  ; MCTS may therefore be viewed as a powerful policy improvement operator20,21. Self­play with search—using the improved MCTS­based policy to select each move, then using the game winner z as a sample of the value—may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators Article reSeArcH19 OcTObER2017 | VOL 550 | NATURE | 355repeatedly in a policy iteration procedure22,23: the neural network’s parameters are updated to make the move probabilities and value more closely match the improved search probabilities and self­play winner (π, z); these new parameters are used in the next iteration of self­play to make the search even stronger. Figure 1 illustrates the self­play training pipeline.

【翻译】AlphaGo Zero的神经网络是通过新的强化学习算法利用自我博弈训练出来的。在每一个棋局s,通过神经网络 的指导来执行蒙特卡洛搜索。MCTS搜索输出每次落子的概率分布π。经过搜索后的落子概率通常比神经网络 输出的落子概率p更强,因此MCTS被看作是一个强大的策略改进算法。带有搜索的自我博弈——采用改进的以MCTS为基础的策略来选择的每一次落子,然后用游戏的赢家z作为价值的样本——可以被看作是一个强有力的策略评估运算符。我们采用的强化学习算法的主要思想是在策略迭代过程中反复地利用这些搜索算子(文章research19 october2017 |卷550 |自然| 355);神经网络的参数被更新,使移动概率值 更紧密地与改进的搜索概率和自我博弈的赢家(π,z)相配;这些新的参数用于下一次的自我博弈迭代,以使搜索更强大。图1展示了自我博弈的训练流程。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第1张图片

【原文】Figure 1 | Self-play reinforcement learning in AlphaGo Zero.
【翻译】图一 AlphaGo Zero中的自我博弈清华学习

【原文】a, The program plays a game s1, ..., sT against itself. In each position st, an MCTS αθ is executed (see Fig. 2) using the latest neural network fθ. Moves are selected according to the search probabilities computed by the MCTS, at∼πt. The terminal position st is scored according to the rules of the game to compute the game winner z. 
【翻译】a.这个程序进行自我博弈s1, ..., sT。在每个棋局st,执行一个使用最新的神经网络fθ 的MCTS αθ(见图2)。根据MCTS计算的搜索概率来选择落子,at∼πt。根据游戏规则在最终的棋局st记分,来计算比赛的胜出者z。
【原文】b, Neural network training in AlphaGo Zero. The neural network takes the raw board position st as its input, passes it through many convolutional layers with parameters θ, and outputs both a vector pt, representing a probability distribution over moves, and a scalar value vt, representing the probability of the current player winning in position st. The neural network parameters θ are updated to maximize the similarity of the policy vector pt to the search probabilities πt, and to minimize the error between the predicted winner V t and the game winner z (see equation (1)). The new parameters are used in the next iteration of self­play as in a.
【翻译】b,AlphaGo Zero中的神经网络训练。神经网络以原始棋盘状态st作为输入,通过参数为θ的多个卷积层,输出代表落子概率分布的向量pt,和一个表示当前玩家在棋局状态st处胜率的标量值vt。神经网络参数θ朝着使策略矢量pt与搜索概率πt相似度最大化的方向更新,同时最大限度地减少预测赢家vt和游戏赢家z之间的误差(见公式(1))。如a所示,在下一次迭代中使用新的参数。
【原文】The MCTS uses the neural network f θ to guide its simulations (see Fig. 2). Each edge (s, a) in the search tree stores a prior probability P(s, a), a visit count N(s, a), and an action value Q(s, a). Each simulation starts from the root state and iteratively selects moves that maximize an upper confidence bound Q(s, a) +U(s, a), where U(s, a) ∝P(s, a) / (1 +N(s, a)) (refs 12, 24), until a leaf node s′ is encountered. This leaf position is expanded and evaluated only once by the network to generate both prior probabilities and evaluation Each edge (s, a) traversed in the simulation is updated to increment its visit count N(s, a), and to update its action value to the mean evaluation over  these  simulations,  ,where s, a→s′ indicates that a simulation eventually reached s′after taking move a from position s.

【翻译】MCTS采用神经网络 来指导它的模拟(见图2)。搜索树中的每个边(s,a)存储先验概率p(s,a)、访问次数n(s,a)和一个动作价值Q(s,a)。每次模拟从根开始,反复选择落子,使置信上限Q(s,a)+ U(s,a)最大化,其中U(s,a)∝P(s,a)/(1 + N(s,a))(参考文献12, 24),直到遇到叶节点s′。叶子的位置被扩展,通过网络对该叶子的棋局进行扩展和评估,产生先验概率和价值 。在模拟中的每条边(s,a)被更新,访问数量N(s,a)增加,并且将其动作值更新为对这些模拟的平均评价, ,其中s,a→s’表示在从位置s移动a之后,模拟最终达到s’。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第2张图片

【原文】Figure 2 | MCTS in AlphaGo Zero.

   【翻译】图二  AlphaGo Zero的MCTS搜索

【原文】a, Each simulation traverses the tree by selecting the edge with maximum action value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge (which is incremented once traversed). 
【翻译】a,每个模拟通过选择具有最大动作值Q,加上取决于存储的先验概率p和该边的访问计数n的一个置信区间上限u(当遍历的时候递增)的边来对树进行遍历。
【原文】b, The leaf node is expanded and the associated position s is evaluated by the neural network (P(s, ·),V(s)) =fθ(s); the vector of P values are stored in the outgoing edges from s. 
【翻译】b、叶节点的扩展和对应棋局s的评价是由神经网络(P(S,·)、V(S))= Fθ(S)完成的;p值的向量存储在从s出发的外向边中。
【原文】c, Action value Q is updated to track the mean of all evaluations V in the subtree below that action. 
【翻译】c,更新动作价值Q,来跟踪那个落子动作下面的子树中所有评价V的平均值。
【原文】d, Once the search is complete, search probabilities π are returned, proportional to N1/τ, where N is the visit count of each move from the root state and τ is a parameter controlling temperature.
【翻译】d,一旦搜索完成后,返回搜索概率π,与N1 /τ成正比,其中N是从根开始的每个落子的访问次数,τ是温度控制参数。
【原文】MCTS may be viewed as a self­play algorithm that, given neural network parameters θ and a root position s, computes a vector of search probabilities recommending moves to play, π=αθ(s), proportional to the exponentiated visit count for each move, πa∝N(s, a)1/τ, where τ is a temperature parameter.
【翻译】蒙特卡洛可以看作是一个自我博弈算法,给出了神经网络参数θ和根的棋局状态s,计算搜索概率推荐的移动向量π=αθ(s),它与每次落子动作的访问计数的指数成正比,π∝N(S,A)1 /τ,其中τ是温度参数。

【原文】The neural network is trained by a self­play reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialized to random weights θ0. At each subsequent iteration i≥ 1, games of self­play are generated (Fig. 1a). At each time­step t, an MCTS search   is executed using the previous iteration of neural network θ−fi1 and a move is played by sampling the search probabilities πt. A game terminates at step T when both players pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length; the game is then scored to give a final reward of rT∈  {−1,+1} (see Methods for details). The data for each time­step t is stored as , where zt=±rT is the game winner from the perspective of the current player at step t. In parallel (Fig. 1b), new network parameters θi are trained from data sampled uniformly among all time­steps of the last iteration(s) of self­play. The neural network is adjusted to minimize the error between the predicted value v and the self­play winner z, and to maximize the similarity of the neural network move probabilities p to the search probabilities π. Specifically, the parameters θ are adjusted by gradient descent on a loss function l that sums over the mean­squared error and cross­entropy losses, respectively:


where c is a parameter controlling the level of L weight regularization (to prevent overfitting).

【翻译】神经网络通过自我强化学习进行训练,该强化学习算法使用MCTS计算每个落子动作。首先,神经网络初始化为随机权重θ0 。在随后的每次迭代i≥1时,产生了自我博弈(图1a)。在每一个时间步t,利用上一次迭代的神经网络 执行MCTS搜索 ,并且通过概率分布πt 进行采样来落子。当双方放弃行棋时,或者当搜索值低于阈值,或者当比赛超过最大长度时,比赛终止于步骤T;然后为比赛计分,给予奖励 rT∈  {−1,+1}(见方法细节)。每个时间步t的数据存储为 ,其中zt=±rT 是在步骤t从当前玩家的视角来看的赢家。并行地(图1b),新的网络参数θi 利用数据 进行训练,数据是从自我博弈的上一次迭代的所有时间步中均匀取样的。调整神经网络 ,使预测值v和自我博弈的赢家z之间的误差最小,并且最大限度地提高神经网络移动概率p与搜索概率π的相似度。具体来说,通过使用对均方误差和交叉熵损耗求和的损失函数l,利用梯度下降来调整参数θ:


其中,c是一个控制L2权重正则化水平的参数(防止过拟合)。

【原文】Empirical analysis of AlphaGo Zero training
We applied our reinforcement learning pipeline to train our program AlphaGo Zero. Training started from completely random behavior and continued without human intervention for approximately three days. Over the course of training, 4.9 million games of self­play were generated, using 1,600 simulations for each MCTS, which corresponds to approximately 0.4 s thinking time per move. Parameters were updated from 700,000 mini­batches of 2,048 positions. The neural network contained 20 residual blocks.
AlphaGo Zero训练的实验分析
【翻译】应用我们的强化学习流程来训练AlphaGo Zero。训练从完全随机的落子开始,在没有人工干预的情况下持续大约三天。在训练过程中, 每次MCTS使用1600次模拟,生成了490万场自我博弈,每次落子使用约0.4s的思考时间。使用大小为2048的700000个小批量更新参数。神经网络包含20个残差块。

【原文】Figure 3a shows the performance of AlphaGo Zero during self­play reinforcement learning, as a function of training time, on an Elo scale 25. Learning progressed smoothly throughout training, and did not suffer from the oscillations or catastrophic forgetting that have been suggested in previous literature Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36 h. In comparison, AlphaGo Lee was trained over several months. After 72 h, we evaluated AlphaGo Zero against the exact version of AlphaGo Lee that defeated Lee Sedol, under the same 2   h time controls and match conditions that were used in the man–machine match in Seoul (see Methods). AlphaGo Zero used a single machine with 4 tensor processing units (TPUs) 29, whereas AlphaGo Lee was distributed over many machines and used 48 TPUs. AlphaGo Zero defeated AlphaGo Lee by 100 games to 0 (see Extended Data Fig. 1 and Supplementary Information).

【翻译】图3a显示了以训练时间为横轴,使用ELO评分规则时AlphaGo Zero在自我博弈强化学习期间的性能。在整个训练期间学习进展顺利,并没有遭受在相关文献中提及的振荡或灾难性的遗忘。令人惊讶的是,在仅训练36小时之后,AlphaGo Zero就超过了AlphaGo Lee的性能,因为AlphaGo Lee训练了几个月。训练72小时后,我们评估AlphaGo Zero,让他和在首尔打败过李世石的AlphaGo Lee使用2小时控制时间和比赛环境下进行比赛。AlphaGo Zero使用具有4个TPU的单机,而AlphaGo Lee则是分布在许多机器上,并且使用48个TPU。AlphaGo Zero以100比0击败AlphaGo Lee(参见扩展数据图1和补充资料)。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第3张图片

【原文】Figure 3 | Empirical evaluation of AlphaGo Zero.

                                                                           【翻译】 图三 AlphaGo Zero的实证评价

【原文】a, Performance of self­play reinforcement learning. The plot shows the performance of each MCTS player αθi from each iteration i of reinforcement learning in AlphaGo Zero. Elo ratings were computed from evaluation games between different players, using 0.4 s of thinking time per move (see Methods). For comparison, a similar player trained by supervised learning from human data, using the KGS dataset, is also shown. 
【翻译】a,自我博弈强化学习的表现。图中显示了AlphaGo Zero强化学习在每次迭代i中MCTS $\alpha_{\theta_{i}}$αθi 的表现。通过与不同玩家的比赛,来评估ELO评级。在比赛中每次落子的思考时间为0.4秒(见方法)。为了对比,我们也展示出了使用KGS数据,由人类经验数据进行监督学习训练的模型。
【原文】b, Prediction accuracy on human professional moves. The plot shows the accuracy of the neural network θfi, at each iteration of self­play i, in predicting human professional moves from the GoKifu dataset. The accuracy measures the probability to the human move. The accuracy of a neural network trained by supervised learning is also shown. 

【翻译】b、对人类棋手落子的预测精度。该图显示了在每一次自我博弈迭代i中,神经网络 根据KGS数据集预测人类棋手落子的准确性。通过监督学习的神经网络的训练精度也显示在图中。

【原文】c, Mean­squared error (MSE) of human professional game outcomes. The plot shows the MSE of the neural networkθfi, at each iteration of self­play i, in predicting the outcome of human professional games from the GoKifu dataset. The MSE is between the actual outcome z∈   {− 1, +1} and the neural network value v, scaled by a factor of 14 to the range of 0–1. The MSE of a neural network trained by supervised learning is also shown.

【翻译】c,在人类职业比赛结果上的均方误差(MSE)。该图显示了在每一次自我博弈迭代i中,神经网络 从gokifu数据中预测人类职业比赛结果的MSE。MSE是在实际结果z∈   {− 1, +1} 和神经网络的价值v,按1/4的比例缩小到0 - 1的范围之间。图中还显示出经过监督学习训练的神经网络的MSE。


【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第4张图片


【原文】To assess the merits of self­play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS Server dataset; this achieved state­of­the­art prediction accuracy compared to previous work12,30–33 (see Extended Data Tables 1 and 2 for current and previous results, respectively). Supervised learning achieved a better initial performance, and was better at predicting human professional moves (Fig. 3). Notably, although supervised learning achieved higher move prediction accuracy, the self­learned player performed much better overall, defeating the human­trained player within the first 24 h of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play.

【翻译】为了评估自我博弈强化学习相对于使用人类数据进行学习的优势,我们训练了第二个神经网络(使用相同的架构)来预测在KGS服务器数据上人类专业棋手的落子动作,取得了与以前的工作(12,30–33)相比更准确的预测精度(当前和以前的结果分别参见扩展数据表1 和2)。监督学习在一开始获得了非常好的性能,并且更好地预测了人类棋手的动作(图3)。但值得注意的是,虽然监督学习取得了较高的落子预测精度,但是总体而言,这个自学的棋手表现更好,在经过24小时的训练后击败了用人类数据进行训练的程序。这表明,AlphaGo Zero可以学习到完全与人类不同的技能。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第5张图片

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第6张图片

【原文】To separate the contributions of architecture and algorithm, we compared the performance of the neural network architecture in AlphaGo Zero with the previous neural network architecture used in AlphaGo Lee (see Fig. 4). Four neural networks were created, using either separate policy and value networks, as were used in AlphaGo Lee, or combined policy and value networks, as used in AlphaGo Zero; and using either the convolutional network architecture from AlphaGo Lee or the residual network architecture from AlphaGo Zero. Each network was trained to minimize the same loss function (equation (1)), using a fixed dataset of self­play games generated by AlphaGo Zero after 72 h of self­play training. Using a residual network was more accurate, achieved lower error and improved performance in AlphaGo by over 600 Elo. Combining policy and value together into a single network slightly reduced the move prediction accuracy, but reduced the value error and boosted playing performance in AlphaGo by around another 600 Elo. This is partly due to improved computational efficiency, but more importantly the dual objective regularizes the network to a common representation that supports multiple use cases.
【翻译】为了将结构和算法的贡献分离,我们将AlphaGo Zero使用的神经网络体系结构的性能与AlphaGo Lee使用的神经网络结构进行了比较(见图4)。我们创建了四个神经网络,就像在AlphaGo Lee中那样,使用独立的策略网络和价值网络;或者使用AlphaGo Lee使用的卷积网络架构或AlphaGo Zero使用的残差网络架构。训练网络时都最大限度地减少相同的损失函数(方程(1)),使用的数据集是AlphaGo Zero在72小时的自我博弈训练后产生的固定数据集。利用残差网络更准确,使AlphaGo 达到较低的错误率和性能的改进,达到了超过600Elo。将策略和价值合成一个单一的网络会轻微地降低落子预测精度,但同时降低了价值误差,并且使AlphaGo的性能提高大约600Elo。这是由于提高了计算效率,但更重要的是具有双重目的的网络成为支持多个案例的通用表示。
什麽是ELO?
ELO等级分制度是指由匈牙利裔美国物理学家 Arpad Elo创建的一个衡量各类对弈活动水平的评价方法,是当今对弈水平评估的公认的权威方法。
ELO怎麽产生的?

最早, ELO等级分制度是基于统计学的一个评估棋手水平的方法. 之后被广泛用于国际象棋、围棋、足球、篮球等运动。线上游戏英雄联盟、魔兽世界内的竞技对战系统也採用此分级制度. 现在不少Destiny网站也使用此统计系统.

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第7张图片

【原文】Figure 4 | Comparison of neural network architectures in AlphaGo Zero and AlphaGo Lee

                                               【翻译】    图4  AlphaGo Zero和AlphaGo Lee中神经网络结构的比较

【原文】Comparison of neural network architectures using either separate (sep) or combined policy and value (dual) networks, and using either convolutional (conv) or residual (res) networks. The combinations ‘dual–res’ and ‘sep–conv’ correspond to the neural network architectures used in AlphaGo Zero and AlphaGo Lee, respectively. Each network was trained on a fixed dataset generated by a previous run of AlphaGo Zero. 
【翻译】使用单独的(SEP)或联合的策略和价值(dual)网络的神经网络结构比较,以及使用卷积(conv)或残差(res)网络的比较。 ‘dual–res’和‘sep–conv’ 的组合分别与AlphaGo Zero 和 AlphaGo Lee中使用的神经网络结构相对应。每个网络在一个固定的数据集上进行训练,这个数据集是由AlphaGo Zero以前的运行产生的。
【原文】a, Each trained network was combined with AlphaGo Zero’s search to obtain a different player. Elo ratings were computed from evaluation games between these different players, using 5 s of thinking time per move. 
b, Prediction accuracy on human professional moves (from the GoKifu dataset) for each network architecture.
c ,MSE of human professional game outcomes (from the GoKifu dataset) for each network architecture.
【翻译】a,每个训练过的网络与AlphaGo Zero的搜索相结合,来获得不同的程序。通过这些不同的程序之间的比赛来计算ELO评级。在比赛中,每次落子使用5秒的思考时间。
b, 每个网络架构对专业人类棋手的落子预测精度(使用gokifu数据集)。
c, 每个网络架构在人类专业职业比赛结果的MSE(使用gokifu数据集)。
【原文】Knowledge learned by AlphaGo Zero
AlphaGo Zero discovered a remarkable level of Go knowledge during its self­play training process. This included not only fundamental elements of human Go knowledge, but also non­standard strategies beyond the scope of traditional Go knowledge. 
【翻译】AlphaGo Zero学习到的知识
AlphaGo Zero在自我博弈训练过程中发现了围棋的新境界。这不仅包括人类围棋知识的基本要素,而且还包括超出传统围棋知识范围之外的非标准策略。

【原文】Figure 5 shows a timeline indicating when professional joseki (corner sequences) were discovered (Fig. 5a and Extended Data Fig. 2); ultimately AlphaGo Zero preferred new joseki variants that were previously unknown (Fig. 5b and Extended Data Fig. 3). Figure 5c shows several fast self­play games played at different stages of training (see Supplementary Information). Tournament length games played at regular intervals throughout training are shown in Extended Data Fig. 4 and in the Supplementary Information. AlphaGo Zero rapidly progressed from entirely random moves towards a sophisticated understanding of Go concepts, including fuseki (opening), tesuji (tactics), life­and­death, ko (repeated board situations), yose (endgame), capturing races, sente (initiative), shape, influence and territory, all discovered from first principles. Surprisingly, shicho (‘ladder’ capture sequences that may span the whole board)—one of the first elements of Go knowledge learned by humans—were only understood by AlphaGo Zero much later in training.

【翻译】图5显示了专业的定式(位于边角的序列上)被发现的时间(图5A和扩展的数据如图2所示);最终AlphaGo Zero使用了新的定式变种(图5B和扩展数据图3)。图5c显示了在不同的训练阶段进行的几次快速自我博弈的进行情况(参见补充信息)。在整个训练过程中定期进行的比赛长度在扩展数据图4和补充信息中显示。(在训练过程中一般游戏长度中都有一些间隔,这些间隔都显示在了额外数据4和补充信息中。)AlphaGo Zero迅速从“一块白板”走向成熟,对围棋概念有了深奥理解,包括布局(开放),手筋(战术),活和死,劫(重复的棋盘情况),官子(残局),提子比赛,森特(主动)(初始)、形态(成型)、影响和领土(占领),都能在第一时间迅速掌握。令人惊讶的是,shicho抓住了整个棋盘的序列——在人类学习围棋中比较早被人类掌握的围棋知识点,却在AlphaGo Zero训练比较晚的时候才掌握到。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第8张图片

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第9张图片

【原文】Figure 5 | Go knowledge learned by AlphaGo Zero.

                                                                 【翻译】  图5  AlphaGo Zero学习的围棋知识

【原文】a, Five human Joseki (common corner sequences) discovered during AlphaGo Zero training. The associated timestamps indicate the first time each sequence occurred (taking account of rotation and reflection) during self­play training. Extended Data Figure 2 provides the frequency of occurence over training for each sequence.
【翻译】a,在AlphaGo Zero训练过程中的五个常见的角点序列。在自我博弈训练期间,相关的时间段显示了每个序列第一次形成的时间(考虑旋转和反射)。扩展数据图2提供了每个序列在训练中出现的频率。
【原文】b, Five joseki favoured at different stages of self­play training. Each displayed corner sequence was played with the greatest frequency, among all corner sequences, during an iteration of self­play training. The timestamp of that iteration is indicated on the timeline. At 10 h a weak corner move was preferred. At 47 h the 3–3 invasion was most frequently played. This joseki is also common in human professional play however AlphaGo Zero later discovered and preferred a new variation. Extended Data Figure 3 provides the frequency of occurence over time for all five sequences and the new variation.
【翻译】b,五定式在自我博弈训练的不同阶段被青睐的程度。在自我博弈训练的一次迭代中,在所有的角序列中,每一个显示的角序列都出现的频率最高。该迭代的时间戳在时间轴上表示。在10小时时,弱角移动是首选。在47小时时,3 - 3的入侵是最经常发生的。这个定式在人类职业比赛中也常见。不过AlphaGo Zero随后发现并偏向于这个新变化。扩展数据图3提供了所有五个序列和新变化随时间变化的频率。

【原文】c, The first 80 moves of three self­play games that were played at different stages of training, using 1,600 simulations (around 0.4 s) per search. At 3 h, the game focuses greedily on capturing stones, much like a human beginner. At 19 h, the game exhibits the fundamentals of life­and­death, influence and territory.At 70 h, the game is remarkably balanced, involving multiple battles and a complicated ko fight, eventually resolving into a half­point win for white. See Supplementary Information for the full game.

【翻译】C,在不同训练阶段进行的三个自我博弈的前80步,每次搜索使用1600次模拟(大约0.4秒)。在3小时后,游戏专注于吃对方的棋子,就像人类初级棋手一样。在19小时时,游戏展现了死活、影响力和占领的基本方面,在70小时时,游戏非常平衡,包括多场战斗和复杂的劫战斗,最终白方以半目赢得胜利。有关完整游戏见补充信息。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第10张图片

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第11张图片

【原文】Final performance of AlphaGo Zero
We subsequently applied our reinforcement learning pipeline to a second instance of AlphaGo Zero using a larger neural network and over a longer duration. Training again started from completely random behaviour and continued for approximately 40 days. 
Over the course of training, 29 million games of self­play were generated. Parameters were updated from 3.1 million mini­batches of 2,048 positions each. The neural network contained 40 residual blocks. The learning curve is shown in Fig. 6a. Games played at regular intervals throughout training are shown in Extended Data Fig. 5 and in the Supplementary Information.
【翻译】AlphaGo Zero的最后的表现
随后,我们使用更大的神经网络,在一个较长的时间将我们的强化学习流程应用到AlphaGo Zero的第二个实例。训练又从完全随机的行为开始,持续了大约40天。
在训练过程中,产生了2900万场自我博弈。参数大小为2048的310万个小批量中更新。神经网络包含40个残差块。学习曲线显示在图6a, 在扩展的数据图5中和补充信息中显示了在训练期间定期进行的比赛。
【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第12张图片
【原文】Figure 6 | Performance of AlphaGo Zero.
【翻译】图6  AlphaGo Zero的表现
【原文】a, Learning curve for AlphaGo Zero using a larger 40­block residual network over 40 days. The plotshows the performance of each player αθi from each iteration i of our reinforcement learning algorithm. Elo ratings were computed from evaluation games between different players, using 0.4 s per search (see Methods). 
【翻译】a, 使用大型的40块残差网络,训练超过40天的AlphaGo Zero的学习曲线。该学习曲线展示了在我们的强化学习算法中,每次迭代i中 的表现。利用不同玩家的比赛计算ELO评级,在游戏中每次搜索使用0.4秒(见方法)。
【原文】b, Final performance of AlphaGo Zero. AlphaGo Zero was trained for 40 days using a 40­block residual neural network. The plot shows the results of a tournament between: AlphaGo Zero, AlphaGo Master (defeated top human professionals 60–0 in online games), Alpha ee (defeated Lee Sedol), AlphaGo Fan (defeated Fan Hui), as well as previous Go programs Crazy Stone, Pachi and GnuGo. Each program was given 5 s of thinking time per move. AlphaGo Zero and AlphaGo Master played on a single machine on the Google Cloud; AlphaGo Fan and AlphaGo Lee were distributed over many machines. The raw neural network from AlphaGo Zero is also included, which directly selects the move a with maximum probability pa, without using MCTS. Programs were evaluated on an Elo scale25: a 200­point gap corresponds to a 75% probability of winning.
【翻译】b, AlphaGo Zero的最终性能。AlphaGo Zero 使用40块残差神经网络训练40天。该图显示了AlphaGo Zero、AlphaGo Master(在在线游戏上以60–0击败人体专业顶级选手)、Alpha Lee(击败Lee Sedol)、AlphaGo Fan(击败樊麾),以及以前的围棋程序Crazy Stone,Pachi和gnugo之间的比赛。允许每个程序每次移动使用5秒的思考时间。AlphaGo Zero 和 AlphaGo Master在谷歌云上的单机进行;AlphaGo Fan和AlphaGo Lee分别分布在多台机器上。AlphaGo Zero的原神经网络也包括在内,它没有使用MCTS,直接选择最大概率为 的移动。程序以ELO 模式评价:200点的gap相当于75%的胜率。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第13张图片

【原文】We evaluated the fully trained AlphaGo Zero using an internal tournament against AlphaGo Fan, AlphaGo Lee and several previous Go programs. We also played games against the strongest existing program, AlphaGoMaster—a program based on the algorithm and architecture presented in this paper but using human data and features (see Methods)—which defeated the strongest human professional players 60–0 in online games in January2017 34 . In our evaluation, all programs were allowed 5 s of thinking time per move; AlphaGo Zero and AlphaGo Master each played on a single machine with 4 TPUs; AlphaGo Fan and AlphaGo Lee were distributed over 176 GPUsand 48 TPUs, respectively. We also included a player based solely on the raw neural network of AlphaGo Zero; this player simply selected the move with maximum probability.

【翻译】我们通过内部比赛对AlphaGo Fan,AlphaGo Lee和几个以前的Go程序评估了全面训练的AlphaGo Zero。我们还让其对战现有最强的程序,AlphaGo Master——一个基于本文的算法和架构但利用人类数据和特征的算法(见方法)的程序,于2017年1月在网络游戏上击败了人类最强的职业选手60–0。在我们的评估中,所有程序都只允许使用5秒时间思考每次落子;AlphaGo Zero和AlphaGo Master每个在使用4个TPU的单一机器上进行;AlphaGo Fan和AlphaGo Lee分别分布在176个GPU和48个TPU上。我们还引入一个完全基于AlphaGo Zero原始神经网络的程序,该程序以最大的概率来选择落子。

【原文】Figure 6b shows the performance of each program on an Elo scale. The raw neural network, without using any lookahead, achieved an Elo rating of 3,055. AlphaGo Zero achieved a rating of 5,185, compared to 4,858 for AlphaGo Master, 3,739 for AlphaGo Lee and 3,144 for AlphaGo Fan.

【翻译】图6b显示了在Elo量表上每个程序的性能。没有使用任何前向搜索的原始神经网络,Elo评级为3,055。相比之下,AlphaGo Zero达到了5185的等级, AlphaGo Master达到了4858 等级,AlphaGo Lee达到了3739和AlphaGo Fan 达到了3144。

【原文】Finally, we evaluated AlphaGo Zero head to head against AlphaGo Master in a 100­game match with 2­h time controls. AlphaGo Zero won by 89 games to 11 (see Extended Data Fig. 6 and Supplementary Information).

【翻译】最后,我们使用具有两小时控制时间的100场比赛对AlphaGo Zero和AlphaGo Master进行评估。AlphaGo Zero以89比11赢得了比赛(参见扩展数据图6和补充资料)。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第14张图片

【原文】Conclusion

Our results comprehensively demonstrate that a pure reinforcement learning approach is fully feasible, even in the most challenging of domains: it is possible to train to superhuman level, without humanexamples or guidance, given no knowledge of the domain beyond basic rules. Furthermore, a pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, comparedto training on human expert data. Using this approach, AlphaGo Zero defeated the strongest previous versions of AlphaGo, which were trained from human data using handcrafted features, by a large margin

【翻译】讨论

我们的研究结果证明,即便是在最具挑战性的领域中,单纯使用强化学习的方法也是完全可行的:没有人类实例或指导,没有基本规则之外的领域知识,训练达到超人的性能是完全可能的。此外,与通过人类棋手数据进行训练相比,单纯的强化学习方法只需要训练几个小时,并且可以取得更好的渐近性能。使用这种方法,AlphaGo Zero打败了AlphaGo 先前最强的版本,那个版本使用手工制作的特征,利用人类数据进行大幅度训练。

【原文】Humankind has accumulated Go knowledge from millions of games played over thousands of years, collectively distilled into patterns, proverbs and books. In the space of a few days, starting tabula rasa, AlphaGo Zero was able to rediscover much of this Go knowledge, as well as novel strategies that provide new insights into the oldest of games.
【翻译】人类从几千年来进行的围棋比赛中积累了大量的知识,并提取其精华写入模式、谚语和书籍。然而在短短几天内,从零开始的AlphaGo Zero能够重新发现很多围棋知识以及新的策略,为这古老的游戏提供了新的见解。
【原文】METHODS
Reinforcement learning. Policy iteration is a classic algorithm that generates a sequence of improving policies, by alternating between policy evaluation—estimating the value function of the current policy—and policy improvement—using the current value function to generate a better policy. A simple approach to policy evaluation is to estimate the value function from the outcomes of sampled trajectories. A simple approach to policy improvement is to select actions greedily with respect to the value function. In large state spaces, approximations are necessary to evaluate each policy and to represent its improvement.
【翻译】方法
强化学习。策略迭代是一种经典算法,它通过估计当前策略下的价值函数的“策略评估”,和利用当前价值函数产生更好策略的“策略改善”,来产生一系列改进策略。策略评估的一个简单方法是从采样轨迹的结果中估计值函数。策略完善的一个简单方法是利用价值函数贪婪地选择动作。在大的状态空间中,近似对评估每个策略并表示其改进是必要的。
【原文】Classification­based reinforcement learning improves the policy using a simple Monte Carlo search. Many rollouts are executed for each action; the action with the maximum mean value provides a positive training example, while all other actions provide negative training examples; a policy is then trained to classify actions as positive or negative, and used in subsequent rollouts. This may be viewed as a precursor to the policy component of AlphaGo Zero’s training algorithm when τ→ 0.
【翻译】基于分类的强化学习利用简单的蒙特卡洛搜索对策略进行了改进。每个落子动作执行许多次rollout;具有最大平均价值的动作提供了一个积极的训练实例,而所有其他的动作提供了负面的训练样例;策略是训练用来对动作的正面或负面进行分类,并用于后续的rollout。这可以看作是τ→0时,对AlphaGo Zero训练算法的策略组成的前身。
【原文】A more recent instantiation, classification­based modified policy iteration (CBMPI), also performs policy evaluation by regressing a value function towards truncated rollout values, similar to the value component of AlphaGo Zero; this achieved state­of­the­art results in the game of Tetris. However, this previous work was limited to simple rollouts and linear function approximation using handcrafted features.
【翻译】最近的一个实例,是基于分类的改进策略迭代(CBMPI),也通过截断的rollout对价值函数进行回归,从而执行策略评估,这类似于AlphaGo Zero的价值部分;这在俄罗斯方块游戏中达到了最先进的水平。然而,这以前的工作仅限于简单的rollout和使用手工特征的线性函数的近似。
【原文】The AlphaGo Zero self­play algorithm can similarly be understood as an approximate policy iteration scheme in which MCTS is used for both policy improvement and policy evaluation. Policy improvement starts with a neural network policy, executes an MCTS based on that policy’s recommendations, and then projects the (much stronger) search policy back into the function space of the neural network. Policy evaluation is applied to the (much stronger) search policy: the outcomes of self­play games are also projected back into the function space of the neural network. These projection steps are achieved by training the neural network parameters to match the search probabilities and self­play game outcome respectively.
【翻译】AlphaGo Zero自我博弈算法同样可以理解为一个近似的策略迭代计划,其中,MCTS用于策略改善和策略评价。策略改善从一个神经网络策略开始,执行基于该神经网络策略的MCTS,然后将搜索策略(更强)回归到神经网络的函数空间中。策略评估应用于(更强大的)搜索策略:自我博弈的结果也被投射回神经网络的函数空间。这些投射步骤是通过训练神经网络参数来匹配搜索概率和自我博弈结果而实现的。
【原文】Guo et al. 7 also project the output of MCTS into a neural network, either by regressing a value network towards the search value, or by classifying the action selected by MCTS. This approach was used to train a neural network for playing Atari games; however, the MCTS was fixed—there was no policy iteration—and did not make any use of the trained networks.
【翻译】郭等人也将MCTS的输出投影到神经网络中,或者通过搜索价值对价值网络进行回归,或者对通过MCTS选择的落子动作进行分类。这种方法通过训练神经网络来玩Atari游戏;然而,MCTS是固定的,没有策略迭代,并且没有使用任何训练过的网络。
【原文】Self-play reinforcement learning in games. Our approach is most directly applicable to Zero­sum games of perfect information. We follow the formalism of alternating Markov games described in previous work12, noting that algorithms based on value or policy iteration extend naturally to this setting39.
【翻译】游戏中的自我博弈强化学习。我们的方法最直接地适用于完全信息的零和博弈。我们遵循在先前的工作12中描述的交替马尔可夫游戏的形式,指出基于价值或自然延伸到此设置的策略迭代的算法39。
【原文】Self­play reinforcement learning has previously been applied to the game of Go. NeuroGo40,41 used a neural network to represent a value function, using a sophisticated architecture based on Go knowledge regarding connectivity, territory and eyes. This neural network was trained by temporal­difference learning42 to predict territory in games of self­play, building on previous work43. A related approach, RLGO44, represented the value function instead by a linear combination of features, exhaustively enumerating all 3 × 3 patterns of stones; it was trained by temporal­difference learning to predict the winner in games of self­play. Both NeuroGo and RLGO achieved a weak amateur level of play.
【翻译】自我博弈强化学习先前就已被应用到围棋中。NeuroGo40,41使用神经网络来表示价值函数,使用基于关于连接性,疆域和眼的围棋知识的成熟架构。该神经网络是通过时间差分学习42进行训练来预测依赖以前的工作43建立的自我博弈的疆域。另一个相关的方法,RLGO44,所代表的是价值函数而不是特征的线性组合,详尽列举所有棋子的3×3特征;它通过时间差分学习进行训练来预测自我博弈的赢家。NeuroGo和RLGO都达到了业余段位。
【原文】MCTS may also be viewed as a form of self­play reinforcement learning45. The nodes of the search tree contain the value function for the positions encountered during search; these values are updated to predict the winner of simulated games of self­play. MCTS programs have previously achieved strong amateur level in Go46,47, but used substantial domain expertise: a fast rollout policy, based on handcrafted features13,48, that evaluates positions by running simulations until the end of the game; and a tree policy, also based on handcrafted features, that selects moves within the search tree47.
【翻译】MCTS也可以看作是一种自我博弈强化学习45。搜索树的节点包含搜索过程中遍历的棋局的价值函数;更新这些值来预测自我博弈的赢家。MCTS程序以前在围棋领域达到了较强的业余段位水平46,47,但使用了大量的领域专业知识:基于手工特征的快速走棋策略,模拟运行直到比赛结束来评价棋局;树策略,也是基于手工制作特征的,在搜索树47中选择落子动作。
【原文】Self­play reinforcement learning approaches have achieved high levels of performance in other games: chess49-51, checkers52, backgammon53, othello54, Scrabble55 and most recently poker56. In all of these examples, a value function was trained by regression54-56 or temporal­difference learning49-53 from training data generated by self­play. The trained value function was used as an evaluation function in an alpha–beta search49-54, a simple Monte Carlo search 55,57or counterfactual regret minimization56. However, these methods used handcrafted input features49-53,56 or handcrafted feature templates54,55. In addition, the learning process used supervised learning to initialize weights58, hand­selected weights for piece values49,51,52, handcrafted restrictions on the action space56 or used pre­existing computer programs as training opponents49,50, or to generate game records51.
【翻译】自我博弈强化学习方法在其他游戏上取得了高性能:国际象棋49-51,西洋棋52, 西洋双陆棋53, 奥赛罗54, 拼字游戏55 和最近的纸牌56。在所有这些例子中,价值函数是利用时间差分学习49-53,通过回归54-56或利用自我博弈生成的数据进行训练的进行训练的。受过训练的价值函数在α-β搜索49-54、简单的蒙特卡洛搜索55,57或者假设遗憾最小化56中作为评价函数。然而,这些方法使用手工输入特征49-53,56或者手工特征范本54,55。此外,学习过程使用的监督学习来初始化权重58、为piece value手工选择权重49,51,52、在动作空间的手工限制56、或使用之前的计算机程序作为训练对手49,50、或生成的游戏记录51。
【原文】Many of the most successful and widely used reinforcement learning methods were first introduced in the context of Zero­sum games: temporal­difference learning was first introduced for a checkers­playing program59, while MCTS was introduced for the game of Go13. However, very similar algorithms have subsequently proven highly effective in video games6-8,10, robotics60, industrial control61-63 and online recommendation systems64,65.
【翻译】许多最成功和使用最广泛的强化学习方法在零和博弈的内容中第一次做了介绍:时间差学习首先利用跳棋程序介绍的,MCTS是利用围棋介绍的。然而,随后在电子游戏,机器人,工业控制和在线推荐中,非常相似的算法得到了效果很好的证实。
【原文】AlphaGo versions. We compare three distinct versions of AlphaGo:
【翻译】AlphaGo版本。我们比较了三个不同的AlphaGo版本:
【原文】(1) AlphaGo Fan is the previously published program12 that played against Fan Hui in October 2015. This program was distributed over many machines using 176 GPUs.
【翻译】(1)AlphaGo Fan是先前公布在2015年10月与樊麾交手的程序。这个程序分布在多台机器上,使用了176个GPU。
【原文】(2) AlphaGo Lee is the program that defeated Lee Sedol 4–1 in March 2016.It was previously unpublished, but is similar in most regards to AlphaGo Fan12.However, we highlight several key differences to facilitate a fair comparison. First, he value network was trained from the outcomes of fast games of self­play by AlphaGo, rather than games of self­play by the policy network; this procedure was iterated several times—an initial step towards the tabula rasa algorithm presented in this paper. Second, the policy and value networks were larger than those described in the original paper—using12 convolutional layers of 256 planes—and were trained for more iterations. This player was also distributed over many machines using 48 TPUs, rather than GPUs, enabling it to evaluate neural networks faster during search.
【翻译】(2)AlphaGo Lee是在2016年3月以4:1击败Lee Sedol的程序。这个程序以前未公布,但它在大多数方面与AlphaGo Fan12是相似的。然而,为了有一个公平的比较,我们强调几个关键性的差异。首先,价值网络是利用AlphaGo自我博弈的快速游戏结果进行训练的,而不是利用策略网络的自我博弈游戏进行训练的;这个过程反复了几次——初步提出了tabula rasa算法。其次,策略网络和价值网络都比原创论文中描写的大——使用具有256个特征平面的12个卷积层——并且在训练中进行了更多次的迭代。这个程序也分布在很多机器上,使用48个TPU,而不是GPU,使搜索过程中神经网络的评估速度更快。
( 【原文】3) AlphaGo Master is the program that defeated top human players by 60–0 in January 201734. It was previously unpublished, but uses the same neural network architecture, reinforcement learning algorithm, and MCTS algorithm as described in this paper. However, it uses the same handcrafted features and rollouts as AlphaGo Lee12 and training was initialized by supervised learning from human data.
【翻译】(3)AlphaGo Master是在2017年一月以60:0击败人类头号玩家34的程序。这是以前未公开的程序,但使用了本文中提到过的相同的神经网络结构、强化学习算法和MCTS算法。但是,它和AlphaGo Lee12使用相同的手工特征和rollout,并且通过对人类数据的监督学习进行初始化训练的。
【原文】(4) AlphaGo Zero is the program described in this paper. It learns from self­play reinforcement learning, starting from random initial weights, without using rollouts, with no human supervision and using only the raw board history as input features. It uses just a single machine in the Google Cloud with 4 TPUs (AlphaGo Zero could also be distributed, but we chose to use the simplest possible search algorithm).
【翻译】(4)AlphaGo Zero是本文描述的程序。它可以通过自我博弈强化学习来学习,从随机的初始权重开始,不使用rollout,没有人监督并且只使用原始棋盘作为输入特征。它仅在谷歌云上使用单一的机器,并且只使用了4个TPU(AlphaGo Zero也可以是分布式的,但我们选择了使用尽可能简单的搜索算法)。
【原文】Domain knowledge. Our primary contribution is to demonstrate that superhu ­man performance can be achieved without human domain knowledge. To clarify his contribution, we enumerate the domain knowledge that AlphaGo Zero uses, explicitly or implicitly, either in its training procedure or its MCTS; these are the items of knowledge that would need to be replaced for AlphaGo Zero to learn a different (alternating Markov) game.
【翻译】领域知识。我们的主要贡献是,证明了在没有人类领域知识的情况下,也可以达到超越人类的性能。为了分清各自的贡献,我们列举了AlphaGo Zero在训练过程中和在使用MCTS的过程中显式或隐式地使用过的领域知识;以下就是如果让AlphaGo Zero学习一个不同的(交替马尔可夫)游戏,它需要替换的知识列表:
【原文】(1) AlphaGo Zero is provided with perfect knowledge of the game rules. These are used during MCTS, to simulate the positions resulting from a sequence of moves, and to score any simulations that reach a terminal state. Games terminate when both players pass or after 19 × 19 × 2 = 722 moves. In addition, the player is provided with the set of legal moves in each position.
【翻译】(1)需要提供给AlphaGo Zero完整的比赛规则。这些都是在MCTS过程中需要使用到的,来从一个落子动作序列中模拟出棋局,并通过给模拟记分来达到终局。当双方棋手放弃行棋后或在进行了19×19×2 = 722步后比赛结束。此外,在每个棋局的合法走子位置也提供给了AlphaGo。
【原文】(2) AlphaGo Zero uses Tromp–Taylor scoring66 during MCTS simulations and self­play training. This is because human scores (Chinese, Japanese or Korean rules) are not well­defined if the game terminates before territorial boundaries are resolved. However, all tournament and evaluation games were scored using Chinese rules.
【翻译】(2)AlphaGo Zero在MCTS模拟和自我博弈训练期间使用Tromp–Taylor评分66。这是因为如果比赛在疆域边界被解决之前终止,人类的分数(如在中国、日本或韩国的规则中)没有很好地定义。然而,所有比赛和评价比赛都是按照中国规则评分的。
【原文】(3) The input features describing the position are structured as a 19 × 19 image; that is, the neural network architecture is matched to the grid­structure of the board.
【翻译】(3)输入特征描述了棋局被组织成一个19×19大小的图像;也就是说,神经网络的结构与棋盘的栅格结构相匹配。
【原文】(4) The rules of Go are invariant under rotation and reflection; this knowledge has been used in AlphaGo Zero both by augmenting the dataset during training to include rotations and reflections of each position, and to sample random rotations or reflections of the position during MCTS (see Search algorithm). Aside from komi, the rules of Go are also invariant to colour transposition; this knowledge is exploited by representing the board from the perspective of the current player (see Neural network architecture).
【翻译】(4)围棋规则在旋转和对称下是不变的;这种知识AlphaGo Zero中已经通过在训练中增加数据集,来将每个棋局的旋转和对称包括进去、以及在MCTS过程中对位置的旋转和对称进行随机采样(见搜索算法)来进行使用了。由于贴目规则的存在,围棋规则在棋子颜色调换后也是不变的;这些知识的开发是通过从当前玩家的角度来对棋局进行表示的(见神经网络结构)。
【原文】AlphaGo Zero does not use any form of domain knowledge beyond the points listed above. It only uses its deep neural network to evaluate leaf nodes and to select moves (see ‘Search algorithm’). It does not use any rollout policy or tree policy, and the MCTS is not augmented by any other heuristics or domain­specific rules. No legal moves are excluded—even those filling in the player’s own eyes (a standard heuristic used in all previous programs67). 
【翻译】AlphaGo Zero不使用任何除了以上列出的任何形式的领域知识。它只使用深度神经网络来评估叶节点并选择落子(参见“搜索算法”)。它不使用任何rollout策略或Tree策略,并且MCTS不增加任何其他启发式或与具体领域相关的规则。合法的落子没有一个被排除在外——即使是那些填充在本方眼中的棋(所有以前的程序使用的标准启发式算法67)。
【原文】The algorithm was started with random initial parameters for the neural network. The neural network architecture (see ‘Neural network architecture’) is based on the current state of the art in image recognition4,18, and hyper parameters for training were chosen accordingly (see ‘Self­play training pipeline’). MCTS search parameters were selected by Gaussian process optimization68, so as to optimize self­play performance of AlphaGo Zero using a neural network trained in a preliminary run. For the larger run (40 blocks, 40 days), MCTS search parameters were re­optimized using the neural network trained in the smaller run (20 blocks, 3 days). The training algorithm was executed autonomously without human intervention.
【翻译】该算法以神经网络的随机初始参数作为起始点。神经网络结构(参见‘Neural network architecture’)基于图像识别的当前状态,并且为了训练选择相应的超参数(见‘Self­play training pipeline’)。用高斯过程优化选择MCTS搜索参数,从而优化在初步运行中使用神经网络训练的AlphaGo Zero的自我博弈性能。对于较大的规模(40块,40天),MCTS的搜索参数利用通过更小的规模进行训练的神经网络进行了重新优化(20块,3天)。训练算法无需人工干预,自主执行。
【原文】Self-play training pipeline. AlphaGo Zero’s self­play training pipeline consists of three main components, all executed asynchronously in parallel. Neural network parameters θ i are continually optimized from recent self­play data; AlphaGo Zero players  α θ i are continually evaluated; and the best performing player so far,  α θ ∗ , is used to generate new self­play data.
【翻译】自我博弈训练流水级。AlphaGo Zero自我博弈训练流程由三个主要部分组成,所有部分都并行地异步执行。神经网络参数 不断通过当前的自我博弈数据进行优化;AlphaGo Zero 棋手 不断被评估;并且到目前为止表现最好的 ,用于生成新的自我博弈数据。
【原文】Optimization. Each neural network θ f i is optimized on the Google Cloud using TensorFlow, with 64 GPU workers and 19 CPU parameter servers. The batch­size is 32 per worker, for a total mini­batch size of 2,048. Each mini­batch of data is sampled uniformly at random from all positions of the most recent 500,000 games of self­play. Neural network parameters are optimized by stochastic gradient descent with momentum and learning rate annealing, using the loss in equation (1). The learning rate is annealed according to the standard schedule in Extended Data Table 3. The momentum parameter is set to 0.9. The cross­entropy and MSE losses are weighted equally (this is reasonable because rewards are unit scaled, r ∈ {− 1, + 1}) and the L2 regularization parameter is set to c = 10  4 . The optimization process produces a new checkpoint every 1,000 training steps. This checkpoint is evaluated by the evaluator and it may be used for generating the next batch of self­play games, as we explain next.
【翻译】优化.每一个神经网络 在谷歌云使用64个GPU和19个CPU参数服务器通过tensorflow进行优化。每个GPU的小批量数量为32,总规模为2048。每个小批量的数据都是从最近的500000场自我博弈的所有棋局中随机抽取的。使用方程(1)中的损失,利用动量损失和学习率退火对神经网络参数进行优化。根据扩展数据表3中的标准调度对学习率进行退火。动量参数设置为0.9。交叉熵和均方误差损失权重相等(这是合理的,因为奖励被处理为 r ∈ {− 1, + 1}并且L2正则化参数设置为c=10-4。优化过程中每1000个训练步骤就会产生一个新的检查点。这个检查点由评价器评估,它可以用来生成下一批自我博弈,我们将在下文进行解释。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第15张图片

【原文】Evaluator. To ensure we always generate the best quality data, we evaluate each new neural network checkpoint against the current best network θ ∗ f before using it for data generation. The neural network θ f i is evaluated by the performance of an MCTS search  α θ i that uses θ f i to evaluate leaf positions and prior probabilities (see Search algorithm). Each evaluation consists of 400 games, using an MCTS with 1,600 simulations to select each move, using an infinitesimal temperature τ→ 0 (that is, we deterministically select the move with maximum visit count, to give the strongest possible play). If the new player wins by a margin of > 55% (to avoid selecting on noise alone) then it becomes the best player  α θ∗ , and is subsequently used for self­play generation, and also becomes the baseline for subsequent comparisons.
【翻译】评估器。为了确保我们总是产生质量最好的数据,我们在使用神经网络生成数据前,会利用目前最好的网络 对每一个新的神经网络进行比较来做评估。MCTS搜索 通过 产生叶子节点棋局的评价和先验概率(见搜索算法),神经网络 的性能由 进行评估。每一个评价需要进行400次比赛,使用具有1600次模拟的MCTS来选择每一次落子,用无穷小的温度τ→0(也就是我们通过选择具有最大访问数的落子,来给出最可能的落子)。如果新的玩家获胜比率>55%(避免单独因为噪声进行选择),那它就成为了最好的 ,并随后用于进行自我博弈,也成为后来比较的基准。
【原文】Self-play. The best current player  α θ ∗ , as selected by the evaluator, is used to generate data. In each iteration,  α θ ∗ plays 25,000 games of self­play, using 1,600 simulations of MCTS to select each move (this requires approximately 0.4 s per search). For the first 30 moves of each game, the temperature is set to τ = 1; this selects moves proportionally to their visit count in MCTS, and ensures a diverse set of positions are encountered. For the remainder of the game, an infinitesimal temperature is used, τ→ 0. Additional exploration is achieved by adding Dirichlet noise to the prior probabilities in the root node s 0 , specifically P(s, a) = (1 − ε)p a + εη a , where η ∼ Dir(0.03) and ε = 0.25; this noise ensures that all moves may be tried, but the search may still overrule bad moves. In order to save computation, clearly lost games are resigned. The resignation threshold v resign is selected automatically to keep the fraction of false positives (games that could have been won if AlphaGo had not resigned) below 5%. To measure false positives, we disable resignation in 10% of self­play games and play until termination.
【翻译】自我博弈。通过评估而被选择的目前最好的棋手 ,用于数据的生成。在每次迭代中, 进行25000场自我博弈,用1600次MCTS模拟来选择每次落子(每次搜索需要使用大约0.4秒选择落子)。在每场比赛的前30次落子中,温度τ设置为1:选择落子位置的概率与MCTS中的访问次数成比例,并保证遇到一组不同的棋局。在剩下的比赛中使用一个无限小的温度τ→0。为了保证有额外的探索,在根节点 的先验概率中加入Dirichlet噪声  P(s, a) = (1 − ε)p a + εη a ,其中 η ∼ Dir(0.03) 并且ε= 0.25;这种噪声确保所有落子都被尝试,但同时也会选择到不是很好的落子动作。为了节省运算量,显然会输的游戏被丢弃。丢弃阈值 v resign是自动选择的,以便保持误报率(AlphaGo如果没有丢弃游戏本来可以赢的概率)低于5%。为了测量误报率,我们在10%的自我博弈中没有执行丢弃。
【原文】Supervised learning. For comparison, we also trained neural network parameters θ SL by supervised learning. The neural network architecture was identical to AlphaGo Zero. Mini­batches of data (s, π, z) were sampled at random from the KGS dataset, setting π a = 1 for the human expert move a. Parameters were optimized by stochastic gradient descent with momentum and learning rate annealing, using the same loss as in equation (1), but weighting the MSE component by a factor of 0.01. The learning rate was annealed according to the standard schedule in Extended Data Table 3. The momentum parameter was set to 0.9, and the L2 regularization parameter was set to c = 10 −4
【翻译】监督学习。作为对比,我们还通过监督学习训练了一个神经网络参数 。神经网络结构和AlphaGo Zero的是相同的。小批次数据 (s, π, z)  从KGS数据集中随机采样,将人类棋手的落子a设置为π a = 1 。参数和方程(1)一样,使用相同的损失,由动量和学习率退火的随机梯度下降算法进行优化,但通过一个参数0.01来衡量MSE。按照标准进度对学习速率进行的退火显示在扩展数据表3中。动量参数设置为0.9,L2正则化参数设置为c = 10-4。

【原文】By using a combined policy and value network architecture, and by using a low weight on the value component, it was possible to avoid overfitting to the values (a problem described in previous work 12 ). After 72 h the move prediction accuracy exceeded the state of the art reported in previous work 12,30–33 , reaching 60.4% on the KGS test set; the value prediction error was also substantially better than previously reported 12 . The validation set was composed of professional games from GoKifu. Accuracies and MSEs are reported in Extended Data Table 1 and Extended Data Table 2, respectively.

【翻译】通过使用策略网络和价值网络相结合的结构,并通过在价值组件中使用低权重,可以避免过拟合(在以往的工作中描述的问题)。72小时后,落子预测精度超过了在以往工作中得到的最好的程序,在KGS测试集上达到60.4%的预测率;价值预测误差也大大少于先前的报道。验证集是由GoKifu的专业比赛组成的。精度和均方误差分别显示在在扩展数据表1和表2中。

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第16张图片

【论文翻译】Mastering the game of Go without human knowledge (无师自通---在不借助人类知识的情况下学会围棋)_第17张图片

【原文】Search algorithm. AlphaGo Zero uses a much simpler variant of the asynchronous policy and value MCTS algorithm (APV­MCTS) used in AlphaGo Fan and AlphaGo Lee. 
Each node s in the search tree contains edges (s, a) for all legal actions  ∈A a s ( ) . Each edge stores a set of statistics,

【翻译】搜索算法。相对于使用异步策略和价值MCTS算法(APV­MCTS MCTS算法)的AlphaGo Fan和AlphaGo Lee,AlphaGo Zero使用了一个更简单的变种。

搜索树中的每个节点包含了边 ,其中,所有合法的动作 。每条边存储一组统计数据,


【原文】where N(s, a) is the visit count, W(s, a) is the total action value, Q(s, a) is the mean action value and P(s, a) is the prior probability of selecting that edge. Multiple simulations are executed in parallel on separate search threads. The algorithm proceeds by iterating over three phases (Fig. 2a–c), and then selects a move to play (Fig. 2d).

【翻译】其中 N(s, a)是访问计数,W(s, a) 是总动作值,Q(s, a) 是平均动作值,P(s, a) 是选择该条边的先验概率。在不同的搜索线程上并行执行多个模拟。该算法通过迭代超过三个阶段来进行(图2a - C),然后选择要进行的落子动作(图2d)。

【原文】Select (Fig. 2a). The selection phase is almost identical to AlphaGo Fan12 ; we recapitulate here for completeness. The first in­tree phase of each simulation begins at the root node of the search tree, s 0 , and finishes when the simulation reaches a leaf node s L at time­step L. At each of these time­steps, t < L, an action is selected  according to the statistics in the search tree,  ,using a variant of the PUCT algorithm24,
【翻译】选择(图2a)。选择阶段与AlphaGo Fan几乎是相同的;为了保持完整性,我们在这里进行复述。每次模拟的第一阶段:树内阶段从搜索树的根节点 开始,并当仿真在时间步L到达叶节点 时结束。在那个时间步t<L,根据搜索树的数据选择落子动作 , 使用PUCT算法的一个变种,
  【原文】where c puct is a constant determining the level of exploration; this search control strategy initially prefers actions with high prior probability and low visit count, but asymptotically prefers actions with high action value.
【翻译】其中  cpuct是一个确定探索水平的常数;这个搜索控制策略最初偏向于具有高先验概率和低访问次数的动作,但后来渐渐偏向价值更高的动作 。
【原文】Expand and evaluate (Fig. 2b). The leaf node s L is added to a queue for neural network evaluation, (d i (p), v) = f θ (d i (s L )), where d i is a dihedral reflection or rotation selected uniformly at random from i in [1..8]. Positions in the queue are evaluated by the neural network using a mini­batch size of 8; the search thread is locked until evaluation completes. The leaf node is expanded and each edge (s L , a) is initialized to {N(s L , a) = 0, W(s L , a) = 0, Q(s L , a) = 0, P(s L , a) = p a }; the value v is then backed up.
【翻译】扩展和评估(图2b)。叶节点 被加入神经网络评价队列 ,其中 是从i∈[ 1..8 ]中随机选择的反射或旋转。队列中的棋局通过使用小批量为8的神经网络进行评估;搜索线程被加锁直到评估完成。叶节点被扩展并且每条边 被初始化为 ;然后价值v被回传。
【原文】Backup (Fig. 2c). The edge statistics are updated in a backward pass through each step t ≤ L. The visit counts are incremented, N(s t , a t ) = N(s t , a t ) + 1, and the action value is updated to the mean value,   ,We use virtual loss to ensure each thread evaluates different nodes 12,69
【翻译】回传(图2c)。在每一步t≤L,回溯更新边的统计数据。访问次数增加 ,并且动作价值更新为平均值 。我们使用虚拟损失以确保每个线程评估不同的节点。
【原文】Play (Fig. 2d). At the end of the search AlphaGo Zero selects a move a to play in the root position s0, proportional to its exponentiated visit count,  , where τ is a temperature parameter that controls the level of exploration. The search tree is reused at subsequent time­steps: the child node corresponding to the played action becomes the new root node; the subtree below this child is retained along with all its statistics, while the remainder of the tree is discarded. AlphaGo Zero resigns if its root value and best child value are lower than a threshold value v resign
【翻译】执行(图2)。AlphaGo Zero搜索的末尾,在根 处选择落子a,与它的访问计数的幂次成比例 , 其中,τ是控制探索水平的温度参数。搜索树在随后的时间步中被重复使用:与选择的落子动作一致的子节点成为新的根节点;这个子节点下面的子树与所有它的数据一起被保留,但是树的其他部分被丢弃。如果根的价值和价值最大的儿子的价值比价值阈值 小,则AlphaGo Zero就会放弃。
【原文】Compared to the MCTS in AlphaGo Fan and AlphaGo Lee, the principal differences are that AlphaGo Zero does not use any rollouts; it uses a single neural network instead of separate policy and value networks; leaf nodes are always expanded, rather than using dynamic expansion; each search thread simply waits for the neural network evaluation, rather than performing evaluation and backup asynchronously; and there is no tree policy. A transposition table was also used in the large (40 blocks, 40 days) instance of AlphaGo Zero.
【翻译】与AlphaGo Fan和 AlphaGo Lee的MCTS相比,主要的差异是AlphaGo Zero不使用任何rollout;它使用单一的神经网络代替单独的策略和价值网络;叶节点始终被扩展,而不是动态地扩展;每个搜索线程只是等待神经网络的评价,而不是进行评价以及同步的回溯;并且没有树策略。转换表也被用在大型(40块,40天)的AlphaGo Zero实例中。
【原文】Neural network architecture. The input to the neural network is a 19 × 19 × 17 image stack comprising 17 binary feature planes. Eight feature planes, X t , consist of binary values indicating the presence of the current player’s stones ( = X 1 t I if intersection i contains a stone of the player’s colour at time­step t; 0 if the intersection is empty, contains an opponent stone, or if t < 0). A further 8 feature planes, Y t , represent the corresponding features for the opponent’s stones. The final feature plane, C, represents the colour to play, and has a constant value of either 1 if black is to play or 0 if white is to play. These planes are concatenated together to give input features s t = [X t , Y t , X t−1 , Y t−1 ,..., X t−7 , Y t−7 , C]. History features X t , Y t are necessary, because Go is not fully observable solely from the current stones, as repetitions are forbidden; similarly, the colour feature C is necessary, because the komi is not observable.
【翻译】神经网络体系结构。神经网络的输入是一个包括17个二进制特征平面的19×19×17图像块。八个特征平面 ,由指示当前玩家棋子的二进制值组成,(如果在时间步t时交叉点i处有当前玩家的棋子,则 ;如果交叉点为空,或者含有对手的棋子,或者 t < 0,则为0)。另外8个特征平面 ,代表对方棋子的相应特征。最后一个特征平面C表示将要下的棋子颜色,并且具有一个常数值:如果要下的是黑子,则为1,如果要下的是白子,则为0。这些平面被连接在一起给出输入特征 。历史特征 是必要的,因为围棋不仅要对当前的棋子进行完整的观察,因为重复是禁止的;同样,颜色特征C是必要的,因为贴目是观察不到的。
【原文】The input features are processed by a residual tower that consists of a single convolutional block followed by either 19 or 39 residual blocks4.
The convolutional block applies the following modules:
【翻译】输入特征 由一个由卷积块组成的残差塔进行处理,这个卷积块由19个或39个残差块组成4。
卷积块应用在以下模块中:
【原文】(1) A convolution of 256 filters of kernel size 3 × 3 with stride 1
(2) Batch normalization 
(3) A rectifier nonlinearity
【翻译】(1)具有256个滤波器的核的大小为3×3,步长为1的卷积层
(2)批量标准化
(3)非线性整流层
【原文】Each residual block applies the following modules sequentially to its input:
【翻译】每个剩余块按照其输入顺序应用到以下模块:
【原文】(1) A convolution of 256 filters of kernel size 3 × 3 with stride 1
(2) Batch normalization
(3) A rectifier nonlinearity
(4) A convolution of 256 filters of kernel size 3 × 3 with stride 1
(5) Batch normalization
(6) A skip connection that adds the input to the block
(7) A rectifier nonlinearity
【翻译】(1)具有256个滤波器的核的大小为3×3,步长为1的卷积层
(2)批量标准化
(3) 非线性整流层
(4)具有256个滤波器的核的大小为3×3,步长为1的卷积层
(5)批量标准化
(6)向块中的输入添加一个跳连接
(7) 非线性整流层
【原文】The output of the residual tower is passed into two separate‘heads’ for computing the policy and value. The policy head applies the following modules:
【翻译】残塔的输出传入两个单独的用于计算策略和价值“头”。策略头应用于以下模块:
【原文】(1) A convolution of 2 filters of kernel size 1 × 1 with stride 1
(2) Batch normalization
(3) A rectifier nonlinearity
(4) A fully connected linear layer that outputs a vector of size 19 
2 + 1 = 362,corresponding to logit probabilities for all intersections and the pass move
【翻译】(1)具有2个滤波器的核的大小为1×1,步长为1的卷积层
(2)批量标准化
(3) 非线性整流层
(4)输出大小为192 + 1 = 362向量的全连接线性层,对应于所有交叉点和双方放弃行棋的对数概率。
    【原文】 The value head applies the following modules:
(1) A convolution of 1 filter of kernel size 1 × 1 with stride 1
(2) Batch normalization
(3) A rectifier nonlinearity
(4) A fully connected linear layer to a hidden layer of size 256
(5) A rectifier nonlinearity
(6) A fully connected linear layer to a scalar
(7) A tanh nonlinearity outputting a scalar in the range [− 1, 1]
【翻译】价值头应用以下模块:
(1)具有1个滤波器的核的大小为1×1,步长为1的卷积层
(2)批量标准化
(3) 非线性整流层
(4)到一个大小为256的隐藏层的一个全连接线性层。
(5) 非线性整流层
(6)一个到标量的全连接线性层。
(7)双曲正切在范围[ 1, 1 ]非线性输出标量
【原文】The overall network depth, in the 20­ or 40­block network, is 39 or 79 parameterized layers, respectively, for the residual tower, plus an additional 2 layers for the policy head and 3 layers for the value head. 
【翻译】总体网络深度:在20或40块的网络,分别有39或79个参数层;对于残塔,额外加上2层策略头和3层价值头。
【原文】We note that a different variant of residual networks was simultaneously applied to computer Go 33 and achieved an amateur dan­level performance; however, this was restricted to a single-headed policy network trained solely by supervised learning. 
【翻译】我们注意到,残差网络的不同变种同时应用于计算机围棋33,并达到了业余段位;然而,这仅限于一个通过监督学习训练的单一头的策略网络。
【原文】Neural network architecture comparison. Figure 4 shows the results of a comparison between network architectures. Specifically, we compared four different neural networks:
(1) dual–res: the network contains a 20­block residual tower, as described above, followed by both a policy head and a value head. This is the architecture used in AlphaGo Zero.
(2) sep–res: the network contains two 20­block residual towers. The first tower is followed by a policy head and the second tower is followed by a value head.
(3) dual–conv: the network contains a non­residual tower of 12 convolutional blocks, followed by both a policy head and a value head.
(4) sep–conv: the network contains two non­residual towers of 12 convolutional blocks. The first tower is followed by a policy head and the second tower is followed by a value head. This is the architecture used in AlphaGo Lee.
【翻译】神经网络体系结构比较。图4显示了网络架构之间比较的结果。具体地说,我们比较了四种不同的神经网络:
(1)dual–res:网络包含一个20块的残差塔,如上所述,跟着策略头和值头。这是AlphaGo Zero使用的结构。
(2)sep–res:网络包含两个20块的残塔。第一个塔后面是一个策略头,第二个塔后面是一个价值头。
(3)dual–conv:网络包含一个具有12个卷积块的非残差塔,跟着一个策略头和一个价值头。
(4)sep–conv::网络包含两个具有12个卷积块的非残差塔。第一个塔后面是一个策略头,第二个塔后面是一个价值头。这是用在AlphaGo Lee中的结构。
【原文】Each network was trained on a fixed dataset containing the final 2 million games of self­play data generated by a previous run of AlphaGo Zero, using stochastic gradient descent with the annealing rate, momentum and regularization hyper parameters described for the supervised learning experiment; however, cross­entropy and MSE components were weighted equally, since more data was available.
【翻译】每个网络在一个固定的包含最后200万场自我博弈数据的数据集上进行训练,这个自我博弈数据由以前运行的、为监督学习进行实验的、采用随机梯度下降与退火速度、动量和正则化描述超参数的AlphaGo Zero产生的。然而,交叉熵和均方误差的权重是一样的,因为更多的数据是可用的。
【原文】Evaluation. We evaluated the relative strength of AlphaGo Zero (Figs 3a, 6) by measuring the Elo rating of each player. We estimate the probability that player a will defeat player b by a logistic function , and estimate the ratings e(·) by Bayesian logistic regression, computed by the BayesElo program 25 using the standard constant c elo = 1/400.
【翻译】评价.我们通过测量每个玩家的ELO等级评估AlphaGo Zero的相对强度(图3A,6)。我们通过一个逻辑函数对玩家a将击败玩家b的概率 进行估计,并且通过贝叶斯逻辑回归估计价值 ,由使用标准常数 c elo = 1/400的bayeselo程序25进行计算。
【原文】Elo ratings were computed from the results of a 5 s per move tournament between AlphaGo Zero, AlphaGo Master, AlphaGo Lee and AlphaGo Fan. The raw neural network from AlphaGo Zero was also included in the tournament. The Elo ratings of AlphaGo Fan, Crazy Stone, Pachi and GnuGo were anchored to the tournament values from previous work 12 , and correspond to the players reported in that work. The results of the matches of AlphaGo Fan against Fan Hui and AlphaGo Lee against Lee Sedol were also included to ground the scale to human references, as otherwise the Elo ratings of AlphaGo are unrealistically high due to self­play bias.
【翻译】通过AlphaGo Zero, AlphaGo Master, AlphaGo Lee 和AlphaGo Fan的对弈计算ELO评级,每次落子花费5秒的时间。AlphaGo Zero的神经网络也被列入比赛。在之前的工作中,AlphaGo Fan、Crazy Stone、Pachi和GnuGo的ELO等级也被包括进去,并且与之前说过的棋手相对应。AlphaGo Fan对战樊麾和AlphaGo Lee对阵Lee Sedol的结果也被包括进来,为人类的引用打下基础,否则由于自我博弈的偏差,AlphaGo 的elo评级偏高。
【原文】The Elo ratings in Figs 3a, 4a, 6a were computed from the results of evaluation games between each iteration of player α θ i during self­play training. Further evaluations were also performed against baseline players with Elo ratings anchored to the previously published values 12 .
【翻译】图3a,4a,6a的ELO评级是根据在自我博弈训练中棋手 在每次迭代中的比赛结果计算出来的。进一步的评估也被固定到先前公布的ELO等级上。
【原文】We measured the head­to­head performance of AlphaGo Zero against AlphaGo Lee, and the 40­block instance of AlphaGo Zero against AlphaGo Master, using the same player and match conditions that were used against Lee Sedol in Seoul, 2016. Each player received 2 h of thinking time plus 3 byoyomi periods of 60 s per move. All games were scored using Chinese rules with a komi of 7.5 points.
【翻译】我们让AlphaGo Zero 直面对战AlphaGo Lee来计算性能,和使用40块的AlphaGo Zero 对战使用被用来在汉城于2016年对战Lee Sedol的版本和比赛条件的AlphaGo Master。每个玩家拥有2小时的思考时间加上每次移动有3个60秒的读秒时间。所有比赛均采用中国规则和7目半的贴目规则。
Data availability. The datasets used for validation and testing are the GoKifu dataset (available from http://gokifu.com/) and the KGS dataset (available from https://u­go.net/gamerecords/).

你可能感兴趣的:(AlphaGo,强化学习,人工智能)