REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES

利用量子波兹曼机进行强化学习

Abstract. We investigate whether quantum annealers with select chip layouts can outperform classical computers in reinforcement learning tasks. We associate a transverse eld Ising spin Hamiltonian with a layout of qubits similar to that of a deep Boltzmann machine (DBM) and use simulated quantum annealing (SQA) to numerically simulate quantum sampling from this system. We design a reinforcement learning algorithm in which the set of visible nodes representing the states and actions of an optimal policy are the rst and last layers of the deep network. In absence of a transverse eld, our simulations show that DBMs are trained more e ectively than restricted Boltzmann machines (RBM) with the same number of nodes. We then develop a framework for training the network as a quantum Boltzmann machine (QBM) in the presence of a signi cant transverse eld for reinforcement learning. This method also outperforms the reinforcement learning method that uses RBMs.

摘要:我们研究具有选择芯片布局的量子退火炉是否能够在强化学习任务中胜过经典计算机。 我们将横向场Ising自旋哈密顿量与类似于深Boltzmann机器(DBM)的量子位的布局相关联,并使用模拟量子退火(SQA)来数值模拟该系统的量子采样。 我们设计了一种强化学习算法,其中表示最优策略的状态和动作的可见节点集是深度网络的第一层和最后一层。 在没有横向视场的情况下,我们的模拟显示DBM比具有相同节点数的受限Boltzmann机器(RBM)更有效地训练。 然后,我们开发了一个框架,用于训练网络作为量子Boltzmann机器(QBM),存在一个重要的横向场以进行强化学习。 该方法也优于使用RBM的强化学习方法。

1. Introduction

Recent theoretical extensions of the quantum adiabatic theorem [1, 2, 3, 4, 5] suggest the pos-sibility of using quantum devices with manufactured spins [6, 7] as samplers of the instantaneous steady states of quantum systems. With this motivation, we consider reinforcement learning as the computational task of interest, and design a method of reinforcement learning consisting of sampling from a layout of quantum bits similar to that of a deep Boltzmann machine (DBM) (see Fig. 1b for a graphical representation). We use simulated quantum annealing (SQA) to demonstrate the advantage of reinforcement learning using deep Boltzmann machines and quantum Boltzmann machines over their classical counterpart, for small problem instances.

Reinforcement learning ([8], known also as neurodynamic programming [9]) is an area of optimal control theory at the intersection of approximate dynamic programming and machine learning. It has been used successfully for many applications, in elds such as engineering [10, 11], sociology [12, 13], and economics [14, 15].

It is important to dierentiate between reinforcement learning and common streams of research in machine learning. For instance, in supervised learning, the learning is facilitated by training samples provided by a source external to the agent and the computer. In reinforcement learning, the training samples are provided only by the interaction of the agent itself with the environment. For example, in a motion planning problem in an uncharted territory, it is desired that the agent learns in the fastest possible way to navigate correctly, with the fewest blind decisions required to be made. This is known as the dilemma of exploration versus exploitation; that is, neither exploration nor exploitation can be pursued exclusively without facing a penalty or failing at the task. The goal is hence not only to design an algorithm that eventually converges to an optimal policy, but for it to be able to generate good policies early in the learning process. We refer the reader to [8, Ch. 1.1] for a thorough introduction to use cases and problem scenarios addressed by reinforcement learning.

The core idea in reinforcement learning is de ning an operator on the Banach space of real-valued functions on the set of states of a system such that a xed point of the operator carries information about an optimal policy of actions for a nite or in nite number of decision epochs. A numerical method for computing this xed point is to explore this function space by travelling in a direction that minimizes the distance between two consecutive applications of the contraction mapping operator [9].

This optimization task, called learning in the context of reinforcement learning, can be performed by locally parametrizing the above function space using a set of auxiliary variables, and applying a gradient method to these variables. One approach for such a parametrization, due to [16], is to use the weights of a restricted Boltzmann machine (RBM) (see Fig. 1a) as the parameters, and the free energy of the RBM as an approximator for the elements in the function space. The descent direction is then calculated in terms of the expected values of the nodes of the RBM.
最近量子绝热定理的理论扩展[1,2,3,4,5]表明使用具有制造自旋的量子器件[6,7]作为量子系统瞬时稳态的采样器的可能性。有了这个动机,我们将强化学习视为感兴趣的计算任务,并设计一种强化学习方法,包括从类似于深Boltzmann机器(DBM)的量子位布局中采样(参见图1b的图形)表示)。我们使用模拟量子退火(SQA)来证明使用深Boltzmann机器和量子Boltzmann机器进行强化学习的优势,而不是经典对应物,用于小问题实例。

强化学习([8],也称为神经动力学规划[9])是近似动态规划和机器学习交叉的最优控制理论领域。它已被成功应用于诸如工程[10,11],社会学[12,13]和经济学[14,15]等领域的许多应用中。

在强化学习和机器学习的常见研究流程之间进行区分是非常重要的。例如,在监督学习中,通过训练由代理和计算机外部的源提供的样本来促进学习。在强化学习中,训练样本仅通过代理本身与环境的交互来提供。例如,在未知领域中的运动规划问题中,期望代理以最快的方式学习正确导航,需要做出最少的盲决定。这被称为探索与剥削的两难选择;也就是说,无论是在没有面临惩罚或没有完成任务的情况下,探索和剥削都不能完全追求。因此,目标不仅是设计最终收敛于最优策略的算法,而且还能够在学习过程的早期生成良好的策略。我们将读者推荐给[8,Ch。 1.1]全面介绍强化学习所解决的用例和问题场景。

强化学习的核心思想是在系统状态集上的实数值函数的Banach空间上定义一个运算符,使得运算符的xed点携带关于针对nite或nite的最佳动作策略的信息。决定时期的数量。计算此xed点的数值方法是通过沿最小化收缩映射运算符的两个连续应用之间的距离的方向行进来探索该函数空间[9]。

这种优化任务,在强化学习的上下文中称为学习,可以通过使用一组辅助变量对上述函数空间进行局部参数化,并对这些变量应用梯度方法来执行。由于[16],这种参数化的一种方法是使用受限玻尔兹曼机(RBM)的权重(见图1a)作为参数,并使用RBM的自由能作为元素的近似值。功能空间。然后根据RBM的节点的期望值计算下降方向。
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第1张图片
图1.(a)基于RBM的强化学习中使用的一般RBM布局。 左侧的可见层由状态和动作节点组成,并连接到隐藏层,形成完整的二分图。 (b)基于DBM的强化学习中使用的一般DBM布局。 左侧的可见节点表示状态,右侧的可见节点表示操作。 训练过程捕获节点之间边缘权重中的状态和动作之间的相关性。
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第2张图片
图2.(a)一个3 5迷宫。 W表示墙,R是表示奖励的正实数,P是表示罚分的实数。 (b)之前的迷宫有两个额外的随机奖励。 (c)图(a)中迷宫的每个细胞的所有最佳动作的集合。 最佳遍历策略是这些操作的任意组合的选择。 (d)没有障碍物(左)和存在墙(右)的刮风问题的条件状态转移概率样本。

It follows from the universal approximation theorem [17] that RBMs can approximate any joint distribution over binary variables [18, 19]. However, in the context of reinforcement learning, RBMs are not necessarily the best choice for approximating Q-functions relating to Markov decision processes because RBMs may require an exponential number of hidden variables with respect to the number of visible variables in order to approximate the desired joint distribution [18, 19]. On the other hand, DBMs have the potential to model higher-order dependencies than RBMs, and are more robust than deep belief networks [20].

One may, therefore, consider replacing the RBM with other graphical models and investigating the performance of the models in the learning process. Except in the case of RBMs, calculat-ing statistical data from the nodes of a graphical model amounts to sampling from a Boltzmann distribution, creating a bottleneck in the learning procedure. Therefore, any improvement in the e ciency of Boltzmann distribution sampling is bene cial for reinforcement learning and machine learning in general.

As we explain in what follows, DBMs are good candidates for reinforcement learning tasks. Moreover, an important advantage of a DBM layout for a quantum annealing system is that the proximity and couplings of the qubits in the layout are similar to those of a sequence of bipartite blocks in D-Wave Systems’ devices [21], and it is therefore feasible that such layouts could be manufactured in the near future. In addition, embedding Boltzmann machines in larger quantum annealer architectures is problematic when excessively large weights and biases are needed to em-ulate logical nodes of the Boltzmann machine using chains and clusters of physical qubits. These are the reasons why, instead of attempting to embed a Boltzmann machine structure on an existing quantum annealing system as in [22, 23, 24, 25], we work under the assumption that the network itself is the native connectivity graph of a near-future quantum annealer, and, using numerical simulations, we attempt to understand its applicability to reinforcement learning.
We also refer the reader to current trends in machine learning using quantum circuits, speci cally,

[26]and [27] for reinforcement learning, and [28] and [29] for training quantum Boltzmann machines with applications in deep learning and tomography. To the best of our knowledge, the present paper complements the literature on quantum machine learning as the rst proposal on reinforcement learning using adiabatic quantum computation.

Quantum Monte Carlo (QMC) numerical simulations have been found to be useful in simulat-ing time-dependant quantum systems. Simulated quantum annealing (SQA) [30, 31], one of the many avours of QMC methods, is based on the Suzuki{Trotter expansion of the path integral representation of the Hamiltonian of Ising spin models in the presence of a transverse eld driver Hamiltonian. Even though the e ciency of SQA for nding the ground state of an Ising model is topologically obstructed [32], we consider the samples generated by SQA to be good approxima-tions of the Boltzmann distribution of the quantum Hamiltonian [33]. Experimental studies have shown similarities in the behaviour of SQA and that of quantum annealing [34, 35] and its physical realization by D-Wave Systems [36, 37].
根据通用逼近定理[17],RBM可以逼近二元变量上的任何联合分布[18,19]。然而,在强化学习的背景下,RBM不一定是近似与马尔可夫决策过程相关的Q函数的最佳选择,因为RBM可能需要关于可见变量数量的指数隐藏变量以便接近所需的联合分配[18,19]。另一方面,DBM有可能模拟高阶依赖性而不是RBM,并且比深度置信网络更强大[20]。

因此,可以考虑用其他图形模型替换RBM并在学习过程中调查模型的性能。除了RBM之外,从图形模型的节点计算统计数据相当于从Boltzmann分布中采样,从而在学习过程中产生瓶颈。因此,玻尔兹曼分布抽样的效率的任何改进一般对于强化学习和机器学习是有益的。

正如我们在下文中解释的那样,DBM是强化学习任务的良好候选者。此外,量子退火系统的DBM布局的一个重要优点是布局中量子位的接近和耦合类似于D-Wave系统设备中二分块序列的接近和耦合[21],它是因此可以在不久的将来制造这种布局。此外,当需要过大的权重和偏差来使用物理量子比特的链和簇来模拟Boltzmann机器的逻辑节点时,将Boltzmann机器嵌入更大的量子退火器架构中是有问题的。这就是为什么,而不是像[22,23,24,25]那样试图在现有的量子退火系统上嵌入玻尔兹曼机器结构,我们的工作假设网络本身就是附近的原生连通图。 -future量子退火炉,并且,使用数值模拟,我们试图了解它对强化学习的适用性。
我们还向读者介绍了使用量子电路的机器学习的当前趋势,特别是

[26]和[27]用于强化学习,[28]和[29]用于训练量子Boltzmann机器在深度学习和层析成像中的应用。据我们所知,本文补充了量子机器学习的文献,作​​为使用绝热量子计算的强化学习的第一个提议。

已发现量子蒙特卡罗(QMC)数值模拟在模拟时间相关的量子系统中是有用的。模拟量子退火(SQA)[30,31]是QMC方法的众多选择之一,它基于Suzuki {Trotter扩展的Ising旋转模型的哈密顿量的路径积分表示在存在横向场驱动器哈密顿量。尽管SQA对于Ising模型基态的效率在拓扑上是受阻的[32],但我们认为SQA产生的样本是量子哈密顿量的Boltzmann分布的良好近似[33]。实验研究表明SQA和量子退火行为的相似性[34,35]及其物理实现的D-Wave Systems [36,37]。

We expect that when SQA is set such that the nal strength of the transverse eld is negligible, the distribution of the samples approaches the classical limit one expects to observe in absence of the transverse eld. Another classical algorithm which can be used to obtain samples from the Boltzmann distribution is conventional simulated annealing (SA), which is based on thermal annealing. Note that this algorithm can be used to create Boltzmann distributions from the Ising spin model only in the absence of a transverse eld. It should, therefore, be possible to use SA or SQA to approximate the Boltzmann distribution of a classical Boltzmann machine. However, unlike in the case of SA, it is possible to use SQA not only to approximate the Boltzmann distribution of a classical Boltzmann machine, but also that of a graphical model in which the energy operator is a quantum Hamiltonian in the presence of a transverse eld. These graphical models, called quantum Boltzmann machines (QBM), were rst introduced in [38].

We use SQA simulations to provide evidence that a quantum annealing device that approximates the distribution of a DBM or a QBM may improve the learning process compared to a reinforcement learning method that uses classical RBM techniques. Other studies have shown that SQA is more e cient than thermal SA [30, 31]. Therefore, our method, used in conjunction with SQA, can also be viewed as a quantum-inspired approach for reinforcement learning.

What distinguishes our work from current trends in quantum machine learning is that (i) we consider the use of quantum annealing in reinforcement learning applications rather than frequently studied classi cation or recognition problems; (ii) using SQA-based numerical simulations, we as-sume that the connectivity graph of a DBM directly maps to the native layout of a feasible quantum annealer; and (iii) the results of our experiments using SQA to simulate the sampling of an entan-gled system of spins suggest that using quantum annealers in reinforcement learning tasks can o er an advantage over thermal sampling.

我们期望当设定SQA使得横向视场的最终强度可以忽略不计时,样本的分布接近在没有横向视场的情况下预期观察到的经典极限。可用于从玻尔兹曼分布获得样品的另一种经典算法是传统的模拟退火(SA),其基于热退火。注意,该算法可用于仅在没有横向场的情况下从Ising自旋模型创建Boltzmann分布。因此,应该可以使用SA或SQA来近似经典玻尔兹曼机器的玻尔兹曼分布。然而,与SA的情况不同,可以使用SQA来近似经典玻尔兹曼机器的玻尔兹曼分布,而且还可以使用能量算子是横向存在时的量子哈密顿量的图形模型。场。这些图形模型,称为量子玻尔兹曼机器(QBM),首先在[38]中引入。

我们使用SQA模拟来提供证据,与使用经典RBM技术的强化学习方法相比,近似于DBM或QBM分布的量子退火设备可以改善学习过程。其他研究表明,SQA比热SA更有效[30,31]。因此,我们的方法与SQA结合使用,也可以被视为强化学习的量子启发方法。

我们的工作与量子机器学习的当前趋势的区别在于:(i)我们考虑在强化学习应用中使用量子退火而不是经常研究分类或识别问题; (ii)使用基于SQA的数值模拟,我们认为DBM的连通图直接映射到可行量子退火炉的原生布局; (iii)我们使用SQA模拟旋转自旋系统的采样的实验结果表明,在强化学习任务中使用量子退火器可以优于热采样。

2. Preliminaries

2.1. Adiabatic Evolution of Open Quantum Systems. The evolution of a quantum system under a slowly changing time-dependent Hamiltonian is characterized by the quantum adiabatic theorem (QAT). QAT has a long history going back to the work of Born and Fock [39]. Colloquially, QAT states that a system remains close to its instantaneous steady state provided there is a gap between the eigenenergy of the steady state and the rest of the Hamiltonian’s spectrum at every point in time if the evolution is su ciently slow. This result motivated [40] and [41] to introduce the closely related paradigms of quantum computing known as quantum annealing (QA) and adiabatic quantum computation (AQC).

QA and AQC, in turn, inspired e orts in the manufacturing of physical realizations of adia-batic evolution via quantum hardware ([6]). In reality, the manufactured chips operate at nonzero temperature and are not isolated from their environment. Therefore, the existing adiabatic theory did not describe the behaviour of these machines. A contemporary investigation in quantum adi-abatic theory was thus initiated to study adiabaticity in open quantum systems ([1, 2, 3, 4, 5]). These references prove adiabatic theorems to various degrees of generality and under a variety of assumptions about the system.

In fact, [2] develops an adiabatic theory for equations of the form
2.1。开放量子系统的绝热演化。量子系统在缓慢变化的时变哈密顿量下的演化以量子绝热定理(QAT)为特征。 QAT有着悠久的历史,可以追溯到Born和Fock的作品[39]。通俗地说,QAT表明系统仍然接近其瞬时稳态,前提是如果进化速度很慢,稳态的能量与哈密顿频谱的其余部分之间存在间隙。这一结果促使[40]和[41]引入了密切相关的量子计算范式,称为量子退火(QA)和绝热量子计算(AQC)。

反过来,QA和AQC通过量子硬件激发了制造adia-batic进化的物理实现的效果([6])。实际上,制造的芯片在非零温度下工作并且不与其环境隔离。因此,现有的绝热理论没有描述这些机器的行为。因此开始了量子爱好理论的当代研究,以研究开放量子系统中的绝热性[[1,2,3,4,5])。这些参考文献证明绝热定理具有不同程度的一般性,并且在对系统的各种假设下。

实际上,[2]为形式的方程式发展了绝热理论
在这里插入图片描述
where L is a family of linear operators on a Banach space and L(s) is a generator of a contrac-tion semigroup for every s. This provides a general framework that encompasses many adiabatic theorems, including that of classical stochastic systems, all the way to quantum evolutions of open systems generated by Lindbladians. The manifold of instantaneous stationary states is identical to ker(L(s)), and [2] shows that the dynamics of the system are 其中L是Banach空间上的线性算子族,L(s)是每个s的收缩半群的生成元。 这提供了一个包含许多绝热定理的一般框架,包括经典随机系统的定理,一直到Lindbladians生成的开放系统的量子演化。 瞬时静止状态的流形与ker(L(s))相同,[2]表明系统的动力学是REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第3张图片

In the work of [2], it was then proven that (s) is parallel-transported along ker(L(s)), and that if

L:

(i)is the generator of a contraction semigroup;

(ii)has closed and complementary range and kernel;

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第4张图片

This stream of research suggests promising opportunities to use quantum annealers to sample from the Gibbs state of a quantum Hamiltonian using adiabatic evolution. In this paper, the transverse eld Ising model (TFIM) has been the centre of attention. In practice, due to additional complications in quantum annealing (e.g., level crossings and gap closure), the samples gathered from the quantum annealer are far from the Gibbs state of the nal Hamiltonian. In fact, [42] suggests that the distribution of the samples would correspond more closely to an instantaneous Hamiltonian at an intermediate point in time, called the freeze-out point. Therefore, our goal is to investigate the applicability of sampling from a TFIM with signi cant to free energy{based reinforcement learning.

这一研究成果提出了使用量子退火炉从绝对演化的量子哈密顿量的Gibbs态采样的有希望的机会。 在本文中,横向伊斯林模型(TFIM)一直是关注的焦点。 实际上,由于量子退火中的额外复杂性(例如,平交和间隙闭合),从量子退火炉收集的样品远离最终哈密顿量的吉布斯状态。 实际上,[42]表明样本的分布更接近于中间时刻的瞬时哈密顿量,称为冻结点。 因此,我们的目标是研究从具有重要意义的TFIM中取样的适用性(基于强化学习)。

2.2. Simulated Quantum Annealing. Simulated quantum annealing (SQA) methods are a class of quantum-inspired algorithms that perform discrete optimization by classically simulating quan-tum tunnelling phenomena (see [43, p. 422] for an introduction). The algorithm used in this paper is a single spin-ip version of quantum Monte Carlo numerical simulation based on the Suzuki{ Trotter formula, and uses the Metropolis acceptance probabilities. The SQA algorithm simulates the quantum annealing phenomena of an Ising spin model with a transverse eld, that is,
2.2。 模拟量子退火。 模拟量子退火(SQA)方法是一类量子启发算法,通过经典模拟量子隧道现象来执行离散优化(参见[43,p.422]的介绍)。 本文使用的算法是基于Suzuki {Trotter公式,并使用Metropolis接受概率的单量自旋ip版量子蒙特卡罗数值模拟。 SQA算法模拟具有横向场的Ising自旋模型的量子退火现象,即REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第5张图片

在这里插入图片描述

The initial value, 20:00, of the transverse eld is empirically chosen to be well above the coupling strengths created during the training. Each spin is replicated 25 times to represent the Trotter slices in the extra dimension. The simulation is set to iterate over all replications of all spins one time per sweep, and the number of sweeps is set to 300, which appears to be large enough for the sizes of Ising models constructed during our experiments. For each instance of input, the SQA algorithm is run 150 times. After termination, the con guration of each replica, as well as the con guration of the entire classical Ising model of one dimension higher, is returned.

Although the SQA algorithm does not follow the dynamics of a physical quantum annealer explic-itly, it is used to simulate this process, as it captures major quantum phenomena such as tunnelling and entanglement [34]. In [34], for example, it is shown that quantum Monte Carlo simulations can be used to understand the tunnelling behaviour in quantum annealers. As mentioned previously, it readily follows from the results of [33] that the limiting distribution of SQA is the Boltzmann distribution of He . This makes SQA a candidate classical algorithm for sampling from Boltzmann distributions of classical and quantum Hamiltonians. The former is achieved by setting and the latter by constructing an e ective Hamiltonian of the system of one dimension higher, rep-resenting the quantum Hamiltonian with non-negligible f . Alternatively, a classical Monte Carlo simulation used to sample from the Boltzmann distribution of the classical Ising Hamiltonian is the SA algorithm, based on thermal uctuations of classical spin systems.
根据经验选择横向视场的初始值20:00远高于训练期间产生的耦合强度。每次旋转复制25次以表示额外维度中的Trotter切片。模拟被设置为每次扫描一次迭代所有旋转的所有复制,并且扫描的数量设置为300,这对于在我们的实验期间构建的Ising模型的尺寸来说似乎足够大。对于每个输入实例,SQA算法运行150次。终止后,返回每个副本的配置,以及一维更高的整个经典Ising模型的配置。

虽然SQA算法没有明确地遵循物理量子退火器的动力学,但它用于模拟这个过程,因为它捕获了主要的量子现象,如隧道和纠缠[34]。例如,在[34]中,显示量子蒙特卡罗模拟可用于理解量子退火炉中的隧穿行为。如前所述,从[33]的结果可以很容易地得出SQA的极限分布是He的玻尔兹曼分布。这使得SQA成为从经典和量子哈密顿量的Boltzmann分布中采样的候选经典算法。前者是通过设置而后者通过构造一维更高的系统的有效哈密顿量来实现的,代表具有不可忽略的f的量子哈密顿量。或者,用于从经典伊辛哈密顿量的玻尔兹曼分布中采样的经典蒙特卡罗模拟是基于经典自旋系统的热波动的SA算法。
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第6张图片
在这里插入图片描述
2.3.1. Maze Traversal as a Markov Decision Process. Maze traversal is a problem typically used to develop and benchmark reinforcement learning algorithms [46]. A maze is structured as a two-dimensional grid of r rows and c columns in which a decision-making agent is free to move up, down, left, or right, or to stand still. During the maze traversal, the agent encounters obstacles (e.g., walls), rewards (e.g., goals), and penalties (negative rewards, e.g., a pit). Each cell of the maze can contain either a deterministic or stochastic reward, a wall, a pit, or a neutral value. Fig. 2a and Fig. 2b show examples of two mazes. Fig. 2c shows the corresponding solutions to the maze in Fig. 2a.

The goal of the reinforcement learning algorithm in the maze traversal problem is for the agent to learn the optimal action to take in each cell of the maze by maximizing the total reward, that is, nding a route across the maze that avoids walls and pits while favouring rewards. This problem can be modelled as an MDP determined by the following components:

The state of the system is the agent’s position within the maze. The position state s takes values in the set of states
2.3.1。迷宫遍历作为马尔可夫决策过程。迷宫遍历是一个通常用于开发和基准强化学习算法的问题[46]。迷宫被构造为r行和c列的二维网格,其中决策代理可以自由地向上,向下,向左或向右移动,或者静止不动。在迷宫遍历期间,代理遇到障碍物(例如,墙壁),奖励(例如,目标)和惩罚(负面奖励,例如,坑)。迷宫的每个单元可以包含确定性或随机奖励,墙壁,坑或中性值。图2a和图2b示出了两个迷宫的示例。图2c示出了图2a中迷宫的相应解决方案。

强化学习算法在迷宫遍历问题中的目标是让代理通过最大化总奖励来学习在迷宫的每个单元格中采取的最佳动作,即,在避开墙壁和凹坑的迷宫中找到路线。赞成奖励。此问题可以建模为由以下组件确定的MDP:

系统的状态是代理在迷宫中的位置。位置状态s采用状态集中的值
在这里插入图片描述

These actions will guide the agent through the maze. An action that would lead the agent into a wall (W ) or outside of the maze boundary is treated as an inadmissible action. Each action can be viewed as an endomorphism on the set of states
这些行动将引导代理人通过迷宫。 将代理引入墙(W)或迷宫边界外的动作被视为不允许的动作。 每个动作都可以被视为一组状态的内同态
在这里插入图片描述

在这里插入图片描述

The transition kernel determines the probability of the agent moving from one state to another given a particular choice of action. In the simplest case, the probability of transition from s to a(s) is one:
转换内核根据特定的动作选择确定代理从一个状态移动到另一个状态的概率。 在最简单的情况下,从s转换为a(s)的概率是1:

在这里插入图片描述

We call the maze clear if the associated transition kernel is as above, as opposed to the windy maze, in which there is a nonzero probability that if the action a is taken at state s, the next state will di er from a(s).

The immediate reward r(s; a) that the agent gains from taking an action a in state s is the value contained in the destination state. Moving into a cell containing a reward returns the favourable

value R, moving into a cell containing a penalty returns the unfavourable value P , and moving into a cell with no reward returns a neutral value in the interval (P; R).
A discount factor for future rewards is a non-negative constant r< 1. In our experiments, this

discount factor is set to = 0:8. The discount factor is a feature of the problem rather than a free parameter of an implementation. For example, in a nancial application scenario, the discount factor might be a function of the risk-free interest rate.

The immediate reward for moving into a cell with a stochastic reward is given by a random variable R. If an agent has prior knowledge of this distribution, then it should be able to treat the cell as one with a deterministic reward value of E[R]. This allows us to nd the set of all optimal policies in each maze instance. This policy information is denoted by : S ! 2A, associating with each state s 2 S a set of optimal actions (s) A.
In our maze model, the neutral value is set to 100, the reward R = 200, and the penalty P = 0. In our experiments, the stochastic reward R is simulated by drawing a sample from the Bernoulli distribution 200 Ber(0:5); hence, it has the expected value E[R] = 100, which is identical to the neutral value. Therefore, the solutions depicted in Fig. 2c are solutions to the maze of Fig. 2b as well.
如果关联的转换内核如上所述,我们将迷宫称为清除,而不是有风的迷宫,其中存在非零概率,如果在状态s采取动作a,则下一个状态将来自a(s) 。

代理人在状态s中采取行动a获得的直接奖励r(s; a)是目的地状态中包含的值。进入一个包含奖励的单元格会返回有利条件

移动到包含惩罚的单元格中的值R返回不利的值P,并且移动到没有奖励的单元格中返回区间中的中性值(P; R)。
未来奖励的折扣因素是非负常数r <1。在我们的实验中,这是

折扣系数设置为= 0:8。折扣因子是问题的一个特征,而不是实现的自由参数。例如,在金融应用场景中,贴现因子可能是无风险利率的函数。

移动到具有随机奖励的单元格的直接奖励由随机变量R给出。如果代理人具有该分布的先验知识,则它应该能够将该单元格视为具有确定性奖励值E [R]的单元。 ]。这允许我们在每个迷宫实例中找到所有最优策略的集合。此政策信息表示为:S!在图2A中,将每个状态s 2 S与一组最佳动作A相关联。
在我们的迷宫模型中,中性值设置为100,奖励R = 200,惩罚P = 0.在我们的实验中,随机奖励R是通过从伯努利分布200 Ber(0:5)中抽取样本来模拟的。 );因此,它具有期望值E [R] = 100,其与中性值相同。因此,图2c中描绘的解决方案也是图2b的迷宫的解决方案。

2.4. Value Iteration. Bellman [47] writes V ( ; s) recursively in the following manner using the monotone convergence theorem:

2.4。 值迭代。 Bellman [47]使用单调收敛定理以下列方式递归地写V(; s):
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第7张图片

heavily on the cardinality of both S and A, and su er from the curse of dimensionality [47, 48]. Moreover, the value iteration method requires having full knowledge of the transition probabilities, as well as the distribution of the immediate rewards.
严重依赖于S和A的基数,并且来自维数的诅咒[47,48]。 此外,价值迭代方法需要充分了解转移概率以及直接奖励的分布。

2.5. Q-functions. For a stationary policy , the Q-function (also known as the action{value func-tion) is de ned as a mapping of a pair (s; a) to the expected value of the reward of the Markov chain that begins with taking action a at initial state s and continuing according to π [8]:
2.5。Q-功能。 对于静止策略,Q函数(也称为动作{值函数)被定义为一对(s; a)到马尔可夫链的奖励的期望值的映射,该值从服用开始 动作a在初始状态s并继续根据π[8]:
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第8张图片

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第9张图片

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第10张图片
In what follows, we explain the case in which comprises the weights of a Boltzmann machine. Let us begin by introducing clamped Boltzmann machines, which are of particular importance in the case of reinforcement learning.
在下文中,我们解释了包括玻尔兹曼机器的重量的情况。 让我们首先介绍夹紧玻尔兹曼机器,这在加固学习的情况下尤为重要。
2.7 Clamped Boltzmann Machines. A classical Boltzmann machine is a type of stochastic neural network with two sets V and H of visible and hidden nodes, respectively. Both visible and hidden nodes represent binary random variables. We use the same notation for a node and the binary random variable it represents. The interactions between the variables represented by their respective nodes are speci ed by real-valued weighted edges of the underlying undirected graph. A GBM, as opposed to models such as RBMs and DBMs, allows weights between any two nodes.
The energy of the classical Boltzmann machine is
2.7 夹紧玻尔兹曼机器。 经典的Boltzmann机器是一种随机神经网络,分别具有两组V和H的可见节点和隐藏节点。 可见和隐藏节点都表示二进制随机变量。 我们对节点和它所代表的二进制随机变量使用相同的表示法。 由它们各自节点表示的变量之间的相互作用由基础无向图的实值加权边指定。 与RBM和DBM等模型相反,GBM允许任意两个节点之间的权重。
经典玻尔兹曼机器的能量是
在这里插入图片描述

hidden and hidden nodes of the Boltzmann machine, respectively, de ned as a function of binary vectors v and h corresponding to the visible and hidden variables, respectively.

A clamped GBM is a neural network whose underlying graph is the subgraph obtained by removing the visible nodes for which the e ect of a xed assignment v of the visible binary variables contributes as constant coe cients to the associated energy

Boltzmann机器的隐藏和隐藏节点分别定义为分别对应于可见和隐藏变量的二进制向量v和h的函数。

钳位GBM是神经网络,其基础图是通过去除可见节点获得的子图,其中可见二元变量的xed赋值v的作用对相关能量作为常数系数贡献。
在这里插入图片描述

A clamped quantum Boltzmann machine (QBM) has the same underlying graph as a clamped GBM, but instead of a binary random variable, a qubit is associated to each node of the network. The energy function is substituted by the quantum Hamiltonian
钳位量子Boltzmann机器(QBM)具有与钳位GBM相同的基础图,但是代替二进制随机变量,量子比特与网络的每个节点相关联。 能量函数由量子哈密顿量代替

在这里插入图片描述

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第11张图片

2.8. Reinforcement Learning Using Clamped Boltzmann Machines. In this section, we explain how a general Boltzmann machine (GBM) can be used to provide a Q-function approximator in a Q-learning method. To the best of our knowledge, this derivation has not been previously given, although it can be readily derived from the ideas presented in [16] and [38]. Following [16], the goal is to use the negative free energy of a Boltzmann machine to approximate the Q-function through the relationship
2.8。 使用夹紧玻尔兹曼机器进行强化学习。 在本节中,我们将解释如何使用通用玻尔兹曼机(GBM)在Q学习方法中提供Q函数逼近器。 据我们所知,这种推导以前没有给出,尽管它可以很容易地从[16]和[38]中提出的想法中得出。 在[16]之后,目标是使用玻尔兹曼机器的负自由能来通过关系近似Q函数
在这里插入图片描述

state s and action a on the state nodes and action nodes, respectively, of the Boltzmann machine. In reinforcement learning, the visible nodes of the GBM are partitioned into two subsets of state nodes S and action nodes A.

The parameters , to be trained according to a TD(0) update rule (see Sec. 2.6), are the weights
in a Boltzmann machine. For every weight w, the update rule is
分别对Boltzmann机器的状态节点和动作节点进行状态和动作a。 在强化学习中,GBM的可见节点被划分为状态节点S和动作节点A的两个子集。

根据TD(0)更新规则(参见第2.6节)训练的参数是权重
在玻尔兹曼机器中。 对于每个权重w,更新规则是
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第12张图片
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第13张图片
where c ranges over all spin con gurations of the classical Ising model of one dimension higher. The above argument holds in the absence of the transverse eld, that is, for the classical Boltzmann machine. In this case, the TD(0) update rule is given by
其中c范围超过一维更高的经典Ising模型的所有自旋配置。 上述论点在没有横向场的情况下成立,即对于经典的玻尔兹曼机器。 在这种情况下,TD(0)更新规则由下式给出
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第14张图片
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第15张图片

The update rule for the weights of the RBM is (17) alone. Moreover, in the case of RBMs, the equilibrium free energy F (s; a) and its derivatives with respect to the weights can be calculated without the need for Boltzmann distribution sampling, according to the closed formula
RBM权重的更新规则仅为(17)。 此外,在RBM的情况下,根据闭合公式,可以计算平衡自由能F(s; a)及其相对于权重的导数而无需玻尔兹曼分布抽样
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第16张图片

Here, denotes the sigmoid function. Note that, in the general case, since the hidden nodes of a clamped Boltzmann machine are not independent, the calculation of the free energy is intractable.
这里,表示sigmoid函数。 注意,在一般情况下,由于夹紧玻尔兹曼机器的隐藏节点不是独立的,因此自由能的计算是难以处理的。

3. Algorithms

In this section, we present the details of classical reinforcement learning using RBM, a semi-classical approach based on a DBM (using SA and SQA), and a quantum reinforcement learning approach (using SQA or quantum annealing). All of the algorithms are based on the Q-learning TD(0) method presented in the previous section. Pseudo-code for these methods is provided in Algorithms 1, 2, and 3 below.

3.1. Reinforcement Learning Using RBMs. The RBM reinforcement learning algorithm is due to Sallans and Hinton [16]. This algorithm uses the update rule (17), with v representing state or action encoding, to update the weights of an RBM, and (21) to calculate the expected values of random variables associated with the hidden nodes hhi. As explained in Sec. 2.8, the main advantage of RBM is that it has explicit formulas for the hidden-node activations, given the values of the visible nodes. Moreover, only for RBMs can the entropy portion of the free energy (19) be written in terms of the activations of the hidden nodes. More-complicated network architectures do not possess this property, so there is a need for a Boltzmann distribution sampler.
在本节中,我们将介绍使用RBM的经典强化学习的细节,基于DBM(使用SA和SQA)的半经典方法,以及量子强化学习方法(使用SQA或量子退火)。所有算法都基于前一节中介绍的Q学习TD(0)方法。这些方法的伪代码在下面的算法1,2和3中提供。

3.1。使用RBM强化学习。 RBM强化学习算法归功于Sallans和Hinton [16]。该算法使用更新规则(17),其中v表示状态或动作编码,以更新RBM的权重,以及(21)计算与隐藏节点hhi相关联的随机变量的期望值。如第二节所述。 2.8,RBM的主要优点是,在给定可见节点的值的情况下,它具有隐藏节点激活的显式公式。此外,仅对于RBM,可以根据隐藏节点的激活来写入自由能(19)的熵部分。更复杂的网络架构不具备这种特性,因此需要Boltzmann分布式采样器。

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第17张图片

In Algorithm 1, we recall the steps of the classical reinforcement learning algorithm using an RBM with a graphical model similar to that shown in Fig. 1a. We set the initial Boltzmann machine weights using Gaussian zero-mean values with a standard deviation of 1:00, as is common practice for implementing Boltzmann machines [50]. Consequently, this initializes an approximation of a Q-function and a policy given by
在算法1中,我们回顾了使用RBM的经典强化学习算法的步骤,该RBM具有类似于图1a中所示的图形模型。 我们使用高斯零均值设定初始玻尔兹曼机器重量,标准偏差为1:00,这是实施玻尔兹曼机器的常用做法[50]。 因此,这初始化了Q函数和由下式给出的策略的近似值

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第18张图片

3.2. Reinforcement Learning Using DBMs. Since we are interested in the dependencies be-tween states and actions, we consider a DBM architecture that has a layer of states connected to the rst layer of hidden nodes, followed by multiple hidden layers, and a layer of actions connected to the nal layer of hidden nodes (see Fig. 1). We demonstrate the advantages of this deep architec-ture trained using SQA and the derivation in Sec. 2.8 of the temporal-di erence gradient method for reinforcement learning using general Boltzmann machines (GBM).

In Algorithm 2, we summarize the DBM-RL method. Here, the graphical model of the Boltzmann machine is similar to that shown in Fig. 1b. The initialization of the weights of the DBM is performed in a similar fashion to the previous algorithm.
3.2。 使用DBM强化学习。 由于我们对状态和动作之间的依赖关系感兴趣,因此我们考虑一种DBM体系结构,其具有连接到第一层隐藏节点的状态层,其后是多个隐藏层,以及连接到最终层的一层动作 隐藏节点(见图1)。 我们展示了使用SQA培训的这种深层架构的优势以及Sec中的推导。 2.8使用通用玻尔兹曼机器(GBM)进行强化学习的时间梯度法。

在算法2中,我们总结了DBM-RL方法。 这里,Boltzmann机器的图形模型类似于图1b中所示的模型。 以与先前算法类似的方式执行DBM的权重的初始化。

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第19张图片

According to lines 4 and 5 of Algorithm 2, the samples from the SA or SQA algorithm are used to approximate the free energy of the classical DBM at points (s1; a1) and (s2; a2) using (19).

If SQA is used, averages are taken over each replica of each run; hence, there are 3750 samples of con gurations of the hidden nodes for each state{action pair. The strength of the transverse eld is scheduled to linearly decrease from 20:00 to f = 0:01.
根据算法2的第4和第5行,来自SA或SQA算法的样本用于使用(19)在点(s1; a1)和(s2; a2)处近似经典DBM的自由能。

如果使用SQA,则对每次运行的每个副本取平均值; 因此,对于每个状态{动作对,有3750个隐藏节点的配置样本。 横向场的强度预定从20:00到f = 0:01线性减小。
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第20张图片

3.3. Reinforcement Learning Using QBMs. The last algorithm is QBM-RL, presented in Algorithm 3. The initialization is performed as in Algorithms 1 and 2. However, according to lines 4 and 5, the samples from the SQA algorithm are used to approximate the free energy of a QBM at points (s1; a1) and (s2; a2) by computing the free energy corresponding to an e ective classical Ising spin model of one dimension higher representing the quantum Ising spin model of the QBM, via (16).
3.3。 使用QBM强化学习。 最后一个算法是QBM-RL,在算法3中给出。初始化如算法1和2中那样执行。但是,根据第4和第5行,来自SQA算法的样本用于近似QBM的自由能。 点(s1; a1)和(s2; a2)通过计算对应于一维的有效经典Ising自旋模型的自由能,表示QBM的量子Ising自旋模型,通过(16)。
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第21张图片

A subsequent state s2 is obtained from the state{action pair (s1; a1) using the transition kernel outlined in Sec. 2, and a corresponding action a2 is chosen via policy . Another SQA sampling is performed in a similar fashion to the above for this pair.

In Fig. 3a and Fig. 3b, the selection of (s1; a1) is performed by sweeping across the set of state{ action pairs. In Fig. 3d, the selection of (s1; a1) and s2 is performed by sweeping over S A S. In Fig. 3c, the selection of s1, a1, and s2 are all performed uniformly randomly.

We experiment with a variety of learning-rate schedules, including exponential, harmonic, and linear; however, we found that for the training of both RBMs and DBMs, an adaptive learning-rate schedule performed best (for information on adaptive subgradient methods, see [51]). In our experiments, the initial learning rate is set to 0.01.

In all of our studied algorithms, training terminates when a desired number of training samples have been processed, after which the updated policy is returned.
使用在Sec。中概述的转换内核从状态{action pair(s1; a1)获得后续状态s2。在图2中,通过策略选择相应的动作a2。对于该对,以与上述类似的方式执行另一个SQA采样。

在图3a和图3b中,通过扫过一组状态{动作对来执行(s1; a1)的选择。在图3d中,通过扫描S A S来执行(s1; a1)和s2的选择。在图3c中,s1,a1和s2的选择均匀地随机地执行。

我们尝试了各种学习率计划,包括指数,谐波和线性;然而,我们发现,对于RBM和DBM的训练,自适应学习率计划表现最佳(关于自适应子梯度方法的信息,见[51])。在我们的实验中,初始学习率设置为0.01。

在我们所有研究的算法中,训练在处理了所需数量的训练样本时终止,之后返回更新的策略。

4.Numerical Results

We study the performance of temporal-di erence reinforcement learning algorithms (explained in detail in Sec. 3) using Boltzmann machines. We generalize the method introduced in [16], and compare the policies obtained from these algorithms to the optimal policy using a delity measure, which we define in (25).
我们使用玻尔兹曼机器研究时间差强化学习算法的性能(在第3节中详细解释)。 我们概括了[16]中介绍的方法,并使用我们在(25)中定义的delity度量将从这些算法获得的策略与最优策略进行比较。

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第22张图片

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第23张图片

Fig. 3a and Fig. 3b show the delity of the generated policies obtained from various reinforcement learning experiments on two clear 3 5 mazes. In Fig. 3a, the maze includes one reward, one wall, and one pit, and in Fig. 3b, the maze additionally includes two stochastic rewards. In these experiments, the training samples are generated by sweeping over the maze. Each sweep iterates over the maze elements in the same order. This explains the periodic behaviour of the delity curves (cf. Fig. 3c).

The curves labelled ‘QBM-RL’ represent the delity of reinforcement learning using QBMs. Sampling from the QBM is performed using SQA. All other experiments use classical Boltzmann machines as their graphical model. In the experiment labelled ‘RBM-RL’, the graphical model is an RBM, trained classically using formula (21). The remaining curve is labelled ‘DBM-RL’ for classical reinforcement learning using a DBM. In these experiments, sampling from con gurations of the DBM is performed with SQA (with f = 0:01). The delity results of DBM-RL coincide closely with those of sampling con gurations of the DBM using SA; therefore, we have not included them. Fig. 3c regenerates the results of Fig. 3a using uniform random sampling (i.e., without sweeping through the maze).
图3a和图3b显示了在两个清晰的3 5迷宫上从各种强化学习实验获得的生成策略的深度。在图3a中,迷宫包括一个奖励,一个墙壁和一个凹坑,并且在图3b中,迷宫另外包括两个随机奖励。在这些实验中,通过扫过迷宫产生训练样本。每次扫描以相同的顺序迭代迷宫元素。这解释了delity曲线的周期性行为(参见图3c)。

标记为“QBM-RL”的曲线表示使用QBM进行强化学习的深度。使用SQA执行QBM的采样。所有其他实验都使用经典的玻尔兹曼机器作为其图形模型。在标记为“RBM-RL”的实验中,图形模型是RBM,使用公式(21)经典地训练。剩下的曲线标记为“DBM-RL”,用于使用DBM进行经典强化学习。在这些实验中,使用SQA(f = 0:01)对DBM的配置进行采样。 DBM-RL的遗传结果与使用SA的DBM的采样配置密切相关;因此,我们没有包括它们。图3c使用均匀随机采样(即,没有扫过迷宫)再生图3a的结果。

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第24张图片

In Fig. 4, we report the e ect of maze size on av‘ for RBM-RL, DBM-RL, and QBM-RL for varying maze sizes. We plot av‘ for each algorithm with ‘ = 500; 250; and 10 as a function of maze size
在图4中,我们报告了针对不同迷宫大小的RBM-RL,DBM-RL和QBM-RL的av’的迷宫大小的影响。 我们用’= 500绘制每个算法的av’;250; 和10作为迷宫大小的函数
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第25张图片
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第26张图片

Figure 3. Comparison of RBM-RL, DBM-RL, and QBM-RL training results. Every underlying RBM has 16 hidden nodes and every DBM has two layers of eight hidden nodes. The shaded areas indicate the standard deviation of each training algorithm. (a) The delity curves for the three algorithms run on the maze in Fig 2a. (b) The delity curves for the maze in Fig 2b. © The delity curves of the mentioned three algorithms corresponding to the same experiment as that of (a), except that the training is performed by uniformly generated training samples rather than sweeping across the maze. (d) The delity curves corresponding to a windy maze similar to Fig 2a.
图3. RBM-RL,DBM-RL和QBM-RL训练结果的比较。 每个底层RBM都有16个隐藏节点,每个DBM有两层八个隐藏节点。 阴影区域表示每个训练算法的标准偏差。 (a)三种算法的delity曲线在图2a中的迷宫上运行。 (b)图2b中迷宫的神性曲线。 (c)所述三种算法的对应于与(a)相同的实验的除体曲线,除了训练是通过均匀生成的训练样本而不是扫过迷宫进行的。 (d)对应于类似于图2a的刮风迷宫的凹形曲线。

for a family of problems with one deterministic reward, two stochastic rewards, one pit, and n 2 walls. We use nine n 5 mazes in this experiment, indexed by various values of n. In addition to the av‘ plots, we include a dotted-line plot depicting the delity for a completely random policy. The delity of the random policy is given by the average probability of choosing an optimal action

at each state when generating admissible actions uniformly at random, which is given by
对于一系列问题,一个确定性奖励,两个随机奖励,一个坑和n 2墙。 我们在这个实验中使用9 n 5个迷宫,用不同的n值索引。 除了av’图,我们还包括一个虚线图,描绘了一个完全随机的政策。 随机策略的超越性由选择最优动作的平均概率给出

在每个州,当随机均匀地产生可允许的动作时,由下式给出
在这里插入图片描述

Note that the delity of the random policy increases as the maze size increases. This is due to the fact that maze rows containing a wall have more average admissible optimal actions than the top and bottom rows of the maze.
请注意随机策略的delity随着迷宫大小的增加而增加。 这是因为包含墙的迷宫行比迷宫的顶行和底行具有更多的平均可允许的最佳动作。
REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第27张图片

在这里插入图片描述
Figure 4. A comparison between the performance of RBM-RL, DBM-RL, and QBM-RL as the size of the maze grows. All Boltzmann machines have 20 hidden nodes. (a) The schematics of an n 5 maze with one deterministic reward, 2 stochastic rewards, one pit, and n 2 walls. (b) The scaling of the average delity of each algorithm run on each instance of the n 5 maze. The dotted line is the average delity of uniformly randomly generated actions.
图4.随着迷宫大小的增长,RBM-RL,DBM-RL和QBM-RL的性能之间的比较。 所有Boltzmann机器都有20个隐藏节点。 (a)具有一个确定性奖励,2个随机奖励,一个坑和n 2个墙的n 5迷宫的示意图。 (b)在n 5迷宫的每个实例上运行每个算法的平均delity的缩放。 虚线是均匀随机生成的动作的平均值。

5.Discussion

The delity curves in Fig. 3 show that DBM-RL outperforms RBM-RL with respect to the number of training samples. Therefore, we expect that in conjunction with a high-performance sampler of Boltzmann distributions (e.g., a quantum or a quantum-inspired oracle taken as such), DBM-RL improves the performance of reinforcement learning. QBM-RL is not only on par with DBM-RL, but actually slightly improves upon it by taking advantage of sampling in the presence of a signi cant transverse eld.

This is a positive result for the potential of sampling from a quantum device in machine learn-ing, as we do not expect quantum annealing to obtain the Boltzmann distribution of a classical Hamiltonian [42, 52, 53]. However, given the discussion in Sec. 2.1, a quantum annealer viewed
as an open system coupled to a heat bath could be a better choice of sampler from its instanta-neous Hamiltonian in earlier stages of the annealing process, compared to a sampler of the problem Hamiltonian at the end of the evolution. Therefore, these experiments address whether a quantum Boltzmann machine with a transverse eld Ising Hamiltonian can perform at least as well as a classical Boltzmann machine.

In each experiment, the delity curves from DBM-RL produced using SQA with f = 0:01 match the ones produced using SA. This is consistent with our expectation that using SQA with ! 0 produces samples from the same distribution as SA, namely, the Boltzmann distribution of the classical Ising Hamiltonian with no transverse eld.

The best algorithm in our experiments is evidently QBM-RL using SQA. Here, the nal trans-verse eld is f = 2:00, corresponding to one-third of the anneal for a quantum annealing algorithm that evolves along the convex linear combination of the initial and nal Hamiltonians with constant speed. This is consistent with ideas found in [38] on sampling at freeze-out [42].

Fig. 3c shows that, whereas the maze can be solved with fewer training samples using ordered sweeps of the maze, the periodic behaviour of the delity curves is due to this periodic choice of training samples. This e ect disappears once the training samples are chosen uniformly randomly.

Fig. 3d shows that the improvement in the learning of the DBM-RL and QBM-RL algorithms persists in the case of more-complicated transition kernels. The same ordering of delity curves discussed earlier is observed: QBM-RL outperforms DBM-RL, and DBM-RL outperforms RBM-RL.
图3中的神性曲线表明DBM-RL相对于训练样本的数量优于RBM-RL。因此,我们期望结合Boltzmann分布的高性能采样器(例如,量子或量子启发的oracle),DBM-RL改善了强化学习的性能。 QBM-RL不仅与DBM-RL相当,而且在存在显着横向视场的情况下利用采样实际上略微改进了它。

这对于机器学习中量子器件的采样潜力是一个积极的结果,因为我们不期望量子退火获得经典哈密顿量的Boltzmann分布[42,52,53]。但是,鉴于第二节中的讨论。 2.1,观察量子退火炉
与在演化结束时问题哈密顿量的采样器相比,在退火过程的早期阶段,耦合到热浴的开放系统可以是来自其瞬时哈密顿量的采样器的更好选择。因此,这些实验解决了具有横向Ising哈密顿量的量子Boltzmann机器是否能够至少与经典的Boltzmann机器一样好。

在每个实验中,使用SQA(f = 0:01)生成的DBM-RL的稀有曲线与使用SA生成的稀疏曲线相匹配。这与我们期望使用SQA一致! 0产生与SA相同分布的样本,即没有横向场的经典Ising哈密顿量的Boltzmann分布。

我们实验中最好的算法显然是使用SQA的QBM-RL。这里,最后的反射场是f = 2:00,对应于量子退火算法的退火的三分之一,该算法沿着具有恒定速度的初始和最终哈密顿量的凸线性组合演化。这与[38]中关于冻结取样的观点一致[42]。

图3c示出了,尽管可以使用迷宫的有序扫描用较少的训练样本来解决迷宫,但是,对于这种周期性选择的训练样本,所述遗留曲线的周期性行为。一旦随机选择训练样本,该效果就会消失。

图3d显示在更复杂的转换内核的情况下,DBM-RL和QBM-RL算法的学习的改进仍然存在。观察到前面讨论的相同的稀有曲线排序:QBM-RL优于DBM-RL,并且DBM-RL优于RBM-RL。

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习_第28张图片

One can observe from Fig. 4 that, as the maze size increases and the complexity of the rein-forcement learning task increases, av‘ decreases for each algorithm. The RBM algorithm, while always outperformed by DBM-RL and QBM-RL, shows a much faster decay in average delity as a function of maze size compared to both DBM-RL and QBM-RL. For larger mazes, the RBM algorithm fails to capture maze traversal knowledge, and approaches av‘ of a random action al-location (the dotted line), whereas the DBM-RL and QBM-RL algorithms continue to be trained well. DBM-RL and QBM-RL are capable of training the agent to traverse larger mazes, whereas the RBM algorithm, utilizing the same number of hidden nodes and a larger number of weights, fails to converge to an output that is better than a random policy.
The runtime and computational resources needed to compare DBM-RL and QBM-RL with RBM-RL have not been investigated here. We expect that in view of [19], the size of RBM needed to solve larger maze problems will grow exponentially. Thus, it would be interesting to research the extrapolation of the asymptotic complexity and size of the DBM-RL and QBM-RL algorithms with the aim of attaining a quantum advantage. Applying the algorithms described in this paper to tasks that have larger state and action spaces, as well as to more-complicated environments, will allow us to demonstrate the scalability and usefulness of the DBM-RL and QBM-RL approaches. The experimental results shown in Fig. 4 represent only a rudimentary attempt to investigate this matter, yet the results are promising. However, this experiment does not provide a practical characterization of the scaling of our approach, and further investigation is needed.
从图4可以看出,随着迷宫尺寸的增加和加强学习任务的复杂性增加,每个算法的av’减小。 RBM算法虽然总是优于DBM-RL和QBM-RL,但与DBM-RL和QBM-RL相比,作为迷宫大小的函数显示出更快的平均延迟衰减。对于较大的迷宫,RBM算法无法捕获迷宫遍历知识,并且接近随机动作al-location(虚线)的av’,而DBM-RL和QBM-RL算法继续被很好地训练。 DBM-RL和QBM-RL能够训练代理遍历较大的迷宫,而RBM算法利用相同数量的隐藏节点和较大数量的权重,无法收敛到比随机策略更好的输出。
这里没有研究将DBM-RL和QBM-RL与RBM-RL进行比较所需的运行时和计算资源。我们预计,鉴于[19],解决更大迷宫问题所需的RBM规模将呈指数级增长。因此,研究DBM-RL和QBM-RL算法的渐近复杂度和大小的外推以获得量子优势将是有趣的。将本文中描述的算法应用于具有较大状态和动作空间的任务以及更复杂的环境,将使我们能够证明DBM-RL和QBM-RL方法的可扩展性和实用性。图4所示的实验结果仅代表了对这一问题进行研究的初步尝试,但结果很有希望。然而,该实验没有提供我们的方法的缩放的实际表征,并且需要进一步的研究。

Acknowledgements

We would like to thank Hamed Karimi, Helmut Katzgraber, Murray Thom, Matthias Troyer, and Ehsan Zahedinejad, as well as the referees and editorial board of Quantum Information and Computation, for reviewing this work and providing many helpful suggestions. The idea of using SQA to run experiments involving measurements with a nonzero transverse eld was communi-cated in person by Mohammad Amin. We would also like to thank Marko Bucyk for editing this manuscript.
我们要感谢Hamed Karimi,Helmut Katzgraber,Murray Thom,Matthias Troyer和Ehsan Zahedinejad,以及量子信息和计算的裁判和编辑委员会,以审查这项工作并提供许多有用的建议。 使用SQA进行涉及非零横向测量的实验的想法由Mohammad Amin亲自传播。 我们还要感谢Marko Bucyk编辑本手稿。

References

[1]M. S. Sarandy and D. A. Lidar, \Adiabatic approximation in open quantum systems," Phys. Rev. A, vol. 71, p. 012331, 2005.

[2]J. E. Avron, M. Fraas, G. M. Graf, and P. Grech, \Adiabatic theorems for generators of contracting evolutions," Commun. Math. Phys., vol. 314, no. 1, pp. 163{191, 2012.

[3]T. Albash, S. Boixo, D. A. Lidar, and P. Zanardi, \Quantum adiabatic Markovian master equations," New J. Phys., vol. 14, no. 12, p. 123016, 2012.

[4]S. Bachmann, W. De Roeck, and M. Fraas, \The Adiabatic Theorem for Many-Body Quantum Systems," arXiv:1612.01505, 2016.

[5]L. C. Venuti, T. Albash, D. A. Lidar, and P. Zanardi, \Adiabaticity in open quantum systems," Phys. Rev. A, vol. 93, p. 032118, 2016.

[6]M. W. Johnson, M. H. S. Amin, S. Gildert, T. Lanting, F. Hamze, N. Dickson, R. Harris, A. J. Berkley, J. Johansson, P. Bunyk, E. M. Chapple, C. Enderud, J. P. Hilton, K. Karimi, E. Ladizinsky, N. Ladizinsky, T. Oh, I. Perminov, C. Rich, M. C. Thom, E. Tolkacheva, C. J. S. Truncik, S. Uchaikin, J. Wang, B. Wilson, and G. Rose, \Quantum annealing with manufactured spins," Nature, vol. 473, pp. 194{198, 2011.

[7]J. Kelly, R. Barends, A. G. Fowler, A. Megrant, E. Je rey, T. C. White, D. Sank, J. Y. Mutus, B. Campbell, Y. Chen, Z. Chen, B. Chiaro, A. Dunsworth, I. C. Hoi, C. Neill, P. J. J. O’Malley, C. Quintana, P. Roushan, A. Vainsencher, J. Wenner, A. N. Cleland, and J. M. Martinis, \State preservation by repetitive error detection in a superconducting quantum circuit," Nature, vol. 519, pp. 66{69, 2015.

[8]R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.

[9]D. Bertsekas and J. Tsitsiklis, Neuro-dynamic Programming. Anthropological Field Studies, Athena Scienti c, 1996.

[10]V. Derhami, E. Khodadadian, M. Ghasemzadeh, and A. M. Z. Bidoki, \Applying reinforcement learning for web pages ranking algorithms," Appl. Soft Comput., vol. 13, no. 4, pp. 1686{1692, 2013.

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES 23

[11]S. Sya ie, F. Tadeo, and E. Martinez, \Model-free learning control of neutralization processes using reinforcement learning," Engineering Applications of Arti cial Intelligence, vol. 20, no. 6, pp. 767{782, 2007.

[12]I. Erev and A. E. Roth, \Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria," Am. Econ. Rev., pp. 848{881, 1998.

[13]H. Shteingart and Y. Loewenstein, \Reinforcement learning and human behavior," Current Opinion in Neuro-biology, vol. 25, pp. 93{98, 2014.

[14]T. Matsui, T. Goto, K. Izumi, and Y. Chen, \Compound reinforcement learning: theory and an application to nance," in European Workshop on Reinforcement Learning, pp. 321{332, Springer, 2011.

[15]Z. Sui, A. Gosavi, and L. Lin, \A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory," Engineering Management Journal, vol. 22, no. 4, pp. 44{53, 2010.

[16]B. Sallans and G. E. Hinton, \Reinforcement learning with factored states and actions," JMLR, vol. 5, pp. 1063{ 1088, 2004.

[17]K. Hornik, M. Stinchcombe, and H. White, \Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, no. 5, pp. 359{366, 1989.

[18]J. Martens, A. Chattopadhya, T. Pitassi, and R. Zemel, \On the representational e ciency of restricted Boltz-mann machines," in Advances in Neural Information Processing Systems, pp. 2877{2885, 2013.

[19]N. Le Roux and Y. Bengio, \Representational power of restricted Boltzmann machines and deep belief networks," Neural Computation, vol. 20, no. 6, pp. 1631{1649, 2008.

[20]R. Salakhutdinov and G. E. Hinton, \Deep Boltzmann Machines," in Proceedings of the Twelfth International Conference on Arti cial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, pp. 448{455, 2009.

[21]R. Harris, M. W. Johnson, T. Lanting, A. J. Berkley, J. Johansson, P. Bunyk, E. Tolkacheva, E. Ladizinsky, N. Ladizinsky, T. Oh, F. Cioata, I. Perminov, P. Spear, C. Enderud, C. Rich, S. Uchaikin, M. C. Thom, E. M.

Chapple, J. Wang, B. Wilson, M. H. S. Amin, N. Dickson, K. Karimi, B. Macready, C. J. S. Truncik, and G. Rose, \Experimental investigation of an eight-qubit unit cell in a superconducting optimization processor," Phys. Rev. B, vol. 82, p. 024511, 2010.

[22]M. Benedetti, J. Realpe-Gmez, R. Biswas, and A. Perdomo-Ortiz, \Quantum-assisted learning of graphical models with arbitrary pairwise connectivity," arXiv:1609.02542, 2016.

[23]S. H. Adachi and M. P. Henderson, \Application of Quantum Annealing to Training of Deep Neural Networks," arXiv:1510.06356, 2015.

[24]M. Denil and N. de Freitas, \Toward the implementation of a quantum RBM," in NIPS 2011 Deep Learning and Unsupervised Feature Learning Workshop, 2011.

[25]M. Benedetti, J. Realpe-Gomez, R. Biswas, and A. Perdomo-Ortiz, \Estimation of e ective temperatures in quantum annealers for sampling applications: A case study with possible applications in deep learning," Phys. Rev. A, vol. 94, p. 022308, 2016.

[26]V. Dunjko, J. M. Taylor, and H. J. Briegel, \Quantum-enhanced machine learning," Phys. Rev. Lett., vol. 117,

p.130501, 2016.

[27]D. Dong, C. Chen, H. Li, and T. J. Tarn, \Quantum reinforcement learning," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, pp. 1207{1220, 2008.

[28]N. Wiebe, A. Kapoor, and K. M. Svore, \Quantum deep learning," Quantum Inf. Comput., vol. 16, no. 7-8,

ap.541{587, 2016.

[29]M. Kieferova and N. Wiebe, \Tomography and Generative Data Modeling via Quantum Boltzmann Training," arXiv:1612.05204, 2016.

[30]E. Crosson and A. W. Harrow, \Simulated Quantum Annealing Can Be Exponentially Faster than Classical Simulated Annealing," arXiv:1601.03030, 2016.

[31]B. Heim, T. F. R nnow, S. V. Isakov, and M. Troyer, \Quantum versus classical annealing of Ising spin glasses," Science, vol. 348, no. 6231, pp. 215{217, 2015.

[32]M. B. Hastings and M. H. Freedman, \Obstructions to classically simulating the quantum adiabatic algorithm," Quantum Information & Computation, vol. 13, pp. 1038{1076, 2013.

[33]S. Morita and H. Nishimori, \Convergence theorems for quantum annealing," J. Phys. A: Mathematical and General, vol. 39, no. 45, p. 13903, 2006.

[34]S. V. Isakov, G. Mazzola, V. N. Smelyanskiy, Z. Jiang, S. Boixo, H. Neven, and M. Troyer, \Understanding quantum tunneling through quantum Monte Carlo simulations," arXiv:1510.08057, 2015.

[35]T. Albash, T. F. R nnow, M. Troyer, and D. A. Lidar, \Reexamining classical and quantum models for the D-Wave One processor," arXiv:1409.3827, 2014.

[36]L. T. Brady and W. van Dam, \Quantum Monte Carlo simulations of tunneling in quantum adiabatic optimiza-tion," Phys. Rev. A, vol. 93, no. 3, p. 032304, 2016.

[37]S. W. Shin, G. Smith, J. A. Smolin, and U. Vazirani, \How ‘Quantum’ is the D-Wave Machine?," arXiv:1401.7087, 2014.

[38]M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, and R. Melko, \Quantum Boltzmann machine," arXiv:1601.02036, 2016.

[39]M. Born and V. Fock, \Beweis des Adiabatensatzes," Zeitschrift fur Physik, vol. 51, pp. 165{180, 1928.

[40]T. Kadowaki and H. Nishimori, \Quantum annealing in the transverse Ising model," Phys. Rev. E, vol. 58,

ap.5355{5363, 1998.

[41]E. Farhi, J. Goldstone, S. Gutmann, and M. Sipser, \Quantum Computation by Adiabatic Evolution," arXiv:quant-ph/0001106, 2000.

[42]M. H. Amin, \Searching for quantum speedup in quasistatic quantum annealers," Phys. Rev. A, vol. 92, no. 5,

p.052323, 2015.

[43]S. M. Anthony Brabazon, Michael O’Neill, Natural Computing Algorithms. Springer-Verlag Berlin Heidelberg, 2015.

[44]R. Martonak, G. E. Santoro, and E. Tosatti, \Quantum annealing by the path-integral Monte Carlo method: The two-dimensional random Ising model," Phys. Rev. B, vol. 66, no. 9, p. 094203, 2002.

[45]S. Yuksel, \Control of stochastic systems." Course lecture notes, Queen’s University (Kingston, ON Canada), Retrieved in May, 2016.

[46]R. S. Sutton, \Integrated architectures for learning, planning, and reacting based on approximating dynamic programming," in In Proceedings of the Seventh International Conference on Machine Learning, pp. 216{224, Morgan Kaufmann, 1990.

[47]R. Bellman, \Dynamic programming and Lagrange multipliers," Proceedings of the National Academy of Sci-ences, vol. 42, no. 10, pp. 767{769, 1956.

[48]M. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.

[49]M. Suzuki, \Relationship between d-dimensional quantal spin systems and (d+1)-dimensional Ising systems equivalence, critical exponents and systematic approximants of the partition function and spin correlations," Progr. Theor. Exp. Phys., vol. 56, no. 5, pp. 1454{1469, 1976.

[50]G. Hinton, \A practical guide to training restricted Boltzmann machines," Momentum, vol. 9, no. 1, p. 926, 2010.

[51]J. Duchi, E. Hazan, and Y. Singer, \Adaptive subgradient methods for online learning and stochastic optimiza-tion," JMLR, vol. 12, no. Jul, pp. 2121{2159, 2011.

[52]Y. Matsuda, H. Nishimori, and H. G. Katzgraber, \Ground-state statistics from annealing algorithms: quantum versus classical approaches," New. J. Phys., vol. 11, no. 7, p. 073021, 2009.

[53]L. C. Venuti, T. Albash, M. Marvian, D. Lidar, and P. Zanardi, \Relaxation versus adiabatic quantum steady-state preparation," Phys. Rev. A, vol. 95, p. 042302, 2017.

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES 25

[54]E. Farhi and A. W. Harrow, \Quantum Supremacy through the Quantum Approximate Optimization Algo-rithm," arXiv:1602.07674, 2016.

[55]F. Abtahi and I. Fasel, \Deep belief nets as function approximators for reinforcement learning," Frontiers in Computational Neuroscience, 2011.

[56]S. Elfwing, E. Uchibe, and K. Doya, \Scaled free-energy based reinforcement learning for robust and e cient learning in high-dimensional state spaces," Value and Reward Based Learning in Neurobots, p. 30, 2015.

[57]R. Martonak, G. E. Santoro, and E. Tosatti, \Quantum annealing by the path-integral Monte Carlo method: The two-dimensional random Ising model," Phys. Rev. B, vol. 66, no. 9, p. 094203, 2002.

[58]M. Otsuka, J. Yoshimoto, and K. Doya, \Free-energy-based reinforcement learning in a partially observable environment.," ESANN 2010 proceedings, European Symposium on Arti cial Neural Networks { Computational Intelligence and Machine Learning, 2010.

[59]S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, J. M. Martinis, and H. Neven, \Characterizing quantum supremacy in near-term devices," arXiv:1608.00263v2, 2016.

[60]J. Raymond, S. Yarkoni, and E. Andriyash, \Global warming: Temperature estimation in annealers," arXiv:1606.00919, 2016.

[61]P. M. Long and R. Servedio, \Restricted Boltzmann machines are hard to approximately evaluate or simulate," in Proceedings of the 27th International Conference on Machine Learning, pp. 703{710, 2010.

[62]D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, \A learning algorithm for Boltzmann machines," Cogn. Sci., vol. 9, no. 1, pp. 147{169, 1985.

[63]S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvari, \Convergence results for single-step on-policy reinforcement-learning algorithms," Machine learning, vol. 38, no. 3, pp. 287{308, 2000.

[64]N. Fremaux, H. Sprekeler, and W. Gerstner, \Reinforcement learning using a continuous time actor-critic frame-work with spiking neurons," PLoS Comput. Biol., vol. 9, no. 4, p. e1003024, 2013.

[65]D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., \Mastering the game of go with deep neural networks and tree search,"

[66]V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., \Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 529{533, 2015.

E-mail address, Daniel Crawford: [email protected]

E-mail address, Anna Levit: [email protected]

E-mail address, Navid Ghadermarzy: [email protected]

E-mail address, Jaspreet S. Oberoi: [email protected]

E-mail address, Pooya Ronagh: [email protected]

(Daniel Crawford, Anna Levit, Jaspreet S. Oberoi, Pooya Ronagh) 1QB Information Technologies (1QBit) (Navid Ghadermarzy) Department of Mathematics, University of British Columbia (Jaspreet S. Oberoi) School of Engineering Science, Simon Fraser University

(Pooya Ronagh) Institute for Quantum Computing and Department of Physics and Astronomy, Uni-versity of Waterloo

你可能感兴趣的:(AI程序员,算法,机器学习,深度学习,强化学习,深度强化学习,论文研读)