Adam坤

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES利用量子波兹曼机进行强化学习

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES

利用量子波兹曼机进行强化学习

Abstract. We investigate whether quantum annealers with select chip layouts can outperform classical computers in reinforcement learning tasks. We associate a transverse eld Ising spin Hamiltonian with a layout of qubits similar to that of a deep Boltzmann machine (DBM) and use simulated quantum annealing (SQA) to numerically simulate quantum sampling from this system. We design a reinforcement learning algorithm in which the set of visible nodes representing the states and actions of an optimal policy are the rst and last layers of the deep network. In absence of a transverse eld, our simulations show that DBMs are trained more e ectively than restricted Boltzmann machines (RBM) with the same number of nodes. We then develop a framework for training the network as a quantum Boltzmann machine (QBM) in the presence of a signi cant transverse eld for reinforcement learning. This method also outperforms the reinforcement learning method that uses RBMs.

摘要：我们研究具有选择芯片布局的量子退火炉是否能够在强化学习任务中胜过经典计算机。我们将横向场Ising自旋哈密顿量与类似于深Boltzmann机器（DBM）的量子位的布局相关联，并使用模拟量子退火（SQA）来数值模拟该系统的量子采样。我们设计了一种强化学习算法，其中表示最优策略的状态和动作的可见节点集是深度网络的第一层和最后一层。在没有横向视场的情况下，我们的模拟显示DBM比具有相同节点数的受限Boltzmann机器（RBM）更有效地训练。然后，我们开发了一个框架，用于训练网络作为量子Boltzmann机器（QBM），存在一个重要的横向场以进行强化学习。该方法也优于使用RBM的强化学习方法。

1. Introduction

Recent theoretical extensions of the quantum adiabatic theorem [1, 2, 3, 4, 5] suggest the pos-sibility of using quantum devices with manufactured spins [6, 7] as samplers of the instantaneous steady states of quantum systems. With this motivation, we consider reinforcement learning as the computational task of interest, and design a method of reinforcement learning consisting of sampling from a layout of quantum bits similar to that of a deep Boltzmann machine (DBM) (see Fig. 1b for a graphical representation). We use simulated quantum annealing (SQA) to demonstrate the advantage of reinforcement learning using deep Boltzmann machines and quantum Boltzmann machines over their classical counterpart, for small problem instances.

Reinforcement learning ([8], known also as neurodynamic programming [9]) is an area of optimal control theory at the intersection of approximate dynamic programming and machine learning. It has been used successfully for many applications, in elds such as engineering [10, 11], sociology [12, 13], and economics [14, 15].

It is important to dierentiate between reinforcement learning and common streams of research in machine learning. For instance, in supervised learning, the learning is facilitated by training samples provided by a source external to the agent and the computer. In reinforcement learning, the training samples are provided only by the interaction of the agent itself with the environment. For example, in a motion planning problem in an uncharted territory, it is desired that the agent learns in the fastest possible way to navigate correctly, with the fewest blind decisions required to be made. This is known as the dilemma of exploration versus exploitation; that is, neither exploration nor exploitation can be pursued exclusively without facing a penalty or failing at the task. The goal is hence not only to design an algorithm that eventually converges to an optimal policy, but for it to be able to generate good policies early in the learning process. We refer the reader to [8, Ch. 1.1] for a thorough introduction to use cases and problem scenarios addressed by reinforcement learning.

The core idea in reinforcement learning is de ning an operator on the Banach space of real-valued functions on the set of states of a system such that a xed point of the operator carries information about an optimal policy of actions for a nite or in nite number of decision epochs. A numerical method for computing this xed point is to explore this function space by travelling in a direction that minimizes the distance between two consecutive applications of the contraction mapping operator [9].

This optimization task, called learning in the context of reinforcement learning, can be performed by locally parametrizing the above function space using a set of auxiliary variables, and applying a gradient method to these variables. One approach for such a parametrization, due to [16], is to use the weights of a restricted Boltzmann machine (RBM) (see Fig. 1a) as the parameters, and the free energy of the RBM as an approximator for the elements in the function space. The descent direction is then calculated in terms of the expected values of the nodes of the RBM.
最近量子绝热定理的理论扩展[1,2,3,4,5]表明使用具有制造自旋的量子器件[6,7]作为量子系统瞬时稳态的采样器的可能性。有了这个动机，我们将强化学习视为感兴趣的计算任务，并设计一种强化学习方法，包括从类似于深Boltzmann机器（DBM）的量子位布局中采样（参见图1b的图形）表示）。我们使用模拟量子退火（SQA）来证明使用深Boltzmann机器和量子Boltzmann机器进行强化学习的优势，而不是经典对应物，用于小问题实例。

强化学习（[8]，也称为神经动力学规划[9]）是近似动态规划和机器学习交叉的最优控制理论领域。它已被成功应用于诸如工程[10,11]，社会学[12,13]和经济学[14,15]等领域的许多应用中。

在强化学习和机器学习的常见研究流程之间进行区分是非常重要的。例如，在监督学习中，通过训练由代理和计算机外部的源提供的样本来促进学习。在强化学习中，训练样本仅通过代理本身与环境的交互来提供。例如，在未知领域中的运动规划问题中，期望代理以最快的方式学习正确导航，需要做出最少的盲决定。这被称为探索与剥削的两难选择;也就是说，无论是在没有面临惩罚或没有完成任务的情况下，探索和剥削都不能完全追求。因此，目标不仅是设计最终收敛于最优策略的算法，而且还能够在学习过程的早期生成良好的策略。我们将读者推荐给[8，Ch。 1.1]全面介绍强化学习所解决的用例和问题场景。

强化学习的核心思想是在系统状态集上的实数值函数的Banach空间上定义一个运算符，使得运算符的xed点携带关于针对nite或nite的最佳动作策略的信息。决定时期的数量。计算此xed点的数值方法是通过沿最小化收缩映射运算符的两个连续应用之间的距离的方向行进来探索该函数空间[9]。

这种优化任务，在强化学习的上下文中称为学习，可以通过使用一组辅助变量对上述函数空间进行局部参数化，并对这些变量应用梯度方法来执行。由于[16]，这种参数化的一种方法是使用受限玻尔兹曼机（RBM）的权重（见图1a）作为参数，并使用RBM的自由能作为元素的近似值。功能空间。然后根据RBM的节点的期望值计算下降方向。

图1.（a）基于RBM的强化学习中使用的一般RBM布局。左侧的可见层由状态和动作节点组成，并连接到隐藏层，形成完整的二分图。（b）基于DBM的强化学习中使用的一般DBM布局。左侧的可见节点表示状态，右侧的可见节点表示操作。训练过程捕获节点之间边缘权重中的状态和动作之间的相关性。

图2.（a）一个3 5迷宫。 W表示墙，R是表示奖励的正实数，P是表示罚分的实数。（b）之前的迷宫有两个额外的随机奖励。（c）图（a）中迷宫的每个细胞的所有最佳动作的集合。最佳遍历策略是这些操作的任意组合的选择。（d）没有障碍物（左）和存在墙（右）的刮风问题的条件状态转移概率样本。

It follows from the universal approximation theorem [17] that RBMs can approximate any joint distribution over binary variables [18, 19]. However, in the context of reinforcement learning, RBMs are not necessarily the best choice for approximating Q-functions relating to Markov decision processes because RBMs may require an exponential number of hidden variables with respect to the number of visible variables in order to approximate the desired joint distribution [18, 19]. On the other hand, DBMs have the potential to model higher-order dependencies than RBMs, and are more robust than deep belief networks [20].

One may, therefore, consider replacing the RBM with other graphical models and investigating the performance of the models in the learning process. Except in the case of RBMs, calculat-ing statistical data from the nodes of a graphical model amounts to sampling from a Boltzmann distribution, creating a bottleneck in the learning procedure. Therefore, any improvement in the e ciency of Boltzmann distribution sampling is bene cial for reinforcement learning and machine learning in general.

As we explain in what follows, DBMs are good candidates for reinforcement learning tasks. Moreover, an important advantage of a DBM layout for a quantum annealing system is that the proximity and couplings of the qubits in the layout are similar to those of a sequence of bipartite blocks in D-Wave Systems’ devices [21], and it is therefore feasible that such layouts could be manufactured in the near future. In addition, embedding Boltzmann machines in larger quantum annealer architectures is problematic when excessively large weights and biases are needed to em-ulate logical nodes of the Boltzmann machine using chains and clusters of physical qubits. These are the reasons why, instead of attempting to embed a Boltzmann machine structure on an existing quantum annealing system as in [22, 23, 24, 25], we work under the assumption that the network itself is the native connectivity graph of a near-future quantum annealer, and, using numerical simulations, we attempt to understand its applicability to reinforcement learning.
We also refer the reader to current trends in machine learning using quantum circuits, speci cally,

[26]and [27] for reinforcement learning, and [28] and [29] for training quantum Boltzmann machines with applications in deep learning and tomography. To the best of our knowledge, the present paper complements the literature on quantum machine learning as the rst proposal on reinforcement learning using adiabatic quantum computation.

Quantum Monte Carlo (QMC) numerical simulations have been found to be useful in simulat-ing time-dependant quantum systems. Simulated quantum annealing (SQA) [30, 31], one of the many avours of QMC methods, is based on the Suzuki{Trotter expansion of the path integral representation of the Hamiltonian of Ising spin models in the presence of a transverse eld driver Hamiltonian. Even though the e ciency of SQA for nding the ground state of an Ising model is topologically obstructed [32], we consider the samples generated by SQA to be good approxima-tions of the Boltzmann distribution of the quantum Hamiltonian [33]. Experimental studies have shown similarities in the behaviour of SQA and that of quantum annealing [34, 35] and its physical realization by D-Wave Systems [36, 37].
根据通用逼近定理[17]，RBM可以逼近二元变量上的任何联合分布[18,19]。然而，在强化学习的背景下，RBM不一定是近似与马尔可夫决策过程相关的Q函数的最佳选择，因为RBM可能需要关于可见变量数量的指数隐藏变量以便接近所需的联合分配[18,19]。另一方面，DBM有可能模拟高阶依赖性而不是RBM，并且比深度置信网络更强大[20]。

因此，可以考虑用其他图形模型替换RBM并在学习过程中调查模型的性能。除了RBM之外，从图形模型的节点计算统计数据相当于从Boltzmann分布中采样，从而在学习过程中产生瓶颈。因此，玻尔兹曼分布抽样的效率的任何改进一般对于强化学习和机器学习是有益的。

正如我们在下文中解释的那样，DBM是强化学习任务的良好候选者。此外，量子退火系统的DBM布局的一个重要优点是布局中量子位的接近和耦合类似于D-Wave系统设备中二分块序列的接近和耦合[21]，它是因此可以在不久的将来制造这种布局。此外，当需要过大的权重和偏差来使用物理量子比特的链和簇来模拟Boltzmann机器的逻辑节点时，将Boltzmann机器嵌入更大的量子退火器架构中是有问题的。这就是为什么，而不是像[22,23,24,25]那样试图在现有的量子退火系统上嵌入玻尔兹曼机器结构，我们的工作假设网络本身就是附近的原生连通图。 -future量子退火炉，并且，使用数值模拟，我们试图了解它对强化学习的适用性。
我们还向读者介绍了使用量子电路的机器学习的当前趋势，特别是

[26]和[27]用于强化学习，[28]和[29]用于训练量子Boltzmann机器在深度学习和层析成像中的应用。据我们所知，本文补充了量子机器学习的文献，作为使用绝热量子计算的强化学习的第一个提议。

已发现量子蒙特卡罗（QMC）数值模拟在模拟时间相关的量子系统中是有用的。模拟量子退火（SQA）[30,31]是QMC方法的众多选择之一，它基于Suzuki {Trotter扩展的Ising旋转模型的哈密顿量的路径积分表示在存在横向场驱动器哈密顿量。尽管SQA对于Ising模型基态的效率在拓扑上是受阻的[32]，但我们认为SQA产生的样本是量子哈密顿量的Boltzmann分布的良好近似[33]。实验研究表明SQA和量子退火行为的相似性[34,35]及其物理实现的D-Wave Systems [36,37]。

We expect that when SQA is set such that the nal strength of the transverse eld is negligible, the distribution of the samples approaches the classical limit one expects to observe in absence of the transverse eld. Another classical algorithm which can be used to obtain samples from the Boltzmann distribution is conventional simulated annealing (SA), which is based on thermal annealing. Note that this algorithm can be used to create Boltzmann distributions from the Ising spin model only in the absence of a transverse eld. It should, therefore, be possible to use SA or SQA to approximate the Boltzmann distribution of a classical Boltzmann machine. However, unlike in the case of SA, it is possible to use SQA not only to approximate the Boltzmann distribution of a classical Boltzmann machine, but also that of a graphical model in which the energy operator is a quantum Hamiltonian in the presence of a transverse eld. These graphical models, called quantum Boltzmann machines (QBM), were rst introduced in [38].

We use SQA simulations to provide evidence that a quantum annealing device that approximates the distribution of a DBM or a QBM may improve the learning process compared to a reinforcement learning method that uses classical RBM techniques. Other studies have shown that SQA is more e cient than thermal SA [30, 31]. Therefore, our method, used in conjunction with SQA, can also be viewed as a quantum-inspired approach for reinforcement learning.

What distinguishes our work from current trends in quantum machine learning is that (i) we consider the use of quantum annealing in reinforcement learning applications rather than frequently studied classi cation or recognition problems; (ii) using SQA-based numerical simulations, we as-sume that the connectivity graph of a DBM directly maps to the native layout of a feasible quantum annealer; and (iii) the results of our experiments using SQA to simulate the sampling of an entan-gled system of spins suggest that using quantum annealers in reinforcement learning tasks can o er an advantage over thermal sampling.

我们期望当设定SQA使得横向视场的最终强度可以忽略不计时，样本的分布接近在没有横向视场的情况下预期观察到的经典极限。可用于从玻尔兹曼分布获得样品的另一种经典算法是传统的模拟退火（SA），其基于热退火。注意，该算法可用于仅在没有横向场的情况下从Ising自旋模型创建Boltzmann分布。因此，应该可以使用SA或SQA来近似经典玻尔兹曼机器的玻尔兹曼分布。然而，与SA的情况不同，可以使用SQA来近似经典玻尔兹曼机器的玻尔兹曼分布，而且还可以使用能量算子是横向存在时的量子哈密顿量的图形模型。场。这些图形模型，称为量子玻尔兹曼机器（QBM），首先在[38]中引入。

我们使用SQA模拟来提供证据，与使用经典RBM技术的强化学习方法相比，近似于DBM或QBM分布的量子退火设备可以改善学习过程。其他研究表明，SQA比热SA更有效[30,31]。因此，我们的方法与SQA结合使用，也可以被视为强化学习的量子启发方法。

我们的工作与量子机器学习的当前趋势的区别在于：（i）我们考虑在强化学习应用中使用量子退火而不是经常研究分类或识别问题; （ii）使用基于SQA的数值模拟，我们认为DBM的连通图直接映射到可行量子退火炉的原生布局; （iii）我们使用SQA模拟旋转自旋系统的采样的实验结果表明，在强化学习任务中使用量子退火器可以优于热采样。

2. Preliminaries

2.1. Adiabatic Evolution of Open Quantum Systems. The evolution of a quantum system under a slowly changing time-dependent Hamiltonian is characterized by the quantum adiabatic theorem (QAT). QAT has a long history going back to the work of Born and Fock [39]. Colloquially, QAT states that a system remains close to its instantaneous steady state provided there is a gap between the eigenenergy of the steady state and the rest of the Hamiltonian’s spectrum at every point in time if the evolution is su ciently slow. This result motivated [40] and [41] to introduce the closely related paradigms of quantum computing known as quantum annealing (QA) and adiabatic quantum computation (AQC).

QA and AQC, in turn, inspired e orts in the manufacturing of physical realizations of adia-batic evolution via quantum hardware ([6]). In reality, the manufactured chips operate at nonzero temperature and are not isolated from their environment. Therefore, the existing adiabatic theory did not describe the behaviour of these machines. A contemporary investigation in quantum adi-abatic theory was thus initiated to study adiabaticity in open quantum systems ([1, 2, 3, 4, 5]). These references prove adiabatic theorems to various degrees of generality and under a variety of assumptions about the system.

In fact, [2] develops an adiabatic theory for equations of the form
2.1。开放量子系统的绝热演化。量子系统在缓慢变化的时变哈密顿量下的演化以量子绝热定理（QAT）为特征。 QAT有着悠久的历史，可以追溯到Born和Fock的作品[39]。通俗地说，QAT表明系统仍然接近其瞬时稳态，前提是如果进化速度很慢，稳态的能量与哈密顿频谱的其余部分之间存在间隙。这一结果促使[40]和[41]引入了密切相关的量子计算范式，称为量子退火（QA）和绝热量子计算（AQC）。

反过来，QA和AQC通过量子硬件激发了制造adia-batic进化的物理实现的效果（[6]）。实际上，制造的芯片在非零温度下工作并且不与其环境隔离。因此，现有的绝热理论没有描述这些机器的行为。因此开始了量子爱好理论的当代研究，以研究开放量子系统中的绝热性[[1,2,3,4,5]）。这些参考文献证明绝热定理具有不同程度的一般性，并且在对系统的各种假设下。

实际上，[2]为形式的方程式发展了绝热理论

where L is a family of linear operators on a Banach space and L(s) is a generator of a contrac-tion semigroup for every s. This provides a general framework that encompasses many adiabatic theorems, including that of classical stochastic systems, all the way to quantum evolutions of open systems generated by Lindbladians. The manifold of instantaneous stationary states is identical to ker(L(s)), and [2] shows that the dynamics of the system are 其中L是Banach空间上的线性算子族，L（s）是每个s的收缩半群的生成元。这提供了一个包含许多绝热定理的一般框架，包括经典随机系统的定理，一直到Lindbladians生成的开放系统的量子演化。瞬时静止状态的流形与ker（L（s））相同，[2]表明系统的动力学是

In the work of [2], it was then proven that (s) is parallel-transported along ker(L(s)), and that if

(i)is the generator of a contraction semigroup;

(ii)has closed and complementary range and kernel;

This stream of research suggests promising opportunities to use quantum annealers to sample from the Gibbs state of a quantum Hamiltonian using adiabatic evolution. In this paper, the transverse eld Ising model (TFIM) has been the centre of attention. In practice, due to additional complications in quantum annealing (e.g., level crossings and gap closure), the samples gathered from the quantum annealer are far from the Gibbs state of the nal Hamiltonian. In fact, [42] suggests that the distribution of the samples would correspond more closely to an instantaneous Hamiltonian at an intermediate point in time, called the freeze-out point. Therefore, our goal is to investigate the applicability of sampling from a TFIM with signi cant to free energy{based reinforcement learning.

这一研究成果提出了使用量子退火炉从绝对演化的量子哈密顿量的Gibbs态采样的有希望的机会。在本文中，横向伊斯林模型（TFIM）一直是关注的焦点。实际上，由于量子退火中的额外复杂性（例如，平交和间隙闭合），从量子退火炉收集的样品远离最终哈密顿量的吉布斯状态。实际上，[42]表明样本的分布更接近于中间时刻的瞬时哈密顿量，称为冻结点。因此，我们的目标是研究从具有重要意义的TFIM中取样的适用性（基于强化学习）。

2.2. Simulated Quantum Annealing. Simulated quantum annealing (SQA) methods are a class of quantum-inspired algorithms that perform discrete optimization by classically simulating quan-tum tunnelling phenomena (see [43, p. 422] for an introduction). The algorithm used in this paper is a single spin-ip version of quantum Monte Carlo numerical simulation based on the Suzuki{ Trotter formula, and uses the Metropolis acceptance probabilities. The SQA algorithm simulates the quantum annealing phenomena of an Ising spin model with a transverse eld, that is,
2.2。 模拟量子退火。模拟量子退火（SQA）方法是一类量子启发算法，通过经典模拟量子隧道现象来执行离散优化（参见[43，p.422]的介绍）。本文使用的算法是基于Suzuki {Trotter公式，并使用Metropolis接受概率的单量自旋ip版量子蒙特卡罗数值模拟。 SQA算法模拟具有横向场的Ising自旋模型的量子退火现象，即

The initial value, 20:00, of the transverse eld is empirically chosen to be well above the coupling strengths created during the training. Each spin is replicated 25 times to represent the Trotter slices in the extra dimension. The simulation is set to iterate over all replications of all spins one time per sweep, and the number of sweeps is set to 300, which appears to be large enough for the sizes of Ising models constructed during our experiments. For each instance of input, the SQA algorithm is run 150 times. After termination, the con guration of each replica, as well as the con guration of the entire classical Ising model of one dimension higher, is returned.

Although the SQA algorithm does not follow the dynamics of a physical quantum annealer explic-itly, it is used to simulate this process, as it captures major quantum phenomena such as tunnelling and entanglement [34]. In [34], for example, it is shown that quantum Monte Carlo simulations can be used to understand the tunnelling behaviour in quantum annealers. As mentioned previously, it readily follows from the results of [33] that the limiting distribution of SQA is the Boltzmann distribution of He . This makes SQA a candidate classical algorithm for sampling from Boltzmann distributions of classical and quantum Hamiltonians. The former is achieved by setting and the latter by constructing an e ective Hamiltonian of the system of one dimension higher, rep-resenting the quantum Hamiltonian with non-negligible f . Alternatively, a classical Monte Carlo simulation used to sample from the Boltzmann distribution of the classical Ising Hamiltonian is the SA algorithm, based on thermal uctuations of classical spin systems.
根据经验选择横向视场的初始值20:00远高于训练期间产生的耦合强度。每次旋转复制25次以表示额外维度中的Trotter切片。模拟被设置为每次扫描一次迭代所有旋转的所有复制，并且扫描的数量设置为300，这对于在我们的实验期间构建的Ising模型的尺寸来说似乎足够大。对于每个输入实例，SQA算法运行150次。终止后，返回每个副本的配置，以及一维更高的整个经典Ising模型的配置。

虽然SQA算法没有明确地遵循物理量子退火器的动力学，但它用于模拟这个过程，因为它捕获了主要的量子现象，如隧道和纠缠[34]。例如，在[34]中，显示量子蒙特卡罗模拟可用于理解量子退火炉中的隧穿行为。如前所述，从[33]的结果可以很容易地得出SQA的极限分布是He的玻尔兹曼分布。这使得SQA成为从经典和量子哈密顿量的Boltzmann分布中采样的候选经典算法。前者是通过设置而后者通过构造一维更高的系统的有效哈密顿量来实现的，代表具有不可忽略的f的量子哈密顿量。或者，用于从经典伊辛哈密顿量的玻尔兹曼分布中采样的经典蒙特卡罗模拟是基于经典自旋系统的热波动的SA算法。

2.3.1. Maze Traversal as a Markov Decision Process. Maze traversal is a problem typically used to develop and benchmark reinforcement learning algorithms [46]. A maze is structured as a two-dimensional grid of r rows and c columns in which a decision-making agent is free to move up, down, left, or right, or to stand still. During the maze traversal, the agent encounters obstacles (e.g., walls), rewards (e.g., goals), and penalties (negative rewards, e.g., a pit). Each cell of the maze can contain either a deterministic or stochastic reward, a wall, a pit, or a neutral value. Fig. 2a and Fig. 2b show examples of two mazes. Fig. 2c shows the corresponding solutions to the maze in Fig. 2a.

The goal of the reinforcement learning algorithm in the maze traversal problem is for the agent to learn the optimal action to take in each cell of the maze by maximizing the total reward, that is, nding a route across the maze that avoids walls and pits while favouring rewards. This problem can be modelled as an MDP determined by the following components:

The state of the system is the agent’s position within the maze. The position state s takes values in the set of states
2.3.1。迷宫遍历作为马尔可夫决策过程。迷宫遍历是一个通常用于开发和基准强化学习算法的问题[46]。迷宫被构造为r行和c列的二维网格，其中决策代理可以自由地向上，向下，向左或向右移动，或者静止不动。在迷宫遍历期间，代理遇到障碍物（例如，墙壁），奖励（例如，目标）和惩罚（负面奖励，例如，坑）。迷宫的每个单元可以包含确定性或随机奖励，墙壁，坑或中性值。图2a和图2b示出了两个迷宫的示例。图2c示出了图2a中迷宫的相应解决方案。

强化学习算法在迷宫遍历问题中的目标是让代理通过最大化总奖励来学习在迷宫的每个单元格中采取的最佳动作，即，在避开墙壁和凹坑的迷宫中找到路线。赞成奖励。此问题可以建模为由以下组件确定的MDP：

系统的状态是代理在迷宫中的位置。位置状态s采用状态集中的值

These actions will guide the agent through the maze. An action that would lead the agent into a wall (W ) or outside of the maze boundary is treated as an inadmissible action. Each action can be viewed as an endomorphism on the set of states
这些行动将引导代理人通过迷宫。将代理引入墙（W）或迷宫边界外的动作被视为不允许的动作。每个动作都可以被视为一组状态的内同态

The transition kernel determines the probability of the agent moving from one state to another given a particular choice of action. In the simplest case, the probability of transition from s to a(s) is one:
转换内核根据特定的动作选择确定代理从一个状态移动到另一个状态的概率。在最简单的情况下，从s转换为a（s）的概率是1：

We call the maze clear if the associated transition kernel is as above, as opposed to the windy maze, in which there is a nonzero probability that if the action a is taken at state s, the next state will di er from a(s).

The immediate reward r(s; a) that the agent gains from taking an action a in state s is the value contained in the destination state. Moving into a cell containing a reward returns the favourable

value R, moving into a cell containing a penalty returns the unfavourable value P , and moving into a cell with no reward returns a neutral value in the interval (P; R).
A discount factor for future rewards is a non-negative constant r< 1. In our experiments, this

discount factor is set to = 0:8. The discount factor is a feature of the problem rather than a free parameter of an implementation. For example, in a nancial application scenario, the discount factor might be a function of the risk-free interest rate.

The immediate reward for moving into a cell with a stochastic reward is given by a random variable R. If an agent has prior knowledge of this distribution, then it should be able to treat the cell as one with a deterministic reward value of E[R]. This allows us to nd the set of all optimal policies in each maze instance. This policy information is denoted by : S ! 2A, associating with each state s 2 S a set of optimal actions (s) A.
In our maze model, the neutral value is set to 100, the reward R = 200, and the penalty P = 0. In our experiments, the stochastic reward R is simulated by drawing a sample from the Bernoulli distribution 200 Ber(0:5); hence, it has the expected value E[R] = 100, which is identical to the neutral value. Therefore, the solutions depicted in Fig. 2c are solutions to the maze of Fig. 2b as well.
如果关联的转换内核如上所述，我们将迷宫称为清除，而不是有风的迷宫，其中存在非零概率，如果在状态s采取动作a，则下一个状态将来自a（s）。

代理人在状态s中采取行动a获得的直接奖励r（s; a）是目的地状态中包含的值。进入一个包含奖励的单元格会返回有利条件

移动到包含惩罚的单元格中的值R返回不利的值P，并且移动到没有奖励的单元格中返回区间中的中性值（P; R）。
未来奖励的折扣因素是非负常数r <1。在我们的实验中，这是

折扣系数设置为= 0：8。折扣因子是问题的一个特征，而不是实现的自由参数。例如，在金融应用场景中，贴现因子可能是无风险利率的函数。

移动到具有随机奖励的单元格的直接奖励由随机变量R给出。如果代理人具有该分布的先验知识，则它应该能够将该单元格视为具有确定性奖励值E [R]的单元。 ]。这允许我们在每个迷宫实例中找到所有最优策略的集合。此政策信息表示为：S！在图2A中，将每个状态s 2 S与一组最佳动作A相关联。
在我们的迷宫模型中，中性值设置为100，奖励R = 200，惩罚P = 0.在我们的实验中，随机奖励R是通过从伯努利分布200 Ber（0：5）中抽取样本来模拟的。）;因此，它具有期望值E [R] = 100，其与中性值相同。因此，图2c中描绘的解决方案也是图2b的迷宫的解决方案。

2.4. Value Iteration. Bellman [47] writes V ( ; s) recursively in the following manner using the monotone convergence theorem:

2.4。 值迭代。 Bellman [47]使用单调收敛定理以下列方式递归地写V（; s）：

heavily on the cardinality of both S and A, and su er from the curse of dimensionality [47, 48]. Moreover, the value iteration method requires having full knowledge of the transition probabilities, as well as the distribution of the immediate rewards.
严重依赖于S和A的基数，并且来自维数的诅咒[47,48]。此外，价值迭代方法需要充分了解转移概率以及直接奖励的分布。

2.5. Q-functions. For a stationary policy , the Q-function (also known as the action{value func-tion) is de ned as a mapping of a pair (s; a) to the expected value of the reward of the Markov chain that begins with taking action a at initial state s and continuing according to π [8]:
2.5。Q-功能。对于静止策略，Q函数（也称为动作{值函数）被定义为一对（s; a）到马尔可夫链的奖励的期望值的映射，该值从服用开始动作a在初始状态s并继续根据π[8]：

In what follows, we explain the case in which comprises the weights of a Boltzmann machine. Let us begin by introducing clamped Boltzmann machines, which are of particular importance in the case of reinforcement learning.
在下文中，我们解释了包括玻尔兹曼机器的重量的情况。让我们首先介绍夹紧玻尔兹曼机器，这在加固学习的情况下尤为重要。
2.7 Clamped Boltzmann Machines. A classical Boltzmann machine is a type of stochastic neural network with two sets V and H of visible and hidden nodes, respectively. Both visible and hidden nodes represent binary random variables. We use the same notation for a node and the binary random variable it represents. The interactions between the variables represented by their respective nodes are speci ed by real-valued weighted edges of the underlying undirected graph. A GBM, as opposed to models such as RBMs and DBMs, allows weights between any two nodes.
The energy of the classical Boltzmann machine is
2.7 夹紧玻尔兹曼机器。经典的Boltzmann机器是一种随机神经网络，分别具有两组V和H的可见节点和隐藏节点。可见和隐藏节点都表示二进制随机变量。我们对节点和它所代表的二进制随机变量使用相同的表示法。由它们各自节点表示的变量之间的相互作用由基础无向图的实值加权边指定。与RBM和DBM等模型相反，GBM允许任意两个节点之间的权重。
经典玻尔兹曼机器的能量是

hidden and hidden nodes of the Boltzmann machine, respectively, de ned as a function of binary vectors v and h corresponding to the visible and hidden variables, respectively.

A clamped GBM is a neural network whose underlying graph is the subgraph obtained by removing the visible nodes for which the e ect of a xed assignment v of the visible binary variables contributes as constant coe cients to the associated energy

Boltzmann机器的隐藏和隐藏节点分别定义为分别对应于可见和隐藏变量的二进制向量v和h的函数。

钳位GBM是神经网络，其基础图是通过去除可见节点获得的子图，其中可见二元变量的xed赋值v的作用对相关能量作为常数系数贡献。

A clamped quantum Boltzmann machine (QBM) has the same underlying graph as a clamped GBM, but instead of a binary random variable, a qubit is associated to each node of the network. The energy function is substituted by the quantum Hamiltonian
钳位量子Boltzmann机器（QBM）具有与钳位GBM相同的基础图，但是代替二进制随机变量，量子比特与网络的每个节点相关联。能量函数由量子哈密顿量代替

2.8. Reinforcement Learning Using Clamped Boltzmann Machines. In this section, we explain how a general Boltzmann machine (GBM) can be used to provide a Q-function approximator in a Q-learning method. To the best of our knowledge, this derivation has not been previously given, although it can be readily derived from the ideas presented in [16] and [38]. Following [16], the goal is to use the negative free energy of a Boltzmann machine to approximate the Q-function through the relationship
2.8。使用夹紧玻尔兹曼机器进行强化学习。在本节中，我们将解释如何使用通用玻尔兹曼机（GBM）在Q学习方法中提供Q函数逼近器。据我们所知，这种推导以前没有给出，尽管它可以很容易地从[16]和[38]中提出的想法中得出。在[16]之后，目标是使用玻尔兹曼机器的负自由能来通过关系近似Q函数

state s and action a on the state nodes and action nodes, respectively, of the Boltzmann machine. In reinforcement learning, the visible nodes of the GBM are partitioned into two subsets of state nodes S and action nodes A.

The parameters , to be trained according to a TD(0) update rule (see Sec. 2.6), are the weights
in a Boltzmann machine. For every weight w, the update rule is
分别对Boltzmann机器的状态节点和动作节点进行状态和动作a。在强化学习中，GBM的可见节点被划分为状态节点S和动作节点A的两个子集。

根据TD（0）更新规则（参见第2.6节）训练的参数是权重
在玻尔兹曼机器中。对于每个权重w，更新规则是


where c ranges over all spin con gurations of the classical Ising model of one dimension higher. The above argument holds in the absence of the transverse eld, that is, for the classical Boltzmann machine. In this case, the TD(0) update rule is given by
其中c范围超过一维更高的经典Ising模型的所有自旋配置。上述论点在没有横向场的情况下成立，即对于经典的玻尔兹曼机器。在这种情况下，TD（0）更新规则由下式给出

The update rule for the weights of the RBM is (17) alone. Moreover, in the case of RBMs, the equilibrium free energy F (s; a) and its derivatives with respect to the weights can be calculated without the need for Boltzmann distribution sampling, according to the closed formula
RBM权重的更新规则仅为（17）。此外，在RBM的情况下，根据闭合公式，可以计算平衡自由能F（s; a）及其相对于权重的导数而无需玻尔兹曼分布抽样

Here, denotes the sigmoid function. Note that, in the general case, since the hidden nodes of a clamped Boltzmann machine are not independent, the calculation of the free energy is intractable.
这里，表示sigmoid函数。注意，在一般情况下，由于夹紧玻尔兹曼机器的隐藏节点不是独立的，因此自由能的计算是难以处理的。

3. Algorithms

In this section, we present the details of classical reinforcement learning using RBM, a semi-classical approach based on a DBM (using SA and SQA), and a quantum reinforcement learning approach (using SQA or quantum annealing). All of the algorithms are based on the Q-learning TD(0) method presented in the previous section. Pseudo-code for these methods is provided in Algorithms 1, 2, and 3 below.

3.1. Reinforcement Learning Using RBMs. The RBM reinforcement learning algorithm is due to Sallans and Hinton [16]. This algorithm uses the update rule (17), with v representing state or action encoding, to update the weights of an RBM, and (21) to calculate the expected values of random variables associated with the hidden nodes hhi. As explained in Sec. 2.8, the main advantage of RBM is that it has explicit formulas for the hidden-node activations, given the values of the visible nodes. Moreover, only for RBMs can the entropy portion of the free energy (19) be written in terms of the activations of the hidden nodes. More-complicated network architectures do not possess this property, so there is a need for a Boltzmann distribution sampler.
在本节中，我们将介绍使用RBM的经典强化学习的细节，基于DBM（使用SA和SQA）的半经典方法，以及量子强化学习方法（使用SQA或量子退火）。所有算法都基于前一节中介绍的Q学习TD（0）方法。这些方法的伪代码在下面的算法1,2和3中提供。

3.1。使用RBM强化学习。 RBM强化学习算法归功于Sallans和Hinton [16]。该算法使用更新规则（17），其中v表示状态或动作编码，以更新RBM的权重，以及（21）计算与隐藏节点hhi相关联的随机变量的期望值。如第二节所述。 2.8，RBM的主要优点是，在给定可见节点的值的情况下，它具有隐藏节点激活的显式公式。此外，仅对于RBM，可以根据隐藏节点的激活来写入自由能（19）的熵部分。更复杂的网络架构不具备这种特性，因此需要Boltzmann分布式采样器。

In Algorithm 1, we recall the steps of the classical reinforcement learning algorithm using an RBM with a graphical model similar to that shown in Fig. 1a. We set the initial Boltzmann machine weights using Gaussian zero-mean values with a standard deviation of 1:00, as is common practice for implementing Boltzmann machines [50]. Consequently, this initializes an approximation of a Q-function and a policy given by
在算法1中，我们回顾了使用RBM的经典强化学习算法的步骤，该RBM具有类似于图1a中所示的图形模型。我们使用高斯零均值设定初始玻尔兹曼机器重量，标准偏差为1:00，这是实施玻尔兹曼机器的常用做法[50]。因此，这初始化了Q函数和由下式给出的策略的近似值

3.2. Reinforcement Learning Using DBMs. Since we are interested in the dependencies be-tween states and actions, we consider a DBM architecture that has a layer of states connected to the rst layer of hidden nodes, followed by multiple hidden layers, and a layer of actions connected to the nal layer of hidden nodes (see Fig. 1). We demonstrate the advantages of this deep architec-ture trained using SQA and the derivation in Sec. 2.8 of the temporal-di erence gradient method for reinforcement learning using general Boltzmann machines (GBM).

In Algorithm 2, we summarize the DBM-RL method. Here, the graphical model of the Boltzmann machine is similar to that shown in Fig. 1b. The initialization of the weights of the DBM is performed in a similar fashion to the previous algorithm.
3.2。使用DBM强化学习。由于我们对状态和动作之间的依赖关系感兴趣，因此我们考虑一种DBM体系结构，其具有连接到第一层隐藏节点的状态层，其后是多个隐藏层，以及连接到最终层的一层动作隐藏节点（见图1）。我们展示了使用SQA培训的这种深层架构的优势以及Sec中的推导。 2.8使用通用玻尔兹曼机器（GBM）进行强化学习的时间梯度法。

在算法2中，我们总结了DBM-RL方法。这里，Boltzmann机器的图形模型类似于图1b中所示的模型。以与先前算法类似的方式执行DBM的权重的初始化。

According to lines 4 and 5 of Algorithm 2, the samples from the SA or SQA algorithm are used to approximate the free energy of the classical DBM at points (s1; a1) and (s2; a2) using (19).

If SQA is used, averages are taken over each replica of each run; hence, there are 3750 samples of con gurations of the hidden nodes for each state{action pair. The strength of the transverse eld is scheduled to linearly decrease from 20:00 to f = 0:01.
根据算法2的第4和第5行，来自SA或SQA算法的样本用于使用（19）在点（s1; a1）和（s2; a2）处近似经典DBM的自由能。

如果使用SQA，则对每次运行的每个副本取平均值; 因此，对于每个状态{动作对，有3750个隐藏节点的配置样本。横向场的强度预定从20:00到f = 0:01线性减小。

3.3. Reinforcement Learning Using QBMs. The last algorithm is QBM-RL, presented in Algorithm 3. The initialization is performed as in Algorithms 1 and 2. However, according to lines 4 and 5, the samples from the SQA algorithm are used to approximate the free energy of a QBM at points (s1; a1) and (s2; a2) by computing the free energy corresponding to an e ective classical Ising spin model of one dimension higher representing the quantum Ising spin model of the QBM, via (16).
3.3。使用QBM强化学习。最后一个算法是QBM-RL，在算法3中给出。初始化如算法1和2中那样执行。但是，根据第4和第5行，来自SQA算法的样本用于近似QBM的自由能。点（s1; a1）和（s2; a2）通过计算对应于一维的有效经典Ising自旋模型的自由能，表示QBM的量子Ising自旋模型，通过（16）。

A subsequent state s2 is obtained from the state{action pair (s1; a1) using the transition kernel outlined in Sec. 2, and a corresponding action a2 is chosen via policy . Another SQA sampling is performed in a similar fashion to the above for this pair.

In Fig. 3a and Fig. 3b, the selection of (s1; a1) is performed by sweeping across the set of state{ action pairs. In Fig. 3d, the selection of (s1; a1) and s2 is performed by sweeping over S A S. In Fig. 3c, the selection of s1, a1, and s2 are all performed uniformly randomly.

We experiment with a variety of learning-rate schedules, including exponential, harmonic, and linear; however, we found that for the training of both RBMs and DBMs, an adaptive learning-rate schedule performed best (for information on adaptive subgradient methods, see [51]). In our experiments, the initial learning rate is set to 0.01.

In all of our studied algorithms, training terminates when a desired number of training samples have been processed, after which the updated policy is returned.
使用在Sec。中概述的转换内核从状态{action pair（s1; a1）获得后续状态s2。在图2中，通过策略选择相应的动作a2。对于该对，以与上述类似的方式执行另一个SQA采样。

在图3a和图3b中，通过扫过一组状态{动作对来执行（s1; a1）的选择。在图3d中，通过扫描S A S来执行（s1; a1）和s2的选择。在图3c中，s1，a1和s2的选择均匀地随机地执行。

我们尝试了各种学习率计划，包括指数，谐波和线性;然而，我们发现，对于RBM和DBM的训练，自适应学习率计划表现最佳（关于自适应子梯度方法的信息，见[51]）。在我们的实验中，初始学习率设置为0.01。

在我们所有研究的算法中，训练在处理了所需数量的训练样本时终止，之后返回更新的策略。

4.Numerical Results

We study the performance of temporal-di erence reinforcement learning algorithms (explained in detail in Sec. 3) using Boltzmann machines. We generalize the method introduced in [16], and compare the policies obtained from these algorithms to the optimal policy using a delity measure, which we define in (25).
我们使用玻尔兹曼机器研究时间差强化学习算法的性能（在第3节中详细解释）。我们概括了[16]中介绍的方法，并使用我们在（25）中定义的delity度量将从这些算法获得的策略与最优策略进行比较。

Fig. 3a and Fig. 3b show the delity of the generated policies obtained from various reinforcement learning experiments on two clear 3 5 mazes. In Fig. 3a, the maze includes one reward, one wall, and one pit, and in Fig. 3b, the maze additionally includes two stochastic rewards. In these experiments, the training samples are generated by sweeping over the maze. Each sweep iterates over the maze elements in the same order. This explains the periodic behaviour of the delity curves (cf. Fig. 3c).

The curves labelled ‘QBM-RL’ represent the delity of reinforcement learning using QBMs. Sampling from the QBM is performed using SQA. All other experiments use classical Boltzmann machines as their graphical model. In the experiment labelled ‘RBM-RL’, the graphical model is an RBM, trained classically using formula (21). The remaining curve is labelled ‘DBM-RL’ for classical reinforcement learning using a DBM. In these experiments, sampling from con gurations of the DBM is performed with SQA (with f = 0:01). The delity results of DBM-RL coincide closely with those of sampling con gurations of the DBM using SA; therefore, we have not included them. Fig. 3c regenerates the results of Fig. 3a using uniform random sampling (i.e., without sweeping through the maze).
图3a和图3b显示了在两个清晰的3 5迷宫上从各种强化学习实验获得的生成策略的深度。在图3a中，迷宫包括一个奖励，一个墙壁和一个凹坑，并且在图3b中，迷宫另外包括两个随机奖励。在这些实验中，通过扫过迷宫产生训练样本。每次扫描以相同的顺序迭代迷宫元素。这解释了delity曲线的周期性行为（参见图3c）。

标记为“QBM-RL”的曲线表示使用QBM进行强化学习的深度。使用SQA执行QBM的采样。所有其他实验都使用经典的玻尔兹曼机器作为其图形模型。在标记为“RBM-RL”的实验中，图形模型是RBM，使用公式（21）经典地训练。剩下的曲线标记为“DBM-RL”，用于使用DBM进行经典强化学习。在这些实验中，使用SQA（f = 0:01）对DBM的配置进行采样。 DBM-RL的遗传结果与使用SA的DBM的采样配置密切相关;因此，我们没有包括它们。图3c使用均匀随机采样（即，没有扫过迷宫）再生图3a的结果。

In Fig. 4, we report the e ect of maze size on av‘ for RBM-RL, DBM-RL, and QBM-RL for varying maze sizes. We plot av‘ for each algorithm with ‘ = 500; 250; and 10 as a function of maze size
在图4中，我们报告了针对不同迷宫大小的RBM-RL，DBM-RL和QBM-RL的av’的迷宫大小的影响。我们用’= 500绘制每个算法的av’;250; 和10作为迷宫大小的函数

Figure 3. Comparison of RBM-RL, DBM-RL, and QBM-RL training results. Every underlying RBM has 16 hidden nodes and every DBM has two layers of eight hidden nodes. The shaded areas indicate the standard deviation of each training algorithm. (a) The delity curves for the three algorithms run on the maze in Fig 2a. (b) The delity curves for the maze in Fig 2b. © The delity curves of the mentioned three algorithms corresponding to the same experiment as that of (a), except that the training is performed by uniformly generated training samples rather than sweeping across the maze. (d) The delity curves corresponding to a windy maze similar to Fig 2a.
图3. RBM-RL，DBM-RL和QBM-RL训练结果的比较。每个底层RBM都有16个隐藏节点，每个DBM有两层八个隐藏节点。阴影区域表示每个训练算法的标准偏差。（a）三种算法的delity曲线在图2a中的迷宫上运行。（b）图2b中迷宫的神性曲线。（c）所述三种算法的对应于与（a）相同的实验的除体曲线，除了训练是通过均匀生成的训练样本而不是扫过迷宫进行的。（d）对应于类似于图2a的刮风迷宫的凹形曲线。

for a family of problems with one deterministic reward, two stochastic rewards, one pit, and n 2 walls. We use nine n 5 mazes in this experiment, indexed by various values of n. In addition to the av‘ plots, we include a dotted-line plot depicting the delity for a completely random policy. The delity of the random policy is given by the average probability of choosing an optimal action

at each state when generating admissible actions uniformly at random, which is given by
对于一系列问题，一个确定性奖励，两个随机奖励，一个坑和n 2墙。我们在这个实验中使用9 n 5个迷宫，用不同的n值索引。除了av’图，我们还包括一个虚线图，描绘了一个完全随机的政策。随机策略的超越性由选择最优动作的平均概率给出

在每个州，当随机均匀地产生可允许的动作时，由下式给出

Note that the delity of the random policy increases as the maze size increases. This is due to the fact that maze rows containing a wall have more average admissible optimal actions than the top and bottom rows of the maze.
请注意随机策略的delity随着迷宫大小的增加而增加。这是因为包含墙的迷宫行比迷宫的顶行和底行具有更多的平均可允许的最佳动作。

Figure 4. A comparison between the performance of RBM-RL, DBM-RL, and QBM-RL as the size of the maze grows. All Boltzmann machines have 20 hidden nodes. (a) The schematics of an n 5 maze with one deterministic reward, 2 stochastic rewards, one pit, and n 2 walls. (b) The scaling of the average delity of each algorithm run on each instance of the n 5 maze. The dotted line is the average delity of uniformly randomly generated actions.
图4.随着迷宫大小的增长，RBM-RL，DBM-RL和QBM-RL的性能之间的比较。所有Boltzmann机器都有20个隐藏节点。（a）具有一个确定性奖励，2个随机奖励，一个坑和n 2个墙的n 5迷宫的示意图。（b）在n 5迷宫的每个实例上运行每个算法的平均delity的缩放。虚线是均匀随机生成的动作的平均值。

5.Discussion

The delity curves in Fig. 3 show that DBM-RL outperforms RBM-RL with respect to the number of training samples. Therefore, we expect that in conjunction with a high-performance sampler of Boltzmann distributions (e.g., a quantum or a quantum-inspired oracle taken as such), DBM-RL improves the performance of reinforcement learning. QBM-RL is not only on par with DBM-RL, but actually slightly improves upon it by taking advantage of sampling in the presence of a signi cant transverse eld.

This is a positive result for the potential of sampling from a quantum device in machine learn-ing, as we do not expect quantum annealing to obtain the Boltzmann distribution of a classical Hamiltonian [42, 52, 53]. However, given the discussion in Sec. 2.1, a quantum annealer viewed
as an open system coupled to a heat bath could be a better choice of sampler from its instanta-neous Hamiltonian in earlier stages of the annealing process, compared to a sampler of the problem Hamiltonian at the end of the evolution. Therefore, these experiments address whether a quantum Boltzmann machine with a transverse eld Ising Hamiltonian can perform at least as well as a classical Boltzmann machine.

In each experiment, the delity curves from DBM-RL produced using SQA with f = 0:01 match the ones produced using SA. This is consistent with our expectation that using SQA with ! 0 produces samples from the same distribution as SA, namely, the Boltzmann distribution of the classical Ising Hamiltonian with no transverse eld.

The best algorithm in our experiments is evidently QBM-RL using SQA. Here, the nal trans-verse eld is f = 2:00, corresponding to one-third of the anneal for a quantum annealing algorithm that evolves along the convex linear combination of the initial and nal Hamiltonians with constant speed. This is consistent with ideas found in [38] on sampling at freeze-out [42].

Fig. 3c shows that, whereas the maze can be solved with fewer training samples using ordered sweeps of the maze, the periodic behaviour of the delity curves is due to this periodic choice of training samples. This e ect disappears once the training samples are chosen uniformly randomly.

Fig. 3d shows that the improvement in the learning of the DBM-RL and QBM-RL algorithms persists in the case of more-complicated transition kernels. The same ordering of delity curves discussed earlier is observed: QBM-RL outperforms DBM-RL, and DBM-RL outperforms RBM-RL.
图3中的神性曲线表明DBM-RL相对于训练样本的数量优于RBM-RL。因此，我们期望结合Boltzmann分布的高性能采样器（例如，量子或量子启发的oracle），DBM-RL改善了强化学习的性能。 QBM-RL不仅与DBM-RL相当，而且在存在显着横向视场的情况下利用采样实际上略微改进了它。

这对于机器学习中量子器件的采样潜力是一个积极的结果，因为我们不期望量子退火获得经典哈密顿量的Boltzmann分布[42,52,53]。但是，鉴于第二节中的讨论。 2.1，观察量子退火炉
与在演化结束时问题哈密顿量的采样器相比，在退火过程的早期阶段，耦合到热浴的开放系统可以是来自其瞬时哈密顿量的采样器的更好选择。因此，这些实验解决了具有横向Ising哈密顿量的量子Boltzmann机器是否能够至少与经典的Boltzmann机器一样好。

在每个实验中，使用SQA（f = 0:01）生成的DBM-RL的稀有曲线与使用SA生成的稀疏曲线相匹配。这与我们期望使用SQA一致！ 0产生与SA相同分布的样本，即没有横向场的经典Ising哈密顿量的Boltzmann分布。

我们实验中最好的算法显然是使用SQA的QBM-RL。这里，最后的反射场是f = 2:00，对应于量子退火算法的退火的三分之一，该算法沿着具有恒定速度的初始和最终哈密顿量的凸线性组合演化。这与[38]中关于冻结取样的观点一致[42]。

图3c示出了，尽管可以使用迷宫的有序扫描用较少的训练样本来解决迷宫，但是，对于这种周期性选择的训练样本，所述遗留曲线的周期性行为。一旦随机选择训练样本，该效果就会消失。

图3d显示在更复杂的转换内核的情况下，DBM-RL和QBM-RL算法的学习的改进仍然存在。观察到前面讨论的相同的稀有曲线排序：QBM-RL优于DBM-RL，并且DBM-RL优于RBM-RL。

One can observe from Fig. 4 that, as the maze size increases and the complexity of the rein-forcement learning task increases, av‘ decreases for each algorithm. The RBM algorithm, while always outperformed by DBM-RL and QBM-RL, shows a much faster decay in average delity as a function of maze size compared to both DBM-RL and QBM-RL. For larger mazes, the RBM algorithm fails to capture maze traversal knowledge, and approaches av‘ of a random action al-location (the dotted line), whereas the DBM-RL and QBM-RL algorithms continue to be trained well. DBM-RL and QBM-RL are capable of training the agent to traverse larger mazes, whereas the RBM algorithm, utilizing the same number of hidden nodes and a larger number of weights, fails to converge to an output that is better than a random policy.
The runtime and computational resources needed to compare DBM-RL and QBM-RL with RBM-RL have not been investigated here. We expect that in view of [19], the size of RBM needed to solve larger maze problems will grow exponentially. Thus, it would be interesting to research the extrapolation of the asymptotic complexity and size of the DBM-RL and QBM-RL algorithms with the aim of attaining a quantum advantage. Applying the algorithms described in this paper to tasks that have larger state and action spaces, as well as to more-complicated environments, will allow us to demonstrate the scalability and usefulness of the DBM-RL and QBM-RL approaches. The experimental results shown in Fig. 4 represent only a rudimentary attempt to investigate this matter, yet the results are promising. However, this experiment does not provide a practical characterization of the scaling of our approach, and further investigation is needed.
从图4可以看出，随着迷宫尺寸的增加和加强学习任务的复杂性增加，每个算法的av’减小。 RBM算法虽然总是优于DBM-RL和QBM-RL，但与DBM-RL和QBM-RL相比，作为迷宫大小的函数显示出更快的平均延迟衰减。对于较大的迷宫，RBM算法无法捕获迷宫遍历知识，并且接近随机动作al-location（虚线）的av’，而DBM-RL和QBM-RL算法继续被很好地训练。 DBM-RL和QBM-RL能够训练代理遍历较大的迷宫，而RBM算法利用相同数量的隐藏节点和较大数量的权重，无法收敛到比随机策略更好的输出。
这里没有研究将DBM-RL和QBM-RL与RBM-RL进行比较所需的运行时和计算资源。我们预计，鉴于[19]，解决更大迷宫问题所需的RBM规模将呈指数级增长。因此，研究DBM-RL和QBM-RL算法的渐近复杂度和大小的外推以获得量子优势将是有趣的。将本文中描述的算法应用于具有较大状态和动作空间的任务以及更复杂的环境，将使我们能够证明DBM-RL和QBM-RL方法的可扩展性和实用性。图4所示的实验结果仅代表了对这一问题进行研究的初步尝试，但结果很有希望。然而，该实验没有提供我们的方法的缩放的实际表征，并且需要进一步的研究。

Acknowledgements

We would like to thank Hamed Karimi, Helmut Katzgraber, Murray Thom, Matthias Troyer, and Ehsan Zahedinejad, as well as the referees and editorial board of Quantum Information and Computation, for reviewing this work and providing many helpful suggestions. The idea of using SQA to run experiments involving measurements with a nonzero transverse eld was communi-cated in person by Mohammad Amin. We would also like to thank Marko Bucyk for editing this manuscript.
我们要感谢Hamed Karimi，Helmut Katzgraber，Murray Thom，Matthias Troyer和Ehsan Zahedinejad，以及量子信息和计算的裁判和编辑委员会，以审查这项工作并提供许多有用的建议。使用SQA进行涉及非零横向测量的实验的想法由Mohammad Amin亲自传播。我们还要感谢Marko Bucyk编辑本手稿。

References

[1]M. S. Sarandy and D. A. Lidar, \Adiabatic approximation in open quantum systems," Phys. Rev. A, vol. 71, p. 012331, 2005.

[2]J. E. Avron, M. Fraas, G. M. Graf, and P. Grech, \Adiabatic theorems for generators of contracting evolutions," Commun. Math. Phys., vol. 314, no. 1, pp. 163{191, 2012.

[3]T. Albash, S. Boixo, D. A. Lidar, and P. Zanardi, \Quantum adiabatic Markovian master equations," New J. Phys., vol. 14, no. 12, p. 123016, 2012.

[4]S. Bachmann, W. De Roeck, and M. Fraas, \The Adiabatic Theorem for Many-Body Quantum Systems," arXiv:1612.01505, 2016.

[5]L. C. Venuti, T. Albash, D. A. Lidar, and P. Zanardi, \Adiabaticity in open quantum systems," Phys. Rev. A, vol. 93, p. 032118, 2016.

[6]M. W. Johnson, M. H. S. Amin, S. Gildert, T. Lanting, F. Hamze, N. Dickson, R. Harris, A. J. Berkley, J. Johansson, P. Bunyk, E. M. Chapple, C. Enderud, J. P. Hilton, K. Karimi, E. Ladizinsky, N. Ladizinsky, T. Oh, I. Perminov, C. Rich, M. C. Thom, E. Tolkacheva, C. J. S. Truncik, S. Uchaikin, J. Wang, B. Wilson, and G. Rose, \Quantum annealing with manufactured spins," Nature, vol. 473, pp. 194{198, 2011.

[7]J. Kelly, R. Barends, A. G. Fowler, A. Megrant, E. Je rey, T. C. White, D. Sank, J. Y. Mutus, B. Campbell, Y. Chen, Z. Chen, B. Chiaro, A. Dunsworth, I. C. Hoi, C. Neill, P. J. J. O’Malley, C. Quintana, P. Roushan, A. Vainsencher, J. Wenner, A. N. Cleland, and J. M. Martinis, \State preservation by repetitive error detection in a superconducting quantum circuit," Nature, vol. 519, pp. 66{69, 2015.

[8]R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.

[9]D. Bertsekas and J. Tsitsiklis, Neuro-dynamic Programming. Anthropological Field Studies, Athena Scienti c, 1996.

[10]V. Derhami, E. Khodadadian, M. Ghasemzadeh, and A. M. Z. Bidoki, \Applying reinforcement learning for web pages ranking algorithms," Appl. Soft Comput., vol. 13, no. 4, pp. 1686{1692, 2013.

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES 23

[11]S. Sya ie, F. Tadeo, and E. Martinez, \Model-free learning control of neutralization processes using reinforcement learning," Engineering Applications of Arti cial Intelligence, vol. 20, no. 6, pp. 767{782, 2007.

[12]I. Erev and A. E. Roth, \Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria," Am. Econ. Rev., pp. 848{881, 1998.

[13]H. Shteingart and Y. Loewenstein, \Reinforcement learning and human behavior," Current Opinion in Neuro-biology, vol. 25, pp. 93{98, 2014.

[14]T. Matsui, T. Goto, K. Izumi, and Y. Chen, \Compound reinforcement learning: theory and an application to nance," in European Workshop on Reinforcement Learning, pp. 321{332, Springer, 2011.

[15]Z. Sui, A. Gosavi, and L. Lin, \A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory," Engineering Management Journal, vol. 22, no. 4, pp. 44{53, 2010.

[16]B. Sallans and G. E. Hinton, \Reinforcement learning with factored states and actions," JMLR, vol. 5, pp. 1063{ 1088, 2004.

[17]K. Hornik, M. Stinchcombe, and H. White, \Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, no. 5, pp. 359{366, 1989.

[18]J. Martens, A. Chattopadhya, T. Pitassi, and R. Zemel, \On the representational e ciency of restricted Boltz-mann machines," in Advances in Neural Information Processing Systems, pp. 2877{2885, 2013.

[19]N. Le Roux and Y. Bengio, \Representational power of restricted Boltzmann machines and deep belief networks," Neural Computation, vol. 20, no. 6, pp. 1631{1649, 2008.

[20]R. Salakhutdinov and G. E. Hinton, \Deep Boltzmann Machines," in Proceedings of the Twelfth International Conference on Arti cial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, pp. 448{455, 2009.

[21]R. Harris, M. W. Johnson, T. Lanting, A. J. Berkley, J. Johansson, P. Bunyk, E. Tolkacheva, E. Ladizinsky, N. Ladizinsky, T. Oh, F. Cioata, I. Perminov, P. Spear, C. Enderud, C. Rich, S. Uchaikin, M. C. Thom, E. M.

Chapple, J. Wang, B. Wilson, M. H. S. Amin, N. Dickson, K. Karimi, B. Macready, C. J. S. Truncik, and G. Rose, \Experimental investigation of an eight-qubit unit cell in a superconducting optimization processor," Phys. Rev. B, vol. 82, p. 024511, 2010.

[22]M. Benedetti, J. Realpe-Gmez, R. Biswas, and A. Perdomo-Ortiz, \Quantum-assisted learning of graphical models with arbitrary pairwise connectivity," arXiv:1609.02542, 2016.

[23]S. H. Adachi and M. P. Henderson, \Application of Quantum Annealing to Training of Deep Neural Networks," arXiv:1510.06356, 2015.

[24]M. Denil and N. de Freitas, \Toward the implementation of a quantum RBM," in NIPS 2011 Deep Learning and Unsupervised Feature Learning Workshop, 2011.

[25]M. Benedetti, J. Realpe-Gomez, R. Biswas, and A. Perdomo-Ortiz, \Estimation of e ective temperatures in quantum annealers for sampling applications: A case study with possible applications in deep learning," Phys. Rev. A, vol. 94, p. 022308, 2016.

[26]V. Dunjko, J. M. Taylor, and H. J. Briegel, \Quantum-enhanced machine learning," Phys. Rev. Lett., vol. 117,

p.130501, 2016.

[27]D. Dong, C. Chen, H. Li, and T. J. Tarn, \Quantum reinforcement learning," IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 38, pp. 1207{1220, 2008.

[28]N. Wiebe, A. Kapoor, and K. M. Svore, \Quantum deep learning," Quantum Inf. Comput., vol. 16, no. 7-8,

ap.541{587, 2016.

[29]M. Kieferova and N. Wiebe, \Tomography and Generative Data Modeling via Quantum Boltzmann Training," arXiv:1612.05204, 2016.

[30]E. Crosson and A. W. Harrow, \Simulated Quantum Annealing Can Be Exponentially Faster than Classical Simulated Annealing," arXiv:1601.03030, 2016.

[31]B. Heim, T. F. R nnow, S. V. Isakov, and M. Troyer, \Quantum versus classical annealing of Ising spin glasses," Science, vol. 348, no. 6231, pp. 215{217, 2015.

[32]M. B. Hastings and M. H. Freedman, \Obstructions to classically simulating the quantum adiabatic algorithm," Quantum Information & Computation, vol. 13, pp. 1038{1076, 2013.

[33]S. Morita and H. Nishimori, \Convergence theorems for quantum annealing," J. Phys. A: Mathematical and General, vol. 39, no. 45, p. 13903, 2006.

[34]S. V. Isakov, G. Mazzola, V. N. Smelyanskiy, Z. Jiang, S. Boixo, H. Neven, and M. Troyer, \Understanding quantum tunneling through quantum Monte Carlo simulations," arXiv:1510.08057, 2015.

[35]T. Albash, T. F. R nnow, M. Troyer, and D. A. Lidar, \Reexamining classical and quantum models for the D-Wave One processor," arXiv:1409.3827, 2014.

[36]L. T. Brady and W. van Dam, \Quantum Monte Carlo simulations of tunneling in quantum adiabatic optimiza-tion," Phys. Rev. A, vol. 93, no. 3, p. 032304, 2016.

[37]S. W. Shin, G. Smith, J. A. Smolin, and U. Vazirani, \How ‘Quantum’ is the D-Wave Machine?," arXiv:1401.7087, 2014.

[38]M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, and R. Melko, \Quantum Boltzmann machine," arXiv:1601.02036, 2016.

[39]M. Born and V. Fock, \Beweis des Adiabatensatzes," Zeitschrift fur Physik, vol. 51, pp. 165{180, 1928.

[40]T. Kadowaki and H. Nishimori, \Quantum annealing in the transverse Ising model," Phys. Rev. E, vol. 58,

ap.5355{5363, 1998.

[41]E. Farhi, J. Goldstone, S. Gutmann, and M. Sipser, \Quantum Computation by Adiabatic Evolution," arXiv:quant-ph/0001106, 2000.

[42]M. H. Amin, \Searching for quantum speedup in quasistatic quantum annealers," Phys. Rev. A, vol. 92, no. 5,

p.052323, 2015.

[43]S. M. Anthony Brabazon, Michael O’Neill, Natural Computing Algorithms. Springer-Verlag Berlin Heidelberg, 2015.

[44]R. Martonak, G. E. Santoro, and E. Tosatti, \Quantum annealing by the path-integral Monte Carlo method: The two-dimensional random Ising model," Phys. Rev. B, vol. 66, no. 9, p. 094203, 2002.

[45]S. Yuksel, \Control of stochastic systems." Course lecture notes, Queen’s University (Kingston, ON Canada), Retrieved in May, 2016.

[46]R. S. Sutton, \Integrated architectures for learning, planning, and reacting based on approximating dynamic programming," in In Proceedings of the Seventh International Conference on Machine Learning, pp. 216{224, Morgan Kaufmann, 1990.

[47]R. Bellman, \Dynamic programming and Lagrange multipliers," Proceedings of the National Academy of Sci-ences, vol. 42, no. 10, pp. 767{769, 1956.

[48]M. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.

[49]M. Suzuki, \Relationship between d-dimensional quantal spin systems and (d+1)-dimensional Ising systems equivalence, critical exponents and systematic approximants of the partition function and spin correlations," Progr. Theor. Exp. Phys., vol. 56, no. 5, pp. 1454{1469, 1976.

[50]G. Hinton, \A practical guide to training restricted Boltzmann machines," Momentum, vol. 9, no. 1, p. 926, 2010.

[51]J. Duchi, E. Hazan, and Y. Singer, \Adaptive subgradient methods for online learning and stochastic optimiza-tion," JMLR, vol. 12, no. Jul, pp. 2121{2159, 2011.

[52]Y. Matsuda, H. Nishimori, and H. G. Katzgraber, \Ground-state statistics from annealing algorithms: quantum versus classical approaches," New. J. Phys., vol. 11, no. 7, p. 073021, 2009.

[53]L. C. Venuti, T. Albash, M. Marvian, D. Lidar, and P. Zanardi, \Relaxation versus adiabatic quantum steady-state preparation," Phys. Rev. A, vol. 95, p. 042302, 2017.

REINFORCEMENT LEARNING USING QUANTUM BOLTZMANN MACHINES 25

[54]E. Farhi and A. W. Harrow, \Quantum Supremacy through the Quantum Approximate Optimization Algo-rithm," arXiv:1602.07674, 2016.

[55]F. Abtahi and I. Fasel, \Deep belief nets as function approximators for reinforcement learning," Frontiers in Computational Neuroscience, 2011.

[56]S. Elfwing, E. Uchibe, and K. Doya, \Scaled free-energy based reinforcement learning for robust and e cient learning in high-dimensional state spaces," Value and Reward Based Learning in Neurobots, p. 30, 2015.

[57]R. Martonak, G. E. Santoro, and E. Tosatti, \Quantum annealing by the path-integral Monte Carlo method: The two-dimensional random Ising model," Phys. Rev. B, vol. 66, no. 9, p. 094203, 2002.

[58]M. Otsuka, J. Yoshimoto, and K. Doya, \Free-energy-based reinforcement learning in a partially observable environment.," ESANN 2010 proceedings, European Symposium on Arti cial Neural Networks { Computational Intelligence and Machine Learning, 2010.

[59]S. Boixo, S. V. Isakov, V. N. Smelyanskiy, R. Babbush, N. Ding, Z. Jiang, J. M. Martinis, and H. Neven, \Characterizing quantum supremacy in near-term devices," arXiv:1608.00263v2, 2016.

[60]J. Raymond, S. Yarkoni, and E. Andriyash, \Global warming: Temperature estimation in annealers," arXiv:1606.00919, 2016.

[61]P. M. Long and R. Servedio, \Restricted Boltzmann machines are hard to approximately evaluate or simulate," in Proceedings of the 27th International Conference on Machine Learning, pp. 703{710, 2010.

[62]D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, \A learning algorithm for Boltzmann machines," Cogn. Sci., vol. 9, no. 1, pp. 147{169, 1985.

[63]S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvari, \Convergence results for single-step on-policy reinforcement-learning algorithms," Machine learning, vol. 38, no. 3, pp. 287{308, 2000.

[64]N. Fremaux, H. Sprekeler, and W. Gerstner, \Reinforcement learning using a continuous time actor-critic frame-work with spiking neurons," PLoS Comput. Biol., vol. 9, no. 4, p. e1003024, 2013.

[65]D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., \Mastering the game of go with deep neural networks and tree search,"

[66]V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., \Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 529{533, 2015.

E-mail address, Daniel Crawford: [email protected]

E-mail address, Anna Levit: [email protected]

E-mail address, Navid Ghadermarzy: [email protected]

E-mail address, Jaspreet S. Oberoi: [email protected]

E-mail address, Pooya Ronagh: [email protected]

(Daniel Crawford, Anna Levit, Jaspreet S. Oberoi, Pooya Ronagh) 1QB Information Technologies (1QBit) (Navid Ghadermarzy) Department of Mathematics, University of British Columbia (Jaspreet S. Oberoi) School of Engineering Science, Simon Fraser University

(Pooya Ronagh) Institute for Quantum Computing and Department of Physics and Astronomy, Uni-versity of Waterloo

你可能感兴趣的:(AI程序员,算法,机器学习,深度学习,强化学习,深度强化学习,论文研读)

RSA加密算法不会搬砖的淡水鱼网络服务器安全
RSA加密算法：数学魔术背后的安全守护者RSA加密算法（Rivest-Shamir-Adleman）是一种广泛使用的公钥加密算法，它在信息安全领域具有重要作用。RSA是由罗纳德·李维斯特（RonRivest）、阿迪·萨莫尔（AdiShamir）和伦纳德·阿德曼（LeonardAdleman）在1977年一起提出的。当时他们三人都在麻省理工学院工作。RSA就是他们三人姓氏开头字母拼在一起组成的。RS
基础算法--欧拉函数不会搬砖的淡水鱼基础算法算法 java 数据结构
欧拉函数（Euler’stotientfunction），也称为费马函数，是一个与正整数相关的数论函数，用符号φ(n)表示。欧拉函数φ(n)定义为小于或等于n的正整数中与n互质的数的个数。RSA加密算法（Rivest-Shamir-Adleman）就是通过欧拉函数进行公钥加密。具体而言，对于给定的正整数n，欧拉函数φ(n)计算满足以下条件的k的个数：1≤k≤n，且k与n互质（即k和n的最大公约数为
基础算法--背包问题不会搬砖的淡水鱼基础算法算法 java 动态规划贪心算法
背包问题概念完全背包（无限背包）0-1背包概念背包问题是一个经典的组合优化问题，其目标是在给定的一组物品中选择一些物品放入背包中，使得物品的总价值最大化，同时要求背包的总重量不超过背包的容量限制。背包问题有两种常见的变体：完全背包和0-1背包。鉴于完全背包计算过程相对0-1背包简单，这里先讲完全背包。完全背包（无限背包）在完全背包问题中，每个物品可以选择放入背包中的次数是无限的，即可以重复选择。每
RuoYi框架连接SQL Server时解决“SSL协议不支持”和“加密协议错误” 专注代码十年 ssl 网络协议网络
RuoYi框架连接SQLServer时解决“SSL协议不支持”和“加密协议错误”在使用RuoYi框架进行开发时，与SQLServer数据库建立连接可能会遇到SSL协议相关的问题。以下是两个常见的错误信息及其解决方案。错误信息1com.zaxxer.hikari.pool.HikariPool$PoolInitializationException:Failedtoinitializepool;'e
CUDA编程基础清澜算法面试人工智能 c++算法 nvidia cuda编程
一、快速理解CUDA编程1.1CUDA简介CUDA（ComputeUnifiedDeviceArchitecture）是由NVIDIA推出的并行计算平台和应用程序接口模型。它允许开发者利用NVIDIAGPU的强大计算能力来加速通用计算任务，而不仅仅是图形渲染。通过CUDA，开发者可以编写C、C++或Fortran代码，并将其扩展以在GPU上运行，从而显著提高性能，特别是在处理大规模数据集和复杂算法
AI编程的心得体会猜测7 AI编程 chatgpt
最近使用了三款AI软件进行编程，真的是一款比一款好用，很大程度提高了写代码的效率，真的非常方便。首先是豆包的插件Marscode，我知道它B站首页曾经见到一个推荐，标题大意是不写一行代码开发出一个打砖块的游戏。我对着视频试了一遍，在VSCode中可以直接搜索安装Marscode，用的phython写的小游戏，结果发现其实最核心的架构玩法其实都在它clonegithub那步，就是把已经能运行的游戏拿
Trae AI 上新 SSHremote：服务器 Python 接口日志排查实战指南芯作者 DD：日记人工智能深度学习机器学习
在当今的软件开发中，服务器端的稳定性和可靠性至关重要。然而，生产环境中的问题往往难以预测，尤其是接口返回502错误却无日志记录的情况，更是让开发者头疼不已。幸运的是，字节跳动推出的AI原生IDE——Trae，近期上线的SSHremote功能，为远程服务器日志排查提供了全新的解决方案。本文将结合实战案例，深入探讨如何利用TraeAI的SSHremote功能高效排查Python接口日志问题，并分享创新
泛目录程序：2025快云站群程序的SEO优化功能云惠科技大数据泛目录
快云站群程序的SEO优化功能围绕搜索引擎算法设计，具体包含以下核心模块：1.关键词智能布局密度检测与优化：自动分析内容关键词密度，建议合理区间（2%-8%），避免堆砌或遗漏；多词策略支持：可针对单篇内容设置主关键词+长尾词组合，覆盖更多搜索场景；标题/摘要自动生成：根据关键词智能生成高点击率的标题和Meta描述，提升搜索展示效果。2.内链自动化系统内容关联推荐：基于语义分析，自动在文章中插入相关内
主流区块链平台对 EVM 的依赖情况分类说明倒霉男孩区块链知识区块链
文章目录概要1.EVM兼容链BinanceSmartChain(BSC)Polygon(PoS链)AvalancheC-ChainFantomOptimism/Arbitrum2.非EVM链3.混合型链AvalanchePolygonSupernetsBNBChain概要1.EVM兼容链这些链直接支持以太坊虚拟机，开发者可用Solidity编写合约，并复用以太坊工具链：BinanceSmartCh
python进阶，迭代器和生成器，函数式编程，闭包，装饰器胡萝卜糊了 python 开发语言
l=[1,2,3,4]it=iter(l)print(next(it))print(next(it))print(next(it))print(next(it))#while循环l=[1,2,3,4]len=len(l)i=0it=iter(l)whilei=self.end:raiseStopIterationself.current+=1returnself.current-1it=MyIte
大模型时代的知识焦虑机载软件与适航机器学习-建模算法-代理模型人工智能大数据
引言：浪潮之巅，焦虑暗涌大模型时代已经浩荡而来，如同奔腾的浪潮，以令人惊叹的速度重塑着世界的面貌。从智能客服的温声细语，到AI绘画的妙笔生花，再到自动驾驶的日趋成熟，大型语言模型、图像模型等人工智能技术以前所未有的姿态，渗透进我们生活的方方面面。信息获取前所未有的便捷，知识创造空前高效，人机交互焕然一新，一个充满无限可能的智能化未来似乎触手可及。然而，在这令人眼花缭乱的技术盛景之下，一股无形的焦虑
Qt爬坑笔记 klzed_ qt c++后端 ui
1.自定义一个QWidget的派生类，将其作为子部件并设置样式表时，需要重写paintEvent事件，否则样式表可能无效，如下所示：voidCustomWidget::paintEvent(QPaintEvent*){QStyleOptionopt;opt.init(this);QPainterp(this);
【LeetCode 热题100】 23. 合并 K 个升序链表的算法思路及python代码 pljnb LeetCode热题100 算法 leetcode 链表
23.合并K个升序链表给你一个链表数组，每个链表都已经按升序排列。请你将所有链表合并到一个升序链表中，返回合并后的链表。示例1：输入：lists=[[1,4,5],[1,3,4],[2,6]]输出：[1,1,2,3,4,4,5,6]解释：链表数组如下：[1->4->5,1->3->4,2->6]将它们合并到一个有序链表中得到。1->1->2->3->4->4->5->6示例2：输入：lists=[
【Leetcode刷题随笔】59 螺旋矩阵 Poor_DayDreamer leetcode数组篇 Medium Tag leetcode 矩阵算法
1.题目描述给定一个正整数n，生成一个包含1到n2所有元素，且元素按顺时针顺序螺旋排列的nxn正方形矩阵matrix。可结合以下原题链接阅读。原题链接：59螺旋矩阵2.解题思路本题为模拟矩阵填充过程，不需要设计算法，只要完成正确的填充过程即可。首先初始化一个nxn的二维矩阵（涉及到动态内存分配），从矩阵左上角开始往顺时针填充，关键在于填充的转角处不要重复填充，所以对于每条边都要遵循严格的统一规则，
本地源代码运行bun install时报错星火燎猿 C#疑难杂症处理方案 Bun Bun.js
最近使用Ubuntu系统运行Bun的时候报，Failedtospawnscriptinstallduetoerroros.linux.errno.generic.E.PERMPERM的错误，查看官方文档也没有这个错误描述，最终找到解决方案进行分享。报错问题如下：errorloadingcurrentdirectoryInstalling[2637/2230]error:failedtospawnl
“轻松一键生成 AI 图像：Stable Diffusion Online 带来革命性视觉创意体验！“ ai小精灵人工智能 stable diffusion 文心一言 AI作画 chatgpt
StableDiffusionOnline正在为AI图像生成领域树立新标准，将复杂的功能与便捷直观的用户体验相结合。历史上，StableDiffusion的部署步骤带来了重大挑战，特别是对于技术新手而言。然而，StableDiffusionOnline消除了这些障碍，提供了一个既适合新手也适合资深专业人士的酷炫界面。什么是StableDiffusionOnline？StableDiffusionO
算法入门——二分法 Able Zhao 650829 算法数据结构 c++蓝桥杯
二分法真的很容易出错！！！在用dp学习之后总结了一下二分法二分查找关键总结一、核心思想分治策略：每次将搜索范围缩小一半，适用于有序数组。时间复杂度：O(logn)，比线性查找高效得多。二、关键点前提条件有序性：数组必须有序（升序或降序），否则需先排序（但排序成本O(nlogn)）。静态性：适合静态数据或低频更新的数据（高频更新建议用哈希表或树结构）。两种边界问题左边界：第一个等于目标的位置（或第一
近期计算机领域的热点技术 0dayNu1L 云计算量子计算人工智能
随着科技的飞速发展，计算机领域的新技术、新趋势层出不穷。本文将探讨近期计算机领域的几个热点技术趋势，并对它们进行简要的分析和展望。一、人工智能与机器学习人工智能（AI）和机器学习（ML）是近年来计算机领域最为热门的话题之一。AI和ML技术已经广泛应用于图像识别、自然语言处理、智能推荐等领域，并取得了显著的成果。随着技术的不断进步，AI和ML将更深入地渗透到各个行业，为人类社会带来更多便利和效益。在
Python助力区块链互通——跨链桥接的实现与实践 Echo_Wish Python！实战！区块链 python 开发语言
Python助力区块链互通——跨链桥接的实现与实践区块链技术的繁荣发展带来了巨大的生态创新，但也因各链之间的割裂局面限制了它们的潜力。例如，你或许想在以太坊上使用来自比特币的资产，却因两条链不互通而不得不求助于中心化交易所。要打破“链间壁垒”，跨链桥接（Cross-chainBridge）应运而生。今天，我以Echo_Wish的视角，通过Python代码实践，带你深入了解跨链桥接的工作原理，技术实
python pyttsx3文本转语音_python 利用pyttsx3文字转语音木大木大本太 python pyttsx3文本转语音
#-*-coding:utf-8-*-importpyttsx3f=open("all.txt",'r')line=f.readline()engine=pyttsx3.init()whileline:line=f.readline()print(line,end='')engine.say(line)engine.runAndWait()f.close()importwin32com.clien
【Spring AI】基于专属知识库的RAG智能问答小程序开发——代码逐行精讲：核心交互函数及RAG知识库构建 un_fired spring 人工智能 java
系列文章目录【SpringAI】基于专属知识库的RAG智能问答小程序开发——完整项目（含完整前端+后端代码）【SpringAI】基于专属知识库的RAG智能问答小程序开发——代码逐行精讲：核心ChatClient对象相关构造函数【SpringAI】基于专属知识库的RAG智能问答小程序开发——代码逐行精讲：核心交互函数及RAG知识库构建文章目录系列文章目录前言1.Service层知识库构建与检索函数详
Android开发哈哈哈隔 android
AndroidAdapter是将数据绑定到UI界面上的桥接类比如:当lambada中只有一个参数时，可以用it指代@Target和@Retention是由Java提供的元注解，所谓元注解就是标记其他注解的注解，下面分别介绍https://blog.csdn.net/javazejian/article/details/71860633#%E5%A3%B0%E6%98%8E%E6%B3%A8%E8%
大整数加、减法（Java实现）与debug找错 gfu_ java 算法数据结构
前言这篇文章主要内容涉及大整数加法的实现以及debug使用的简单记录。以前当我碰到程序报错时，总是想找别人帮忙，感觉debug太难了，自己根本看不懂。这次，自己在做一道算法题时，程序能够运行，结果却出错了。本来想找别人帮忙，但想着学习还是要脚踏实地，于是自己硬着头皮上了，先在网上了解如何debug，然后一步一步找到了错误所在。主要是想记录下第一次debug找到问题的快乐。一、大整数加法（java）
使用python seaborn创建配对图：从核心概念到实战案例梦想画家数据分析工程 #python 人工智能 python 机器学习
Seaborn的配对图（Pairplot）是一种用于探索多变量数据关系的可视化工具，尤其适合分析数据集中多个特征之间的相关性、分布模式或异常值。本文介绍如何生成数据集数值变量之间的配对图，并通过参数设置色系。配对图的核心作用矩阵式可视化生成一个N×N的网格图（N为特征数），每个单元格展示两列特征之间的关系。默认对角线显示单变量分布（直方图或KDE曲线），非对角线显示散点图或其他关系图。快速发现模式
如何在 Python 中将语音转换为文本无水先生语音处理人工智能综合 python xcode 开发语言
一、说明学习如何使用语音识别Python库执行语音识别，以在Python中将音频语音转换为文本。想要更快地编码吗？我们的Python代码生成器让您只需点击几下即可创建Python脚本。现在就现在试试！二、语言AI库2.1相当给力的转文字库语音识别是计算机软件识别口语中的单词和短语并将其转换为人类可读文本的能力。在本教程中，您将学习如何使用SpeechRecognition库在Python中
华为余承东“剧透”新形态手机；自DeepSeek发布以来，英伟达市值已蒸发4200亿美元；Java 24正式发布 | 极客头条极客日报华为智能手机 java
「极客头条」——技术人员的新闻圈！CSDN的读者朋友们好，「极客头条」来啦，快来看今天都有哪些值得我们技术人关注的重要新闻吧。整理|郑丽媛出品|CSDN（ID：CSDNnews）一分钟速览新闻点！华为余承东“揭秘”新形态手机：不是卷轴屏/伸缩屏，但男生女生都会喜欢腾讯去年营收增长8%，马化腾：重组AI团队，增加AI相关的资本开支金山办公：2024年WPSOffice全球月度活跃设备数达6.32亿，
个人AI助手的未来：Yi AI开源系统助力快速搭建耶耶Norsea 网络杂烩人工智能开源
摘要YiAI推出了一站式个人AI助手平台解决方案，助力用户快速搭建专属AI助手。该平台采用全套开源系统，涵盖前端应用、后台管理及小程序功能，并基于MIT协议开放使用。同时，平台集成了本地RAG方案，利用Milvus与Weaviate向量数据库支持本地部署，为用户提供高效、灵活的数据处理能力。关键词个人AI助手,快速搭建,开源系统,本地RAG,向量数据库一、YiAI开源系统概述1.1个人AI助手的发
2025年开发者工具全景图：IDE与AI协同的效能革命 He.Tech ide 人工智能
2025年开发者工具全景图：IDE与AI协同的效能革命（基于CSDN、腾讯云等平台技术文档与行业趋势分析）一、核心工具链的务实演进与配置指南主流开发工具的升级聚焦于工程化适配与智能化增强，以下是2025年开发者必须掌握的配置技巧：1.VSCode：性能优化与远程协作标杆核心特性：CUDA核心利用率分析：通过NVIDIANsight插件优化GPU计算任务，需在settings.json中添加："ns
Java IO流详解我真的不想做程序员 java 文件读写 java 开发语言后端数据结构算法
目录一、JavaIO流基础（一）字节流常见字节流类（二）字符流常见字符流类二、字节流操作示例（一）读取文件（二）写入文件（三）带缓冲功能的字节流三、字符流操作示例（一）读取文件（二）写入文件（三）带缓冲功能的字符流四、总结一、JavaIO流基础JavaIO流用于处理设备之间的数据传输，主要包括字节流和字符流两大类。字节流以字节为单位进行数据传输，适用于处理二进制数据；字符流以字符为单位进行传输，适
提升敏感力，“工具人”破圈的唯一解！技能咖 GAI认证生成式人工智能认证人工智能
在当今这个日新月异的数字化时代，个人与组织面临着前所未有的挑战与机遇。随着科技的飞速发展，尤其是生成式人工智能（GenerativeAI）的兴起，职场生态正在发生深刻变革。如何在这场变革中提升敏感力，实现从“工具人”到行业佼佼者的跨越，成为了众多职场人士关注的焦点。本文将探讨提升敏感力的重要性，并引入生成式人工智能认证（GAI认证），为您揭示“工具人”破圈的唯一解。提升敏感力：职场竞争的关键什么是
mondb入手木zi_鸣 mongodb
windows 启动mongodb 编写bat文件， mongod --dbpath D:\software\MongoDBDATA mongod --help 查询各种配置配置在mongob 打开批处理，即可启动，27017原生端口，shell操作监控端口扩展28017，web端操作端口启动配置文件配置，数据更灵活
大型高并发高负载网站的系统架构 bijian1013 高并发负载均衡
扩展Web应用程序一.概念简单的来说，如果一个系统可扩展，那么你可以通过扩展来提供系统的性能。这代表着系统能够容纳更高的负载、更大的数据集，并且系统是可维护的。扩展和语言、某项具体的技术都是无关的。扩展可以分为两种： 1.
DISPLAY变量和xhost(原创) czmmiao display
DISPLAY 在Linux/Unix类操作系统上, DISPLAY用来设置将图形显示到何处. 直接登陆图形界面或者登陆命令行界面后使用startx启动图形, DISPLAY环境变量将自动设置为:0:0, 此时可以打开终端, 输出图形程序的名称(比如xclock)来启动程序, 图形将显示在本地窗口上, 在终端上输入printenv查看当前环境变量, 输出结果中有如下内容:DISPLAY=:0.0
获取B/S客户端IP 周凡杨 java 编程 jsp Web 浏览器
最近想写个B/S架构的聊天系统，因为以前做过C/S架构的QQ聊天系统，所以对于Socket通信编程只是一个巩固。对于C/S架构的聊天系统，由于存在客户端Java应用，所以直接在代码中获取客户端的IP，应用的方法为： String ip = InetAddress.getLocalHost().getHostAddress(); 然而对于WEB
浅谈类和对象朱辉辉33 编程
类是对一类事物的总称，对象是描述一个物体的特征，类是对象的抽象。简单来说，类是抽象的，不占用内存，对象是具体的，占用存储空间。类是由属性和方法构成的，基本格式是public class 类名{ //定义属性 private/public 数据类型属性名； //定义方法 publ
android activity与viewpager+fragment的生命周期问题肆无忌惮_ viewpager
有一个Activity里面是ViewPager，ViewPager里面放了两个Fragment。第一次进入这个Activity。开启了服务，并在onResume方法中绑定服务后，对Service进行了一定的初始化，其中调用了Fragment中的一个属性。 super.onResume(); bindService(intent, conn, BIND_AUTO_CREATE);
base64Encode对图片进行编码 843977358 base64 图片 encoder
/** * 对图片进行base64encoder编码 * * @author mrZhang * @param path * @return */ public static String encodeImage(String path) { BASE64Encoder encoder = null; byte[] b = null; I
Request Header简介 aigo servlet
当一个客户端(通常是浏览器)向Web服务器发送一个请求是，它要发送一个请求的命令行，一般是GET或POST命令，当发送POST命令时，它还必须向服务器发送一个叫“Content-Length”的请求头(Request Header) 用以指明请求数据的长度，除了Content-Length之外，它还可以向服务器发送其它一些Headers，如：
HttpClient4.3 创建SSL协议的HttpClient对象 alleni123 httpclient 爬虫 ssl
public class HttpClientUtils { public static CloseableHttpClient createSSLClientDefault(CookieStore cookies){ SSLContext sslContext=null; try { sslContext=new SSLContextBuilder().l
java取反 -右移-左移-无符号右移的探讨百合不是茶位运算符位移
取反：在二进制中第一位，1表示符数，0表示正数 byte a = -1; 原码：10000001 反码：11111110 补码：11111111 //异或: 00000000 byte b = -2; 原码：10000010 反码：11111101 补码：11111110 //异或: 00000001
java多线程join的作用与用法 bijian1013 java 多线程
对于JAVA的join，JDK 是这样说的：join public final void join （long millis ）throws InterruptedException Waits at most millis milliseconds for this thread to die. A timeout of 0 means t
Java发送http请求(get 与post方法请求) bijian1013 java spring
PostRequest.java package com.bijian.study; import java.io.BufferedReader; import java.io.DataOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.net.HttpURL
【Struts2二】struts.xml中package下的action配置项默认值 bit1129 struts.xml
在第一部份，定义了struts.xml文件，如下所示： <!DOCTYPE struts PUBLIC "-//Apache Software Foundation//DTD Struts Configuration 2.3//EN" "http://struts.apache.org/dtds/struts
【Kafka十三】Kafka Simple Consumer bit1129 simple
代码中关于Host和Port是割裂开的，这会导致单机环境下的伪分布式Kafka集群环境下，这个例子没法运行。实际情况是需要将host和port绑定到一起， package kafka.examples.lowlevel; import kafka.api.FetchRequest; import kafka.api.FetchRequestBuilder; impo
nodejs学习api ronin47 nodejs api
NodeJS基础什么是NodeJS JS是脚本语言，脚本语言都需要一个解析器才能运行。对于写在HTML页面里的JS，浏览器充当了解析器的角色。而对于需要独立运行的JS，NodeJS就是一个解析器。每一种解析器都是一个运行环境，不但允许JS定义各种数据结构，进行各种计算，还允许JS使用运行环境提供的内置对象和方法做一些事情。例如运行在浏览器中的JS的用途是操作DOM，浏览器就提供了docum
java-64.寻找第N个丑数 bylijinnan java
public class UglyNumber { /** * 64.查找第N个丑数具体思路可参考 [url] http://zhedahht.blog.163.com/blog/static/2541117420094245366965/[/url] * 题目：我们把只包含因子 2、3和5的数称作丑数（Ugly Number）。例如6、8都是丑数，但14
二维数组（矩阵）对角线输出 bylijinnan 二维数组
/** 二维数组对角线输出两个方向例如对于数组： { 1, 2, 3, 4 }, { 5, 6, 7, 8 }, { 9, 10, 11, 12 }, { 13, 14, 15, 16 }, slash方向输出： 1 5 2 9 6 3 13 10 7 4 14 11 8 15 12 16 backslash输出： 4 3
[JWFD开源工作流设计]工作流跳跃模式开发关键点(今日更新) comsci 工作流
既然是做开源软件的,我们的宗旨就是给大家分享设计和代码,那么现在我就用很简单扼要的语言来透露这个跳跃模式的设计原理大家如果用过JWFD的ARC-自动运行控制器,或者看过代码,应该知道在ARC算法模块中有一个函数叫做SAN(),这个函数就是ARC的核心控制器,要实现跳跃模式,在SAN函数中一定要对LN链表数据结构进行操作,首先写一段代码,把
redis常见使用 cuityang redis 常见使用
redis 通常被认为是一个数据结构服务器，主要是因为其有着丰富的数据结构 strings、map、 list、sets、 sorted sets 引入jar包 jedis-2.1.0.jar (本文下方提供下载) package redistest; import redis.clients.jedis.Jedis; public class Listtest
配置多个redis dalan_123 redis
配置多个redis客户端 <?xml version="1.0" encoding="UTF-8"?><beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi=&quo
attrib命令 dcj3sjt126com attr
attrib指令用于修改文件的属性.文件的常见属性有:只读.存档.隐藏和系统. 只读属性是指文件只可以做读的操作.不能对文件进行写的操作.就是文件的写保护. 存档属性是用来标记文件改动的.即在上一次备份后文件有所改动.一些备份软件在备份的时候会只去备份带有存档属性的文件.
Yii使用公共函数 dcj3sjt126com yii
在网站项目中，没必要把公用的函数写成一个工具类，有时候面向过程其实更方便。在入口文件index.php里添加 require_once('protected/function.php'); 即可对其引用，成为公用的函数集合。 function.php如下： <?php /** * This is the shortcut to D
linux 系统资源的查看（free、uname、uptime、netstat） eksliang netstat linux uname linux uptime linux free
linux 系统资源的查看转载请出自出处：http://eksliang.iteye.com/blog/2167081 http://eksliang.iteye.com 一、free查看内存的使用情况语法如下： free [-b][-k][-m][-g] [-t] 参数含义 -b:直接输入free时，显示的单位是kb我们可以使用b(bytes),m
JAVA的位操作符 greemranqq 位运算 JAVA位移 <<>>>
最近几种进制，加上各种位操作符，发现都比较模糊，不能完全掌握，这里就再熟悉熟悉。 1.按位操作符：按位操作符是用来操作基本数据类型中的单个bit,即二进制位，会对两个参数执行布尔代数运算，获得结果。与（&）运算： 1&1 = 1, 1&0 = 0, 0&0 &
Web前段学习网站 ihuning Web
Web前段学习网站菜鸟学习：http://www.w3cschool.cc/ JQuery中文网：http://www.jquerycn.cn/ 内存溢出：http://outofmemory.cn/#csdn.blog http://www.icoolxue.com/ http://www.jikexue
强强联合：FluxBB 作者加盟 Flarum justjavac r
原文：FluxBB Joins Forces With Flarum作者：Toby Zerner译文：强强联合：FluxBB 作者加盟 Flarum译者：justjavac FluxBB 是一个快速、轻量级论坛软件，它的开发者是一名德国的 PHP 天才 Franz Liedke。FluxBB 的下一个版本(2.0)将被完全重写，并已经开发了一段时间。FluxBB 看起来非常有前途的，
java统计在线人数（session存储信息的） macroli java Web
这篇日志是我写的第三次了前两次都发布失败！郁闷极了！由于在web开发中常常用到这一部分所以在此记录一下，呵呵，就到备忘录了！我对于登录信息时使用session存储的，所以我这里是通过实现HttpSessionAttributeListener这个接口完成的。 1、实现接口类，在web.xml文件中配置监听类，从而可以使该类完成其工作。 public class Ses
bootstrp carousel初体验快速构建图片播放 qiaolevip 每天进步一点点学习永无止境 bootstrap 纵观千象
img{ border: 1px solid white; box-shadow: 2px 2px 12px #333; _width: expression(this.width > 600 ? "600px" : this.width + "px"); _height: expression(this.width &
SparkSQL读取HBase数据，通过自定义外部数据源 superlxw1234 spark sparksql sparksql读取hbase sparksql外部数据源
关键字：SparkSQL读取HBase、SparkSQL自定义外部数据源前面文章介绍了SparSQL通过Hive操作HBase表。 SparkSQL从1.2开始支持自定义外部数据源(External DataSource)，这样就可以通过API接口来实现自己的外部数据源。这里基于Spark1.4.0，简单介绍SparkSQL自定义外部数据源，访
Spring Boot 1.3.0.M1发布 wiselyman spring boot
Spring Boot 1.3.0.M1于6.12日发布，现在可以从Spring milestone repository下载。这个版本是基于Spring Framework 4.2.0.RC1,并在Spring Boot 1.2之上提供了大量的新特性improvements and new features。主要包含以下： 1.提供一个新的sprin