如莫

深度强化学习极简入门（一）——强化学习发展历史简述

【摘要】介绍强化学习的起源、发展、主要流派、以及应用。强化学习理论和技术很早就被提出和研究了，属于人工智能三大流派中的行为主义。强化学习一度成为人工智能研究的主流，而最近十年多年随着以深度学习为基础的联结主义的兴起，强化学习与之结合后在感知和表达能力上得到了巨大提升，在解决某些领域的问题中达到或者超过了人类水平。在围棋领域，基于强化学习和蒙特卡洛树搜索的AlphaGo打败了世界顶级专业棋手；在视频游戏领域，基于深度强化学习的游戏智能体在29款Atari游戏中超过人类平均水平；在即时战略游戏领域，AlphaStar和OpenAI Five分别在星际争霸II和Dota 2这两款游戏中达到了顶尖人类玩家的水平；在德州扑克领域，一个叫做 Pluribus的人工智能选手在长达12天的鏖战中，打败了12名世界顶级职业玩家。

1. 强化学习发展历史简述
- 1.1 动物实验心理学中的效应定律
- 1.2 早期计算机领域关于试错学习的探索
- 1.3 试错学习与最优控制
- 1.4 强化学习分类
- - 1.4.1 基于价值的强化学习算法
  - 1.4.2 基于策略的强化学习算法和“演员-评论家”算法
2. 强化学习能做什么
- 2.1 使用强化学习实现控制功能
- 2.2 使用强化学习玩棋牌类游戏
- 2.3 使用强化学习优化物流
- 2.4 使用强化学习实现核聚变反应控制
- 2.5 使用强化学习实现路由控制协议
- 2.6 使用强化学习控制自动驾驶车辆
参考文献

1. 强化学习发展历史简述

1.1 动物实验心理学中的效应定律

强化学习是一种基于反馈的学习，即存在一个智能体，能够感知环境，根据环境状态做出动作，并从环境接收反馈信息，以此调整自身的行动策略。这种基于反馈进行学习的朴素思想很早就被研究者观察到了。心理学家Edward Thorndike在1911年将其总结为“效应定律(Law of effect)”[50]：

“面对同样的情境时，动物可能产生不同的反应。在其他条件相同的情况下，如果某些反应伴随着或紧随其后能够引起动物自身的满足感，则这些反应将与情境联系得更加紧密。因此，当这种情境再次发生的时候，这些反应也更有可能再出现。而在其他条件相同的情况下，如果某些反应给动物带来了不适感，则这些反应与情境的联系将被减弱，所以当这种情境再次发生时，这些反应便越来越不容易再现。更大的满意度或更大的不适感，决定了更强化或更弱化的联系。”

与效果定律类似的，还有“练习定律(Law of exercise)”：

“练习次数的多寡，影响刺激和反应之间练习的稳固程度。练习越多，练习越紧密，小鸡越清楚要采取什么行动，逃脱的速度越快；练习越少，练习就不够紧密，小鸡就越难找到出口。”

此后，在动物学习领域，巴甫洛夫对条件反射的研究中开始使用“强化”一词。这里的“强化”，也包含了弱化过程，即对刺激事件的忽略或终止。

1.2 早期计算机领域关于试错学习的探索

这种基于反馈，或者称基于试错的学习思想，得到了计算机领域学者的关注。图灵在1948年的报告[51]中描述了一种“快乐-痛苦”的机器对效应定律进行演绎：

“当达到无预设动作的状态时，随机选择一些没有遇到过的数据，记录并试探性地应用这些数据。如果发生了痛苦刺激，停止所有动作试探。如果发生了愉悦刺激，则一直保持动作试探。”

1.3 试错学习与最优控制

这种“试错”学习方式，与“最优控制”的理念不谋而合。“最优控制”的目标是使得动态系统随时间变化的某种度量最大化或者最小化[53]。从控制的角度，试错学习的一系列动作实际上是时序决策的结果。贝尔曼于1957年提出了最优控制问题的离散随机版本，并用马尔可夫决策过程(Markov decision process, MDP )进行形式化描述[52]。此后，Ronald Howard在1960年开发出了MDP的策略迭代方法[54]，为现代强化学习理论和算法打下了坚实的基础。

1.4 强化学习分类

图1 强化学习方法分类

处理马尔可夫决策过程有三类基本方法：动态规划(dynamic programming, DP)、蒙特卡洛方法(Monte Carlo, MC)和时序差分(temporal difference, TD)。动态规划方法具有严格而清晰的数学基础且已经被深入研究，但它需要完整而精确的环境模型。这里的环境模型，简称模型，一般指环境的状态转移函数：在当前状态下智能体执性某个动作后，环境如何转移到下一个状态。蒙特卡洛方法和时序差分(Temporal difference, TD)都不需要环境模型，属于无模型强化学习方法。蒙特卡洛方法在环境交互数据充分的条件下能够准确估计状态和动作的价值，从而收敛到有效的策略，但难以应用一步一步的增量式更新计算方式。时序差分方法用前后状态收益估计的差分来驱动价值函数的更新，能够增量式地更新，但在稳定性上有所欠缺。此外，在获得大量与环境交互的优质样本情况下，可以直接对策略进行监督学习，即模仿学习，如智能体行为克隆等。本文主要关注无模型强化学习方法。

强化学习中智能体的目标是找到一种最优策略用以最大化累积奖励和的期望。为了寻找最优策略，有两种可行方案：一种方案是准确估计每一个状态下所有动作的预期收益，最优动作自然就能通过贪婪策略得到，称为基于价值(函数)的强化学习。另一种方案是根据环境给出的奖励信号直接优化策略函数，称为基于策略(函数)的强化学习。

1.4.1 基于价值的强化学习算法

基于价值的强化学习方法，其核心是准确估计状态-动作对的值函数，时序差分思想在这类方法起到了核心作用。时序差分背后的思想可以追溯到Minsky在1954年发表的博士论文[57]，该论文认为预期收益是一种次级强化物，可以产生类似于初级强化物（例如食物或者疼痛）产生的刺激。Arthur Samuel在1959年开发了一个包含时序差分思想的算法用于跳棋程序[58]。20世纪80到90年代，Sutton和Barto等在时序差分思想的基础上建立了一个经典的条件反射心理学模型[55][56]，并且开发了“演员-评论家”(actor-critic)架构用于解决小车平衡杆问题[59]。Sutton在1988年将时序差分从控制中分离出来，将其作为一般的预测方法进行研究[60]。Watkins于1989年在他的博士论文中提出了著名的学习算法[61]，将时序差分与最优控制结合在一起，大大推动了强化学习的研究进程。此后，Gerry Tesauro在1992年开发的基于时序差分的西洋双陆棋程序TD-Gammon[89]取得了巨大的成功，让强化学习领域得到了更多的关注。

以学习为代表的基于价值的强化学习算法，使用一张值表存储每个状态下做出各种动作所对应的未来收益的估计，获得新的交互数据后，使用时序差分方法对值表进行更新。这种基于表格更新的算法称为表格型强化学习算法。表格型强化学习算法在理论证明和一些简单的离散状态和离散动作空间问题中是十分有用的，但缺点也很明显。当问题的状态和动作数量巨大或者是连续的，表格型算法无论在内存还是填充数据所需时间上都是难以满足需要的。为此，面对新的状态，从以往经历过的状态出发，去归纳和总结进而做出合理的决策是可行的方案。但是，新的状态与以往的状态有多相似？做出的决策应该参考多少？这些都是需要解决的问题。实际上，这是将一个有限子集的经验进行推广来近似一个大得多的子集的问题，又称为泛化问题。解决泛化问题通常使用函数逼近。

基于函数逼近的强化学习方法最早出现在Samuel的跳棋程序[58][62]中，Samuel按照Shannon的建议[63]，通过特征的线性组合来近似下棋策略所依赖的价值函数。为了表达特征之间的相互关系，基于多项式基、傅立叶基[64]的函数逼近方法逐渐被开发。粗编码[65]和瓦片编码[66][67][68][69][70]，都是基于特征的覆盖来算不同特征的值函数。径向基函数(radius basis function, RBF)[71][71][72]是粗编码和瓦片编码的自然扩展，即用一组高斯函数作为特征的线性逼近函数或者非线性逼近函数。RBF网络具有很强的表达能力，其缺点是计算复杂高且较为依赖参数设定。人工神经网络(artificial neural network, ANN)也很早就被广泛应用于强化学习中的非线性函数逼近[73][74][75][76]。而随着近年来深度神经网络的空前发展，各种高效ANN模型的出现是现代强化学习能够取得令人赞叹表现的重要原因之一。

人工神经网络具有悠久的研究历史且应用广泛，最近一二十年出现的深度神经网络(deep neural network, DNN)[18][19]更是表现出了惊人的表征能力。一个ANN是由相互连接的具有类似人和动物神经元功能的单元组成的网络。这些单元的输入输出特性一般具有一定的非线性区域，即对输入特征在线性组合的基础上增加激活函数，以增加对输入信号的非线性表征能力。一个ANN可以具有多个由神经单元组成的层，那些输入为其他层的输出，而其输出为其他层的输入的层，称为隐层。Cybenko在1989年证明[77]，如果一个ANN拥有一个隐层，并且这个隐层包含足够多数量的激活单元，则这个ANN可以在网络输入空间的一个紧凑区域内以任意精度逼近任意连续函数。但是，在后来的理论和实际应用中都发现，对于很多人工智能任务中的复杂函数，使用多层次的网络是一种更容易实现的逼近方式[78]。因此，ANN朝着层数增加的方向发展，高层的抽象是许多底层抽象的层次化组合，即深度学习。基于梯度反向传播[79]以及其他一些防止过拟合[80]和梯度消失[81]的方法，可以对一些更深的网络进行训练。视觉领域的卷积神经网络（convolutional neural network, CNN）[18]、深度残差网络[20]，自然语言处理领域的循环神经网络（recurrent neural network, RNN）[83][85]、Transformer结构[85]等，都是优秀的基于深度学习的人工智能感知模型。

ANN在强化学习中的应用主要体现在对价值函数和策略函数的逼近。Barto等人在1982年使用一个双层神经网络来学习一个非线性控制策略[86]，并指出第一层用于自动学习合适的特征表征。Barto、Sutton和Watkins等人在1990年发表论指出，ANN在解决序列决策问题的函数逼近方法中可以发挥重要作用[87]。Williams在1992年开发的REINFORCE算法[88]，使用ANN作为智能体的策略函数，并使用反向传播方法进行训练。Tersauro在1992年开发的TD-Gammon程序[89]可以自动玩西洋双陆棋，展示了使用ANN进行函数逼近在强化学习中的巨大潜力。Silver等人在2015年提出的DQN(deep Q-network)[31]方法，用深度神经网络对Q网络的值函数进行拟合，并使用经验重放池和目标网络稳定训练过程。DQN使用卷积神经网络处理视频游戏的原始数据，在Atari视频游戏中达到了人类玩家的水平。在DQN之后，强化学习和深度学习相结合成为强化学习研究新范式，涌现出了大量优秀的深度强化学习 (deep reinforcement learning, DRL)算法[90][91][92][93][94][95][96][97]。

1.4.2 基于策略的强化学习算法和“演员-评论家”算法

基于价值的深度强化学习算法虽然在视频游戏等领域取得了突破，但是依然难以处理连续动作空间的任务。而基于策略的方法能够很好地适用于连续动作环境，比如机械控制问题。

基于策略的强化学习方法直接学习参数化的策略函数，此时价值函数可以用于策略函数的学习，但不是必需的。最早的基于策略的强化学习算法应是Williams在1992年开发的REINFORCE算法[88]。REINFORCE是一种基于蒙特卡洛的策略梯度算法，该算法基于蒙特卡洛思想对策略函数参数相对于长期累积奖励和期望的梯度进行估计，并使用梯度提升算法更新策略函数的参数。REIFFORCE算法中策略函数参数的每一次更新都需要使用整个任务的总收益，只有在任务结束后才能进行更新，并且每一步的决策都使用相同的收益(整个分幕的收益)进行强化。后续提出的改进算法，只考虑当步以后的收益，可以加快算法收敛速度，减少与环境的交互次数。REINFORCE算法使用任务结束后的收益来更新策略参数，这种收益是一种无偏估计。但该算法的更新需要等任务（甚至是许多次任务）结束后才能更新策略函数。2000年前后，Marbach、Tsitsiklis和Sutton等人先后推导出了策略梯度理论[98][99]，即策略对收益的梯度可以用动作的价值函数和策略的梯度进行表示，REINFORCE是其中的一种特殊情况。基于策略梯度理论，价值函数和策略函数得到统一，形成了“演员-评论家”结构的强化学习智能体架构。近些年出现的表现出色的深度强化学习智能体大多基于此结构。当强化学习算法中的策略函数被参数化，并且使用梯度进行更新，这类算法都称为策略梯度算法。2016年以后，许多优秀的策略梯度算法陆续被提出[100][101][102][97]，其中近端策略优化算法(proximal policy optimization, PPO)[102]将策略梯度进行限制，较好的稳定了策略梯度算法的参数更新。基于PPO算法开发的电子游戏多智能体OpenAI Five[33]在多人即时战略游戏Dota2中达到了顶尖人类玩家的水平。

2. 强化学习能做什么

强化学习是实现智能决策的有效途经，与决策相关的问题都可以尝试用强化学习方案去解决，应用是相当广泛的。这里仅仅举几个简单的例子加以说明。

2.1 使用强化学习实现控制功能

图2 使用强化学习控制模拟飞船着陆

图2展示了使用强化学习控制模拟飞船着陆的过程。智能体（飞船）能够观测到着陆台的坐标以及自身的坐标，任务是通过控制发动机动力的大小和方向以安全的速度和姿势停到指定区域。

图3 使用强化学习控制星际争霸游戏智能体

图3展示了使用强化学系控制智能体完星际争霸游戏的画面。这里面有诸多控制单元，如果将每个单元视为不同的单元，具备各自的观测和动作空间，则会变成一个多智能体强化学习问题。此时需要考虑团体的奖励分配、团队协作和探索等问题。

2.2 使用强化学习玩棋牌类游戏

图4 围棋

图5 麻将

2.3 使用强化学习优化物流

图6 使用强化学习优化物流运输

图6为中国科学院自动化研究所群体决策智能实验室及第平台上的物流运输仿真器是及第自主开发的物流运输环境，其中的运输网络由节点集和有向边集组成。物流运输环境中的网络由10个节点和13条边组成。每个节点包含生产量、需求量和库存量，当库存量超过上界时，库存量为上界值。环境的控制目标为满足运输网络中所有节点需求的同时，最小化整个物流系统的成本。

2.4 使用强化学习实现核聚变反应控制

图7 使用强化学习实现核聚变反应控制

图7为DeepMind于2022年初在Nature杂志上发表的使用强化学习实现核聚变反应控制论文插图。这篇文章大大冲击了强化学习只能在游戏或者仿真环境中发挥作用的传统印象，让大家意识到强化学习也能在实际的工业控制中发发挥作用。

2.5 使用强化学习实现路由控制协议

图8 使用强化学习优化路由协议

使用强化学习进行路由协议控制当年Littman等大佬早很多年前就开始研究了，但是当时的表征和计算能力有限，并未取得太大的效果。现在随着深度学习的发展，强化学习智能体“脑袋扩容”了，因此有些学者又开始尝试使用强化学习解决包路由的问题。

2.6 使用强化学习控制自动驾驶车辆

图9 使用强化学习控制自动驾驶车辆

参考文献

[18] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[C]. Proceedings of the IEEE. 1998.
[19] Goodfellow I, Bengio Y, Courville A. Deep learning[M]. MIT Press. 2016.
[20] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]. IEEE Conference of Computer Vision and Pattern Recognition. 2016: 770-778.
[21] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C]. The 3rd International Conference on Learning Representations. 2015: 1–14.
[22] Girshick R. Fast R-CNN[C]. Proceedings of the IEEE International Conference on Computer Vision. 2015: 1440–1448.
[23] Redmon J, Divvala S K, Girshick R B. You only look once: unified, real-time object detection[C]. IEEE Conference on Computer Vision and Pattern Recognition. 2016: 779-788.
[24] Mikolov T, Corrado G, Chen K, et al. Efficient estimation of word representations in vector space[C]. International Conference on Learning Representations. 2013: 1–12.
[25] Sutskever I, Vinyals O, V.Le Q. Sequence to sequence learning with neural networks[C]. Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 2: 3104–3112.
[26] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies. 2019, 1: 4171–4186.
[27] Howard J, Ruder S. Universal language model fine-tuning for text classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Dissertations). 2018.
[28] Purwins H, Li B, Virtanen T, et al. Deep learning for audio signal processing. Journal of Selected Topics of Signal Processing. 2019, 13(2): 206–219.
[29] Luo H, Zhang S, Lei M. Simplified self-attention for transformer-based end-to-end speech recognition[C]. Proceedings of IEEE Spoken Language Technology Workshop. 2021: 75–81.
[30] Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature. 2016, 529(7587): 484–489.
[31] Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature. 2015, 518(7540): 529–533.
[32] Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature. 2019, 575(7782): 350–354.
[33] OpenAI Five: https:. openai.com/five/.
[34] Brown N, Sandholm T. Superhuman AI for multiplayer poker[J]. Science. 2019, 365(6456): 885–890.
[35] Wolpert H D, Tumer K. Optimal payoff functions for members of collectives[J]. Modeling Complexity in Economic and Social Systems. 2002: 355–369.
[36] Agogino K A, Tumer K. Unifying temporal and structural credit assignment problems[C]. The Third International Joint Conference on Autonomous Agents and Multiagent Systems. IEEE Computer Society, 2004: 980–987.
[37] Thien D, Kumar A, and Lau C H. Credit assignment for collective multiagent RL with global rewards[C]. Advances in Neural Information Processing Systems. 2018: 8102–8113.
[38] Tan M. Multi-agent reinforcement learning: Independent vs. Cooperative agents[C]. The 10th international conference on machine learning. 1993: 330–337.
[39] Lauer M, Riedmiller M. Distributed reinforcement learning in multi-agent networks[C]. The 7th International Conference on Machine Learning. 2000.
[40] Ye D, Liu Z, Sun M, et al. Mastering Complex Control in MOBA Games with Deep Reinforcement Learning[C]. The 34th AAAI Conference on Artificial Intelligence. 2020: 6672-6679.
[41] RoboCup: http:. robocup.drct-caa.org.cn/.
[42] 魏志鹏. RoboCup救援仿真中异构多智能体协作策略[D]. 南京: 南京邮电大学, 2016.
[43] Chen L, Qin S, Chen K, et al. Efficient role assignment with priority in Robocup3D[C]. Chinese Control and Decision Conference (CCDC). 2020: 2697-2702.
[44] 刘转. 基于多智能体系统的兵棋推演模型研究[D]. 武汉: 华中科技大学, 2016.
[45] 张瑶, 马亚辉. 体系对抗中的智能策略生成[J]. 中国信息化, 2018, 291(07):52-55.
[46] 强化学习在多智能体对抗中的应用研究[D]. 北京: 中国运载火箭技术研究院, 2019.
[47] 聂凯, 曾科军, 孟庆海,等. 人机对抗智能技术最新进展及军事应用[J]. 兵器装备工程学报, 2021, 42(6): 6-26.
[48] 李琛, 黄炎焱, 张永亮,等. Actor-Critic框架下的多智能体决策方法及其在兵棋上的应用[J]. 系统工程与电子技术. 2021, 43(3): 754-762.
[49] Busoniu L, Babuska R, Schutter B. A comprehensive survey of multiagent reinforcement learning[J]. IEEE Transactions on Systems, Man & Cybernetics (Part C Applications & Reviews). 2008, 38(2): 156-172.
[50] Thorndike E L. Animal intelligence[M]. Hafner, Darien, CT. 1911.
[51] Turing A M. Intelligent machinery[M]. Oxford University Press. 2004: 410-432.
[52] Bellman R E. A Markov decision process[J]. Journal of Mathematics and Mechanics. 1957: 679-684.
[53] Bellman R E. Dynamic programming[M]. Princeton University Press, Princeton, 1957.
[54] Howard Ronald. Dynamic programming and Markov process[M]. MIT Press, Cambridge, 1960.
[55] Sutton R S. Single channel theory: A neural theory of learning[J]. Brain Theory Newsletter. 1978(4): 72-75.
[56] Sutton R S. A unified theory of expectation in classical and instrumental conditioning[D]. California: Stanford University, 1978.
[57] Minsky M L. Theory of neural-analog reinforcement systems and its application to the brain-model problem[D]. New Jersey: Princeton University. 1954.
[58] Samuel A L. Some studies in machine learning using the game of checkers[J]. IBM Journal on Research and Development. 1959, 3(3): 210-229.
[59] Barto A G, Sutton R S, and Anderson C W. Neuronlike elements that can solve difficult learning control problem[J]. IEEE Transactions on System, Man, and Cybernetics, 1983, 13(5): 835-846.
[60] Sutton R S. Learning to predict by the method of temporal difference[J]. Machine Learning. 1988, 3(1): 9-44.
[61] Watkins C J C H. Learning from delayed rewards[D]. Cambridge: University of Cambridge, 1989.
[62] Samuel A L. Some studies in machine learning using the game of checkers II—Recent progress[J]. IBM Journal on Research and Development. 1967, 11(6): 601-617.
[63] Shannon C E. Programming a computer for playing chess[J]. Philosophical Magazine and Journal of Science. 1950, 41(314): 255-275.
[64] Konidaris G D, Osentoski S, and Thomas P S. Value function approximation in reinforcement learning using the Fourier basis[C]. The 25th Conference of the Association for the Advancement of Artificial Intelligence, 2011: 380-385.
[65] Waltz W G, Fu K S. A heuristic approach to reinforcement control systems[J]. IEEE Transactions on Automatic Control. 1965, 10(4): 390-398.
[66] Albus J S. A theory of cerebellar function[J]. Mathematical Biosciences. 1971, 10(1-2): 1307-1323.
[67] Albus J S. Brain, behavior, and robotics[M]. Byte Book, Peterborough, 1981.
[68] Shewchuk J, Dean T. Towards learning time-varying functions with high input dimensionality[C]. The 5th IEEE International Symposium on Intelligent Control. 1990: 383-388.
[69] Lin C S, Kim H. CMAC-based adaptive critic self-learning control[J]. IEEE Transactions on Neural Networks. 1991, 2(5): 661-670.
[70] Miller W T, Scalera S M, and Kim A. Neural network control of dynamic balance for a biped walking robot[C]. The 8th Yale Workshop on Adaptive and Learning Systems. 1994: 156-161.
[71] Powell M J D. Radius basis functions for multivariant interpolation: A review[M]. Algorithms for approximation. Clarendon Press, Oxford, 1987: 143-167.
[72] Poggio T, Girosi F. Regularization algorithms for learning that are equivalent to multiplayer networks[J]. Science. 1990, 247(4945): 978-982.
[73] Riedmiller M. Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method[C]. The European Conference on Machine Learning. 2005: 317–328.
[74] Munos R, Szepesvari C. Finite-time bounds for fitted value iteration[J]. Journal of Machine Learning Research[J]. 2008, 9(3): 815-857.
[75] Antos A, Munos R, Szepesvari C. Fitted Q-iteration in continuous action-space MDPs[J]. Advances in Neural Information Processing Systems.2008, 53(2): 556-564.
[76] Lange S, Riedmiller M. Deep Auto-Encoder Neural Networks in Reinforcement Learning[C]. International Joint Conference on Neural Networks. 2010.
[77] Cybenko G. Approximation by superpositions of a sigmoidal function[J]. Mathematics of Control, Signal and Systems. 1989, 2(4): 303-314.
[78] Bengio Y. Learning deep architectures for AI[J]. Foundations and Trends in Machine Learning. 2009, 2(1): 1-27.
[79] Rumelhart D E, Hinton G E, and Williams R J. Learning Representations by Back Propagating Errors[J]. Nature. 1986, 323(6088):533-536.
[80] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: A simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research. 2014, 15(1): 1929-1958.
[81] Ioffe S, Szegedy C. Batch Normalization: Accelerating deep network training by reducing internal covariate shift[J]. JMLR.org. 2015.
[82] Cho K. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. Computer Science. 2014, https:. arxiv.org/abs/1406.1078.
[83] Giles C L, Kuhn G M, and Williams R J. Dynamic recurrent neural networks: theory and applications [J] IEEE Transactions on Neural Networks. 1994, 5(2): 153-156.
[84] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation. 1997, 9(8):1735-1780.
[85] Vaswani A, Shazeer N, Parmar N. Attention Is All You Need[C]. Advances in Neural Information Processing Systems. 2017: 5999–6009.
[86] Barto A G, Anderson C W, and Sutton R S. Synthesis of nonlinear control surfaces by a layered associative search network[J]. Biological Cybernetics. 1982, 43(3): 175-185.
[87] Miller T, Sutton R S, and Werbos P J. Neural networks for control. Cambridge, MA, MIT Press, 1990: 5-58.
[88] Williams R J, Baird L C. A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming[C]. The 6th Yale Workshop on Adaptive and Learning Systems. 1990: 96-101.
[89] Tesauro G. TD-Gammon, a self-teaching backgammon program, achieve master-level play[J]. Neural Computation. 1994, 6(2): 215-219.
[90] Hasselt H V, Guez A, and Silver D. Deep reinforcement learning with double q-learning[C]. The 30th AAAI Conference on Artificial Intelligence, 2016.
[91] Wang Z, Schaul T, Hessel M, et al. Dueling network architectures for deep reinforcement learning[C]. The 33rd International Conference on Machine Learning, ICML. 2015: 2939-2947.
[92] Fortunato M, Azar M G, Piot B, et al. Noisy networks for exploration[C]. The 6th International Conference on Learning Representations, ICLR. 2017.
[93] Bellemare M G, Dabney W, and Munos R. A distributional perspective on reinforcement learning[C]. The 34th International Conference on Machine Learning. 2017: 449–458.
[94] Dabney W, Rowland M, Bellemare M G, et al. Distributional reinforcement learning with quantile regression[C]. The 32nd AAAI Conference on Artificial Intelligence, 2018.
[95] Hessel M, Modayil M, Hasselt H V, et al. Rainbow: Combining improvements in deep reinforcement learning[C]. In 32nd AAAI Conference on Artificial Intelligence, 2018.
[96] Fujimoto S, Hasselt H V, Meger D. Addressing function approximation error in actor-critic methods[C]. The 35th International Conference on Machine Learning, ICML. 2018: 2587–2601.
[97] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]. The 35th International Conference on Machine Learning, ICML 2018: 2976–2989.
[98] Marbach P. Tsitsiklis J N. Simulation-based optimization of Markov reward processes[R]. LIDS-P-2411. 1998.
[99] Sutton R S, McAllester D A, Singh S P, et al. Mansour. Policy gradient methods for reinforcement learning with function approximation[C]. Advances in Neural Information Processing Systems. 2000: 1057–1063.
[100] Mnih V, Badia A P, Mirza L, et al. Asynchronous methods for deep reinforcement learning[C]. The 33rd International Conference on Machine Learning, ICML. 2016: 2850–2869.
[101] Schulman J, Levine S, Abbeel P, et al. Trust region policy optimization[C]. International Conference on Machine Learning, ICML. 2015: 1889–1897.
[102] Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J]. arXiv preprint. arXiv:1707.06347, 2017.
[103] Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. MIT Press, Cambridge, 1998.
[104] Foster D J, Wilson M A. Reverse replay of behavioural sequences in hippocampal place cells during the awake state[J]. Nature. 2006, 440(7084): 680–683.
[105] Singer AC, Frank L M. Rewarded outcomes enhance reactivation of experience in the hippocampus[J]. Neuron. 2009, 64(6): 910–921.
[106] McNamara C G, et al. Dopaminergic neurons promote hippocampal reactivation and spatial memory persistence[J]. Nature neuroscience. 2014.
[107] White A, Modayil J, Sutton R S. Surprise and curiosity for big data robotics[C]. Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence. 2014.
[108] Schaul T, Quan J, Antonoglou I, et al. Prioritized experience replay[C]. The 4th International Conference on Learning Representations, ICLR. 2016.
[109] Atherton L A, Dupret D, Mellor J R. Memory trace replay: the shaping of memory consolidation by neuromodulation[J]. Trends in Nneurosciences. 2015, 38(9):560–570.
[110] Olafsdottir H F, Barry C, Saleem A B, et al. Hippocampal place cells construct reward related sequences through unexplored space[J]. Elife. 2015, 4: e06063.
[111] 安波, 史忠植. 多智能体系统研究的历史、现状及挑战.[J] 中国计算机学会通讯. 2014, 10(9): 8-14.
[112] 丁明刚. 基于多智能体强化学习的足球机器人决策策略研究[D]. 合肥: 合肥工业大学, 2017.
[113] 梅乐. 基于多智能体的多机器人跟踪控制研究[D]. 合肥: 合肥工业大学, 2016.
[114] 段勇, 徐心和. 基于多智能体强化学习的多机器人协作策略研究[J]. 系统工程理论与实践. 2014, 34(5):1305-1310.
[115] 汪伟, 张效义, 胡赟鹏. 基于无线传感网的分布式协调调制识别算法[J]. 计算机应用研究. 2014, 31(5): 1524-604.
[116] 高岐. 基于语义的家居多智能体协作决策方法的研究与实现[D]. 重庆: 重庆邮电大学, 2019.
[117] 刘鑫, 于振中, 郑为凑, 等. 多机器人远程监控系统的多智能体控制结构[J]. 计算机工程, 2014(2).
[118] 许少伦, 严正, 冯冬涵,等. 基于多智能体的电动汽车充电协同控制策略[J]. 电力自动化设备, 2014, 034(011):7-13,21.
[119] 杨涛. 基于多智能体的区域交通信号控制系统研究[D]. 成都: 西华大学, 2016.
[120] 盖翔. 基于多智能体的能量路由器调度优化方法研究[D]. 沈阳: 东北大学, 2015.
[121] Yu C, Lan J, Guo Z, et al. DROM: Optimizing the Routing in Software-Defined Networks with Deep Reinforcement Learning. IEEE Access. 2018, 6©: 64533–64539.
[122] 黄林. 基于深度强化学习的无线传感器网络调度与路由优化[D]. 武汉: 华中科技大学, 2019.
[123] Armbrust M, Fox A, Griffith R, et al. A review of cloud computing[J]. Communication of the ACM. 2010, 53(4): 211-225.
[124] Rong J, Qin T, An B, et al. Pricing optimization for selling reusable resources[C]. International Joint Conference on Autonomous Agents and Multi-Agent Systems. 2017.
[125] Panait L, Luke S. Cooperative multi-agent: the state of the art[J]. Journal of Autonomous Agents and Multi-Agent Systems. 2005, 11(3): 387-434.
[126] Huhns M N. Distributed artificial intelligence[M]. Pitman Publishing Ltd… London, England.
[127] Bound A H, Grasser. Readings in distributed artificial intelligence[M]. Morgan Kaufmann Publisher. San Mateo, CA.
[128] Vinyals O, Babuschkin I, Czarnecki W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature. 2019, 575(7782): 350–354
[129] Jaderberg M, Czarnecki W M, Dunning I. Human-level performance in 3D multiplayer games with population based reinforcement learning[J]. Science, 865(May): 859–865.
[130] 韩伟. 电子市场环境下的多智能体学习与协商[D]. 上海: 华东师范大学, 2006.
[131] 梁天新, 杨小平, 王良,等. 基于强化学习的金融交易系统研究与发展[J]. 软件学报, 2019, 030(003): 845-864.
[132] Littman M L. Markov games as a framework for multi-agent reinforcement learning[C]. Machine Leaning Proceedings. 1994: 157-163.
[133] Kaelbling L P, Littman M L, Cassandra A R. Planning and acting in partially observable stochastic domains[J]. Artificial Intelligence. 1998: 99-134.
[134] Smith T, Simmons R. Heuristic search value iteration for POMDPs[C]. Proceedings of Uncertainty in Artificial Intelligence. 2004.
[135] Silver D, Veness J. Monte-Carlo planning in large POMDPs[C]. Advances in Neural Information Processing Systems. 2010.
[136] Bernstein D S, Givan R, Immerman N, et al. The complexity of decentralized control of Markov decision processes[J]. Mathematics of Operations Research. 2002, 27(4), 819-840.
[137] Nair R, Tambe M, Yokoo M, et al. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings[C]. International Joint Conference on Artificial Intelligence. 2003: 705–711.
[138] Nair R, Roth M, Yohoo M. Communication for improving policy computation in dis- tributed POMDPs[C]. International Joint Conference on Autonomous Agents and Multi Agent Systems. 2004: 1098–1105.
[139] Hu J, Wellman M P. Multiagent reinforcement learning: theoretical framework and algorithm[C]. The International Conference on Machine Learning. 1999: 242-250.
[140] Hu J, Wellman M P. Nash Q-Learning for general-sum stochastic games[J]. The journal of Machine Learning Research. 2003, 4: 1039-1069.
[141] Heinrich J, Silver D. Deep reinforcement learning from self-play in imperfect-information games[C]. NIPS Deep Reinforcement Learning Workshop. 2016.
[142] Greenwald A, Hall K, Serrano R. Correlated Q-learning[C]. The International Conference on Machine Learning. 2003.
[143] Könönen, V. Asymmetric multiagent reinforcement learning[C]. IEEE/WIC International Conference on Intelligent Agent Technology. 2004: 336-342.
[144] Wang X, Sandholm T. Reinforcement learning to play an optimal Nash equilibrium in team Markov games[J]. Robotics & Autonomous, 2002: 1571-1578.
[145] Chalkiadakis G, Boutilier C. Sequential decision making in repeated coalition formation under uncertainty[C]. International Joint Conference on Autonomous Agents and Multiagent Systems. 2008: 342-349.
[146] Albrecht S V, Stone P. Autonomous Agents Modelling Other Agents: A Comprehensive Survey and Open Problems[J]. Artificial Intelligence. 2017, 258: 66-95.
[147] Hernandez-Leal P, Kartal B, Taylor M E. A survey and critique of multiagent deep reinforcement learning[J]. Autonomous Agents and Multi-Agent Systems. 2019, 33(6), 750–797.
[148] Tampuu A, Matiisen T, Kodelja D, et al. Multiagent cooperation and competition with deep reinforcement learning[C]. PLOS ONE. 2017, 12(4).
[149] Bansal T, Pachocki J, Sidor S, et al. Emergent Complexity via Multi-Agent Competition[C]. International Conference on Machine Learning, 2018.
[150] Gullapalli V, Barto A G. Shaping as a method for accelerating reinforcement learning[C]. IEEE international symposium on intelligent control. 1992: 554–559.
[151] Raghu M, Irpan A, Andreas J, et al. Can Deep Reinforcement Learning solve Erdos-Selfridge-Spencer Games?[C]. The 35th International Conference on Machine Learning, 2018.
[152] Spencer J. Randomization, derandomization and antirandomization: three games[J]. Theoretical Computer Science. 1994, 131(2): 415–429.
[153] Lazaridou A, Peysakhovich A, Baroni M. Multi-Agent Cooperation and the Emergence of (Natural) Language[J]. International Conference on Learning Representations, 2017.
[154] Fudenberg D, Tirole J. Game Theory[M]. MIT Press, 1991.
[155] Maaten L, Hinton G. Visualizing data using t-SNE[J]. Journal of machine learning research. 2008, 9: 2579–2605.
[156] Zahavy T, Ben-Zrihem N, Mannor S. Graying the black box: Understanding DQNs[C]. International Conference on Machine Learning. 2016:1899–1908.
[157] Mordatch I, Abbeel P. Emergence of grounded compositional language in multi-agent populations[C]. The 32nd AAAI Conference on Artificial Intelligence. 2018.
[158] Foerster J N, Assael Y M, Freitas, N, et al. Learning to communicate with deep multi-agent reinforcement learning[C]. Advances in Neural Information Processing Systems. 2016: 2145–2153.
[159] Kraemer L, Banerjee B. Multi-agent reinforcement learning as a rehearsal for decentralized planning[J]. Neurocomputing. 2016: 82–94.
[160] Lowe R, Wu Y, Tamar A. Multi-agent actor-critic for mixed cooperative-competitive environments[C]. Proceedings of Conference on Neural Information Processing Systems. 2017, 6379–6390.
[161] Sukhbaatar S, Szlam A, Fergus R. Learning multiagent communication with backpropagation[C]. Advances in Neural Information Processing Systems. 2016: 2252–2260.
[162] Peng P, Yuan Q, Wen Y, et al. Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games[C]. Advances in Neural Information Processing Systems. 2017.
[163] Schuster M, Paliwal K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing. 1997, 45 (11): 2673–2681.
[164] He H, Boyd-Graber J, Kwok K. Opponent modeling in deep reinforcement learning[C]. The 33rd International Conference on Machine Learning. 2016: 2675–2684.
[165] Hong Z-W, Su S-Y, Shann T-Y, et al. A deep policy inference Q-network for multi-agent systems[C]. International Conference on Autonomous Agents and Multiagent Systems. 2018.
[166] Jaderberg M, Mnih V, Czarnecki W M. Reinforcement Learning with Unsupervised Auxiliary Tasks[C]. International Conference on Learning Representations. 2017.
[167] Sutton R S, Modayil J, Delp T. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction[C]. The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2. 2011: 761–768.
[168] Raileanu R, Denton E, Szlam A, et al. Modeling others using oneself in multi-agent reinforcement learning[C]. International Conference on Machine Learning. 2018.
[169] Li S, Wu Y, Cui X. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient[C]. AAAI Conference on Artificial Intelligence. 2019.
[170] Morimoto J, Doya L. Robust reinforcement learning[J]. Neural computation. 2005, 17 (2): 335–359.
[171] Pinto L, Davidson J, Sukthankar R. Robust adversarial reinforcement learning[C]. The 34th International Conference on Machine Learning. 2017: 2817–2826.
[172] Sunehag P, Lever G, Gruslys A, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward[C]. The 17th International Conference on Autonomous Agents and MultiAgent Systems, 2018: 2085–2087.
[173] Rashid T, Samvelyan M, Witt C S, et al. Qmix: Monotonic value function factorization for deep multi-agent reinforcement learning[C]. International Conference on Machine Learning. 2018: 4292–4301.
[174] Son K, Kim D, Kang W J, et al. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]. International Conference on Machine Learning. 2019: 5887–5896.
[175] Wang J, Ren Z, Liu T, et al. QPLEX: Duplex Dueling Multi-Agent Q-Learning[C]. International Conference on Learning Representations. 2021.
[176] Varshavskaya P, Kaelbling L P, Rus D. Efficient distributed reinforcement learning through agreement[C]. In Distributed Autonomous Robotic Systems. 2009: 367–378.
[177] Kar S, Moura J M F, Poor H V. QD-learning: A collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations[J]. IEEE Transactions on Signal Processing. 2013, 61(7):1848–1862.
[178] Kar S, Moura J M F, Poor H V. Distributed reinforcement learning in multi-agent networks. IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). 2013: 296–299.
[179] Zhang K-Q, Yang Z-R, Liu H, et al. Fully decentralized multi-agent reinforcement learning with networked agents[C]. The 35th International Conference on Machine Learning, 2018: 5872–5881.
[180] Macua S V, Chen J-S, Zazo S, et al. Distributed policy evaluation under multiple behavior strategies[J]. IEEE Transactions on Automatic Control. 2015, 60(5):1260–1274.
[181] Wai H-T, Yang Z-R, Wang Z-R, et al. Multi-agent reinforcement learning via double averaging primal-dual optimization[C]. In Advances in Neural Information Processing Systems. 2018: 9649–9660.
[182] Zhang K-Q, Yang Z-R, Basar T. Networked multi-agent reinforcement learning in continuous spaces[C]. In 2018 IEEE Conference on Decision and Control (CDC), 2018: 2771–2776.
[183] Suttle W, Yang Z-R, Zhang K-Q, et al. A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning[J]. arXiv preprint. arXiv:1903.06 372, 2019.
[184] Bertsekas D P, Tsitsiklis J. Neuro-dynamic programming[M]. Athena Scientific, Belmont, 1996.
[185] Puterman M L. Markov decision processes—Discrete stochastic dynamic programming[M]. John Wiley & Sons, Inc., New York, 1994.
[186] Ross S M. Stochastic processes, second edition[M]. John Wiley & Sons Inc. 1995.
[187] Wiering M A. Explorations in efficient reinforcement learning[D]. Amsterdam: Universiteit van Amsterdam, 1999.
[188] Dayan P. The convergence of TD( ) for general . Machine Learning. 1992, 8(3):341–362.
[189] Degris T, White M, Sutton R S. Linear off-policy actor-critic[J]. The 29th International Conference on Machine Learning. 2019.
[190] Nair A, Srinivasan P, Blackwell S, et al. Massively parallel methods for deep reinforcement learning[J]. CoRR, vol. abs//1507.04296. 2015: 1–14.
[191] Clemente A V, Castejón H N, Chandra A. Efficient parallel methods for deep reinforcement learning[J]. CoRR, vol. abs/1705.04862. 2017: 1–9.
[192] Espeholt L, Soyer H, Munos R, et al. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures[C]. The 35th International Conference on Machine Learning. 2018, 4: 2263–2284.
[193] Luo M, Yao J, Liaw R, Liang E. IMPACT: Importance weighted asynchronous architectures with clipped target networks[C]. International Conference on Learning Representations. 2020: 1–14.
[194] Han S, Sung Y. Dimension-wise importance sampling weight clipping for sample-efficient reinforcement learning[C]. International Conference on Machine Learning. 2019: 4572–4584.
[195] Schulman J, Moritz P, Levine S. High-dimensional continuous control using generalized advantage estimation[C]. The 4th International Conference on Learning Representations. 2016: 1–14.
[196] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control[C]. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026– 5033. IEEE, 2012.
[197] Kingma D P, Ba J L. Adam: A method for stochastic optimization[C]. International Conference on Learning Representations. 2015: 1–15.
[198] Foerster J N, Farquhar G, Afouras T, et al. Counterfactual multi-agent policy gradients[C]. AAAI Conference on Artificial Intelligence. 2018.
[199] Wolpert D H, Tumer K. Optimal payoff functions for members of collectives. In Modeling Complexity in Economic and Social Systems[M]. World Scientific. 2002: 355–369.
[200] Konda V R, Tsitsiklis J N. Actor-critic algorithms[C]. Advances in Neural Information Processing Systems. 2000: 1008–1014.
[201] Hornik K. Multilayer feedforward networks are universal approximators[J]. Neural Networks. 1989, 2: 359–366.
[202] Cybenko G. Approximation by superpositions of a sigmoidal function[J]. Mathematics of Control, Signals, and Systems. 1989, 2: 303–314.
[203] Hoshen Y. Vain: Attentional multi-agent predictive modeling[C]. Advances in Neural Information Processing Systems. 2017: 2701–2711.
[204] Jiang J, Lu Z. Learning attentional communication for multi-agent cooperation[C]. Advances in Neural Information Processing Systems. 2018: 7254–7264.
[205] Singh A, Jain S, Sukhbaatar S. Learning when to communicate at scale in multiagent cooperative and competitive tasks[C]. Proceedings of the International Conference on Learning Representations. 2019.
[206] Das A, Gervet T, Romoff J. Tarmac: Targeted multi-agent communication[C]. International Conference on Machine Learning. 2019: 1538–1546.
[207] Zhang C, Lesser V. Coordinating multi-agent reinforcement learning with limited communication[C]. Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems. 2013: 1101–1108.
[208] Tishby N, Pereira F C, Biale W. The information bottleneck method[C] The 37th Annual Allerton Conf. on Communication, Control, and Computing. 1999: 368–377.
[209] Alemi A A, Fischer I, Dillon J V, et al. Deep variational information bottleneck[C]. 5th International Conference on Learning Representations. 2017: 1–19.
[210] Don J. Statistical theory of passive location systems[J]. IEEE Transactions on Aerospace and Electronic Systems. 1984(2): 183–198.
[211] Al-Jazzar S O, Jaradat Y. AOA-based drone localization using wireless sensordoublets[J]. Physical Communication. 2020(42).
[212] Ho K C, Lu X, Kovavisaruch L. Source localization using TDOA and FDOA measurements in the presence of receiver location errors: analysis and solution[J]. IEEE Transactions on Signal Processing. 2007, 55(2): 684–696.
[213] Ma F, Guo F, Yang L. Low-complexity TDOA and FDOA localization: A compromise between two-step and DPD methods[J]. Digital Signal Processing. 2020, 96.
[214] Amar A, Weiss A J. Direct position determination in the presence of model errors-known waveforms[J]. Digital Signal Processing. 2006, 16(1): 52–83.
[215] Weiss A J. Direct position determination of narrowband radio transmitters[C]. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2004, 2.
[216] Tirer T, Weiss A J. High resolution direct position determination of radio frequency sources[J]. IEEE Signal Processing Letters. 2016, 23(2): 192–196.
[217] Krzysztof B, Stefanski J. Bad geometry effect in the TDOA systems[J]. Polish Journal of Environmental Studies. 2007, 16: 11–13.
[218] Martin-Escalona I, Barcelo-Arroyo F. Impact of geometry on the accuracy of the passive-TDOA algorithm[C]. Proceedings of the IEEE International Symposium on Personal, Indoor and Mobile Radio Communications. 2008: 1–6.
[219] Sun B. Analysis of the influence of station placement on the position precision of passive area positioning system based on TDOA[J]. Fire Control & Command Control. 2011, 36: 129–132.
[220] Wang B, Xue L. Station arrangement strategy of TDOA location system based on genetic algorithm[J]. Systems Engineering and Electronics. 2009, 31: 2125–2128.
[221] Zhou G. Analysis of the influence of base station layout on location accuracy based on TDOA[J]. Command Control and Simulation, 2017, 39: 119–126.
[222] Shi Y, Eberhart R C. A modified particle swarm optimization[C]. Proceedings of the IEEE Congress on Evolutionary Computation. 1999: 69–73.
[223] Kennedy J, Eberhart R C. Swarm Intelligence[M]. San Mateo, CA, USA: Morgan Kaufmann, 2001.
[224] Tan M. Multi-agent reinforcement learning: Independent vs. cooperative agents[C]. Proceedings of the 10th International Conference on Machine Learning. 1993: 330–337.
[225] Matignon L, Laurent G, Fort-Piat N. Hysteretic q-learning: An algorithm for decentralized reinforcement learning in cooperative multi-agent teams[C]. IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS’07. 2007: 64–69.
[226] Matignon L, Laurent G, Fort-Piat N. Independent reinforcement learners in cooperative Markov games: A survey regarding coordination problems[J]. The Knowledge Engineering Review. 2012, 27(1): 1–31.
[227] Pennesi P, Paschalidis I. A distributed actor-critic algorithm and applications to mobile sensor network coordination problems[J]. IEEE Transactions on Automatic Control. 2010, 55(2):492–497.
[228] Prasad H, Prashanth L A, Bhatnagar S. Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games[C]. Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems. 2015, 3: 1371–1379.
[229] Nedich A, Olshevsky A, Shi W. A geometrically convergent method for distributed optimization over time-varying graphs[C]. IEEE 55th Conference on Decision and Control (CDC). 2016: 1023–1029.
[230] Xiao L, Boyd S, Lall S. A scheme for robust distributed sensor fusion based on average consensus[C]. Proceedings of the 4th International Symposium on Information Processing in Sensor Networks. 2005: 63–70.
[231] Borkar V S, Stochastic approximation a dynamical systems viewpoint[M]. Hindustan Book Agency (lndia). 2008: 74–75.
[232] Fernández F, Veloso M. Probabilistic policy reuse in a reinforcement learning agent[C]. Proceedings of American Association for Artificial Intelligence. 2006.
[233] Wiering M, Otterlo M V. Reinforcement Learning: State-of-the-Art. Springer[M]. Berlin. 2012: 143–173.
[234] Haitham Bou Ammar, Eric Eaton, Mattliew E. Taylor. An automated measure of MDP similarity for transfer in reinforcement learning[C]. Proceedings of AAAI Workshop Technical Report. 2014, Vol. WS-14-07: 31–37.
[235] García J, Fernández F. Probabilistic Policy Reuse for Safe Reinforcement Learning[J]. ACM Transactions on Autonomous and Adaptive Systems. 2019 3(13): 14.1–14.24.
[236] Sunmola F T, Wyatt J L. Model transfer for Markov decision tasks via parameter matching[C]. Proceedings of the 25th Workshop of the UK Planning and Scheduling Special Interest Group. 2006.
[237] Brunskill E, Li LH. PAC-inspired Option Discovery in Lifelong Reinforcement Learning[C]. Proceedings of the 31st International Conference on Machine Learning. 2014, 32: 316–324.
[238] Fernández F, Veloso M. Learning domain structure through probabilistic policy reuse in reinforcement learning[J]. Progress in Artificial Intelligence. 2013: 13–27.
[239] Yang T, Hao J, Meng Z, et al. Efficient deep reinforcement learning via adaptive policy transfer[C]. Proceedings of International Joint Conference on Artificial Intelligence.2020: 3094–3100.
[240] Sutton R S, Precup D, Singh S P. Intra-Option learning about temporally abstract actions[C]. Proceedings of International Conference on Machine Learning. 1998: 556–564.
[241] Li S, Gu F, Zhu G, Zhang C. Context-aware policy reuse[C]. Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS. 2019, 2: 989–997.
[242] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translation[C]. Proceedings of International Conference on Learning Representations. 2015.
[243] Brown T, Mann B, Ryder N. Language models are few-shot learners[C]. Proceedings of Advances in Neural Information Processing Systems. 2020, 33: 1877–1901.
[244] Bello I, Zoph B, Le Q. Attention augmented convolutional networks[C]. Proceedings ofthe IEEE International Conference on Computer Vision. 2019: 3285–3294.
[245] Pan J, Yang Qiang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering. 2010, 10: 1345–1359.

你可能感兴趣的:(深度强化学习极简入门,强化学习,人工智能,深度学习)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
第四天旅游线路预览——从换乘中心到喀纳斯湖陟彼高冈yu 基于Google earth studio 的旅游规划和预览旅游
第四天：从贾登峪到喀纳斯风景区入口，晚上住宿贾登峪；换乘中心有4路车，喀纳斯①号车，去喀纳斯湖，路程时长约5分钟；将上面的的行程安排进行动态展示，具体步骤见”Googleearthstudio进行动态轨迹显示制作过程“、“Googleearthstudio入门教程”和“Googleearthstudio进阶教程“相关内容，得到行程如下所示：Day4-2-480p
linux中sdl的使用教程,sdl使用入门 Melissa Corvinus linux中sdl的使用教程
本文通过一个简单示例讲解SDL的基本使用流程。示例中展示一个窗口，窗口里面有个随机颜色快随机移动。当我们鼠标点击关闭按钮时间窗口关闭。基本步骤如下：1.初始化SDL并创建一个窗口。SDL_Init()初始化SDL_CreateWindow()创建窗口2.纹理渲染存储RGB和存储纹理的区别：比如一个从左到右由红色渐变到蓝色的矩形，用存储RGB的话就需要把矩形中每个点的具体颜色值存储下来；而纹理只是一
将cmd中命令输出保存为txt文本文件落难Coder Windows cmd window
最近深度学习本地的训练中我们常常要在命令行中运行自己的代码，无可厚非，我们有必要保存我们的炼丹结果，但是复制命令行输出到txt是非常麻烦的，其实Windows下的命令行为我们提供了相应的操作。其基本的调用格式就是：运行指令>输出到的文件名称或者具体保存路径测试下，我打开cmd并且ping一下百度：pingwww.baidu.com>./data.txt看下相同目录下data.txt的输出：如果你再
我的黑历史袖手围观有来有去
孩子同学与我们一起共进晚餐，俩孩子加我三个人。小同学是一个大方率性礼貌的小孩，我们也都非常喜欢。好了，回到正题上来让我把这个故事讲完。俩孩子都喜欢吃鱼，所以就发生了小孩子之间常会发生的事。我狠狠的盯了我家孩子，孩子表情有些狼狈。和孩子单独一起的时候，见她尚未释怀，并谴责我不该狠盯她，让她没面子。也许是她触动了我的童年往事吧。由此，一狠心，给她讲了一段埋藏心里极深的黑历史：我奶奶有四个儿子，四个儿子
探索OpenAI和LangChain的适配器集成：轻松切换模型提供商 nseejrukjhad langchain easyui 前端 python
#探索OpenAI和LangChain的适配器集成：轻松切换模型提供商##引言在人工智能和自然语言处理的世界中，OpenAI的模型提供了强大的能力。然而，随着技术的发展，许多人开始探索其他模型以满足特定需求。LangChain作为一个强大的工具，集成了多种模型提供商，通过提供适配器，简化了不同模型之间的转换。本篇文章将介绍如何使用LangChain的适配器与OpenAI集成，以便轻松切换模型提供商
深入理解 MultiQueryRetriever：提升向量数据库检索效果的强大工具 nseejrukjhad 数据库 python
深入理解MultiQueryRetriever：提升向量数据库检索效果的强大工具引言在人工智能和自然语言处理领域，高效准确的信息检索一直是一个关键挑战。传统的基于距离的向量数据库检索方法虽然广泛应用，但仍存在一些局限性。本文将介绍一种创新的解决方案：MultiQueryRetriever，它通过自动生成多个查询视角来增强检索效果，提高结果的相关性和多样性。MultiQueryRetriever的工
Day1笔记-Python简介&标识符和关键字&输入输出 ~在杰难逃~ Python python 开发语言大数据数据分析数据挖掘
大家好，从今天开始呢，杰哥开展一个新的专栏，当然，数据分析部分也会不定时更新的，这个新的专栏主要是讲解一些Python的基础语法和知识，帮助0基础的小伙伴入门和学习Python，感兴趣的小伙伴可以开始认真学习啦！一、Python简介【了解】1.计算机工作原理编程语言就是用来定义计算机程序的形式语言。我们通过编程语言来编写程序代码，再通过语言处理程序执行向计算机发送指令，让计算机完成对应的工作，编程
人工智能时代，程序员如何保持核心竞争力？ jmoych 人工智能
随着AIGC（如chatgpt、midjourney、claude等）大语言模型接二连三的涌现，AI辅助编程工具日益普及，程序员的工作方式正在发生深刻变革。有人担心AI可能取代部分编程工作，也有人认为AI是提高效率的得力助手。面对这一趋势,程序员应该如何应对?是专注于某个领域深耕细作，还是广泛学习以适应快速变化的技术环境?又或者，我们是否应该将重点转向AI无法轻易替代的软技能？让我们一起探讨程序员
Python快速入门 —— 第三节：类与对象孤华暗香 Python快速入门 python 开发语言
第三节：类与对象目标：了解面向对象编程的基础概念，并学会如何定义类和创建对象。内容：类与对象：定义类：class关键字。类的构造函数：__init__()。类的属性和方法。对象的创建与使用。示例：classStudent:def__init__(self,name,age,major):self.name&#
C++菜鸟教程 - 从入门到精通第二节 DreamByte c++
一.上节课的补充(数据类型)1.前言继上节课,我们主要讲解了输入,输出和运算符,我们现在来补充一下数据类型的知识上节课遗漏了这个知识点,非常的抱歉顺便说一下,博主要上高中了,更新会慢,2-4周更新一次对了,正好赶上中秋节,小编跟大家说一句:中秋节快乐!2.int类型上节课,我们其实只用了int类型int类型,是整数类型,它们存贮的是整数,不能存小数(浮点数)定义变量的方式很简单inta;//定义一
数字里的世界17期：2021年全球10大顶级数据中心，中国移动榜首张三叨
你知道吗？2016年，全球的数据中心共计用电4160亿千瓦时，比整个英国的发电量还多40％！前言每天，我们都会创造超过250万TB的数据。并且随着物联网（IOT）的不断普及，这一数据将持续增长。如此庞大的数据被存储在被称为“数据中心”的专用设施中。虽然最早的数据中心建于20世纪40年代，但直到1997-2000年的互联网泡沫期间才逐渐成为主流。当前人类的技术，比如人工智能和机器学习，已经将我们推向
舜公郑金锋书辛丑自剪扇面书法作品（四O六）舜公郑金锋
辛丑小阳春，新自剪扇面400品，大多为各色撒金、撒银、描金、描银、水印、彩绘、荧光等亚粉、色宣纸，以及域外包装填充纸等；王一品长锋羊毫秃笔；一得阁云头艳墨、宿墨、水等。书体有甲骨文，金文(商周金文、春秋战国金文、中山王厝器金文、汉金文……)，楚简帛书，侯马盟书，温县盟书，小篆，果蝙书等，隶书(秦简、汉简帛书、汉碑……)，草书(章草、小草、大草……)，行书(行楷、行草)，楷书(魏碑及北朝墓志、隋朝墓
STM32中的计时与延时 lupinjia STM32 stm32 单片机
前言在裸机开发中，延时作为一种规定循环周期的方式经常被使用，其中尤以HAL库官方提供的HAL_Delay为甚。刚入门的小白可能会觉得既然有官方提供的延时函数，而且精度也还挺好，为什么不用呢？实际上HAL_Delay中有不少坑，而这些也只是HAL库中无数坑的其中一些。想从坑里跳出来还是得加强外设原理的学习和理解，切不可只依赖HAL库。除了延时之外，我们在开发中有时也会想要确定某段程序的耗时，这就需要
Python实现简单的机器学习算法 master_chenchengg python python 办公效率 python开发 IT
Python实现简单的机器学习算法开篇：初探机器学习的奇妙之旅搭建环境：一切从安装开始必备工具箱第一步：安装Anaconda和JupyterNotebook小贴士：如何配置Python环境变量算法初体验：从零开始的Python机器学习线性回归：让数据说话数据准备：从哪里找数据编码实战：Python实现线性回归模型评估：如何判断模型好坏逻辑回归：从分类开始理论入门：什么是逻辑回归代码实现：使用skl
《西游记》观后感领读者李轩颖
西游记相信大家都不陌生，但我还是要给有些人讲一讲。长话短说，当然了，开头就是孙悟空的讲解，孙悟空本为一块仙石，然而因风化作一石猴。猪八戒是天蓬元帅，后因调戏王母娘娘的孙女织女后被打入凡间，投胎为猪，后名猪八戒。沙和尚因常年居住在流沙河中千年未出，所以名为沙僧。唐僧原名唐三藏，后因被吴来佛祖西天取经简名为唐僧。师徒四人历经了九九八十一磨难，最终取到了西经。然而最后师傅唐僧让他们回去的时候，可四人都恋
2019考研 | 西交大软件工程笔者阿蓉
本科背景：某北京211学校电子信息工程互联网开发工作两年录取结果：全日制软件工程学院分数：初试350+复试笔试80+面试85+总排名：100+从五月份开始脱产学习，我主要说一下专业课和复试还有我对非全的一些看法。【数学100+】张宇，张宇，张宇。跟着张宇学习，入门视频刷一遍，真题刷两遍，错题刷三遍。书刷N多遍。从视频开始学习，是最快的学习方法。5-7月份把主要是数学学好，8-9月份开始给自己每个周
esp32开发快速入门 8 : MQTT 的快速入门，基于esp32实现MQTT通信 z755924843 ESP32开发快速入门服务器网络运维
MQTT介绍简介MQTT（MessageQueuingTelemetryTransport，消息队列遥测传输协议），是一种基于发布/订阅（publish/subscribe）模式的"轻量级"通讯协议，该协议构建于TCP/IP协议上，由IBM在1999年发布。MQTT最大优点在于，可以以极少的代码和有限的带宽，为连接远程设备提供实时可靠的消息服务。作为一种低开销、低带宽占用的即时通讯协议，使其在物联
Armv8.3 体系结构扩展--原文版代码改变世界ctw ARM-TEE-Android armv8 嵌入式 arm架构安全架构芯片 Trustzone Secureboot
快速链接:.ARMv8/ARMv9架构入门到精通-[目录]付费专栏-付费课程【购买须知】:个人博客笔记导读目录(全部)TheArmv8.3architectureextensionTheArmv8.3architectureextensionisanextensiontoArmv8.2.Itaddsmandatoryandoptionalarchitecturalfeatures.Somefeat
Python入门之Lesson2:Python基础语法小熊同学哦 Python入门课程 python 开发语言算法数据结构青少年编程
目录前言一.介绍1.变量和数据类型2.常见运算符3.输入输出4.条件语句5.循环结构二.练习三.总结前言欢迎来到《Python入门》系列博客的第二课。在上一课中，我们了解了Python的安装及运行环境的配置。在这一课中，我们将深入学习Python的基础语法，这是编写Python代码的根基。通过本节内容的学习，你将掌握变量、数据类型、运算符、输入输出、条件语句等Python编程的基础知识。一.介绍1
摄影小白，怎么才能拍出高大上产品图片？是波妞唉
很多人以为文案只要会码字，会排版就OK了！说实话，没接触到这一行的时候，我的想法更简单，以为只要会写字就行！可是真做了文案才发现，码字只是入门级的基本功。一篇文章离不开排版、配图，说起来很简单！从头做到尾你就会发现，写文章用两个小时，找合适的配图居然要花掉半天的时间，甚至更久！图片能找到合适的就不怕，还有找不到的，比如产品图，只能亲自拍。拿着摆弄了半天，就是拍不出想要的效果，光线不好、搭出来丑破天
2021 CCF 非专业级别软件能力认证第一轮（CSP-J1）入门级C++语言试题（第三大题：完善程序代码） mmz1207 c++csp
最近有一段时间没更新了，在准备CSP考试，请大家见谅。（1）有n个人围成一个圈，依次标号0到n-1。从0号开始，依次0，1，0，1...交替报数，报到一的人离开，直至圈中剩最后一个人。求最后剩下的人的编号。#includeusingnamespacestd;intf[1000010];intmain(){intn;cin>>n;inti=0,cnt=0,p=0;while(cnt#includeu
Vue( ElementUI入门、vue-cli安装) m0_l5z elementui vue.js
一.ElementUI入门目录：1.ElementUI入门1.1ElementUI简介1.2Vue+ElementUI安装1.3开发示例2.搭建nodejs环境2.1nodejs介绍2.2npm是什么2.3nodejs环境搭建2.3.1下载2.3.2解压2.3.3配置环境变量2.3.4配置npm全局模块路径和cache默认安装位置2.3.5修改npm镜像提高下载速度2.3.6验证安装结果3.运行n
Spring MVC 全面指南：从入门到精通的详细解析一杯梅子酱技术栈学习 spring mvc java
引言：SpringMVC，作为Spring框架的一个重要模块，为构建Web应用提供了强大的功能和灵活性。无论是初学者还是有一定经验的开发者，掌握SpringMVC都将显著提升你的Web开发技能。本文旨在为初学者提供一个全面且易于理解的学习路径，通过详细的知识点分析和实际案例，帮助你快速上手SpringMVC，让学习过程既深刻又高效。一、SpringMVC简介1.1什么是SpringMVC？Spri
入门MySQL——查询语法练习 K_un
前言：前面几篇文章为大家介绍了DML以及DDL语句的使用方法，本篇文章将主要讲述常用的查询语法。其实MySQL官网给出了多个示例数据库供大家实用查询，下面我们以最常用的员工示例数据库为准，详细介绍各自常用的查询语法。1.员工示例数据库导入官方文档员工示例数据库介绍及下载链接：https://dev.mysql.com/doc/employee/en/employees-installation.h
【创客文案社】第三期写手招募筱瑶123
创客文案社第三期写手招募开始了。要求：1：注册一个月以上2：本身热爱写作3：有时间参与接单投稿参与方式：可以关注公众号：写作灵感；也可以通过其他转发文章的文友帮忙拉入群；也可以简信我。参与之后的文友，会先进入新人班，进行基本的试稿与培训，先接一些比较简单的单子；在这里可以一边赚钱，一边学习。不知不觉，来三四个月了，也发现了很多很有意思的现象。1：在上写一篇文章，基本都是几毛钱，多的也不过几块钱的收
ESP32-C3入门教程网络篇⑩——基于esp_https_ota和MQTT实现开机主动升级和被动触发升级的OTA功能小康师兄 ESP32-C3入门教程 https 服务器 esp32 OTA MQTT
文章目录一、前言二、软件流程三、部分源码四、运行演示一、前言本文基于VSCodeIDE进行编程、编译、下载、运行等操作基础入门章节请查阅：ESP32-C3入门教程基础篇①——基于VSCode构建HelloWorld教程目录大纲请查阅：ESP32-C3入门教程——导读ESP32-C3入门教程网络篇⑨——基于esp_https_ota实现史上最简单的ESP32OTA远程固件升级功能二、软件流程
2023最详细的Python安装教程（Windows版本）程序员林哥 Python python windows 开发语言
python安装是学习pyhon第一步，很多刚入门小白不清楚如何安装python，今天我来带大家完成python安装与配置，跟着我一步步来，很简单，你肯定能完成。第一部分：python安装（一）准备工作1、下载和安装python(认准官方网站)当然你不想去下载的话也可以分享给你，还有入门学习教程，点击下方卡片跳转进群领取（二）开始安装对于Windows操作系统，可以下载“executableins
【2022 CCF 非专业级别软件能力认证第一轮（CSP-J1）入门级 C++语言试题及解析】汉子萌萌哒 CCF noi 算法数据结构 c++
一、单项选择题(共15题，每题2分，共计30分；每题有且仅有一个正确选项)1.以下哪种功能没有涉及C++语言的面向对象特性支持：()。A.C++中调用printf函数B.C++中调用用户定义的类成员函数C.C++中构造一个class或structD.C++中构造来源于同一基类的多个派生类题目解析【解析】正确答案:AC++基础知识，面向对象和类有关，类又涉及父类、子类、继承、派生等关系，printf
现金贷“租系统”产业崛起：租金3000，本金10万，一月回本 Dayon
最近，地下现金贷的全面崛起，已成了不可阻挡的趋势。大量民间资本开始涌入，民间高利贷、炒房团、土豪的钱，都裹挟其中。而地下现金贷的入门门槛正在不断降低，一条新的产业链开始崛起：租现金贷系统。现在，只需要10万本金，花3000元租个系统，两个人的团队，一个月就能回本。大量的小本金玩家进场了，为了急速获利，他们甚至将利率调到1600%以上。业内人士称，真实的现金贷用户，现在大概只有200多万。整个行业几
Spring4.1新特性——Spring MVC增强 jinnianshilongnian spring 4.1
目录 Spring4.1新特性——综述 Spring4.1新特性——Spring核心部分及其他 Spring4.1新特性——Spring缓存框架增强 Spring4.1新特性——异步调用和事件机制的异常处理 Spring4.1新特性——数据库集成测试脚本初始化 Spring4.1新特性——Spring MVC增强 Spring4.1新特性——页面自动化测试框架Spring MVC T
mysql 性能查询优化 annan211 java sql 优化 mysql 应用服务器
1 时间到底花在哪了？ mysql在执行查询的时候需要执行一系列的子任务，这些子任务包含了整个查询周期最重要的阶段，这其中包含了大量为了检索数据列到存储引擎的调用以及调用后的数据处理，包括排序、分组等。在完成这些任务的时候，查询需要在不同的地方花费时间，包括网络、cpu计算、生成统计信息和执行计划、锁等待等。尤其是向底层存储引擎检索数据的调用操作。这些调用需要在内存操
windows系统配置 cherishLC windows
删除Hiberfil.sys ：使用命令powercfg -h off 关闭休眠功能即可： http://jingyan.baidu.com/article/f3ad7d0fc0992e09c2345b51.html 类似的还有pagefile.sys msconfig 配置启动项 shutdown 定时关机 ipconfig 查看网络配置 ipconfig /flushdns
人体的排毒时间 Array_06 工作
======================== || 人体的排毒时间是什么时候？|| ======================== 转载于： http://zhidao.baidu.com/link?url=ibaGlicVslAQhVdWWVevU4TMjhiKaNBWCpZ1NS6igCQ78EkNJZFsEjCjl3T5EdXU9SaPg04bh8MbY1bR
ZooKeeper cugfy zookeeper
Zookeeper是一个高性能，分布式的，开源分布式应用协调服务。它提供了简单原始的功能，分布式应用可以基于它实现更高级的服务，比如同步，配置管理，集群管理，名空间。它被设计为易于编程，使用文件系统目录树作为数据模型。服务端跑在java上，提供java和C的客户端API。 Zookeeper是Google的Chubby一个开源的实现，是高有效和可靠的协同工作系统，Zookeeper能够用来lea
网络爬虫的乱码处理随意而生爬虫网络
下边简单总结下关于网络爬虫的乱码处理。注意，这里不仅是中文乱码，还包括一些如日文、韩文、俄文、藏文之类的乱码处理，因为他们的解决方式是一致的，故在此统一说明。网络爬虫，有两种选择，一是选择nutch、hetriex，二是自写爬虫，两者在处理乱码时，原理是一致的，但前者处理乱码时，要看懂源码后进行修改才可以，所以要废劲一些；而后者更自由方便，可以在编码处理
Xcode常用快捷键张亚雄 xcode
一、总结的常用命令：隐藏xcode command+h 退出xcode command+q 关闭窗口 command+w 关闭所有窗口 command+option+w 关闭当前
mongoDB索引操作 adminjun mongodb 索引
一、索引基础： MongoDB的索引几乎与传统的关系型数据库一模一样，这其中也包括一些基本的优化技巧。下面是创建索引的命令： > db.test.ensureIndex({"username":1}) 可以通过下面的名称查看索引是否已经成功建立： &nbs
成都软件园实习那些话 aijuans 成都软件园实习
无聊之中，翻了一下日志，发现上一篇经历是很久以前的事了，悔过~~ 　　断断续续离开了学校快一年了，习惯了那里一天天的幼稚、成长的环境，到这里有点与世隔绝的感觉。不过还好，那是刚到这里时的想法，现在感觉在这挺好，不管怎么样，最要感谢的还是老师能给这么好的一次催化成长的机会，在这里确实看到了好多好多能想到或想不到的东西。　　都说在外面和学校相比最明显的差距就是与人相处比较困难，因为在外面每个人都
Linux下FTP服务器安装及配置 ayaoxinchao linux FTP服务器 vsftp
检测是否安装了FTP [root@localhost ~]# rpm -q vsftpd 如果未安装：package vsftpd is not installed 安装了则显示：vsftpd-2.0.5-28.el5累死的版本信息安装FTP 运行yum install vsftpd命令，如[root@localhost ~]# yum install vsf
使用mongo-java-driver获取文档id和查找文档 BigBird2012 driver
注：本文所有代码都使用的mongo-java-driver实现。在MongoDB中，一个集合（collection）在概念上就类似我们SQL数据库中的表（Table），这个集合包含了一系列文档（document）。一个DBObject对象表示我们想添加到集合（collection）中的一个文档（document），MongoDB会自动为我们创建的每个文档添加一个id，这个id在
JSONObject以及json串 bijian1013 json JSONObject
一.JAR包简介要使程序可以运行必须引入JSON-lib包，JSON-lib包同时依赖于以下的JAR包： 1.commons-lang-2.0.jar 2.commons-beanutils-1.7.0.jar 3.commons-collections-3.1.jar &n
[Zookeeper学习笔记之三]Zookeeper实例创建和会话建立的异步特性 bit1129 zookeeper
为了说明问题，看个简单的代码， import org.apache.zookeeper.*; import java.io.IOException; import java.util.concurrent.CountDownLatch; import java.util.concurrent.ThreadLocal
【Scala十二】Scala核心六：Trait bit1129 scala
Traits are a fundamental unit of code reuse in Scala. A trait encapsulates method and field definitions, which can then be reused by mixing them into classes. Unlike class inheritance, in which each c
weblogic version 10.3破解 ronin47 weblogic
版本：WebLogic Server 10.3 说明：%DOMAIN_HOME%：指WebLogic Server 域(Domain）目录例如我的做测试的域的根目录 DOMAIN_HOME=D:/Weblogic/Middleware/user_projects/domains/base_domain 1.为了保证操作安全，备份%DOMAIN_HOME%/security/Defa
求第n个斐波那契数 BrokenDreams
今天看到群友发的一个问题：写一个小程序打印第n个斐波那契数。自己试了下，搞了好久。。。基础要加强了。 &nbs
读《研磨设计模式》-代码笔记-访问者模式-Visitor bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.util.ArrayList; import java.util.List; interface IVisitor { //第二次分派，Visitor调用Element void visitConcret
MatConvNet的excise 3改为网络配置文件形式 cherishLC matlab
MatConvNet为vlFeat作者写的matlab下的卷积神经网络工具包，可以使用GPU。主页： http://www.vlfeat.org/matconvnet/ 教程： http://www.robots.ox.ac.uk/~vgg/practicals/cnn/index.html 注意：需要下载新版的MatConvNet替换掉教程中工具包中的matconvnet： http
ZK Timeout再讨论 chenchao051 zookeeper timeout hbase
http://crazyjvm.iteye.com/blog/1693757 文中提到相关超时问题，但是又出现了一个问题，我把min和max都设置成了180000，但是仍然出现了以下的异常信息： Client session timed out, have not heard from server in 154339ms for sessionid 0x13a3f7732340003
CASE WHEN 用法介绍 daizj sql group by case when
CASE WHEN 用法介绍 1. CASE WHEN 表达式有两种形式 --简单Case函数 CASE sex WHEN '1' THEN '男' WHEN '2' THEN '女' ELSE '其他' END --Case搜索函数 CASE WHEN sex = '1' THEN
PHP技巧汇总:提高PHP性能的53个技巧 dcj3sjt126com PHP
PHP技巧汇总:提高PHP性能的53个技巧　　用单引号代替双引号来包含字符串，这样做会更快一些。因为PHP会在双引号包围的字符串中搜寻变量，　　单引号则不会，注意：只有echo能这么做，它是一种可以把多个字符串当作参数的函数译注：　　PHP手册中说echo是语言结构，不是真正的函数，故把函数加上了双引号)。　　1、如果能将类的方法定义成static，就尽量定义成static，它的速度会提升将近4倍
Yii框架中CGridView的使用方法以及详细示例 dcj3sjt126com yii
CGridView显示一个数据项的列表中的一个表。表中的每一行代表一个数据项的数据,和一个列通常代表一个属性的物品(一些列可能对应于复杂的表达式的属性或静态文本)。　　CGridView既支持排序和分页的数据项。排序和分页可以在AJAX模式或正常的页面请求。使用CGridView的一个好处是,当用户浏览器禁用JavaScript,排序和分页自动退化普通页面请求和仍然正常运行。实例代码如下：
Maven项目打包成可执行Jar文件 dyy_gusi assembly
Maven项目打包成可执行Jar文件在使用Maven完成项目以后，如果是需要打包成可执行的Jar文件，我们通过eclipse的导出很麻烦，还得指定入口文件的位置，还得说明依赖的jar包，既然都使用Maven了，很重要的一个目的就是让这些繁琐的操作简单。我们可以通过插件完成这项工作，使用assembly插件。具体使用方式如下： 1、在项目中加入插件的依赖： <plugin>
php常见错误 geeksun PHP
1. kevent() reported that connect() failed (61: Connection refused) while connecting to upstream, client: 127.0.0.1, server: localhost, request: "GET / HTTP/1.1", upstream: "fastc
修改linux的用户名 hongtoushizi linux change password
Change Linux Username 更改Linux用户名，需要修改4个系统的文件： /etc/passwd /etc/shadow /etc/group /etc/gshadow 古老/传统的方法是使用vi去直接修改，但是这有安全隐患（具体可自己搜一下），所以后来改成使用这些命令去代替： vipw vipw -s vigr vigr -s 具体的操作顺
第五章常用Lua开发库1-redis、mysql、http客户端 jinnianshilongnian nginx lua
对于开发来说需要有好的生态开发库来辅助我们快速开发，而Lua中也有大多数我们需要的第三方开发库如Redis、Memcached、Mysql、Http客户端、JSON、模板引擎等。一些常见的Lua库可以在github上搜索，https://github.com/search?utf8=%E2%9C%93&q=lua+resty。 Redis客户端 lua-resty-r
zkClient 监控机制实现 liyonghui160com zkClient 监控机制实现
直接使用zk的api实现业务功能比较繁琐。因为要处理session loss，session expire等异常，在发生这些异常后进行重连。又因为ZK的watcher是一次性的，如果要基于wather实现发布/订阅模式，还要自己包装一下，将一次性订阅包装成持久订阅。另外如果要使用抽象级别更高的功能，比如分布式锁，leader选举
在Mysql 众多表中查找一个表名或者字段名的 SQL 语句 pda158 mysql
在Mysql 众多表中查找一个表名或者字段名的 SQL 语句：　　方法一：SELECT table_name, column_name from information_schema.columns WHERE column_name LIKE 'Name'; 　　方法二：SELECT column_name from information_schema.colum
程序员对英语的依赖 Smile.zeng 英语程序猿
1、程序员最基本的技能，至少要能写得出代码，当我们还在为建立类的时候思考用什么单词发牢骚的时候，英语与别人的差距就直接表现出来咯。 2、程序员最起码能认识开发工具里的英语单词，不然怎么知道使用这些开发工具。 3、进阶一点，就是能读懂别人的代码，有利于我们学习人家的思路和技术。 4、写的程序至少能有一定的可读性，至少要人别人能懂吧... 以上一些问题，充分说明了英语对程序猿的重要性。骚年
Oracle学习笔记(8) 使用PLSQL编写触发器 vipbooks oracle sql 编程活动 Access
时间过得真快啊，转眼就到了Oracle学习笔记的最后个章节了，通过前面七章的学习大家应该对Oracle编程有了一定了了解了吧，这东东如果一段时间不用很快就会忘记了，所以我会把自己学习过的东西做好详细的笔记，用到的时候可以随时查找，马上上手！希望这些笔记能对大家有些帮助！这是第八章的学习笔记，学习完第七章的子程序和包之后