人类最好从反馈中学习,我们被鼓励采取能带来积极结果的行动,同时又被消极后果的决定所吓倒。这种强化过程可以应用到计算机程序中,使它们能够解决经典编程无法解决的更复杂的问题。
Humans learn best from feedback—we are encouraged to take actions that lead to positive results while deterred by decisions with negative consequences. This reinforcement process can be applied to computer programs allowing them to solve more complex problems that classical programming cannot.
本书在实践中教你深度强化学习的基本概念和术语,以及将其应用到你自己的项目中所需的实际技能和技巧。
Deep Reinforcement Learning in Action teaches you the fundamental concepts and terminology of deep reinforcement learning, along with the practical skills and techniques you’ll need to implement it into your own projects.
主要特点
•将问题结构化为马尔可夫决策过程
•流行算法如深度Q网络、策略梯度法和进化算法以及驱动它们的直觉
•将强化学习算法应用于实际问题
Key features
• Structuring problems as Markov Decision Processes
• Popular algorithms such Deep Q-Networks, Policy Gradient method and Evolutionary Algorithms and the intuitions that drive them
• Applying reinforcement learning algorithms to real-world problems
读者需要中级的Python技能和对深度学习的基本理解。
Audience
You’ll need intermediate Python skills and a basic understanding of deep learning.
深度强化学习是机器学习的一种形式,人工智能体从自己的原始感官输入中学习最佳行为。系统感知环境,解释其过去决策的结果,并使用这些信息优化其行为,以获得最大的长期回报。深度强化学习对AlphaGo的成功做出了卓越的贡献,但这并不是它所能做的全部!
Deep reinforcement learning is a form of machine learning in which AI agents learn optimal behavior from their own raw sensory input. The system perceives the environment, interprets the results of its past decisions, and uses this information to optimize its behavior for maximum long-term return. Deep reinforcement learning famously contributed to the success of AlphaGo but that’s not all it can do!
Alexander Zai是Amazon AI的机器学习工程师,致力于为AWS机器学习产品提供动力的MXNet。Brandon Brown是outlace.com网站的一个机器学习和数据分析博客作家,致力于为刚入门的学员答疑解惑。
Alexander Zai is a Machine Learning Engineer at Amazon AI working on MXNet that powers a suite of AWS machine learning products. Brandon Brown is a Machine Learning and Data Analysis blogger at outlace.com committed to providing clear teaching on difficult topics for newcomers.
Part 1—Foundations
1 What is reinforcement learning?
1.1 The “deep” in deep reinforcement learning
1.2 Reinforcement learning
1.3 Dynamic programming versus Monte Carlo
1.4 The reinforcement learning framework
1.5 What can I do with reinforcement learning?
1.6 Why deep reinforcement learning?
1.7 Our didactic tool: String diagrams
1.8 What’s next?
Summary
2 Modeling reinforcement learning problems: Markov decision processes
2.1 String diagrams and our teaching methods
2.2 Solving the multi-arm bandit
2.2.1 Exploration and exploitation
2.2.2 Epsilon-greedy strategy
2.2.3 Softmax selection policy
2.3 Applying bandits to optimize ad placements
2.3.1 Contextual bandits
2.3.2 States, actions, rewards
2.4 Building networks with PyTorch
2.4.1 Automatic differentiation
2.4.2 Building Models
2.5 Solving contextual bandits
2.6 The Markov property
2.7 Predicting future rewards: Value and policy functions
2.7.1 Policy functions
2.7.2 Optimal policy
2.7.3 Value functions
Summary
3 Predicting the best states and actions: Deep Q-networks
3.1 The Q function
3.2 Navigating with Q-learning
3.2.1 What is Q-learning?
3.2.2 Tackling Gridworld
3.2.3 Hyperparameters
3.2.4 Discount factor
3.2.5 Building the network
3.2.6 Introducing the Gridworld game engine
3.2.7 A neural network as the Q function
3.3 Preventing catastrophic forgetting: Experience replay
3.3.1 Catastrophic forgetting
3.3.2 Experience replay
3.4 Improving stability with a target network
3.4.1 Learning instability
3.5 Review
Summary
4 Learning to pick the best policy: Policy gradient methods
4.1 Policy function using neural networks
4.1.1 Neural network as the policy function
4.1.2 Stochastic policy gradient
4.1.3 Exploration
4.2 Reinforcing good actions: The policy gradient algorithm
4.2.1 Defining an objective
4.2.2 Action reinforcement
4.2.3 Log probability
4.2.4 Credit assignment
4.3 Working with OpenAI Gym
4.3.1 CartPole
4.3.2 The OpenAI Gym API
4.4 The REINFORCE algorithm
4.4.1 Creating the policy network
4.4.2 Having the agent interact with the environment
4.4.3 Training the model
4.4.4 The full training loop
4.4.5 Chapter conclusion
Summary
5 Tackling more complex problems with actor-critic methods
5.1 Combining the value and policy function
5.2 Distributed training
5.3 Advantage actor-critic
5.4 N-step actor-critic
Summary
Part 2—Above and beyond
6 Alternative optimization methods: Evolutionary algorithms
6.1 A different approach to reinforcement learning
6.2 Reinforcement learning with evolution strategies
6.2.1 Evolution in theory
6.2.2 Evolution in practice
6.3 A genetic algorithm for CartPole
6.4 Pros and cons of evolutionary algorithms
6.4.1 Evolutionary algorithms explore more
6.4.2 Evolutionary algorithms are incredibly sample intensive
6.4.3 Simulators
6.5 Evolutionary algorithms as a scalable alternative
6.5.1 Scaling evolutionary algorithms
6.5.2 Parallel vs. serial processing
6.5.3 Scaling efficiency
6.5.4 Communicating between nodes
6.5.5 Scaling linearly
6.5.6 Scaling gradient-based approaches
Summary
7 Distributional DQN: Getting the full story
7.1 What’s wrong with Q-learning?
7.2 Probability and statistics revisited
7.2.1 Priors and posteriors
7.2.2 Expectation and variance
7.3 The Bellman equation
7.3.1 The distributional Bellman equation
7.4 Distributional Q-learning
7.4.1 Representing a probability distribution in Python
7.4.2 Implementing the Dist-DQN
7.5 Comparing probability distributions
7.6 Dist-DQN on simulated data
7.7 Using distributional Q-learning to play Freeway
Summary
8 Curiosity-driven exploration
8.1 Tackling sparse rewards with predictive coding
8.2 Inverse dynamics prediction
8.3 Setting up Super Mario Bros.
8.4 Preprocessing and the Q-network
8.5 Setting up the Q-network and policy function
8.6 Intrinsic curiosity module
8.7 Alternative intrinsic reward mechanisms
Summary
9 Multi-agent reinforcement learning
9.1 From one to many agents
9.2 Neighborhood Q-learning
9.3 The 1D Ising model
9.4 Mean field Q-learning and the 2D Ising model
9.5 Mixed cooperative-competitive games
Summary
10 Interpretable reinforcement learning: Attention and relational models
10.1 Machine learning interpretability with attention and relational biases
10.1.1 Invariance and equivariance
10.2 Relational reasoning with attention
10.2.1 Attention models
10.2.2 Relational reasoning
10.2.3 Self-attention models
10.3 Implementing self-attention for MNIST
10.3.1 Transformed MNIST
10.3.2 The relational module
10.3.3 Tensor contractions and Einstein notation
10.3.4 Training the relational module
10.4 Multi-head attention and relational DQN
10.5 Double Q-learning
10.6 Training and attention visualization
10.6.1 Maximum entropy learning
10.6.2 Curriculum learning
10.6.3 Visualizing attention weights
Summary
11 In conclusion: A review and roadmap
11.1 What did we learn?
11.2 The uncharted topics in deep reinforcement learning
11.2.1 Prioritized experience replay
11.2.2 Proximal policy optimization (PPO)
11.2.3 Hierarchical reinforcement learning and the options framework
11.2.4 Model-based planning
11.2.5 Monte Carlo tree search (MCTS)
11.3 The end
Appendix—Mathematics, deep learning, PyTorch
A.1 Linear algebra
A.2 Calculus
A.3 Deep learning
A.4 PyTorch
Reference list
index
Symbols
Numerics