《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章

1. Introduction (about machine learning)


2. Concept Learning and the General-to-Specific Ordering

3. Decision Tree Learning

4. Artificial Neural Networks


5. Evaluating Hypotheses

6. Bayesian Learning

7. Computational Learning Theory


8. Instance-Based Learning

9. Genetic Algorithms

10. Learning Sets of Rules

11. Analytical Learning

12. Combining Inductive and Analytical Learning

13. Reinforcement Learning


13. Reinforcement Learning

Reinforcement learning addresses the question of  how  an autonomous agent(自治agent) that senses and acts in its environment can learn to choose optimal actions to achieve its goals. Each time the agent performs an action in its environment, a  trainer may provide a reward or penalty to  indicate the desirability  of the resulting state. The task of  the agent is to learn from this indirect, delayed reward, to choose sequences
of  actions that produce  the  greatest cumulative(累积) reward. This chapter focuses on an algorithm called  Q  learning  that  can acquire optimal control strategies from delayed rewards, even when the agent has no prior knowledge  of  the effects  of its actions on the environment. Reinforcement learning algorithms are related  to dynamic programming(动态规划) algorithms frequently used to solve optimization problems. 

13.1  INTRODUCTION

This general setting for robot learning is summarized in Figure 13.1.

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第1张图片

The problem of  learning a control policy to choose actions  is similar in some respects to the function approximation problems discussed in other chapters. The target  function  to  be learned  in  this  case is  a  control policy,  π : S -> A,  that outputs an appropriate action a  from the set A, given the current state s  from the set S. However, this reinforcement  learning problem differs from other function
approximation tasks in several important respects:

Delayed reward(延迟回报):  the trainer provides only a sequence of immediate  reward values as the agent executes its sequence of actions. The agent, therefore, faces the problem of temporal credit assignment(时间信用分配):  determining which of the actions in its sequence are to be credited with producing the eventual rewards.

Exploration(探索):  The learner faces a tradeoff in choosing whether to favor exploration  of unknown states and actions (to gather new  information), or exploitation of  states and actions that  it has  already learned will yield high reward  (to maximize  its cumulative reward). 

Partially observable  states(部分观察状态): In many  practical situations sensors provide only partial  information. For example, a  robot with a  forward-pointing camera cannot see what  is behind  it. In such cases, it may  be  necessary  for the  agent  to consider its previous observations together with  its current sensor data when  choosing actions.

Life-long learning(终生学习):  Robot learning often requires that  the robot  learn several related tasks within the same environment, using the same sensors. This setting raises the possibility of  using previously obtained experience or knowledge to reduce sample complexity when  learning new tasks. 

13.2  THE LEARNING TASK 

Here we define one quite general formulation of  the problem, based on Markov decision processes(马尔科夫决策过程). This formulation of  the problem follows the problem  illustrated in Figure  13.1.

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第2张图片

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第3张图片

an example:

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第4张图片

13.3  Q  LEARNING

13.3.1 The Q Function 

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第5张图片

13.3.2  An Algorithm for Learning Q

由于

所以(13.4)可重写为

进而

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第6张图片

可得如下算法:

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第7张图片

13.3.3  An Illustrative Example

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第8张图片

13.3.4  Convergence

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第9张图片

13.4  NONDETERMINISTIC REWARDS AND ACTIONS

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第10张图片

To summarize, we have simply redefined V and Q in  the nondeterministic case to be the expected value of
its previously defined quantity for the deterministic case.

13.5  TEMPORAL DIFFERENCE LEARNING(时间差分学习)

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第11张图片

13.8  SUMMARY AND FURTHER READING 

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第12张图片

《Machine Learning(Tom M. Mitchell)》读书笔记——14、第十三章_第13张图片

你可能感兴趣的:(Tom,Mitchell》)