《reinforcement learning:an introduction》第三章《Finite Markov Decision Processes》总结

由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。

对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。

因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。



应用RL解决实际问题,目前已有的算法总的来说还是可以的,主要是要设计好能够反映问题本质的state/reward(action通常比较明确):

Of course, the particular states and actions vary greatly from task to task, and how they are represented can strongly affect performance. In reinforcement learning, as in other kinds of learning, such representational choices are at present more art than science. 

state表示:尽量符合markov property,当前的state尽量能够总结history的有用信息(一般情况下不可能是immediate sensations;也不会是complete history of all past sensations);

action:task specific,粒度一定要合适

reward设计:一定要反应我们的目标。通常情况,对于实现我们的goal的action,反馈一个正值,对于不想看到的action,可以反馈一个负值(To encourage smooth movements, on each time step a small, negative reward can be given as a function of the moment-to-moment \jerkiness" of the motion );

reward设计一定要反映goal,有个很好的例子,exercise 3.5,reward of +1 for escaping from the maze and a reward of zero at all other times, and return G_t without discount, 意味着,不管你花了多少timestep,最后只要走出来了,得到的return竟然是一样的。。。。这可能会造成agent不知道你想要学什么东西,反正在里面兜兜转转,只要最终走出去就好了。相应的,对于走迷宫这样的任务,应该对于每一个timestep,做一个负的惩罚reward,这样agent就会学会尽快离开maze,从而获得更大的reward。。。(使用discount也可以,不过估计效果不明显)

reward设计是要考虑值的绝对大小还是相对大小???有个很好的例子exercise 3.9/3.10,对于continuing RL task来说,reward之间的【相对差距】才是关键(都是负数页没有关系的)!!!adding a constantc to all the rewards adds a constant, Vc, to the values of all states, and thus does not affect the relative values of any states under any policies. ==》Vc = c*sigma_over_k_from_k=0_to_k=infinite { γ^k })。对于episodic RL task来说,reward之间的【相对差距、本身绝对大小】都是关键!!!(Vc = c*sigma_over_k_from_k=0_to_k=K { γ^k } 后访问的状态只能累加少量的γ^k)。



reward hypothesis :RL最本质的出发点

    what we want/ what we mean by goals can be well thought of as the maximization of the expected cumulative sum of a scalar reward signal!

    maximization the reward signal is one of the most distinctive features of RL

    might at first appear limiting,in practice it has proved to be flexible and widely applicable, for example,.... 

    reward hypothesis是RL最本质的出发点,设计的reward,一定要实现maximizing reward对应着maximizing our goal!!!The reward signal is your way of communicating to the robotwhat you want it to achieve, not howyou want it achieved(reward不要对应着how to get the goal,这种告诉agent 【how to do】的方式本质上是人的启发式思想在作怪;而goal-directed的reward,往往能够学到人想不到的策略,alphaGo是很好的例子。).For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such as taking its opponent’s pieces or gaining control of the center of the board. If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent’s pieces even at the cost of losing the game.



episodic tasks and continuing tasks 



discounted return G_t(A_t) = R_t+1 + γ R_t+2 + γ^2 R_t+3 + ...

    why discount

    The undiscounted formulation is appropriate forepisodic tasks (因为对于有结束的task,后续得到的结果确实都是真实的,undiscount貌似更合理。。。); the discounted formulation is appropriate forcontinuing tasks



markov property

    Even when the state signal is non-Markov, it is still appropriate to think of the state in reinforcement learning as an approximation to a Markov state. We conclude that the inability to have access to a perfect Markov state representation is probably not a severe problem for a reinforcement learning agent.(确实是这样,state只要不是太烂,正常情况下都不错)


MDP

    dynamic/model

    transition graph: sate node and action node

    


value function

    Vπ(s) = Eπ[G_t|S_t=s] : the expected return depend on what actions this agent will take, accordingly, value function is defined with respect to particular policies.

    V*(s) = max_over_π Vπ(s)

    Qπ(s,a)、Q*(s,a)

    Bellman expectation equation forVπ: state与next_state之间关系的建模,用到expectation,是因为存在多个可能的action π(a|s),由policy π(a|s)决定。

    Bellman expectation equation for Qπ: (state, action)与(next_state, next_action)之间关系的建模,用到expectation,是因为存在多个可能的next_state由model/dynamic决定。

    Bellman optimality equation:因为计算的是optimality,所以需要将由policy π(a|s)决定的那个expectation修改成对action求max,而由model/dynamic决定的那个expectation不能少

    从V*(s) 和Q*(s,a)中计算π*是很容易的事。

    V*/Q*只有一个,但是π*可能有多个




tabular and approximation

    Bellman optimality equation的计算面临着三大难题:(1)dynamic可能不知道;(2)没有足够的计算资源;(3)不是markov property。一般情况下,(1,2)是核心,前面提到过,即使不是markov property,也同样可以较好的解决。对于棋类问题,一般(2)的计算资源不足;对于其他现实问题,一般(1)的dynamic不知道。

    对于state、action很多的问题,approximation必不可少:approximately solving the Bellman optimality equation, using actual experienced transitions in place of knowledge of the expected transitions。



Monte Carlo methods

    The value functions Vπ and Qπ can be estimated from experience using Monte Carlo methods(averaging over many random samples of actual returns)







下面是silver课程《Lecture 2,Markov Decision Processes》我觉得应该知道的内容:
2(课件第2页的意思):能够明白Markov Process/ Markov Reward Process/ Markov Decision Process三者的区别
Markov Process:S/P
Markov Reward Process:S/P/R
Markov Decision Process:S/P/R/A
4(课件第4页的意思):markov property:
The future is independent of the past given the present
a state S_t is markov iff: P[S_t+1|S_t]=P[S_t+1|S_t,..,S_1]
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
6:markov process(markov chain)
is a memoryless random process, 
i.e. a sequence of random states S1, S2, ... with the Markov property.
State + state之间的转移概率P
8:episode
10:markov reward process(MRP):
a markov process(markov chain) with values.
state + P[s'|s] + R[s'|s] (and the discount factor γ)
12:return G_t:
is the total discounted reward from time-step t
one sample of episode
13:why discount:
Mathematically convenient: for finite retures/for convergence/
present is more important than future: future is uncertainty/human like immediate reward
but undiscounted MRP also possible if all sequences terminate
14:value function:
value function v(s) gives the long-term value of state s
expected return G_t; average over many samples of episodes
19/22/23:Bellman equation for MRP
v=R+γPv   ==>   v = (I?γP)^(?1) R
因为是计算value function(expected return G_t),所以对于当前状态s,会考虑所有可能的下个状态s',
不存在max/min等非线性操作,a linear equation;区别于计算v*时候的非线性问题。
O(n^3),many iterative methods for large MRPs:DP/MC evaluation/TD learning。
24:markov decision process(MDP):
a markov reward process with dicsions. It is an environment in which all states are Markov.
state + P[S'|S,a] + R[s'|s,a] + (γ) + action
28:state-value function and action-value function (of one specific policy π)
vπ(s) = Eπ [Gt | St = s]
qπ(s, a) = Eπ [Gt | St = s, At = a]
30-36:Bellman Expectation Equation (of one specific policy π) for MDP
因为计算的Expectation,所以仍然是a linear equation;区别于计算Bellman optimal Equation时候的非线性问题。
37:optimal state-value funnciton and optimal action-value function (of optimal policy π*)
max over all π
41:finding an optimal policy from optimal action-value function
There is always a deterministic optimal policy for any MDP
optimal value funnciton is unique; but optimal policy is not!
43-47:Bellman Optimality Equation (of optimal policy π*) for MDP
由于计算optimality,计算涉及到max操作,是non-linear方程;区别于计算Bellman Expectation Equation时候的线性问题。
48:Solving the Bellman Optimality Equation
non-linear方程通常情况下没有closed-form solution
通常通过iterative method解决non-linear proble:
Value Iteration;Policy Iteration;Q-learning;Sarsa

另外,能够想明白student MRP/MDP图示中的那些值是怎么计算出来的。39页有个数据是错误的,看能不能找出来???(答案放在评论里)


你可能感兴趣的:((深度)增强学习,增强学习,sutton,RL,reinforcement,learni,an,introduction)