由于组里新同学进来,需要带着他入门RL,选择从silver的课程开始。
对于我自己,增加一个仔细阅读《reinforcement learning:an introduction》的要求。
因为之前读的不太认真,这一次希望可以认真一点,将对应的知识点也做一个简单总结。
应用RL解决实际问题,目前已有的算法总的来说还是可以的,主要是要设计好能够反映问题本质的state/reward(action通常比较明确):
Of course, the particular states and actions vary greatly from task to task, and how they are represented can strongly affect performance. In reinforcement learning, as in other kinds of learning, such representational choices are at present more art than science.
state表示:尽量符合markov property,当前的state尽量能够总结history的有用信息(一般情况下不可能是immediate sensations;也不会是complete history of all past sensations);
action:task specific,粒度一定要合适
reward设计:一定要反应我们的目标。通常情况,对于实现我们的goal的action,反馈一个正值,对于不想看到的action,可以反馈一个负值(To encourage smooth movements, on each time step a small, negative reward can be given as a function of the moment-to-moment \jerkiness" of the motion );
reward设计一定要反映goal,有个很好的例子,exercise 3.5,reward of +1 for escaping from the maze and a reward of zero at all other times, and return G_t without discount, 意味着,不管你花了多少timestep,最后只要走出来了,得到的return竟然是一样的。。。。这可能会造成agent不知道你想要学什么东西,反正在里面兜兜转转,只要最终走出去就好了。相应的,对于走迷宫这样的任务,应该对于每一个timestep,做一个负的惩罚reward,这样agent就会学会尽快离开maze,从而获得更大的reward。。。(使用discount也可以,不过估计效果不明显)
reward设计是要考虑值的绝对大小还是相对大小???有个很好的例子,exercise 3.9/3.10,对于continuing RL task来说,reward之间的【相对差距】才是关键(都是负数页没有关系的)!!!(adding a constantc to all the rewards adds a constant, Vc, to the values of all states, and thus does not affect the relative values of any states under any policies. ==》Vc = c*sigma_over_k_from_k=0_to_k=infinite { γ^k })。对于episodic RL task来说,reward之间的【相对差距、本身绝对大小】都是关键!!!(Vc = c*sigma_over_k_from_k=0_to_k=K { γ^k } 后访问的状态只能累加少量的γ^k)。
reward hypothesis :RL最本质的出发点
what we want/ what we mean by goals can be well thought of as the maximization of the expected cumulative sum of a scalar reward signal!
maximization the reward signal is one of the most distinctive features of RL
might at first appear limiting,in practice it has proved to be flexible and widely applicable, for example,....
reward hypothesis是RL最本质的出发点,设计的reward,一定要实现maximizing reward对应着maximizing our goal!!!The reward signal is your way of communicating to the robotwhat you want it to achieve, not howyou want it achieved(reward不要对应着how to get the goal,这种告诉agent 【how to do】的方式本质上是人的启发式思想在作怪;而goal-directed的reward,往往能够学到人想不到的策略,alphaGo是很好的例子。).For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals such as taking its opponent’s pieces or gaining control of the center of the board. If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent’s pieces even at the cost of losing the game.
episodic tasks and continuing tasks
discounted return G_t(A_t) = R_t+1 + γ R_t+2 + γ^2 R_t+3 + ...
why discount
The undiscounted formulation is appropriate forepisodic tasks (因为对于有结束的task,后续得到的结果确实都是真实的,undiscount貌似更合理。。。); the discounted formulation is appropriate forcontinuing tasks
markov property
Even when the state signal is non-Markov, it is still appropriate to think of the state in reinforcement learning as an approximation to a Markov state. We conclude that the inability to have access to a perfect Markov state representation is probably not a severe problem for a reinforcement learning agent.(确实是这样,state只要不是太烂,正常情况下都不错)
MDP
dynamic/model
transition graph: sate node and action node
value function
Vπ(s) = Eπ[G_t|S_t=s] : the expected return depend on what actions this agent will take, accordingly, value function is defined with respect to particular policies.
V*(s) = max_over_π Vπ(s)
Qπ(s,a)、Q*(s,a)
Bellman expectation equation forVπ: state与next_state之间关系的建模,用到expectation,是因为存在多个可能的action π(a|s),由policy π(a|s)决定。
Bellman expectation equation for Qπ: (state, action)与(next_state, next_action)之间关系的建模,用到expectation,是因为存在多个可能的next_state,由model/dynamic决定。
Bellman optimality equation:因为计算的是optimality,所以需要将由policy π(a|s)决定的那个expectation修改成对action求max,而由model/dynamic决定的那个expectation不能少。
从V*(s) 和Q*(s,a)中计算π*是很容易的事。
V*/Q*只有一个,但是π*可能有多个。
tabular and approximation
Bellman optimality equation的计算面临着三大难题:(1)dynamic可能不知道;(2)没有足够的计算资源;(3)不是markov property。一般情况下,(1,2)是核心,前面提到过,即使不是markov property,也同样可以较好的解决。对于棋类问题,一般(2)的计算资源不足;对于其他现实问题,一般(1)的dynamic不知道。
对于state、action很多的问题,approximation必不可少:approximately solving the Bellman optimality equation, using actual experienced transitions in place of knowledge of the expected transitions。
Monte Carlo methods
The value functions Vπ and Qπ can be estimated from experience using Monte Carlo methods(averaging over many random samples of actual returns)