[update 20200712]
OpenAI的网站是很好的reference:spinningup
Plan
强化学习概览
This overview is largely based on this article: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc.
[update 0710] 看过李宏毅DRL视频后意识到,基于TD的Q-learning中的replay buffer跟on-off policy的关系主要是分布意义的。因为buffer里面的tuple并不是trajectory而是experience,跟当前在train哪个policy无关。但是整个buffer中tuple的分布,与使用当前policy去collect data得到的数据分布是不一样的,再加上从replay buffer sample的时候一般是uniformly,所以replay buffer如果非要去对应的话,对应的是off policy。针对MC的trajectory,如果用pi产生的traj去train pi',那么就更加是off policy的范畴。
[source:https://www.quora.com/Why-is-Q-Learning-deemed-to-be-off-policy-learning]
主要判断,在更新Q时,Q所评估的policy跟目前与环境互动的policy是否是同一个。在Sarsa中是的,在Q-learning中,Q的update本质是在评估而非当前.涉及到的Q(s',a')中的a'是否由当前actor根据s'得出,抑或是一种approximation like the max function in q-learning. 在使用replay-buffer的情况下,或者a'由target actor生成的情况下,称为off-policy。否则是on-policy。这是我个人浏览了很多信息后,目前的理解。[update 0413]见下图,目前有了新的理解:remember Qfunction是基于TD的,前后action是有顺序关系的。换句话说,train Q的时候,需要知道,是Q(?)跟当前的Q差了一个r。这时,如果此处的?与当前policy应当输出的action相符,说明我们想要把Q 按照当前policy去train,所以是on policy。否则的话,如Sarsa,当前policy给出的是epsilon-greedy choice但是train Q的时候假定下一步是totally greedy的,所以Q与当前policy不符合,所以是off。
涉及到replay buf时候,relay-buffer给出的a'是历史上的某个policy的action,与当前actor所能返回的action不同,所以Q没有在试图把自己train成当前policy的评估函数,所以是off。
总结:判断on还是off,取决于,train Q的时候,Q(s',a')中的a'与当前actor所建议的action是否相同。换句话说,我们是否在把Q train成当前police的评估函数。
那么下一个问题来了:两者应该如何选择呢?下面这个回答可以参考一下,特别是对于'take action'的理解。所以总的来说q learning off policy直接learn optimal policy但是有可能会不稳定,难以converge等。Sarsa相对conservative一些,所以当training代价比较大时可以考虑。
最后一个疑问,今天意识到TD本质上是有时间顺序在的:Q(s,a) for a->a'->a'' 和 Q(s,a) for a->a''->a'两者的值可能本来就不能一致。针对TD背后的思想,需要进一步思考
On-policy v.s. Off-policy
An on-policy agent learns the value based on its current action a derived from the current policy, whereas its off-policy counter part learns it based on the action a* obtained from another policy.
The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s′ and the greedy action a′. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.
The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s′ and the current policy's action a′′. It estimates the return for state-action pairs assuming the current policy continues to be followed.
The distinction disappears if the current policy is a greedy policy. However, such an agent would not be good since it never explores.
f(state) -> action.
What action to take now?
Remarks
[Update 20200719]
Police gradient 只是Policy Search的一种而已。PG根据gradient对每个action进行优化,而存在其他的方法直接在policy space中进行搜索和优化。目前看到的一个ressource:ICML2015 Tutorial. TO CHECK MORE ON PS
f(state, action)->expected action value
Action-value functoin: How good is it if an particular action is taken?
DQN. tends to overestimate the q value, because the greedy setting: max_a Q(s,a). Variations/tips:
Multi-step: combine MC with TD, use not only one transition but multiple consecutive transitions.
Noisy Exploration: Noise on action (epsilon greedy) or noise on parameters (noise on theta before each episode. state-dependent exploration).
Distributional Q: Q(S,a) is no more a scalar expected value, but a distribution (several bins).Distributional Reinforcement Learning with Quantile Regression (QR-DQN). Instead of returning an expected value of an action, it returns a distribution. Then quantiles can be considered to identify the 'best' action. C51:Use distributional bellman equation instead of only consider EXPECTATION of futre rewards
Rainbow = dqn + double+dueling+noise+prioritized+distributional+multi-step
Hindsight Experience Replay. DQN with goals added to input. Especially useful for sparse reward space, like some 1/0 games. Can also be combined with DDPG.
DQN for Continuous actions
sampling actions
gradient ascent to solve the argmax (ddpg)
modify the network structure to facilitate the optimization (CHECK THE PAPER. are we solving optimization pb using dl?)
For Actor-Critic and DDPG check Actor-Critic, DDPG and GAN
More on model-based methods, check Model-based Reinforcement Learning
Some other research directions:
In many real applications, the reward is very sparse: consider a robo task concerning insert a bar into a hole, most of time it fails and get 0 reward. To deal with this,
Divide the goal into sub-goals that may not directly related to the final goal. Then the subgoals can be divided again to the next level, forming a hierarchy.
I would say it's the same as in the context of DL
等学到这里,顺便学习一下博弈论吧!