An Casual Overview of Reinforcement Learning

[update 20200712]

OpenAI的网站是很好的reference:spinningup

An Casual Overview of Reinforcement Learning_第1张图片


Plan

  1. 看完李宏毅RL视频
  2. 开始one by one implementation,based on openai tips
  3. At the mean time, master pytorch/tf and deep learning basics.
  4. When have time, keep an eye on the research frontier

 


强化学习概览

This overview is largely based on this article: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc.

On-Policy vs Off-Policy

[update 0710] 看过李宏毅DRL视频后意识到,基于TD的Q-learning中的replay buffer跟on-off policy的关系主要是分布意义的。因为buffer里面的tuple并不是trajectory而是experience,跟当前在train哪个policy无关。但是整个buffer中tuple的分布,与使用当前policy去collect data得到的数据分布是不一样的,再加上从replay buffer sample的时候一般是uniformly,所以replay buffer如果非要去对应的话,对应的是off policy。针对MC的trajectory,如果用pi产生的traj去train pi',那么就更加是off policy的范畴。

[source:https://www.quora.com/Why-is-Q-Learning-deemed-to-be-off-policy-learning]

主要判断,在更新Q时,Q所评估的policy跟目前与环境互动的policy是否是同一个。在Sarsa中是的,在Q-learning中,Q的update本质是在评估\pi^*而非当前\pi.涉及到的Q(s',a')中的a'是否由当前actor根据s'得出,抑或是一种approximation like the max function in q-learning. 在使用replay-buffer的情况下,或者a'由target actor生成的情况下,称为off-policy。否则是on-policy。这是我个人浏览了很多信息后,目前的理解[update 0413]见下图,目前有了新的理解:remember Qfunction是基于TD的,前后action是有顺序关系的。换句话说,train Q的时候,需要知道,是Q(?)跟当前的Q差了一个r。这时,如果此处的?与当前policy应当输出的action相符,说明我们想要把Q 按照当前policy去train,所以是on policy。否则的话,如Sarsa,当前policy给出的是epsilon-greedy choice但是train Q的时候假定下一步是totally greedy的,所以Q与当前policy不符合,所以是off。

涉及到replay buf时候,relay-buffer给出的a'是历史上的某个policy的action,与当前actor所能返回的action不同,所以Q没有在试图把自己train成当前policy的评估函数,所以是off。

总结:判断on还是off,取决于,train Q的时候,Q(s',a')中的a'与当前actor所建议的action是否相同。换句话说,我们是否在把Q train成当前police的评估函数。

 

An Casual Overview of Reinforcement Learning_第2张图片

那么下一个问题来了:两者应该如何选择呢?下面这个回答可以参考一下,特别是对于'take action'的理解。所以总的来说q learning off policy直接learn optimal policy但是有可能会不稳定,难以converge等。Sarsa相对conservative一些,所以当training代价比较大时可以考虑。

An Casual Overview of Reinforcement Learning_第3张图片

最后一个疑问,今天意识到TD本质上是有时间顺序在的:Q(s,a) for a->a'->a'' 和 Q(s,a) for a->a''->a'两者的值可能本来就不能一致。针对TD背后的思想,需要进一步思考

On-policy v.s. Off-policy

An on-policy agent learns the value based on its current action a derived from the current policy, whereas its off-policy counter part learns it based on the action a* obtained from another policy.

The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s′ and the greedy action a′. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s′ and the current policy's action a′′. It estimates the return for state-action pairs assuming the current policy continues to be followed.

The distinction disappears if the current policy is a greedy policy. However, such an agent would not be good since it never explores.

Policy Optimization (Policy Iteration)

f(state) -> action.

What action to take now?

  1. Policy Gradient. Think of play a game 10000 times and check the expected reward of f(s), then you can do gradient asent on it.
  2. Trust Region Policy Optimization (TRPO). A on-policy method which do PG with a big step but not too big: it is within the trust region. This is ensured by a constraint on the difference between the current behavior (not parameters) and the behavior of the target network. Both continuous and discrete space are supported.
  3.  Proximal Policy Optimization (PPO). Like TRPO is an on-policy method, but traite that contraint as penalty (regularization). Popularized by OpenAI. Do PG by improving gradients smartly to avoid performance issue and instability. Can be implemented and solved more easily than TRPO but with similar performance so it is preferred than TRPO.

Remarks

  1. Why we invent TRPO/PPO: each time when the policy is upated, all previous samples are outdated. It is too costly to regenerate all samples on each policy update. PPO allows to reuse the old experiences, allowing moving from on-policy to off-policy.
  2. Rewards should be centered to 0. Since PG is based on sampling, if all rewards are positive, the probability of sampling out some actions would be less and less.
  3. For given (s,a), only the disconted reward afterward should be considered.

[Update 20200719]

Police gradient 只是Policy Search的一种而已。PG根据gradient对每个action进行优化,而存在其他的方法直接在policy space中进行搜索和优化。目前看到的一个ressource:ICML2015 Tutorial. TO CHECK MORE ON PS

An Casual Overview of Reinforcement Learning_第4张图片

 

Q-Learning (Value Iteration)

f(state, action)->expected action value

Action-value functoin: How good is it if an particular action is taken?

DQN. tends to overestimate the q value, because the greedy setting: max_a Q(s,a). Variations/tips:

  1. Double-DQN is DQN with target network. Separate the Q which selects the action and the Q which compute the Q valued used to update the bellman equation.
  2. Dueling DQN is DQN with separate output on V(s) and A(s,a). Then let Q(s,a)=V(s)+A(s,a).
    1. Advantage: the update of V(s) will influence A(s,a), even if some actions are not sampled. In practice,
      1. some normalization should be done : keep sum(A)=0 HOW TO IMPLEMENT? It's simply done by substracting avg(A)
      2. also add constraints on A, so that the network will not simply set V(s) to 0.
  3. Prioritized Replay: prefer to use samples having large TD error.
  4. Multi-step: combine MC with TD, use not only one transition but multiple consecutive transitions.

  5. Noisy Exploration: Noise on action (epsilon greedy) or noise on parameters (noise on theta before each episode. state-dependent exploration).

  6. Distributional Q: Q(S,a) is no more a scalar expected value, but a distribution (several bins).Distributional Reinforcement Learning with Quantile Regression (QR-DQN). Instead of returning an expected value of an action, it returns a distribution. Then quantiles can be considered to identify the 'best' action. C51:Use distributional bellman equation instead of only consider EXPECTATION of futre rewards

  7. Rainbow = dqn + double+dueling+noise+prioritized+distributional+multi-step

  8. Hindsight Experience Replay. DQN with goals added to input. Especially useful for sparse reward space, like some 1/0 games. Can also be combined with DDPG.

DQN for Continuous actions

  1. sampling actions

  2. gradient ascent to solve the argmax (ddpg)

  3. modify the network structure to facilitate the optimization (CHECK THE PAPER. are we solving optimization pb using dl?)

Hybrid

For Actor-Critic and DDPG check Actor-Critic, DDPG and GAN

  1. DDPG
  2. A3C. Asynchronous: serval agents are trained in parallel. Actor-Critic: policy gradient and q-learning are combined. Also check Soft Actor-Critic
  3. TD3

Model-based vs Model free

  • Model: world model. Structured information about the environment. 利用了环境的结构性信息进行planning。一定程度上知道State transition probability
  • Model-free method see the environment as a black box, only providing state and reward as numbers. No extra info can be emploited.

More on model-based methods, check Model-based Reinforcement Learning

Other topics

Some other research directions:

Sparse Reward

In many real applications, the reward is very sparse: consider a robo task concerning insert a bar into a hole, most of time it fails and get 0 reward. To deal with this,

  • Reward shaping: manually add more reward to guide the agent. (domain knowledge is emploited)
  • Curriculum Learning:Learning from simple to hard, step by step. First train using easy to learn data, tuples with rewards for example, then add harder data, experience with sparse reward. 
    •  Reverse Curriculum Generation: first sample states near the goal, then further and further.

Hierarchical RL

Divide the goal into sub-goals that may not directly related to the final goal. Then the subgoals can be divided again to the next level, forming a hierarchy.

Immitation Learning

  • Behavior cloning:same as supervised learning. Problems:
    1. Experience is limited. Try data aggreation: expert-> pi_1->trajectories->expert label->pi_2... Not a good solution
    2. Some parts of the demostration is to be cloned, others are not, but the learning does not know
    3. Data mismatch. TO BE CLARIFIED The distribution of the training data and testing data is not the same. Even if the learner has learnt 99% from the expert, the resulting reward could be very different due to the nature of RL/MDP.
  • Inverse RL: more interesting than behavior cloning: instead of learning by cloning expert actions, first deduct the reward function based on expert's action, then optimize over that function.
    1. How to learn the reward: again GAN. Update the reward function so that the teacher's actions are always better. Update the actor to get better reward.
    2. Different ways to demostrate: first-person or third-person. In the third-person case, add feature extraction layers to the network in order to make the third-person experiences like a first-person one.
    3. Advantage: usually not many demostrations are needed.
    4. CHECK THE LINK WITH STRUCTURED LEARNING

Meta-learning/Transfer-learning

I would say it's the same as in the context of DL

Multi-agent

等学到这里,顺便学习一下博弈论吧!

你可能感兴趣的:(#,Data-Driven,Decision,Making,算法,强化学习,算法)