笔记01-Q-learning

文章链接

PS: 插播一个 RL 信息
(You’ll see in papers that the RL process is called the Markov Decision Process (MDP).)

对比 Monte Carlo 和 Temporal Difference Learning

  1. Monte Carlo 学习方法,是在最后完成一轮之后,更新返回目标值Gt。
    对应学习公式


    image.png
  1. Temporal Difference Learning: learning at each step
    TD target 就是这个 temporal difference 的学习目标,
    对应学习公式:


    image.png

on-policy 和 off-policy

... ...

value-based 和 policy-based

  • “Value-based method”: it means that it finds its optimal policy indirectly by training a value-function or action-value function that will tell us what’s the value of each state or each state-action pair.

  • policy-based

TD approach

“Uses a TD approach”: updates its action-value function at each step.

state-value function 和 action-value function

... ...

Q-learning

Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function:
以上各种名词解释都可以在文章链接中找到,只在这里做个记录

更新 Q 表的 Bellman Equation公式为:


image.png

其实, 在设计环境实验的过程中, 奖励和惩罚以及结束的条件,都需要进行模拟的赋值,规定好这些特殊地点的数值,所以这些参数可以看成是已知的,我们要做的就是将最后的 Q-table 中的数值,在相应的策略下,通过不同的状态和动作,尽可能的将最终的 累计奖励(cumulative reward)最大化,就可以了。

你可能感兴趣的:(笔记01-Q-learning)