Silver-Slides Chapter 4 - 蒙特卡洛方法(MC)与时序差分 (TD)

Chapter 4 - MC-TD

Introduction

Last lecture:
	Planning by dynamic programming
	Solve a known MDP
This lecture:
	Model-free prediction
	Estimate the value function of an unknown MDP
Next lecture:
	Model-free control
	Optimise the value function of an unknown MDP

Monte-Carlo Learning

  • Monte-Carlo RL

    It is our first learning methods for estimating value functions and discovering optimal policies. Unlike the previous chapter, here we do not assume complete knowledge of the environment.

    MC methods learn directly from episodes of experience
    MC is model-free: no knowledge of MDP transitions / rewards
    MC learns from complete episodes: no bootstrapping
    MC uses the simplest possible idea: value = mean return
    Caveat: can only apply MC to episodic MDPs
    ​ All episodes must terminate

    Monte-Carlo RL和之前的DP明显的不同之处就是,DP需要完整的概率转移矩阵,而Monte-Carlo RL不需要。所以说它是model-free

    Monte-Carlo policy evaluation uses empirical mean return instead of expected return

  • Policy Evaluation

    • First-Visit

    • Every-Visit

    通过一个具体的流程来区别它们的不同:

    上面的是First-Visit,看到Unless S t S_t St appears in S 0 , S 1 , S t − 1 S_0,S_1,S_{t-1} S0,S1,St1 这一行,对当前的时间步t和状态 S t S_t St,只有这个episode中 S t S_t St第一次出现的地方的回报G才有效;相反,Every-Visit就是每一次出现的地方的G都加进 V ( S t ) V(S_t) V(St)

    同时我们可以看到,每个episode是我们自己生成的,当我们生成的episode足够多时,期望回报就趋近于平均回报。

Monte-Carlo RL最主要的思想就是在不知道模型的情况下,可以计算value function,利用的方法就是Monte-Carlo Method,自己生成多个状态序列,在每个episode中以平均回报来计算value,以21点游戏为例,就是每个episode里自己完成一次game,记录每次game里的状态序列,用于更新 V ( S t ) V(S_t) V(St)

Temporal-Difference Learning

  • Introduction

    If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning. TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas.

    1. Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment’s dynamics.
    2. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

    TD methods combine the sampling of Monte Carlo with the bootstrapping of DP.

  • Temporal-Difference Learning

    TD methods learn directly from episodes of experience
    TD is model-free: no knowledge of MDP transitions / rewards
    TD learns from incomplete episodes, by bootstrapping
    TD updates a guess towards a guess

  • TD(0)

    states[oldState] += alpha * (reward + states[state] - states[oldState])
    

    TD(0)基于现有的estimate来完成value的更新,所以我们说它是一个bootstrapping方法。

  • TD、MC、DP方法上的区别

    1. TD和MC的更新属于sample updates,因为它们通过寻找一个后续的sample state,利用终止状态的value和途中的reward来更新原始状态的value;而DP属于expected updates,因为它是基于完整的后续状态的概率分布来完成更新的。
    2. TD和MC:TD是用下个状态的value来完成value的更新,而MC是要到达终止状态,利用终止状态的value和途中的reward来更新。以Driving Home为例,MC是利用到家的实际用时(即终止状态)来更新当前状态所需的时间,而TD是根据下一步状态实时更新value,比如现在高速公路堵车,就直接预估到家还要50分钟。

    MC: samples

    DP: bootstraps

    TD: samples & bootstraps

  • TD的advantages

    1. 与DP相比:

      They do not require a model of the environment, of its reward and next-state probability distributions.

    2. 与MC相比:

      • they are naturally implemented in an online, fully incremental fashion.

        TD can learn before knowing the final outcome:TD can learn online after every step;MC must wait until end of episode before return is known

        TD can learn without the final outcome:TD can learn from incomplete sequences;MC can only learn from complete sequences;TD works in continuing (non-terminating) environments;MC only works for episodic (terminating) environments

      • TD methods are much less susceptible to these problems because they learn from each transition regardless of what subsequent actions are taken.

      • MC has high variance, zero bias
        Good convergence properties; (even with function approximation); Not very sensitive to initial value; Very simple to understand and use
        TD has low variance, some bias
        Usually more efficient than MC; TD(0) converges to v π ( s ) v_{\pi}(s) vπ(s); (but not always with function approximation); More sensitive to initial value

      • TD exploits Markov property: Usually more effective in Markov environments
        MC does not exploit Markov property: Usually more effective in non-Markov environments

  • Batch MC and TD

    Batch Monte Carlo methods always find the estimates that minimize mean-squared error on the training set, whereas batch TD(0) always finds the estimates that would be exactly correct for the maximum-likelihood model of the Markov process.

    以AB的例子为例:Batch TD(0)是根据8个episodes对其中的马尔科夫过程进行最大似然估计,如果我们得到的马尔科夫模型是正确的话,那么我们根据其计算得到的value function也是正确的,这种estimate就是certainty-equivalence estimate(because it is equivalent to assuming that the estimate of the underlying process was known with certainty rather than being approximated.)。

    这也是TD方法比MC方法收敛得快的原因,批处理的TD进行了certainty-equivalence估计,In batch form, TD(0) is faster than Monte Carlo methods because it computes the true certainty-equivalence estimate;非批处理形式的TD虽然没有进行certainty-equivalence估计也没有进行最小均方误差估计,但它近似朝那个方向移动,Nonbatch TD(0) may be faster than constant- α \alpha α MC because it is moving toward a better estimate, even though it is not getting all the way there.

TD( λ \lambda λ)

  • n-Step TD

    n-Step TD unifies the Monte Carlo (MC) methods and the one-step temporal-difference (TD) methods.

    即在两者之间,向后看n步:

    n-Step Return:

    n-Step Learning:

  • λ \lambda λ return algorithm - Forward View of TD( λ \lambda λ)

    the return of TD( λ \lambda λ) combines the returns of all the n-step using the weight ( 1 − λ ) λ n − 1 (1-\lambda) \lambda^{n-1} (1λ)λn1

    λ \lambda λ return:
    G t λ = ( 1 − λ ) ∑ n = 1 ∞ λ n − 1 G t ( n ) G_t^{\lambda}=(1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_t^{(n)} Gtλ=(1λ)n=1λn1Gt(n)
    The λ \lambda λ-return gives us an alternative way of moving smoothly between Monte Carlo and one-step TD methods that can be compared with the n-step TD way of Chapter 7.

    λ \lambda λ update value function:
    V ( S t ) = V ( S t ) + α ( G t λ − V ( S t ) ) V(S_t)=V(S_t)+\alpha(G_t^{\lambda}-V(S_t)) V(St)=V(St)+α(GtλV(St))

    可以看到, λ \lambda λ return algorithm向前看了很多步,We decide how to update each state by looking forward to future rewards and states. 但是,和MC一样,它也需要完整的episodes上进行计算。

  • TD( λ \lambda λ)

    Forward view provides theory; Backward view provides mechanism
    Update online, every step, from incomplete sequences

你可能感兴趣的:(silver,slides,强化学习,强化学习,silver,slides,蒙特卡洛方法,时序差分,TD)