人工智障学习笔记——强化学习(4)时间差分方法

前两章我们学习了动态规划DP方法和蒙特卡洛MC方法,DP方法的特性是状态转移,状态值函数的估计是自举的(bootstrapping),即当前状态值函数的更新依赖于已知的其他状态值函数。MC方法的特性是不需要环境模型,状态值函数的估计是相互独立的,但同时又依赖episode tasks。为了解决即不需要环境模型,又局限于episode task,还可以用于连续任务的问题,我们衍生出了时间差分学习方法。

时间差分学习(Temporal-Difference learning, TD learning)结合了动态规划和蒙特卡洛方法,是强化学习的核心思想。名字含义为通过当前时间的差分数据来进行学习。

策略状态价值vπ的时间差分学习方法

单步时间差分学习方法TD(0)

Initialize V(s)V(s) arbitrarily ∀s∈S+∀s∈S+
Repeat (for each episode):
  Initialize SS
  Repeat (for each step of episode):
   A←A← action given by ππ for SS
   Take action AA, observe R,S′R,S′
   V(S)←V(S)+α[R+γV(S′)−V(S)]V(S)←V(S)+α[R+γV(S′)−V(S)]
   S←S′S←S′
  Until S is terminal

多步时间差分学习方法

Input: the policy ππ to be evaluated
Initialize V(s)V(s) arbitrarily ∀s∈S∀s∈S
Parameters: step size α∈(0,1]α∈(0,1], a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t

V(S0)V(S0) 是由V(S0),V(S1),…,V(Sn)V(S0),V(S1),…,V(Sn)计算所得;V(S1)V(S1)是由V(S1),V(S1),…,V(Sn+1)V(S1),V(S1),…,V(Sn+1)。

策略行动价值qπ的on-policy时间差分学习方法: Sarsa

单步时间差分学习方法

Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0
Repeat (for each episode):
  Initialize SS
  Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)
  Repeat (for each step of episode):
   Take action AA, observe R,S′R,S′
   Choose A′A′ from S′S′ using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)
   Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]Q(S,A)←Q(S,A)+α[R+γQ(S′,A′)−Q(S,A)]
   S←S′;A←A′;S←S′;A←A′;
  Until S is terminal

多步时间差分学习方法

Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainA
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
  small ϵ>0ϵ>0
  a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0)
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t

策略行动价值qπ的off-policy时间差分学习方法: Q-learning

Q-learning 算法(Watkins, 1989)是一个突破性的算法。这里利用了这个公式进行off-policy学习。

Q(St,At)←Q(St,At)+α[Rt+1+γmaxa Q(St+1,a)−Q(St,At)]

单步时间差分学习方法

Initialize Q(s,a),∀s∈S,a∈A(s)Q(s,a),∀s∈S,a∈A(s) arbitrarily, and Q(terminal, ˙)=0Q(terminal, ˙)=0
Repeat (for each episode):
  Initialize SS
  Choose AA from SS using policy derived from QQ (e.g. ϵ−greedyϵ−greedy)
  Repeat (for each step of episode):
   Take action AA, observe R,S′R,S′
   Q(S,A)←Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)]Q(S,A)←Q(S,A)+α[R+γmaxa Q(S‘,a)−Q(S,A)]
   S←S′;S←S′;
  Until S is terminal

Double Q-learning的单步时间差分学习方法

Initialize Q1(s,a)Q1(s,a) and Q2(s,a),∀s∈S,a∈A(s)Q2(s,a),∀s∈S,a∈A(s) arbitrarily
Initialize Q1(terminal, ˙)=Q2(terminal, ˙)=0Q1(terminal, ˙)=Q2(terminal, ˙)=0
Repeat (for each episode):
  Initialize SS
  Repeat (for each step of episode):
   Choose AA from SS using policy derived from Q1Q1 and Q2Q2 (e.g. ϵ−greedyϵ−greedy)
   Take action AA, observe R,S′R,S′
   With 0.5 probability:
    Q1(S,A)←Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)]Q1(S,A)←Q1(S,A)+α[R+γQ2(S′,argmaxa Q1(S′,a))−Q1(S,A)]
   Else:
    Q2(S,A)←Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)]Q2(S,A)←Q2(S,A)+α[R+γQ1(S′,argmaxa Q2(S′,a))−Q2(S,A)]
   S←S′;S←S′;
  Until S is terminal

策略行动价值qπ的off-policy时间差分学习方法(by importance sampling): Sarsa

多步时间差分学习方法

Input: behavior policy \mu such that μ(a|s)>0,∀s∈S,a∈Aμ(a|s)>0,∀s∈S,a∈A
Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainA
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
  small ϵ>0ϵ>0
  a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0)
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t

策略行动价值qπ的off-policy时间差分学习方法(不带importance sampling): Tree Backup Algorithm

Tree Backup Algorithm的思想是每步都求行动价值的期望值。求行动价值的期望值意味着对所有可能的行动a都评估一次。

多步时间差分学习方法

Initialize Q(s,a)Q(s,a) arbitrarily ∀s∈S,∀ainA∀s∈S,∀ainA
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
  small ϵ>0ϵ>0
  a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  Select and store an action A0∼π( ˙|S0)A0∼π( ˙|S0)
  Q0←Q(S0,A0)Q0←Q(S0,A0)
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t

策略行动价值qπ的off-policy时间差分学习方法: Q(σ)

Q(σ) 结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,并考虑了重要样本。
当σ=1时,使用了重要样本的Sarsa算法。
当σ=0时,使用了Tree Backup的行动期望值算法。

多步时间差分学习方法

Input: behavior policy \mu such that μ(a|s)>0,∀s∈S,a∈Aμ(a|s)>0,∀s∈S,a∈A
Initialize Q(s,a)Q(s,a) arbitrarily \forall s \in \mathcal{S}^, \forall a in \mathcal{A}$
Initialize ππ to be ϵϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α∈(0,1]α∈(0,1],
  small ϵ>0ϵ>0
  a positive integer nn
All store and access operations (for StSt and RtRt) can take their index mod nn

Repeat (for each episode):
  Initialize and store S0≠terminalS0≠terminal
  Select and store an action A0∼μ( ˙|S0)A0∼μ( ˙|S0)
  Q0←Q(S0,A0)Q0←Q(S0,A0)
  T←∞T←∞
  For t=0,1,2,⋯t=0,1,2,⋯:
   If t
总结:如果说蒙特卡洛的方法是模拟(或者经历)一段情节,在情节结束后,根据情节上各个状态的价值,来估计状态价值。那么时间差分学习是模拟(或者经历)一段情节,每行动一步(或者几步),根据新状态的价值,然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时间差分学习方法。

你可能感兴趣的:(人工智障)