Reinforcement Learning Exercise 7.1

Exercise 7.1 In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the n-step error used in (7.2) can also be written as a sum TD errors (again if the value estimates don’t change) generalizing the earlier result.

Here, according to equation (7.2), the TD error is:
δ t = G t : t + n − V t + n − 1 ( S t ) \delta_t = G_{t:t+n} - V_{t+n-1}(S_t) δt=Gt:t+nVt+n1(St)
For G t : t + n G_{t:t+n} Gt:t+n there is:
G t : t + n = { R t + 1 + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n V t + n − 1 ( S t + n ) ( n ≥ 1  and  0 ≤ t < T − n ) R t + 1 + γ R t + 2 + ⋯ + γ T − t − 1 R T ( t + n ≥ T ) G_{t:t+n} = \begin{cases} R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1}R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}) & (n \geq 1 \text{ and } 0 \leq t < T-n) \\ R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1}R_T & (t+n \geq T) \end{cases} Gt:t+n={Rt+1+γRt+2++γn1Rt+n+γnVt+n1(St+n)Rt+1+γRt+2++γTt1RT(n1 and 0t<Tn)(t+nT)
Then, for t + n ≥ T t+n\geq T t+nT, the Monte Carlo error is:
G t − V t + n ( S t ) = R t + 1 + γ R t + 2 + ⋯ + γ T − t − 1 R T − V t + n ( S t ) = G t : t + n − V t + n ( S t ) \begin{aligned} G_t - V_{t+n}(S_t) &= R_{t+1} + \gamma R_{t+2} +\cdots + \gamma^{T-t-1}R_T - V_{t+n}(S_t) \\ &=G_{t:t+n}-V_{t+n}(S_t) \end{aligned} GtVt+n(St)=Rt+1+γRt+2++γTt1RTVt+n(St)=Gt:t+nVt+n(St)
Because all states are unchanged: V t + n ( s ) = V t + n − 1 ( s ) V_{t+n}(s) = V_{t+n-1}(s) Vt+n(s)=Vt+n1(s), so:
G t − V t + n ( S t ) = G t : t + n − V t + n − 1 ( S t ) = δ t \begin{aligned} G_t - V_{t+n}(S_t) &=G_{t:t+n}-V_{t+n-1}(S_t)\\ &=\delta_t \end{aligned} GtVt+n(St)=Gt:t+nVt+n1(St)=δtand for state value in any time, there is V t ( S t ) = V t + x ( S t ) V_t(S_t) = V_{t+x}(S_t) Vt(St)=Vt+x(St). Here, x > 0 x > 0 x>0.

Similarly, for n ≥ 1 n \geq 1 n1 and 0 ≤ t < T − k n 0 \leq t < T-kn 0t<Tkn, (here k ≥ 1 k \geq 1 k1) the Monte Carlo error should be:
G t − V t + n ( S t ) = R t + 1 + γ R t + 2 + ⋯ + γ T − t − 1 R T − V t + n ( S t ) = R t + 1 + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n V t + n − 1 ( S t + n ) − V t + n ( S t ) − γ n V t + n − 1 ( S t + n ) + γ n R t + n + 1 + ⋯ + γ T − t − 1 R T = R t + 1 + γ R t + 2 + ⋯ + γ n − 1 R t + n + γ n V t + n − 1 ( S t + n ) − V t + n − 1 ( S t ) − γ n V t + n − 1 ( S t + n ) + γ n R t + n + 1 + ⋯ + γ T − t − 1 R T = δ t + γ n [ R t + n + 1 + ⋯ + γ T − ( t + n ) − 1 R T − V t + n − 1 ( S t + n ) ] = δ t + γ n [ G t + n − V t + 2 n ( S t + n ) ] = δ t + γ n δ t + n + γ 2 n [ G t + 2 n − V t + 3 n ( S t + 2 n ) ] = δ t + γ n δ t + n + γ 2 n δ t + 2 n + ⋯ + γ k n [ G t + k n − V t + ( k + 1 ) n ( S t + k n ) ] = δ t + γ n δ t + n + γ 2 n δ t + 2 n + ⋯ + γ k n δ t + k n + γ k n [ R t + k n + 1 + γ R t + k n + 2 + ⋯ + γ T − ( t + k n ) − 1 R T − V t + ( k + 1 ) n ( S t + ( k + 1 ) n ) ] = ∑ p = 0 p = k γ p n δ t + p n + γ k n [ G t + k n − V ( S T ) ] = ∑ p = 0 p = k γ p n δ t + p n + γ k n [ G t + k n − 0 ] = ∑ p = 0 p = k γ p n δ t + p n + γ k n G t + k n \begin{aligned} G_t - V_{t+n}(S_t) &= R_{t+1} + \gamma R_{t+2} +\cdots + \gamma^{T-t-1}R_T - V_{t+n}(S_t) \\ &= R_{t+1} + \gamma R_{t+2} +\cdots + \gamma^{n-1}R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}) - V_{t+n}(S_t) \\ & \quad - \gamma^n V_{t+n-1}(S_{t+n}) + \gamma^nR_{t+n+1}+\cdots+\gamma^{T-t-1}R_T\\ &= R_{t+1} + \gamma R_{t+2} +\cdots + \gamma^{n-1}R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}) - V_{t+n -1}(S_t)\\ & \quad - \gamma^n V_{t+n-1}(S_{t+n}) + \gamma^nR_{t+n+1}+\cdots+\gamma^{T-t-1}R_T \\ &=\delta_t + \gamma^n\bigl [R_{t+n+1}+\cdots+\gamma^{T-(t+n) -1}R_T - V_{t+n-1}(S_{t+n}) \bigr ]\\ &=\delta_t + \gamma^n \bigl [G_{t+n} - V_{t+2n}(S_{t+n}) \bigr ] \\ &=\delta_t + \gamma^n \delta_{t+n} + \gamma^{2n} \bigl[ G_{t+2n} - V_{t+3n}(S_{t+2n})\bigr] \\ &=\delta_t + \gamma^n \delta_{t+n} + \gamma^{2n}\delta_{t+2n} + \cdots +\gamma^{kn} \bigl [ G_{t+kn} - V_{t+(k+1)n}(S_{t+kn})\bigr ] \\ &=\delta_t + \gamma^n \delta_{t+n} + \gamma^{2n}\delta_{t+2n} + \cdots +\gamma^{kn} \delta_{t+kn} \\ & \quad+ \gamma^{kn} \Bigl [ R_{t+kn+1} + \gamma R_{t+kn+2} + \cdots + \gamma^{T-(t+kn)-1}R_T - V_{t+(k+1)n} \bigl(S_{t+(k+1)n} \bigr ) \Bigr ] \\ &= \sum_{p=0}^{p=k}\gamma^{pn}\delta_{t+pn} + \gamma^{kn}\Bigl[ G_{t+kn} -V(S_T)\Bigr] \\ &=\sum_{p=0}^{p=k}\gamma^{pn}\delta_{t+pn} + \gamma^{kn}\Bigl[ G_{t+kn}-0\Bigr] \\ &=\sum_{p=0}^{p=k}\gamma^{pn}\delta_{t+pn} + \gamma^{kn} G_{t+kn} \end{aligned} GtVt+n(St)=Rt+1+γRt+2++γTt1RTVt+n(St)=Rt+1+γRt+2++γn1Rt+n+γnVt+n1(St+n)Vt+n(St)γnVt+n1(St+n)+γnRt+n+1++γTt1RT=Rt+1+γRt+2++γn1Rt+n+γnVt+n1(St+n)Vt+n1(St)γnVt+n1(St+n)+γnRt+n+1++γTt1RT=δt+γn[Rt+n+1++γT(t+n)1RTVt+n1(St+n)]=δt+γn[Gt+nVt+2n(St+n)]=δt+γnδt+n+γ2n[Gt+2nVt+3n(St+2n)]=δt+γnδt+n+γ2nδt+2n++γkn[Gt+knVt+(k+1)n(St+kn)]=δt+γnδt+n+γ2nδt+2n++γknδt+kn+γkn[Rt+kn+1+γRt+kn+2++γT(t+kn)1RTVt+(k+1)n(St+(k+1)n)]=p=0p=kγpnδt+pn+γkn[Gt+knV(ST)]=p=0p=kγpnδt+pn+γkn[Gt+kn0]=p=0p=kγpnδt+pn+γknGt+kn
Specially, if t + k n + 1 = T t+kn+1 =T t+kn+1=T, then G t + k n = 0 G_{t+kn} = 0 Gt+kn=0, the Monte Carlo error is:
G t − V t + n ( S t ) = ∑ p = 0 p = k γ p n δ t + p n G_t - V_{t+n}(S_t) = \sum_{p=0}^{p=k}\gamma^{pn}\delta_{t+pn} GtVt+n(St)=p=0p=kγpnδt+pn

你可能感兴趣的:(reinforcement,learning)