Reinforcement Learning Exercise 5.5

Exercise 5.5 Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability p p p and transitions to the terminal state with probability 1 − p 1-p 1p. Let the reward be + 1 +1 +1 on all transitions, and let γ = 1 \gamma=1 γ=1. Suppose you observe one episode that lasts 10 steps, with a return of 10. What are the first-visit and every-visit estimators of the value of the nonterminal state?

For the first-visit estimator, only the first visit of a state is considered. So:
V ( S n o n t e r m i n a l ) = G ( S 0 ) = 1 ⋅ p + 0 ⋅ ( 1 − p ) = p \begin{aligned} V(S_{nonterminal}) &= G(S_0) \\ &= 1 \cdot p + 0 \cdot (1-p) \\ &= p \end{aligned} V(Snonterminal)=G(S0)=1p+0(1p)=p
For the every-visit estimator, every state is considered:
V ( S n o t e r m i n a l ) = G ( S 0 ) + G ( S 1 ) + ⋯ + G ( S 10 ) = [ 1 ⋅ p + 0 ⋅ ( 1 − p ) ] + γ [ 1 ⋅ p 2 + 0 ⋅ ( 1 − p ) ] + ⋯ + γ 9 [ 1 ⋅ p 10 + 0 ⋅ ( 1 − p ) ] = p + p 2 + ⋯ + p 10 = p 1 − p ( p 10 − 1 ) \begin{aligned} V(S_{noterminal}) &= G(S_0) + G(S_1) + \cdots + G_(S_{10})\\ &= [1 \cdot p + 0 \cdot (1-p)] + \gamma[1 \cdot p^2 + 0 \cdot (1 - p)] +\cdots+\gamma^9[1\cdot p^{10}+0\cdot(1-p)] \\ &=p + p^2 + \cdots + p^{10} \\ &= \frac{p}{1-p}(p^{10} - 1) \end{aligned} V(Snoterminal)=G(S0)+G(S1)++G(S10)=[1p+0(1p)]+γ[1p2+0(1p)]++γ9[1p10+0(1p)]=p+p2++p10=1pp(p101)

你可能感兴趣的:(reinforcement,learning)