Reinforcement Learning Exercise 3.18

Exercise 3.18 The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:
Reinforcement Learning Exercise 3.18_第1张图片
Give the equation corresponding to this intuition and diagram for the value at the root node, v π ( s ) v_\pi(s) vπ(s), in terms of the value at the expected leaf node, q π ( s , a ) q_\pi(s, a) qπ(s,a), given S t = s S_t = s St=s. This equation should include an expectation conditioned on following the policy, π \pi π. Then give a second equation in which the expected value is written out explicitly in terms of π ( a ∣ s ) \pi(a|s) π(as) such that no expected value notation appears in the equation.
υ π ( s ) = E π ( G t ∣ S t = s ) = ∑ a ∈ A E π ( G t ∣ S t = s , A t = a ) P ( A t = a ∣ S t = s ) ∵ P ( A t = a ∣ S t = s ) = π ( a ∣ s ) ∴ υ π ( s ) = ∑ a ∈ A E π ( G t ∣ S t = s , A t = a ) π ( a ∣ s ) \begin{aligned} \\ \upsilon_\pi(s) &= \mathbb E_\pi ( G_t | S_t = s ) \\ &= \sum_{a \in \mathcal A} \mathbb E_\pi ( G_t | S_t = s, A_t = a ) P ( A_t = a | S_t = s) \\ \end{aligned} \\ \begin{aligned} &\because P ( A_t = a | S_t = s) = \pi(a | s) \\ &\therefore \upsilon_\pi(s) = \sum_{a \in \mathcal A} \mathbb E_\pi ( G_t | S_t = s, A_t = a ) \pi(a | s) \end{aligned} υπ(s)=Eπ(GtSt=s)=aAEπ(GtSt=s,At=a)P(At=aSt=s)P(At=aSt=s)=π(as)υπ(s)=aAEπ(GtSt=s,At=a)π(as)
According to definition
E π ( G t ∣ S t = s , A t = a ) = q π ( s , a ) \mathbb E_\pi ( G_t | S_t = s, A_t = a ) = q_\pi( s, a ) \\ Eπ(GtSt=s,At=a)=qπ(s,a)
so
υ π ( s ) = ∑ a ∈ A q π ( s , a ) π ( a ∣ s ) \upsilon_\pi(s) = \sum_{a \in \mathcal A} q_\pi( s, a ) \pi(a | s) υπ(s)=aAqπ(s,a)π(as)

你可能感兴趣的:(reinforcement,learning)