第2章 马尔可夫决策过程

2.1 马尔可夫决策过程(上)

Markov Decision Process(MDP)

第2章 马尔可夫决策过程_第1张图片

  1. Markov Decision Process can model a lot of real-world problem. It formally describes the framework of reinforcement learning
  2. Under MDP, the environment is fully observable.
    1. Optimal control primarily deals with continuous MDPs
    2. Partially observable problems can be converted into MDPs

Markov Property

  1. The history of states: h t = { s 1 , s 2 , s 3 , . . . , s t } h_{t}=\left \{ s_{1},s_{2},s_{3},...,s_{t} \right \} ht={s1,s2,s3,...,st}

  2. State s t s_{t} st is Markovian if and only if:
    p ( s t + 1 ∣ s t ) = p ( s t + 1 ∣ h t ) p(s_{t+1}|s_{t})=p(s_{t+1}|h_{t}) p(st+1st)=p(st+1ht)

    p ( s t + 1 ∣ s t , a t ) = p ( s t + 1 ∣ h t , a t ) p(s_{t+1}|s_{t},a_{t})=p(s_{t+1}|h_{t},a_{t}) p(st+1st,at)=p(st+1ht,at)

  3. “The future is independent of the past given the present”

Markov Process/Markov Chain

第2章 马尔可夫决策过程_第2张图片

  1. State transition matrix P specifies p ( s t + 1 = s ′ ∣ s t = s ) p(s_{t+1}=s'|s_{t}=s) p(st+1=sst=s)
    P = [ P ( s 1 ∣ s 1 ) P ( s 2 ∣ s 1 ) . . . P ( s N ∣ s 1 ) P ( s 1 ∣ s 2 ) P ( s 2 ∣ s 2 ) . . . P ( s N ∣ s 2 ) . . . . . . ⋱ . . . P ( s 1 ∣ s N ) P ( s 2 ∣ s N ) . . . P ( s N ∣ s N ) ] P=\begin{bmatrix} P(s_{1}|s_{1}) & P(s_{2}|s_{1}) & ... & P(s_{N}|s_{1})\\ P(s_{1}|s_{2}) & P(s_{2}|s_{2}) & ... & P(s_{N}|s_{2})\\ ... & ... & \ddots & ...\\ P(s_{1}|s_{N}) & P(s_{2}|s_{N}) & ... & P(s_{N}|s_{N}) \end{bmatrix} P= P(s1s1)P(s1s2)...P(s1sN)P(s2s1)P(s2s2)...P(s2sN).........P(sNs1)P(sNs2)...P(sNsN)

Example of MP

第2章 马尔可夫决策过程_第3张图片

  1. Sample episodes starting from s 3 s_{3} s3
    1. s 3 , s 4 , s 5 , s 6 , s 6 s_{3},s_{4},s_{5},s_{6},s_{6} s3,s4,s5,s6,s6
    2. s 3 , s 2 , s 3 , s 2 , s 1 s_{3},s_{2},s_{3},s_{2},s_{1} s3,s2,s3,s2,s1
    3. s 3 , s 4 , s 4 , s 5 , s 5 s_{3},s_{4},s_{4},s_{5},s_{5} s3,s4,s4,s5,s5

Markov Reward Process (MRP)

  1. Markov Reward Process is a Markov Chain + reward
  2. Definition of Markov Reward Process (MRP)
    1. S is a (finite) set of states (s ∈ S)
    2. P is dynamics/transition model that specifies P ( S t + 1 = s ′ ∣ s t = s ) P(S_{t+1}=s'|s_{t}=s) P(St+1=sst=s)
    3. R is a reward function R ( s t = s ) = E [ r t ∣ s t = s ] R(s_{t}=s)=E[r_{t}|s_{t}=s] R(st=s)=E[rtst=s]
    4. Discount factor γ ∈ [ 0 , 1 ] \gamma ∈[0,1] γ[0,1]
  3. If finite number of states, R can be a vector

Example of MRP

第2章 马尔可夫决策过程_第4张图片

Reward: +5 in s 1 s_{1} s1, +10 in s 7 s_{7} s7, 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]

Return and Value function

  1. Definition of Horizon

    1. Number of maximum time steps in each episode
    2. Can be infinite, otherwise called finite Markov (reward) Process
  2. Definition of Return

    1. Discounted sum of rewards from time step t to horizon
      G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 R t + 4 + . . . + γ T − t − 1 R T G_{t}=R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T} Gt=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...+γTt1RT
  3. Definition of state value function V t ( s ) V_{t}(s) Vt(s) for a MRP

    1. Expected return from t in state s
      V t ( s ) = E [ G t ∣ s t = s ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 R t + 4 + . . . + γ T − t − 1 R T ∣ s t = s ] {V_{t}(s)=E[G_{t}|s_{t}=s]} =E[R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}|s_{t}=s] Vt(s)=E[Gtst=s]=E[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...+γTt1RTst=s]

    2. Present value of future rewards

Why Discount Factor γ

  1. Avoid infinite returns in cycle Markov processes
  2. Uncertainly about the future may not be fully represented
  3. If the reward is financial, immediate rewards may earn more interest than delayed rewards
  4. Animal/human behaviour shows preference for immediate reward
  5. It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g if all sequences terminate.
    1. γ = 0: Only care about the immediate reward
    2. γ = 1: Future reward is equal to the immediate reward.

Example of MRP

第2章 马尔可夫决策过程_第5张图片

  1. Reward: +5 in s 1 s_{1} s1, +10 in s 7 s_{7} s7, 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]
  2. Sample returns G for a 4-step episodes with γ = 1/2
    1. return for s 4 , s 5 , s 6 , s 7 s_{4},s_{5},s_{6},s_{7} s4,s5,s6,s7 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1.25 0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×10=1.25 0+21×0+41×0+81×10=1.25
    2. return for s 4 , s 3 , s 2 , s 1 s_{4},s_{3},s_{2},s_{1} s4,s3,s2,s1 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 5 = 0.625 0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×5=0.625 0+21×0+41×0+81×5=0.625
    3. return for s 4 , s 5 , s 6 , s 6 s_{4},s_{5},s_{6},s_{6} s4,s5,s6,s6 : = 0
  3. How to compute the value function? For example, the value of state s 4 s_{4} s4 as V ( s 4 ) V(s_{4}) V(s4)

Compute the Value of a Markov Reward Process

  1. Value function: expected return from starting in state s
    V ( s ) = E [ G t ∣ s t = s ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 R t + 4 + . . . + γ T − t − 1 R T ∣ s t = s ] {V(s)=E[G_{t}|s_{t}=s]} =E[R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}|s_{t}=s] V(s)=E[Gtst=s]=E[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...+γTt1RTst=s]

  2. MRP value function satisfies the following Bellman equation:
    V ( s ) = R ( s ) ⏟ I m m e d i a t e   r e w a r d + γ ∑ s ∈ S ′ P ( s ′ ∣ s ) V ( s ′ ) ⏟ D i s c o u n t e d   s u m   o f   f u t u r e   r e w a r d V(s)=\underset{Immediate \,reward}{\underbrace{R(s)}}+\underset{Discounted \, sum\, of \, future \, reward}{\underbrace{\gamma \sum_{s\in S'}^{}P(s'|s)V(s')}} V(s)=Immediatereward R(s)+Discountedsumoffuturereward γsSP(ss)V(s)

  3. Practice: To derive the Bellman equation for V(s)

    1. Hint: V ( s ) = E [ R t + 1 + γ E [ R t + 2 + γ 2 R t + 3 + . . . ] ∣ s t = s ] V(s)=E[R_{t+1}+γE[R_{t+2}+γ^{2}R_{t+3}+...]|s_{t}=s] V(s)=E[Rt+1+γE[Rt+2+γ2Rt+3+...]st=s]

Understanding Bellman equation

  1. Bellman equation describes the iterative relations of states
    V ( s ) = R ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) V(s)=R(s)+\gamma \sum_{s'\in S}^{}P(s'|s)V(s') V(s)=R(s)+γsSP(ss)V(s)

第2章 马尔可夫决策过程_第6张图片

Matrix Form of Bellman Equation for MRP

Therefore, we can express V(s) using the matrix form:

第2章 马尔可夫决策过程_第7张图片

  1. Analytic solution for value of MRP: V = ( I − γ P ) − 1 R V=(I-γP)^{-1}R V=(IγP)1R
    1. Matrix inverse takes the complexity O ( N 3 ) O(N^{3}) O(N3) for N states
    2. Only possible for a small MRPs

Iterative Algorithm for Computing Value of a MRP

  1. Iterative methods for large MRPs:
    1. Dynamic Programming
    2. Monte-Carlo evaluation
    3. Temporal-Difference learning

Monte Carlo Algorithm for Computing Value of a MRP

Algorithm 1 Monte Carlo simulation to calculate MRP value function

  1. i ← 0 , G t ← 0 i\leftarrow 0,G_{t}\leftarrow 0 i0,Gt0

  2. while i ≠ N i≠N i=N do

  3. ​ generate an episode, starting from state s and time t

  4. ​ Using the generated episode, calculate return g = ∑ i = t H − 1 γ i − t r i g=\sum_{i=t}^{H-1}\gamma ^{i-t}r_{i} g=i=tH1γitri

  5. G t ← G t + g , i ← i + 1 G_{t}\leftarrow G_{t}+g,i\leftarrow i+1 GtGt+g,ii+1

  6. end while

  7. V t ( s ) ← G t / N V_{t}(s)\leftarrow G_{t}/N Vt(s)Gt/N

  8. For example: to calculate V ( s 4 ) V(s_{4}) V(s4) we can generate a lot of trajectories then take the average of the returns:

    1. return for s 4 , s 5 , s 6 , s 7 s_{4},s_{5},s_{6},s_{7} s4,s5,s6,s7​ : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1.25 0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×10=1.25 0+21×0+41×0+81×10=1.25
    2. return for s 4 , s 3 , s 2 , s 1 s_{4},s_{3},s_{2},s_{1} s4,s3,s2,s1 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 5 = 0.625 0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×5=0.625 0+21×0+41×0+81×5=0.625
    3. return for s 4 , s 5 , s 6 , s 6 s_{4},s_{5},s_{6},s_{6} s4,s5,s6,s6 : = 0
    4. more trajectories

Iterative Algorithm for Computing Value of a MRP

Algorithm 1 Iterative Algorithm to calculate MRP value function

  1. for all states s ∈ S , V ′ ( s ) ← 0 , V ( s ) ← ∞ s∈S,V'(s)\leftarrow 0,V(s)\leftarrow ∞ sS,V(s)0,V(s)
  2. while ∣ ∣ V − V ′ ∣ ∣ > ϵ ||V-V'||>\epsilon ∣∣VV∣∣>ϵ do
  3. V ← V ′ V\leftarrow V' VV
  4. ​ For all states s ∈ S , V ′ ( s ) = R ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) s∈S,V'(s)=R(s)+\gamma \sum_{s'\in S}^{}P(s'|s)V(s') sS,V(s)=R(s)+γsSP(ss)V(s)
  5. end while
  6. return V ′ ( s ) V'(s) V(s) for all s ∈ S s∈S sS

Markov Decision Process (MDP)

  1. Markov Decision Process is Markov Reward Process with decisions.

  2. Definition of MDP

    1. S is a finite set of states

    2. A is a finite set of actions

    3. P a P^{a} Pa is dynamics/transition model for each action

      P ( s t + 1 = s ′ ∣ s t = s , a t = a ) P(s_{t+1}=s'|s_{t}=s,a_{t}=a) P(st+1=sst=s,at=a)

    4. R is a reward function R ( s t = s , a t = a ) = E [ r t ∣ s t = s , a t = a ] R(s_{t}=s,a_{t}=a)=E[r_{t}|s_{t}=s,a_{t}=a] R(st=s,at=a)=E[rtst=s,at=a]

    5. Discount factor γ ∈ [ 0 , 1 ] γ∈[0,1] γ[0,1]

  3. MDP is a turple: ( S , A , P , R , γ ) (S,A,P,R,γ) (S,A,P,R,γ)

Policy in MDP

  1. Policy specifies what action to take in each state

  2. Give a state, specify a distribution over actions

  3. Policy: π ( a ∣ s ) = P ( a t = a ∣ s t = s ) \pi(a|s)=P(a_{t}=a|s_{t}=s) π(as)=P(at=ast=s)

  4. Policies are stationary (time-independent), A t ∼ π ( a ∣ s ) A_{t}\sim \pi(a|s) Atπ(as) for any t > 0

  5. Given an MDP ( S , A , P , R , γ ) (S,A,P,R,γ) (S,A,P,R,γ) and a policy π \pi π

  6. The state sequence S 1 , S 2 , . . . S_{1},S_{2},... S1,S2,... is a Markov process ( S , P π ) (S,P^{\pi}) (S,Pπ)

  7. The state and reward sequence S 1 , R 1 , S 2 , R 2 , . . . S_{1},R_{1},S_{2},R_{2},... S1,R1,S2,R2,... is a Markov reward process ( S , P π , R π , γ ) (S,P^{\pi},R^{\pi},γ) (S,Pπ,Rπ,γ) where,
    P π ( s ′ ∣ s ) = ∑ a ∈ A π ( a ∣ s ) P ( s ′ ∣ s , a ) P^{\pi}(s'|s)=\sum_{a∈A}\pi(a|s)P(s'|s,a)\\ Pπ(ss)=aAπ(as)P(ss,a)

R π ( s ) = ∑ a ∈ A π ( a ∣ s ) P ( s , a ) R^{\pi}(s)=\sum_{a∈A}\pi(a|s)P(s,a) Rπ(s)=aAπ(as)P(s,a)

Comparison of MP/MRP and MDP

第2章 马尔可夫决策过程_第8张图片

Value function for MDP

  1. The state-value function v π ( s ) v^{\pi}(s) vπ(s) of an MDP is the expected return starting from state s, and following policy π \pi π
    v π ( s ) = E π [ G t ∣ s t = s ] v^{\pi}(s)=E{\pi}[G_{t}|s_{t}=s] vπ(s)=Eπ[Gtst=s]

  2. The action-value function q π ( s , a ) q^{\pi}(s,a) qπ(s,a) is the expected return starting from state s, taking action a, and following policy π \pi π
    q π ( s , a ) = E π [ G t ∣ s t = s , A t = a ] q^{\pi}(s,a)=E{\pi}[G_{t}|s_{t}=s,A_{t}=a] qπ(s,a)=Eπ[Gtst=s,At=a]

  3. We have the relation between v π ( s ) v^{\pi}(s) vπ(s) and q π ( s , a ) q^{\pi}(s,a) qπ(s,a)
    v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)q^{\pi}(s,a) vπ(s)=aAπ(as)qπ(s,a)

Bellman Expection Equation

  1. The state-value function can be decomposed into immediate reward plus discounted value of successor state,
    v π ( s ) = E π [ R t + 1 + γ v π ( s t + 1 ) ∣ s t = s ] v^{\pi}(s)=E_{\pi}[R_{t+1}+γv^{\pi}(s_{t+1})|s_{t}=s] vπ(s)=Eπ[Rt+1+γvπ(st+1)st=s]

  2. The action-value function can similarly be decomposed
    q π ( s , a ) = E π [ R t + 1 + γ q π ( s t + 1 , A t + 1 ) ∣ s t = s , A t = a ] q^{\pi}(s,a)=E_{\pi}[R_{t+1}+γq^{\pi}(s_{t+1},A_{t+1})|s_{t}=s,A_{t}=a] qπ(s,a)=Eπ[Rt+1+γqπ(st+1,At+1)st=s,At=a]

Bellman Expection Equation for V π V^{\pi} Vπ and Q π Q^{\pi} Qπ

v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)q^{\pi}(s,a) vπ(s)=aAπ(as)qπ(s,a)

q π ( s , a ) = R s a + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) q^{\pi}(s,a)=R_{s}^{a}+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s') qπ(s,a)=Rsa+γsSP(ss,a)vπ(s)

Thus
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s')) vπ(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vπ(s))

q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) q^{\pi}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\sum_{a'∈A}\pi(a'|s')q^{\pi}(s',a') qπ(s,a)=R(s,a)+γsSP(ss,a)aAπ(as)qπ(s,a)

Backup Diagram for V π V^{\pi} Vπ

第2章 马尔可夫决策过程_第9张图片

v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s')) vπ(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vπ(s))

Backup Diagram for Q π Q^{\pi} Qπ

第2章 马尔可夫决策过程_第10张图片

q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) q^{\pi}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\sum_{a'∈A}\pi(a'|s')q^{\pi}(s',a') qπ(s,a)=R(s,a)+γsSP(ss,a)aAπ(as)qπ(s,a)

Policy Evaluation

  1. Evaluate the value of state given a policy π \pi π: compute v π ( s ) v^{\pi}(s) vπ(s)
  2. Also called as (value) prediction

Example: Navigate the boat

第2章 马尔可夫决策过程_第11张图片

Example: Policy Evaluation

第2章 马尔可夫决策过程_第12张图片

  1. Two actions: Left and Right

  2. For all actions, reward: +5 in s 1 s_{1} s1, +10 in s 7 s_{7} s7, 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]

  3. Let’s have a deterministic policy π ( s ) = \pi(s)= π(s)=Left and γ = 0 γ=0 γ=0 for any state s, then what is the value of the policy?

    1. v π = [ 5 , 0 , 0 , 0 , 0 , 0 , 10 ] v^{\pi}= [5, 0, 0, 0, 0, 0, 10] vπ=[5,0,0,0,0,0,10]
  4. Iteration: v k π ( s ) = r ( s , π ( s ) ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , π ( s ) ) v k − 1 π ( s ′ ) v_{k}^{\pi}(s)=r(s,\pi(s))+γ\sum_{s'∈S}P(s'|s,\pi(s))v_{k-1}^{\pi}(s') vkπ(s)=r(s,π(s))+γsSP(ss,π(s))vk1π(s)

  5. R = [ 5 , 0 , 0 , 0 , 0 , 0 , 10 ] R = [5, 0, 0, 0, 0, 0, 10] R=[5,0,0,0,0,0,10]

  6. Practice 1: Deterministic policy π ( s ) = \pi(s)= π(s)=Left and γ = 0.5 γ=0.5 γ=0.5 for any state s, then what are the states values under the policy?

  7. Practice 2: Stochastic policy P ( π ( s ) = L e f t ) = 0.5 P(\pi(s)=Left)=0.5 P(π(s)=Left)=0.5 and P ( π ( s ) = R i g h t ) = 0.5 P(\pi(s)=Right)=0.5 P(π(s)=Right)=0.5 and γ = 0.5 γ=0.5 γ=0.5 for any state s, then what are the states values under the policy?

  8. Iteration: v k π ( s ) = r ( s , π ( s ) ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , π ( s ) ) v k − 1 π ( s ′ ) v_{k}^{\pi}(s)=r(s,\pi(s))+γ\sum_{s'∈S}P(s'|s,\pi(s))v_{k-1}^{\pi}(s') vkπ(s)=r(s,π(s))+γsSP(ss,π(s))vk1π(s)

2.2 马尔可夫决策过程(下)

Decison Making in Markov Decision Process(MDP)

  1. Prediction (evaluate a given policy):
    1. Input: MDP < S , A , P , R , γ > <S,A,P,R,γ> and policy π \pi π or MRP < S , P π , R π , γ > <S,Pπ,Rπ,γ>
    2. Output: value function v π v^{\pi} vπ
  2. Control (search the optimal policy):
    1. Input: MDP < S , A , P , R , γ > <S,A,P,R,γ>
    2. Output: optimal value function v ∗ v^{*} v and optimal policy π ∗ \pi^{*} π
  3. Prediction and control in MDP can be solved by dynamic programming.

Dynamic programming

Dynamic programming is a very general solution method for problems which have two properties:

  1. Optimal substructure
    1. Principle of optimality applies
    2. Optimal solution can be decomposed into subproblems
  2. Overlapping subproblems
    1. Subproblems recur many times
    2. Solutions can be cached and reused

Markov decision processes satisfy both properties

  1. Bellman equation gives recursive decomposition
  2. Value function stores and reuses solutions

Policy evaluation on MDP

  1. Objective: Evaluate a given policy π \pi π for a MDP

  2. Output: the value function under policy v π v^{\pi} vπ

  3. Solution: iteration on Bellman expectation backup

  4. Algorithm: Synchronous backup

    1. At each iteration t+1

      update v t + 1 ( s ) v_{t+1}(s) vt+1(s) from v t ( s ′ ) v_{t}(s') vt(s) for all states s ∈ S s∈S sS where s’ is a successor state of s
      v t + 1 ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v t ( s ′ ) ) v_{t+1}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{t}(s')) vt+1(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vt(s))

  5. Convergence: v 1 → v 2 → . . . → v π v_{1}\rightarrow v_{2}\rightarrow ...\rightarrow v^{\pi} v1v2...vπ

Policy evaluation: Iteration on Bellman expectation backup

Bellman expectation backup for a particular policy
v t + 1 ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v t ( s ′ ) ) v_{t+1}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{t}(s')) vt+1(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vt(s))
Or if in the form of MRP < S , P π , R π , γ > <S,Pπ,Rπ,γ>
v t + 1 ( s ) = R π ( s ) + γ P π ( s ′ ∣ s ) V t ( s ′ ) v_{t+1}(s)=R^{\pi}(s)+\gamma P^{\pi}(s'|s)V_{t}(s') vt+1(s)=Rπ(s)+γPπ(ss)Vt(s)

Evaluating a Random Policy in the Small Gridworld

Example 4.1 in the Sutton RL textbook

第2章 马尔可夫决策过程_第13张图片

  1. Undiscounted episodic MDP ( γ = 1 γ=1 γ=1)
  2. Nonterminal states 1, …, 14
  3. Two terminal states (two shaded squares)
  4. Action leading out of grid leaves state unchanged, P ( 7 ∣ 7 , r i g h t ) = 1 P(7|7,right)=1 P(7∣7,right)=1
  5. Reward is -1 until the terminal state is reach
  6. Transition is deterministic given the action, e.g., P ( 6 ∣ 5 , r i g h t ) = 1 P(6|5,right)=1 P(6∣5,right)=1
  7. Uniform random policy π ( l ∣ . ) = π ( r ∣ . ) = π ( u ∣ . ) = π ( d ∣ . ) = 0.25 \pi(l|.)=\pi(r|.)=\pi(u|.)=\pi(d|.)=0.25 π(l∣.)=π(r∣.)=π(u∣.)=π(d∣.)=0.25

A live demo on policy evaluation

v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s')) vπ(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vπ(s))

  1. https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Optimal Value Function

  1. The optimal state-value function v ∗ ( s ) v^{*}(s) v(s) is the maximum value function over all policies
    v ∗ ( s ) = m a x π   v π ( s ) v^{*}(s)=\underset{\pi}{max}\,v^{\pi}(s) v(s)=πmaxvπ(s)

  2. The optimal policy
    π ∗ ( s ) = a r g   m a x π   v π ( s ) \pi^{*}(s)=arg\,\underset{\pi}{max}\,v^{\pi}(s) π(s)=argπmaxvπ(s)

  3. An MDP is “solved” when we know the optimal value

  4. There exists a unique optimal value function, but could be multiple optimal policies (two actions that have the same optimal value function)

Finding Optimal Policy

  1. An optimal policy can be found by maximizing over q ∗ ( s , a ) q^{*}(s,a) q(s,a),
    π ∗ ( a ∣ s ) { 1 ,  if  a = a r g   m a x a ∈ A   q ∗ ( s , a ) 0 ,  otherwise  \pi^{*}(a|s)\begin{cases} 1, & \text{ if } a=arg\,max_{a\in A}\,q^{*}(s,a) \\ 0, & \text{ otherwise } \end{cases} π(as){1,0, if a=argmaxaAq(s,a) otherwise 

  2. There is always a deterministic optimal policy for any MDP

  3. If we know q ∗ ( s , a ) q^{*}(s,a) q(s,a), we immediately have the optimal policy

Policy Search

  1. One option is to enumerate search the best policy
  2. Number of deterministic policies is ∣ A ∣ ∣ S ∣ |A|^{|S|} AS
  3. Other approaches such as policy iteration and value iteration are more efficient

MDP Control

  1. Compute the optimal policy
    π ∗ ( s ) = a r g   m a x π   v π ( s ) \pi^{*}(s)=arg\,\underset{\pi}{max}\,v^{\pi}(s) π(s)=argπmaxvπ(s)

  2. Optimal policy for a MDP in an infinite horizon problem (agent acts forever) is

    1. Deterministic
    2. Stationary (does not depend on time step)
    3. Unique? Not necessarily, may have state-actions with identical optimal values

Improving a Policy through Policy Iteration

  1. Iterate through the two steps:

    1. Evaluate the policy π \pi π (computing v given current π \pi π)

    2. Improve the policy by acting greedily with respect to v π v^{\pi} vπ
      π ′ = g r e e d y ( v π ) \pi'=greedy(v^{\pi}) π=greedy(vπ)

第2章 马尔可夫决策过程_第14张图片

Policy Improvement

  1. Compute the state-action value of a policy π \pi π:
    q π i ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π i ( s ′ ) q^{\pi_{i}}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi_{i}}(s') qπi(s,a)=R(s,a)+γsSP(ss,a)vπi(s)

  2. Compute new policy π i + 1 \pi_{i+1} πi+1 for all s ∈ S s∈S sS following
    π i + 1 ( s ) = a r g   m a x a   q π i ( s , a ) \pi_{i+1}(s)=arg\,\underset{a}{max}\,q^{\pi_{i}}(s,a) πi+1(s)=argamaxqπi(s,a)

第2章 马尔可夫决策过程_第15张图片

Monotonic Improvement in Policy

  1. Consider a deterministic policy a = π ( s ) a=\pi(s) a=π(s)

  2. We improve the policy through
    π ′ ( s ) = a r g   m a x a   q π ( s , a ) \pi'(s)=arg\,\underset{a}{max}\,q^{\pi}(s,a) π(s)=argamaxqπ(s,a)

  3. This improves the value from any state s over one step,
    q π ( s , π ′ ( s ) ) = m a x a ∈ A   q π ( s , a ) ≥ q π ( s , π ( s ) ) = v π ( s ) q^{\pi}(s,\pi'(s))=\underset{a∈A}{max}\,q^{\pi}(s,a)≥q^{\pi}(s,\pi(s))=v^{\pi}(s) qπ(s,π(s))=aAmaxqπ(s,a)qπ(s,π(s))=vπ(s)

  4. It therefore improves the value function, v π ′ ( s ) ≥ v π ( s ) v_{\pi'(s)}≥v^{\pi}(s) vπ(s)vπ(s)

    v π ( s ) ≤ q π ( s , π ′ ( s ) ) = E π ′ [ R t + 1 + γ v π ( S t + 1 ∣ S t = s ) ] v^{\pi}(s)≤q^{\pi}(s,\pi'(s))=E_{\pi'}[R_{t+1}+γv^{\pi}(S_{t+1}|S_{t}=s)] vπ(s)qπ(s,π(s))=Eπ[Rt+1+γvπ(St+1St=s)]

    ≤ E π ′ [ R t + 1 + γ q π ( S t + 1 , π ′ ( S t + 1 ) ) ∣ S t = s ) ] ≤E_{\pi'}[R_{t+1}+γq^{\pi}(S_{t+1},\pi'(S_{t+1}))|S_{t}=s)] Eπ[Rt+1+γqπ(St+1,π(St+1))St=s)]

    ≤ E π ′ [ R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , π ′ ( S t + 2 ) ) ∣ S t = s ) ] ≤E_{\pi'}[R_{t+1}+γR_{t+2}+γ^{2}q^{\pi}(S_{t+2},\pi'(S_{t+2}))|S_{t}=s)] Eπ[Rt+1+γRt+2+γ2qπ(St+2,π(St+2))St=s)]

    ≤ E π ′ [ R t + 1 + γ R t + 2 + . . . ∣ S t = s ) ] = v π ′ ( s ) ≤E_{\pi'}[R_{t+1}+γR_{t+2}+...|S_{t}=s)]=v_{\pi'}(s) Eπ[Rt+1+γRt+2+...∣St=s)]=vπ(s)

  5. If iImprovements stop,
    q π ( s , π ′ ( s ) ) = m a x a ∈ A   q π ( s , a ) ≥ q π ( s , π ( s ) ) = v π ( s ) q^{\pi}(s,\pi'(s))=\underset{a∈A}{max}\,q^{\pi}(s,a)≥q^{\pi}(s,\pi(s))=v^{\pi}(s) qπ(s,π(s))=aAmaxqπ(s,a)qπ(s,π(s))=vπ(s)

  6. Thus the Bellman optimality equation has been satisified

    v π ( s ) = m a x a ∈ A   q π ( s , a ) v^{\pi}(s)=\underset{a∈A}{max}\,q^{\pi}(s,a) vπ(s)=aAmaxqπ(s,a)

  7. Therefore v π ( s ) = v ∗ ( s ) v^{\pi}(s)=v^{*}(s) vπ(s)=v(s) for all s ∈ S s∈S sS, so π \pi π is an optimal policy

Bellman Optimality Equation

1️⃣The optimal value functions are reached by the Bellman optimality equations:
v ∗ ( s ) = m a x a   q π ( s , a ) v^{*}(s)=\underset{a}{max}\,q^{\pi}(s,a) v(s)=amaxqπ(s,a)

q ∗ ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v ∗ ( s ′ ) q^{*}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s') q(s,a)=R(s,a)+γsSP(ss,a)v(s)

thus
v ∗ ( s ) = m a x a R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v ∗ ( s ′ ) v^{*}(s)=\underset{a}{max}R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s') v(s)=amaxR(s,a)+γsSP(ss,a)v(s)

q ∗ ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a )   m a x a ′   q ∗ ( s ′ , a ′ ) q^{*}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\,\underset{a'}{max}\,q^{*}(s',a') q(s,a)=R(s,a)+γsSP(ss,a)amaxq(s,a)

Value Iteration by turning the Bellman Optimality Equation as update rule

1️⃣If we know the solution to subproblem v ∗ ( s ′ ) v^{*}(s') v(s), which is optimal.

2️⃣Then the solution for the optimal v ∗ ( s ) v^{*}(s) v(s) can be found by iteration over the following Bellman Optimality backup rule,
v ( s ) ← m a x a ∈ A ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v ∗ ( s ′ ) ) v(s)\leftarrow\underset{a∈A}{max}(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s')) v(s)aAmax(R(s,a)+γsSP(ss,a)v(s))

3️⃣The idea of value iteration is to apply these updates iteratively

Algorithm of Value Iteration

  1. Objective: find the optimal policy π \pi π

  2. Solution: iteration on the Bellman optimality backup

  3. Value Iteration algorithm:

    1. initialize k = 1 k=1 k=1 and v 0 ( s ) = 0 v_{0}(s)=0 v0(s)=0 for all states s

    2. For k = 1 : H k=1:H k=1:H

      1. for each states s
        q k + 1 ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v k ( s ′ ) q_{k+1}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{k}(s') qk+1(s,a)=R(s,a)+γsSP(ss,a)vk(s)

        v k + 1 ( s ) = m a x a   q k + 1 ( s , a ) v_{k+1}(s)=\underset{a}{max}\,q_{k+1}(s,a) vk+1(s)=amaxqk+1(s,a)

      2. k ← k + 1 k\leftarrow k+1 kk+1

    3. To retrieve the optimal policy after value iteration:
      π ( s ) = a r g   m a x a   R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v k + 1 ( s ′ ) \pi(s)=arg\,\underset{a}{max}\,R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{k+1}(s') π(s)=argamaxR(s,a)+γsSP(ss,a)vk+1(s)

Example: Shortest Path

第2章 马尔可夫决策过程_第16张图片

After the optimal values are reached, we run policy extraction to retrieve the optimal policy.

Demo of policy iteration and value Iteration

第2章 马尔可夫决策过程_第17张图片

1️⃣Policy iteration: Iteration of policy evaluation and policy improvement(update)

2️⃣Value iteration

3️⃣https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Policy iteration and value iteration on FrozenLake

1️⃣https://github.com/cuhkrlcourse/RLexample/tree/master/MDP

Different between Policy Iteration and Value Iteration

1️⃣Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.

2️⃣Value Iteration includes: finding optimal value function + one policy extraction.There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).

3️⃣Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v(s) after just one sweep of all states regardless of convergence).

Summary for Prediction and Control in MDP

第2章 马尔可夫决策过程_第18张图片

End

1️⃣Optional Homework 1 is available at https://github.com/cuhkrlcourse/ierg6130-assignment

你可能感兴趣的:(强化学习)