策略梯度(Policy Gradient)

本章我们主要讲解Policy Based
解决问题:
之前的强化学习算法是Value Based的方法,主要就是根据Q值和V值,进行选择。但是它有以下几个缺点。
第一点是对连续动作的处理能力不足。
第二点是对受限状态下的问题处理能力不足。导致真实环境下本来不同的两个状态却再我们建模后拥有相同的特征描述。
第三点是无法解决随机策略问题。Value Based强化学习方法对应的最优策略通常是确定性策略,因为其是从众多行为价值中选择一个最大价值的行为,而有些问题的最优策略却是随机策略,这种情况下同样是无法通过基于价值的学习来求解的。

理论推导

接下来我们通过神经网络近似表示策略 π \pi π和动作价值函数 q {q} q
q ^ ( s , a , w ) ≈ q π ( s , a ) \hat{q}(s,a,w) \approx q_{\pi}(s,a) q^(s,a,w)qπ(s,a)
π θ ( s , a ) = P ( a ∣ s , θ ) ≈ π ( a ∣ s ) \pi_{\theta}(s,a) = P(a|s,\theta)\approx \pi(a|s) πθ(s,a)=P(as,θ)π(as)
然后我们来表示 π θ ( s , a ) \pi_{\theta}(s,a) πθ(s,a), ϕ ( s , a ) \phi(s,a) ϕ(s,a)表示状态和行为的特征,如下:
π θ ( s , a ) = e ϕ ( s , a ) T θ ∑ b e ϕ ( s , b ) T θ \pi_{\theta}(s,a) = \frac{e^{\phi(s,a)^T\theta}}{\sum\limits_be^{\phi(s,b)^T\theta}} πθ(s,a)=beϕ(s,b)Tθeϕ(s,a)Tθ
这里给定需要优化的函数目标为 J ( θ ) J(\theta) J(θ),最终对θ求导的梯度都可以表示为 ∇ θ J ( θ ) = E π θ [ ∇ θ l o g π θ ( s , a ) Q π ( s , a ) ] \nabla_{\theta} J(\theta) = \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}log \pi_{\theta}(s,a) Q_{\pi}(s,a)] θJ(θ)=Eπθ[θlogπθ(s,a)Qπ(s,a)]

然后这里有两种优化目标:
第一种优化目标:
平均价值
ρ ( π ) = lim ⁡ n → ∞ 1 n E { r 1 + r 2 + . . . + r n ∣ π } = ∑ s d π ( s ) V π = ∑ s d π ( s ) ∑ a π ( s , a ) R s a ρ(\pi) = \lim\limits_{n\to\infty}\frac{1}{n}\mathbb{E}\{r_1+r_2+...+r_n|\pi\}=∑\limits_{s}d^{\pi} (s)V^\pi=∑\limits_sd^\pi(s)∑\limits_a\pi(s,a)R_s^a ρ(π)=nlimn1E{r1+r2+...+rnπ}=sdπ(s)Vπ=sdπ(s)aπ(s,a)Rsa
d π ( s ) = lim ⁡ t → ∞ P r { s t = s ∣ s 0 , π } d^{\pi}(s) = \lim\limits_{t\to\infty}Pr\{s_t=s|s_0,\pi\} dπ(s)=tlimPr{st=ss0,π}
d π ( s ) d^{\pi}(s) dπ(s)是策略 π \pi π下的符合马尔科夫链静态分布。
此时对状态行为Q值的定义是:
Q π ( s , a ) = ∑ t = 1 ∞ E { r t − ρ ( π ) ∣ s 0 = s , a 0 = a , π ) } , ∀ s ∈ S , a ∈ A Q^{\pi}(s,a)=∑\limits_{t=1}^{\infty}\mathbb{E}\{r_t-ρ(\pi)|s_0=s,a_0=a,\pi)\},{\forall}s\in{S},a\in{A} Qπ(s,a)=t=1E{rtρ(π)s0=s,a0=a,π)},sS,aA

第二种优化目标是初始状态收获的期望:
ρ ( π ) = V π ( s 0 ) = E { ∑ t = 1 ∞ γ t − 1 r t ∣ s 0 , π } ρ(\pi) =V^\pi(s_0)=\mathbb{E}\{∑\limits_{t=1}^{\infty}\gamma^{t-1}r_t|s_0,\pi\} ρ(π)=Vπ(s0)=E{t=1γt1rts0,π}
Q Q Q值定义是:
Q π ( s , a ) = E { ∑ k = 1 ∞ γ k − 1 r t + k ∣ s t = s , a t = a , π } Q^{\pi}(s,a) = \mathbb{E}\{∑\limits_{k=1}^{\infty}\gamma^{k-1}r_{t+k}|s_t=s,a_t=a,\pi\} Qπ(s,a)=E{k=1γk1rt+kst=s,at=a,π}
这里 γ ∈ [ 0 , 1 ] \gamma\in[0,1] γ[0,1],同时设置 d π ( s ) = ∑ t = 0 ∞ γ t P r { s t = s ∣ s 0 , π } d^\pi(s)=∑\limits_{t=0}^{\infty}\gamma^tPr\{s_t=s|s_0,\pi\} dπ(s)=t=0γtPr{st=ss0,π}
不论是哪种形似,它们的梯度求导最终归于以下形式:
∂ ρ ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \frac{\partial ρ}{\partial \theta}=∑_sd^\pi(s)∑_a\frac{\partial\pi(s,a)}{\partial\theta}Q^\pi(s,a) θρ=sdπ(s)aθπ(s,a)Qπ(s,a)
以下为证明
proof:
以下为平均价值推导
∂ V π ( s ) ∂ θ = ∂ ∂ θ ∑ π ( s , a ) Q π ( s , a ) ∀ s ∈ S \frac{\partial V^\pi(s)}{\partial \theta}=\frac{\partial}{\partial \theta}∑\pi(s,a)Q^\pi(s,a) \quad \forall s\in S θVπ(s)=θπ(s,a)Qπ(s,a)sS
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ Q π ( s , a ) ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}Q^\pi(s,a)] =a[θπ(s,a)Qπ(s,a)+π(s,a)θQπ(s,a)]
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ [ R s a − ρ ( π ) + ∑ s ′ P s s ′ a V π ( s ′ ) ] ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}[R_s^a-ρ(\pi)+∑\limits_{s'}P_{ss'}^aV^\pi(s')]] =a[θπ(s,a)Qπ(s,a)+π(s,a)θ[Rsaρ(π)+sPssaVπ(s)]]
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ [ R s a − ρ ( π ) + ∑ s ′ P s s ′ a V π ( s ′ ) ] ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}[R_s^a-ρ(\pi)+∑\limits_{s'}P_{ss'}^aV^\pi(s')]] =a[θπ(s,a)Qπ(s,a)+π(s,a)θ[Rsaρ(π)+sPssaVπ(s)]]
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) [ − ∂ ρ ∂ θ + ∑ s ′ P s s ′ a ∂ V π ( s ′ ) ∂ θ ] ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)[-\frac{\partial ρ}{\partial \theta}+∑\limits_{s'}P_{ss'}^a\frac{\partial V^\pi(s')}{\partial \theta}]] =a[θπ(s,a)Qπ(s,a)+π(s,a)[θρ+sPssaθVπ(s)]]
therefore,
∂ V π ∂ θ = ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∑ s ′ P s s ′ a ∂ V π ( s ′ ) ∂ θ ] − ∂ V π ( s ) ∂ θ \frac{\partial V^\pi}{\partial \theta}=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)∑\limits_{s'}P_{ss'}^a\frac{\partial V^\pi(s')}{\partial \theta}]-\frac{\partial V^\pi(s)}{\partial \theta} θVπ=a[θπ(s,a)Qπ(s,a)+π(s,a)sPssaθVπ(s)]θVπ(s)
加入 d π ( s ) d^\pi(s) dπ(s),变成以下公式:
∑ s d π ( s ) ∂ V π ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) + ∑ s d π ( s ) ∑ a π ( s , a ) ∑ s ′ P s s ′ a ∂ V π ( s ′ ) ∂ θ − ∑ s d π ( s ) ∂ V π ( s ) ∂ θ ∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi}{\partial \theta}=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+∑\limits_{s}d^{\pi} (s)∑\limits_{a}\pi(s,a)∑\limits_{s'}P_{ss'}^a\frac{\partial V^\pi(s')}{\partial \theta}-∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi(s)}{\partial \theta} sdπ(s)θVπ=sdπ(s)aθπ(s,a)Qπ(s,a)+sdπ(s)aπ(s,a)sPssaθVπ(s)sdπ(s)θVπ(s)

∑ s d π ( s ) ∂ V π ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) + ∑ s ′ d π ( s ′ ) ∂ V π ( s ′ ) ∂ θ − ∑ s d π ( s ) ∂ V π ( s ) ∂ θ ∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi}{\partial \theta}=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+∑\limits_{s'}d^{\pi} (s')\frac{\partial V^\pi(s')}{\partial \theta}-∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi(s)}{\partial \theta} sdπ(s)θVπ=sdπ(s)aθπ(s,a)Qπ(s,a)+sdπ(s)θVπ(s)sdπ(s)θVπ(s)
= ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \qquad=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)} =sdπ(s)aθπ(s,a)Qπ(s,a)
所以
∂ ρ ∂ θ = ∑ s d π ( s ) ∂ V π ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \frac{\partial ρ}{\partial \theta}=∑\limits_{s}d^{\pi} (s)\frac{\partial V^\pi}{\partial \theta}=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)} θρ=sdπ(s)θVπ=sdπ(s)aθπ(s,a)Qπ(s,a)
以上就是平均价值推导。

接下来推导初始状态收获的期望求导:
∂ V π ( s ) ∂ θ = ∂ ∂ θ ∑ π ( s , a ) Q π ( s , a ) ∀ s ∈ S \frac{\partial V^\pi(s)}{\partial \theta}=\frac{\partial}{\partial \theta}∑\pi(s,a)Q^\pi(s,a) \quad \forall s\in S θVπ(s)=θπ(s,a)Qπ(s,a)sS
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ Q π ( s , a ) ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}Q^\pi(s,a)] =a[θπ(s,a)Qπ(s,a)+π(s,a)θQπ(s,a)]
= ∑ a [ ∂ π ( s , a ) ∂ θ Q π ( s , a ) + π ( s , a ) ∂ ∂ θ [ R s a + ∑ s ′ γ P s s ′ a V π ( s ′ ) ] ] \qquad=∑\limits_{a}[\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}+\pi(s,a)\frac{\partial}{\partial \theta}[R_s^a+∑\limits_{s'} \gamma P_{ss'}^aV^\pi(s')]] =a[θπ(s,a)Qπ(s,a)+π(s,a)θ[Rsa+sγPssaVπ(s)]]
注意我们将对 V π ( s ′ ) V^\pi(s') Vπ(s)迭代持续展开
∂ V π ( s ) ∂ θ = ∑ x ∑ k = 0 ∞ γ k P r ( s → x , k , π ) ∑ a ∂ π ( x , a ) ∂ θ Q π ( x , a ) \frac{\partial V^\pi(s)}{\partial \theta}=∑\limits_{x}∑\limits_{k=0}^{\infty}\gamma^k Pr(s \to x,k,\pi)∑\limits_{a}\frac{\partial \pi(x,a)}{\partial \theta}Q^\pi(x,a) θVπ(s)=xk=0γkPr(sx,k,π)aθπ(x,a)Qπ(x,a)
其中上面的 P r ( s → x , k , π ) Pr(s \to x,k,\pi) Pr(sx,k,π)是指在第k步和策略 π \pi π下,状态s到状态x的概率
∂ ρ ∂ θ = ∂ ∂ θ E { ∑ t = 1 ∞ γ t − 1 r t ∣ s 0 , π } = ∂ ∂ θ V π ( s 0 ) \frac{\partial ρ}{\partial \theta}=\frac{\partial}{\partial \theta}\mathbb{E}\{∑\limits_{t=1}^{\infty}\gamma^{t-1}r_t|s_0,\pi\}=\frac{\partial}{\partial \theta}V^{\pi}(s_0) θρ=θE{t=1γt1rts0,π}=θVπ(s0)
= ∑ x ∑ k = 0 ∞ γ k P r ( s 0 → s , k , π ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \qquad=∑\limits_{x}∑\limits_{k=0}^{\infty}\gamma^k Pr(s_0 \to s,k,\pi)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi(s,a) =xk=0γkPr(s0s,k,π)aθπ(s,a)Qπ(s,a)
这里初始状态收获的期望: d π ( s ) = ∑ t = 0 ∞ γ t P r { s t = s ∣ s 0 , π } d^\pi(s)=∑\limits_{t=0}^{\infty}\gamma^tPr\{s_t=s|s_0,\pi\} dπ(s)=t=0γtPr{st=ss0,π}
所以
∂ ρ ∂ θ = ∑ s d π ( s ) ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) \frac{\partial ρ}{\partial \theta}=∑\limits_{s}d^{\pi} (s)∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)} θρ=sdπ(s)aθπ(s,a)Qπ(s,a)
以上两种可以写为以下形式:
∂ ρ ∂ θ = E π [ ∑ a ∂ π ( s , a ) ∂ θ Q π ( s , a ) ] \frac{\partial ρ}{\partial \theta}=\mathbb{E}_\pi[∑\limits_{a}\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi{(s,a)}] θρ=Eπ[aθπ(s,a)Qπ(s,a)]
证明参考论文

算法流程

输入:N个蒙特卡罗完整序列,训练步长α
输出:策略函数的参数θ

  1. for 每个蒙特卡罗序列:
    a. 用蒙特卡罗法计算序列每个时间位置t的状态价值 v t v_t vt
    b. 对序列每个时间位置 t t t,使用梯度上升法,更新策略函数的参数 θ θ θ
    θ = θ + α ∇ θ l o g π θ ( s t , a t ) v t θ=θ+α∇_θlogπ_θ(s_t,a_t)v_t θ=θ+αθlogπθ(st,at)vt
  2. 返回策略函数的参数 θ θ θ

以上算法做了修改 v t v_t vt提到 Q Q Q,为无偏估计。加了log为后面的cross_entropy,加的log,详细理解见代码。

代码

思路比较简单,接下来就是pytorch代码,参照tensorflow的代码

# -*- coding: utf-8 -*-
"""
Created on Sun Dec  8 14:21:25 2019

@author: asus
"""

import gym
import torch
import numpy as np
import torch.nn.functional as F

# Hyper Parameters
GAMMA = 0.95 # discount factor
LEARNING_RATE=0.01

class softmax_network(torch.nn.Module):
    def __init__(self, env):
        super(softmax_network, self).__init__()
        self.state_dim = env.observation_space.shape[0]
        self.action_dim = env.action_space.n
        self.fc1 = torch.nn.Linear(self.state_dim, 20)
        self.fc1.weight.data.normal_(0, 0.6)
        self.fc2 = torch.nn.Linear(20, self.action_dim)
        self.fc1.weight.data.normal_(0, 0.6)
        
    def create_softmax_network(self, state_input):
        self.h_layer = F.relu(self.fc1(state_input))
        self.softmax_input = self.fc2(self.h_layer)
        all_act_prob = F.softmax(self.softmax_input)
        return all_act_prob
    
    def forward(self, state_input, acts, vt):
        self.h_layer = F.relu(self.fc1(state_input))
        self.softmax_input = self.fc2(self.h_layer)
#        print(self.softmax_input)
        neg_log_prob = F.cross_entropy(self.softmax_input, acts, reduce=False)
        
#        print("vt:", vt)
#        print("neg_log_prob:", neg_log_prob)
        loss = (neg_log_prob * vt).sum()
        return loss
        
class Policy_Gradient():
    def __init__(self, env):
        # init some parameters
        self.time_step = 0
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        self.model = softmax_network(env)
        self.optimizer = torch.optim.Adam(params=self.model.parameters(), lr=0.01)
        
    def choose_action(self, observation):
        prob_weights = self.model.create_softmax_network(torch.FloatTensor(observation[np.newaxis, :]))
        action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.detach().numpy().ravel())  # select action w.r.t the actions prob
        return action

    def store_transition(self, s, a, r):
        self.ep_obs.append(s)
        self.ep_as.append(a)
        self.ep_rs.append(r)

    def learn(self):

        discounted_ep_rs = np.zeros_like(self.ep_rs)
        
        running_add = 0
        for t in reversed(range(0, len(self.ep_rs))):
            running_add = running_add * GAMMA + self.ep_rs[t]
            discounted_ep_rs[t] = running_add

        discounted_ep_rs -= np.mean(discounted_ep_rs)
        discounted_ep_rs /= np.std(discounted_ep_rs)
#        print(discounted_ep_rs)
        # train on episode
        loss = self.model(torch.FloatTensor(np.vstack(self.ep_obs)), 
                          torch.LongTensor(self.ep_as),
                          torch.LongTensor(discounted_ep_rs))
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []    # empty episode data
# Hyper Parameters
ENV_NAME = 'CartPole-v0'
EPISODE = 3000 # Episode limitation
STEP = 3000 # Step limitation in an episode
TEST = 10 # The number of experiment test every 100 episode

def main():
  # initialize OpenAI Gym env and dqn agent
  env = gym.make(ENV_NAME)
  agent = Policy_Gradient(env)

  for episode in range(EPISODE):
    # initialize task
    state = env.reset()
    # Train
    for step in range(STEP):
      action = agent.choose_action(state) # e-greedy action for train
      next_state,reward,done,_ = env.step(action)
      agent.store_transition(state, action, reward)
      state = next_state
      if done:
        #print("stick for ",step, " steps")
        agent.learn()
        break

    # Test every 100 episodes
    if episode % 100 == 0:
      total_reward = 0
      for i in range(TEST):
        state = env.reset()
        for j in range(STEP):
#          env.render()
          action = agent.choose_action(state) # direct action for test
          state,reward,done,_ = env.step(action)
          total_reward += reward
          if done:
            break
      ave_reward = total_reward/TEST
      print ('episode: ',episode,'Evaluation Average Reward:',ave_reward)

if __name__ == '__main__':
  main()

参考文献:https://www.cnblogs.com/pinard/p/10137696.html

你可能感兴趣的:(强化学习,Pytorch)