cs285-lec5-policy gradient

文章目录

  • 总结
  • 数学推导
    • 怎么计算 J ( θ ) J(\theta) J(θ)
    • 怎么优化 J ( θ ) J(\theta) J(θ)
  • 对比policy gradient和supervised learning的maximum likelihood
  • partial observability
  • 优缺点
    • 优点
    • 缺点
  • reduce the variance
    • 因果性定理
    • 衰减因子 γ \gamma γ降低variance
    • 减baseline方式降低variance
  • on-policy -> off-policy
    • on-policy性质
    • 推广到off-policy
  • 代码实现
  • 自动调整学习率 α \alpha α
  • Questions
    • Why does policy gradient have high variance?
    • How to calculate log ⁡ π θ ( a ∣ s ) \log \pi_\theta(a|s) logπθ(as)?

总结

按照RL的算法框架,policy gradient的整体步骤如下:
cs285-lec5-policy gradient_第1张图片

  1. sample { s i , a i } \{ s_i, a_i \} {si,ai} from π θ ( a ∣ s ) \pi_\theta(a|s) πθ(as) (run it over data)
  2. 计算期望: J ( θ ) = E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] ≈ 1 N ∑ i ∑ t r ( s i , t , s i , t ) J(\theta)=E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]\approx\frac{1}{N}\sum_i \sum_t r(s_{i,t}, s_{i,t}) J(θ)=Eτpθ(τ)[tr(st,at)]N1itr(si,t,si,t)
  3. ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T ∇ θ log ⁡ π θ ( a i , t ∣ s i , t ) ( ( ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) ) − b ) \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \Big( \Big(\sum_{t' = t}^T \gamma^{t'-t}r(s_{i,t'}, a_{i, t'}) \Big) - b) θJ(θ)N1i=1Nt=1Tθlogπθ(ai,tsi,t)((t=tTγttr(si,t,ai,t))b), involving causal policy, discount factor γ \gamma γ, reduced baseline.
  4. θ = θ + α ∇ θ J ( θ ) \theta = \theta + \alpha \nabla_\theta J(\theta) θ=θ+αθJ(θ)

Sergey把以上过程认为是一种formalization of trial-and-error

数学推导

总体来看,RL是为了学到policy,2种传统的方法,policy-gradient以及value-based方法。前者是直接对policy建模,valued-based是间接的方法

RL的优化目标为: θ ∗ = arg max ⁡ θ E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] ⏟ J ( θ ) \theta^* =\argmax_{\theta}\underbrace{E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]}_{J(\theta)} θ=argmaxθJ(θ) Eτpθ(τ)[tr(st,at)]

怎么计算 J ( θ ) J(\theta) J(θ)

在实际使用中,可以用多次trajectory的结果对 E E E做近似,即:
J ( θ ) = E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] ≈ 1 N ∑ i ∑ t r ( s i , t , s i , t ) J(\theta)=E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]\approx\frac{1}{N}\sum_i \sum_t r(s_{i,t}, s_{i,t}) J(θ)=Eτpθ(τ)[tr(st,at)]N1itr(si,t,si,t)

怎么优化 J ( θ ) J(\theta) J(θ)

在实际使用的时候,除了计算 J J J,还需要优化 J J J,所以需要计算 J J J θ \theta θ的导数

首先把 J J J简写一下: J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ ) ⏟ ∑ t = 1 T r ( s t , a t ) ] = ∫ p θ ( τ ) r ( τ ) d τ J(\theta) = E_{\tau \sim p_\theta(\tau)}[\underbrace{r(\tau)}_{\sum_{t=1}^Tr(s_t,a_t)}] = \int p_\theta(\tau)r(\tau)d\tau J(θ)=Eτpθ(τ)[t=1Tr(st,at) r(τ)]=pθ(τ)r(τ)dτ

J J J θ \theta θ的导数,有: ∇ θ J ( θ ) = ∫ ∇ θ p θ ( τ ) r ( τ ) d τ \nabla_\theta J(\theta)=\int \nabla_\theta p_\theta(\tau)r(\tau)d\tau θJ(θ)=θpθ(τ)r(τ)dτ,这个需要求 ∇ θ p θ ( τ ) \nabla_\theta p_\theta(\tau) θpθ(τ),但是 p p p是未知的

p p p做一下数学变换,有:
p θ ( τ ) ∇ θ log ⁡ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) p θ ( τ ) = ∇ θ p θ ( τ ) p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) = p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta p_\theta(\tau) pθ(τ)θlogpθ(τ)=pθ(τ)pθ(τ)θpθ(τ)=θpθ(τ)
代入到 ∇ θ J ( θ ) \nabla_\theta J(\theta) θJ(θ)中,有:
∇ θ J ( θ ) = ∫ ∇ θ p θ ( τ ) r ( τ ) d τ = ∫ p θ ( τ ) ∇ θ log ⁡ p θ ( τ ) r ( τ ) d τ \begin{aligned} \nabla_\theta J(\theta) &= \int \nabla_\theta p_\theta(\tau) r(\tau) d\tau \\ &= \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) r(\tau) d\tau \\ \end{aligned} θJ(θ)=θpθ(τ)r(τ)dτ=pθ(τ)θlogpθ(τ)r(τ)dτ
因为 p ( τ ) p(\tau) p(τ)对某个数字的积分可以写成期望,所以进一步有:
∇ θ J ( θ ) = ∫ ∇ θ p θ ( τ ) r ( τ ) d τ = ∫ p θ ( τ ) ∇ θ log ⁡ p θ ( τ ) r ( τ ) d τ = E τ ∼ p θ ( τ ) [ ∇ θ log ⁡ p θ ( τ ) r ( τ ) ] \begin{aligned} \nabla_\theta J(\theta) &= \int \nabla_\theta p_\theta(\tau) r(\tau) d\tau \\ &= \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) r(\tau) d\tau \\ &= E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau) r(\tau)] \end{aligned} θJ(θ)=θpθ(τ)r(τ)dτ=pθ(τ)θlogpθ(τ)r(τ)dτ=Eτpθ(τ)[θlogpθ(τ)r(τ)]

现在 log ⁡ p θ ( τ ) \log p_\theta(\tau) logpθ(τ)还是没法计算,考虑到:
p θ ( τ ) = p θ ( s 1 , a 1 , … , s T , a T ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) \begin{aligned} p_\theta(\tau) &= p_\theta(s_1, a_1, \dots, s_T, a_T) \\ &= p(s_1) \prod_{t = 1}^T \pi_\theta(a_t | s_t)p(s_{t+1}|s_t, a_t) \end{aligned} pθ(τ)=pθ(s1,a1,,sT,aT)=p(s1)t=1Tπθ(atst)p(st+1st,at)
公示两边同时增加log,有:
log ⁡ p θ ( τ ) = log ⁡ p ( s 1 ) + ∑ t = 1 T [ log ⁡ π θ ( a t ∣ s t ) + log ⁡ p ( s t + 1 ∣ s t , a t ) ] \begin{aligned} \log p_\theta(\tau) &= \log p(s_1) + \sum_{t = 1}^T [\log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t)] \end{aligned} logpθ(τ)=logp(s1)+t=1T[logπθ(atst)+logp(st+1st,at)]

现在可以把 log ⁡ p θ ( τ ) \log p_\theta(\tau) logpθ(τ)代入到 ∇ θ J ( θ ) \nabla_\theta J(\theta) θJ(θ)中,有:
∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log ⁡ p θ ( τ ) r ( τ ) ] = E τ ∼ p θ ( τ ) [ ∇ θ [ log ⁡ p ( s 1 ) ⏟ 对 θ 的导数为0 + ∑ t = 1 T [ log ⁡ π θ ( a t ∣ s t ) + log ⁡ p ( s t + 1 ∣ s t , a t ) ⏟ 对 θ 的导数为0 ] ] r ( τ ) ] = E τ ∼ p θ ( τ ) [ ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) ] \begin{aligned} \nabla_\theta J(\theta) &= E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau) r(\tau)] \\ &= E_{\tau \sim p_\theta(\tau)}\Bigg[\nabla_\theta \Big[ \underbrace{\log p(s_1)}_{\text{对}\theta\text{的导数为0}} + \sum_{t = 1}^T \big[\log \pi_\theta(a_t|s_t) + \underbrace{\log p(s_{t+1}|s_t, a_t)}_{\text{对}\theta\text{的导数为0}}\big] \Big]r(\tau)\Bigg] \\ &= E_{\tau \sim p_\theta(\tau)}\Bigg[ \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) \Bigg] \\ \end{aligned} θJ(θ)=Eτpθ(τ)[θlogpθ(τ)r(τ)]=Eτpθ(τ)[θ[θ的导数为0 logp(s1)+t=1T[logπθ(atst)+θ的导数为0 logp(st+1st,at)]]r(τ)]=Eτpθ(τ)[(t=1Tθlogπθ(atst))(t=1Tr(st,at))]
类似求 J J J的那样,用多个trajectory的和来近似 E E E,即:
∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log ⁡ p θ ( τ ) r ( τ ) ] = E τ ∼ p θ ( τ ) [ ∇ θ [ log ⁡ p ( s 1 ) ⏟ 对 θ 的导数为0 + ∑ t = 1 T [ log ⁡ π θ ( a t ∣ s t ) + log ⁡ p ( s t + 1 ∣ s t , a t ) ⏟ 对 θ 的导数为0 ] ] r ( τ ) ] = E τ ∼ p θ ( τ ) [ ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) ] ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \begin{aligned} \nabla_\theta J(\theta) &= E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau) r(\tau)] \\ &= E_{\tau \sim p_\theta(\tau)}\Bigg[\nabla_\theta \Big[ \underbrace{\log p(s_1)}_{\text{对}\theta\text{的导数为0}} + \sum_{t = 1}^T \big[\log \pi_\theta(a_t|s_t) + \underbrace{\log p(s_{t+1}|s_t, a_t)}_{\text{对}\theta\text{的导数为0}}\big] \Big]r(\tau)\Bigg] \\ &= E_{\tau \sim p_\theta(\tau)}\Bigg[ \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) \Bigg] \\ &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) \end{aligned} θJ(θ)=Eτpθ(τ)[θlogpθ(τ)r(τ)]=Eτpθ(τ)[θ[θ的导数为0 logp(s1)+t=1T[logπθ(atst)+θ的导数为0 logp(st+1st,at)]]r(τ)]=Eτpθ(τ)[(t=1Tθlogπθ(atst))(t=1Tr(st,at))]N1i=1N(t=1Tθlogπθ(atst))(t=1Tr(st,at))
这样计算出来 ∇ θ J ( θ ) \nabla_\theta J(\theta) θJ(θ)后,更新 θ \theta θ就可以用 θ = θ + α ∇ θ J ( θ ) \theta = \theta + \alpha \nabla_\theta J(\theta) θ=θ+αθJ(θ)

例如,REINFORCE算法的迭代步骤为:

  1. π θ ( a t ∣ s t ) \pi_\theta(a_t|s_t) πθ(atst)中sample τ i \tau^i τi
  2. ∇ θ J ( θ ) ≈ ∑ i ( ∑ t ∇ θ log ⁡ π θ ( a t i ∣ s t i ) ) ( ∑ t r ( s t i , a t i ) ) \nabla_\theta J(\theta) \approx \sum_i \Big(\sum_t \nabla_\theta\log \pi_\theta(a_t^i|s_t^i) \Big) \Big( \sum_t r(s_t^i, a_t^i) \Big) θJ(θ)i(tθlogπθ(atisti))(tr(sti,ati))
  3. θ ← θ + α ∇ θ J ( θ ) \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta) θθ+αθJ(θ)

对比policy gradient和supervised learning的maximum likelihood

supervised learning的log maximum likelihood目标函数为:
max ⁡ ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ P ( X ∣ θ ) ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ s t ) ) \begin{aligned} \max \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log P(X|\theta) \Big) \\ &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \end{aligned} maxθJ(θ)N1i=1N(t=1TθlogP(Xθ))N1i=1N(t=1Tθlogπθ(atst))

而policy gradient的优化目标为:
max ⁡ ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \max \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) maxθJ(θ)N1i=1N(t=1Tθlogπθ(atst))(t=1Tr(st,at))
两者的区别在于后面有没有reward

在supervised learning中,因为有label,所以可以保证 s t , a t s_t, a_t st,at一定是好的,因此不需要reward,优化必然是要增加 log ⁡ π \log \pi logπ那一部分
但reinforcement learning因为没有label,所以需要reward,在优化时可能就不是要单纯增加 log ⁡ π \log \pi logπ那一部分,而可能是减小

policy gradient的核心在于 log ⁡ π θ ( a t ∣ s t ) \log \pi_\theta(a_t|s_t) logπθ(atst)的选取上

partial observability

partial observability的问题是,只能观察到obs而不是states。在RL中,只有假设states满足markov性质,但obs是不一定满足的。

但是因为policy gradient不需要markov property,所以可以用在partial observability的问题上

即,对于partial observation,目标函数同样为:
max ⁡ ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ o t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \max \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|o_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) maxθJ(θ)N1i=1N(t=1Tθlogπθ(atot))(t=1Tr(st,at))

优缺点

优点

  1. 更好的收敛性,肯定能收敛
  2. 可以在连续分布上选择action

缺点

  1. high variance:因为很难give credit to actions that bring more future rewards
  2. inefficiency:on-policy导致每次都需要sample data based on policy

reduce the variance

因果性定理

因果性定理:在t时刻做出的policy,不会影响t时刻以前的reward

RL的目标函数:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ o t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|o_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) θJ(θ)N1i=1N(t=1Tθlogπθ(atot))(t=1Tr(st,at))

根据上面这个因果性定理,在 ∇ θ J ( θ ) \nabla_\theta J(\theta) θJ(θ)中将 ∑ t = 1 T r ( s t , a t ) \sum_{t = 1}^T r(s_t, a_t) t=1Tr(st,at)替换为policy所在的时刻t及之后的reward,即reward-to-go,有:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ o t ) ) ( ∑ t = 1 T r ( s t , a t ) ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ o t ) ) ( ∑ t ′ = t T r ( s t ′ , a t ′ ) ) ⏟ reward-to-go , Q ^ i , t \begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|o_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) \\ &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|o_t) \Big) \underbrace{ \Big( \sum_{t' = t}^T r(s_{t'}, a_{t'}) \Big)}_{\text{reward-to-go}, \hat{Q}_{i,t} } \end{aligned} θJ(θ)N1i=1N(t=1Tθlogπθ(atot))(t=1Tr(st,at))N1i=1N(t=1Tθlogπθ(atot))reward-to-go,Q^i,t (t=tTr(st,at))

替换完之后,因为sum of reward小了,所以期望也小了,进一步的variance也小了

衰减因子 γ \gamma γ降低variance

考虑更远的未来会导致vairance过高(考虑越远,policy越会过拟合),theoretically the discount factor should start at t = 1 t=1 t=1, that is:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a i , t ∣ s i , t ) ) ( ∑ t = 1 T γ t − 1 r ( s i , t , a i , t ) ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T γ t − 1 ∇ θ log ⁡ π θ ( a i , t ∣ s i , t ) ( ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) ) \begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_{i,t}|s_{i,t}) \Big) \Big( \sum_{t=1}^T \gamma^{t-1}r(s_{i,t}, a_{i,t}) \Big) \\ &\approx \frac{1}{N} \sum_{i = 1}^N \sum_{t = 1}^T \gamma^{t-1} \nabla_\theta\log \pi_\theta(a_{i,t}|s_{i,t}) \Big( \sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'}, a_{i,t'}) \Big) \end{aligned} θJ(θ)N1i=1N(t=1Tθlogπθ(ai,tsi,t))(t=1Tγt1r(si,t,ai,t))N1i=1Nt=1Tγt1θlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t))

In practice we don’t use this, because it could mean later rewards and steps matter less.

So we will use:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T ∇ θ log ⁡ π θ ( a i , t ∣ s i , t ) ( ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) ) \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_{i,t}|s_{i,t}) \Big( \sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'}, a_{i,t'}) \Big) θJ(θ)N1i=1Nt=1Tθlogπθ(ai,tsi,t)(t=tTγttr(si,t,ai,t))

减baseline方式降低variance

要降低policy gradient的variance,做法是,减掉1个baseline(减掉1个baseline,可以让reward低于平均值的action得到一个负的梯度,高于平均的得到正的梯度),也即:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∇ θ log ⁡ p θ ( τ ) [ r ( τ ) − b ] b = 1 N ∑ i = 1 N r ( τ ) \begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i = 1}^N \nabla_\theta \log p_\theta(\tau) [r(\tau) - b] \\ b &= \frac{1}{N}\sum_{i = 1}^Nr(\tau) \end{aligned} θJ(θ)bN1i=1Nθlogpθ(τ)[r(τ)b]=N1i=1Nr(τ)
减去1个常数b不会影响expection,同时还能降低variance,因为:
E [ log ⁡ p θ ( τ ) b ] = ∫ p θ ( τ ) ∇ θ log ⁡ p θ ( τ ) b    d τ = ∫ ∇ θ p θ ( τ ) b    d τ = b ∇ θ ∫ p θ ( τ )    d τ = b ∇ θ 1 = 0 \begin{aligned} E[\log p_\theta(\tau) b] &= \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) b \;d\tau \\ &= \int \nabla_\theta p_\theta(\tau) b \;d\tau \\ &= b \nabla_\theta \int p_\theta(\tau) \;d\tau \\ &= b\nabla_\theta1 \\ &= 0 \end{aligned} E[logpθ(τ)b]=pθ(τ)θlogpθ(τ)bdτ=θpθ(τ)bdτ=bθpθ(τ)dτ=bθ1=0
注:

  1. baseline设置为average reward不是最优,但是效果很不错了已经
  2. 上面第一步到第二步的推导,用到了: p θ ( τ ) ∇ θ log ⁡ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) p θ ( τ ) = ∇ θ p θ ( τ ) p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) = p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta p_\theta(\tau) pθ(τ)θlogpθ(τ)=pθ(τ)pθ(τ)θpθ(τ)=θpθ(τ)
  3. ∫ p θ ( τ )    d τ \int p_\theta(\tau) \;d\tau pθ(τ)dτ是条件概率分布的积分,也即1

开始准备最优baseline(实际中不常应用,但还是可以推导一下),把variance写出来:
V a r [ x ] = E [ x 2 ] − E 2 [ x ] Var[x] = E[x^2] - E^2[x] Var[x]=E[x2]E2[x]
J J J写成期望的形式,有:
∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log ⁡ p θ ( τ ) ( r ( τ ) − b ) ] \nabla_\theta J(\theta) = E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau) (r(\tau) - b)] θJ(θ)=Eτpθ(τ)[θlogpθ(τ)(r(τ)b)]
所以方差推导:
V a r = E τ ∼ p θ ( τ ) [ ( ∇ θ log ⁡ p θ ( τ ) ⏟ 用 g ( τ ) 代替 ( r ( τ ) − b ) ) 2 ] − E τ ∼ p θ ( τ ) 2 [ ∇ θ log ⁡ p θ ( τ ) ( r ( τ ) − b ) ] ⏟ 期望与b无关, 所以可以直接写成原先的E = E τ ∼ p θ ( τ ) [ g 2 ( τ ) ( r ( τ ) − b ) ) 2 ] − E τ ∼ p θ ( τ ) 2 [ g ( τ ) r ( τ ) ] \begin{aligned} Var &= E_{\tau \sim p_\theta(\tau)}[(\underbrace{\nabla_\theta \log p_\theta(\tau)}_{\text{用}g(\tau)\text{代替}} (r(\tau) - b))^2] - \underbrace{E_{\tau \sim p_\theta(\tau)}^2[\nabla_\theta \log p_\theta(\tau) (r(\tau) - b)]}_{\text{期望与b无关, 所以可以直接写成原先的E}} \\ &= E_{\tau \sim p_\theta(\tau)}[g^2(\tau)(r(\tau) - b))^2] - E^2_{\tau \sim p_\theta(\tau)}[g(\tau) r(\tau)] \end{aligned} Var=Eτpθ(τ)[(g(τ)代替 θlogpθ(τ)(r(τ)b))2]期望与b无关所以可以直接写成原先的E Eτpθ(τ)2[θlogpθ(τ)(r(τ)b)]=Eτpθ(τ)[g2(τ)(r(τ)b))2]Eτpθ(τ)2[g(τ)r(τ)]
求方差对 b b b的梯度,有:
∂ V a r ∂ b = ∂ ∂ b E [ g 2 ( τ ) ( r 2 ( τ ) ⏟ 对b的导数为0 − 2 b r ( τ ) + b 2 ) ] = − 2 E [ g 2 ( τ ) r ( τ ) ] + 2 b E [ g 2 ( τ ) ] \begin{aligned} \frac{\partial Var}{\partial b} &= \frac{\partial}{\partial b}E[\underbrace{g^2(\tau)(r^2(\tau)}_{\text{对b的导数为0}} - 2br(\tau) + b^2)] \\ &= -2E[g^2(\tau)r(\tau)] + 2bE[g^2(\tau)] \end{aligned} bVar=bE[b的导数为0 g2(τ)(r2(τ)2br(τ)+b2)]=2E[g2(τ)r(τ)]+2bE[g2(τ)]
令梯度为0,得到:
b = E [ g 2 ( τ ) r ( τ ) ] E [ g 2 ( τ ) ] b = \frac{E[g^2(\tau)r(\tau)]}{E[g^2(\tau)]} b=E[g2(τ)]E[g2(τ)r(τ)]
其实就是reward的期望/平均值,只不过做了个 E [ g 2 ( τ ) ] E[g^2(\tau)] E[g2(τ)]的增幅,使得b和policy相关了。不同的policy parameter会导致不同的b

Considering actor-cirtic which has low variance, we could use value function as the baseline, if and only if this baseline is state-dependent:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T ∇ θ log ⁡ π θ ( a i , t ∣ s i , t ) ( ( ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) ) − V ^ ϕ π ( s i , t ) ) \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \Big( \Big(\sum_{t' = t}^T \gamma^{t'-t}r(s_{i,t'}, a_{i, t'}) \Big) - \hat{V}^\pi_\phi(s_{i,t}) \Big) θJ(θ)N1i=1Nt=1Tθlogπθ(ai,tsi,t)((t=tTγttr(si,t,ai,t))V^ϕπ(si,t))

on-policy -> off-policy

on-policy性质

policy gradient是一种on-policy的算法, J ( θ ) = E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] J(\theta)=E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] J(θ)=Eτpθ(τ)[tr(st,at)],其中 τ ∼ p θ ( τ ) \tau \sim p_\theta(\tau) τpθ(τ)表示每次都需要从当前的policy导致的样本分布中sample样本,也即每次都需要从新分布中sample样本。

由于on-policy非常inefficient,所以考虑off-policy的policy gradient

推广到off-policy

如果可以用其他分布得到samples,就可以把on-policy变成off-policy

引入一个IS/importance sample概念

importance sample,对期望做变换,有:
E x ∼ p ( x ) ( f ( x ) ) = ∫ p ( x ) f ( x ) d x = ∫ q ( x ) q ( x ) p ( x ) f ( x ) d x = ∫ q ( x ) p ( x ) q ( x ) f ( x ) d x = E x ∼ q ( x ) [ p ( x ) q ( x ) f ( x ) ] \begin{aligned} E_{x \sim p(x)}(f(x)) &= \int p(x)f(x) dx \\ &= \int \frac{q(x)}{q(x)} p(x)f(x) dx \\ &= \int q(x) \frac{p(x)}{q(x)} f(x) dx \\ &= E_{x \sim q(x)}[\frac{p(x)}{q(x)}f(x)] \end{aligned} Exp(x)(f(x))=p(x)f(x)dx=q(x)q(x)p(x)f(x)dx=q(x)q(x)p(x)f(x)dx=Exq(x)[q(x)p(x)f(x)]

同理代入 J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ ) ] J(\theta) = E_{\tau \sim p_\theta(\tau)}[r(\tau)] J(θ)=Eτpθ(τ)[r(τ)]中,假设我们没有从 p θ ( τ ) p_\theta(\tau) pθ(τ)中sample的 τ \tau τ,但是有从 p ˉ θ ( τ ) \bar{p}_\theta(\tau) pˉθ(τ)中sample出的 τ \tau τ,此时:
J ( θ ) = E τ ∼ p ˉ θ ( τ ) [ p θ ( τ ) p ˉ θ ( τ ) r ( τ ) ] J(\theta) = E_{\tau \sim \bar{p}_\theta(\tau)}[\frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)}r(\tau)] J(θ)=Eτpˉθ(τ)[pˉθ(τ)pθ(τ)r(τ)]
p θ ( τ ) p ˉ θ ( τ ) \frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)} pˉθ(τ)pθ(τ)有:
p θ ( τ ) p ˉ θ ( τ ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p ( s 1 ) ∏ t = 1 T π ˉ θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) = ∏ t = 1 T π θ ( a t ∣ s t ) ∏ t = 1 T π ˉ θ ( a t ∣ s t ) \begin{aligned} \frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)} &= \frac{p(s_1) \prod_{t = 1}^T \pi_\theta(a_t | s_t)p(s_{t+1}|s_t, a_t)}{p(s_1) \prod_{t = 1}^T \bar\pi_\theta(a_t | s_t)p(s_{t+1}|s_t, a_t)} \\ &= \frac{\prod_{t = 1}^T \pi_\theta(a_t | s_t)}{\prod_{t = 1}^T \bar\pi_\theta(a_t | s_t)} \end{aligned} pˉθ(τ)pθ(τ)=p(s1)t=1Tπˉθ(atst)p(st+1st,at)p(s1)t=1Tπθ(atst)p(st+1st,at)=t=1Tπˉθ(atst)t=1Tπθ(atst)

利用这个性质去预测新的参数 θ ′ \theta' θ,有:
J ( θ ′ ) = E τ ∼ p θ ( τ ) [ p θ ′ ( τ ) p θ ( τ ) r ( τ ) ] ∇ θ ′ J ( θ ′ ) = E τ ∼ p θ ( τ ) [ ∇ θ ′ p θ ′ ( τ ) p θ ( τ ) r ( τ ) ] = E τ ∼ p θ ( τ ) [ p θ ′ ( τ ) p θ ( τ ) ∇ θ ′ log ⁡ p θ ′ ( τ ) r ( τ ) ] \begin{aligned} J(\theta') &= E_{\tau \sim p_\theta(\tau)}[\frac{p_{\theta'}(\tau)}{p_\theta(\tau)}r(\tau)] \\ \nabla_{\theta'}J(\theta') &= E_{\tau \sim p_\theta(\tau)}[\frac{\nabla_{\theta'}p_{\theta'}(\tau)}{p_\theta(\tau)}r(\tau)] \\ &= E_{\tau \sim p_\theta(\tau)}[\frac{p_{\theta'}(\tau)}{p_\theta(\tau)}\nabla_{\theta'}\log p_{\theta'}(\tau)r(\tau)] \end{aligned} J(θ)θJ(θ)=Eτpθ(τ)[pθ(τ)pθ(τ)r(τ)]=Eτpθ(τ)[pθ(τ)θpθ(τ)r(τ)]=Eτpθ(τ)[pθ(τ)pθ(τ)θlogpθ(τ)r(τ)]
其中:

  1. 第2步,是因为只有 p θ ′ ( τ ) p_{\theta'}(\tau) pθ(τ) θ ′ \theta' θ有关,所以求导数可以直接把符号放进去
  2. 第3步,利用数学变换: p θ ( τ ) ∇ θ log ⁡ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) p θ ( τ ) = ∇ θ p θ ( τ ) p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) = p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta p_\theta(\tau) pθ(τ)θlogpθ(τ)=pθ(τ)pθ(τ)θpθ(τ)=θpθ(τ)
  3. 注意最后这个 ∇ θ ′ J ( θ ′ ) \nabla_{\theta'}J(\theta') θJ(θ)的形式,其实和 ∇ θ J ( θ ) \nabla_{\theta}J(\theta) θJ(θ)几乎相同,只是前面增加了一个 p θ ′ ( τ ) p θ ( τ ) \frac{p_{\theta'}(\tau)}{p_\theta(\tau)} pθ(τ)pθ(τ)

最终 ∇ θ J ( θ ) \nabla_\theta J(\theta) θJ(θ)如下

对on-policy:
∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log ⁡ p θ ( τ ) r ( τ ) ] ≈ 1 N ∑ i = 1 N ∑ t = 1 T ∇ θ log ⁡ π θ ( a i , t ∣ s i , t ) Q ^ i , t \begin{aligned} \nabla_\theta J(\theta) &= E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau)r(\tau)] \\ &\approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \hat{Q}_{i,t} \end{aligned} θJ(θ)=Eτpθ(τ)[θlogpθ(τ)r(τ)]N1i=1Nt=1Tθlogπθ(ai,tsi,t)Q^i,t
其中: Q ^ i , t = ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) \hat{Q}_{i,t} = \sum_{t' = t}^T r(s_{i,t'}, a_{i, t'}) Q^i,t=t=tTr(si,t,ai,t),表示未来的reward之和

类似地,对off-policy:
∇ θ ′ J ( θ ′ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T π θ ′ ( a i , t ∣ s i , t ) π θ ( a i , t ∣ s i , t ) ∇ θ ′ log ⁡ π θ ′ ( a i , t ∣ s i , t ) Q ^ i , t \begin{aligned} \nabla_{\theta'} J(\theta') \approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \frac{\pi_{\theta'}(a_{i,t}|s_{i,t})}{\pi_{\theta}(a_{i,t}|s_{i,t})}\nabla_{\theta'}\log \pi_{\theta'}(a_{i,t} | s_{i,t}) \hat{Q}_{i,t} \end{aligned} θJ(θ)N1i=1Nt=1Tπθ(ai,tsi,t)πθ(ai,tsi,t)θlogπθ(ai,tsi,t)Q^i,t

代码实现

在实现时,计算 ∇ θ log ⁡ π θ \nabla_\theta \log \pi_\theta θlogπθ效率太低了,所以做法是,找到一个loss function,使其梯度和 ∇ θ J ( θ ) \nabla_{\theta} J(\theta) θJ(θ)相同即可,比如:
J ~ ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T log ⁡ π θ ( a i , t ∣ s i , t ) ⏟ cross entropy/squared error Q ^ i , t \tilde{J}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \underbrace{\log \pi_\theta(a_{i,t}|s_{i,t})}_{\text{cross entropy/squared error}}\hat{Q}_{i,t} J~(θ)N1i=1Nt=1Tcross entropy/squared error logπθ(ai,tsi,t)Q^i,t

在代码中实现时,和supervised learning很相似,以下例子以离散的action为例
supervised learning:

'''
Given
actions - (N * T) * Da tensor of actions
states - (N * T) * Ds tensor of states
'''
logits = policy.predictions(states)
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
loss = tf.reduce_mean(negative_likelihoods)
gradients = loss.gradients(loss, variables)

policy gradient:

'''
Given
actions - (N * T) * Da tensor of actions
states - (N * T) * Ds tensor of states
q_values - (N * T) * 1 tensor of estimated state-action values
'''
logits = policy.predictions(states)
negative_likelihodds = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
weighted_negative_likelihoods = tf.multiply(negative_likelihoods, q_values)
loss = tf.reduce_mean(weighted_negative_likelihoods)
gradients = loss.gradients(loss, variables)

注意policy gradient和supervised learning还是不一样的:

  1. gradient有很大的variance
  2. gradient will be noisy
  3. batch_size需要很大
  4. lr不好调参了(adam算法还是能用)

自动调整学习率 α \alpha α

policy gradient也有学习率过大导致梯度下降不稳定、震荡的问题,因此也有自动调整的 α \alpha α,可见论文:15-icml-trust region policy optimization

Questions

Why does policy gradient have high variance?

Because policy gradient optimizes on every single sample, the expected value of total rewards is calculated as:
J ( θ ) = E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] ≈ 1 N ∑ i ∑ t r ( s i , t , s i , t ) J(\theta)=E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]\approx\frac{1}{N}\sum_i \sum_t r(s_{i,t}, s_{i,t}) J(θ)=Eτpθ(τ)[tr(st,at)]N1itr(si,t,si,t)
And to optimize J ( θ ) J(\theta) J(θ), we need to calculate ∇ θ J ( θ ) \nabla_\theta J(\theta) θJ(θ) as:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log ⁡ π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) θJ(θ)N1i=1N(t=1Tθlogπθ(atst))(t=1Tr(st,at))

This method uses every single sample, so usually it has low bias but high variance.

How to calculate log ⁡ π θ ( a ∣ s ) \log \pi_\theta(a|s) logπθ(as)?

你可能感兴趣的:(#,强化学习,算法,强化学习,policy,gradient)