按照RL的算法框架,policy gradient的整体步骤如下:
Sergey把以上过程认为是一种formalization of trial-and-error
总体来看,RL是为了学到policy,2种传统的方法,policy-gradient以及value-based方法。前者是直接对policy建模,valued-based是间接的方法
RL的优化目标为: θ ∗ = arg max θ E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] ⏟ J ( θ ) \theta^* =\argmax_{\theta}\underbrace{E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]}_{J(\theta)} θ∗=argmaxθJ(θ) Eτ∼pθ(τ)[t∑r(st,at)]
在实际使用中,可以用多次trajectory的结果对 E E E做近似,即:
J ( θ ) = E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] ≈ 1 N ∑ i ∑ t r ( s i , t , s i , t ) J(\theta)=E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]\approx\frac{1}{N}\sum_i \sum_t r(s_{i,t}, s_{i,t}) J(θ)=Eτ∼pθ(τ)[t∑r(st,at)]≈N1i∑t∑r(si,t,si,t)
在实际使用的时候,除了计算 J J J,还需要优化 J J J,所以需要计算 J J J对 θ \theta θ的导数
首先把 J J J简写一下: J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ ) ⏟ ∑ t = 1 T r ( s t , a t ) ] = ∫ p θ ( τ ) r ( τ ) d τ J(\theta) = E_{\tau \sim p_\theta(\tau)}[\underbrace{r(\tau)}_{\sum_{t=1}^Tr(s_t,a_t)}] = \int p_\theta(\tau)r(\tau)d\tau J(θ)=Eτ∼pθ(τ)[∑t=1Tr(st,at) r(τ)]=∫pθ(τ)r(τ)dτ
对 J J J求 θ \theta θ的导数,有: ∇ θ J ( θ ) = ∫ ∇ θ p θ ( τ ) r ( τ ) d τ \nabla_\theta J(\theta)=\int \nabla_\theta p_\theta(\tau)r(\tau)d\tau ∇θJ(θ)=∫∇θpθ(τ)r(τ)dτ,这个需要求 ∇ θ p θ ( τ ) \nabla_\theta p_\theta(\tau) ∇θpθ(τ),但是 p p p是未知的
对 p p p做一下数学变换,有:
p θ ( τ ) ∇ θ log p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) p θ ( τ ) = ∇ θ p θ ( τ ) p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) = p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} = \nabla_\theta p_\theta(\tau) pθ(τ)∇θlogpθ(τ)=pθ(τ)pθ(τ)∇θpθ(τ)=∇θpθ(τ)
代入到 ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇θJ(θ)中,有:
∇ θ J ( θ ) = ∫ ∇ θ p θ ( τ ) r ( τ ) d τ = ∫ p θ ( τ ) ∇ θ log p θ ( τ ) r ( τ ) d τ \begin{aligned} \nabla_\theta J(\theta) &= \int \nabla_\theta p_\theta(\tau) r(\tau) d\tau \\ &= \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) r(\tau) d\tau \\ \end{aligned} ∇θJ(θ)=∫∇θpθ(τ)r(τ)dτ=∫pθ(τ)∇θlogpθ(τ)r(τ)dτ
因为 p ( τ ) p(\tau) p(τ)对某个数字的积分可以写成期望,所以进一步有:
∇ θ J ( θ ) = ∫ ∇ θ p θ ( τ ) r ( τ ) d τ = ∫ p θ ( τ ) ∇ θ log p θ ( τ ) r ( τ ) d τ = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ ) ] \begin{aligned} \nabla_\theta J(\theta) &= \int \nabla_\theta p_\theta(\tau) r(\tau) d\tau \\ &= \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) r(\tau) d\tau \\ &= E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau) r(\tau)] \end{aligned} ∇θJ(θ)=∫∇θpθ(τ)r(τ)dτ=∫pθ(τ)∇θlogpθ(τ)r(τ)dτ=Eτ∼pθ(τ)[∇θlogpθ(τ)r(τ)]
现在 log p θ ( τ ) \log p_\theta(\tau) logpθ(τ)还是没法计算,考虑到:
p θ ( τ ) = p θ ( s 1 , a 1 , … , s T , a T ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) \begin{aligned} p_\theta(\tau) &= p_\theta(s_1, a_1, \dots, s_T, a_T) \\ &= p(s_1) \prod_{t = 1}^T \pi_\theta(a_t | s_t)p(s_{t+1}|s_t, a_t) \end{aligned} pθ(τ)=pθ(s1,a1,…,sT,aT)=p(s1)t=1∏Tπθ(at∣st)p(st+1∣st,at)
公示两边同时增加log
,有:
log p θ ( τ ) = log p ( s 1 ) + ∑ t = 1 T [ log π θ ( a t ∣ s t ) + log p ( s t + 1 ∣ s t , a t ) ] \begin{aligned} \log p_\theta(\tau) &= \log p(s_1) + \sum_{t = 1}^T [\log \pi_\theta(a_t|s_t) + \log p(s_{t+1}|s_t, a_t)] \end{aligned} logpθ(τ)=logp(s1)+t=1∑T[logπθ(at∣st)+logp(st+1∣st,at)]
现在可以把 log p θ ( τ ) \log p_\theta(\tau) logpθ(τ)代入到 ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇θJ(θ)中,有:
∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ ) ] = E τ ∼ p θ ( τ ) [ ∇ θ [ log p ( s 1 ) ⏟ 对 θ 的导数为0 + ∑ t = 1 T [ log π θ ( a t ∣ s t ) + log p ( s t + 1 ∣ s t , a t ) ⏟ 对 θ 的导数为0 ] ] r ( τ ) ] = E τ ∼ p θ ( τ ) [ ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) ] \begin{aligned} \nabla_\theta J(\theta) &= E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau) r(\tau)] \\ &= E_{\tau \sim p_\theta(\tau)}\Bigg[\nabla_\theta \Big[ \underbrace{\log p(s_1)}_{\text{对}\theta\text{的导数为0}} + \sum_{t = 1}^T \big[\log \pi_\theta(a_t|s_t) + \underbrace{\log p(s_{t+1}|s_t, a_t)}_{\text{对}\theta\text{的导数为0}}\big] \Big]r(\tau)\Bigg] \\ &= E_{\tau \sim p_\theta(\tau)}\Bigg[ \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) \Bigg] \\ \end{aligned} ∇θJ(θ)=Eτ∼pθ(τ)[∇θlogpθ(τ)r(τ)]=Eτ∼pθ(τ)[∇θ[对θ的导数为0 logp(s1)+t=1∑T[logπθ(at∣st)+对θ的导数为0 logp(st+1∣st,at)]]r(τ)]=Eτ∼pθ(τ)[(t=1∑T∇θlogπθ(at∣st))(t=1∑Tr(st,at))]
类似求 J J J的那样,用多个trajectory的和来近似 E E E,即:
∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ ) ] = E τ ∼ p θ ( τ ) [ ∇ θ [ log p ( s 1 ) ⏟ 对 θ 的导数为0 + ∑ t = 1 T [ log π θ ( a t ∣ s t ) + log p ( s t + 1 ∣ s t , a t ) ⏟ 对 θ 的导数为0 ] ] r ( τ ) ] = E τ ∼ p θ ( τ ) [ ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) ] ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \begin{aligned} \nabla_\theta J(\theta) &= E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau) r(\tau)] \\ &= E_{\tau \sim p_\theta(\tau)}\Bigg[\nabla_\theta \Big[ \underbrace{\log p(s_1)}_{\text{对}\theta\text{的导数为0}} + \sum_{t = 1}^T \big[\log \pi_\theta(a_t|s_t) + \underbrace{\log p(s_{t+1}|s_t, a_t)}_{\text{对}\theta\text{的导数为0}}\big] \Big]r(\tau)\Bigg] \\ &= E_{\tau \sim p_\theta(\tau)}\Bigg[ \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) \Bigg] \\ &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) \end{aligned} ∇θJ(θ)=Eτ∼pθ(τ)[∇θlogpθ(τ)r(τ)]=Eτ∼pθ(τ)[∇θ[对θ的导数为0 logp(s1)+t=1∑T[logπθ(at∣st)+对θ的导数为0 logp(st+1∣st,at)]]r(τ)]=Eτ∼pθ(τ)[(t=1∑T∇θlogπθ(at∣st))(t=1∑Tr(st,at))]≈N1i=1∑N(t=1∑T∇θlogπθ(at∣st))(t=1∑Tr(st,at))
这样计算出来 ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇θJ(θ)后,更新 θ \theta θ就可以用 θ = θ + α ∇ θ J ( θ ) \theta = \theta + \alpha \nabla_\theta J(\theta) θ=θ+α∇θJ(θ)了
例如,REINFORCE算法的迭代步骤为:
supervised learning的log maximum likelihood目标函数为:
max ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log P ( X ∣ θ ) ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ s t ) ) \begin{aligned} \max \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log P(X|\theta) \Big) \\ &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \end{aligned} max∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogP(X∣θ))≈N1i=1∑N(t=1∑T∇θlogπθ(at∣st))
而policy gradient的优化目标为:
max ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \max \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) max∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(at∣st))(t=1∑Tr(st,at))
两者的区别在于后面有没有reward
在supervised learning中,因为有label,所以可以保证 s t , a t s_t, a_t st,at一定是好的,因此不需要reward,优化必然是要增加 log π \log \pi logπ那一部分
但reinforcement learning因为没有label,所以需要reward,在优化时可能就不是要单纯增加 log π \log \pi logπ那一部分,而可能是减小
policy gradient的核心在于 log π θ ( a t ∣ s t ) \log \pi_\theta(a_t|s_t) logπθ(at∣st)的选取上
partial observability的问题是,只能观察到obs而不是states。在RL中,只有假设states满足markov性质,但obs是不一定满足的。
但是因为policy gradient不需要markov property,所以可以用在partial observability的问题上
即,对于partial observation,目标函数同样为:
max ∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ o t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \max \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|o_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) max∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(at∣ot))(t=1∑Tr(st,at))
因果性定理:在t时刻做出的policy,不会影响t时刻以前的reward
RL的目标函数:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ o t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|o_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) ∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(at∣ot))(t=1∑Tr(st,at))
根据上面这个因果性定理,在 ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇θJ(θ)中将 ∑ t = 1 T r ( s t , a t ) \sum_{t = 1}^T r(s_t, a_t) ∑t=1Tr(st,at)替换为policy所在的时刻t
及之后的reward,即reward-to-go
,有:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ o t ) ) ( ∑ t = 1 T r ( s t , a t ) ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ o t ) ) ( ∑ t ′ = t T r ( s t ′ , a t ′ ) ) ⏟ reward-to-go , Q ^ i , t \begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|o_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) \\ &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|o_t) \Big) \underbrace{ \Big( \sum_{t' = t}^T r(s_{t'}, a_{t'}) \Big)}_{\text{reward-to-go}, \hat{Q}_{i,t} } \end{aligned} ∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(at∣ot))(t=1∑Tr(st,at))≈N1i=1∑N(t=1∑T∇θlogπθ(at∣ot))reward-to-go,Q^i,t (t′=t∑Tr(st′,at′))
替换完之后,因为sum of reward小了,所以期望也小了,进一步的variance也小了
考虑更远的未来会导致vairance过高(考虑越远,policy越会过拟合),theoretically the discount factor should start at t = 1 t=1 t=1, that is:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log π θ ( a i , t ∣ s i , t ) ) ( ∑ t = 1 T γ t − 1 r ( s i , t , a i , t ) ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T γ t − 1 ∇ θ log π θ ( a i , t ∣ s i , t ) ( ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) ) \begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_{i,t}|s_{i,t}) \Big) \Big( \sum_{t=1}^T \gamma^{t-1}r(s_{i,t}, a_{i,t}) \Big) \\ &\approx \frac{1}{N} \sum_{i = 1}^N \sum_{t = 1}^T \gamma^{t-1} \nabla_\theta\log \pi_\theta(a_{i,t}|s_{i,t}) \Big( \sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'}, a_{i,t'}) \Big) \end{aligned} ∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(ai,t∣si,t))(t=1∑Tγt−1r(si,t,ai,t))≈N1i=1∑Nt=1∑Tγt−1∇θlogπθ(ai,t∣si,t)(t′=t∑Tγt′−tr(si,t′,ai,t′))
In practice we don’t use this, because it could mean later rewards and steps matter less.
So we will use:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T ∇ θ log π θ ( a i , t ∣ s i , t ) ( ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) ) \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_{i,t}|s_{i,t}) \Big( \sum_{t'=t}^T \gamma^{t'-t}r(s_{i,t'}, a_{i,t'}) \Big) ∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)(t′=t∑Tγt′−tr(si,t′,ai,t′))
要降低policy gradient的variance,做法是,减掉1个baseline(减掉1个baseline,可以让reward低于平均值的action得到一个负的梯度,高于平均的得到正的梯度),也即:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∇ θ log p θ ( τ ) [ r ( τ ) − b ] b = 1 N ∑ i = 1 N r ( τ ) \begin{aligned} \nabla_\theta J(\theta) &\approx \frac{1}{N} \sum_{i = 1}^N \nabla_\theta \log p_\theta(\tau) [r(\tau) - b] \\ b &= \frac{1}{N}\sum_{i = 1}^Nr(\tau) \end{aligned} ∇θJ(θ)b≈N1i=1∑N∇θlogpθ(τ)[r(τ)−b]=N1i=1∑Nr(τ)
减去1个常数b不会影响expection,同时还能降低variance,因为:
E [ log p θ ( τ ) b ] = ∫ p θ ( τ ) ∇ θ log p θ ( τ ) b d τ = ∫ ∇ θ p θ ( τ ) b d τ = b ∇ θ ∫ p θ ( τ ) d τ = b ∇ θ 1 = 0 \begin{aligned} E[\log p_\theta(\tau) b] &= \int p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) b \;d\tau \\ &= \int \nabla_\theta p_\theta(\tau) b \;d\tau \\ &= b \nabla_\theta \int p_\theta(\tau) \;d\tau \\ &= b\nabla_\theta1 \\ &= 0 \end{aligned} E[logpθ(τ)b]=∫pθ(τ)∇θlogpθ(τ)bdτ=∫∇θpθ(τ)bdτ=b∇θ∫pθ(τ)dτ=b∇θ1=0
注:
开始准备最优baseline(实际中不常应用,但还是可以推导一下),把variance写出来:
V a r [ x ] = E [ x 2 ] − E 2 [ x ] Var[x] = E[x^2] - E^2[x] Var[x]=E[x2]−E2[x]
将 J J J写成期望的形式,有:
∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) ( r ( τ ) − b ) ] \nabla_\theta J(\theta) = E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau) (r(\tau) - b)] ∇θJ(θ)=Eτ∼pθ(τ)[∇θlogpθ(τ)(r(τ)−b)]
所以方差推导:
V a r = E τ ∼ p θ ( τ ) [ ( ∇ θ log p θ ( τ ) ⏟ 用 g ( τ ) 代替 ( r ( τ ) − b ) ) 2 ] − E τ ∼ p θ ( τ ) 2 [ ∇ θ log p θ ( τ ) ( r ( τ ) − b ) ] ⏟ 期望与b无关, 所以可以直接写成原先的E = E τ ∼ p θ ( τ ) [ g 2 ( τ ) ( r ( τ ) − b ) ) 2 ] − E τ ∼ p θ ( τ ) 2 [ g ( τ ) r ( τ ) ] \begin{aligned} Var &= E_{\tau \sim p_\theta(\tau)}[(\underbrace{\nabla_\theta \log p_\theta(\tau)}_{\text{用}g(\tau)\text{代替}} (r(\tau) - b))^2] - \underbrace{E_{\tau \sim p_\theta(\tau)}^2[\nabla_\theta \log p_\theta(\tau) (r(\tau) - b)]}_{\text{期望与b无关, 所以可以直接写成原先的E}} \\ &= E_{\tau \sim p_\theta(\tau)}[g^2(\tau)(r(\tau) - b))^2] - E^2_{\tau \sim p_\theta(\tau)}[g(\tau) r(\tau)] \end{aligned} Var=Eτ∼pθ(τ)[(用g(τ)代替 ∇θlogpθ(τ)(r(τ)−b))2]−期望与b无关, 所以可以直接写成原先的E Eτ∼pθ(τ)2[∇θlogpθ(τ)(r(τ)−b)]=Eτ∼pθ(τ)[g2(τ)(r(τ)−b))2]−Eτ∼pθ(τ)2[g(τ)r(τ)]
求方差对 b b b的梯度,有:
∂ V a r ∂ b = ∂ ∂ b E [ g 2 ( τ ) ( r 2 ( τ ) ⏟ 对b的导数为0 − 2 b r ( τ ) + b 2 ) ] = − 2 E [ g 2 ( τ ) r ( τ ) ] + 2 b E [ g 2 ( τ ) ] \begin{aligned} \frac{\partial Var}{\partial b} &= \frac{\partial}{\partial b}E[\underbrace{g^2(\tau)(r^2(\tau)}_{\text{对b的导数为0}} - 2br(\tau) + b^2)] \\ &= -2E[g^2(\tau)r(\tau)] + 2bE[g^2(\tau)] \end{aligned} ∂b∂Var=∂b∂E[对b的导数为0 g2(τ)(r2(τ)−2br(τ)+b2)]=−2E[g2(τ)r(τ)]+2bE[g2(τ)]
令梯度为0,得到:
b = E [ g 2 ( τ ) r ( τ ) ] E [ g 2 ( τ ) ] b = \frac{E[g^2(\tau)r(\tau)]}{E[g^2(\tau)]} b=E[g2(τ)]E[g2(τ)r(τ)]
其实就是reward的期望/平均值,只不过做了个 E [ g 2 ( τ ) ] E[g^2(\tau)] E[g2(τ)]的增幅,使得b和policy相关了。不同的policy parameter会导致不同的b
Considering actor-cirtic which has low variance, we could use value function as the baseline, if and only if this baseline is state-dependent:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T ∇ θ log π θ ( a i , t ∣ s i , t ) ( ( ∑ t ′ = t T γ t ′ − t r ( s i , t ′ , a i , t ′ ) ) − V ^ ϕ π ( s i , t ) ) \nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \Big( \Big(\sum_{t' = t}^T \gamma^{t'-t}r(s_{i,t'}, a_{i, t'}) \Big) - \hat{V}^\pi_\phi(s_{i,t}) \Big) ∇θJ(θ)≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)((t′=t∑Tγt′−tr(si,t′,ai,t′))−V^ϕπ(si,t))
policy gradient是一种on-policy的算法, J ( θ ) = E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] J(\theta)=E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)] J(θ)=Eτ∼pθ(τ)[∑tr(st,at)],其中 τ ∼ p θ ( τ ) \tau \sim p_\theta(\tau) τ∼pθ(τ)表示每次都需要从当前的policy导致的样本分布中sample样本,也即每次都需要从新分布中sample样本。
由于on-policy非常inefficient,所以考虑off-policy的policy gradient
如果可以用其他分布得到samples,就可以把on-policy变成off-policy
引入一个IS/importance sample概念
importance sample,对期望做变换,有:
E x ∼ p ( x ) ( f ( x ) ) = ∫ p ( x ) f ( x ) d x = ∫ q ( x ) q ( x ) p ( x ) f ( x ) d x = ∫ q ( x ) p ( x ) q ( x ) f ( x ) d x = E x ∼ q ( x ) [ p ( x ) q ( x ) f ( x ) ] \begin{aligned} E_{x \sim p(x)}(f(x)) &= \int p(x)f(x) dx \\ &= \int \frac{q(x)}{q(x)} p(x)f(x) dx \\ &= \int q(x) \frac{p(x)}{q(x)} f(x) dx \\ &= E_{x \sim q(x)}[\frac{p(x)}{q(x)}f(x)] \end{aligned} Ex∼p(x)(f(x))=∫p(x)f(x)dx=∫q(x)q(x)p(x)f(x)dx=∫q(x)q(x)p(x)f(x)dx=Ex∼q(x)[q(x)p(x)f(x)]
同理代入 J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ ) ] J(\theta) = E_{\tau \sim p_\theta(\tau)}[r(\tau)] J(θ)=Eτ∼pθ(τ)[r(τ)]中,假设我们没有从 p θ ( τ ) p_\theta(\tau) pθ(τ)中sample的 τ \tau τ,但是有从 p ˉ θ ( τ ) \bar{p}_\theta(\tau) pˉθ(τ)中sample出的 τ \tau τ,此时:
J ( θ ) = E τ ∼ p ˉ θ ( τ ) [ p θ ( τ ) p ˉ θ ( τ ) r ( τ ) ] J(\theta) = E_{\tau \sim \bar{p}_\theta(\tau)}[\frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)}r(\tau)] J(θ)=Eτ∼pˉθ(τ)[pˉθ(τ)pθ(τ)r(τ)]
对 p θ ( τ ) p ˉ θ ( τ ) \frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)} pˉθ(τ)pθ(τ)有:
p θ ( τ ) p ˉ θ ( τ ) = p ( s 1 ) ∏ t = 1 T π θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) p ( s 1 ) ∏ t = 1 T π ˉ θ ( a t ∣ s t ) p ( s t + 1 ∣ s t , a t ) = ∏ t = 1 T π θ ( a t ∣ s t ) ∏ t = 1 T π ˉ θ ( a t ∣ s t ) \begin{aligned} \frac{p_\theta(\tau)}{\bar{p}_\theta(\tau)} &= \frac{p(s_1) \prod_{t = 1}^T \pi_\theta(a_t | s_t)p(s_{t+1}|s_t, a_t)}{p(s_1) \prod_{t = 1}^T \bar\pi_\theta(a_t | s_t)p(s_{t+1}|s_t, a_t)} \\ &= \frac{\prod_{t = 1}^T \pi_\theta(a_t | s_t)}{\prod_{t = 1}^T \bar\pi_\theta(a_t | s_t)} \end{aligned} pˉθ(τ)pθ(τ)=p(s1)∏t=1Tπˉθ(at∣st)p(st+1∣st,at)p(s1)∏t=1Tπθ(at∣st)p(st+1∣st,at)=∏t=1Tπˉθ(at∣st)∏t=1Tπθ(at∣st)
利用这个性质去预测新的参数 θ ′ \theta' θ′,有:
J ( θ ′ ) = E τ ∼ p θ ( τ ) [ p θ ′ ( τ ) p θ ( τ ) r ( τ ) ] ∇ θ ′ J ( θ ′ ) = E τ ∼ p θ ( τ ) [ ∇ θ ′ p θ ′ ( τ ) p θ ( τ ) r ( τ ) ] = E τ ∼ p θ ( τ ) [ p θ ′ ( τ ) p θ ( τ ) ∇ θ ′ log p θ ′ ( τ ) r ( τ ) ] \begin{aligned} J(\theta') &= E_{\tau \sim p_\theta(\tau)}[\frac{p_{\theta'}(\tau)}{p_\theta(\tau)}r(\tau)] \\ \nabla_{\theta'}J(\theta') &= E_{\tau \sim p_\theta(\tau)}[\frac{\nabla_{\theta'}p_{\theta'}(\tau)}{p_\theta(\tau)}r(\tau)] \\ &= E_{\tau \sim p_\theta(\tau)}[\frac{p_{\theta'}(\tau)}{p_\theta(\tau)}\nabla_{\theta'}\log p_{\theta'}(\tau)r(\tau)] \end{aligned} J(θ′)∇θ′J(θ′)=Eτ∼pθ(τ)[pθ(τ)pθ′(τ)r(τ)]=Eτ∼pθ(τ)[pθ(τ)∇θ′pθ′(τ)r(τ)]=Eτ∼pθ(τ)[pθ(τ)pθ′(τ)∇θ′logpθ′(τ)r(τ)]
其中:
最终 ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇θJ(θ)如下
对on-policy:
∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ ) ] ≈ 1 N ∑ i = 1 N ∑ t = 1 T ∇ θ log π θ ( a i , t ∣ s i , t ) Q ^ i , t \begin{aligned} \nabla_\theta J(\theta) &= E_{\tau \sim p_\theta(\tau)}[\nabla_\theta \log p_\theta(\tau)r(\tau)] \\ &\approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \hat{Q}_{i,t} \end{aligned} ∇θJ(θ)=Eτ∼pθ(τ)[∇θlogpθ(τ)r(τ)]≈N1i=1∑Nt=1∑T∇θlogπθ(ai,t∣si,t)Q^i,t
其中: Q ^ i , t = ∑ t ′ = t T r ( s i , t ′ , a i , t ′ ) \hat{Q}_{i,t} = \sum_{t' = t}^T r(s_{i,t'}, a_{i, t'}) Q^i,t=∑t′=tTr(si,t′,ai,t′),表示未来的reward之和
类似地,对off-policy:
∇ θ ′ J ( θ ′ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T π θ ′ ( a i , t ∣ s i , t ) π θ ( a i , t ∣ s i , t ) ∇ θ ′ log π θ ′ ( a i , t ∣ s i , t ) Q ^ i , t \begin{aligned} \nabla_{\theta'} J(\theta') \approx \frac{1}{N}\sum_{i = 1}^N \sum_{t=1}^T \frac{\pi_{\theta'}(a_{i,t}|s_{i,t})}{\pi_{\theta}(a_{i,t}|s_{i,t})}\nabla_{\theta'}\log \pi_{\theta'}(a_{i,t} | s_{i,t}) \hat{Q}_{i,t} \end{aligned} ∇θ′J(θ′)≈N1i=1∑Nt=1∑Tπθ(ai,t∣si,t)πθ′(ai,t∣si,t)∇θ′logπθ′(ai,t∣si,t)Q^i,t
在实现时,计算 ∇ θ log π θ \nabla_\theta \log \pi_\theta ∇θlogπθ效率太低了,所以做法是,找到一个loss function,使其梯度和 ∇ θ J ( θ ) \nabla_{\theta} J(\theta) ∇θJ(θ)相同即可,比如:
J ~ ( θ ) ≈ 1 N ∑ i = 1 N ∑ t = 1 T log π θ ( a i , t ∣ s i , t ) ⏟ cross entropy/squared error Q ^ i , t \tilde{J}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \sum_{t=1}^T \underbrace{\log \pi_\theta(a_{i,t}|s_{i,t})}_{\text{cross entropy/squared error}}\hat{Q}_{i,t} J~(θ)≈N1i=1∑Nt=1∑Tcross entropy/squared error logπθ(ai,t∣si,t)Q^i,t
在代码中实现时,和supervised learning很相似,以下例子以离散的action为例
supervised learning:
'''
Given
actions - (N * T) * Da tensor of actions
states - (N * T) * Ds tensor of states
'''
logits = policy.predictions(states)
negative_likelihoods = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
loss = tf.reduce_mean(negative_likelihoods)
gradients = loss.gradients(loss, variables)
policy gradient:
'''
Given
actions - (N * T) * Da tensor of actions
states - (N * T) * Ds tensor of states
q_values - (N * T) * 1 tensor of estimated state-action values
'''
logits = policy.predictions(states)
negative_likelihodds = tf.nn.softmax_cross_entropy_with_logits(labels=actions, logits=logits)
weighted_negative_likelihoods = tf.multiply(negative_likelihoods, q_values)
loss = tf.reduce_mean(weighted_negative_likelihoods)
gradients = loss.gradients(loss, variables)
注意policy gradient和supervised learning还是不一样的:
policy gradient也有学习率过大导致梯度下降不稳定、震荡的问题,因此也有自动调整的 α \alpha α,可见论文:15-icml-trust region policy optimization
Because policy gradient optimizes on every single sample, the expected value of total rewards is calculated as:
J ( θ ) = E τ ∼ p θ ( τ ) [ ∑ t r ( s t , a t ) ] ≈ 1 N ∑ i ∑ t r ( s i , t , s i , t ) J(\theta)=E_{\tau \sim p_\theta(\tau)}[\sum_t r(s_t, a_t)]\approx\frac{1}{N}\sum_i \sum_t r(s_{i,t}, s_{i,t}) J(θ)=Eτ∼pθ(τ)[t∑r(st,at)]≈N1i∑t∑r(si,t,si,t)
And to optimize J ( θ ) J(\theta) J(θ), we need to calculate ∇ θ J ( θ ) \nabla_\theta J(\theta) ∇θJ(θ) as:
∇ θ J ( θ ) ≈ 1 N ∑ i = 1 N ( ∑ t = 1 T ∇ θ log π θ ( a t ∣ s t ) ) ( ∑ t = 1 T r ( s t , a t ) ) \nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i = 1}^N \Big(\sum_{t = 1}^T \nabla_\theta\log \pi_\theta(a_t|s_t) \Big) \Big( \sum_{t = 1}^T r(s_t, a_t) \Big) ∇θJ(θ)≈N1i=1∑N(t=1∑T∇θlogπθ(at∣st))(t=1∑Tr(st,at))
This method uses every single sample, so usually it has low bias but high variance.