这是一篇笔记文
value-based基于价值的RL,倾向于选择价值最大的状态或者动作;value-based通过迭代计算最优值函数Q,并根据最优值函数改进策略。
policy-based基于策略的RL,常分为随机策略与确定性策略;无需定义价值函数,policy-based可以通过动作分配概率分布,并按照该分布来根据当前状态执行动作;其原理在于将策略参数化,即 π θ ( s ) \pi_{\theta}(s) πθ(s), 通过寻找最优参数 θ \theta θ,使得累计回报的期望最大: m a x E [ Σ t = 0 k R ( s t ) ∣ π θ ] \displaystyle maxE[\Sigma_{t=0}^{k}R(s_t)|\pi_{\theta}] maxE[Σt=0kR(st)∣πθ], 此时 π θ ( s ) \pi_{\theta}(s) πθ(s)即为最优策略。
区别:
1)相较于value-based直接将值函数参数化表示相比,policy-based直接将策略参数化表示,使得策略$ \pi_{\theta}(s) $更加简单高效、易收敛。
2)利用value-base适用于离散的动作空间,动作空间虽然可以离散化处理,但离散间距的选取不易确定,并且value function的微小变化对策略的影响很大。而policy-based适用于连续的动作空间,不用计算每个动作的概率,而是通过正态分布选择action。
3)policy-based常采用随机策略,随机策略将探索 ε \varepsilon ε 集成到了所学的策略中。
4)policy-based基于梯度的求解,容易陷入局部最优。
5)policy-based评估单个策略时并不充分,方差很大。
trajectory: τ : s 1 , a 1 , r 1 , . . . , s t , a t , r t \tau :s_1,a_1,r_1,...,s_t,a_t,r_t τ:s1,a1,r1,...,st,at,rt
trajectory回报: R ( τ ) = Σ t = 0 k R ( s t , a t ∣ θ ) R(\tau)=\Sigma_{t=0}^{k}R(s_t,a_t|\theta) R(τ)=Σt=0kR(st,at∣θ)
目标函数: l ( θ ) = E [ Σ t = 0 k R ( s t , a t ) ∣ π ( θ ) ] = Σ τ p ( τ ∣ θ ) R ( θ ) l(\theta)=E[\Sigma_{t=0}^{k}R(s_t,a_t)|\pi(\theta)]=\Sigma_{\tau}p(\tau|\theta)R(\theta) l(θ)=E[Σt=0kR(st,at)∣π(θ)]=Στp(τ∣θ)R(θ)
eg. p ( τ ∣ θ ) p(\tau|\theta) p(τ∣θ)为轨迹的概率分布
梯度下降求解: θ n e w = θ o l d + α ▽ θ l ( θ ) \theta_{new}=\theta_{old}+\alpha\triangledown_\theta l(\theta) θnew=θold+α▽θl(θ)
▽ θ l ( θ ) = ▽ θ Σ τ p ( τ ∣ θ ) R ( τ ) = Σ τ ▽ θ p ( τ ∣ θ ) R ( τ ) = Σ τ p ( τ ∣ θ ) p ( τ ∣ θ ) ▽ θ p ( τ ∣ θ ) R ( τ ) = Σ τ p ( τ ∣ θ ) ▽ θ p ( τ ∣ θ ) R ( τ ) p ( τ ∣ θ ) = Σ τ p ( τ ∣ θ ) ▽ θ l o g p ( τ ∣ θ ) R ( τ ) \triangledown_\theta l(\theta)=\triangledown_{\theta}\Sigma_{\tau}p(\tau|\theta)R(\tau) =\Sigma_{\tau}\triangledown_{\theta}p(\tau|\theta)R(\tau)\\ =\Sigma_{\tau}\frac{p(\tau|\theta)}{p(\tau|\theta)}\triangledown_{\theta}p(\tau|\theta)R(\tau)\\ =\Sigma_{\tau}p(\tau|\theta)\frac{\triangledown_{\theta}p(\tau|\theta)R(\tau)}{p(\tau|\theta)}\\ =\Sigma_{\tau}p(\tau|\theta)\triangledown_{\theta}logp(\tau|\theta)R(\tau) ▽θl(θ)=▽θΣτp(τ∣θ)R(τ)=Στ▽θp(τ∣θ)R(τ)=Στp(τ∣θ)p(τ∣θ)▽θp(τ∣θ)R(τ)=Στp(τ∣θ)p(τ∣θ)▽θp(τ∣θ)R(τ)=Στp(τ∣θ)▽θlogp(τ∣θ)R(τ)
其中, ▽ θ l o g p ( τ ∣ θ ) = 1 p ( τ ∣ θ ) ▽ θ p ( τ ∣ θ ) \triangledown_{\theta}logp(\tau|\theta)=\frac{1}{p(\tau|\theta)}\triangledown_{\theta}p(\tau|\theta) ▽θlogp(τ∣θ)=p(τ∣θ)1▽θp(τ∣θ), 这是由公式: d l o g ( f ( x ) ) d x = 1 f ( x ) d f ( x ) d x \frac{dlog(f(x))}{dx}=\frac{1}{f(x)}\frac{df(x)}{dx} dxdlog(f(x))=f(x)1dxdf(x)得到的。
最终变成求解 ▽ θ l o g p ( τ ∣ θ ) R ( τ ) \triangledown_{\theta}logp(\tau|\theta)R(\tau) ▽θlogp(τ∣θ)R(τ)的期望,
可以利用经验平均估算:
R ‾ θ = 1 N Σ n = 1 N ▽ θ l o g p ( τ n ∣ θ ) R ( τ n ) \overline R_{\theta}=\frac{1}{N}\Sigma_{n=1}^{N}\triangledown_{\theta}logp(\tau^{n}|\theta)R(\tau^{n}) Rθ=N1Σn=1N▽θlogp(τn∣θ)R(τn)
▽ θ l θ = 1 m Σ n = 1 m ▽ θ l o g p ( τ n ∣ θ ) R ( τ n ) \triangledown_{\theta} l_{\theta}=\frac{1}{m}\Sigma_{n=1}^{m}\triangledown_{\theta}logp(\tau^{n}|\theta)R(\tau^{n}) ▽θlθ=m1Σn=1m▽θlogp(τn∣θ)R(τn)
1、 那么如何求解 ▽ θ l o g p ( τ ∣ θ ) \triangledown_{\theta}logp(\tau|\theta) ▽θlogp(τ∣θ) ?
根据trajectory可以得到概率:
p ( τ ∣ θ ) = p ( s 1 ) p ( a 1 ∣ s 1 ; θ ) p ( r 1 , s 2 ∣ s 1 , a 1 ) p ( a 2 ∣ s 2 , θ ) p ( r 2 , s 3 ∣ s 2 , a 2 ) . . . p ( a t ∣ s t , θ ) p ( r t , s t + 1 ∣ s t , a t ) = p ( s 1 ) ∏ t = 1 T p ( a t ∣ s t , θ ) p ( r t , s t + 1 ∣ s t , a t ) p(\tau|\theta)=p(s_1)p(a_1|s_1;\theta)p(r_1,s_2|s_1,a_1)p(a_2|s_2,\theta)p(r_2,s_3|s_2,a_2)...p(a_{t}|s_t,\theta)p(r_t,s_{t+1}|s_t,a_t)\\ =p(s_1)\prod_{t=1}^Tp(a_t|s_t,\theta)p(r_t,s_{t+1}|s_t,a_t) p(τ∣θ)=p(s1)p(a1∣s1;θ)p(r1,s2∣s1,a1)p(a2∣s2,θ)p(r2,s3∣s2,a2)...p(at∣st,θ)p(rt,st+1∣st,at)=p(s1)t=1∏Tp(at∣st,θ)p(rt,st+1∣st,at)
从而推导求得:
l o g p ( τ ∣ θ ) = l o g p ( s 1 ) + Σ t = 1 T l o g p ( a t ∣ s t . θ ) + l o g p ( r t , s t + 1 ∣ s t , a t ) = = > ▽ θ l o g p ( τ ∣ θ ) = Σ t = 1 T ▽ θ l o g p ( a t ∣ s t , θ ) logp(\tau|\theta)=logp(s_1)+\Sigma_{t=1}^{T}logp(a_t|s_t.\theta)+logp(r_t,s_{t+1}|s_t,a_t)\\ ==>\triangledown_{\theta}logp(\tau|\theta)=\Sigma_{t=1}^{T}\triangledown _{\theta}logp(a_t|s_t,\theta) logp(τ∣θ)=logp(s1)+Σt=1Tlogp(at∣st.θ)+logp(rt,st+1∣st,at)==>▽θlogp(τ∣θ)=Σt=1T▽θlogp(at∣st,θ)
2、最终如何求解策略的梯度?
θ n e w = θ o l d + α ▽ θ R ‾ ( θ ) o l d \theta_{new}=\theta_{old}+\alpha\triangledown_\theta \overline R(\theta)_{old} θnew=θold+α▽θR(θ)old
求解过程如下:
▽ θ R ‾ θ ≈ 1 N Σ n = 1 N ▽ θ p ( τ n ∣ θ ) R ( τ n ) = 1 N Σ n = 1 N R ( τ n ) Σ t = 1 T ▽ θ l o g p ( a t n ∣ s t n , θ ) = 1 N Σ n = 1 N Σ t = 1 T R ( τ n ) ▽ θ l o g p ( a t n ∣ s t n , θ ) \triangledown_{\theta}\overline R_{\theta}\approx \frac{1}{N}\Sigma_{n=1}^{N}\triangledown_{\theta}p(\tau^{n}|\theta)R(\tau ^n)\\ =\frac{1}{N}\Sigma_{n=1}^{N}R(\tau ^n)\Sigma_{t=1}^{T} \triangledown_{\theta}logp(a_t^{n}|s_t^n,\theta)\\ =\frac{1}{N}\Sigma_{n=1}^{N}\Sigma_{t=1}^{T}R(\tau ^n) \triangledown_{\theta}logp(a_t^{n}|s_t^n,\theta)\\ ▽θRθ≈N1Σn=1N▽θp(τn∣θ)R(τn)=N1Σn=1NR(τn)Σt=1T▽θlogp(atn∣stn,θ)=N1Σn=1NΣt=1TR(τn)▽θlogp(atn∣stn,θ)
参考:李宏毅(Reinforcement Learning)