Deterministic Policy Gradient Algorithms
论文地址
DPG
笔记
出发点
首先最开始提出的policy gradient 算法是 stochastic的。
这里的随机是指随机策略\(\pi_\theta(a|s)=P[a|s,;\theta]\). 但是随机策略在高维连续动作空间上可能会有问题,毕竟要考虑当前状态下所有的动作带来的不同的影响,需要更多的(s,a)的数据来形成更准确的判断
但是对于确定性策略\(a=\mu_\theta(s)\). 过去,认为这样是不可行的,原因待补充。(一个显而易见的原因就是不够explore)
本文就冒天下之大不韪,提出了deterministic policy gradient ,也就是DPG
文章用的off-polcy 用一个stochasitic behavior policy来选择动作,然后学习一个determinisitic target policy.
policy gradient
\[\begin{align}J(\pi_\theta)=&\int_S \rho^\pi(s)\int_A \pi_\theta (s,a)r(s,a)dads\\=&E_{s\sim \rho^\pi ,a\sim \pi_\theta}[r(s,a)]\end{align}\]
\(\rho^\pi(s') = \int_S \sum_{t=1}^ {\infty} \gamma^{t-1}p_1(s)p(s\to s',t,\pi)ds\)
\(p_1(s)\)表示初始状态为s的概率
\(p(s\to s',t,\pi)\)表示在策略\(\pi\)下状态s经过t时间步到达\(s'\)
stochastic policy gradient
policy gradient theorem:
\[\begin{align} \nabla_\theta J(\pi_\theta)=&\int_S \rho^\pi(s)\int_A \nabla_\theta \pi_\theta (s,a)Q^\pi(s,a)dads\\=&E_{s\sim \rho^\pi ,a\sim \pi_\theta}[\nabla_\theta log \pi_\theta(s,a)Q^\pi(s,a)]\end{align}\]
stochastic Actor-Critic algorithm
critic 通过TD的方式估计 action-value function \(Q^w(s,a)=Q^\pi(s,a)\)
\[\begin{align} \nabla_\theta J(\pi_\theta)=&\int_S \rho^\pi(s)\int_A \nabla_\theta \pi_\theta (s,a)Q^w(s,a)dads\\=&E_{s\sim \rho^\pi ,a\sim \pi_\theta}[\nabla_\theta log \pi_\theta(s,a)Q^w(s,a)]\end{align}\]
Off-policy AC
behavior policy \(\beta(a|s)\neq \pi_\theta(a|s)\)
\[\begin{align}J_\beta(\pi_\theta)=&\int_S \rho^\beta(s)V^\pi(s)ds\\ =&\int_S \int_A \rho^\beta \pi_\theta (s,a)Q^\pi(s,a)dads\end{align}\]
\[\begin{align}\nabla_\theta J_\beta(\pi_\theta)\approx&\int_S \int_A \rho^\beta(s)\nabla_\theta \pi_\theta (s,a)Q^\pi(s,a)dads\\=&E_{s\sim \rho^\beta ,a\sim \beta}[\frac{\pi_\theta(a|s)}{\beta_\theta(a|s)} \nabla_\theta log \pi_\theta(s,a)Q^\pi(s,a)]\end{align}\]
DPG
model free RL 算法通常都是基于GPI(generalised policy iteration: policy evaluation with polcy improvement)。
在连续的动作空间上policy improvement 通过greedy 的方式找到global maxmisation Q 不太可行,所以就不直接找全局最大,而是向全局最大移动。
用公式表示。对于确定性策略\(a = \mu(s)\)
以前的policy improvement:
\(\mu^{k+1}(s)=\underset{a}{max}Q^{\mu^k}(s,a)\)
既然这样找全局最大不可行,我们
\[\theta^{k+1}=\theta^{k}+\alpha E_{s\sim\rho^{u^k}}[\nabla_\theta Q^{\mu^k}(s,\mu_\theta(s))]\]
\[\theta^{k+1}=\theta^{k}+\alpha E_{s\sim\rho^{u^k}}[\nabla_\theta \mu_\theta(s)\nabla_a Q^{\mu^k}(s,a)|_{a=\mu_\theta(s)}]\]
存在的问题,policy 改变,\(\rho^\mu\)就会变,不能看出来是否有policy improvement。论文里证明了可以。
Deterministic Policy Gradient Theorem
\[\begin{align}J(\mu_\theta)=&\int_S \rho^\mu(s) r(s,\mu_\theta(s))ds\\=&E_{s\sim \rho^\mu}[r(s,\mu_\theta(s))]\end{align}\]
\[\begin{align}\nabla_\theta J(\mu_\theta)=&\int_S \rho^\mu(s) \nabla_\theta \mu_\theta (s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}ds\\=&E_{s\sim \rho^\mu}[\nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)}]\end{align}\]
DPG 是SPG的一种特殊情况
SPG:\(\pi_{\mu_\theta,\sigma}\), DPG:\(\mu_\theta\)
\[\underset{\sigma \to 0}{lim}\nabla_\theta J(\pi_{\mu_\theta,\sigma})=\nabla_\theta J(\mu_\theta)\]
on-policy DPG
\[\begin{align*} \delta_t &= r_t+ \gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)\\ w_{t+1} &= w_t + \alpha_w \delta_t \nabla_w Q^w(s_t,a_t)\\ \theta_{t+1} &= \theta_{t} + \alpha_\theta \nabla_\theta \mu_{\theta}(s_t)\nabla_a Q^w(s_t,a_t)|_{a=\mu(s)}\end{align*}\]
off-policy DPG
\[\begin{align*}J_{\beta}(\mu_\theta)=&\int_S \rho^\beta(s)V^\mu(s)ds\\ =&\int_S\rho^\beta(s)Q^\mu (s,\mu_\theta(s))ds \end{align*}\]
\[\begin{align*}\nabla_\theta J_{\beta}(\mu_\theta)\approx& \int_S \rho^\beta(s) \nabla_\theta \mu_\theta(a|s)Q^\mu(s,a)ds\\ =&E_{s\sim \rho^\beta} [\nabla_\theta \mu_\theta(s)\nabla_a Q^\mu (s,a)|_{a=\mu_\theta(s)}] \end{align*}\]
\[\begin{align*} \delta_t &= r_t+ \gamma Q^w(s_{t+1},a_{t+1})-Q^w(s_t,a_t)\\ w_{t+1} &= w_t + \alpha_w \delta_t \nabla_w Q^w(s_t,a_t)\\ \theta_{t+1} &= \theta_{t} + \alpha_\theta \nabla_\theta \mu_{\theta}(s_t)\nabla_a Q^w(s_t,a_t)|_{a=\mu(s)}\end{align*}\]
不需要IS因为没有关于action的积分