Model-based RL

注:以下内容基于CS598.

1. Estimate Model

给定数据集, 采用极大似然对模型进行估计。用表示的样本数。

2. Analysis of Certainty-Equivalence RL

2.1 Naive analysis

根据Hoeffding's Inequality: With probability at least ,

将失败率分别平摊到 和个事件上,有:

所以, 定义为一个维的vector,有:

  • Lemma 1(Simulation Lemma)

If and , then for any policy , we have

Proof:
\begin{align*} | V^{\pi}_{\hat M}(s) - V^{\pi}_{M}(s) | &= | \hat R(s,a) + \gamma \hat {P}(s'|s,a) \cdot V^\pi_{\hat M}(s') - R(s,a) - \gamma P(s'|s,a) \cdot V^\pi_M(s')| \\ & \leq |\hat R(s,a)-R(s,a)| + \gamma | \hat P(s'|s,a) \cdot V^\pi_{\hat M}(s') - P(s'|s,a) \cdot V^\pi_M(s')| \\ &\leq \epsilon_R + \gamma |\hat P(s,a) \cdot V^\pi_{\hat M} - P(s,a) \cdot V^\pi_{\hat M} + P(s,a) \cdot V^\pi_{\hat M} - P(s,a) \cdot V^\pi_M| \quad (ignore \, s')\\ &\leq \epsilon_R +\gamma |(\hat P(s,a) - P(s,a)) \cdot V^\pi_{\hat M}| +\gamma | P(s,a) \cdot (V^\pi_{\hat M} - V^\pi_M) | \\ &\leq \epsilon_R + \gamma |(\hat P(s,a) - P(s,a)) \cdot (V^\pi_{\hat M} - \frac{R_{max}}{2(1-\gamma)} {\mathbf{1}} )| + \gamma ||(V^\pi_{\hat M} - V^\pi_M)||_{\infty} \\ &\leq \epsilon_R + \gamma ||\hat P(s,a) - P(s,a)||_{1} \times ||V^\pi_{\hat M} - \frac{R_{max}}{2(1-\gamma)} {\mathbf{1}} ||_{\infty} + \gamma |(V^\pi_{\hat M} - V^\pi_M)|_{\infty} \\ &\leq \epsilon_R + \gamma \frac{ \epsilon_P R_{max}}{2(1-\gamma)} + \gamma ||(V^\pi_{\hat M} - V^\pi_M)||_{\infty}\\ Thus,\\ (1-\gamma)& ||(V^\pi_{\hat M} - V^\pi_M)||_{\infty} \leq \epsilon_R + \gamma \frac{ \epsilon_P R_{max}}{2(1-\gamma)} \\ &||(V^\pi_{\hat M} - V^\pi_M)||_{\infty} \leq \frac{\epsilon_R}{1 - \gamma} + \frac{\gamma \epsilon_P R_{max}}{2(1-\gamma)^2} \end{align*}

  • Lemma 1(Evaluation error to decision loss)

Proof:
\begin{align*} V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) &= V^*_M(s) - V^{\pi^*_M}_{\hat M}(s) + V^{\pi^*_M}_{\hat M}(s) - V^{\pi^*_{\hat M}}_M(s) \\ &\leq V^*_M(s) - V^{\pi^*_M}_{\hat M}(s) + V^{\pi^*_{\hat M}}_{\hat M}(s) - V^{\pi^*_{\hat M}}_M(s) \qquad ( \pi^*_{\hat M} \,maxmizes \quad v_{\hat M}) \\ & \leq || V^*_M(s) - V^{\pi^*_M}_{\hat M}(s)||_{\infty} + || V^{\pi^*_{\hat M}}_{\hat M}(s) - V^{\pi^*_{\hat M}}_M(s)||_{\infty}. \\ & \leq 2[\frac{\epsilon_R}{1 - \gamma} + \frac{\gamma \epsilon_P R_{max}}{2(1-\gamma)^2}] \qquad(Lemma \,1) \\ Thus, \\ V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) &= \tilde{O} (\frac{|S|}{\sqrt{n}(1 - \gamma)^2}) \qquad \forall s \in S. \end{align*}
Here supresses poly-logarithmic dependences on and .

2.2 Improving to

对于任意向量, 有

所以对于任意给定的 和任意给定的, 是以为界的随机变量,以至少, 有

所以, 以至少的概率,有

所以,

你可能感兴趣的:(Model-based RL)