Bandit UCB推导

推导Reinforcement Learning Richard S.Sutton and Andrew G. Barto 第二章Bandit算法中的Upper-Confidence-Bound Action Selection.

预备知识

Markov Inequality

对于任意r.v. (random variable) X and constant a > 0,

P(|x| \geq a) \leq \frac{E[|x|]}{a}

Prf:

P(|x| \geq a)=\int _{|x| \geq a}{f(x)}dx \leq \int_{|x|\geq a}{\frac{|x|}{a}f(x)dx}\leq \int{\frac{|x|}{a}f(x)dx}\leq \frac{1}{a}\int{|x|f(x)dx}=\frac{1}{a}E[|x|]

Chebyshev's Inequality

Let X have mean \mu and variance \sigma^2, Then for any a>0,

P(|x-\mu|\geq a)\leq \frac{\sigma^2}{a^2}

Prf:

P(|x-\mu|\geq a)=\int_{|x-\mu|\geq a}{f(x)dx}\leq \int{\frac{|x-\mu|^2}{a^2}f(x)dx}=\frac{1}{a^2}Var(x)=\frac{\sigma^2}{a^2}

Chernoff's Inequality

For any r.v. X and constants a>0 and t>0,

P(X\geq a) \leq \frac{E(e^{tx})}{e^{ta}}

Prf:

P(x\geq a)=P(e^{tx}\geq e^{ta})\leq \frac{E(e^{tx})}{e^{ta}}, by Markov Inequality.

Hoeffding Lemma

Tool:

1. Jenson Inequality:

For f is a convex function, and \lambda_1, \lambda_2 \leq 1, and\ \lambda_1+\lambda_2 = 1.

f(\lambda_1\times x_1 + \lambda_2\times x_2)\leq \lambda_1 f(x_1) + \lambda_2 f(x_2),

2.Taylor's Theorem:

All derivatives of f(x) exist at point a,

f(x) = f(a) + f^{'}(a)(x-a) + \frac{1}{2}f{''}(a)(x-a)^2+... + \frac{1}{k!}f^{k}(a)(x-a)^k+...

Prf:

1.f(x)=e^{\lambda x} is a convex function, for any \alpha \in (0, 1),

f(a \alpha + b(1- \alpha))\leq \alpha f(a) + (1-\alpha)f(b)=\alpha \times e^{\lambda a} + (1-\alpha)\times e^{\lambda b}

x \in [a, b],\ \alpha = \frac{b-x}{b-a},\ so\ x=(1-\alpha)b + \alpha a

$$f(x)=e^{\lambda x}\leq \frac{b-x}{b-a}e^{\lambda a} + \frac{x-a}{b-a}e^{\lambda b}.$$\\ So:\\ $$E(e^{\lambda x})\leq E(\frac{b-x}{b-a})e^{\lambda a} + E(\frac{x-a}{b-a})e^{\lambda b}=\frac{b}{b-a}e^{\lambda a}-\frac{a}{b-a}e^{\lambda b}, where\ E(x)=0,$$we can make E[x]=0 by set a,b.

We need to find a function \varphi, s.t.

E[e^{\lambda x}]\leq \frac{b}{b-a}e^{\lambda a}-\frac{a}{b-a}e^{\lambda b}=e^{\varphi (t)}|_{t=?}

2.

 \varphi(t)=-\theta t + ln(1-\theta + \theta e^t), where\ \theta = -\frac{a}{b-a} > 0.\ \\

e^{\varphi(t)}|_{t=\lambda(b-a)}=e^{-\theta t}\times [1-\theta + \theta e^t]|_{t=\lambda(b-a)}=\\

e^{-\theta \lambda (b-a)}[1- \theta + \theta e^{\lambda (b-a)}]|_{\theta = -\frac{a}{b-a}}=\\ e^{a}[\frac{b}{b-a}-\frac{a}{b-a}e^{\lambda(b-a)}]=\frac{b}{b-a}e^{\lambda a}-\frac{a}{b-a}e^{\lambda b},

So:

E(e^{\lambda x})\leq e^{\varphi(\lambda(b-a))}\leq e^{\frac{1}{8}\lambda^2(b-a)^2}\ [by\ Taylor's\ theorem]

3.

 

\varphi(t) = \varphi(0) + \varphi(0)^{'}(t-0) + \frac{1}{2} \varphi(0)^{''}(t-0)^2 + o(t^2),\ by\ Taylor's\ Theorem.

\varphi(0) = 0,

\varphi(t)^{'}= -\theta + \frac{\theta e^t}{1 -\theta + \theta e^t},

\varphi(0)^{'}=0,

\varphi(t)^{''}=\frac{(1-\theta)\theta e^t}{(1-\theta +\theta e^t)^2},\\

As\ \frac{xy}{(x+y)^2}\leq \frac{1}{4}, \ make\ x=1-\theta,\ y=\theta e^t.

\\ \varphi(t)^{''}\leq \frac{1}{4}\\ \varphi(t)=0+0+\frac{1}{8}t^2,\\ E(e^{\lambda x})\leq e^{\varphi(t)}|_{t=\lambda(b-a)}=e^{\frac{1}{8}\lambda^2(b-a)^2}\\ So\ E[e^{\lambda x}]\leq e^{\frac{1}{8}\lambda^2(b-a)^2}

So Hoeffding Lemma proved.

Hoeffding Bound

Let the random variables X_1, X_2,...,X_n be independent with E(X_i)=\mua \leq X_i \leq b for each i=1,2,...,n, where a,b are constants. Then for any \varepsilon \geq 0,

$$P(|\frac{1}{n}\sum_{i=1}^{n}{X_i}-\mu|>\varepsilon)\leq 2\times e^{\frac{-2 n \varepsilon^2}{(b-a)^2}}$$

Prf:

1.

z_i = X_i - E[X_i], z = \frac{1}{n} \sum_{i=1}^{n}{z_i}=\frac{1}{n}\sum_{i=1}^{n}{X_i} -\mu\\

Arbitrary\ \lambda, \ P(z\geq \varepsilon)\leq \frac{E[e^{\lambda z}]}{e^{\lambda \varepsilon}} = \frac{E[e^{\lambda \frac{1}{n} \sum_{i=1}^{n}{z_i}}]}{e^{\lambda \varepsilon}}

Because X_1, X_2,...,X_n independent, so z_1,z_2,...,z_n independent.

=\frac{\prod_{i=1}^{n}{E[e^{\lambda \frac{1}{n} z_i}]}}{e^{\lambda \varepsilon}} = e^{-\lambda \varepsilon} \prod_{i=1}^{n}{E[e^{\frac{1}{n}\lambda z_i}]}

And:

E[\frac{1}{n}z_i]=\frac{1}{n}E[z_i]=0

E[z_i]=E[X_i - E[X_i]],

a<z_i<b, a-\mu < z_i < b-\mu, to make E[z_i]=0

By Hoeffding Lemma, E[e^{\lambda \frac{1}{n} z_i}]\leq e^{\frac{1}{8}\lambda^2(\frac{b-a}{n})^2}=e^{\frac{1}{8 n^2}\lambda^2(b-a)^2}

P(z\geq \varepsilon)\leq e^{-\lambda \varepsilon} \times e^{\frac{1}{8n}\lambda^2(b-a)^2}\ (arbitrary)

Construct a function: g(\lambda) = \frac{1}{8n}\lambda^2(b-a)^2-\lambda \varepsilon

{g}'(\lambda)=\lambda \frac{(b-a)^2}{4n} - \varepsilon=0,\ \lambda=\frac{4n\varepsilon}{(b-a)^2}.

{g}''(\lambda) = \frac{(b-a)^2}{4n}\geq 0

so g(\lambda) get the minial value at \lambda = \frac{4n\varepsilon}{(b-a)^2}g(\frac{4n\varepsilon}{(b-a)^2})=-\frac{2n\epsilon^2}{(b-a)^2}.

P[z\geq \varepsilon]\leq e^{-\frac{2n\epsilon^2}{(b-a)^2}},\ P[\frac{1}{n}\sum{z_i}\geq \epsilon],And P[z\leq -\varepsilon]\leq e^{-\frac{2n\epsilon^2}{(b-a)^2}},\ P[\frac{1}{n}\sum{z_i}\leq \epsilon] \leq e^{-\frac{2n\epsilon^2}{(b-a)^2}}

So: P[|z|\geq \epsilon]\leq 2 e^{-\frac{2n \epsilon^2}{(b-a)^2}}.

Upper Confidence Bound(UCB)

Def: 置信区间(confidence interval), A 1-\delta confidence interval for a parameter p is an interval [\bar{p} - \epsilon, \bar{p} + \epsilon] \ s.t.

P(p \in [\bar{p}-\epsilon, \bar{p} + \epsilon])\geq 1 - \delta. \Leftrightarrow P(|\bar{p}-p|\leq \epsilon)\geq 1-\delta.

假设抛一枚硬币N次,向上概率p, 且p未知, 用N次结果来估计p的取值范围:\bar{p}=\frac{1}{N}(X_1+X_2+...+X_N).

则:P(|\bar{p}-p|\geq \epsilon)\leq \delta \Leftrightarrow P(|\bar{p}-p|\leq \epsilon)\geq 1-\delta.\Rightarrow P(|\bar{p}-p|\geq \epsilon)\leq 2\times e^{-\frac{2n\epsilon^2}{(b-a)^2}}=\delta by Hoeffding Bound.

So: \epsilon = \sqrt{\frac{ln(\frac{2}{\delta})}{2N}},\ \epsilon \ is\ connected\ with\ N\ \delta..

P(p \in [\bar{p}-\sqrt{ln(\frac{2}{\delta}) \frac{1}{2N}}, \bar{p}+\sqrt{ln(\frac{2}{\delta}) \frac{1}{2N}}])\geq 1 - \delta.

Bandit UCB Action Selection

A_t= arg\max_a[Q_t(a) + c\sqrt{\frac{ln\ t}{N_t(a)}}], Q_t(a)\ is\ the\ expected\ estimate\ action\ value,\ q^*(a)\ is\ the\ action\ value.

 We use Q_t(a) \approx q^*(a).

Prf:

1.

 Bandit UCB推导_第1张图片

Where Q_t(a)=\frac{\sum_{i=1}^{t-1}{R_i \times 1_{A_i=a}}}{\sum_{i=1}^{t-1} 1_{A_i=a}}=\frac{R_1^a+R_2^a+...+R_{t-1}^a}{N_t(a)}

2. Hoeffding Bound

设 L\leq R_i^a\leq M,\ def\ d=M-LP(q^*(a)-Q_t(a)\geq \varepsilon)\leq e^{-\frac{2N_t(a)\varepsilon^2}{d^2}}=\delta By Hoeffding Bound.

So:\epsilon=\sqrt{\frac{d^2 ln(\frac{1}{\delta})}{2N_t(a)}}=\frac{d}{\sqrt{2}}\sqrt{\frac{ln(\frac{1}{\delta})}{N_t(a)}},

Make \delta=t^{-4}\epsilon=d\sqrt{2}\sqrt{\frac{\ln{t}}{N_t(a)}}.

3.

\\ P(q^*(a)-Q_t(a)\geq c\sqrt{\frac{\ln{t}}{N_t(a)}})\leq t^{-4}. \Rightarrow q^*(a)\leq Q_t(a)+c\sqrt{\frac{\ln{t}}{N_t(a)}} \\ with\ confidence\ probability\ 1-t^{-4}

4.

A_t= arg\max_a[Q_t(a) + c\sqrt{\frac{ln\ t}{N_t(a)}}] , 当t足够大,这个上界就越精确。当N_t(a)=0 时,后一项的值无穷大,则选择action a 的概率就越大,c 控制exploring的程度。c越大探索少选择的action的几率也越大。

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(reinforcement,learning,bandit)