对于非负随机变量 X X X 及 a > 0 a>0 a>0,有:
P ( X ≥ a ) ≤ E [ X ] a P(X \geq a) \leq \frac{E[X]}{a} P(X≥a)≤aE[X]
Proof:
E [ X ] = ∫ 0 + ∞ x f ( x ) d x ≥ ∫ a + ∞ a f ( x ) d x = a ⋅ P ( x ≥ a ) E[X] = \int_{0}^{ +\infty } xf(x) \, dx \geq \int_{a}^{ +\infty } af(x) \, dx = a \cdot P(x \geq a) E[X]=∫0+∞xf(x)dx≥∫a+∞af(x)dx=a⋅P(x≥a)
对于离散情况,把积分变为求和即可。
考虑独立随机变量 X i ∈ [ a i , b i ] ( i = 1 , 2 , . . . , n ) X_{i} \in [a_{i}, b_{i}] \, (i=1,2,...,n) Xi∈[ai,bi](i=1,2,...,n) 的和 S n = ∑ i = 1 n X i S_{n} = \sum_{i=1}^n X_{i} Sn=∑i=1nXi,则对任意 s > 0 s > 0 s>0 有:
P ( ∣ S n − E [ S n ] ∣ ≥ t ) ≤ e − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 P(|S_{n} - E[S_{n}]| \geq t) \leq e^{ \frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2} } P(∣Sn−E[Sn]∣≥t)≤e∑i=1n(bi−ai)2−2t2
Proof:
考虑独立随机变量 X i ∈ [ a , b ] ( i = 1 , 2 , . . . , n ) X_{i} \in [a, b] \, (i=1,2,...,n) Xi∈[a,b](i=1,2,...,n), E [ X ] = 0 E[X] = 0 E[X]=0,则对任意 s > 0 s > 0 s>0 有:
E [ e s X ] ≤ e 1 8 s 2 ( b − a ) 2 E[e^{sX}] \leq e^{\frac{1}{8} s^2 (b-a)^2} E[esX]≤e81s2(b−a)2
Proof:
注意到 e s X e^{sX} esX 是关于 X X X 的一个凸函数,则有:
e s X ≤ b − X b − a e s a + X − a b − a e s b e^{sX} \leq \frac{b-X}{b-a} e^{sa} + \frac{X-a}{b-a} e^{sb} esX≤b−ab−Xesa+b−aX−aesb
对上式两边取期望,结合条件 E [ X ] = 0 E[X] = 0 E[X]=0 有:
E [ e s X ] ≤ b − E [ X ] b − a e s a + E [ X ] − a b − a e s b = b b − a e s a − a b − a e s b E[e^{sX}] \leq \frac{b-E[X]}{b-a} e^{sa} + \frac{E[X]-a}{b-a} e^{sb} = \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} E[esX]≤b−ab−E[X]esa+b−aE[X]−aesb=b−abesa−b−aaesb
令 θ = − a b − a \theta = - \frac{a}{b-a} θ=−b−aa, u = s ( b − a ) u=s(b-a) u=s(b−a),由于 E [ X ] = 0 E[X] = 0 E[X]=0 ,故 a ≤ 0 ≤ b a \leq 0 \leq b a≤0≤b, θ ≥ 0 \theta \geq 0 θ≥0, u ≥ 0 u \geq 0 u≥0,得:
b b − a e s a − a b − a e s b \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} b−abesa−b−aaesb
= − a b − a e s a ( e s ( b − a ) − b a ) = − a b − a e − s ( b − a ) − a b − a ( e s ( b − a ) − b − a a − 1 ) = -\frac{a}{b-a} e^{sa} (e^{s(b-a)} - \frac{b}{a}) = -\frac{a}{b-a} e^{-s(b-a) \frac{-a}{b-a}} (e^{s(b-a)} - \frac{b-a}{a} - 1) =−b−aaesa(es(b−a)−ab)=−b−aae−s(b−a)b−a−a(es(b−a)−ab−a−1)
= θ e − θ u ( e u + 1 θ − 1 ) = e − θ u ( θ e u + 1 − θ ) = \theta e^{- \theta u} (e^u + \frac{1}{\theta} - 1) = e^{- \theta u} (\theta e^u + 1 - \theta) =θe−θu(eu+θ1−1)=e−θu(θeu+1−θ)
令 ψ ( u ) = e − θ u ( θ e u + 1 − θ ) \psi (u) = e^{- \theta u} (\theta e^u +1 - \theta) ψ(u)=e−θu(θeu+1−θ),由于 u ≥ 0 u \geq 0 u≥0,故 ψ ( u ) > 0 \psi (u) >0 ψ(u)>0。
令 ϕ ( u ) = ln [ ψ ( u ) ] = − θ u + ln ( θ e u + 1 − θ ) \phi (u) = \ln [\psi (u)] = - \theta u + \ln (\theta e^u + 1 - \theta) ϕ(u)=ln[ψ(u)]=−θu+ln(θeu+1−θ),
根据泰勒公式, ∃ v ∈ [ 0 , u ] \exists v \in [0,u] ∃v∈[0,u],使得:
ϕ ( u ) = ϕ ( 0 ) + u ⋅ ϕ ′ ( 0 ) + u 2 2 ! ⋅ ϕ ′ ′ ( v ) ( 1 ) \phi (u) = \phi (0) + u \cdot \phi ' (0) + \frac{u^2}{2!} \cdot \phi '' (v) \quad (1) ϕ(u)=ϕ(0)+u⋅ϕ′(0)+2!u2⋅ϕ′′(v)(1)
其中:
ϕ ′ ( u ) = − θ + θ e u θ e u + 1 − θ \phi ' (u) = - \theta + \frac {\theta e^u} {\theta e^u + 1 - \theta} ϕ′(u)=−θ+θeu+1−θθeu
ϕ ′ ′ ( u ) = θ e u ( 1 − θ ) ( θ e u + 1 − θ ) 2 \phi '' (u) = \frac {\theta e^u (1 - \theta)} {(\theta e^u + 1 - \theta)^2} ϕ′′(u)=(θeu+1−θ)2θeu(1−θ)
可得:
ϕ ( 0 ) = 0 \phi (0) = 0 ϕ(0)=0
ϕ ′ ( 0 ) = 0 \phi ' (0) = 0 ϕ′(0)=0
ϕ ′ ′ ( u ) = 1 2 + ( θ 1 − θ e u + 1 − θ θ e − u ) ≤ 1 4 \phi '' (u) = \frac {1} {2 + (\frac{\theta}{1 - \theta} e^u + \frac{1 - \theta}{\theta} e^{-u})} \leq \frac{1}{4} ϕ′′(u)=2+(1−θθeu+θ1−θe−u)1≤41
由 (1) 式得:
ϕ ( u ) ≤ 1 8 u 2 = 1 8 s 2 ( b − a ) 2 \phi (u) \leq \frac{1}{8} u^2 = \frac{1}{8} s^2 (b-a)^2 ϕ(u)≤81u2=81s2(b−a)2
故:
E [ e s X ] ≤ b b − a e s a − a b − a e s b = ψ ( u ) = e ϕ ( u ) ≤ e 1 8 s 2 ( b − a ) 2 E[e^{sX}] \leq \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} = \psi (u) = e^{\phi (u)} \leq e^{\frac{1}{8} s^2 (b-a)^2} E[esX]≤b−abesa−b−aaesb=ψ(u)=eϕ(u)≤e81s2(b−a)2
Hoeffding 引理得证。
下面证明 Hoeffding 不等式。
首先证明不等式 P ( S n − E [ S n ] ≥ t ) ≤ e − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 P(S_{n} - E[S_{n}] \geq t) \leq e^{\frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}} P(Sn−E[Sn]≥t)≤e∑i=1n(bi−ai)2−2t2。
对于任意 s > 0 s>0 s>0,有:
P ( S n − E [ S n ] ≥ t ) = P ( e s ( S n − E [ S n ] ) ≥ e s t ) P(S_{n} - E[S_{n}] \geq t) = P(e^{s(S_{n} - E[S_{n}])} \geq e^{st}) P(Sn−E[Sn]≥t)=P(es(Sn−E[Sn])≥est)
由于随机变量 e s ( S n − E [ S n ] ) > 0 e^{s(S_{n} - E[S_{n}])} > 0 es(Sn−E[Sn])>0,则由 Markov 不等式得:
P ( e s ( S n − E [ S n ] ) ≥ e s t ) ≤ E [ e s ( S n − E [ S n ] ) ] e s t = e − s t ⋅ ∏ i = 1 n E [ e s ( X i − E [ X i ] ) ] P(e^{s(S_{n} - E[S_{n}])} \geq e^{st}) \leq \frac {E[e^{s(S_{n} - E[S_{n}])}]} {e^{st}} = e^{-st} \cdot \prod_{i=1}^{n} E[e^{s(X_{i} - E[X_{i}])}] P(es(Sn−E[Sn])≥est)≤estE[es(Sn−E[Sn])]=e−st⋅i=1∏nE[es(Xi−E[Xi])]
对于随机变量 X i − E [ X i ] X_{i} - E[X_{i}] Xi−E[Xi],有 E [ X i − E [ X i ] ] = 0 E[X_{i} - E[X_{i}]] = 0 E[Xi−E[Xi]]=0,且 X i − E [ X i ] ∈ [ a i − E [ X i ] , b i − E [ X i ] ] X_{i} - E[X_{i}] \in [a_{i} - E[X_{i}], b_{i} - E[X_{i}]] Xi−E[Xi]∈[ai−E[Xi],bi−E[Xi]]。
则由 Hoeffding 引理得:
e − s t ⋅ ∏ i = 1 n E [ e s ( X i − E [ X i ] ) ] ≤ e − s t ⋅ ∏ i = 1 n e s 2 ( b i − a i ) 2 8 = e − s t + 1 8 s 2 ∑ i = 1 n ( b i − a i ) 2 e^{-st} \cdot \prod_{i=1}^{n} E[e^{s(X_{i} - E[X_{i}])}] \leq e^{-st} \cdot \prod_{i=1}^{n} e^{\frac {s^2 (b_{i} - a_{i})^2} {8}} =e^{-st + \frac{1}{8} s^2 \sum_{i=1}^{n} (b_{i} - a_{i})^2} e−st⋅i=1∏nE[es(Xi−E[Xi])]≤e−st⋅i=1∏ne8s2(bi−ai)2=e−st+81s2∑i=1n(bi−ai)2
令 g ( s ) = − s t + 1 8 s 2 ∑ i = 1 n ( b i − a i ) 2 g(s) = -st + \frac{1}{8} s^2 \sum_{i=1}^{n} (b_{i} - a_{i})^2 g(s)=−st+81s2∑i=1n(bi−ai)2,则:
g ′ ( s ) = − t + 1 4 s ∑ i = 1 n ( b i − a i ) 2 g'(s) = -t + \frac{1}{4} s \sum_{i=1}^{n} (b_{i} - a_{i})^2 g′(s)=−t+41si=1∑n(bi−ai)2
求解 g ′ ( s ) = 0 g'(s) = 0 g′(s)=0,得:
s = 4 t ∑ i = 1 n ( b i − a i ) 2 s = \frac {4t} {\sum_{i=1}^{n} (b_{i} - a_{i})^2} s=∑i=1n(bi−ai)24t
故:
P ( S n − E [ S n ] ≥ t ) ≤ e − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 P(S_{n} - E[S_{n}] \geq t) \leq e^{\frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}} P(Sn−E[Sn]≥t)≤e∑i=1n(bi−ai)2−2t2
不等式 P ( E [ S n ] − S n ≥ t ) ≤ e − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 P(E[S_{n}] -S_{n} \geq t) \leq e^{\frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}} P(E[Sn]−Sn≥t)≤e∑i=1n(bi−ai)2−2t2 的证明同理。
Hoeffding 不等式得证。
定理:
对二类分类问题,当假设空间是有限个函数的集合 F = { f 1 , f 1 , . . . , f d } F = \{f_{1}, f_{1}, ..., f_{d}\} F={f1,f1,...,fd} 时,对任意一个函数 f ∈ F f \in F f∈F,至少以概率 1 − δ 1 - \delta 1−δ,以下不等式成立:
R ( f ) ≤ R ^ ( f ) + ε ( d , N , δ ) ( 2 ) R(f) \leq \hat R(f) + \varepsilon (d, N, \delta) \quad (2) R(f)≤R^(f)+ε(d,N,δ)(2)
其中,不等式左侧 R ( f ) R(f) R(f) 为泛化误差:
R ( f ) = E [ L ( Y , f ( X ) ) ] R(f) = E[L(Y, f(X))] R(f)=E[L(Y,f(X))]
不等式右侧即为泛化误差的上界:
R ^ ( f ) = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) \hat R(f) = \frac{1}{N} \sum_{i=1}^{N} L(y_{i}, f(x_{i})) R^(f)=N1i=1∑NL(yi,f(xi))
ε ( d , N , δ ) = 1 2 N ( ln d + ln 1 δ ) ( 3 ) \varepsilon (d, N, \delta) = \sqrt {\frac{1}{2N} (\ln {d} + \ln {\frac{1}{\delta}})} \quad (3) ε(d,N,δ)=2N1(lnd+lnδ1)(3)
在泛化误差上界中,第1项是训练误差;
第2项 ε ( d , N , δ ) \varepsilon (d, N, \delta) ε(d,N,δ) 是 N N N 的单调递减函数,当 N → + ∞ N \to +\infty N→+∞ 时趋于 0 0 0;同时它也是 ln d \sqrt {\ln d} lnd 阶的函数,假设空间 F F F 包含的函数越多,其值越大。
Proof:
对任意函数 f ∈ F f \in F f∈F, R ^ ( f ) \hat R(f) R^(f) 是 N N N 个独立的随机变量 L ( Y , f ( X ) ) 的 样 本 均 值 , L(Y, f(X)) 的样本均值, L(Y,f(X))的样本均值,R(f)$ 是随机变量 L ( Y , f ( X ) ) L(Y, f(X)) L(Y,f(X)) 的期望值,则:
P ( R ( f ) − R ^ ( f ) ≥ ε ) = P ( N ⋅ R ( f ) − N ⋅ R ^ ( f ) ≥ N ⋅ ε ) P(R(f) - \hat R(f) \geq \varepsilon) = P(N \cdot R(f) - N \cdot \hat R(f) \geq N \cdot \varepsilon) P(R(f)−R^(f)≥ε)=P(N⋅R(f)−N⋅R^(f)≥N⋅ε)
= P ( N ⋅ E [ L ( Y , f ( X ) ) ] − ∑ i = 1 N L ( y i − f ( x i ) ) ≥ N ⋅ ε ) = P(N \cdot E[L(Y, f(X))] - \sum_{i=1}^{N} L(y_{i} - f(x_{i})) \geq N \cdot \varepsilon) =P(N⋅E[L(Y,f(X))]−i=1∑NL(yi−f(xi))≥N⋅ε)
令 S n = ∑ i = 1 N L ( y i , f ( x i ) ) S_{n} = \sum_{i=1}^{N} L(y_{i}, f(x_{i})) Sn=∑i=1NL(yi,f(xi)),则:
E [ S n ] = E [ ∑ i = 1 N L ( y i , f ( x i ) ) ] = N ⋅ E [ L ( Y , f ( X ) ] E[S_{n}] = E[\sum_{i=1}^{N} L(y_{i}, f(x_{i}))] = N \cdot E[L(Y, f(X)] E[Sn]=E[i=1∑NL(yi,f(xi))]=N⋅E[L(Y,f(X)]
如果损失函数取值于区间 [ 0 , 1 ] [0, 1] [0,1],即对所有 i i i, [ a i , b i ] = [ 0 , 1 ] [a_{i}, b_{i}] = [0, 1] [ai,bi]=[0,1],则由 Hoeffding 不等式可知,对任意 ε > 0 \varepsilon > 0 ε>0,以下不等式成立:
P ( N ⋅ E [ L ( Y , f ( X ) ) ] − ∑ i = 1 N L ( y i − f ( x i ) ) ≥ N ⋅ ε ) P(N \cdot E[L(Y, f(X))] - \sum_{i=1}^{N} L(y_{i} - f(x_{i})) \geq N \cdot \varepsilon) P(N⋅E[L(Y,f(X))]−i=1∑NL(yi−f(xi))≥N⋅ε)
= P ( E [ S n ] − S n ≥ N ⋅ ε ) ≤ e − 2 ( N ε 2 ) N ( 1 − 0 ) 2 = e − 2 N ε 2 = P(E[S_{n}] - S_{n} \geq N \cdot \varepsilon) \leq e^{\frac {-2(N \varepsilon ^ 2)} {N(1-0)^2}} = e^{-2N \varepsilon ^ 2} =P(E[Sn]−Sn≥N⋅ε)≤eN(1−0)2−2(Nε2)=e−2Nε2
由于 F = { f 1 , f 1 , . . . , f d } F = \{f_{1}, f_{1}, ..., f_{d}\} F={f1,f1,...,fd} 是一个有限集合,故:
P ( ∃ f ∈ F : R ( f ) − R ^ ( f ) ≥ ε ) = P ( ⋃ f ∈ F { R ( f ) − R ^ ( f ) ≥ ε } ) P(\exists f \in F: \, R(f) - \hat R(f) \geq \varepsilon) = P(\bigcup_{f \in F} \{ R(f) - \hat R(f) \geq \varepsilon \}) P(∃f∈F:R(f)−R^(f)≥ε)=P(f∈F⋃{R(f)−R^(f)≥ε})
≤ ∑ f ∈ F P ( R ( f ) − R ^ ( f ) ≥ ε ) ≤ d ⋅ e − 2 N ε 2 \leq \sum_{f \in F} P(R(f) - \hat R(f) \geq \varepsilon) \leq d \cdot e^{-2N \varepsilon ^ 2} ≤f∈F∑P(R(f)−R^(f)≥ε)≤d⋅e−2Nε2
或者,等价的, ∀ f ∈ F \forall f \in F ∀f∈F,有:
P ( R ( f ) − R ^ ( f ) < ε ) ≥ 1 − d ⋅ e − 2 N ε 2 P(R(f) - \hat R(f) < \varepsilon) \geq 1 - d \cdot e^{-2N \varepsilon ^ 2} P(R(f)−R^(f)<ε)≥1−d⋅e−2Nε2
令:
δ = d ⋅ e − 2 N ε 2 ( 4 ) \delta = d \cdot e^{-2N \varepsilon ^ 2} \quad (4) δ=d⋅e−2Nε2(4)
可得:
P ( R ( f ) < R ^ ( f ) + ε ) ≥ 1 − δ P(R(f) < \hat R(f) + \varepsilon) \geq 1 - \delta P(R(f)<R^(f)+ε)≥1−δ
即:至少以概率 1 − δ 1 - \delta 1−δ,有 R ( f ) < R ^ ( f ) + ε R(f) < \hat R(f) + \varepsilon R(f)<R^(f)+ε,其中 ε \varepsilon ε 由式 (4) 得到,即为式 (3)。
故不等式 (2) 得证,即定理得证。
李航【统计学习方法】第一版,1.6.2