二类分类问题的泛化误差上界的证明

文章目录

  • Markov 不等式
  • Hoeffding 不等式
    • Hoeffding 引理
  • 二类分类问题的泛化误差上界
  • 参考文献


Markov 不等式


对于非负随机变量 X X X a > 0 a>0 a>0,有:
P ( X ≥ a ) ≤ E [ X ] a P(X \geq a) \leq \frac{E[X]}{a} P(Xa)aE[X]

Proof:
E [ X ] = ∫ 0 + ∞ x f ( x )   d x ≥ ∫ a + ∞ a f ( x )   d x = a ⋅ P ( x ≥ a ) E[X] = \int_{0}^{ +\infty } xf(x) \, dx \geq \int_{a}^{ +\infty } af(x) \, dx = a \cdot P(x \geq a) E[X]=0+xf(x)dxa+af(x)dx=aP(xa)
对于离散情况,把积分变为求和即可。


Hoeffding 不等式


考虑独立随机变量 X i ∈ [ a i , b i ]   ( i = 1 , 2 , . . . , n ) X_{i} \in [a_{i}, b_{i}] \, (i=1,2,...,n) Xi[ai,bi](i=1,2,...,n) 的和 S n = ∑ i = 1 n X i S_{n} = \sum_{i=1}^n X_{i} Sn=i=1nXi,则对任意 s > 0 s > 0 s>0 有:
P ( ∣ S n − E [ S n ] ∣ ≥ t ) ≤ e − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 P(|S_{n} - E[S_{n}]| \geq t) \leq e^{ \frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2} } P(SnE[Sn]t)ei=1n(biai)22t2

Proof:

Hoeffding 引理


考虑独立随机变量 X i ∈ [ a , b ]   ( i = 1 , 2 , . . . , n ) X_{i} \in [a, b] \, (i=1,2,...,n) Xi[a,b](i=1,2,...,n) E [ X ] = 0 E[X] = 0 E[X]=0,则对任意 s > 0 s > 0 s>0 有:
E [ e s X ] ≤ e 1 8 s 2 ( b − a ) 2 E[e^{sX}] \leq e^{\frac{1}{8} s^2 (b-a)^2} E[esX]e81s2(ba)2

Proof:

注意到 e s X e^{sX} esX 是关于 X X X 的一个凸函数,则有:
e s X ≤ b − X b − a e s a + X − a b − a e s b e^{sX} \leq \frac{b-X}{b-a} e^{sa} + \frac{X-a}{b-a} e^{sb} esXbabXesa+baXaesb
对上式两边取期望,结合条件 E [ X ] = 0 E[X] = 0 E[X]=0 有:
E [ e s X ] ≤ b − E [ X ] b − a e s a + E [ X ] − a b − a e s b = b b − a e s a − a b − a e s b E[e^{sX}] \leq \frac{b-E[X]}{b-a} e^{sa} + \frac{E[X]-a}{b-a} e^{sb} = \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} E[esX]babE[X]esa+baE[X]aesb=babesabaaesb
θ = − a b − a \theta = - \frac{a}{b-a} θ=baa u = s ( b − a ) u=s(b-a) u=s(ba),由于 E [ X ] = 0 E[X] = 0 E[X]=0 ,故 a ≤ 0 ≤ b a \leq 0 \leq b a0b θ ≥ 0 \theta \geq 0 θ0 u ≥ 0 u \geq 0 u0,得:
b b − a e s a − a b − a e s b \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} babesabaaesb
= − a b − a e s a ( e s ( b − a ) − b a ) = − a b − a e − s ( b − a ) − a b − a ( e s ( b − a ) − b − a a − 1 ) = -\frac{a}{b-a} e^{sa} (e^{s(b-a)} - \frac{b}{a}) = -\frac{a}{b-a} e^{-s(b-a) \frac{-a}{b-a}} (e^{s(b-a)} - \frac{b-a}{a} - 1) =baaesa(es(ba)ab)=baaes(ba)baa(es(ba)aba1)
= θ e − θ u ( e u + 1 θ − 1 ) = e − θ u ( θ e u + 1 − θ ) = \theta e^{- \theta u} (e^u + \frac{1}{\theta} - 1) = e^{- \theta u} (\theta e^u + 1 - \theta) =θeθu(eu+θ11)=eθu(θeu+1θ)
ψ ( u ) = e − θ u ( θ e u + 1 − θ ) \psi (u) = e^{- \theta u} (\theta e^u +1 - \theta) ψ(u)=eθu(θeu+1θ),由于 u ≥ 0 u \geq 0 u0,故 ψ ( u ) > 0 \psi (u) >0 ψ(u)>0
ϕ ( u ) = ln ⁡ [ ψ ( u ) ] = − θ u + ln ⁡ ( θ e u + 1 − θ ) \phi (u) = \ln [\psi (u)] = - \theta u + \ln (\theta e^u + 1 - \theta) ϕ(u)=ln[ψ(u)]=θu+ln(θeu+1θ)
根据泰勒公式, ∃ v ∈ [ 0 , u ] \exists v \in [0,u] v[0,u],使得:
ϕ ( u ) = ϕ ( 0 ) + u ⋅ ϕ ′ ( 0 ) + u 2 2 ! ⋅ ϕ ′ ′ ( v ) ( 1 ) \phi (u) = \phi (0) + u \cdot \phi ' (0) + \frac{u^2}{2!} \cdot \phi '' (v) \quad (1) ϕ(u)=ϕ(0)+uϕ(0)+2!u2ϕ(v)(1)
其中:
ϕ ′ ( u ) = − θ + θ e u θ e u + 1 − θ \phi ' (u) = - \theta + \frac {\theta e^u} {\theta e^u + 1 - \theta} ϕ(u)=θ+θeu+1θθeu
ϕ ′ ′ ( u ) = θ e u ( 1 − θ ) ( θ e u + 1 − θ ) 2 \phi '' (u) = \frac {\theta e^u (1 - \theta)} {(\theta e^u + 1 - \theta)^2} ϕ(u)=(θeu+1θ)2θeu(1θ)
可得:
ϕ ( 0 ) = 0 \phi (0) = 0 ϕ(0)=0
ϕ ′ ( 0 ) = 0 \phi ' (0) = 0 ϕ(0)=0
ϕ ′ ′ ( u ) = 1 2 + ( θ 1 − θ e u + 1 − θ θ e − u ) ≤ 1 4 \phi '' (u) = \frac {1} {2 + (\frac{\theta}{1 - \theta} e^u + \frac{1 - \theta}{\theta} e^{-u})} \leq \frac{1}{4} ϕ(u)=2+(1θθeu+θ1θeu)141
由 (1) 式得:
ϕ ( u ) ≤ 1 8 u 2 = 1 8 s 2 ( b − a ) 2 \phi (u) \leq \frac{1}{8} u^2 = \frac{1}{8} s^2 (b-a)^2 ϕ(u)81u2=81s2(ba)2
故:
E [ e s X ] ≤ b b − a e s a − a b − a e s b = ψ ( u ) = e ϕ ( u ) ≤ e 1 8 s 2 ( b − a ) 2 E[e^{sX}] \leq \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} = \psi (u) = e^{\phi (u)} \leq e^{\frac{1}{8} s^2 (b-a)^2} E[esX]babesabaaesb=ψ(u)=eϕ(u)e81s2(ba)2
Hoeffding 引理得证。

下面证明 Hoeffding 不等式。

首先证明不等式 P ( S n − E [ S n ] ≥ t ) ≤ e − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 P(S_{n} - E[S_{n}] \geq t) \leq e^{\frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}} P(SnE[Sn]t)ei=1n(biai)22t2
对于任意 s > 0 s>0 s>0,有:
P ( S n − E [ S n ] ≥ t ) = P ( e s ( S n − E [ S n ] ) ≥ e s t ) P(S_{n} - E[S_{n}] \geq t) = P(e^{s(S_{n} - E[S_{n}])} \geq e^{st}) P(SnE[Sn]t)=P(es(SnE[Sn])est)
由于随机变量 e s ( S n − E [ S n ] ) > 0 e^{s(S_{n} - E[S_{n}])} > 0 es(SnE[Sn])>0,则由 Markov 不等式得:
P ( e s ( S n − E [ S n ] ) ≥ e s t ) ≤ E [ e s ( S n − E [ S n ] ) ] e s t = e − s t ⋅ ∏ i = 1 n E [ e s ( X i − E [ X i ] ) ] P(e^{s(S_{n} - E[S_{n}])} \geq e^{st}) \leq \frac {E[e^{s(S_{n} - E[S_{n}])}]} {e^{st}} = e^{-st} \cdot \prod_{i=1}^{n} E[e^{s(X_{i} - E[X_{i}])}] P(es(SnE[Sn])est)estE[es(SnE[Sn])]=esti=1nE[es(XiE[Xi])]
对于随机变量 X i − E [ X i ] X_{i} - E[X_{i}] XiE[Xi],有 E [ X i − E [ X i ] ] = 0 E[X_{i} - E[X_{i}]] = 0 E[XiE[Xi]]=0,且 X i − E [ X i ] ∈ [ a i − E [ X i ] , b i − E [ X i ] ] X_{i} - E[X_{i}] \in [a_{i} - E[X_{i}], b_{i} - E[X_{i}]] XiE[Xi][aiE[Xi],biE[Xi]]
则由 Hoeffding 引理得:
e − s t ⋅ ∏ i = 1 n E [ e s ( X i − E [ X i ] ) ] ≤ e − s t ⋅ ∏ i = 1 n e s 2 ( b i − a i ) 2 8 = e − s t + 1 8 s 2 ∑ i = 1 n ( b i − a i ) 2 e^{-st} \cdot \prod_{i=1}^{n} E[e^{s(X_{i} - E[X_{i}])}] \leq e^{-st} \cdot \prod_{i=1}^{n} e^{\frac {s^2 (b_{i} - a_{i})^2} {8}} =e^{-st + \frac{1}{8} s^2 \sum_{i=1}^{n} (b_{i} - a_{i})^2} esti=1nE[es(XiE[Xi])]esti=1ne8s2(biai)2=est+81s2i=1n(biai)2
g ( s ) = − s t + 1 8 s 2 ∑ i = 1 n ( b i − a i ) 2 g(s) = -st + \frac{1}{8} s^2 \sum_{i=1}^{n} (b_{i} - a_{i})^2 g(s)=st+81s2i=1n(biai)2,则:
g ′ ( s ) = − t + 1 4 s ∑ i = 1 n ( b i − a i ) 2 g'(s) = -t + \frac{1}{4} s \sum_{i=1}^{n} (b_{i} - a_{i})^2 g(s)=t+41si=1n(biai)2
求解 g ′ ( s ) = 0 g'(s) = 0 g(s)=0,得:
s = 4 t ∑ i = 1 n ( b i − a i ) 2 s = \frac {4t} {\sum_{i=1}^{n} (b_{i} - a_{i})^2} s=i=1n(biai)24t
故:
P ( S n − E [ S n ] ≥ t ) ≤ e − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 P(S_{n} - E[S_{n}] \geq t) \leq e^{\frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}} P(SnE[Sn]t)ei=1n(biai)22t2
不等式 P ( E [ S n ] − S n ≥ t ) ≤ e − 2 t 2 ∑ i = 1 n ( b i − a i ) 2 P(E[S_{n}] -S_{n} \geq t) \leq e^{\frac {-2t^2} {\sum_{i=1}^{n} (b_{i} - a_{i})^2}} P(E[Sn]Snt)ei=1n(biai)22t2 的证明同理。

Hoeffding 不等式得证。


二类分类问题的泛化误差上界


定理:

对二类分类问题,当假设空间是有限个函数的集合 F = { f 1 , f 1 , . . . , f d } F = \{f_{1}, f_{1}, ..., f_{d}\} F={f1,f1,...,fd} 时,对任意一个函数 f ∈ F f \in F fF,至少以概率 1 − δ 1 - \delta 1δ,以下不等式成立:
R ( f ) ≤ R ^ ( f ) + ε ( d , N , δ ) ( 2 ) R(f) \leq \hat R(f) + \varepsilon (d, N, \delta) \quad (2) R(f)R^(f)+ε(d,N,δ)(2)
其中,不等式左侧 R ( f ) R(f) R(f) 为泛化误差:
R ( f ) = E [ L ( Y , f ( X ) ) ] R(f) = E[L(Y, f(X))] R(f)=E[L(Y,f(X))]
不等式右侧即为泛化误差的上界:
R ^ ( f ) = 1 N ∑ i = 1 N L ( y i , f ( x i ) ) \hat R(f) = \frac{1}{N} \sum_{i=1}^{N} L(y_{i}, f(x_{i})) R^(f)=N1i=1NL(yi,f(xi))
ε ( d , N , δ ) = 1 2 N ( ln ⁡ d + ln ⁡ 1 δ ) ( 3 ) \varepsilon (d, N, \delta) = \sqrt {\frac{1}{2N} (\ln {d} + \ln {\frac{1}{\delta}})} \quad (3) ε(d,N,δ)=2N1(lnd+lnδ1) (3)
在泛化误差上界中,第1项是训练误差;
第2项 ε ( d , N , δ ) \varepsilon (d, N, \delta) ε(d,N,δ) N N N 的单调递减函数,当 N → + ∞ N \to +\infty N+ 时趋于 0 0 0;同时它也是 ln ⁡ d \sqrt {\ln d} lnd 阶的函数,假设空间 F F F 包含的函数越多,其值越大。

Proof:

对任意函数 f ∈ F f \in F fF R ^ ( f ) \hat R(f) R^(f) N N N 个独立的随机变量 L ( Y , f ( X ) ) 的 样 本 均 值 , L(Y, f(X)) 的样本均值, L(Y,f(X))R(f)$ 是随机变量 L ( Y , f ( X ) ) L(Y, f(X)) L(Y,f(X)) 的期望值,则:
P ( R ( f ) − R ^ ( f ) ≥ ε ) = P ( N ⋅ R ( f ) − N ⋅ R ^ ( f ) ≥ N ⋅ ε ) P(R(f) - \hat R(f) \geq \varepsilon) = P(N \cdot R(f) - N \cdot \hat R(f) \geq N \cdot \varepsilon) P(R(f)R^(f)ε)=P(NR(f)NR^(f)Nε)
= P ( N ⋅ E [ L ( Y , f ( X ) ) ] − ∑ i = 1 N L ( y i − f ( x i ) ) ≥ N ⋅ ε ) = P(N \cdot E[L(Y, f(X))] - \sum_{i=1}^{N} L(y_{i} - f(x_{i})) \geq N \cdot \varepsilon) =P(NE[L(Y,f(X))]i=1NL(yif(xi))Nε)
S n = ∑ i = 1 N L ( y i , f ( x i ) ) S_{n} = \sum_{i=1}^{N} L(y_{i}, f(x_{i})) Sn=i=1NL(yi,f(xi)),则:
E [ S n ] = E [ ∑ i = 1 N L ( y i , f ( x i ) ) ] = N ⋅ E [ L ( Y , f ( X ) ] E[S_{n}] = E[\sum_{i=1}^{N} L(y_{i}, f(x_{i}))] = N \cdot E[L(Y, f(X)] E[Sn]=E[i=1NL(yi,f(xi))]=NE[L(Y,f(X)]
如果损失函数取值于区间 [ 0 , 1 ] [0, 1] [0,1],即对所有 i i i [ a i , b i ] = [ 0 , 1 ] [a_{i}, b_{i}] = [0, 1] [ai,bi]=[0,1],则由 Hoeffding 不等式可知,对任意 ε > 0 \varepsilon > 0 ε>0,以下不等式成立:
P ( N ⋅ E [ L ( Y , f ( X ) ) ] − ∑ i = 1 N L ( y i − f ( x i ) ) ≥ N ⋅ ε ) P(N \cdot E[L(Y, f(X))] - \sum_{i=1}^{N} L(y_{i} - f(x_{i})) \geq N \cdot \varepsilon) P(NE[L(Y,f(X))]i=1NL(yif(xi))Nε)
= P ( E [ S n ] − S n ≥ N ⋅ ε ) ≤ e − 2 ( N ε 2 ) N ( 1 − 0 ) 2 = e − 2 N ε 2 = P(E[S_{n}] - S_{n} \geq N \cdot \varepsilon) \leq e^{\frac {-2(N \varepsilon ^ 2)} {N(1-0)^2}} = e^{-2N \varepsilon ^ 2} =P(E[Sn]SnNε)eN(10)22(Nε2)=e2Nε2
由于 F = { f 1 , f 1 , . . . , f d } F = \{f_{1}, f_{1}, ..., f_{d}\} F={f1,f1,...,fd} 是一个有限集合,故:
P ( ∃ f ∈ F :   R ( f ) − R ^ ( f ) ≥ ε ) = P ( ⋃ f ∈ F { R ( f ) − R ^ ( f ) ≥ ε } ) P(\exists f \in F: \, R(f) - \hat R(f) \geq \varepsilon) = P(\bigcup_{f \in F} \{ R(f) - \hat R(f) \geq \varepsilon \}) P(fF:R(f)R^(f)ε)=P(fF{R(f)R^(f)ε})
≤ ∑ f ∈ F P ( R ( f ) − R ^ ( f ) ≥ ε ) ≤ d ⋅ e − 2 N ε 2 \leq \sum_{f \in F} P(R(f) - \hat R(f) \geq \varepsilon) \leq d \cdot e^{-2N \varepsilon ^ 2} fFP(R(f)R^(f)ε)de2Nε2
或者,等价的, ∀ f ∈ F \forall f \in F fF,有:
P ( R ( f ) − R ^ ( f ) < ε ) ≥ 1 − d ⋅ e − 2 N ε 2 P(R(f) - \hat R(f) < \varepsilon) \geq 1 - d \cdot e^{-2N \varepsilon ^ 2} P(R(f)R^(f)<ε)1de2Nε2
令:
δ = d ⋅ e − 2 N ε 2 ( 4 ) \delta = d \cdot e^{-2N \varepsilon ^ 2} \quad (4) δ=de2Nε2(4)
可得:
P ( R ( f ) < R ^ ( f ) + ε ) ≥ 1 − δ P(R(f) < \hat R(f) + \varepsilon) \geq 1 - \delta P(R(f)<R^(f)+ε)1δ
即:至少以概率 1 − δ 1 - \delta 1δ,有 R ( f ) < R ^ ( f ) + ε R(f) < \hat R(f) + \varepsilon R(f)<R^(f)+ε,其中 ε \varepsilon ε 由式 (4) 得到,即为式 (3)。
故不等式 (2) 得证,即定理得证。


参考文献


李航【统计学习方法】第一版,1.6.2

你可能感兴趣的:(机器学习)