机器学习著名定理之—No Free Lunch定理详解

No Free Lunch定理

定理(No Free Lunch): 假定 A \mathcal{A} A是一个在域 X \mathcal{X} X的二分类任务中任意一个机器学习算法,其损失函数为 0 - 1 0\text{-}1 0-1损失。令 n n n是一个大小为 ∣ X ∣ / 2 |\mathcal{X}|/2 X/2的训练集,存在域 X \mathcal{X} X中的分布 D \mathcal{D} D,则有
(1)存在一个函数 f : X → { 0 , 1 } f:\mathcal{X}\rightarrow \{0,1\} f:X{0,1},且有 L D ( f ) = 0 L_\mathcal{D}(f)=0 LD(f)=0
(2)对于子列 S ∼ D n \mathcal{S\sim D}^n SDn,则概率不等式 P ( L D ( A ( S ) ) ≥ 1 / 8 : S ∼ D n ) ≥ 1 / 7 P(L_\mathcal{D}(\mathcal{A}(\mathcal{S}))\ge 1/8: \mathcal{S}\sim\mathcal{D}^n)\ge 1/7 P(LD(A(S))1/8:SDn)1/7成立。

证明:
(1) 令 C \mathcal{C} C表示域 X \mathcal{X} X中大小为 2 n 2n 2n的一个子集。主要的证明思路是只利用数据集 C \mathcal{C} C一半的数据样本点并不能给出剩下一半数据点的信息。 假定 H \mathcal{H} H表示数据集 C \mathcal{C} C到标签集合 { 0 , 1 } \{0,1\} {0,1}所有可能的函数集合,且 T T T表示的是函数集合的基数,其中 H = { f 1 , ⋯   , f T } \mathcal{H}=\{f_1,\cdots,f_T\} H={f1,,fT} T = 2 2 n T=2^{2n} T=22n。对于 H \mathcal{H} H中每一个函数假设,令 D i \mathcal{D}_i Di C × { 0 , 1 } \mathcal{C}\times\{0,1\} C×{0,1}中的分布 D i ( { ( x , y ) } ) = { 1 / 2 m i f   y = f i ( x ) 0 o t h e r w i s e \mathcal{D}_i(\{(x,y)\})=\left\{\begin{array}{ll}1/2m & \mathrm{if}\text{ } y=f_i(x)\\0& \mathrm{otherwise}\end{array}\right. Di({(x,y)})={1/2m0if y=fi(x)otherwise进而可知存在函数 f i f_i fi,在数据分布 D i \mathcal{D}_i Di上则有 L D i ( f i ) = 0 L_{\mathcal{D}_i}(f_i)=0 LDi(fi)=0
(2)主要证明的关键在于即对任意的学习算法 A \mathcal{A} A max ⁡ i ∈ [ T ] E S ∼ D i n [ L D i ( A ( S ) ) ] ≥ 1 / 4 \max\limits_{i \in [T]}\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]\ge 1 / 4 i[T]maxESDin[LDi(A(S))]1/4首先从 C × { 0 , 1 } \mathcal{C}\times \{0,1\} C×{0,1}中采样出 n n n个样本构造一个训练集,其中采样出的样本可以重复,进而可知有 k = ( 2 n ) n k=(2n)^n k=(2n)n中可能的样本序列。令这些样本序列分别表示为 S 1 , S 2 , ⋯   , S k \mathcal{S}_1,\mathcal{S}_2,\cdots,\mathcal{S}_k S1,S2,,Sk S j i = ( ( x 1 , f i ( x 1 ) ) , ⋯   , ( x n , f i ( x n ) ) ) \mathcal{S}_j^i=((x_1,f_i(x_1)),\cdots,(x_n,f_i(x_n))) Sji=((x1,fi(x1)),,(xn,fi(xn)))表示的是函数 f j f_{j} fj在样本序列 S j S_j Sj中的数据集合,则有 E S ∼ D i n [ L D i ( A ( S ) ) ] = 1 k ∑ j = 1 k L D i ( A ( S j i ) ) \mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]=\frac{1}{k}\sum\limits_{j=1}^k L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j)) ESDin[LDi(A(S))]=k1j=1kLDi(A(Sji))又因为 " m a x i m u m " ≥ " a v e r a g e " ≥ " m i n i m u m " \mathrm{"maximum"}\ge \mathrm{"average"}\ge \mathrm{"minimum"} "maximum""average""minimum",所以则有 max ⁡ i ∈ [ T ] 1 T ∑ j = 1 k L D i ( A ( S j i ) ) ≥ 1 T ∑ i = 1 T 1 k ∑ j = 1 k L D i ( A ( S j i ) ) = 1 k ∑ j = 1 k 1 T ∑ i = 1 T L D i ( A ( S j i ) ) ≥ min ⁡ j ∈ [ k ] 1 T ∑ i = 1 T L D i ( A ( S j i ) ) \begin{aligned}\max\limits_{i\in [T]}\frac{1}{T}\sum\limits_{j=1}^k L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}_j^i))&\ge \frac{1}{T}\sum\limits_{i=1}^T\frac{1}{k}\sum\limits_{j=1}^kL_{\mathcal{D}_i}(\mathcal{A(\mathcal{S}_j^i)})\\&=\frac{1}{k}\sum\limits_{j=1}^k\frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j))\\& \ge \min\limits_{j \in [k]}\frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}_j^i))\end{aligned} i[T]maxT1j=1kLDi(A(Sji))T1i=1Tk1j=1kLDi(A(Sji))=k1j=1kT1i=1TLDi(A(Sji))j[k]minT1i=1TLDi(A(Sji))现固定 j ∈ [ k ] j \in [k] j[k],令 S j = { x 1 , ⋯   , x n } \mathcal{S}_j=\{x_1,\cdots,x_n\} Sj={x1,,xn} C \ S j = { v 1 , ⋯   , v p } C \backslash \mathcal{S}_j=\{v_1,\cdots,v_p\} C\Sj={v1,,vp},其中 p ≥ n p\ge n pn是剩余没有采样的样本数。对于每一个函数 h : C → { 0 , 1 } h:C \rightarrow\{0,1\} h:C{0,1} i ∈ [ T ] i\in[T] i[T] L D i ( h ) = 1 2 n ∑ x ∈ C I ( h ( x ) ≠ f i ( x ) ) = 1 2 n ( ∑ l = 1 n I ( h ( x l ) ≠ f i ( x l ) ) + ∑ r = 1 p I ( h ( v r ) ≠ f i ( v r ) ) ) ≥ 1 2 n ∑ r = 1 p I ( h ( v r ) ≠ f i ( v r ) ) ≥ 1 2 p ∑ r = 1 p I ( h ( v r ) ≠ f i ( v r ) ) \begin{aligned}L_{\mathcal{D}_i}(h)&=\frac{1}{2n}\sum\limits_{x \in C}\mathbb{I}(h(x)\ne f_i(x))\\&=\frac{1}{2n}\left(\sum\limits_{l=1}^n \mathbb{I}(h(x_l)\ne f_i(x_l))+\sum\limits_{r=1}^p \mathbb{I}(h(v_r)\ne f_i(v_r))\right)\\&\ge \frac{1}{2n}\sum\limits_{r=1}^p \mathbb{I}(h(v_r)\ne f_i(v_r))\\&\ge \frac{1}{2p}\sum\limits_{r=1}^{p}\mathbb{I}(h(v_r)\ne f_i(v_r))\end{aligned} LDi(h)=2n1xCI(h(x)=fi(x))=2n1(l=1nI(h(xl)=fi(xl))+r=1pI(h(vr)=fi(vr)))2n1r=1pI(h(vr)=fi(vr))2p1r=1pI(h(vr)=fi(vr))所以 1 T ∑ i = 1 T L D i ( A ( S j i ) ) ≥ 1 T ∑ i = 1 T 1 2 p ∑ r = 1 p I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) = 1 2 p ∑ r = 1 p 1 T ∑ i = 1 T I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) ≥ 1 2 min ⁡ r ∈ [ p ] 1 T ∑ i = 1 T I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) \begin{aligned}\frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}_j^i))& \ge \frac{1}{T}\sum\limits_{i=1}^T\frac{1}{2p}\sum\limits_{r=1}^p \mathbb{I}(\mathcal{A(\mathcal{S}^i_j)(v_r)\ne f_i(v_r)})\\&=\frac{1}{2p}\sum\limits_{r=1}^p\frac{1}{T}\sum\limits_{i=1}^T \mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_i(v_r))\\&\ge \frac{1}{2}\min\limits_{r \in [p]}\frac{1}{T}\sum\limits_{i=1}^T\mathbb{I}(\mathcal{A}(\mathcal{S}_j^i)(v_r)\ne f_i(v_r))\end{aligned} T1i=1TLDi(A(Sji))T1i=1T2p1r=1pI(A(Sji)(vr)=fi(vr))=2p1r=1pT1i=1TI(A(Sji)(vr)=fi(vr))21r[p]minT1i=1TI(A(Sji)(vr)=fi(vr))对于给定的 r ∈ [ p ] r\in [p] r[p],因为 T T T是所有可能函数映射的基数,所以总有成对存在的 a , b ∈ [ T ] a,b\in [T] a,b[T] I ( A ( S j i ) ( v r ) ≠ f a ( v r ) ) + I ( A ( S j i ) ( v r ) ≠ f b ( v r ) ) = 1 \mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_a(v_r))+\mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_b(v_r))=1 I(A(Sji)(vr)=fa(vr))+I(A(Sji)(vr)=fb(vr))=1进而则有 E S ∼ D i n [ L D i ( A ( S ) ) ] = e 1 T ∑ i = 1 T L D i ( A ( S j i ) ) ≥ 1 2 min ⁡ r ∈ [ p ] 1 T ∑ i = 1 T I ( A ( S j i ) ( v r ) ≠ f i ( v r ) ) = 1 2 ⋅ 1 T ⋅ T 2 = 1 4 \begin{aligned}\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))] &= e \frac{1}{T}\sum\limits_{i=1}^T L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}^i_j))\\&\ge \frac{1}{2}\min\limits_{r \in [p]}\frac{1}{T}\sum\limits_{i=1}^T \mathbb{I}(\mathcal{A}(\mathcal{S}^i_j)(v_r)\ne f_i(v_r))\\&=\frac{1}{2}\cdot \frac{1}{T}\cdot \frac{T}{2}=\frac{1}{4}\end{aligned} ESDin[LDi(A(S))]=eT1i=1TLDi(A(Sji))21r[p]minT1i=1TI(A(Sji)(vr)=fi(vr))=21T12T=41根据马尔可夫不等式的修改版可知,给定一个随机变量 X ∈ [ 0 , 1 ] X \in[0,1] X[0,1],给定一个常数 a ∈ [ 0 , 1 ] a\in [0,1] a[0,1],进而则有 E [ X ] = ∫ 0 1 x p ( x ) d x = ∫ 0 a x p ( x ) d x + ∫ a 1 x p ( x ) d x ≤ a ∫ 0 a p ( x ) d x + ∫ a 1 p ( x ) d x = a ( 1 − p { X ≥ a } ) + p { X ≥ a } = a + ( 1 − a ) P { X ≥ a } \begin{aligned}\mathbb{E}[X]&=\int_0^1 x p(x)dx\\&= \int_0^a x p(x)dx + \int_a^1xp(x)dx\\&\le a \int_0^a p(x)dx + \int^1_a p(x)dx\\&=a(1-p\{X\ge a\})+p\{X\ge a\}\\&=a+(1-a)P\{X\ge a\}\end{aligned} E[X]=01xp(x)dx=0axp(x)dx+a1xp(x)dxa0ap(x)dx+a1p(x)dx=a(1p{Xa})+p{Xa}=a+(1a)P{Xa}马尔可夫不等式为 P { X ≥ a } ≥ E [ X ] − a 1 − a P\{X\ge a\}\ge \frac{\mathbb{E}[X]-a}{1-a} P{Xa}1aE[X]a利用马尔可夫不等可知 P ( L D ( A ( S ) ) ≥ 1 / 8 : S ∼ D n ) = E S ∼ D i n [ L D i ( A ( S ) ) ] − 1 / 8 1 − 1 / 8 ≥ 1 / 4 − 1 / 8 1 − 1 / 8 = 1 7 \begin{aligned}P(L_\mathcal{D}(\mathcal{A}(\mathcal{S}))\ge 1/8: \mathcal{S}\sim\mathcal{D}^n)&=\frac{\mathbb{E}_{\mathcal{S}\sim \mathcal{D}_i^n}[L_{\mathcal{D}_i}(\mathcal{A}(\mathcal{S}))]-1/8}{1-1/8}\\& \ge \frac{1/4-1/8}{1-1/8}=\frac{1}{7}\end{aligned} P(LD(A(S))1/8:SDn)=11/8ESDin[LDi(A(S))]1/811/81/41/8=71

你可能感兴趣的:(论文解读,机器学习,深度学习,算法)