本文是对两篇文章:
的整理。他从理论上给出了在target domain的误差的bound是部分由source domain的误差决定的,具有指导性意义。
首先我们给出一些基本的设置,用 D s , f s \displaystyle D_{s} ,f_{s} Ds,fs表示在source domain上分布以及该domain上的函数分类函数(这里假设 f s \displaystyle f_{s} fs是二分类函数,所以取值是[0,1]),同理target domain:用 D t , f t \displaystyle D_{t} ,f_{t} Dt,ft表示
我们称hypothesis是一个用来分类的函数 h : X → { 0 , 1 } \displaystyle h:\mathcal{X}\rightarrow \{0,1\} h:X→{0,1}. 于是我们可以定义h和f的误差为:
ϵ S ( h , f ) = E x ∼ D S [ ∣ h ( x ) − f ( x ) ∣ ] \epsilon _{S} (h,f)=\mathrm{E}_{\mathbf{x} \sim \mathcal{D}_{S}} [|h(\mathbf{x} )-f(\mathbf{x} )|] ϵS(h,f)=Ex∼DS[∣h(x)−f(x)∣]
表示在source domain上h和f的误差,特别的,当 f = f s \displaystyle f=f_{s} f=fs,即为真实的分类函数时,记 ϵ S \displaystyle \epsilon _{S} ϵS(h)= ϵ S ( h , f s ) \displaystyle \epsilon _{S} (h,f_{s} ) ϵS(h,fs),同理target domain的误差同样有 ϵ T ( h ) = ϵ T ( h , f t ) \displaystyle \epsilon _{T}( h) =\epsilon _{T} (h,f_{t} ) ϵT(h)=ϵT(h,ft),接下来我们给出最重要的H-divergence
所谓散度就是一个弱化的距离,他不一定具备距离的性质,比如有可能不满足对称性等等,那么所谓H是定义在假设空间 H \displaystyle \mathcal{H} H的 D \displaystyle \mathcal{D} D和 D ′ \displaystyle \mathcal{D}^{\prime } D′的距离:
d H ( D , D ′ ) = 2 sup h ∈ H ∣ Pr x ∼ D [ h ( x ) = 1 ] − Pr x ∼ D ′ [ h ( x ) = 1 ] ∣ d_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) =2\sup _{h\in \mathcal{H}}\left| \operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]-\operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1]\right| dH(D,D′)=2h∈Hsup∣Prx∼D[h(x)=1]−Prx∼D′[h(x)=1]∣
直观来看,这个散度的意思是,在一个假设空间 H \displaystyle \mathcal{H} H中,找到一个函数h,使得 Pr x ∼ D [ h ( x ) = 1 ] \displaystyle \operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1] Prx∼D[h(x)=1]的概率尽可能大,而 Pr x ∼ D ′ [ h ( x ) = 1 ] \displaystyle \operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1] Prx∼D′[h(x)=1]的概率尽可能小,也就是说,我们用最大距离来衡量 D , D ′ \displaystyle \mathcal{D} ,\mathcal{D}^{\prime } D,D′之间的距离。同时这个h也可以理解为是用来尽可能区分 D , D ′ \displaystyle \mathcal{D} ,\mathcal{D}^{\prime } D,D′这两个分布的函数。
此外这个散度是可以从数据中估计出来的:
Lemma 1 LetHbe a hypothesis space on X with VC dimension d. If U and U’ are samples of size m from D and D’ respectively and d ^ H ( D , D ′ ) \displaystyle \hat{d}_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) d^H(D,D′) is the empirical H-divergence between samples, then for any δ ∈ (0,1), with probability at least 1−δ,
d H ( D , D ′ ) ≤ d ^ H ( U , U ′ ) + 4 d log ( 2 m ) + log ( 2 δ ) m d_{\mathcal{H}}\left(\mathcal{D} ,\mathcal{D}^{\prime }\right) \leq \hat{d}_{\mathcal{H}}\left(\mathcal{U} ,\mathcal{U}^{\prime }\right) +4\sqrt{\frac{d\log (2m)+\log\left(\frac{2}{\delta }\right)}{m}} dH(D,D′)≤d^H(U,U′)+4mdlog(2m)+log(δ2)
这个bound其实就是VC维的bound,这里d表示H的VC维m是样本数量。显然当d有限时,样本量趋于无穷的时候收敛。接下来给出一种计算的方法:
Lemma 2 该散度可以 从样本中计算
d ^ H ( U , U ′ ) = 2 ( 1 − min h ∈ H [ 1 m ∑ x h ( x ) = 0 I [ x ∈ U ] + 1 m ∑ x h ( x ) = 1 I [ x ∈ U ′ ] ] ) \hat{d}_{\mathcal{H}}\left(\mathcal{U} ,\mathcal{U}^{\prime }\right) =2\left( 1-\min_{h\in \mathcal{H}}\left[\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=0} I[\mathbf{x} \in \mathcal{U} ]+\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=1} I\left[\mathbf{x} \in \mathcal{U}^{\prime }\right]\right]\right) \ d^H(U,U′)=2⎝⎛1−h∈Hmin⎣⎡m1xh(x)=0∑I[x∈U]+m1xh(x)=1∑I[x∈U′]⎦⎤⎠⎞
其中 I [ x ∈ U ] I[ x\in U] I[x∈U]表示当 x ∈ U \displaystyle x\in U x∈U 的时候等于1,也就是统计 x ∈ U \displaystyle x\in U x∈U的x的数量
可以其实可以直接看出他就是在估计这么个概率,也就是H散度:
1 − [ 1 m ∑ x h ( x ) = 0 I [ x ∈ U ] + 1 m ∑ x h ( x ) = 1 I [ x ∈ U ′ ] ] = Pr x ∼ D [ h ( x ) = 1 ] − Pr x ∼ D ′ [ h ( x ) = 1 ] 1-\left[\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=0} I[\mathbf{x} \in \mathcal{U} ]+\frac{1}{m}\sum _{\mathbf{x} h(\mathbf{x} )=1} I\left[\mathbf{x} \in \mathcal{U}^{\prime }\right]\right] =\operatorname{Pr}_{x\sim \mathcal{D}} [h( x) =1]-\operatorname{Pr}_{x\sim \mathcal{D}^{\prime }} [h( x) =1] 1−⎣⎡m1xh(x)=0∑I[x∈U]+m1xh(x)=1∑I[x∈U′]⎦⎤=Prx∼D[h(x)=1]−Prx∼D′[h(x)=1]
Definition 1 symmetric difference hypothesis space H Δ H \displaystyle \mathcal{H} \Delta \mathcal{H} HΔH是一系列hypotheses的集合
g ∈ H Δ H ⟺ g ( x ) = h ( x ) ⊕ h ′ ( x ) for some h , h ′ ∈ H g\in \mathcal{H} \Delta \mathcal{H} \ \ \Longleftrightarrow \ \ g(\mathbf{x} )=h(\mathbf{x} )\oplus h^{\prime } (\mathbf{x} )\ \ \text{ for some } h,h^{\prime } \in \mathcal{H} g∈HΔH ⟺ g(x)=h(x)⊕h′(x) for some h,h′∈H
其中 ⊕ \displaystyle \oplus ⊕表示异或,就是当 h ( x ) ≠ h ′ ( x ) \displaystyle h(\mathbf{x} )\neq h'(\mathbf{x} ) h(x)̸=h′(x)时, g ( x ) = 1 \displaystyle g(\mathbf{x} )=1 g(x)=1
直观来说,这个g就是判断两个h的结果相不相等的函数。这个东西的好好处是,可以用这个集合中的函数来表示两个函数不相等的概率,也就是两个函数之间的误差,如果能找到两个domain之间的两个函数间的最大误差 ,也就找到了H散度的值,即:
d H Δ H ( D S , D T ) = 2 sup h , h ′ ∈ H ∣ ϵ S ( h , h ′ ) − ϵ T ( h , h ′ ) ∣ d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) =2\sup _{h,h^{\prime } \in \mathcal{H}}\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| dHΔH(DS,DT)=2h,h′∈Hsup∣ϵS(h,h′)−ϵT(h,h′)∣
推导过程可见引理3:
Lemma 3 对于任意的hypotheses h , h ′ ∈ H \displaystyle h,h'\in H h,h′∈H
∣ ϵ S ( h , h ′ ) − ϵ T ( h , h ′ ) ∣ ≤ 1 2 d H Δ H ( D S , D T ) \left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \leq \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) ∣ϵS(h,h′)−ϵT(h,h′)∣≤21dHΔH(DS,DT)
证明:
d H Δ H ( D S , D T ) = 2 sup h , h ′ ∈ H ∣ Pr x ∼ D S [ h ( x ) ⊕ h ′ ( x ) = 1 ] − Pr x ∼ D T [ h ( x ) ⊕ h ′ ( x ) = 1 ] = 2 sup h , h ′ ∈ H ∣ Pr x ∼ D S [ h ( x ) ≠ h ′ ( x ) ] − Pr x ∼ D T [ h ( x ) ≠ h ′ ( x ) ] = 2 sup h , h ′ ∈ H ∣ ϵ S ( h , h ′ ) − ϵ T ( h , h ′ ) ∣ ≥ 2 ∣ ϵ S ( h , h ′ ) − ϵ T ( h , h ′ ) ∣ \begin{aligned} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) & = 2\sup _{h,h^{\prime } \in \mathcal{H}}| \operatorname{Pr}_{x\sim \mathcal{D}_{S}}\left[ h(x)\oplus h^{\prime } (x)=1\right] -\operatorname{Pr}_{x\sim \mathcal{D}_{T}}\left[ h(x)\oplus h^{\prime } (x)=1\right]\\ & =2\sup _{h,h^{\prime } \in \mathcal{H}}| \operatorname{Pr}_{x\sim \mathcal{D}_{S}}\left[ h(x)\neq h^{\prime } (x)\right] -\operatorname{Pr}_{x\sim \mathcal{D}_{T}}\left[ h(x)\neq h^{\prime } (x)\right]\\ & =2\sup _{h,h^{\prime } \in \mathcal{H}}\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \geq 2\left| \epsilon _{S}\left( h,h^{\prime }\right) -\epsilon _{T}\left( h,h^{\prime }\right)\right| \end{aligned} dHΔH(DS,DT)=2h,h′∈Hsup∣Prx∼DS[h(x)⊕h′(x)=1]−Prx∼DT[h(x)⊕h′(x)=1]=2h,h′∈Hsup∣Prx∼DS[h(x)̸=h′(x)]−Prx∼DT[h(x)̸=h′(x)]=2h,h′∈Hsup∣ϵS(h,h′)−ϵT(h,h′)∣≥2∣ϵS(h,h′)−ϵT(h,h′)∣
证毕。
有了上面的一些引理,我们证明一个重要的定理,这个定理告诉我们,只要找到一个h,使得在source domain上的误差尽可能小就能让target domain上的误差尽可能小。
Theorem 1 如果Us,Ut是从Ds,Dt中抽取的无标签数据。则
ϵ T ( h ) ≤ ϵ S ( h ) + 1 2 d ^ H Δ H ( U S , U T ) + 4 2 d log ( 2 m ′ ) + log ( 2 δ ) m ′ + λ \epsilon _{T} (h)\leq \epsilon _{S} (h)+\frac{1}{2}\hat{d}_{\mathcal{H\Delta H}}(\mathcal{U}_{S} ,\mathcal{U}_{T}) +4\sqrt{\frac{2d\log\left( 2m^{\prime }\right) +\log\left(\frac{2}{\delta }\right)}{m^{\prime }}} +\lambda ϵT(h)≤ϵS(h)+21d^HΔH(US,UT)+4m′2dlog(2m′)+log(δ2)+λ
证明:该证明用到了上面的引理1,以及三角不等式: ϵ T ( h , f T ) ≤ ϵ T ( f T , h ∗ ) + ϵ T ( h , h ∗ ) \displaystyle \epsilon _{T} (h,f_{T}) \leq \epsilon _{T}\left( f_{T} ,h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right) ϵT(h,fT)≤ϵT(fT,h∗)+ϵT(h,h∗)
ϵ T ( h ) ≤ ϵ T ( h ∗ ) + ϵ T ( h , h ∗ ) = ϵ T ( h ∗ ) + ϵ T ( h , h ∗ ) + ϵ S ( h , h ∗ ) − ϵ S ( h , h ∗ ) ≤ ϵ T ( h ∗ ) + ϵ S ( h , h ∗ ) + ∣ ϵ T ( h , h ∗ ) − ϵ S ( h , h ∗ ) ∣ ( 引 理 1 ) ≤ ϵ T ( h ∗ ) + ϵ S ( h , h ∗ ) + 1 2 d H Δ H ( D S , D T ) ( 三 角 不 等 式 ) ≤ ϵ T ( h ∗ ) + ϵ S ( h ) + ϵ S ( h ∗ ) + 1 2 d H Δ H ( D S , D T ) = ϵ S ( h ) + 1 2 d H Δ H ( D S , D T ) + λ ≤ ϵ S ( h ) + 1 2 d ^ H Δ H ( U S , U T ) + 4 2 d log ( 2 m ′ ) + log ( 2 δ ) m ′ + λ \begin{aligned} \epsilon _{T} (h) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right)\\ & =\epsilon _{T}\left( h^{*}\right) +\epsilon _{T}\left( h,h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) -\epsilon _{S}\left( h,h^{*}\right)\\ & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) +\left| \epsilon _{T}\left( h,h^{*}\right) -\epsilon _{S}\left( h,h^{*}\right)\right| \\ ( 引理1) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S}\left( h,h^{*}\right) +\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})\\ ( 三角不等式\ ) & \leq \epsilon _{T}\left( h^{*}\right) +\epsilon _{S} (h)+\epsilon _{S}\left( h^{*}\right) +\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T})\\ & =\epsilon _{S} (h)+\frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) +\lambda \\ & \leq \epsilon _{S} (h)+\frac{1}{2}\hat{d}_{\mathcal{H\Delta H}}(\mathcal{U}_{S} ,\mathcal{U}_{T}) +4\sqrt{\frac{2d\log\left( 2m^{\prime }\right) +\log\left(\frac{2}{\delta }\right)}{m^{\prime }}} +\lambda \end{aligned} ϵT(h)(引理1)(三角不等式 )≤ϵT(h∗)+ϵT(h,h∗)=ϵT(h∗)+ϵT(h,h∗)+ϵS(h,h∗)−ϵS(h,h∗)≤ϵT(h∗)+ϵS(h,h∗)+∣ϵT(h,h∗)−ϵS(h,h∗)∣≤ϵT(h∗)+ϵS(h,h∗)+21dHΔH(DS,DT)≤ϵT(h∗)+ϵS(h)+ϵS(h∗)+21dHΔH(DS,DT)=ϵS(h)+21dHΔH(DS,DT)+λ≤ϵS(h)+21d^HΔH(US,UT)+4m′2dlog(2m′)+log(δ2)+λ
式1.用了三角不等式,式5用了三角不等式: ϵ S ( h , h ∗ ) ⩽ ϵ S ( h , f s ) + ϵ S ( h ∗ , f s ) \displaystyle \epsilon _{S}\left( h,h^{*}\right) \leqslant \epsilon _{S}( h,f_{s}) +\epsilon _{S}\left( h^{*} ,f_{s}\right) ϵS(h,h∗)⩽ϵS(h,fs)+ϵS(h∗,fs),最后一个使用使用了VC维理论,这是从样本从估计 1 2 d H Δ H \displaystyle \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}} 21dHΔH的泛化误差,其中d为VC维度
证毕。
这个bound的本质就是用H-divrgence将两个domain误差的差距建立了一个联系:
∣ ϵ S − ϵ T ∣ ≈ 1 2 d H Δ H ( D S , D T ) |\epsilon_S-\epsilon_T| \approx \frac{1}{2} d_{\mathcal{H} \Delta \mathcal{H}}(\mathcal{D}_{S} ,\mathcal{D}_{T}) ∣ϵS−ϵT∣≈21dHΔH(DS,DT)
这篇论文将DA的误差推广到存在representation的分布上。通过假设存在一个表征函数R,将domain映射到一个representation上,即负责将X映射到Z,当然,R确定时,也就表示一个domain被确定了,因为R可以将表征逆映射回X上 ,而这个X就是一个domain
Pr D ~ [ B ] = d e f Pr D [ R − 1 ( B ) ] f ~ ( z ) = d e f E D [ f ( x ) ∣ R ( x ) = z ] \begin{array}{ c c c } \operatorname{Pr}_{\tilde{\mathcal{D}}} [B] & \stackrel{\mathrm{def}}{=} & \operatorname{Pr}_{\mathcal{D}}\left[\mathcal{R}^{-1} (B)\right]\\ \tilde{f} (\mathbf{z} ) & \stackrel{\mathrm{def}}{=} & \mathrm{E}_{\mathcal{D}} [f(\mathbf{x} )|\mathcal{R} (\mathbf{x} )=\mathbf{z} ] \end{array} PrD~[B]f~(z)=def=defPrD[R−1(B)]ED[f(x)∣R(x)=z]
简单的说,B是在feature space上的一个时间,这里的 Pr D ~ [ B ] \operatorname{Pr}_{\tilde{\mathcal{D}}} [B] PrD~[B]就是直接测量representation上的概率的测度。另外这里 f ~ ( z ) \displaystyle \tilde{f} (\mathbf{z} ) f~(z)是所有被z表征的f(x)的均值,,这里每个f(x)都是一个label,将他们取均值来作为表征z的label.
在DA问题中,我们用 D S \displaystyle D_{S} DS 表示source domain的分布,用 D ~ S \displaystyle \tilde{D}_{S} D~S表示是建立在feature space上的source domain的分布,也就是这个分布是经过一个z进行转换得到的,正如上述定义的公式描述的一样。
那么误差也同样可以推广到带representation的场景下,只要我们从 D ~ S \displaystyle \tilde{D}_{S} D~S从采样z就可以了,这里用h表示任意的一个分类器,于是h在source domain的误差计算如下:
ϵ S ( h ) = E z ∼ D ~ S [ E y ∼ f ~ ( z ) [ y ≠ h ( z ) ] ] = E z ∼ D ~ S ∣ f s ~ ( z ) − h ( z ) ∣ \begin{aligned} \epsilon _{S} (h) & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{S}}[\mathrm{E}_{y\sim \tilde{f} (\mathbf{z} )} [y\neq h(\mathbf{z} )]]\\ & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{S}} |\widetilde{f_{s}} (\mathbf{z} )-h(\mathbf{z} )| \end{aligned} ϵS(h)=Ez∼D~S[Ey∼f~(z)[y̸=h(z)]]=Ez∼D~S∣fs (z)−h(z)∣
同理target domain的误差:
ϵ T ( h ) = E z ∼ D ~ T [ E y ∼ f ~ ( z ) [ y ≠ h ( z ) ] ] = E z ∼ D ~ T ∣ f ~ t ( z ) − h ( z ) ∣ \begin{aligned} \epsilon _{T} (h) & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{T}}[\mathrm{E}_{y\sim \tilde{f} (\mathbf{z} )} [y\neq h(\mathbf{z} )]]\\ & =\mathrm{E}_{\mathbf{z} \sim \tilde{\mathcal{D}}_{T}} |\tilde{f}_{t} (\mathbf{z} )-h(\mathbf{z} )| \end{aligned} ϵT(h)=Ez∼D~T[Ey∼f~(z)[y̸=h(z)]]=Ez∼D~T∣f~t(z)−h(z)∣
也就是说 ϵ S ( h ) = ϵ S ( h , f s ~ ) \displaystyle \epsilon _{S} (h)=\epsilon _{S} (h,\widetilde{f_{s}} ) ϵS(h)=ϵS(h,fs ), ϵ T ( h ) = ϵ T ( h , f T ~ ) \displaystyle \epsilon _{T} (h)=\epsilon _{T} (h,\widetilde{f_{T}} ) ϵT(h)=ϵT(h,fT )
接下来我们开始尝试将定理1推广到带representation的情况。
Theorem 2 Let R be a fixed representation function from X to Z and H be a hypothesis space of VC-dimension d. If a random labeled sample of size m is generated by applying R to a DS-i.i.d. sample labeled according to f, then with probability at least 1−δ, for every h ∈ H:
ϵ T ( h ) ≤ ϵ ^ S ( h ) + 4 m ( d log 2 e m d + log 4 δ ) + d H ( D ~ S , D ~ T ) + λ \epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) +\lambda ϵT(h)≤ϵ^S(h)+m4(dlogd2em+logδ4)+dH(D~S,D~T)+λ
其中e是自然底数
证明:
令 h ∗ = argmin h ∈ H ( ϵ T ( h ) + ϵ S ( h ) ) h^{*} =\operatorname{argmin}_{h\in H}( \epsilon _{T} (h)+\epsilon _{S} (h)) h∗=argminh∈H(ϵT(h)+ϵS(h)),且 ϵ T ( h ∗ ) = λ T , ϵ S ( h ∗ ) = λ S \displaystyle \epsilon _{T} (h^{*} )=\lambda _{T} ,\epsilon _{S} (h^{*} )=\lambda _{S} ϵT(h∗)=λT,ϵS(h∗)=λS. 记 λ = λ T + λ S \displaystyle \lambda =\lambda _{T} +\lambda _{S} λ=λT+λS
ϵ T ( h ) ≤ λ T + Pr D T [ Z h Δ Z h ∗ ] = λ T + Pr D S [ Z h Δ Z h ∗ ] + Pr D T [ Z h Δ Z h ∗ ] − Pr D S [ Z h Δ Z h ∗ ] ≤ λ T + Pr D S [ Z h Δ Z h ∗ ] + ∣ Pr D S [ Z h Δ Z h ∗ ] − Pr D T [ Z h Δ Z h ∗ ] ∣ ≤ λ T + Pr D S [ Z h Δ Z h ∗ ] + d H ( D ~ S , D ~ T ) ≤ λ T + λ S + ϵ S ( h ) + d H ( D ~ S , D ~ T ) ≤ λ + ϵ S ( h ) + d H ( D ~ S , D ~ T ) \begin{aligned} \epsilon _{T} (h) & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]\\ & =\lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] -\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}]\\ & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +| \operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] -\operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] |\\ & \leq \lambda _{T} +\operatorname{Pr}_{\mathcal{D}_{S}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)\\ & \leq \lambda _{T} +\lambda _{S} +\epsilon _{S} (h)+d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right)\\ & \leq \lambda +\epsilon _{S} (h)+d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) \end{aligned} ϵT(h)≤λT+PrDT[ZhΔZh∗]=λT+PrDS[ZhΔZh∗]+PrDT[ZhΔZh∗]−PrDS[ZhΔZh∗]≤λT+PrDS[ZhΔZh∗]+∣PrDS[ZhΔZh∗]−PrDT[ZhΔZh∗]∣≤λT+PrDS[ZhΔZh∗]+dH(D~S,D~T)≤λT+λS+ϵS(h)+dH(D~S,D~T)≤λ+ϵS(h)+dH(D~S,D~T)
其中 Z h = { z ∈ Z : h ( z ) = 1 } \displaystyle \mathcal{Z}_{h} =\{\mathbf{z} \in \mathcal{Z} :h(\mathbf{z} )=1\} Zh={z∈Z:h(z)=1},因此 Pr D T [ Z h Δ Z h ∗ ] \displaystyle \operatorname{Pr}_{\mathcal{D}_{T}}[\mathcal{Z}_{h} \Delta \mathcal{Z}_{h^{*}}] PrDT[ZhΔZh∗]可以看做是 ϵ T ( h , h ∗ ) \displaystyle \epsilon _{T}\left( h,h^{*}\right) ϵT(h,h∗)。
第一条不等式来自与三角不等式: ϵ T ( h , f T ) ⩽ ϵ T ( h ∗ , f T ) + ϵ T ( h ∗ , h ) \displaystyle \epsilon _{T} (h,f_{T} )\leqslant \epsilon _{T} (h^{*} ,f_{T} )+\epsilon _{T} (h^{*} ,h) ϵT(h,fT)⩽ϵT(h∗,fT)+ϵT(h∗,h)
第5条式子来自三角不等式: ϵ S ( h ∗ , h ) ⩽ ϵ S ( h ∗ , f T ) + ϵ S ( h , f T ) \displaystyle \epsilon _{S} (h^{*} ,h)\leqslant \epsilon _{S} (h^{*} ,f_{T} )+\epsilon _{S} (h,f_{T} ) ϵS(h∗,h)⩽ϵS(h∗,fT)+ϵS(h,fT)
最后根据Vapnik-Chervonenkis theory (V. Vapnik. Statistical Learning Theory. JohnWiley, New York, 1998)
ϵ S ( h ) ≤ ϵ ^ S ( h ) + 4 m ( d log 2 e m d + log 4 δ ) \epsilon _{S} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} ϵS(h)≤ϵ^S(h)+m4(dlogd2em+logδ4)
因此
ϵ T ( h ) ≤ ϵ ^ S ( h ) + 4 m ( d log 2 e m d + log 4 δ ) + d H ( D ~ S , D ~ T ) + λ \epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\sqrt{\frac{4}{m}\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) +\lambda ϵT(h)≤ϵ^S(h)+m4(dlogd2em+logδ4)+dH(D~S,D~T)+λ
同理,对于 d H ( D ~ S , D ~ T ) \displaystyle d_{\mathcal{H}}\left(\tilde{\mathcal{D}}_{S} ,\tilde{\mathcal{D}}_{T}\right) dH(D~S,D~T)的经验估计,设该分布有m’个样本,bound可以进一步写作:
ϵ T ( h ) ≤ ϵ ^ S ( h ) + 4 m ( d log 2 e m d + log 4 δ ) + λ + d H ( U ~ S , U ~ T ) + 4 d log ( 2 m ′ ) + log ( 4 δ ) m ′ \epsilon _{T} (h)\leq \hat{\epsilon }_{S} (h)+\frac{4}{m}\sqrt{\left( d\log\frac{2em}{d} +\log\frac{4}{\delta }\right)} +\lambda +d_{\mathcal{H}}\left(\tilde{\mathcal{U}}_{S} ,\tilde{\mathcal{U}}_{T}\right) +4\sqrt{\frac{d\log\left( 2m^{\prime }\right) +\log\left(\frac{4}{\delta }\right)}{m^{\prime }}} ϵT(h)≤ϵ^S(h)+m4(dlogd2em+logδ4)+λ+dH(U~S,U~T)+4m′dlog(2m′)+log(δ4)
证毕。
A theory of learning fromdifferent domains
Analysis of Representations for Domain Adaptation
V. Vapnik. Statistical Learning Theory. JohnWiley, New York, 1998