Deep Cocktail Network: Multi-source Unsupervised Domain Adaptation with Category Shift
SUMMARY@ 2020/5/12
Inspired by the distribution weighted combining rule in [33], the target distribution can be represented as the weighted combination of the multi-source distributions.
An ideal target predictor can be obtained by integrating all source predictions based on the corresponding source distribution weights.
domain discriminator: The multi-way adversarial adaptation implicitly reduces domain shifts among
those sources.
feature extractor and the category classifier
This paper focuses on the problem of multi-source domain adaptation, where there is category shift between diverse sources.
Category shift is a new protocol in MDA, where domain shift and categorical disalignment co-exist among the sources.
This paper aims at domain shift and category shift all together.
Suppose the classifier for each source domain is known
Vanilla MDA: samples from diverse sources share a same category set
Category Shift: categories from different sources might be also different
N N N different underlying source distributions { p s j ( x , y ) } j = 1 N \{p_{\mathbf s_j}(x,y)\}_{j=1}^N {psj(x,y)}j=1N
1 target distribution p t ( x , y ) p_t(x,y) pt(x,y), no label
training set ensemble: N + 1 N+1 N+1 datasets
testing set: from target distribution
target domain get labeled by the union of all categories in those sources
C t = ⋃ j = 1 N C s j \mathcal{C}_{t}=\bigcup\limits_{j=1}^{N} \mathcal{C}_{s_{j}} Ct=j=1⋃NCsj
The uncommon classes are unified as a negative category called “unknown”.
In contrast, category shift consider the specific disaligned categories among multiple sources to enrich the classification in transfer.
N N N source-specific discriminators: { D s j } j = 1 N \left\{D_{s_j}\right\}_{j=1}^{N} {Dsj}j=1N
Given image x x x from the source j j j or the target domain, the domain discriminator D D D receives the features F ( x ) F(x) F(x), classifies whether from the source j j j or the target
for the data flow from each target instance x t x_t xt, the domain discriminators D D D yields the N N N source-specific discriminative results
{ D s j ( F ( x t ) ) } j = 1 N \left\{D_{s_j}(F(x^t))\right\}_{j=1}^{N} {Dsj(F(xt))}j=1N
target-source perplexity scores
S c f ( x t ; F , D s j ) = − log ( 1 − D s j ( F ( x t ) ) ) + α s j \mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{j}}\right)=-\log \left(1-D_{s_{j}}\left(F\left(x^{t}\right)\right)\right)+\alpha_{s_{j}} Scf(xt;F,Dsj)=−log(1−Dsj(F(xt)))+αsj
α s j \alpha_{s_{j}} αsj is the source-specific concentration constant, It is obtained by averaging the source j j j discriminator losses over X s j X_{s_j} Xsj.
in supplementary, different score, different α \alpha α:
α s j = 1 N T ∑ i N T ( D s j ( 1 − F ( x i s j ) ) ) 2 \alpha_{s_{j}}=\frac{1}{N_{T}} \sum_{i}^{N_{T}}\left(D_{s_{j}}\left(1-F\left(x_{i}^{s_{j}}\right)\right)\right)^{2} αsj=NT1i∑NT(Dsj(1−F(xisj)))2
N T N_T NT denotes how many times the target samples have been visited to train our model
x i s j x_{i}^{s_{j}} xisj denotes the source j instance come along with the coupled target instances in the adversarial learning.
a multi-output net composed by N N N source-specific predictors { C s j } j = 1 N \left\{C_{s_j}\right\}_{j=1}^{N} {Csj}j=1N
Each predictor is softmax classifier
for the image from source j j j: only the value from C s j C_{s_j} Csj get activated and provides the gradient for training
For a target image x t x_t xt instead, all source-specific predictors provide N N N categorization results { C s j ( F ( x t ) ) } j = 1 N \{C_{s_j} (F(x_t))\}^N_{j =1} {Csj(F(xt))}j=1Nto the target classification operator.
for each target feature F ( x t ) F(x_t) F(xt), the target classification operator takes each source perplexity score S c f ( x t ; F , D s j ) \mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{j}}\right) Scf(xt;F,Dsj) to re-weight the corresponding source-specific prediction { C s j ( F ( x t ) ) } \{C_{s_j} (F(x_t))\} {Csj(F(xt))}
the confidence x t x_t xt belongs to c c c presents as
C o n f i d e n c e ( c ∣ x t ) : = ∑ c ∈ C s j S c f ( x t ; F , D s j ) ∑ c ∈ C s k S c f ( x t ; F , D s k ) C s j ( c ∣ F ( x t ) ) w h e r e c ∈ ⋃ j = 1 N C s j (2) Confidence \left(c | x^{t}\right):=\sum_{c \in \mathcal{C}_{s_{j}}} \frac{\mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{j}}\right)}{\sum\limits_{c \in \mathcal{C}_{s_{k}}} \mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{k}}\right)} C_{s_{j}}\left(c | F\left(x^{t}\right)\right) \\ where\ c\in\bigcup_{j=1}^{N} \mathcal{C}_{s_{j}}\tag{2} Confidence(c∣xt):=c∈Csj∑c∈Csk∑Scf(xt;F,Dsk)Scf(xt;F,Dsj)Csj(c∣F(xt))where c∈j=1⋃NCsj(2)
h λ ( x ) = ∑ i = 1 k λ i D i ( x ) ∑ j = 1 k λ j D j ( x ) h i ( x ) h_{\lambda}(x)=\sum_{i=1}^{k} \frac{\lambda_{i} D_{i}(x)}{\sum_{j=1}^{k} \lambda_{j} D_{j}(x)} h_{i}(x) hλ(x)=i=1∑k∑j=1kλjDj(x)λiDi(x)hi(x)
note that the hypothesis is one-dimension output h i ( x ) ∈ R h_i(x)\in \mathbb R hi(x)∈R
The ideal target classifier presents as the weighted combination of source classifiers.
Note that here each classifier for each source C s j C_{s_j} Csj is a multi output softmax result.
C t ( c ∣ x t ) = ∑ c ∈ C s λ j D s j ( x t ) ∑ c ∈ C s k λ k D s k ( x t ) C s j ( c ∣ F ( x t ) ) C_{t}\left(c | x^{t}\right)=\sum_{c \in \mathcal{C}_{s}} \frac{\lambda_{j} \mathcal{D}_{s_{j}}\left(x^{t}\right)}{\sum_{c \in \mathcal{C}_{s_{k}}} \lambda_{k} \mathcal{D}_{s_{k}}\left(x^{t}\right)} C_{s_{j}}\left(c | F\left(x^{t}\right)\right) Ct(c∣xt)=c∈Cs∑∑c∈CskλkDsk(xt)λjDsj(xt)Csj(c∣F(xt))
with the increase of the probability that x t x_t xt from source j j j, D s j ( F ( x t ) ) → 1 , D s j ( x t ) → 1 D_{s_{j}}\left(F\left(x^{t}\right)\right)\rightarrow 1,\mathcal D_{s_{j}}\left(x^{t}\right)\rightarrow 1 Dsj(F(xt))→1,Dsj(xt)→1
so λ j D s j ( x t ) ∝ S c f ( x t ; F , D s j ) = − log ( 1 − D s j ( F ( x t ) ) ) + α s j \lambda_{j} \mathcal{D}_{s_{j}}\left(x^{t}\right) \propto\mathcal{S}_{c f}\left(x^{t} ; F, D_{s_{j}}\right)=-\log \left(1-D_{s_{j}}\left(F\left(x^{t}\right)\right)\right)+\alpha_{s_{j}} λjDsj(xt)∝Scf(xt;F,Dsj)=−log(1−Dsj(F(xt)))+αsj
take all source images to jointly train the feature extractor F and the category classifier C
pseudo label for target: Those networks and the target classification operator then predict categories for all target images and annotate those with high confidences.
Since the domain discriminator hasn’t been trained, we take the uniform distribution simplex weight as the perplexity scores to the target classification operator.
Finally, we obtain the pre-trained feature extractor and category classifier via further fine-tuning them with sources and the pseudo-labeled target images.
In object recognition, we initiate our DCTN by following the same way of DAN (start with an AlexNet model pretrained on ImageNet 2012 and fine-tune it).
In terms of digit recognition, we perform DCTN learning from scratch.
ref: ADDA论文Adversarial Discriminative Domain Adaptation
L a d v M = − L a d v D \mathcal{L}_{\mathrm{adv}_{M}}=-\mathcal{L}_{\mathrm{adv}_{D}} LadvM=−LadvD
min D L a d v D ( X s , X t , M s , M t ) min M s , M t L a d v M ( X s , X t , D ) \begin{array}{c} \min _{D} \mathcal{L}_{\mathrm{adv}_{D}}\left(\mathbf{X}_{s}, \mathbf{X}_{t}, M_{s}, M_{t}\right) \\ \min _{M_{s}, M_{t}} \mathcal{L}_{\mathrm{adv}_{M}}\left(\mathbf{X}_{s}, \mathbf{X}_{t}, D\right) \end{array} minDLadvD(Xs,Xt,Ms,Mt)minMs,MtLadvM(Xs,Xt,D)
change method 1: early on during training the discriminator converges quickly, causing the gradient to vanish, change the generator objective, splits the optimization into two independent objectives, one for the generator and one for the discriminator,
L a d v M ( X s , X t , D ) = − E x t ∼ X t [ log D ( M t ( x t ) ) ] (**) \mathcal{L}_{\mathrm{adv}_{M}}\left(\mathbf{X}_{s}, \mathbf{X}_{t}, D\right)=-\mathbb{E}_{\mathbf{x}_{t} \sim \mathbf{X}_{t}}\left[\log D\left(M_{t}\left(\mathbf{x}_{t}\right)\right)\right] \tag{**} LadvM(Xs,Xt,D)=−Ext∼Xt[logD(Mt(xt))](**)
change method 2: in the setting where both distributions are changing, this objective will lead to oscillation–when the mapping converges to its optimum, the discriminator can simply flip the sign of its prediction in response.
Tzeng et al. instead proposed the domain confusion objective, under which the mapping is trained using a cross-entropy loss function against a uniform distribution
This loss ensures that the adversarial discriminator views the two domains identically.
confuse就是要让它“半信半疑”,让source和target经过mapping的marginal distribution尽量接近。来自source和target的可能性都接近一半(或者说相当于source和target中的样本的真实domain标签都是来自1和0的可能性占一半,这样最小化这个差异的交叉熵损失函数,得到的mapping后的source和target分布就都是接近均均分布,可以认为source和target被map成很相似的domain,DA的任务就完成了)
L a d v M ( X s , X t , D ) = − ∑ d ∈ { s , t } E x d ∼ x d [ 1 2 log D ( M d ( x d ) ) + 1 2 log ( 1 − D ( M d ( x d ) ) ) ] (*) \begin{array}{l} \mathcal{L}_{\mathrm{adv}_{M}}\left(\mathbf{X}_{s}, \mathbf{X}_{t}, D\right)= \begin{aligned} -\sum_{d \in\{s, t\}} & \mathbb{E}_{\mathbf{x}_{d} \sim \mathbf{x}_{d}}\left[\frac{1}{2} \log D\left(M_{d}\left(\mathbf{x}_{d}\right)\right)\right. \left.+\frac{1}{2} \log \left(1-D\left(M_{d}\left(\mathbf{x}_{d}\right)\right)\right)\right] \end{aligned} \end{array} \tag{*} LadvM(Xs,Xt,D)=−d∈{s,t}∑Exd∼xd[21logD(Md(xd))+21log(1−D(Md(xd)))](*)
注意:其实*式子在ADDA论文中结果没有用,只是用来说明related work,ADDA中用的还是(**);
*式子是 Simultaneous Deep Transfer Across Domains and Tasks 文中提出来的;
ADDA论文改了generator的优化目标为**。
in this paper
minmax adversarial domain adaptation
min F max D V ( F , D ; C ˉ ) = L a d v ( F , D ) + L c l s ( F , C ˉ ) (4) \min _{F} \max _{D} V(F, D ; \bar{C})=\mathcal{L}_{a d v}(F, D)+\mathcal{L}_{c l s}(F, \bar{C})\tag{4} FminDmaxV(F,D;Cˉ)=Ladv(F,D)+Lcls(F,Cˉ)(4)
The optimization based on Eq.4 works well for D D D but not F F F.
Since the feature extractor learns the mapping from the multiple sources and the target, the domain distributions become simultaneously changing in adversary, which results in an oscillation then spoils our feature extractor.
when source and target feature mappings share their architectures, the domain confusion can be introduced to replace the adversarial objective, which performs stable to learn the mapping F F F.
multidomain confusion loss
L a d v ( F , D ) = 1 N ∑ j N E x ∼ X s j L c f ( x ; F , D s j ) + E x ∼ X t L c f ( x ; F , D s j ) (6) \begin{array}{l} \mathcal{L}_{a d v}(F, D)=\frac{1}{N} \sum_{j}^{N} \mathbb{E}_{x \sim X_{s_{j}}} \mathcal{L}_{c f}\left(x ; F, D_{s_{j}}\right) +\mathbb{E}_{x \sim X_{t}} \mathcal{L}_{c f}\left(x ; F, D_{s_{j}}\right) \end{array} \tag{6} Ladv(F,D)=N1∑jNEx∼XsjLcf(x;F,Dsj)+Ex∼XtLcf(x;F,Dsj)(6)
where
L c f ( x ; F , D s j ) = 1 2 log D s j ( F ( x ) ) + 1 2 log ( 1 − D s j ( F ( x ) ) ) (7) \begin{array}{c} \mathcal{L}_{c f}\left(x ; F, D_{s_{j}}\right)= \frac{1}{2} \log D_{s_{j}}(F(x))+\frac{1}{2} \log \left(1-D_{s_{j}}(F(x))\right) \end{array}\tag{7} Lcf(x;F,Dsj)=21logDsj(F(x))+21log(1−Dsj(F(x)))(7)
i.e.
L a d v ( F , D ) = 1 N ∑ j N E x ∼ X s j [ 1 2 log D s j ( F ( x ) ) + 1 2 log ( 1 − D s j ( F ( x ) ) ) ] + 1 N ∑ j N E x ∼ X t [ 1 2 log D s j ( F ( x ) ) + 1 2 log ( 1 − D s j ( F ( x ) ) ) ] \begin{array}{l} \mathcal{L}_{a d v}(F, D)=\frac{1}{N} \sum_{j}^{N} \mathbb{E}_{x \sim X_{s_{j}}} \Big[\frac{1}{2} \log D_{s_{j}}(F(x))+\frac{1}{2} \log \left(1-D_{s_{j}}(F(x))\right)\Big]\\ +\frac{1}{N} \sum_{j}^{N}\mathbb{E}_{x \sim X_{t}} \Big[\frac{1}{2} \log D_{s_{j}}(F(x))+\frac{1}{2} \log \left(1-D_{s_{j}}(F(x))\right)\Big] \end{array} Ladv(F,D)=N1∑jNEx∼Xsj[21logDsj(F(x))+21log(1−Dsj(F(x)))]+N1∑jNEx∼Xt[21logDsj(F(x))+21log(1−Dsj(F(x)))]
和(*)的差别在于:
没有负号
是multi source所以有N个discriminator,每个对应一个source和target的域判别
*中是source和target的mapping不一样,这里是feature extractor一样
本文中直接修改成了*是discriminator和generator公用的loss function(的相反数,因为为负数),表示的是target和每个source之间
交叉熵表示的是两个分布之间的差异,注意交叉熵一定是正数的结果
samples from different sources are sometimes useless to improve the adaptation to the target, and as the training proceeds, more redundant source samples turn to draw back the whole model performance
minibatch: sample batch M M M for target and each source domain
Each source target discriminator D s j D_{s_j} Dsj‘s loss is viewed as the degrees to distinguish M M M x i t x^t_i xit from the j j jth source’ s M M M samples
∑ i M − log D s j ( F ( x i s j ) ) − log ( 1 − D s j ( F ( x i t ) ) ) \sum_i^M - \log D_{s_{j}}(F(x_i^{s_j})) - \log \left(1-D_{s_{j}}(F(x_i^{t}))\right) i∑M−logDsj(F(xisj))−log(1−Dsj(F(xit)))
这里是交叉熵损失,是最原始GAN的形式。越大表示损失越大,表示对M个source样本和M个target样本的来自source j j j 还是target domain的区分效果越差,即这个source j j j的discriminator效果不好。
j ∗ = a r g max j N { ∑ i M − log D s j ( F ( x i s j ) ) − log ( 1 − D s j ( F ( x i t ) ) ) } j = 1 N j^*= \mathrm{arg}\max_j^{N}\Big\{ \sum_i^M - \log D_{s_{j}}(F(x_i^{s_j})) - \log \left(1-D_{s_{j}} (F(x_i^{t}))\right) \Big\}_{j=1}^N j∗=argjmaxN{i∑M−logDsj(F(xisj))−log(1−Dsj(F(xit)))}j=1N
we use the source j ∗ j^* j∗ and the target samples in the minibatchto train the feature extractor
以下是用于迭代更新、找到最好的feature extractor的算法1
L a d v s j ∗ ( F , D ) = ∑ i M L c f ( x i s j ∗ ; F , D s j ∗ ) + L c f ( x i t ; F , D s j ∗ ) \mathcal{L}_{a d v}^{s_j^*}(F, D)=\sum_{i}^{M} \mathcal{L}_{c f}\left(x_i^{s_j^*} ; F, D_{s_j^*}\right) +\mathcal{L}_{c f}\left(x_i^t ; F, D_{s_j^*}\right) Ladvsj∗(F,D)=i∑MLcf(xisj∗;F,Dsj∗)+Lcf(xit;F,Dsj∗)
min F max D V ( F , D ; C ˉ ) = L a d v s j ∗ ( F , D ) + L c l s ( F , C ˉ ) (4) \min _{F} \max _{D} V(F, D ; \bar{C})=\mathcal{L}_{a d v}^{s_j^*}(F, D)+\mathcal{L}_{c l s}(F, \bar{C})\tag{4} FminDmaxV(F,D;Cˉ)=Ladvsj∗(F,D)+Lcls(F,Cˉ)(4)
Aided by the multi-way adversary, DCTN has been able to obtain good domain-invariant features, yet not surely classifiable in the target domain.
auto-labeling strategy: annotate target samples, then jointly train our feature extractor and multi-source category classifier with source and target images by their (pseudo-) labels
classification losses from multiple source images and target images with pseudo labels
min F , C L c l s ( F , C ) = ∑ j N E ( x , y ) ∼ ( X s j , Y s j ) [ L ( C s j ( F ( x ) ) , y ) ] + E ( x t , y ^ ) ∼ ( X t p , Y t p ) [ ∑ y ^ ∈ C s ^ L ( C s ^ ( F ( x t ) ) , y ^ ) ] (8) \min _{F, C} \mathcal{L}_{c l s}(F, C)=\sum_{j}^{N} \mathbb{E}_{(x, y) \sim\left(X_{s_{j}}, Y_{s_{j}}\right)}\left[\mathcal{L}\left(C_{s_{j}}(F(x)), y\right)\right] +\mathbb{E}_{\left(x^{t}, \hat{y}\right) \sim\left(X_{t}^{p}, Y_{t}^{p}\right)}\left[\sum_{\hat{y} \in \mathcal{C}_{\hat{s}}} \mathcal{L}\left(C_{\hat{s}}\left(F\left(x^{t}\right)\right), \hat{y}\right)\right] \tag{8} F,CminLcls(F,C)=j∑NE(x,y)∼(Xsj,Ysj)[L(Csj(F(x)),y)]+E(xt,y^)∼(Xtp,Ytp)⎣⎡y^∈Cs^∑L(Cs^(F(xt)),y^)⎦⎤(8)
apply the target classification operator to assign pseudo labels, and the samples with the confidence higher than a preseted threshold will be selected into X t P X^P_t XtP .
given a target instance x t x^t xt with pseudo-labeled class y ^ \hat y y^, we find those sources s ^ \hat s s^ include this class ( y ^ ∈ C s ^ ) (\hat y \in \mathcal C_{\hat s}) (y^∈Cs^), then update our network via the sum of the multi-source classification losses
baseline
mullti source: two shallow methods
single source models----> multi source: conventional (TCA, GFK)/ deep
Since those methods perform in single-source setting, we introduce two MDA standards for different purposes
source only
depart all categories into two non-overlapped class sets and define them as the private classes
DAN also suffers negative transfer gains in most situations, which
indicates the transferbility of DAN cripled in the category
shift.
In contrast, DCTN reduces the performance drops compared to the model in the vanilla setting, and obtains positive transfer gains in all situations. It reveals that DCTN can resist the negative transfer caused by the category shift
visualize the DCTN activations before and after adaptation.
The adversarial-only model excludes the pseudo labels and updates the category classifier with source samples.
The pseudo-only model forbids the adversary and categorize target samples with average multi-source results
without domain batch mining technique
despite of the frequent deviation, the classification loss, adversarial loss and testing error gradually converge.