During the last decade, the field of Computer vision has made huge progress. This progress is mostly due to the undeniable effectiveness of Convolutional Neural Networks (CNNs). CNNs allow for very precise predictions if trained with high-quality annotated training data. In a classification setting for instance, you would often use one of the standard network architectures out there (ResNet, VGG etc.) and train it using your dataset. This would likely lead to very good performance.
On the other hand, if you do not have huge manually annotated datasets for your particular problem at your disposal, CNNs also allow for making use of others who already trained a network for a similar problem by transfer learning. In this case, you would take a network that is pretrained on a large dataset and tune some of its upper layers using your own small annotated dataset.
Both of these approaches assume that your training data (be it large or small) is representative of the underlying distribution. However, if the inputs at test time differ significantly from the training data, the model might not perform very well. Let us assume, for instance, that you are an autonomous vehicle engineer and you want to segment the images captured by the car’s camera in order to know what lies ahead (buildings, trees, other cars, pedestrians, traffic lights etc.). You have nice human-generated annotations for your NYC dataset and you train a large network using these annotations. You test your self-driving car on the streets of Manhattan and everything seems to work fine. Then you test the same system in Paris and suddenly things go very wrong. The car is not able to detect traffic lights anymore, the cars look very different (there are no yellow cabs in Paris) and the streets are not as straight anymore.
The reason your model did not perform very well in these scenarios is that the problem domain changed. In this particular case, the domain of the input data changed while the task domain (the labels) remained the same. On other occasions, you may want to use data from the same domain (drawn for the same underlying distribution) to accomplish a new task. Similarly, the input and task domains could differ at the same time. In these cases, domain adaptation comes to your rescue. Domain adaptation is a sub-discipline of machine learning which deals with scenarios in which a model trained on a source distribution is used in the context of a different (but related) target distribution. In general, domain adaptation uses labeled data in one or more source domains to solve new tasks in a target domain. The level of relatedness between the source and target domains hereby usually determines how successful the adaptation will be.
There are multiple approaches to domain adaptation. In “shallow” (not deep) domain adaptation, two methods are commonly used: reweighing the source samples and training on the reweighed samples and trying to learn a shared space to match the distributions of the source and target datasets. While these techniques may also be applicable in the context of deep learning, the deep features learned by Deep Neural Networks (DNNs) usually give rise to more transferable representations (generally learning highly transferable features in the lower layers while the transferability sharply decreases in higher layers, see e.g. this paper by Donahue et al). In Deep Domain Adaptation, we try to make use this property of DNNs.
The following summary is mostly based on this review paper by Wang et al and this review by Wilson et al. In that work, the authors distinguish between different types of domain adaptation, depending on the complexity of the task, the amount of labeled/unlabeled data available, and the difference in input feature space. They specifically define domain adaptation to be a problem in which the task space is the same and the difference is solely in input domain divergence. Based on this definition, domain adaptation can be either homogeneous (the input feature spaces are the same, but have different data distributions) or heterogeneous (the feature spaces and their dimensionalities may differ).
Domain adaptation may also occur in one step (one-step domain adaptation), or through multiple steps, traversing one or more domains in the process (multi-step domain adaptation). In this post, we will only discuss one-step domain adaptation, as this is the most common type of domain adaptation.
Depending on the data you have available from the target domain, domain adaptation can further be classified into supervised (you do have labeled data from the target domain, albeit an amount which is too small for training a whole model), semi-supervised (you have both labeled and unlabeled data), and unsupervised (you do not have any labeled data from the target domain).
How can we determine if a model trained in a source domain may be adapted to our target domain? It turns out that this question is not easy to answer and task relatedness is still a topic of active research. We can define two tasks to be similar if they use the same features to make a decision. Another possibility is to define two tasks as similar if their parameter vectors (i.e. classification boundaries) are close (see this paper by Xue et al.). On the other hand, Ben-David et al. propose that two tasks are F-related if the data for both tasks can be generated from a fixed probability distribution using a set of transformations F.
Despite these theoretical deliberations, in practice it is probably necessary to try domain adaptation on your own datasets to see if you can obtain some benefit for your target task by using a model from the source task. Often, task relatedness can be determined by simple reasoning, such as images from different viewing angles or different lighting conditions, or, in the medical field, images from different devices and so on.
There are three basic techniques for one-step domain adaptation:
Divergence-based domain adaptation works by minimizing some divergence criterion between the source and target data distributions, thus achieving a domain-invariant feature representation. If we find such a feature representation, the classifier will be able to perform equally well on both domains. This of course assumes that such a representation exists which in turn assumes that the tasks are related in some way.
The four most frequently used divergence measures are Maximum Mean Discrepancy (MMD), Correlation Alignment (CORAL), Contrastive Domain Discrepancy (CCD) and the Wasserstein metric.
MMD is a hypothesis test which tests whether two samples are from the same distribution by comparing the means of the features after mapping them to a Reproducing Kernel Hilbert Space (RKHS). If the means are different, the distributions are likely different as well. This is often accomplished by using the kernel embedding trick and comparing the samples using a Gaussian kernel. The intuition here is that if two distributions are identical, the average (mean) similarity between samples from each distribution should be equal to the average similarity between mixed samples from both distributions. An example for using MMD in domain adaptation is this paper by Rozantsev et al. In this paper, two-stream architecture is used with weights which are not shared but which lead to similar feature representations by using a combination of classification, regularization and domain discrepancy (MMD) loss, as in the figure below.
Two-stream architecture by Rozantsev et al.
Hereby, the setting may either be supervised, semi-supervised or even unsupervised (no classification loss in target domain).
CORAL (link) is similar to MMD, however it tries to align the second-order statistics (correlations) of the source and target distributions instead of the means using a linear transformation. This paper by Sun et al. uses CORAL in the context of deep learning by constructing a differentiable CORAL loss using the Frobenius norm between the source and target covariance matrices.
CCD is also based on MMD but also makes use of label distributions by looking at conditional distributions. This makes sure that the joint domain features still retain predictivity w.r.t the labels. Minimizing CCD minimizes the intra-class discrepancy while maximizing the inter-class discrepancy. This requires both source and target domain labels. In order to get rid of this constraint, Kang et al. propose to estimate missing target labels using clustering in an iterative procedure which jointly optimizes target labels and feature representations. Hereby target labels are found by clustering, whereupon CCD is minimized to adapt the features.
Finally, feature and label distributions in the source and target domains may be aligned by considering the optimal transport problem and its corresponding distance, the Wasserstein distance. This is proposed in DeepJDOT (Damodaran et al). The authors propose to minimize the discrepancies between joint deep feature representations and labels through optimal transport.
This technique tries to achieve domain adaptation by using adversarial training.
One approach is to generate synthetic target data which are somehow related to the source domain (e.g. by retaining labels) using Generative Adversarial Networks (GANs). These synthetic data are then used to the train the target model.
The CoGAN model tries to achieve this by using two generator/discriminator pairs for both the source and target distributions. Some weights of the generators and the discriminator are shared to learn a domain-invariant feature space. In this way, labeled target data can be generated that can further be used in a task such as classification.
CoGAN architecture by Liu et al.
In another installation, Yoo et al. try to learn a source/target converter network by using two discriminators: one for ensuring that the target data are real, the other to preserve relatedness between the source and target domains. The generator is hereby conditioned on the source data. This approach only requires unlabeled data in the target domain.
We can also get rid of generators altogether and perform domain adaptation in one go if we use the so-called domain confusion loss in addition to the classification loss used for the current task. The domain confusion loss is similar to the discriminator in GANs in that it tries to match the distributions of the source and target domains in order to “confuse” the high-level classification layers. Perhaps the most famous example of such a network is the Domain-Adversarial Neural Network (DANN) by Ganin et al. This network consists of two losses, the classification loss and the domain confusion loss. It contains a gradient reversal layer to match the feature distributions. By minimizing the classification loss for the source samples and the domain confusion loss for all samples (while maximizing the domain confusion loss for the feature extraction) , this makes sure that the samples are mutually indistinguishable for the classificator.
Domain-adversarial neural network architecture by Ganin et al.
This approach uses an auxiliary reconstruction task to create a shared representation for each of the domains. For instance, the Deep Reconstruction Classification Network (DRCN) tries to solve these two tasks simultaneously: (i) classification of the source data, and (ii) reconstruction of the unlabeled target data. This makes sure that the network learns not only to discriminate correctly but also preserves information about the target data. In the paper the authors also mention that the reconstruction pipeline learns to transform source images into images that resemble the target dataset, suggesting that a common representation is learned for both.
The DRCN architecture by Ghifary et al.
Another possibility is to use so-called cycle GANs. Cycle GANs are inspired by the concept of dual learning in machine translation. This concept trains two opposite language translators (A-B, B-A) simultaneously. Feedback signals in the loop consist of the corresponding language models and the mutual BLEU scores. The same can be done with images using the im2im framework. In this paper, a mapping from one image domain to another is learned without using any paired image samples. This is accomplished by simultaneously training two GANs that generate the images in the two domains, respectively. To ensure consistency, a cycle consistency loss is introduced. This makes sure that the transformation from one domain to another and back again results in roughly the same image as the input. Thus, the full loss of the two paired networks is a sum of the GAN losses of both discriminators and the cycle consistency loss.
Finally, GANs can also be used in an encoder-decoder setting by conditioning their input on the image from another domain. In the paper by Isola et al., conditional GANs are used to translate images from one domain to another by conditioning the discriminator and the output of the generator on the input. This can be achieved using a simple encoder-decoder architecture or alternatively using a U-Net architecture with skip-connections.
Several results from the conditional GAN by Isola et al.
Deep domain adaptation allows us to transfer the knowledge learned by a particular DNN on a source task to a new related target task. It has been successfully applied in tasks such as image classification or style transfer. In some sense, deep domain adaptation enables us to get closer to human-level performance in terms of the amount of training data required for a particular new computer vision task. Therefore, I think that progress in this area will be crucial to the entire field of Computer Vision and I hope that it will eventually lead us to effective and simple knowledge reuse across visual tasks.