自监督学习近期进展——naiyan wang
Yan Lecun 自监督学习:机器能像人一样学习吗? 110页PPT+视频
自主学习(Self Learning)有什么比较新的思路?——xiaolong wang
ICML workshop
Lecun IJCAI18 ppt / Zisserman ICML19 ppt
my MacBook :)
Multi-task Self-Supervised Visual Learning (ICCV)
Vision is one of the most promising domains for unsupervised learning. Unlabeled images and video are available in practically unlimited quantities, and the most prominent present image models—neural networks—are data starved, easily memorizing even random labels for large image collections. Yet unsupervised algorithms are still not very effective for training neural networks: they fail to adequately capture the visual semantics needed to solve real-world tasks like object detection or geometry estimation the way strongly-supervised methods do. For most vision problems, the current state-of-the-art approach begins by training a neural network on ImageNet or a similarly large dataset which has been hand-annotated.
How might we better train neural networks without manual labeling? Neural networks are generally trained via backpropagation on some objective function. Without labels, however, what objective function can measure how good the network is? Self-supervised learning answers this question by proposing various tasks for networks to solve, where performance is easy to measure, i.e., performance can be captured with an objective function like those seen in supervised learning. Ideally, these tasks will be difficult to solve without understanding some form of image semantics, yet any labels necessary to formulate the objective function can be obtained automatically. In the last few years, a considerable number of such tasks have been proposed [1, 2, 6, 7, 8, 17, 20, 21, 23, 25, 26, 27, 28, 29, 31, 39, 40, 42, 43, 46, 47], such as asking a neural network to colorize grayscale images, fill in image holes, solve jigsaw puzzles made from image patches, or predict movement in videos. Neural networks pre-trained with these tasks can be re-trained to perform well on standard vision tasks (e.g. image classification, object detection, geometry estimation) with less manually-labeled data than networks which are initialized randomly. However, they still perform worse in this setting than networks pre-trained on ImageNet.
Related Work
两类:use auxiliary information / use raw pixels.
video & image
TextTopicNet: Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces
Split-brain Auto-encoder
不同channel之间预测监督,L-ab, RGB-D
Unsupervised Visual Representation Learning by Context Prediction
relative position
This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. (自监督)
1. 小样本
Recently, new computer vision methods have leveraged large datasets of millions of labeled examples to learn rich, high-performance visual representations.
Yet efforts to scale these methods to truly Internet-scale datasets (i.e. hundreds of billions of images) are hampered by the sheer expense of the human annotation required.
A natural way to address this difficulty would be to employ unsupervised learning, which aims to use data without any annotation.
2. 动机:文本领域中的context
This converts an apparently unsupervised problem (finding a good similarity metric between words) into a “self-supervised” one: learning a function from a given word to the words surrounding it.
Here the context prediction task is just a “pretext” to force the model to learn a good word embedding, which, in turn, has been shown to be useful in a number of real tasks, such as semantic word similarity.
3. Our paper
Our underlying hypothesis is that doing well on this task requires understanding scenes and objects, i.e. a good visual representation for this task will need to extract objects and their parts in order to reason about their relative spatial location. (借口任务的作用)
“Objects,” after all, consist of multiple parts that can be detected independently of one another, and which occur in a specific spatial configuration (if there is no specific configuration of the parts, then it is “stuff” [1]).
We demonstrate that the resulting visual representation is good for both object detection, providing a significant boost on PASCAL VOC 2007 compared to learning from scratch, as well as for unsupervised object discovery / visual data mining. This means, surprisingly, that our representation generalizes across images, despite being trained using an objective function that operates on a single image at a time. That is, instance-level supervision appears to improve performance on category-level tasks.
Related Work
1. 生成模型
存在问题:Generative models have shown promising performance on smaller datasets such as handwritten digits [25, 24, 48, 30, 46], but none have proven effective for high-resolution natural images.(2016年)
2. 无监督
存在问题:We believe that current reconstruction-based algorithms struggle with low-level phenomena, like stochastic textures, making it hard to even measure whether a model is generating well.
文本领域:context prediction
各种pretext task:However, such a task would be trivial, since discriminating low-level color statistics and lighting would be enough. To make the task harder and more high-level, in this paper, we instead classify between multiple possible configurations of patches sampled from the same image, which means they will share lighting and color statistics, as shown on Figure 2.
Another line of work in unsupervised learning from images aims to ...
Our work
Avoiding trivial solution
When designing a pretext task, care must be taken to ensure that the task forces the network to extract the desired information (high-level semantics, in our case), without taking “trivial” shortcuts. In our case, low-level cues like boundary patterns or textures continuing between patches could potentially serve as such a shortcut. Hence, for the relative prediction task, it was important to include a gap between patches.
However, even these precautions are not enough: we were surprised to find that, for some images, another trivial solution exists. We traced the problem to an unexpected culprit: chromatic aberration.
Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles
By following the principles of self-supervision, we build a convolutional neural network (CNN) that can be trained to solve Jigsaw puzzles as a pretext task, which requires no manual labeling, and then later repurposed(重新调整用途) to solve object classification and detection.
We show that the CFN includes fewer parameters than AlexNet while preserving the same semantic learning capabilities.
1. vision 小样本
However, as manually labeled data can be costly, unsupervised learning methods are gaining momentum.
2. 自监督
... have explored a novel paradigm for unsupervised learning called self-supervised learning. The main idea is to exploit different labelings that are freely available besides or within visual data, and to use them as intrinsic reward signals to learn general-purpose features.
The features obtained with these approaches have been successfully transferred to classification and detections tasks, and their performance is very encouraging when compared to features trained in a supervised manner.
We introduce a novel self-supervised task, the Jigsaw puzzle reassembly problem (see Fig. 1), which builds features that yield high performance when transferred to detection and classification tasks.
3. Our work
We argue that solving Jigsaw puzzles can be used to teach a system that an object is made of parts and what these parts are. The association of each separate puzzle tile to a precise object part might be ambiguous. However, when all the tiles are observed, the ambiguities might be eliminated more easily because the tile placement is mutually exclusive. This argument is supported by our experimental validation. Training a Jigsaw puzzle solver takes about 2.5 days compared to 4 weeks of [10]. Also, there is no need to handle chromatic aberration or to build robustness to pixelation. Moreover, the features are highly transferable to detection and classification and yield the highest performance to date for an unsupervised method.
Related Work
1. 表示学习
transfer learning / pre-traing (这么说好像也可以,后面实验)
2. 无监督学习
三类:probabilistic, direct mapping (autoencoders), and manifold learning ones
3. 自监督学习
1 不完全监督:标签样本少 —— 主动学习、半监督学习、迁移学习
2 不确切监督:标签粗粒度 —— 多示例学习
3 不准确监督:标签有噪声
A brief introduction to weakly supervised learning