https://www.zdaiot.com/DeepLearningApplications/Few-shot%20Learning/Few-shot%20learning%20with%20graph%20neural%20networks/ \ https://mp.weixin.qq.com/s/YVMhqhURqGmJ5D26pXPjQg FSL GNN
https://research.aimultiple.com/few-shot-learning/
https://neptune.ai/blog/understanding-few-shot-learning-in-computer-vision
https://medium.com/quick-code/understanding-few-shot-learning-in-machine-learning-bede251a0f67
https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html
Common practice for machine learning applications is to feed as much data as the model can take. This is because in most machine learning applications feeding more data enables the model to predict better (it needs lots of examples to drive lots of iterations of stochastic gradient descent and gradually refine the weights in the model).
Traditional ML models can not discriminate classes that are not present in training datasets
Traditionally, a neural network learns to predict multiple classes. This poses a problem when we need to add/remove new classes to the data. In this case, we have to update the neural network and retrain it on the whole dataset. Also, deep neural networks need a large volume of data to train on.
in the real world, you can rarely build or find a dataset with that many samples.
Labeling additional samples is a time-consuming and expensive task
However, sometimes accruing enough data to increase the accuracy of the models is unrealistic and difficult to achieve. For example, in enormous business situations, labeling samples becomes costly and difficult to manage. / Some rare pathologies might lack enough images to be used in the training set. This is exactly the type of problem that can be solved by building an FSL classifier.
few-shot learning aims to build accurate machine learning models with less training data. As the dimension of input data is a factor that determines resource costs (e.g. time costs, computational costs etc.), companies can reduce data analysis/machine learning (ML) costs by using few-shot learning.
few shot learning techniques enable ML models to separate two classes that are not present in the training data and in some applications they can even separate more than two unseen classes.
Here are some situations that are driving their increased adoption:
Few-shot learning (FSL), also referred to as low-shot learning (LSL) in few sources, is a type of machine learning problems where the training dataset contains limited information.
Few-shot learning aims for ML models to predict the correct class of instances when a small amount of examples are available in the training dataset.
Few-Shot Learning (FSL) aims at recognizing the target classes that only a few samples are available for training.
Few-Shot Learning is a sub-area of machine learning. It’s about classifying new data when you have only a few training samples with supervised information.
Few-Shot Learning is an example of meta-learning, where a learner is trained on several related tasks, during the meta-training phase, so that it can generalize well to unseen (but related) tasks with just few examples, during the meta-testing phase. An effective approach to the Few-Shot Learning problem is to learn a common representation for various tasks and train task specific classifiers on top of this representation.
employing an object categorization model still gives appropriate results even without having several training samples.
This approach is based on the concept that whenever there is insufficient data to fit the parameters of the algorithm(build a reliable model) and avoid underfitting or overfitting the data, then more data should be added.
exploit prior knowledge about the structure and variability of the data, which enables construction of viable models from few examples.
DATA AUGMENT
GAN: new images of birds can be produced from different perspectives if there are enough examples available in the training set.
Generative models can be constructed for families of data classes:
Pen-stroke models
Neural statistician
New examples for the training set can be synthesized:
Analogies
End-to-end
If two entities participate in a relation, any sentence that contain those two entities might express that relation.
In distant supervision, we make use of an already existing database, such as Freebase or a domain-specific database, to collect examples for the relation we want to extract. We then use these examples to automatically generate our training data. For example, Freebase contains the fact that Barack Obama and Michelle Obama are married. We take this fact, and then label each pair of “Barack Obama” and “Michelle Obama” that appear in the same sentence as a positive example for our marriage relation. This way we can easily generate a large amount of (possibly noisy) training data. .
weekness:
– Use the relation of two entities in a Knowledge Base
as the semantic relation of two entity mentions in text
– Introduce noise to labeling
– Few instances for rare relations
Multiple Instance Learning (MIL) is proposed as a variation of supervised learning for problems with incomplete knowledge about labels of training examples. In supervised learning, every training instance is assigned with a discrete or real-valued label. In comparison, in MIL the labels are only assigned to bags of instances. The goal of MIL is to classify unseen bags or instances based on the labeled bags as the training data.
multiple-instance learning (MIL) is a type of supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept.
Because of the inadequate availability of data, few-shot learning samples can have high-dimensional spaces that are too extensive. it’s quite easy to overfit on Few-Shot Learning samples. To overcome overfitting issues, the parameter space can be limited.
To overcome this problem we should limit the parameter space and use regularization and proper loss functions. The model will generalize the limited number of training samples.
On the other hand, we can enhance model performance by directing it to the extensive parameter space. If we use a standard optimization algorithm, it might not give reliable results because of the small amount of training data.
That is why on the parameter-level we train our model to find the best route in the parameter space to give optimal prediction results(meta-learning).
The first and most obvious step in an FSL task is to gain experience from other, similar problems. This is why Few-Shot Learning is characterized as a Meta-Learning problem.
In a traditional classification problem, we try to learn how to classify from the training data, and evaluate using test data.
In Meta-Learning, we learn how to learn to classify given a set of training data. We use one set of classification problems for other, unrelated sets.
In the Meta-Learning paradigm, we have a set of tasks. An algorithm is learning to learn if its performance at each task improves with experience and with the number of tasks.
difficulties:
the problems of learning a good feature representation and choosing an appropriate distance function.
learning a distance function over objects. they classify query samples based on their similarity to the support samples.
A Siamese Neural Network is a class of neural network architectures that contain two or more identical subnetworks. ‘identical’ here means, they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both sub-networks. It is used to find the similarity of the inputs by comparing its feature vectors, so these networks are used in many applications.
SNNs learn a similarity function. Thus, we can train it to see if the two images are the same (which we will do here). This enables us to classify new classes of data without training the network again.
The main advantages of Siamese Networks are:
The downsides of the Siamese Networks can be:
In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types.
there is some external memory and an attention mechanism which is used to access the memory. It’s a network that learns how to learn a classifier from only a very small number of examples…
For each episode, Matching Networks apply the following procedure:
a). Each image from the support and the query set is fed to a CNN that outputs embeddings for them,
b). Each query image is classified using the softmax of the cosine distance from its embeddings to the support-set embeddings,
c). The Cross-Entropy Loss on the resulting classification is backpropagated through the CNN.
This way, Matching Networks learn to compute image embeddings. This approach allows MN to classify images with no specific prior knowledge of classes. Everything is done simply by comparing different instances of the classes.
Since the classes are different in every episode, Matching Networks compute features of the images that are relevant to discriminate between classes. On the contrary, in the case of a standard classification, the algorithm learns the features that are specific to each class.
Prototypical Networks learn a metric space in which classification can be performed by computing distances to prototype representations of each class. Compared to recent approaches for few-shot learning, they reflect a simpler inductive bias that is beneficial in this limited-data regime, and achieve excellent results. We provide an analysis showing that some simple design decisions can yield substantial improvements over recent approaches involving complicated architectural choices and meta-learning.
The PN process is essentially the same, but the query image embeddings are not compared to every image embedding from the support set.
In PN, you need to form class prototypes. They are basically class embeddings formed by averaging the embeddings of images from this class. The query image embeddings are then compared only to these class prototypes.
PN uses Euclidean distance instead of cosine distance.
addresses the Few-shot Learning paradigm.
The approach is based on the idea that there exists an embedding in which points cluster around a single prototype representation for each class.
This paper address:
RC with rare instances per class and noisy labels
Use prototypical network as a technique to model RC as FSL addressing diversity and noise in prototypical networks
RN was built on the PN concept but added big changes to the algorithm.
The distance function was not defined in advance but learned by the algorithm. RN has its own relation module that does this.
The relation module is put on the top of the embedding module, which is the part that computes embeddings and class prototypes from input images.
The relation module is fed with the concatenation of the embedding of a query image with each class prototype, and it outputs a relation score for each couple. Applying a Softmax to the relation scores, we get a prediction.
it depends on a model designed specifically for fast learning — a model that updates its parameters rapidly with a few training steps. This rapid parameter update can be achieved by its internal architecture or controlled by another meta-learner model.
is a meta-learning model with architecture and training process designed for rapid generalization across tasks.
The rapid generalization of MetaNet relies on “fast weights”.
utilize one neural network to predict the parameters of another neural network and the generated weights are called fast weights. In comparison, the ordinary SGD-based weights are named slow weights.
In MetaNet, loss gradients are used as meta information to populate models that learn fast weights. Slow and fast weights are combined to make predictions in neural networks.
use external memory storage to facilitate the learning process of neural networks.
With an explicit storage buffer, it is easier for the network to rapidly incorporate new information and not to forget in the future. Note that recurrent neural networks with only internal memory such as vanilla RNN or LSTM are not MANNs.
the gradient-based optimization is neither designed to cope with a small number of training samples, nor to converge within a small number of optimization steps. Is there a way to adjust the optimization algorithm so that the model can be good at learning with a few examples? This is what optimization-based approach meta-learning algorithms intend for.
need to build a meta-learner and a base-learner. Meta-learner is a model that learns across episodes, whereas a base-learner is a model that is initialized and trained inside each episode by the meta-learner.
the meta-learner acquiring prior experience from training the base-model and learning the common features representations of all tasks.
Imagine an episode of Meta-training with some classification task defined by a N * K images support-set and a Q query set:
short for Model-Agnostic Meta-Learning (Finn, et al. 2017) is a fairly general optimization algorithm, compatible with any model that learns through gradient descent.
MAML provides a good initialization of a meta-learner’s parameters to achieve optimal fast learning on a new task with only a small number of gradient steps while avoiding overfitting that may happen when using a small dataset.
To achieve a good generalization across a variety of tasks, we would like to find the optimal θ∗(model parameters) so that the task-specific fine-tuning is more efficient.
The meta-optimization step above(MAML) relies on second derivatives. To make the computation less expensive, a modified version of MAML omits second derivatives, resulting in a simplified and cheaper implementation, known as First-Order MAML (FOMAML).
is a remarkably simple meta-learning optimization algorithm. It is similar to MAML in many ways, given that both rely on meta-optimization through gradient descent and both are model-agnostic.
different from MAML: The batch version samples multiple tasks instead of one within each iteration.
The Reptile works by repeatedly:
ML models use prior knowledge to constrain the learning algorithm to choose parameters that generalize well from few examples.
In many applications, the data is graph-structured. For example, in drug discovery, the goal is to predict whether a given molecule is a potential candidate for a new drug, where the input molecules are represented by graphs. In a recommender system, the interaction between
the users and the items are represented by a graph, and such non-Euclidean data is crucial in designing a better system.
Graph Neural Networks (GNNs), a generalization of deep neural networks on graph data have been widely used in various domains, ranging from drug discovery to recommender systems. However, GNNs on such applications are limited when there are few available samples. Meta-learning has been an important framework to address the lack of samples in machine learning, and in recent years, the researchers have started to apply meta-learning to GNNs.
The main goal of GNNs is to learn effective representations of the graphs. Such representations map the vertices, edges, and/or graphs to a low-dimensional space , so that the structural relationships in the graph are reflected by the geometric relationships in the representations .
The main challenge in applying meta-learning to graph-structured data is to determine the type of representation that is shared across tasks, and devise an effective training strategy.
N-way-K-Shot: N stands for the number of classes, and K for the number of samples from each class to train on.
During each training epoch, we first sample a dataset D = ( D t r a i n , D t e s t ) ∈ D m e t a − t r a i n D=(D_{train},D_{test})∈D^{meta-train} D=(Dtrain,Dtest)∈Dmeta−train and then sample mini-batches out of D t r a i n D_{train} Dtrain to update θ θ θ for T T T rounds.
We will train our Meta-Learning algorithm on a batch of training tasks TRAIN. Training experience gained from attempting to solve TRAIN tasks will be used to solve the TEST task.
The whole Meta-Training process will have a finite number of episodes. We form an episode like this:
From the TRAIN, we sample N classes and K support-set images per each class, along with Q query images. This way, we form a classification task that’s similar to our ultimate TEST task.
At the end of each episode, the parameters of the model are trained to maximize the accuracy of Q images from the query set. This is where our model learns the ability to solve an unseen classification problem.
The overall efficiency of the model is measured by its accuracy on the TEST classification task.
Continual relation learning
基于距离的分类器对于小样本学习有很高的方差
1.1 集成学习:合作(cooperation: easier learning and regularization) VS 差异\多样性(diversity: a collection
of weak learners making diverse predictions often performs
better together than a single strong one)
1.1.1 ensemble problems:模型数目增多,集成效果开始下降, 因为模型之间太过于相似。
1.1.2 集成学习技巧(focus on增加模型差异性)
- 每次迭代随意丢弃某几个网络;
- 每个网络采用dropout;
- 对网络输入的图像采用不同变换方式、变形 =》 数据增强
dd