Few-Shot Learning Record

文章目录

  • Why FSL
    • Traditional models
    • Real World => FSL
  • Challenges for NLP
  • FSL Function
  • Where to choose FSL
  • Difine FSL
  • Approch
    • Data-level Approach
          • Distance Supervision
          • multiple-instance learning
    • Parameter-level Approach
      • Meta-learning
        • Metric-based
          • Siamese Network
          • Matching Networks
          • Prototypical Networks
            • Hybrid Attention-Based Prototypical Networks
          • Relation Networks
        • Model-based
          • Meta Networks
          • Memory-Augmented Neural Networks
        • Gradient-Based | Optimization-Based
          • MAML
          • FOMAML
          • Reptile
        • Few-shot Learning: Prior knowledge about learning
  • FSL with GNN
  • FSL Task Difinition
  • IDEAL
  • FSL目的
  • 元学习
    • 基于度量

https://www.zdaiot.com/DeepLearningApplications/Few-shot%20Learning/Few-shot%20learning%20with%20graph%20neural%20networks/ \ https://mp.weixin.qq.com/s/YVMhqhURqGmJ5D26pXPjQg FSL GNN

https://research.aimultiple.com/few-shot-learning/
https://neptune.ai/blog/understanding-few-shot-learning-in-computer-vision
https://medium.com/quick-code/understanding-few-shot-learning-in-machine-learning-bede251a0f67
https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html

Why FSL

Traditional models

Common practice for machine learning applications is to feed as much data as the model can take. This is because in most machine learning applications feeding more data enables the model to predict better (it needs lots of examples to drive lots of iterations of stochastic gradient descent and gradually refine the weights in the model).

Traditional ML models can not discriminate classes that are not present in training datasets

Traditionally, a neural network learns to predict multiple classes. This poses a problem when we need to add/remove new classes to the data. In this case, we have to update the neural network and retrain it on the whole dataset. Also, deep neural networks need a large volume of data to train on.

Real World => FSL

in the real world, you can rarely build or find a dataset with that many samples.
Labeling additional samples is a time-consuming and expensive task

However, sometimes accruing enough data to increase the accuracy of the models is unrealistic and difficult to achieve. For example, in enormous business situations, labeling samples becomes costly and difficult to manage. / Some rare pathologies might lack enough images to be used in the training set. This is exactly the type of problem that can be solved by building an FSL classifier.

few-shot learning aims to build accurate machine learning models with less training data. As the dimension of input data is a factor that determines resource costs (e.g. time costs, computational costs etc.), companies can reduce data analysis/machine learning (ML) costs by using few-shot learning.

few shot learning techniques enable ML models to separate two classes that are not present in the training data and in some applications they can even separate more than two unseen classes.

Challenges for NLP

  • more diversity
  • more noisy

FSL Function

  • Test base for learning like human: Humans can spot the difference between handwritten characters after seeing a few examples. However, computers need large amounts of data to classify what they “see” and spot the difference between handwritten characters. Few-shot learning is a test base where computers are expected to learn from few examples like humans.
  • Learning for rare cases: By using few-shot learning, machines can learn rare cases. For example, when classifying images of animals, a machine learning model trained with few-shot learning techniques can classify an image of a rare species correctly after being exposed to small amount of prior information.
  • Reducing data collection effort and computational costs: As few-shot learning requires less data to train a model, high costs related to data collection and labeling are eliminated. Low amount of training data means low dimensionality in the training dataset, which can significantly reduce the computational costs.

Where to choose FSL

Here are some situations that are driving their increased adoption:

  • Whenever there is scarcity of supervised data, machine learning models often fail to carry out reliable generalizations.
  • When working with a huge dataset, correctly labeling the data can be costly.
  • When several samples are available, adding specific features for every task is strenuous and difficult to implement.

Difine FSL

Few-shot learning (FSL), also referred to as low-shot learning (LSL) in few sources, is a type of machine learning problems where the training dataset contains limited information.
Few-shot learning aims for ML models to predict the correct class of instances when a small amount of examples are available in the training dataset.
Few-Shot Learning (FSL) aims at recognizing the target classes that only a few samples are available for training.

Few-Shot Learning is a sub-area of machine learning. It’s about classifying new data when you have only a few training samples with supervised information.

Few-Shot Learning is an example of meta-learning, where a learner is trained on several related tasks, during the meta-training phase, so that it can generalize well to unseen (but related) tasks with just few examples, during the meta-testing phase. An effective approach to the Few-Shot Learning problem is to learn a common representation for various tasks and train task specific classifiers on top of this representation.

employing an object categorization model still gives appropriate results even without having several training samples.

Approch

Data-level Approach

This approach is based on the concept that whenever there is insufficient data to fit the parameters of the algorithm(build a reliable model) and avoid underfitting or overfitting the data, then more data should be added.

exploit prior knowledge about the structure and variability of the data, which enables construction of viable models from few examples.

  • DATA AUGMENT

  • GAN: new images of birds can be produced from different perspectives if there are enough examples available in the training set.

  • Generative models can be constructed for families of data classes:
    Pen-stroke models
    Neural statistician

  • New examples for the training set can be synthesized:
    Analogies
    End-to-end

Distance Supervision

If two entities participate in a relation, any sentence that contain those two entities might express that relation.

In distant supervision, we make use of an already existing database, such as Freebase or a domain-specific database, to collect examples for the relation we want to extract. We then use these examples to automatically generate our training data. For example, Freebase contains the fact that Barack Obama and Michelle Obama are married. We take this fact, and then label each pair of “Barack Obama” and “Michelle Obama” that appear in the same sentence as a positive example for our marriage relation. This way we can easily generate a large amount of (possibly noisy) training data. .

weekness:
– Use the relation of two entities in a Knowledge Base
as the semantic relation of two entity mentions in text
– Introduce noise to labeling
– Few instances for rare relations

multiple-instance learning

Multiple Instance Learning (MIL) is proposed as a variation of supervised learning for problems with incomplete knowledge about labels of training examples. In supervised learning, every training instance is assigned with a discrete or real-valued label. In comparison, in MIL the labels are only assigned to bags of instances. The goal of MIL is to classify unseen bags or instances based on the labeled bags as the training data.

multiple-instance learning (MIL) is a type of supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept.

Parameter-level Approach

Because of the inadequate availability of data, few-shot learning samples can have high-dimensional spaces that are too extensive. it’s quite easy to overfit on Few-Shot Learning samples. To overcome overfitting issues, the parameter space can be limited.

To overcome this problem we should limit the parameter space and use regularization and proper loss functions. The model will generalize the limited number of training samples.

On the other hand, we can enhance model performance by directing it to the extensive parameter space. If we use a standard optimization algorithm, it might not give reliable results because of the small amount of training data.

That is why on the parameter-level we train our model to find the best route in the parameter space to give optimal prediction results(meta-learning).

Meta-learning

The first and most obvious step in an FSL task is to gain experience from other, similar problems. This is why Few-Shot Learning is characterized as a Meta-Learning problem.

In a traditional classification problem, we try to learn how to classify from the training data, and evaluate using test data.
In Meta-Learning, we learn how to learn to classify given a set of training data. We use one set of classification problems for other, unrelated sets.

In the Meta-Learning paradigm, we have a set of tasks. An algorithm is learning to learn if its performance at each task improves with experience and with the number of tasks.

difficulties:
the problems of learning a good feature representation and choosing an appropriate distance function.

Metric-based

learning a distance function over objects. they classify query samples based on their similarity to the support samples.

Siamese Network

A Siamese Neural Network is a class of neural network architectures that contain two or more identical subnetworks. ‘identical’ here means, they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both sub-networks. It is used to find the similarity of the inputs by comparing its feature vectors, so these networks are used in many applications.
SNNs learn a similarity function. Thus, we can train it to see if the two images are the same (which we will do here). This enables us to classify new classes of data without training the network again.

The main advantages of Siamese Networks are:

  • More Robust to class Imbalance: With the aid of One-shot learning, given a few images per class is sufficient for Siamese Networks to recognize those images in the future
  • Nice to an ensemble with the best classifier: Given that its learning mechanism is somewhat different from Classification, simple averaging of it with a Classifier can do much better than average 2 correlated Supervised models (e.g. GBM & RF classifier)
  • Learning from Semantic Similarity: Siamese focuses on learning embeddings (in the deeper layer) that place the same classes/concepts close together. Hence, can learn semantic similarity.

The downsides of the Siamese Networks can be:

  • Needs more training time than normal networks: Since Siamese Networks involves quadratic pairs to learn from (to see all information available) it is slower than normal classification type of learning(pointwise learning)
  • Doesn’t output probabilities: Since training involves pairwise learning, it won’t output the probabilities of the prediction, but the distance from each class
Matching Networks

In this work, we employ ideas from metric learning based on deep neural features and from recent advances that augment neural networks with external memories. Our framework learns a network that maps a small labelled support set and an unlabelled example to its label, obviating the need for fine-tuning to adapt to new class types.

there is some external memory and an attention mechanism which is used to access the memory. It’s a network that learns how to learn a classifier from only a very small number of examples…

For each episode, Matching Networks apply the following procedure:
a). Each image from the support and the query set is fed to a CNN that outputs embeddings for them,
b). Each query image is classified using the softmax of the cosine distance from its embeddings to the support-set embeddings,
c). The Cross-Entropy Loss on the resulting classification is backpropagated through the CNN.
This way, Matching Networks learn to compute image embeddings. This approach allows MN to classify images with no specific prior knowledge of classes. Everything is done simply by comparing different instances of the classes.
Since the classes are different in every episode, Matching Networks compute features of the images that are relevant to discriminate between classes. On the contrary, in the case of a standard classification, the algorithm learns the features that are specific to each class.

Prototypical Networks

Prototypical Networks learn a metric space in which classification can be performed by computing distances to prototype representations of each class. Compared to recent approaches for few-shot learning, they reflect a simpler inductive bias that is beneficial in this limited-data regime, and achieve excellent results. We provide an analysis showing that some simple design decisions can yield substantial improvements over recent approaches involving complicated architectural choices and meta-learning.

The PN process is essentially the same, but the query image embeddings are not compared to every image embedding from the support set.

In PN, you need to form class prototypes. They are basically class embeddings formed by averaging the embeddings of images from this class. The query image embeddings are then compared only to these class prototypes.

PN uses Euclidean distance instead of cosine distance.

addresses the Few-shot Learning paradigm.

The approach is based on the idea that there exists an embedding in which points cluster around a single prototype representation for each class.

Hybrid Attention-Based Prototypical Networks

This paper address:
RC with rare instances per class and noisy labels
Use prototypical network as a technique to model RC as FSL addressing diversity and noise in prototypical networks

  • Introducing Two levels of attention:
    Feature level: Select most useful features for
    computing prototypes
    Instance level: Selects most useful instances in
    support set based on the given query
  • Analyzing robustness to noise:
    Compared to vanilla prototypical network their
    approach is more robust to noise in labels
Relation Networks

RN was built on the PN concept but added big changes to the algorithm.

The distance function was not defined in advance but learned by the algorithm. RN has its own relation module that does this.

The relation module is put on the top of the embedding module, which is the part that computes embeddings and class prototypes from input images.

The relation module is fed with the concatenation of the embedding of a query image with each class prototype, and it outputs a relation score for each couple. Applying a Softmax to the relation scores, we get a prediction.

Model-based

it depends on a model designed specifically for fast learning — a model that updates its parameters rapidly with a few training steps. This rapid parameter update can be achieved by its internal architecture or controlled by another meta-learner model.

Meta Networks

is a meta-learning model with architecture and training process designed for rapid generalization across tasks.

The rapid generalization of MetaNet relies on “fast weights”.
utilize one neural network to predict the parameters of another neural network and the generated weights are called fast weights. In comparison, the ordinary SGD-based weights are named slow weights.

In MetaNet, loss gradients are used as meta information to populate models that learn fast weights. Slow and fast weights are combined to make predictions in neural networks.

Memory-Augmented Neural Networks

use external memory storage to facilitate the learning process of neural networks.

With an explicit storage buffer, it is easier for the network to rapidly incorporate new information and not to forget in the future. Note that recurrent neural networks with only internal memory such as vanilla RNN or LSTM are not MANNs.

Gradient-Based | Optimization-Based

the gradient-based optimization is neither designed to cope with a small number of training samples, nor to converge within a small number of optimization steps. Is there a way to adjust the optimization algorithm so that the model can be good at learning with a few examples? This is what optimization-based approach meta-learning algorithms intend for.

need to build a meta-learner and a base-learner. Meta-learner is a model that learns across episodes, whereas a base-learner is a model that is initialized and trained inside each episode by the meta-learner.

the meta-learner acquiring prior experience from training the base-model and learning the common features representations of all tasks.

Imagine an episode of Meta-training with some classification task defined by a N * K images support-set and a Q query set:

  1. We choose a meta-learner model,
  2. Episode is started,
  3. We initialize the base-learner (typically a CNN classifier),
  4. We train it on the support-set (the exact algorithm used to train the base-learner is defined by the meta-learner),
  5. Base-learner predicts the classes on the query set,
  6. Meta-learner parameters are trained on the loss resulting from the classification error,
  7. From this point, the pipeline may differ based on your choice of meta-learner.
  • Why LSTM?
    The meta-learner is modeled as a LSTM, because:
    There is similarity between the gradient-based update in backpropagation and the cell-state update in LSTM.
    Knowing a history of gradients benefits the gradient update; think about how momentum works.
MAML

short for Model-Agnostic Meta-Learning (Finn, et al. 2017) is a fairly general optimization algorithm, compatible with any model that learns through gradient descent.

MAML provides a good initialization of a meta-learner’s parameters to achieve optimal fast learning on a new task with only a small number of gradient steps while avoiding overfitting that may happen when using a small dataset.

To achieve a good generalization across a variety of tasks, we would like to find the optimal θ∗(model parameters) so that the task-specific fine-tuning is more efficient.

Few-Shot Learning Record_第1张图片

FOMAML

The meta-optimization step above(MAML) relies on second derivatives. To make the computation less expensive, a modified version of MAML omits second derivatives, resulting in a simplified and cheaper implementation, known as First-Order MAML (FOMAML).

Reptile

is a remarkably simple meta-learning optimization algorithm. It is similar to MAML in many ways, given that both rely on meta-optimization through gradient descent and both are model-agnostic.

different from MAML: The batch version samples multiple tasks instead of one within each iteration.

The Reptile works by repeatedly:

  1. sampling a task,
  2. training on it by multiple gradient descent steps,
  3. and then moving the model weights towards the new parameters.

Few-shot Learning: Prior knowledge about learning

ML models use prior knowledge to constrain the learning algorithm to choose parameters that generalize well from few examples.

  • Techniques used for hyperparameter tuning in few-shot learning are:
    MAML
    FOMAML
    Reptile
  • Learning update rules can also encourage good performance with small datasets:
    LSTMs
    Reinforcement learning
    Optimization rules
  • Sequence methods take entire dataset and test example and predict the value of the test label:
    Memory-augmented NN
    SNAIL

FSL with GNN

In many applications, the data is graph-structured. For example, in drug discovery, the goal is to predict whether a given molecule is a potential candidate for a new drug, where the input molecules are represented by graphs. In a recommender system, the interaction between
the users and the items are represented by a graph, and such non-Euclidean data is crucial in designing a better system.

Graph Neural Networks (GNNs), a generalization of deep neural networks on graph data have been widely used in various domains, ranging from drug discovery to recommender systems. However, GNNs on such applications are limited when there are few available samples. Meta-learning has been an important framework to address the lack of samples in machine learning, and in recent years, the researchers have started to apply meta-learning to GNNs.

The main goal of GNNs is to learn effective representations of the graphs. Such representations map the vertices, edges, and/or graphs to a low-dimensional space , so that the structural relationships in the graph are reflected by the geometric relationships in the representations .

The main challenge in applying meta-learning to graph-structured data is to determine the type of representation that is shared across tasks, and devise an effective training strategy.

FSL Task Difinition

N-way-K-Shot: N stands for the number of classes, and K for the number of samples from each class to train on.

During each training epoch, we first sample a dataset D = ( D t r a i n , D t e s t ) ∈ D m e t a − t r a i n D=(D_{train},D_{test})∈D^{meta-train} D=(Dtrain,Dtest)Dmetatrain and then sample mini-batches out of D t r a i n D_{train} Dtrain to update θ θ θ for T T T rounds.

  1. A training (support) set that consists of:
  • N class labels
  • K labeled images for each class (a small amount, less than ten samples per class)
  1. Q query images
    We want to classify Q query images among the N classes. The N * K samples in the training set are the only examples that we have. The main problem here is not enough training data.

We will train our Meta-Learning algorithm on a batch of training tasks TRAIN. Training experience gained from attempting to solve TRAIN tasks will be used to solve the TEST task.

The whole Meta-Training process will have a finite number of episodes. We form an episode like this:
From the TRAIN, we sample N classes and K support-set images per each class, along with Q query images. This way, we form a classification task that’s similar to our ultimate TEST task.
At the end of each episode, the parameters of the model are trained to maximize the accuracy of Q images from the query set. This is where our model learns the ability to solve an unseen classification problem.
The overall efficiency of the model is measured by its accuracy on the TEST classification task.

IDEAL

Continual relation learning

FSL目的

  • Improving the generalization capabilities of
    deep neural networks and removing the need for huge sets of annotations is thus of utmost importance.

元学习

  • transfer knowledge
    across tasks in order to improve generalization

基于度量

  1. 基于距离的分类器对于小样本学习有很高的方差
    1.1 集成学习:合作(cooperation: easier learning and regularization) VS 差异\多样性(diversity: a collection
    of weak learners making diverse predictions often performs
    better together than a single strong one)
    1.1.1 ensemble problems:模型数目增多,集成效果开始下降, 因为模型之间太过于相似。
    1.1.2 集成学习技巧(focus on增加模型差异性
    - 每次迭代随意丢弃某几个网络;
    - 每个网络采用dropout
    - 对网络输入的图像采用不同变换方式、变形 =》 数据增强

  2. dd

你可能感兴趣的:(Model,Deep,Learning,FSL)