Author list
YAQING WANG, Hong Kong University of Science and Technology and Baidu Research
QUANMING YAO∗, 4Paradigm Inc.
JAMES T. KWOK, Hong Kong University of Science and Technology
LIONEL M. NI, Hong Kong University of Science and Technology
机器学习在数据密集型应用程序中非常成功,但是在数据集较小时通常会受到阻碍。近来,提出了“少量学习”(FSL)来解决这个问题。使用现有知识,FSL可以快速推广到仅包含少量带有监督信息的样本的新任务。
在本文中,我们进行了彻底的调查,以全面了解FSL。从FSL的正式定义开始,我们将FSL与几个相关的机器学习问题区分开来。然后,我们指出FSL的核心问题是经验风险最小化工具不可靠。根据如何使用先验知识来处理此核心问题,我们从三个角度对FSL方法进行了分类:
通过这种分类法,我们将审查和讨论每个类别的利弊。在FSL问题设置,技术,应用和理论方面,也提出了有希望的方向,以为未来的研究提供见识。
current AI techniques cannot rapidly generalize from a few examples. The aforementioned successful AI applications rely on learning from large-scale data.
In contrast, humans are capable of learning new tasks rapidly by utilizing what they learned in the past.
examples:
a child who learned how to add can rapidly transfer his knowledge to learn multiplication given a few examples (e.g., 2 × 3 = 2 + 2 + 2 and 1 × 3 = 1 + 1 + 1).
Another example is that given a few photos of a stranger, a child can easily identify the same person from a large number of photos.
In order to learn from a limited number of examples with supervised information, a new machine learning paradigm called Few-Shot Learning (FSL) [35, 36] is proposed.
当然,FSL还可以推进机器人技术[26],后者开发出可以复制人类行为的机器。 例子包括一杆模仿[147],多臂匪[33],视觉导航[37]和连续控制[156]。
although ResNet [55] outperforms humans on ImageNet, each class needs to have sufficient labeled images which can be laborious to collect.
Examples include
Many related machine learning approaches have been proposed:
Consider a learning task T T T, FSL deals with a data set D = { D t r a i n , D t e s t } D = \left\{D_{train},D_{test}\right\} D={Dtrain,Dtest} consisting of a training set D t r a i n = { ( x i , y i ) } i = 1 I D_{train} = \left\{(x_i,y_i)\right\}_{i=1}^I Dtrain={(xi,yi)}i=1I whereI I I I is small,and a testing set D t e s t = { x t e s t } D_{test} = \left\{x_{test}\right\} Dtest={xtest}. Letp(x,y) be the ground-truth joint probability distribution(联合概率分布) of input x and output y, and y ^ \hat{y} y^ be the optimal hypothesis from x to y. FSL learns to discover y ^ \hat{y} y^ by fitting D t r i a n D_{trian} Dtrian and testing on D t e s t D_{test} Dtest. θ denotes all the parameters used by h.
A FSL algorithm is an optimization strategy that searches H in order to find the θ that parameterizes the best h*. The FSL performance is measured by a loss function l ( y ^ , y ) l(\hat{y},y) l(y^,y) defined over the prediction y ^ = h ( x ; θ ) \hat{y}= h(x;θ) y^=h(x;θ) and the observed output y.
A computer program is said to learn from experience E with respect to some classes of task T and performance measure P if its performance can improve with E on T measured by P .
examples:
consider an image classification task (T ), a machine learning program can improve its classification accuracy § through E obtained by training on a large number of labeled images (e.g., the ImageNet data set [73]).
FSL is a special case of machine learning, which targets at obtaining good learning performance given limited supervised information provided in the training set D t r a i n D_{train} Dtrain.
Few-Shot Learning (FSL) is a type of machine learning problems (specified by E, T and P), where E contains only a limited number of examples with supervised information for the target T .
few-shot classification learns classifiers given only a few labeled examples of each class.
example applications:
D t r a i n D_{train} Dtrain contains I = KN examples from N classes each with K examples.
estimates a regression function h given only a few input-output example pairs sampled from that function, where output y i y_i yi is the observed value of the dependent variable y, and x i x_i xi is the input which records the observed value of the independent variable x.
targets at finding a policy given only a few trajectories consisting of state-action pairs.
One typical type of FSL methods is Bayesian learning [35, 76]. It combines the provided training set D t r a i n D_{train} Dtrain with some prior probability distribution which is available before D t r a i n D_{train} Dtrain is given [15].
When there is only one example with supervised information in E,FSL is called one-shot learning [14, 35, 138]. When E does not contain any example with supervised information for the target T , FSL becomes a zero-shot learning problem (ZSL) [78].
ZSL requires E to contain information from other modalities(形式) (such as attributes, WordNet, and word embeddings used in rare object recognition tasks), so as to transfer some supervised information and make learning possible.
only a small amount of samples have supervised information.
this can be further classified into the following:
weakly supervised learning with incomplete supervision mainly uses unlabeled data as ad- ditional information in E, while FSL leverages various kinds of prior knowledge such as pre-trained models, supervised data from other domains or modalities and does not restrict to using unlabeled data. Therefore, FSL becomes weakly supervised learning problem only when prior knowledge is unlabeled data and the task is classification or regression.
从经验E中学习y的偏斜分布。 当很少使用y的某些值时(例如在欺诈检测和巨灾预测应用程序中),就会发生这种情况。 它会进行训练和测试,以便在所有可能的y中进行选择。 相比之下,FSL会通过一些示例对y进行训练和测试,同时可能会将其他y作为学习的先验知识。
It can be used in applications such as cross-domain recommendation, WiFi localization across time periods, space and mobile devices.
Domain adaptation [11] is a type of transfer learning in which the source/target tasks are the same but the source/target domains are different.
example:
in sentiment analysis, the source domain data contains customer comments on movies, while the target domain data contains customer comments on daily goods.
Meta-learning [59] improves P of the new task T by the provided data set and the meta- knowledge extracted across tasks by a meta-learner. Specifically, the meta-learner gradually learns generic information (meta-knowledge) across tasks, and the learner generalizes the meta-learner for a new task T using task-specific information.
the meta-learner is taken as prior knowledge to guide each specific FSL task.
we illustrate the core issue of FSL based on error decomposition(分解) in supervised machine learning [17, 18]
Given a hypothesis h, we want to minimize its expected risk R, which is the loss measured with respect to p(x,y). Specifically,
因为p(x,y)是未知的, the empirical risk(训练集 D t r a i n D_{train} Dtrain的I个样本的平均loss)表达如下:
The empirical risk 通常被用作 R ( h ) R(h) R(h)的proxy,可以使得empirical risk minimization.(可能有一些正则化)
为了更好的说明,我们规定:
我们假设三者独立,那么the total error 可以被分解为:
公式右边第一项为 approximation error,第二项为estimation error.
总的来说,the total error 收到H(hypothesis space)和I(训练集中样本的数量)的影响,也就是说,想要减少total error,可以从三个方面下手:
estimation error 可以通过增加样本量来减少[17,18,41]。所以,当有充足的训练监督信息数据的时候,estimation error 很小。
this is the core issue of FSL supervised learning:
the empirical risk minimizer h I h_I hI is no longer reliable.
为了缓解在FSL监督学习中具有不可靠的经验风险最小化工具 h I h_I hI的问题,必须使用先验知识。基于使用先验知识对哪个方面进行了增强,可以将现有的FSL作品分为以下几个方面(图2)。