a zero-shot learning approach:零样本学习方法。
natural language understanding domain:自然语言处理域。
a given utterance:给定的话语。
domains at runtime:运行时的域。
utterances and domains 给定话语和域。
the same embedding space :相同的嵌入空间。
domain-specific embedding:特定域嵌入。
a set of attributes that characterize the domain: 一系列表征域的属性。
virtual assistants: Alexa, Cortana and the Google Assistant
a small and relatively fixed number of domains: 相对固定的域数量。
功能:
(被分组)
are groupings of mutually related user intents, and predicting the right domain for a given utterance could be treated as a multi-class classification problem
**the Alexa Skills Kit, the Cortana Skills Kit, and Actions on Google 域的数量呈现指数级的增长。
non-experts :非专家。
heterogenous 异构
overlapping output label spaces: 重叠的输出标签空间。
scratch for every new domain 抓取每一个新域。
infeasible 不可行的。
at regular intervals 定期
the interim period 中期
学习一个函数将任何域映射到密集向量实现新域的连续可扩展性。
this continuous extensibility
new domains
project any domain into a dense vector
a function :generate a domain embedding for any domain
attributes of the domain, 域的属性。特征
the sample utterances:样本语句
generates domain embeddings from domain attributes。
(从域属性哪里产生域嵌入)
an utterance embedding for any incoming utterance.
(为输入语句产生输入嵌入)
two functions to use the same embedding space
(两个函数使用相同的嵌入空间)
list the domains whose embeddings are most similar to the utterance embedding
This paper deals with the case where novel classes (i.e., domains) are added after our model has been trained,
learn unique parameters per training class
y ∈ Y t r a i n y \in Y^{train} y∈Ytrain
在测试阶段并不能够预测新类。
标准的神经网络使用一个得分函数,为每个训练类别有一个参数空间。
s ( x , y ; θ , f x ) = h x ( x ; θ x , f x ) ⋅ θ y T s(x,y;\theta,f_x) = h_x(x;\theta_x,f_x)\cdot\theta^T_y s(x,y;θ,fx)=hx(x;θx,fx)⋅θyT
f x f_x fx是一个函数,能够提取输入向量的 x x x的输入属性。
θ x \theta_x θx: 排除最后一层神经网络的参数。
再类参数 θ y \theta_y θy 函数是线性的。
f x ( x ) 和 f y ( x ) f_x(x) 和f_y(x) fx(x)和fy(x)是属性。
h x h_x hx and h y h_y hy 是密集嵌入。
h y ( y ; θ y , f y ) h_y(y;\theta_y,f_y) hy(y;θy,fy) 是类 y y y的嵌入。基于类属性 f y ( y ) f_y(y) fy(y)
θ y \theta_y θy 是一系列所有类别的共享参数。
框架包含三个成分:
the attributes of an input utterance,: f x ( x ) f_x(x) fx(x)
a dense embedding h x ( x ) h_x(x) hx(x)
输入属性包含: all utterance-specific contextual features
use 300-dimensional pre-trained word embeddings
初始化:the lookup layer
This is followed by a mean pooling layer followed by an affine layer with a (tanh) nonlinear activation function
s LSTM-based architectures
为each domain y 我们提取以下属性: f y ( y ) f_y(y) fy(y)
Category metadata
Developer-provided metadata such as domain category
Mean-pooled word embeddings
Gazetteer attributes:
We have a number of in-house gazetteers,
. Gazetteer-firing patterns are noisy, and
some gazetteers are badly constructed, so instead of using
raw matches against the gazetteers as feature values, we
normalize them by applying
applying TF-IDF
产生输入和输出的相似性得分。
define the scorer as a vector dot product
替代方案:cosine distance、Euclidean distance、
as a trainable neural network in itself, jointly trained as part of the larger network
D t r a i n = { ( x i , y i ) } i = 1 N D^{train} = \left\{\begin{matrix}(x_i,y_i)\end{matrix}\right\}^N_{i = 1} Dtrain={(xi,yi)}i=1N
表示可以利用的训练数据 y i ∈ y t r a i n 任 意 : i y_i \in y^{train} 任意:i yi∈ytrain任意:i
所谓的得分函数如上图所示:
we coulde define a probility distribution over the training classes y t r a i n y^{train} ytrain
using a softmax layer similiar to a maximum entropy model :
P ( y ∣ x ) = exp s ( x , y ) ∑ y ^ ∈ y t r a i n exp s ( x , y ^ ) P(y|x) = \frac{\exp s(x,y)}{\sum_{\hat{y} \in y^{train}}\exp s(x,\hat{y})} P(y∣x)=∑y^∈ytrainexps(x,y^)exps(x,y)
通过最小化损失函数:a cross-entropy loss, 可以最优参数 θ x 和 θ y \theta_x和\theta_y θx和θy
the training classes y t r a i n y^{train} ytrain
the test classes y t e s t i s , n o t , w e l l , m o t i v a t e d y^{test} is,not,well,motivated ytestis,not,well,motivated
在这里 [ x ] + [x]_{+} [x]+ the hinge function
is equal to x x x when x > 0 x >0 x>0 else 0
This objective function tries to maximize the margin between the correct class and all the other classes.
the model starts choosing the hardest,most confusable cases as negative samples.
we find that this training strategy significantly speeds up convergence in training compared to purely random sampling, though sampling from the normalized output distribution adds a fixed time cost.
maximizing the margin with the best incorrect class implies that the margin with other incorrect classes is maximized as well.
在这里 y i ^ = a r g m a x y ≠ y i s ( x i , y \hat{y_i} = argmax_{y \not= y_i}s(x_i,y yi^=argmaxy=yis(xi,y
the highest-scoring incorrect prediction under the current model)
y i y_i yi denotes the ground truth .
( x i , y i ) (x_i,y_i) (xi,yi) :the partial gradients during training
the input embedding h x ( x i ) h_x(x_i) hx(xi)
all the output embedding: h y ( y ) h_y(y) hy(y)
resulting scores: s ( x i , y ) 对 任 意 的 y ∈ y t r a i n s(x_i,y) 对任意的 y\in y^{train} s(xi,y)对任意的y∈ytrain
使用公式5去计算得分函数更新参数的梯度。
我们给出一些测试例子:
D t e s t = { ( x i , y i ) } i = 1 N ^ D^{test} = \left\{\begin{matrix}(x_i,y_i)\end{matrix}\right\}^{\hat{N}}_{i = 1} Dtest={(xi,yi)}i=1N^
在这里 y i ∈ y t e s t y_i \in y^{test} yi∈ytest
我们计算 f y ( y ) 对 任 意 的 : y ∈ y t e s t f_y(y) 对任意的: y \in y^{test} fy(y)对任意的:y∈ytest
我们按照以前的过程来计算所有测试类别的得分 y t e s t y^{test} ytest:
预测这个最佳类别: arg max y ∈ y t e s t s ( x i , y ) \argmax_{y \in y^{test}}s(x_i,y) y∈ytestargmaxs(xi,y)
Many of the NLP problems can similarly be cast as attribute learning problems for better generalization and extending to novel classes.。
P ( u t t e r a n c e ∣ d o m a i n ) P(utterance | domain) P(utterance∣domain)
Naïve Bayes model with features being word unigrams in the utterance.
A trigram language model is used per domain to model P ( u t t e r a n c e ∣ d o m a i n ) P(utterance | domain) P(utterance∣domain). Kneser-Ney smoothing for n-gram backoff has been applied.
K-NN using intent embeddings from a classifier trained on data excluding the zero-shot partion
we also compared the zero-shot model to an n-gram based maximum entropy model baseline for intent classification within a domain.
代码运行的时候,会自己根据未来工作进行调试代码,然后会自己进行修改与整理。
会自己泛读论文,大致了解模型架构、创新ideas、未来展望、以及如何将未来展望带入模型进行调试,并用于自己的文章中。并不断探索新的阅读方法,阅读套路。
不断的探索,加快文章的阅读。会自己琢磨透彻,琢磨精髓,
论文学习方法。读论文期间会了解模型架构以及公式推导。以及创新点都行啦,根本不需要逐字逐句的读,然后开始亚尼据其他的样子都行啦的回事与打算。还有基准模型,然后跑代码,了解代码架构都行啦。其他的不重要,学会慢慢的将其高完整。会自己略读都行啦的理由与打算