Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling（翻译）

基于注意的递归神经网络模型用于联合意图检测和槽填充

摘要

提出基于注意力的神经网络模型用于构建意图识别和槽填充的联合模型

1. 简介

意图识别：意图识别可以看做语义表达的分类问题，流行的方法如支持向量机和深度神经网络

槽填充：槽填充看做是一个序列标记问题，流行的方法如最大熵马尔可夫模型，条件随机场（crf）和RNNs。

文献[8,9]也提出了意图检测和槽填充的联合模型。

[12]中介绍的注意机制能够在encoder-decoder架构能够学会同时对齐和解码。

2. 背景

2.1 RNN用于槽填充

就是通过RNN预测对应的标签，因为槽填充任务的输入和标签就是对其的。

2.2 encoder-decoder

这种结构可以处理变长序列，在[12]中引入的注意机制使编码器 - 解码器模型能够学习软对齐并同时解码。

3. 提出的方法

3.1 对齐输入的encoder-decoder

img

图 2：对意图和槽填充联合任务的encoder-decoder模型。（a）未对其输入（b）对其输入（c）输入对齐，有attention，encoder使用blstm,使用encoder反向传播的最后状态初始化decoder的RNN状态

联合意图识别和槽值填充的encoder–decoder模型如图2所示，使用LSTM单元。前向和后向的RNN序列会在每个时间步产生隐藏状态fb_i 和 bf_i ，最终的编码序列是两者的串联即 h_i = [fb_i, bf_i]。

前后向encoder RNN的最后状态包含有整个序列的信息。我们使用了反向encoder RNN的最后状态去初始化decoder的隐藏状态，decoder是单向RNN。ci为上下文向量，hi为对齐的encoder隐藏向量，si为decoder每个时间步状态。

其中上下文向量 c 是encoder状态h=(h1,…,hn)的权重和，计算：

image-20200630171459896.png

image-20200630171521968.png

（6）就是做一个softmax

其中g是前馈神经网络，在每个decoder的时间步，输入是encoder状态hi。上下文向量ci给decoder提供额外的信息，并且可以看作是一个连续的权重特征（h1,…,hn）的袋子。

意图识别和槽填充共享一个encoder，意图识别共享初始化decoder状态的S₀，它编码了整个序列的信息。并且上下文向量C_intent 表示意图解码器要注意的源序列的一部分。

3.2 基于attention的RNN模型

img

图3：基于attention的RNN联合模型。双向的RNN前后向读取源序列。槽标签依赖前向RNN建模。在每个时间步，串联前后向隐藏状态用于预测槽标签，如果使用attention，上下文向量Ci从输入序列部分提供信息，词向量Ci和时间对齐的隐状态向量 h_i一起预测槽标签。

birnn对于序列标记，每个时间步都会携带整个句子的信息，但是在前后向传播中会发生信息丢失。因此，对槽标签预测，不仅是利用每一步的隐藏状态h_i ，我们应该看是否使用上下文向量Ci给我们提供额外的支持信息，尤其那些长期依赖是否被隐藏状态完全捕捉。

槽标记依赖于前向的RNN建模，类似于encoder-decoder结构中的encoder模型，每个时间步的隐藏状态h_i 是前向和后向状态的串联h_i=[fh_i，bh_i]。每个隐藏状态hi包含了整个输入序列的信息，hi会结合上下文向量C_i 产生标签分布，其中上下文C_i 作为RNN隐状态h = [h1,…, hT]的加权平均来计算。

对于联合模型，我们再次使用了birnn的隐状态h来产生意图类分布。如果不使用attention，我们在随着时间分布隐藏状态h, 后应用平均池化，通过逻辑回归执行意图分类[17]。如果使用attention，我们要改变隐藏状态h的加权平均值。

在对比对其输入的基于attention的encoder-decoder模型，基于attention的RNN更有效率。encoder-decoder使用两次输入，而RNN只有一次。

4 实验

4.1 数据

使用了ATIS数据

4.2 训练步骤

LSTM单元为128

单层LSTM

batch_size = 16

word embedding = 128

dropout = 0.5

优化器：adam

4.3 单独训练模型：槽填充

表1：单任务上的表现

img

表2：与以前方法的比较。 ATIS插槽填充的独立培训模型结果。

img

4.4 独立训练模型：意图检测

表3比较了我们的意图模型和以前的方法之间的意图分类错误率。

img

4.5 联合模型

表4显示了与先前报告的结果相比，我们在意图检测和槽填充方面的联合训练模型性能。

img

5 结论

我们探索了在基于注意力的编码器 - 解码器神经网络模型中利用显式对齐信息的策略。

进一步提出了一种基于注意力的双向RNN模型，用于联合意图检测和槽填充。

6.参考文献

[1] P. Haffner, G. Tur, and J. H. Wright, “Optimizing svms for complex call classification,” in Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International
Conference on, vol. 1. IEEE, 2003, pp. I–632.
[2] R. Sarikaya, G. E. Hinton, and B. Ramabhadran, “Deep belief nets
for natural language call-routing,” in Acoustics, Speech and Signal
Processing (ICASSP), 2011 IEEE International Conference on.
IEEE, 2011, pp. 5680–5683.
[3] A. McCallum, D. Freitag, and F. C. Pereira, “Maximum entropy
markov models for information extraction and segmentation.” in
ICML, vol. 17, 2000, pp. 591–598.
[4] C. Raymond and G. Riccardi, “Generative and discriminative algorithms for spoken language understanding.” in INTERSPEECH,
2007, pp. 1605–1608.
[5] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spoken language understanding using long short-term memory neural networks,” in Spoken Language Technology Workshop (SLT),
2014 IEEE. IEEE, 2014, pp. 189–194.
[6] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. HakkaniTur, X. He, L. Heck, G. Tur, D. Yu et al., “Using recurrent neural
networks for slot filling in spoken language understanding,” Audio, Speech, and Language Processing, IEEE/ACM Transactions
on, vol. 23, no. 3, pp. 530–539, 2015.
[7] B. Liu and I. Lane, “Recurrent neural network structured output prediction for spoken language understanding,” in Proc. NIPS
Workshop on Machine Learning for Spoken Language Understanding and Interactions, 2015.
[8] D. Guo, G. Tur, W.-t. Yih, and G. Zweig, “Joint semantic utterance classification and slot filling with recursive neural networks,” in Spoken Language Technology Workshop (SLT), 2014
IEEE. IEEE, 2014, pp. 554–559.
[9] P. Xu and R. Sarikaya, “Convolutional neural network based
triangular crf for joint intent detection and slot filling,” in Automatic Speech Recognition and Understanding (ASRU), 2013
IEEE Workshop on. IEEE, 2013, pp. 78–83.
[10] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
learning with neural networks,” in Advances in neural information
processing systems, 2014, pp. 3104–3112.
[11] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and
spell,” arXiv preprint arXiv:1508.01211, 2015.
[12] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint
arXiv:1409.0473, 2014.
[13] T. Mikolov, S. Kombrink, L. Burget, J. H. Cernock ˇ y, and S. Khu- `
danpur, “Extensions of recurrent neural network language model,”
in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE
International Conference on. IEEE, 2011, pp. 5528–5531.
[14] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, ¨
F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[15] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE,
2013, pp. 273–278.
[16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[17] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional
networks for text classification,” in Advances in Neural Information Processing Systems, 2015, pp. 649–657.
[18] C. T. Hemphill, J. J. Godfrey, and G. R. Doddington, “The atis
spoken language systems pilot corpus,” in Proceedings, DARPA
speech and natural language workshop, 1990, pp. 96–101.
[19] G. Tur, D. Hakkani-Tur, and L. Heck, “What is left to be understood in atis?” in Spoken Language Technology Workshop (SLT),
2010 IEEE. IEEE, 2010, pp. 19–24.
[20] M. Jeong and G. Geunbae Lee, “Triangular-chain conditional
random fields,” Audio, Speech, and Language Processing, IEEE
Transactions on, vol. 16, no. 7, pp. 1287–1302, 2008.
[21] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regularization,” arXiv preprint arXiv:1409.2329, 2014.
[22] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical exploration of recurrent network architectures,” in Proceedings of
the 32nd International Conference on Machine Learning (ICML15), 2015, pp. 2342–2350.
[23] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[24] B. Peng and K. Yao, “Recurrent neural networks with external memory for language understanding,” arXiv preprint
arXiv:1506.00195, 2015.
[25] G. Kurata, B. Xiang, B. Zhou, and M. Yu, “Leveraging sentencelevel information with encoder lstm for natural language understanding,” arXiv preprint arXiv:1601.01530, 2016.
[26] G. Tur, D. Hakkani-Tur, L. Heck, and S. Parthasarathy, “Sentence ¨
simplification for spoken language understanding,” in Acoustics,
Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, 2011, pp. 5628–5631.