教程地址:http://deeplearning.net/tutorial/rnnslu.html
相关论文:Grégoire Mesnil, Xiaodong He, Li Deng and Yoshua Bengio - Investigation of Recurrent Neural Network Architectures and Learning Methods for Spoken Language Understanding
Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu and Geoffrey Zweig - Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding
The Slot-Filling (Spoken Language Understanding)是给句子中每个word分配标签,是一个分类问题。
数据集是DARPA的一个小型数据集:ATIS (Airline Travel Information System),下面例子使用了Inside Outside Beginning (IOB) representation。
Input (words) | show | flights | from | Boston | to | New | York | today |
Output (labels) | O | O | O | B-dept | O | B-arr | I-arr | B-date |
数据集中训练集句子4978个,word 56590个;测试集句子893,word 9198个;平均句长 15;The number of classes (different slots) is 128 including the O label (NULL).
As Microsoft Research people,把训练集中只出现一次的word标记为<UNK>,用这个标记去代表测试集中没见过的word;As Ronan Collobert and colleagues,把数字序列换成 DIGIT,如1984 变成 DIGITDIGITDIGITDIGIT。
把训练集分80%为训练集,20%为验证集。当F1 score为95%时,小数据集对性能的影响会大于0.6%。详见Microsoft Research people。
下面有3个实验评价指标:Precision,Recall,F1 score;
前两个看下图很好理解,在模式识别领域,一般都指定recall为多少情况下,比较precision的好坏。F1 score是根据Precision和Recall计算的。
教程使用conlleval PERL script(是一个计算上述指标的脚本代码)来评价性能。
每个word对应一个token。在ATIS字典中,每个token对应一个index。每个句子都是一个index(int32)的数组。每个set(训练集,测试集,验证集)是 list of arrays of indexes(list是python的list)。python字典用来把 indexes 映射到 words 。
>>> sentence array([383, 189, 13, 193, 208, 307, 195, 502, 260, 539, 7, 60, 72, 8, 350, 384], dtype=int32) >>> map(lambda x: index2word[x], sentence) ['please', 'find', 'a', 'flight', 'from', 'miami', 'florida', 'to', 'las', 'vegas', '<UNK>', 'arriving', 'before', 'DIGIT', "o'clock", 'pm']
map(function, iterable, ...) 第一个参数为函数,第二个参数为输入。
就是对 iterable 中每个元素都运行 function 函数,最后返回 list(list中为每个元素的函数运行结果)。
>>> def add100(x): ... return x+100 ... >>> hh = [11,22,33] >>> map(add100,hh) [111, 122, 133]
本例子中 function为add100,iterable为hh。hh中每个元素都加了100,然后返回 list。
lambda x: index2word[x] 这句是一个单行函数,所以对应map参数中的function。用 lambda 可以省去定义函数的过程(一般的定义就是用def....),代码更简洁。
所以map(lambda x: index2word[x], sentence)功能就是对sentence中每个数,都调用 index2word[x] 找出其对应的word。
同样地,通过 index 得出 label:
>>> labels array([126, 126, 126, 126, 126, 48, 50, 126, 78, 123, 81, 126, 15, 14, 89, 89], dtype=int32) >>> map(lambda x: index2label[x], labels) ['O', 'O', 'O', 'O', 'O', 'B-fromloc.city_name', 'B-fromloc.state_name', 'O', 'B-toloc.city_name', 'I-toloc.city_name', 'B-toloc.state_name', 'O', 'B-arrive_time.time_relative', 'B-arrive_time.time', 'I-arrive_time.time', 'I-arrive_time.time']
context window 就是根据指定的 window size (1,3,5...)把句子中的每个word及其前后(1,3,5...个)word提取出来。
返回结果为 list of list of indexes。
def contextwin(l, win): ''' win :: int corresponding to the size of the window given a list of indexes composing a sentence l :: array containing the word indexes it will return a list of list of indexes corresponding to context windows surrounding each word in the sentence ''' assert (win % 2) == 1 assert win >= 1 l = list(l) lpadded = win // 2 * [-1] + l + win // 2 * [-1] out = [lpadded[i:(i + win)] for i in range(len(l))] assert len(out) == len(l) return out
index -1 是word在句子开头结尾时,出现前面或后面没有word 这种情况就填上 -1。
看下面这个例子就很清楚: 其中0,1,2,3,4 为word的index。
>>> x array([0, 1, 2, 3, 4], dtype=int32) >>> contextwin(x, 3) [[-1, 0, 1], [ 0, 1, 2], [ 1, 2, 3], [ 2, 3, 4], [ 3, 4,-1]] >>> contextwin(x, 7) [[-1, -1, -1, 0, 1, 2, 3], [-1, -1, 0, 1, 2, 3, 4], [-1, 0, 1, 2, 3, 4,-1], [ 0, 1, 2, 3, 4,-1,-1], [ 1, 2, 3, 4,-1,-1,-1]]
上面这个matrix中,每行表示一个 word 的context window。
现在我们得到了contex windows,是一个 indexes 的矩阵,每个元素是一个index,对应一个 word(通过index2word可以得出word)。
现在需要把 index 转换成 embeddings (对应每个 word 的实数向量)。
import theano, numpy from theano import tensor as T # nv :: size of our vocabulary # de :: dimension of the embedding space # cs :: context window size nv, de, cs = 1000, 50, 5 embeddings = theano.shared(0.2 * numpy.random.uniform(-1.0, 1.0, \ (nv+1, de)).astype(theano.config.floatX)) # add one for PADDING at the end idxs = T.imatrix() # as many columns as words in the context window and as many lines as words in the sentence x = self.emb[idxs].reshape((idxs.shape[0], de*cs))
x为一个矩阵,大小为(句子中单词数,de 乘以 cs)。
例子: 句子中单词数为 5,de 为 50 ,cs 为 7,所以结果矩阵大小为(5,50*7)
>>> sample array([0, 1, 2, 3, 4], dtype=int32) >>> csample = contextwin(sample, 7) [[-1, -1, -1, 0, 1, 2, 3], [-1, -1, 0, 1, 2, 3, 4], [-1, 0, 1, 2, 3, 4,-1], [ 0, 1, 2, 3, 4,-1,-1], [ 1, 2, 3, 4,-1,-1,-1]] >>> f = theano.function(inputs=[idxs], outputs=x) >>> f(csample) array([[-0.08088442, 0.08458307, 0.05064092, ..., 0.06876887, -0.06648078, -0.15192257], [-0.08088442, 0.08458307, 0.05064092, ..., 0.11192625, 0.08745284, 0.04381778], [-0.08088442, 0.08458307, 0.05064092, ..., -0.00937143, 0.10804889, 0.1247109 ], [ 0.11038255, -0.10563177, -0.18760249, ..., -0.00937143, 0.10804889, 0.1247109 ], [ 0.18738101, 0.14727569, -0.069544 , ..., -0.00937143, 0.10804889, 0.1247109 ]], dtype=float32) >>> f(csample).shape (5, 350)
这个matrix 是 context window word embeddings 序列。
每行为一个向量,对应句子中一个 word,现在很容易应用RNN了。
(Elman) recurrent neural network (E-RNN)把当前时刻(time t)的输入以及前一时刻(time t-1)的隐藏层状态作为输入。
上节中的矩阵,每行就代表一个时刻 time t。第0行对应 t=0,第一行对应 t=1......。
E-RNN中需要学习的参数如下:
t
and the previous hidden layer state t-1(神经网络中的权重W)
超参数如下:
class RNNSLU(object): ''' elman neural net model ''' def __init__(self, nh, nc, ne, de, cs): ''' nh :: dimension of the hidden layer nc :: number of classes ne :: number of word embeddings in the vocabulary de :: dimension of the word embeddings cs :: word window context size ''' # parameters of the model self.emb = theano.shared(name='embeddings', value=0.2 * numpy.random.uniform(-1.0, 1.0, (ne+1, de)) # add one for padding at the end .astype(theano.config.floatX)) self.wx = theano.shared(name='wx', value=0.2 * numpy.random.uniform(-1.0, 1.0, (de * cs, nh)) .astype(theano.config.floatX)) self.wh = theano.shared(name='wh', value=0.2 * numpy.random.uniform(-1.0, 1.0, (nh, nh)) .astype(theano.config.floatX)) self.w = theano.shared(name='w', value=0.2 * numpy.random.uniform(-1.0, 1.0, (nh, nc)) .astype(theano.config.floatX)) self.bh = theano.shared(name='bh', value=numpy.zeros(nh, dtype=theano.config.floatX)) self.b = theano.shared(name='b', value=numpy.zeros(nc, dtype=theano.config.floatX)) self.h0 = theano.shared(name='h0', value=numpy.zeros(nh, dtype=theano.config.floatX)) # bundle self.params = [self.emb, self.wx, self.wh, self.w, self.bh, self.b, self.h0]
ne=nv (size of our vocabulary,1000)?
emb大小为(ne+1,de)每行对应vocabulary中一个word,表明如何把这个word转换成一个de维的向量。emb这个matrix也是需要在训练中学习的。
通过embedding matrix得出输入:
idxs = T.imatrix() x = self.emb[idxs].reshape((idxs.shape[0], de*cs)) y_sentence = T.ivector('y_sentence') # labels
用 scan (theano带的)来实现 递归:
def recurrence(x_t, h_tm1): h_t = T.nnet.sigmoid(T.dot(x_t, self.wx) + T.dot(h_tm1, self.wh) + self.bh) s_t = T.nnet.softmax(T.dot(h_t, self.w) + self.b) return [h_t, s_t] [h, s], _ = theano.scan(fn=recurrence, sequences=x, outputs_info=[self.h0, None], n_steps=x.shape[0]) p_y_given_x_sentence = s[:, 0, :] y_pred = T.argmax(p_y_given_x_sentence, axis=1)
Theano自动计算梯度,最大化 log-likelihood:(这里和教程给的代码其实不一样,代码中损失函数,训练都是用的句子中最后一个词,这里给出的是整个句子,如 p_y_given_x_lastword = s[-1,0,:],-1表示取末尾)
lr = T.scalar('lr') sentence_nll = -T.mean(T.log(p_y_given_x_sentence) [T.arange(x.shape[0]), y_sentence]) sentence_gradients = T.grad(sentence_nll, self.params) sentence_updates = OrderedDict((p, p - lr*g) for p, g in zip(self.params, sentence_gradients))
分类,训练函数:
self.classify = theano.function(inputs=[idxs], outputs=y_pred) self.sentence_train = theano.function(inputs=[idxs, y_sentence, lr], outputs=sentence_nll, updates=sentence_updates)
为了保证 word embeddings(50维向量) 在单位球上,需在每次更新后归一化:
self.normalize = theano.function(inputs=[], updates={self.emb: self.emb / T.sqrt((self.emb**2) .sum(axis=1)) .dimshuffle(0, 'x')})
使用随机梯度下降,一个句子为一个minibatch,每一个句子更新一次参数。每次更新后,都要对 word embeddings 归一化。
Early-stoppong,按给定数量的epochs训练,每次epoch后都在验证集上测试 F1 score,保留最佳模型。
使用 KISS random search。
超参如下:
教程目的是用RNN做 word embedding,就是如何把 word 变成向量更好,通过训练,不断更新emb参数矩阵,最终得出结果。
通过 L2 (欧氏距离)距离 和 cosine 距离两种方式判断 word 间的相似度,两种方式结果一样,k近邻如下:
atlanta | back | ap80 | but | aircraft | business | a | august | actually | cheap |
---|---|---|---|---|---|---|---|---|---|
phoenix | live | ap57 | if | plane | coach | people | september | provide | weekday |
denver | lives | ap | up | service | first | do | january | prices | weekdays |
tacoma | both | connections | a | airplane | fourth | but | june | stop | am |
columbus | how | tomorrow | now | seating | thrift | numbers | december | number | early |
seattle | me | before | amount | stand | tenth | abbreviation | november | flight | sfo |
minneapolis | out | earliest | more | that | second | if | april | there | milwaukee |
pittsburgh | other | connect | abbreviation | on | fifth | up | july | serving | jfk |
ontario | plane | thrift | restrictions | turboprop | third | serve | jfk | thank | shortest |
montreal | service | coach | mean | mean | twelfth | database | october | ticket | bwi |
philadelphia | fare | today | interested | amount | sixth | passengers | may | are | lastest |