做过NLP任务的童鞋应该都知道什么是语言模型,简单来说,如果我们把一句话s看做是由若干个(n个)词w组成的集合,那么这句话生成的概率就为:
P(s)=p(w1,w2,...,wn)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,...,wn−1)
显然,右边的式子非常难以计算,它需要估算n个联合条件分布率。通常我们会做一些假设,比如,假设每个词的出现只依赖它的前一个词。那么上式就可以化简为:
P(s)=p(w1)p(w2|w1)p(w3|w2)...p(wn|wn−1)
如果依赖与前两个词:
P(s)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|wn−2,wn−1)
以上分别被称为2-Gram和3-Gram,对应的可以定义出4-Gram,5-Gram等等。前面关联的词越多,模型就越精确,相应的训练成本就越大。
在这里,为了简单起见,我们将通过RNN来训练一个2-Gram。即通过RNN来估计出每对单词的条件概率 p(wi|wj)
训练样本(sentence)数为 Ns ,目标词典的word数量为 Nw ,每个单词使用one-hot表示(即该单词在词典中的Index),输入(出)层节点数 Ni=No=Nw 其中隐层节点数为 Nh (通常为N_i的十分之一左右)。隐层节点激活函数为tanh,输出层节点激活函数为softmax。隐层存在一个递归边,每一时刻的输出乘以权值矩阵W后将作为下一时刻输入的一部分。
权值矩阵:
U:Input Layer->Hidden(t) Layer,大小为( Nh×Ni )
V:Hidden(t) Layer->Output Layer,大小为( No×Nh )
W:Hidden(t-1) Layer->Hidden(t) Layer,大小为( Nh×Nh )
fh :隐层激活函数
fo :输出层激活函数
sk(t) :t时刻第k个隐层节点的输出值
ok(t) :t时刻第k个输出层节点的输出值
yk :第k个输出层节点的目标输出值(即学习signal)
如上图所示,整个网络可以在时间上无限延展,每个时刻t网络都有一个对应的输入向量 x(t) ,一个隐层输出(状态)向量 s(t) ,一个输出向量 o(t) 。
输入向量 x(t) 和期望输出向量 y(t) 结构如下(假设输入的sentence由4个words组成):
可以看到,y刚好相对于x有一个时刻的偏移,即: y(t)=x(t+1)
因此,t时刻的损失函数通过 y(t) 和 o(t) 即可求出。具体计算方法参见下边的小节。
这里我们使用python的Numpy包来实现,类名为RNNNumpy。另外代码中有中文注释,如果pydev中运行报错,在py文件最开始加上“#coding=utf-8”
按行读取文件,每一行为一个句子(sentence),并在每个句子的开头和结尾添加SENTENCE_START及SENTENCE_END word。
# with open('data/script.txt', 'rb') as f:
with open('data/script.txt', 'rb') as f:
reader = csv.reader(f, skipinitialspace=True)
reader.next()
# Split full comments into sentences
sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
# Append SENTENCE_START and SENTENCE_END
sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
print "Parsed %d sentences." % (len(sentences))
vocabulary_size:词典的大小,即 Nw 。这里取训练语料经过分词后所有不重复词的数目;
tokenized_sentences:分词后的句子,每个句子由一个word的list组成,分词使用最简单的按空格
切分的方式,如果需要处理中文可以先使用中文专用的分词器切分语料;
word_freq:每个词在语料中出现的次数
# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]
# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
vocabulary_size = len(word_freq.items())
vocab:只存储词典中出现频率排名前 Nw−1 的词(相当于去掉了出现频率最低的词)
word_to_index:word到index的映射表
# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
#将词典中所有的词(非重复)存储在index_to_word中
index_to_word = [x[0] for x in vocab]
#添加为登录词标记
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):
tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]
X_train:即公式中的输入x,取每个sentence word index list的前T-1个元素,X_train中每一行存储的是每个句子从t=0时刻到t=T-2时刻的word index,T为该句子的长度;
y_train:即公式中的训练数据 y ,取每个sentence word index list的后T-1个元素(除去第一个元素),y_train中每一行存储的是每个句子从t=1时刻到t=T-1时刻的word index;
# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])
model = RNNNumpy(vocabulary_size, hidden_dim=_HIDDEN_DIM)
def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=1, evaluate_loss_after=5):
""" Train RNN Model with SGD algorithm. Parameters ---------- model : The model that will be trained X_train : input x y_train:期望输出值 learning_rate:学习率 nepoch:迭代次数 evaluate_loss_after:loss值估计间隔,训练程序将每迭代evaluate_loss_after次进行一次loss值估计 """
# We keep track of the losses so we can plot them later
losses = []
num_examples_seen = 0
#循环迭代训练,这个for循环每运行一次,就完成一次对所有数据的迭代
for epoch in range(nepoch):
# Optionally evaluate the loss
if (epoch % evaluate_loss_after == 0):
loss = model.calculate_loss(X_train, y_train)
losses.append((num_examples_seen, loss))
time = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss)
# Adjust the learning rate if loss increases
#如果当前一次loss值大于上一次,则调小学习率
if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
learning_rate = learning_rate * 0.5
print "Setting learning rate to %f" % learning_rate
sys.stdout.flush()
# ADDED! Saving model oarameters
save_model_parameters_theano("./data/rnn-theano-%d-%d-%s.npz" % (model.hidden_dim, model.word_dim, time), model)
# For each training example...
#对所有训练数据执行一轮SGD算法迭代
for i in range(len(y_train)):
# One SGD step
model.sgd_step(X_train[i], y_train[i], learning_rate)
num_examples_seen += 1
辅助函数在utils.py中,这里不做过多解释,相信大家都能看懂:
def softmax(x):
xt = np.exp(x - np.max(x))
return xt / np.sum(xt)
def save_model_parameters_theano(outfile, model):
U, V, W = model.U.get_value(), model.V.get_value(), model.W.get_value()
np.savez(outfile, U=U, V=V, W=W)
print "Saved model parameters to %s." % outfile
def load_model_parameters_theano(path, model):
npzfile = np.load(path)
U, V, W = npzfile["U"], npzfile["V"], npzfile["W"]
model.hidden_dim = U.shape[0]
model.word_dim = U.shape[1]
model.U.set_value(U)
model.V.set_value(V)
model.W.set_value(W)
print "Loaded model parameters from %s. hidden_dim=%d word_dim=%d" % (path, U.shape[0], U.shape[1])
def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
""" Instantiates the RNN class. Parameters ---------- word_dim : 输入词向量的维度,如果是one hot,显然应该等于Input Layer 的结点数 hidden_dim : 隐层的结点数 bptt_truncate:BPTT反向传播的时间范围 """
# Assign instance variables
self.word_dim = word_dim
self.hidden_dim = hidden_dim
self.bptt_truncate = bptt_truncate
# Randomly initialize the network parameters
self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))
前向传播对应的公式如下:
fh(x)=tanh(x)=ex−e−xex+e−x
fo(x,z)=softmax(x)=exz
t时刻隐层的输入: z(h)(t)=U⋅x(t)+W⋅s(t−1)
t时刻隐层的输出: s(t)=fh(z(h)(t))
t时刻输出层的输入: z(o)(t)=V⋅s(t)
t时刻整个网络的输出: o(t)=fo(s(t),z),其中z=∑k=1Noez(o)k(t)
t时刻输出层节点i的输出值: oi(t)=ez(o)i(t)∑k=1Noez(o)k(t)=p(wt+1=wi|wt)
def forward_propagation(self, x):
""" forward propagation algorithm. Parameters ---------- x : 输入句子对应的词向量list,其长度等于句子中的word数目,每一个word的词向量作为一个时刻t的网络输入 Returns ------- [o, s]: o:每一时刻的网络输出值,其大小为T×N_o 其中的一个元素o[t][i]含义为$p(w_{t+1}=w_i|w_t)$,即已知t时刻输入word $w_t$,的前提下,下一个单词为$w_i$的概率 s:每一时刻Hidden Layer节点的输出值,其长度为T×N_h 其中的一个元素s[t][i]表示t时刻隐层第i个节点的输出值 """
# The total number of time steps
T = len(x)
# During forward propagation we save all hidden states in s because need them later.
# We add one additional element for the initial hidden, which we set to 0
s = np.zeros((T + 1, self.hidden_dim))
s[-1] = np.zeros(self.hidden_dim)
# The outputs at each time step. Again, we save them for later.
o = np.zeros((T, self.word_dim))
# For each time step...
for t in np.arange(T):
# Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
o[t] = softmax(self.V.dot(s[t]))
return [o, s]
这里使用cross-entropy损失函数,
Lt=CEH(y(t),o(t))=Ey[−log(o(t))]=−∑Nwi=0yi(t)log(oi(t))
因为
def calculate_total_loss(self, x, y):
L = 0
# For each sentence...
for i in np.arange(len(y)):
o, s = self.forward_propagation(x[i])
# We only care about our prediction of the "correct" words
correct_word_predictions = o[np.arange(len(y[i])), y[i]]
# Add to the loss based on how off we were
L += -1 * np.sum(np.log(correct_word_predictions))
return L
def calculate_loss(self, x, y):
""" loss function. Parameters ---------- x : 输入句子对应的词向量list 矩阵,矩阵的行数等于输入句子的数目,而每一行存储的内容为各个句子对应的word index列表 y : 训练值,即期望的输出word index 列表 Returns avg loss:所有训练样本平均loss ------- """
# Divide the total loss by the number of training examples
N = np.sum((len(y_i) for y_i in y))
return self.calculate_total_loss(x,y)/N
令 δ(o)k(t) 表示t时刻损失函数对output layer 节点的输入 z(o)k(t) 的导数,
δ(o)k(t)=∂Lt∂z(o)k(t)
=∂Lt∂ok(t)∂ok(t)∂z(o)k(t)
=L′tf′o(z(o)k(t))
=[−∑Nw−1i=01{yi(t)=wi}log(ok(t))]′[ez(o)k(t)∑j=1Noez(o)j(t)]′
=[−1ok(t)][ez(o)k(t)]′[∑j=1Noez(o)j(t)]−[ez(o)k(t)][∑j=1Noez(o)j(t)]′[∑j=1Noez(o)j(t)]2
=[−1ok(t)][ez(o)k(t)][∑j=1Noez(o)j(t)]−[ez(o)k(t)][ez(o)k(t)][∑j=1Noez(o)j(t)]2
=[−1ok(t)][ez(o)k(t)∑j=1Noez(o)j(t)]∑j=1Noez(o)j(t)−ez(o)k(t)∑j=1Noez(o)j(t)
=[−1ok(t)][ok(t)][1−ok(t)]
=ok(t)−1
这里需要注意,由前边3.1.2可知, ok(t)=ez(o)k(t)∑j=1Noez(o)j(t)
写成向量的形式为:
δ(o)(t)=o(t)−1
正好对应了代码:
delta_o[np.arange(len(y)), y] -= 1.
对于隐层节点k的输入 z(h)k(t) 求导:
δ(h)k(t)=∂Lt∂z(h)k(t)
=∑j∂Lt∂oj(t)∂oj(t)∂z(o)j(t)∂z(o)j(t)∂z(h)k(t)
=∑j∂Lt∂oj(t)∂oj(t)∂z(o)j(t)∂z(o)j(t)∂sk(t)∂sk(t)∂z(h)k(t)
=∑jδ(o)j(t)vjkf′h(z(h)k(t))
=∑jδ(o)j(t)vjktanh′(z(h)k(t))
=∑jδ(o)j(t)vjk[1−tanh(z(h)k(t))2]
=∑jδ(o)j(t)vjk[1−sk(t)2]
=[δ(o)(t)⋅Vk]∙[1−sk(t)2]
写成向量为:
δ(h)(t)=[δ(o)(t)⋅V]∙[1−s(t)2]
小圆点为内积,大圆点为entrywise product(阿达马乘积)
对应代码为:
delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
同理,将隐层沿时间往后扩展一层,得:
δ(h)(t−1)=[δ(h)(t)⋅W]∙[1−s(t−1)2] (注意这里的权值变成了W!)
对应代码:
delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
首先,分别求出V,W,U的导数:
∂Lt∂V=∂Lt∂o(t)∂o(t)∂z(o)(t)⊗∂z(o)(t)∂V
=δ(o)(t)⊗s(t)
=[o(t)−1]⊗s(t)
对于一个sentence而言,需要对每个时刻(word)的导数进行求和:
ΔV=∑t∂Lt∂V=∑t[o(t)−1]⊗s(t)
*这里的圆圈表示outer product
对应代码为:
dLdV += np.outer(delta_o[t], s[t].T)
∂Lt∂W=∑t=TsTs−Tr∂Lt∂z(h)(t)∂z(h)(t)∂W
=∑t=TsTs−Trδ(h)(t)s(t−1)
=∑t=TsTs−Trδ(h)(t)s(t−1)
∂Lt∂U=∑t=TsTs−Trδ(h)(t)x(t)
代码参考第二个for循环。注意,由于x(t)为0或1,所以代码中省略了乘法,直接取出非0元素。
注意到代码中有两个反向的for循环,意义如下:
最外层的for循环是对每个时刻t(即当前sentence的每一个word)进行迭代,而第二个for循环则是对隐层的bptt_truncate(即公式中的 Tr )次循环进行迭代。bptt_truncate是一个预定义的参数,代表了BP时,最多可以回溯的隐层深度。深度越大,越接近真实情况,但计算量和出现震荡的概率都会增大。
def bptt(self, x, y):
""" BPTT algorithm. Parameters ---------- x : 输入句子对应的词向量list,其长度等于句子中的word数目,每一个word的词向量作为一个时刻t的网络输入 y : 训练值,即期望的输出word index 列表 Returns dLdU:损失函数对U矩阵的导数 dLdV:损失函数对V矩阵的导数 dLdW:损失函数对W矩阵的导数 ------- """
T = len(y)
# Perform forward propagation
o, s = self.forward_propagation(x)
# We accumulate the gradients in these variables
dLdU = np.zeros(self.U.shape)
dLdV = np.zeros(self.V.shape)
dLdW = np.zeros(self.W.shape)
delta_o = o
delta_o[np.arange(len(y)), y] -= 1.
# For each output backwards...
for t in np.arange(T)[::-1]:
dLdV += np.outer(delta_o[t], s[t].T)
# Initial delta calculation
delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
# Backpropagation through time (for at most self.bptt_truncate steps)
for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
# print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
dLdW += np.outer(delta_t, s[bptt_step-1])
dLdU[:,x[bptt_step]] += delta_t
# Update delta for next step
delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
return [dLdU, dLdV, dLdW]
语言模型简介
参考项目
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/