[置顶] RNN学习笔记(五)-RNN 代码实现

RNN学习笔记(五)-RNN 代码实现

1.语言模型(LM)简述

做过NLP任务的童鞋应该都知道什么是语言模型,简单来说,如果我们把一句话s看做是由若干个(n个)词w组成的集合,那么这句话生成的概率就为:
P(s)=p(w1,w2,...,wn)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,...,wn1)
显然,右边的式子非常难以计算,它需要估算n个联合条件分布率。通常我们会做一些假设,比如,假设每个词的出现只依赖它的前一个词。那么上式就可以化简为:
P(s)=p(w1)p(w2|w1)p(w3|w2)...p(wn|wn1)
如果依赖与前两个词:
P(s)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|wn2,wn1)
以上分别被称为2-Gram和3-Gram,对应的可以定义出4-Gram,5-Gram等等。前面关联的词越多,模型就越精确,相应的训练成本就越大。
在这里,为了简单起见,我们将通过RNN来训练一个2-Gram。即通过RNN来估计出每对单词的条件概率 p(wi|wj)

2.RNN网络结构

训练样本(sentence)数为 Ns ,目标词典的word数量为 Nw ,每个单词使用one-hot表示(即该单词在词典中的Index),输入(出)层节点数 Ni=No=Nw 其中隐层节点数为 Nh (通常为N_i的十分之一左右)。隐层节点激活函数为tanh,输出层节点激活函数为softmax。隐层存在一个递归边,每一时刻的输出乘以权值矩阵W后将作为下一时刻输入的一部分。
权值矩阵:
U:Input Layer->Hidden(t) Layer,大小为( Nh×Ni
V:Hidden(t) Layer->Output Layer,大小为( No×Nh
W:Hidden(t-1) Layer->Hidden(t) Layer,大小为( Nh×Nh
fh :隐层激活函数
fo :输出层激活函数
sk(t) :t时刻第k个隐层节点的输出值
ok(t) :t时刻第k个输出层节点的输出值
yk :第k个输出层节点的目标输出值(即学习signal)
[置顶] RNN学习笔记(五)-RNN 代码实现_第1张图片
如上图所示,整个网络可以在时间上无限延展,每个时刻t网络都有一个对应的输入向量 x(t) ,一个隐层输出(状态)向量 s(t) ,一个输出向量 o(t)
输入向量 x(t) 和期望输出向量 y(t) 结构如下(假设输入的sentence由4个words组成):

可以看到,y刚好相对于x有一个时刻的偏移,即: y(t)=x(t+1)
因此,t时刻的损失函数通过 y(t) o(t) 即可求出。具体计算方法参见下边的小节。

3.python代码分析

这里我们使用python的Numpy包来实现,类名为RNNNumpy。另外代码中有中文注释,如果pydev中运行报错,在py文件最开始加上“#coding=utf-8”

3.0 主程序中的关键代码

3.0.1 文件读取

按行读取文件,每一行为一个句子(sentence),并在每个句子的开头和结尾添加SENTENCE_START及SENTENCE_END word。

# with open('data/script.txt', 'rb') as f:
with open('data/script.txt', 'rb') as f:
    reader = csv.reader(f, skipinitialspace=True)
    reader.next()
    # Split full comments into sentences
    sentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])
    # Append SENTENCE_START and SENTENCE_END
    sentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
print "Parsed %d sentences." % (len(sentences))

3.0.2 分词及词频信息统计

vocabulary_size:词典的大小,即 Nw 。这里取训练语料经过分词后所有不重复词的数目;
tokenized_sentences:分词后的句子,每个句子由一个word的list组成,分词使用最简单的按空格
切分的方式,如果需要处理中文可以先使用中文专用的分词器切分语料;
word_freq:每个词在语料中出现的次数

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
vocabulary_size = len(word_freq.items())

3.0.3 去除低频词、word转index

vocab:只存储词典中出现频率排名前 Nw1 的词(相当于去掉了出现频率最低的词)
word_to_index:word到index的映射表

# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
#将词典中所有的词(非重复)存储在index_to_word中
index_to_word = [x[0] for x in vocab]
#添加为登录词标记
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])

3.0.4 去除停用词

# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):
    tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]

3.0.5 生成训练数据

X_train:即公式中的输入x,取每个sentence word index list的前T-1个元素,X_train中每一行存储的是每个句子从t=0时刻到t=T-2时刻的word index,T为该句子的长度;
y_train:即公式中的训练数据 y ,取每个sentence word index list的后T-1个元素(除去第一个元素),y_train中每一行存储的是每个句子从t=1时刻到t=T-1时刻的word index;

# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])

3.0.6 构建模型类

model = RNNNumpy(vocabulary_size, hidden_dim=_HIDDEN_DIM)

3.0.7 模型训练

def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=1, evaluate_loss_after=5):
    """ Train RNN Model with SGD algorithm. Parameters ---------- model : The model that will be trained X_train : input x y_train:期望输出值 learning_rate:学习率 nepoch:迭代次数 evaluate_loss_after:loss值估计间隔,训练程序将每迭代evaluate_loss_after次进行一次loss值估计 """
    # We keep track of the losses so we can plot them later
    losses = []
    num_examples_seen = 0
    #循环迭代训练,这个for循环每运行一次,就完成一次对所有数据的迭代
    for epoch in range(nepoch):
        # Optionally evaluate the loss
        if (epoch % evaluate_loss_after == 0):
            loss = model.calculate_loss(X_train, y_train)
            losses.append((num_examples_seen, loss))
            time = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
            print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss)
            # Adjust the learning rate if loss increases
            #如果当前一次loss值大于上一次,则调小学习率
            if (len(losses) > 1 and losses[-1][1] > losses[-2][1]):
                learning_rate = learning_rate * 0.5  
                print "Setting learning rate to %f" % learning_rate
            sys.stdout.flush()
            # ADDED! Saving model oarameters
            save_model_parameters_theano("./data/rnn-theano-%d-%d-%s.npz" % (model.hidden_dim, model.word_dim, time), model)
        # For each training example...
        #对所有训练数据执行一轮SGD算法迭代
        for i in range(len(y_train)):
            # One SGD step
            model.sgd_step(X_train[i], y_train[i], learning_rate)
            num_examples_seen += 1

3.0.8 辅助函数

辅助函数在utils.py中,这里不做过多解释,相信大家都能看懂:

def softmax(x):
    xt = np.exp(x - np.max(x))
    return xt / np.sum(xt)

def save_model_parameters_theano(outfile, model):
    U, V, W = model.U.get_value(), model.V.get_value(), model.W.get_value()
    np.savez(outfile, U=U, V=V, W=W)
    print "Saved model parameters to %s." % outfile

def load_model_parameters_theano(path, model):
    npzfile = np.load(path)
    U, V, W = npzfile["U"], npzfile["V"], npzfile["W"]
    model.hidden_dim = U.shape[0]
    model.word_dim = U.shape[1]
    model.U.set_value(U)
    model.V.set_value(V)
    model.W.set_value(W)
    print "Loaded model parameters from %s. hidden_dim=%d word_dim=%d" % (path, U.shape[0], U.shape[1])

3.1 RNN实现类RNNNumpy

3.1.1 初始化

    def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):
        """ Instantiates the RNN class. Parameters ---------- word_dim : 输入词向量的维度,如果是one hot,显然应该等于Input Layer 的结点数 hidden_dim : 隐层的结点数 bptt_truncate:BPTT反向传播的时间范围 """

        # Assign instance variables
        self.word_dim = word_dim
        self.hidden_dim = hidden_dim
        self.bptt_truncate = bptt_truncate

        # Randomly initialize the network parameters
        self.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))
        self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))
        self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))

3.1.2 前向传播

前向传播对应的公式如下:
fh(x)=tanh(x)=exexex+ex
fo(x,z)=softmax(x)=exz
t时刻隐层的输入: z(h)(t)=Ux(t)+Ws(t1)
t时刻隐层的输出: s(t)=fh(z(h)(t))
t时刻输出层的输入: z(o)(t)=Vs(t)
t时刻整个网络的输出: o(t)=fo(s(t),z),z=k=1Noez(o)k(t)
t时刻输出层节点i的输出值: oi(t)=ez(o)i(t)k=1Noez(o)k(t)=p(wt+1=wi|wt)

    def forward_propagation(self, x):
        """ forward propagation algorithm. Parameters ---------- x : 输入句子对应的词向量list,其长度等于句子中的word数目,每一个word的词向量作为一个时刻t的网络输入 Returns ------- [o, s]: o:每一时刻的网络输出值,其大小为T×N_o 其中的一个元素o[t][i]含义为$p(w_{t+1}=w_i|w_t)$,即已知t时刻输入word $w_t$,的前提下,下一个单词为$w_i$的概率 s:每一时刻Hidden Layer节点的输出值,其长度为T×N_h 其中的一个元素s[t][i]表示t时刻隐层第i个节点的输出值 """
        # The total number of time steps
        T = len(x)
        # During forward propagation we save all hidden states in s because need them later.
        # We add one additional element for the initial hidden, which we set to 0
        s = np.zeros((T + 1, self.hidden_dim))
        s[-1] = np.zeros(self.hidden_dim)
        # The outputs at each time step. Again, we save them for later.
        o = np.zeros((T, self.word_dim))
        # For each time step...
        for t in np.arange(T):
            # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
            s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
            o[t] = softmax(self.V.dot(s[t]))
        return [o, s]

3.1.3损失函数

这里使用cross-entropy损失函数,
Lt=CEH(y(t),o(t))=Ey[log(o(t))]=Nwi=0yi(t)log(oi(t))
因为

yi={10yi(t)=wielse

所以 Lt=Nw1i=01{yi(t)=wi}log(oi(t)) ,容易知道,求和式中每一时刻只有一项非零。
对输入句子的loss等于每个时刻的loss值之和:
Ls=Tst=0Lt
最后,针对所有句子,再进行均值化处理:
L=E[Ls]=1NsNs1j=0Ts1t=0Nw1i=01{yi(t)=wi}log(o(j)i(t))
=1NsNs1j=0Ts1t=0Ljt
这里的 Ljt 表示第j个句子在t时刻的loss值

    def calculate_total_loss(self, x, y):
        L = 0
        # For each sentence...
        for i in np.arange(len(y)):
            o, s = self.forward_propagation(x[i])
            # We only care about our prediction of the "correct" words
            correct_word_predictions = o[np.arange(len(y[i])), y[i]]
            # Add to the loss based on how off we were
            L += -1 * np.sum(np.log(correct_word_predictions))
        return L

    def calculate_loss(self, x, y):
        """ loss function. Parameters ---------- x : 输入句子对应的词向量list 矩阵,矩阵的行数等于输入句子的数目,而每一行存储的内容为各个句子对应的word index列表 y : 训练值,即期望的输出word index 列表 Returns avg loss:所有训练样本平均loss ------- """
        # Divide the total loss by the number of training examples
        N = np.sum((len(y_i) for y_i in y))
        return self.calculate_total_loss(x,y)/N

3.1.4反向传播及BPTT算法

δ(o)k(t) 表示t时刻损失函数对output layer 节点的输入 z(o)k(t) 的导数,
δ(o)k(t)=Ltz(o)k(t)
=Ltok(t)ok(t)z(o)k(t)
=Ltfo(z(o)k(t))
=[Nw1i=01{yi(t)=wi}log(ok(t))][ez(o)k(t)j=1Noez(o)j(t)]
=[1ok(t)][ez(o)k(t)][j=1Noez(o)j(t)][ez(o)k(t)][j=1Noez(o)j(t)][j=1Noez(o)j(t)]2
=[1ok(t)][ez(o)k(t)][j=1Noez(o)j(t)][ez(o)k(t)][ez(o)k(t)][j=1Noez(o)j(t)]2
=[1ok(t)][ez(o)k(t)j=1Noez(o)j(t)]j=1Noez(o)j(t)ez(o)k(t)j=1Noez(o)j(t)
=[1ok(t)][ok(t)][1ok(t)]
=ok(t)1
这里需要注意,由前边3.1.2可知, ok(t)=ez(o)k(t)j=1Noez(o)j(t)
写成向量的形式为:
δ(o)(t)=o(t)1
正好对应了代码:

delta_o[np.arange(len(y)), y] -= 1.

对于隐层节点k的输入 z(h)k(t) 求导:
δ(h)k(t)=Ltz(h)k(t)
=jLtoj(t)oj(t)z(o)j(t)z(o)j(t)z(h)k(t)
=jLtoj(t)oj(t)z(o)j(t)z(o)j(t)sk(t)sk(t)z(h)k(t)
=jδ(o)j(t)vjkfh(z(h)k(t))
=jδ(o)j(t)vjktanh(z(h)k(t))
=jδ(o)j(t)vjk[1tanh(z(h)k(t))2]
=jδ(o)j(t)vjk[1sk(t)2]
=[δ(o)(t)Vk][1sk(t)2]
写成向量为:
δ(h)(t)=[δ(o)(t)V][1s(t)2]
小圆点为内积,大圆点为entrywise product(阿达马乘积)
对应代码为:

delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))

同理,将隐层沿时间往后扩展一层,得:
δ(h)(t1)=[δ(h)(t)W][1s(t1)2] (注意这里的权值变成了W!)
对应代码:

delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)

首先,分别求出V,W,U的导数:
LtV=Lto(t)o(t)z(o)(t)z(o)(t)V
=δ(o)(t)s(t)
=[o(t)1]s(t)
对于一个sentence而言,需要对每个时刻(word)的导数进行求和:
ΔV=tLtV=t[o(t)1]s(t)
*这里的圆圈表示outer product
对应代码为:

dLdV += np.outer(delta_o[t], s[t].T)

LtW=t=TsTsTrLtz(h)(t)z(h)(t)W
=t=TsTsTrδ(h)(t)s(t1)
=t=TsTsTrδ(h)(t)s(t1)
LtU=t=TsTsTrδ(h)(t)x(t)
代码参考第二个for循环。注意,由于x(t)为0或1,所以代码中省略了乘法,直接取出非0元素。

注意到代码中有两个反向的for循环,意义如下:
最外层的for循环是对每个时刻t(即当前sentence的每一个word)进行迭代,而第二个for循环则是对隐层的bptt_truncate(即公式中的 Tr )次循环进行迭代。bptt_truncate是一个预定义的参数,代表了BP时,最多可以回溯的隐层深度。深度越大,越接近真实情况,但计算量和出现震荡的概率都会增大。

    def bptt(self, x, y):
        """ BPTT algorithm. Parameters ---------- x : 输入句子对应的词向量list,其长度等于句子中的word数目,每一个word的词向量作为一个时刻t的网络输入 y : 训练值,即期望的输出word index 列表 Returns dLdU:损失函数对U矩阵的导数 dLdV:损失函数对V矩阵的导数 dLdW:损失函数对W矩阵的导数 ------- """
        T = len(y)
        # Perform forward propagation
        o, s = self.forward_propagation(x)
        # We accumulate the gradients in these variables
        dLdU = np.zeros(self.U.shape)
        dLdV = np.zeros(self.V.shape)
        dLdW = np.zeros(self.W.shape)
        delta_o = o
        delta_o[np.arange(len(y)), y] -= 1.
        # For each output backwards...
        for t in np.arange(T)[::-1]:
            dLdV += np.outer(delta_o[t], s[t].T)
            # Initial delta calculation
            delta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))
            # Backpropagation through time (for at most self.bptt_truncate steps)
            for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:
                # print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)
                dLdW += np.outer(delta_t, s[bptt_step-1])              
                dLdU[:,x[bptt_step]] += delta_t
                # Update delta for next step
                delta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)
        return [dLdU, dLdV, dLdW]

4.step by step in matlab

5.参考:

语言模型简介
参考项目
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/

你可能感兴趣的:([置顶] RNN学习笔记(五)-RNN 代码实现)