Language Model is the task of predicting what word comes next.
更正式的:给定一个单词序列, x ( 1 ) , x ( 2 ) , . . . , x ( t ) \mathbf{x}^{(1)},\mathbf{x}^{(2)},...,\mathbf{x}^{(t)} x(1),x(2),...,x(t),预测下一个单词 x ( t + 1 ) \mathbf{x}^{(t+1)} x(t+1)的概率分布。
p ( x ( t + 1 ) ∣ x ( 1 ) , x ( 2 ) , . . . , x ( t ) ) p(\mathbf{x}^{(t+1)}|\mathbf{x}^{(1)},\mathbf{x}^{(2)},...,\mathbf{x}^{(t)}) p(x(t+1)∣x(1),x(2),...,x(t))
x ( t + 1 ) \mathbf{x}^{(t+1)} x(t+1)可以是单词表 V = { w 1 , . . . , w ∣ V ∣ } V=\{w_1,...,w_{|V|}\} V={w1,...,w∣V∣}中的任意单词。这样的系统成为language model,也可以理解为给一段文本分配概率。
p ( x ( 1 ) , x ( 2 ) , . . . , x ( t ) ) = p ( x ( 1 ) ) × p ( x ( 2 ) ∣ x ( 1 ) ) × ⋯ × p ( x ( T ) ∣ x ( T − 1 ) , x ( T − 2 ) , . . . , x ( 1 ) ) = ∏ t = 1 T p ( x ( t ) ∣ x ( t − 1 ) , . . . , x ( 1 ) ) p(\mathbf{x}^{(1)},\mathbf{x}^{(2)},...,\mathbf{x}^{(t)})=p(\mathbf{x}^{(1)})\times p(\mathbf{x}^{(2)}|\mathbf{x}^{(1)})\times \cdots \times p(\mathbf{x}^{(T)}|\mathbf{x}^{(T-1)},\mathbf{x}^{(T-2)},...,\mathbf{x}^{(1)})\\ =\prod_{t=1}^Tp(\mathbf{x}^{(t)}|\mathbf{x}^{(t-1)},...,\mathbf{x}^{(1)}) p(x(1),x(2),...,x(t))=p(x(1))×p(x(2)∣x(1))×⋯×p(x(T)∣x(T−1),x(T−2),...,x(1))=t=1∏Tp(x(t)∣x(t−1),...,x(1))
Language Model是一个benchmark task,可以帮助我们评估对语言的理解的程度。
同时,Language Model是许多NLP任务的组成部分(subcomponent),尤其是对于那些包含文本生成(generating text)和估计给定文本概率(estimating the probability of text)的任务。
• Predictive typing
• Speech recognition
• Handwriting recognition
• Spelling/grammar correction
• Authorship identification
• Machine translation
• Summarization
• Dialogue
• etc.
n-gram 就是n个连续的单词,根据n的不同,有不同的类型,比如对于同一句话 the students opened their __ 有:
unigram: “the” “students” “opened” “their”
bigram:“the students” “students opened” “opened their”
trigram: “the students opened” “students opened their”
4-gram: “the students opened their”
可以看出n越大涵盖的信息多,模型相较就会更精准一些。
n-gram language model的思想就是统计不同n-gram出现的频率,用于预测单词。(其实就是简单的counts)
首先,n-gram language model的简单假设就是一个单词 x ( t ) \mathbf{x^{(t)}} x(t)只依赖于它前面的n-1个单词,即:
p ( x ( t + 1 ) ∣ x ( t ) , x ( t − 1 ) , . . . , x ( 1 ) ) = p ( x ( t + 1 ) ∣ x ( t ) , x ( t − 1 ) , . . . , x ( t − n + 2 ) ) p(\mathbf{x}^{(t+1)}|\mathbf{x}^{(t)},\mathbf{x}^{(t-1)},...,\mathbf{x}^{(1)}) = p(\mathbf{x}^{(t+1)}|\mathbf{x}^{(t)},\mathbf{x}^{(t-1)},...,\mathbf{x}^{(t-n+2)}) p(x(t+1)∣x(t),x(t−1),...,x(1))=p(x(t+1)∣x(t),x(t−1),...,x(t−n+2))
然后根据条件概率定义计算它,
p ( x ( t + 1 ) ∣ x ( t ) , x ( t − 1 ) , . . . , x ( t − n + 2 ) ) = p ( x ( t + 1 ) , x ( t ) , . . . , x ( t − n + 2 ) ) p ( x ( t ) , . . . , x ( t − n + 2 ) ) ) p(\mathbf{x}^{(t+1)}|\mathbf{x}^{(t)},\mathbf{x}^{(t-1)},...,\mathbf{x}^{(t-n+2)})=\frac{p(\mathbf{x}^{(t+1)},\mathbf{x}^{(t)},...,\mathbf{x}^{(t-n+2)})}{p(\mathbf{x}^{(t)},...,\mathbf{x}^{(t-n+2)}))} p(x(t+1)∣x(t),x(t−1),...,x(t−n+2))=p(x(t),...,x(t−n+2)))p(x(t+1),x(t),...,x(t−n+2))
分子是n-gram出现的概率,分母是(n-1)-gram出现的概率,通过对大规模语料库的统计,可以获得这两个概率的近似(statistical approximation),即:
p ( x ( t + 1 ) , x ( t ) , . . . , x ( t − n + 2 ) ) p ( x ( t ) , . . . , x ( t − n + 2 ) ) ) ≈ c o u n t ( x ( t + 1 ) , x ( t ) , . . . , x ( t − n + 2 ) ) c o u n t ( x ( t ) , . . . , x ( t − n + 2 ) ) \frac{p(\mathbf{x}^{(t+1)},\mathbf{x}^{(t)},...,\mathbf{x}^{(t-n+2)})}{p(\mathbf{x}^{(t)},...,\mathbf{x}^{(t-n+2)}))} \approx \frac{count(x^{(t+1)},x^{(t)},...,x^{(t-n+2)})}{count(x^{(t)},...,x^{(t-n+2)})} p(x(t),...,x(t−n+2)))p(x(t+1),x(t),...,x(t−n+2))≈count(x(t),...,x(t−n+2))count(x(t+1),x(t),...,x(t−n+2))
e.g. 学习一个4-gram language model,句子为:
as the proctor started the clock, the students opened their __ \text{as the proctor started the clock, the students opened their \_\_ } as the proctor started the clock, the students opened their __
p ( w ∣ students opened their ) = c o u n t ( students opened their w ) c o u n t ( students opened their ) p(w|\text{ students opened their }) = \frac{count(\text{ students opened their w })}{count(\text{ students opened their })} p(w∣ students opened their )=count( students opened their )count( students opened their w )
假设在语料库中,"students opened their"出现了1000次,"students opened their books"出线了400次,"students opened their exams"出线了100次,则
p ( books ∣ students opened their ) = c o u n t ( students opened their books ) c o u n t ( students opened their ) = 0.4 p ( exams ∣ students opened their ) = c o u n t ( students opened their examss ) c o u n t ( students opened their ) = 0.1 p(\text{books}|\text{ students opened their }) = \frac{count(\text{ students opened their books })}{count(\text{ students opened their })} = 0.4 \\ p(\text{exams}|\text{ students opened their }) = \frac{count(\text{ students opened their examss })}{count(\text{ students opened their })} = 0.1 \\ p(books∣ students opened their )=count( students opened their )count( students opened their books )=0.4p(exams∣ students opened their )=count( students opened their )count( students opened their examss )=0.1
p ( w ∣ students opened their ) = c o u n t ( students opened their w ) c o u n t ( students opened their ) p(w|\text{ students opened their }) = \frac{count(\text{ students opened their w })}{count(\text{ students opened their })} p(w∣ students opened their )=count( students opened their )count( students opened their w )
首先是分子的问题,“students opened their w”有可能没在语料库出现过,所以这个概率为0。
解决:给每个单词 w ∈ V w\in V w∈V都加上一个很小的值 δ \delta δ,这个方法称为smoothing。
然后是分母的问题,“students opened their”有可能没出现过,这样对于任何概率都无法计算。
解决:backoff(回退),统计"opened their"作为替代。
对于每一个出现过的n-gram都需要存储下来,随着n的增大,要存储的规模也会增大。
所以引出了一个矛盾,我们希望大一点n使得模型更加精确,但是n越大会引发sparsity problem和storage problem。
as the proctor started the clock, the students opened their __ \text{as the proctor started the clock, the students opened their \_\_} as the proctor started the clock, the students opened their __
还是同样的例子,假设window大小为4,则我们的模型只用到"the students opened their"来预测下一个单词。
其中输入 x ( 1 ) , x ( 2 ) , x ( 3 ) , x ( 4 ) \mathbf{x}^{(1)},\mathbf{x}^{(2)},\mathbf{x}^{(3)},\mathbf{x}^{(4)} x(1),x(2),x(3),x(4)为one-hot编码的向量,接着将词嵌入(word embedding)后的词向量拼接(concatenation)成一个向量 e = [ e ( 1 ) ; e ( 2 ) ; e ( 3 ) ; e ( 4 ) ] \mathbf{e}=[\mathbf{e}^{(1)};\mathbf{e}^{(2)};\mathbf{e}^{(3)};\mathbf{e}^{(4)}] e=[e(1);e(2);e(3);e(4)],然后经过一个隐层, h = f ( W e + b 1 ) \mathbf{h}=f(\mathbf{W}\mathbf{e}+\mathbf{b_1}) h=f(We+b1),最后softmax输出获得概率分布, y ^ = s o f t m a x ( U h + b 2 ) ∈ R ∣ V ∣ \hat{\mathbf{y}}= softmax(\mathbf{U}\mathbf{h}+\mathbf{b_2})\in \mathbb{R}^{|V|} y^=softmax(Uh+b2)∈R∣V∣
没有sparsity问题;
不需要存储观测到的所有n-gram。
固定窗口(fixed window)不够大,扩大窗口就相当于扩大 W \mathbf{W} W,因而窗口不能太大;
因为各个单词通过拼接组成的 e \mathbf{e} e,使得不同的单词对应 W \mathbf{W} W矩阵不同的位置,这样相当于丢失了机器学习很重要的特性–共享权重。例如:
W e = [ w 1 , w 2 , w 3 , w 4 ] [ e 1 e 2 e 3 e 4 ] \mathbf{W}\mathbf{e} = [\mathbf{w_1},\mathbf{w_2},\mathbf{w_3},\mathbf{w_4}] \left [ \begin{array}{ccc}{\mathbf{e_1}} \\{\mathbf{e_2}}\\{\mathbf{e_3}}\\{\mathbf{e_4}} \end{array} \right ] We=[w1,w2,w3,w4]⎣⎢⎢⎡e1e2e3e4⎦⎥⎥⎤
可以看出 W \mathbf{W} W中不同块对应不同的单词,即丢失了对称性(symmetry)。
Core idea: 重复的运用同一个权重 W \mathbf{W} W。可以同上面fixed window neural做比较,RNN中 W \mathbf{W} W会重复用在输入序列每个词上面。
如图,输入为one-hot编码的词向量 x ( t ) \mathbf{x}^{(t)} x(t),首先词嵌入(word embedding)成稠密向量 e ( t ) = E x ( t ) \mathbf{e}^{(t)}=\mathbf{E}\mathbf{x}^{(t)} e(t)=Ex(t),接着隐层计算,每个隐层的输入 h ( t ) \mathbf{h}^{(t)} h(t)包括 x ( t ) \mathbf{x}^{(t)} x(t)和上一时刻的隐层 h ( t − 1 ) \mathbf{h}^{(t-1)} h(t−1), h ( t ) = σ ( W h h ( t − 1 ) + W e e ( t ) + b 1 ) \mathbf{h}^{(t)}=\sigma(\mathbf{W_h}\mathbf{h}^{(t-1)}+\mathbf{W_e}\mathbf{e}^{(t)}+\mathbf{b_1}) h(t)=σ(Whh(t−1)+Wee(t)+b1),最后输入通过softmax获取概率分布, y ^ ( t ) = s o f t m a x ( U h ( t ) + b 2 ) ∈ R ∣ V ∣ \hat{\mathbf{y}}^{(t)}= softmax(\mathbf{U}\mathbf{h}^{(t)}+\mathbf{b_2})\in \mathbb{R}^{|V|} y^(t)=softmax(Uh(t)+b2)∈R∣V∣。这里任意时刻都可以获得输出,取决于你想让模型做什么样的预测。此外输入序列可以任意长度。
可以处理任意长度的序列;
在时刻t的计算,可以获取到很多步前的输入信息;
模型规模不会随着输入序列增大而增大;
权重 W \mathbf{W} W应用到每个时刻的输入上,所以对于任意的输入,权重 W \mathbf{W} W有对称性(symmetry)。
循环计算(recurrent computation)很慢;
很难获取到很多步前的信息;
给定一个文本的语料库(corpus),由单词序列组成 x ( 1 ) , x ( 2 ) , . . . , x ( T ) \mathbf{x}^{(1)},\mathbf{x}^{(2)},...,\mathbf{x}^{(T)} x(1),x(2),...,x(T)。对于RNN-LM,在每一步t,计算输出概率分布 y ^ ( t ) \mathbf{\hat{y}}^{(t)} y^(t),即根据之前的所有单词,预测下个单词。
每一步t的损失函数定义为预测概率分布 y ^ ( t ) \mathbf{\hat{y}}^{(t)} y^(t)与真实的下个个单词 y ( t ) \mathbf{y}^{(t)} y(t)( x ( t + 1 ) \mathbf{x}^{(t+1)} x(t+1)的one-hot)之间的互熵损失(cross entropy):
J ( t ) ( θ ) = C E ( y ( t ) , y ^ ( t ) ) = − ∑ w ∈ V y w ( t ) log y ^ w ( t ) = − log y ^ x t + 1 ( t ) J^{(t)}(\theta)=CE(\mathbf{y}^{(t)},\mathbf{\hat{y}}^{(t)})=-\sum_{w\in \mathbf{V}}\mathbf{y}_w^{(t)}\log\mathbf{\hat{y}}_w^{(t)} =-\log\mathbf{\hat{y}}_{\mathbf{x}_{t+1}}^{(t)} J(t)(θ)=CE(y(t),y^(t))=−w∈V∑yw(t)logy^w(t)=−logy^xt+1(t)
总的损失(overall loss)为整个训练集loss的均值:
J ( θ ) = 1 T ∑ t = 1 T J ( t ) ( θ ) = 1 T ∑ t = 1 T − log y ^ x t + 1 ( t ) J(\theta)=\frac{1}{T}\sum_{t=1}^{T} J^{(t)}(\theta)=\frac{1}{T}\sum_{t=1}^{T}-\log\mathbf{\hat{y}}_{\mathbf{x}_{t+1}}^{(t)} J(θ)=T1t=1∑TJ(t)(θ)=T1t=1∑T−logy^xt+1(t)
但是在整个语料库上计算损失(loss)和梯度(gradient) too expensive,实践中在一个句子上或者一个文本上做计算。
给定一个多变量(multivariable)函数 f ( x , y ) f(x,y) f(x,y),其中 x ( t ) x(t) x(t)和 y ( t ) y(t) y(t)是单变量(single variable)函数,下面是multivariable chain rule:
d d t f ( x ( t ) , y ( t ) ) = ∂ f ∂ x d x d t + ∂ f ∂ y d y d t \frac{d}{d_t}f\left( x(t),y(t) \right )=\frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt} dtdf(x(t),y(t))=∂x∂fdtdx+∂y∂fdtdy
所有对于RNN权重 W \mathbf{W} W的梯度计算如图,也就是运用multivariable chain rule将每个时刻对权重 W \mathbf{W} W的梯度加起来:
关于 ∂ J ( t ) ∂ W h = ∑ t = 1 t ∂ J ( t ) ∂ W h ∣ i \frac{\partial J^{(t)}}{\partial \mathbf{W}_h} =\sum_{t=1}^t {\frac{\partial J^{(t)}}{\partial \mathbf{W}_h}}\vert_i ∂Wh∂J(t)=∑t=1t∂Wh∂J(t)∣i的计算为按照时间 i = t , . . . , 0 i=t,...,0 i=t,...,0反向传播,将梯度加起来,算法被称为backpropagation through time(BPTT)。
对于language model一个标准的评估度量是perplexity (perplexity越小越好)
p e r p l e x i t y = ∏ t = 1 T ( 1 P L M ( x ( t + 1 ) ∣ x ( t ) , . . . , x ( 1 ) ) ) 1 / T perplexity= \prod_{t=1}^T{\left(\frac{1}{P_{LM}(\mathbf{x}^{(t+1)}|\mathbf{x}^{(t)},...,\mathbf{x}^{(1)}) } \right)} ^{1/T} perplexity=t=1∏T(PLM(x(t+1)∣x(t),...,x(1))1)1/T
等价于互熵损失的指数:
= ∏ t = 1 T ( 1 y ^ x t + 1 ( t ) ) 1 / T = e x p ( 1 T ∑ t = 1 T − log y ^ x t + 1 ( t ) ) = e x p ( J ( θ ) ) \begin{aligned} &=\prod_{t=1}^T{\left(\frac{1}{\mathbf{\hat{y}}_{\mathbf{x}_{t+1}}^{(t)}} \right)} ^{1/T}\\ &= exp(\frac{1}{T}\sum_{t=1}^{T}-\log\mathbf{\hat{y}}_{\mathbf{x}_{t+1}}^{(t)})\\ &=exp(J(\theta)) \end{aligned} =t=1∏T(y^xt+1(t)1)1/T=exp(T1t=1∑T−logy^xt+1(t))=exp(J(θ))