NNLM(Nerual Network Language Model)是2003年Bengio等人在文章A neural probabilistic language model提出的语言模型
使用了一个三层结构,第一层为映射层,第二层为隐藏层,第三层为输出层。用端到端的思想来看,我们输入一个词的one-hot向量表征,希望得到相应的相应词的条件概率,则神经网络模型要做的就是拟合一个由one-hot向量映射为相应概率模型的函数。我们将上图的网络结构拆成两部分来理解:
首先是一个线性的映射层,假如求 w n w_n wn的概率,则依次输入 w 1 , w 2 , . . . , w n − 1 w_1,w_2,...,w_{n-1} w1,w2,...,wn−1的one-hot向量,乘上一个Embedding矩阵 C m ∗ V C_{m*V} Cm∗V,
m m m是Embedding向量的维度,
V V V是词典的长度,
C C C矩阵也是学习的产物。这个过程其实就是一个通过one-hot向量映射词向量的过程。
例:现有一个有N个词的文本,长度为V的词典
词向量W:是一个one-hot向量,大小=[10W,1],W(t)表示第t个词语的one-hot
Embedding矩阵C:维度[m*V],V=10W,谷歌测试时选取m=300
设 h h h为隐藏层层数,
通过映射层得到输入向量 x ( n − 1 ) ∗ m × 1 x_{({n-1}) * m \times 1} x(n−1)∗m×1,即前n-1个词的词向量矩阵
输入层到隐藏层(the hidden layer weights) 的权重矩阵为 H h × ( n − 1 ) ∗ m H_{h \times (n-1)*m} Hh×(n−1)∗m
输入向量 X X X的权值矩阵 W V × ( n − 1 ) ∗ m W_{V \times (n-1)*m} WV×(n−1)∗m,
隐藏层到输出层(the hidden-to-output weights) 的权重矩阵 U m × h U_{m \times h} Um×h
输入层到隐藏层(the hidden layer weights) 的偏置 d h × 1 d_{h \times 1} dh×1
隐藏层到输出层(the hidden layer weights) 的偏置 b V × 1 b_{V \times 1} bV×1
输出公式 y = b + W X + U tanh ( d + H x ) y = b+WX+U \tanh (d+Hx) y=b+WX+Utanh(d+Hx)
当输入特征向量和输出层没有直接连接的时候,W矩阵设为0
隐藏层计算出 y y y之后,通过softmax,公式为:
P ^ ( w t ∣ w t − 1 , w t , . . . , w t − n + 1 ) = e y w t ∑ i = 1 n − 1 e y w i \hat{P}(w_t|w_{t-1},w_t,...,w_{t-n+1}) = {{e^{y_{wt}}} \over {\sum_{i=1}^ {n-1}}e^{y_wi} } P^(wt∣wt−1,wt,...,wt−n+1)=∑i=1n−1eywieywt
损失函数为:
L = 1 T ∑ t l o g P ^ ( w t ∣ w t − 1 , w t , . . . , w t − n + 1 ) + R ( θ ) L = {1 \over T} {\sum_{t} log\hat{P}(w_t|w_{t-1},w_t,...,w_{t-n+1}) + R(\theta)} L=T1t∑logP^(wt∣wt−1,wt,...,wt−n+1)+R(θ)
需要更新的参数 θ \theta θ:
θ = ( b , d , W , U , H , C ) \theta = (b,d,W,U,H,C) θ=(b,d,W,U,H,C)
反向传播梯度下降:
θ ← θ + ε ∂ l o g P ^ ( w t ∣ w t − 1 , w t , . . . , w t − n + 1 ) ∂ θ \theta ← \theta + \varepsilon {\partial log\hat{P}(w_t|w_{t-1},w_t,...,w_{t-n+1}) \over \partial \theta} θ←θ+ε∂θ∂logP^(wt∣wt−1,wt,...,wt−n+1)
论文设置 V = 17964 V = 17964 V=17964
初始学习率 ε 0 = 1 0 − 3 \varepsilon_0 = 10^{-3} ε0=10−3
动态学习率 ε t = ε 0 1 + r t \varepsilon_t = {\varepsilon_0 \over 1+rt} εt=1+rtε0;其中 t t t为完成参数更新的次数,r是一个被“启发式选择”的减少因子 r = 1 0 − 8 r = 10^{-8} r=10−8
把n-gram拓展到了n=5
The results do not allow to say whether the direct connections from input to output are useful or not, but suggest that on a smaller corpus at least, better generalization can be obtained without the direct input-to-output connections, at the cost of longer training: without direct connections the network took twice as much time to converge (20 epochs instead of 10), albeit to a slightly lower perplexity.
A reasonable interpretation is that direct input-to-output connections provide a bit more capacity and faster learning of the “linear” part of the mapping from word features to log-probabilities.
结果无法说明从输入到输出的直接连接是否有用,但建议至少在较小的语料库中,无需直接输入到输出的连接即可获得更好的概括性,但代价是更长训练:在没有直接连接的情况下,网络收敛所需的时间是原来的两倍,而不是10个,而是20倍。
合理的解释是直接的输入到输出连接提供了更多的容量,并且可以更快地学习从单词特征到对数概率的“线性”映射。
简而言之,就是当年的算力不够,用直接连接的方法,简单粗暴。
Random initialization of the word features was done (similarly to initialization of neural network weights), but we suspect that better results might be obtained with a knowledge-based initialization.
The feature vectors associated with each word are learned, but they could be initialized using prior knowledge of semantic features.
随机初始化
https://zhuanlan.zhihu.com/p/84338492
https://blog.csdn.net/Pit3369/article/details/104513784/
http://hanyaopeng.coding.me/2019/04/30/word2vec/