Word2Vec学习笔记(四)——Negative Sampling 模型

    前面讲了Hierarchical softmax 模型,现在来说说Negative Sampling 模型的CBOW和Skip-gram的原理。它相对于Hierarchical softmax 模型来说,不再采用huffman树,这样可以大幅提高性能。

一、Negative Sampling

    在负采样中,对于给定的词 w w w,如何生成它的负采样集合 N E G ( w ) NEG(w) NEG(w)呢?已知一个词 w w w,它的上下文是 c o n t e x t ( w ) context(w) context(w),那么词 w w w就是一个正例,其他词就是一个负例。但是负例样本太多了,我们怎么去选取呢?在语料库 C \mathcal{C} C中,各个词出现的频率是不一样的,我们采样的时候要求高频词选中的概率较大,而低频词选中的概率较小。这就是一个带权采样的问题。
设词典 D \mathcal{D} D中的每一个词 w w w对应线段的一个长度:
l e n ( w ) = c o u n t e r ( w ) ∑ u ∈ D c o u n t e r ( u ) ( 1 ) len(w) = \frac{counter(w)}{\sum_{u \in \mathcal{D}}counter(u)} (1) len(w)=uDcounter(u)counter(w)(1)
式(1)分母是为了归一化,Word2Vec中的具体做法是:记 l 0 = 0 , l k = ∑ j = 1 k l e n ( w j ) , k = 1 , 2 , … , N l_0 = 0, l_k = \sum_{j=1}^{k} len(w_j), k=1,2, \dots, N l0=0,lk=j=1klen(wj),k=1,2,,N,其中, w j w_j wj是词典 D \mathcal{D} D中的第 j j j个词,则以 { l j } j = 0 N \{l_j\}_{j=0}^{N} {lj}j=0N为点构成了一个在区间[0,1]非等距离的划分。然后再加一个等距离划分,Word2Vec中选取 M = 1 0 8 M=10^8 M=108,将M个点等距离的分布在区间[0,1]上,这样就构成了M到I之间的一个映射,如下图所示:
Word2Vec学习笔记(四)——Negative Sampling 模型_第1张图片
图例参考:http://www.cnblogs.com/neopenx/p/4571996.html ,建议大家读下这篇神作

    选取负例样本的时候,取 [ M 0 , M m − 1 ] [M_0, M_{m-1}] [M0,Mm1]上的一个随机数,对应到I上就可以了。如果对于词 w i w_i wi,正好选到它自己,则跳过。负例样本集合 N E G ( w ) NEG(w) NEG(w)的大小在Word2Vec源码中默认选5.

二、CBOW

    假定关于词 w w w的负例样本 N E G ( w ) NEG(w) NEG(w)已经选出,定义标签 L L L如下,对于 ∀ w ~ ∈ D \forall \widetilde{w} \in \mathcal{D} w D
L w ( w ~ ) = { 1 , w ~ = w ; 0 , w ~ ≠ w ; L^w(\widetilde{w}) = \Bigg\{ \begin{array} {ll} 1, & \widetilde{w} = w ;\\ 0, & \widetilde{w} \ne w; \end{array} Lw(w )={1,0,w =w;w =w;
对于给定的一个正例样本 ( c o n t e x t ( w ) , w ) (context(w), w) (context(w),w), 要求:
max ⁡ g ( w ) = max ⁡ ∏ u ∈ { w } ∪ u ∈ N E G ( w ) p ( u ∣ c o n t e x t ( w ) ) \max g(w) = \max \prod_{u \in \{w\} \cup u \in NEG(w)} p(u|context(w)) maxg(w)=maxu{w}uNEG(w)p(ucontext(w))
其中,
p ( u ∣ c o n t e x t ( w ) ) = { σ ( x w T θ u ) , L w ( u ) = 1 1 − σ ( x w T θ u ) , L w ( u ) = 0 p(u|context(w)) = \Bigg \{ \begin{array}{ll} \sigma(\boldsymbol{x}_w^T \theta^u), & L^w(u) = 1\\ 1-\sigma(\boldsymbol{x}_w^T \theta^u), & L^w(u) = 0 \end{array} p(ucontext(w))={σ(xwTθu),1σ(xwTθu),Lw(u)=1Lw(u)=0
把它写成一个式子:
p ( u ∣ c o n t e x t ( w ) ) = σ ( x w T θ u ) L w ( u ) + ( 1 − σ ( x w T θ u ) ) 1 − L w ( u ) p(u|context(w)) = \sigma(\boldsymbol{x}_w^T \theta^u)^{L^w(u)} + (1-\sigma(\boldsymbol{x}_w^T \theta^u))^{1-L^w(u)} p(ucontext(w))=σ(xwTθu)Lw(u)+(1σ(xwTθu))1Lw(u)
下边解释为什么要最大化 g ( w ) g(w) g(w)
g ( w ) = ∏ u ∈ { w } ∪ u ∈ N E G ( w ) p ( u ∣ c o n t e x t ( w ) ) = ∏ u ∈ { w } ∪ u ∈ N E G ( w ) σ ( x w T θ u ) L w ( u ) + ( 1 − σ ( x w T θ u ) ) 1 − L w ( u ) = σ ( x w T θ w ) ∏ u ∈ N E G ( w ) ( 1 − σ ( x w T θ u ) ) g(w) = \prod_{u \in \{w\} \cup u \in NEG(w)} p(u|context(w)) \\ =\prod_{u \in \{w\} \cup u \in NEG(w)} \sigma(\boldsymbol{x}_w^T \theta^u)^{L^w(u)} + (1-\sigma(\boldsymbol{x}_w^T \theta^u))^{1-L^w(u)} \\ =\sigma(\boldsymbol{x}_w^T \theta^w)\prod_{u \in NEG(w)} (1-\sigma(\boldsymbol{x}_w^T \theta^u)) g(w)=u{w}uNEG(w)p(ucontext(w))=u{w}uNEG(w)σ(xwTθu)Lw(u)+(1σ(xwTθu))1Lw(u)=σ(xwTθw)uNEG(w)(1σ(xwTθu))
上式中连乘号前边的式子可以解释为最大化正例样本概率,连乘号后边解释为最小化负例样本概率

同样的,针对于语料库,令:
G = ∏ w ∈ C g ( w ) \mathcal{G} = \prod_{w \in \mathcal{C}} g(w) G=wCg(w)
可以将上式作为整体的优化目标函数,取上式的最大似然:
L = log ⁡ G = ∑ w ∈ C log ⁡ g ( w ) = ∑ w ∈ C ∑ u ∈ { w } ∪ u ∈ N E G ( w ) L w ( u ) log ⁡ [ σ ( x w T θ u ] + [ 1 − L w ( u ) ] log ⁡ [ 1 − σ ( x w T θ u ) ] \mathcal{L} = \log\mathcal{G} = \sum_{w \in \mathcal{C}} \log g(w) \\ =\sum_{w \in \mathcal{C}} \sum_{u \in \{w\} \cup u \in NEG(w)}L^w(u)\log[\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u] + [1-L^w(u)] \log [1-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)] L=logG=wClogg(w)=wCu{w}uNEG(w)Lw(u)log[σ(xwTθu]+[1Lw(u)]log[1σ(xwTθu)]
和之前的计算过程一样,记
L ( w , u ) = L w ( u ) log ⁡ [ σ ( x w T θ u ] + [ 1 − L w ( u ) ] log ⁡ [ 1 − σ ( x w T θ u ) ] L(w,u) = L^w(u)\log[\sigma(\boldsymbol{x}_w^T \theta^u] + [1-L^w(u)]\log [1-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)] L(w,u)=Lw(u)log[σ(xwTθu]+[1Lw(u)]log[1σ(xwTθu)]
然后分别求: ∂ L ( w , u ) ∂ X w \frac{\partial L(w,u)}{\partial\boldsymbol{X}_w} XwL(w,u) ∂ L ( w , u ) ∂ θ u \frac{\partial L(w,u)}{\partial\boldsymbol{\theta}^u} θuL(w,u),求解过程略过:
∂ L ( w , u ) ∂ X w = [ L w ( u ) − σ ( x w T θ u ) ] θ u ∂ L ( w , u ) ∂ θ u = [ L w ( u ) − σ ( x w T θ u ) ] X w \frac{\partial L(w,u)}{\partial\boldsymbol{X}_w} = [L^w(u)-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]\boldsymbol{\theta}^u \\ \frac{\partial L(w,u)}{\partial\boldsymbol{\theta}^u} = [L^w(u)-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]\boldsymbol{X}_w XwL(w,u)=[Lw(u)σ(xwTθu)]θuθuL(w,u)=[Lw(u)σ(xwTθu)]Xw
则,可得到如下更新公式:
θ u : = θ u + η [ L w ( u ) − σ ( x w T θ u ) ] X w v ( w ~ ) : = v ( w ~ ) + ∑ u ∈ { w } ∪ u ∈ N E G ( w ) [ L w ( u ) − σ ( x w T θ u ) ] θ u \boldsymbol{\theta}^u:=\boldsymbol{\theta}^u+\eta [L^w(u)-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]\boldsymbol{X}_w \\ v(\boldsymbol{\widetilde{w}}):=v(\boldsymbol{\widetilde{w}}) + \sum_{u \in \{w\} \cup u \in NEG(w)} [L^w(u)-\sigma(\boldsymbol{x}_w^T \boldsymbol{\theta}^u)]\boldsymbol{\theta}^u θu:=θu+η[Lw(u)σ(xwTθu)]Xwv(w ):=v(w )+u{w}uNEG(w)[Lw(u)σ(xwTθu)]θu
其中, w ~ ∈ c o n t e x t ( w ) \boldsymbol{\widetilde{w}} \in context(w) w context(w).

你可能感兴趣的:(自然语言处理)