word2vec数学分析

    本文没有繁文缛节,纯数学推导,建议先阅读《word2vec中的数学原理详解》

一、逻辑回归

    可以阅读《逻辑回归算法分析》理解逻辑回归。

    sigmoid函数 σ ( x ) = 1 1 + e − x \sigma(x) = \frac{1}{1 + e^{-x}} σ(x)=1+ex1

         σ ′ ( x ) = σ ( x ) [ 1 − σ ( x ) ] \sigma'(x) = \sigma(x)[1 - \sigma(x)] σ(x)=σ(x)[1σ(x)]

         [ l o g σ ( x ) ] ′ = σ ′ ( x ) σ ( x ) = 1 − σ ( x ) [log\sigma(x)]' = \frac{\sigma'(x)}{\sigma(x)} = 1 - \sigma(x) [logσ(x)]=σ(x)σ(x)=1σ(x)

         [ l o g ( 1 − σ ( x ) ) ] ′ = − σ ′ ( x ) 1 − σ ( x ) = − σ ( x ) [log(1 - \sigma(x))]' = \frac{-\sigma'(x)}{1 - \sigma(x)} = - \sigma(x) [log(1σ(x))]=1σ(x)σ(x)=σ(x)

    逻辑回归用于解决二分类问题,定义好极大对数似然函数,采用梯度上升的方法进行优化。事实上,word2vec的算法本质就是逻辑回归。


二、CBOW

    根据上下文词,预测当前词,将预测误差反向更新到每个上下文词上,以达到更准确的预测的目的。

    记:

    1、 p w p^w pw:从根节点出发,到达 w w w的路径;

    2、 l w l^w lw:路径 p w p^w pw包含节点的个数;

    3、 p 1 w , p 2 w , ⋯   , p l w w p_1^w, p_2^w, \cdots, p_{l^w}^w p1w,p2w,,plww:路径 p w p^w pw中第 l w l^w lw个节点,其中 p 1 w p_1^w p1w是根节点, p l w w p_{l^w}^w plww w w w对应的节点

    4、 d 2 w , d 3 w , ⋯   , d l w w   ϵ   { 0 , 1 } d_2^w, d_3^w, \cdots, d_{l^w}^w \ \epsilon \ \left \{ 0, 1 \right \} d2w,d3w,,dlww ϵ {0,1}:词 w w w的哈夫曼编码,它由 l w − 1 l^w - 1 lw1位编码构成, d j w d_j^w djw表示路径 p w p^w pw j j j个节点对应的编码(根节点不编码)。

    5、 θ 1 w , θ 2 w , ⋯   , θ l w − 1 w   ϵ   R m \theta_1^w, \theta_2^w, \cdots, \theta_{l^w - 1}^w \ \epsilon \ \mathbb{R}^m θ1w,θ2w,,θlw1w ϵ Rm:路径 p w p^w pw中非叶子节点对应的向量, θ j w \theta_j^w θjw表示路径 p w p^w pw j j j个非叶子节点对应的向量

    6、 L a b e l ( p j w ) = 1 − d j w , j = 2 , 3 , ⋯   , l w Label(p_j^w) = 1 - d_j^w, j = 2,3,\cdots,l^w Label(pjw)=1djw,j=2,3,,lw:表示路径 p w p^w pw j j j个节点对应的分类标签(根节点不分类)

1、Hierarchical Softmax

    极大似然 ∏ w ϵ C p ( w ∣ C o n t e x t ( w ) ) \prod_{w \epsilon C}p(w|Context(w)) wϵCp(wContext(w))

    极大对数似然 £ = ∑ w ϵ C l o g   p ( w ∣ C o n t e x t ( w ) ) \pounds = \sum_{w \epsilon C} log \ p(w|Context(w)) £=wϵClog p(wContext(w))

    条件概率 p ( w ∣ C o n t e x t ( w ) ) = ∏ j = 2 l w p ( d j w ∣ X w , θ j − 1 w ) p(w|Context(w)) = \prod_{j=2}^{l^w} p(d_j^w|X_w, \theta_{j-1}^w) p(wContext(w))=j=2lwp(djwXw,θj1w),其中:

         p ( d j w ∣ X w , θ j − 1 w ) = { σ ( X w T θ j − 1 w ) ,    d j w = 0 1 − σ ( X w T θ j − 1 w ) , d j w = 1 p(d_j^w|X_w, \theta_{j-1}^w) = \left\{\begin{matrix}\sigma (X_w^T \theta_{j - 1}^w), \quad \quad \ \ d_j^w = 0 & \\ & \\ 1 - \sigma (X_w^T \theta_{j - 1}^w), \quad d_j^w = 1 & \end{matrix}\right. p(djwXw,θj1w)=σ(XwTθj1w),  djw=01σ(XwTθj1w),djw=1,注意:在word2vec中的哈夫曼树中,编码0表示正类,编码1表示负类。

         X w = ∑ u ϵ C o n t e x t ( w ) v ( u ) ∣ C o n t e x t ( w ) ∣ X_w = \frac{\sum_{u \epsilon Context(w)} v(u)}{|Context(w)|} Xw=Context(w)uϵContext(w)v(u)


    写成整体即: p ( d j w ∣ X w , θ j − 1 w ) = [ σ ( X w T θ j − 1 w ) ] 1 − d j w ⋅ [ 1 − σ ( X w T θ j − 1 w ) ] d j w p(d_j^w|X_w, \theta_{j-1}^w) = [\sigma (X_w^T \theta_{j - 1}^w)]^{1 - d_j^w} \cdot [1 - \sigma (X_w^T \theta_{j - 1}^w)]^{d_j^w} p(djwXw,θj1w)=[σ(XwTθj1w)]1djw[1σ(XwTθj1w)]djw,代入对数似然函数得:

     £ = ∑ w ϵ C l o g ∏ j = 2 l w p ( d j w ∣ X w , θ j − 1 w ) \pounds = \sum _{w \epsilon C} log \prod_{j=2}^{l^w} p(d_j^w|X_w, \theta_{j-1}^w) £=wϵClogj=2lwp(djwXw,θj1w)

        = ∑ w ϵ C l o g ∏ j = 2 l w { [ σ ( X w T θ j − 1 w ) ] 1 − d j w ⋅ [ 1 − σ ( X w T θ j − 1 w ) ] d j w } = \sum _{w \epsilon C} log \prod_{j=2}^{l^w} \left \{ [\sigma (X_w^T \theta_{j - 1}^w)]^{1 - d_j^w} \cdot [1 - \sigma (X_w^T \theta_{j - 1}^w)]^{d_j^w} \right \} =wϵClogj=2lw{[σ(XwTθj1w)]1djw[1σ(XwTθj1w)]djw}

        = ∑ w ϵ C ∑ j = 2 l w { ( 1 − d j w ) ⋅ l o g [ σ ( X w T θ j − 1 w ) ] + d j w ⋅ l o g [ 1 − σ ( X w T θ j − 1 w ) ] } = \sum_{w \epsilon C} \sum_{j=2}^{l^w} \left \{ (1 - d_j^w) \cdot log [\sigma (X_w^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (X_w^T \theta_{j - 1}^w)] \right \} =wϵCj=2lw{(1djw)log[σ(XwTθj1w)]+djwlog[1σ(XwTθj1w)]}


    为求导方便,记: £ ( w , j ) = ( 1 − d j w ) ⋅ l o g [ σ ( X w T θ j − 1 w ) ] + d j w ⋅ l o g [ 1 − σ ( X w T θ j − 1 w ) ] \pounds(w, j) = (1 - d_j^w) \cdot log [\sigma (X_w^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (X_w^T \theta_{j - 1}^w)] £(w,j)=(1djw)log[σ(XwTθj1w)]+djwlog[1σ(XwTθj1w)]

     £ ( w , j ) \pounds(w, j) £(w,j)关于 θ j − 1 w \theta_{j - 1}^w θj1w的梯度:

         ∂ £ ( w , j ) ∂ θ j − 1 w = ∂ ∂ θ j − 1 w { ( 1 − d j w ) ⋅ l o g [ σ ( X w T θ j − 1 w ) ] + d j w ⋅ l o g [ 1 − σ ( X w T θ j − 1 w ) ] } \frac{\partial \pounds(w, j)}{\partial \theta_{j - 1}^w} = \frac{\partial }{\partial \theta_{j - 1}^w}\left \{ (1 - d_j^w) \cdot log [\sigma (X_w^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (X_w^T \theta_{j - 1}^w)] \right \} θj1w£(w,j)=θj1w{(1djw)log[σ(XwTθj1w)]+djwlog[1σ(XwTθj1w)]}

                 = ( 1 − d j w ) [ 1 − σ ( X w T θ j − 1 w ) ] X w − d j w [ σ ( X w T θ j − 1 w ) ] X w = (1 - d_j^w)[1 - \sigma (X_w^T \theta_{j - 1}^w)]X_w - d_j^w [\sigma (X_w^T \theta_{j - 1}^w)]X_w =(1djw)[1σ(XwTθj1w)]Xwdjw[σ(XwTθj1w)]Xw

                 = { ( 1 − d j w ) [ 1 − σ ( X w T θ j − 1 w ) ] − d j w [ σ ( X w T θ j − 1 w ) ] } X w = \left \{ (1 - d_j^w)[1 - \sigma (X_w^T \theta_{j - 1}^w)] - d_j^w [\sigma (X_w^T \theta_{j - 1}^w)] \right \} X_w ={(1djw)[1σ(XwTθj1w)]djw[σ(XwTθj1w)]}Xw

                 = [ 1 − d j w − σ ( X w T θ j − 1 w ) ] X w = [1 - d_j^w - \sigma (X_w^T \theta_{j - 1}^w)] X_w =[1djwσ(XwTθj1w)]Xw

    于是, θ j − 1 w \theta_{j - 1}^w θj1w的更新可写为:

         θ j − 1 w   : =   θ j − 1 w + η [ 1 − d j w − σ ( X w T θ j − 1 w ) ] X w \theta_{j - 1}^w \ := \ \theta_{j - 1}^w + \eta [1 - d_j^w - \sigma (X_w^T \theta_{j - 1}^w)] X_w θj1w := θj1w+η[1djwσ(XwTθj1w)]Xw


    由于在 £ ( w , j ) \pounds(w, j) £(w,j) θ j − 1 w \theta_{j - 1}^w θj1w X w X_w Xw是对称的,所以 £ ( w , j ) \pounds(w, j) £(w,j)关于 X w X_w Xw的梯度:

         ∂ £ ( w , j ) ∂ X w = [ 1 − d j w − σ ( X w T θ j − 1 w ) ] θ j − 1 w \frac{\partial \pounds(w, j)}{\partial X_w} = [1 - d_j^w - \sigma (X_w^T \theta_{j - 1}^w)] \theta_{j - 1}^w Xw£(w,j)=[1djwσ(XwTθj1w)]θj1w

     用 ∂ £ ( w , j ) ∂ X w \frac{\partial \pounds(w, j)}{\partial X_w} Xw£(w,j)来对上下文词 v ( u ) , u ϵ C o n t e x t ( w ) v(u), u \epsilon Context(w) v(u),uϵContext(w)进行更新:

         v ( u ) : = v ( u ) + η ∑ j = 2 l w ∂ £ ( w , j ) ∂ X w v(u) := v(u) + \eta \sum_{j=2}^{l^w}\frac{\partial \pounds(w, j)}{\partial X_w} v(u):=v(u)+ηj=2lwXw£(w,j)


    以样本 ( C o n t e x t ( w ) , w ) (Context(w), w) (Context(w),w)为例,训练伪代码如下:

     e = 0 e = 0 e=0

     X w = ∑ u ϵ C o n t e x t ( w ) v ( u ) ∣ C o n t e x t ( w ) ∣ X_w = \frac{\sum_{u \epsilon Context(w)} v(u)}{|Context(w)|} Xw=Context(w)uϵContext(w)v(u)

     F O R j = 2 : l w D O FOR \quad j = 2:l^w \quad DO FORj=2:lwDO
     {
          q = σ ( X w T θ j − 1 w ) q = \sigma(X_w^T\theta_{j - 1}^w) q=σ(XwTθj1w)

          g = η [ 1 − d j w − q ] g = \eta [1 - d_j^w - q] g=η[1djwq]

          e : = e + g θ j − 1 w e := e + g\theta_{j - 1}^w e:=e+gθj1w

          θ j − 1 w : = θ j − 1 w + g X w \theta_{j - 1}^w := \theta_{j - 1}^w + gX_w θj1w:=θj1w+gXw
     }

     F O R u ϵ C o n t e x t ( w ) D O FOR \quad u \epsilon Context(w) \quad DO FORuϵContext(w)DO
     {
          v ( u ) : = v ( u ) + e v(u) := v(u) + e v(u):=v(u)+e
     }


    这里有必要对以上伪代码的含义做一个说明,当然,直接通过导数推导过程来理解也可以,但导数的推导过程并没有表达其真实的内在含义,下文中有类似的地方,不再说明。

    1、 σ ( X w T θ j − 1 w ) \sigma(X_w^T\theta_{j - 1}^w) σ(XwTθj1w)

        其含义是在已知上下文的前提下,在当前词的哈夫曼路径上做分类预测,根据路径上的父节点 θ j − 1 w \theta_{j - 1}^w θj1w,预测其子节点 θ j w \theta_{j}^w θjw,得到的子节点的分类标签。当然,这里得到的分类标签是[0,1]之间的实数,而不是{0, 1}二分类,这个值与0,1之间的差距即是预测误差。把 σ ( X w T θ j − 1 w ) \sigma(X_w^T\theta_{j - 1}^w) σ(XwTθj1w)理解成子节点 θ j w \theta_{j}^w θjw正分类的概率也是可以的。

    2、 1 − d j w − q 1 - d_j^w - q 1djwq

         1 − d j w 1 - d_j^w 1djw的含义是子节点 θ j w \theta_{j}^w θjw的真实分类标签, 1 − d j w − q 1 - d_j^w - q 1djwq则是真实标签与预测标签之间的误差。

    3、 e : = e + g θ j − 1 w e := e + g\theta_{j - 1}^w e:=e+gθj1w

        这里是一个关键点,回到我们最开始的优化函数上,要求是的最大对数似然: £ = ∑ w ϵ C l o g   p ( w ∣ C o n t e x t ( w ) ) \pounds = \sum_{w \epsilon C} log \ p(w|Context(w)) £=wϵClog p(wContext(w)),即求极大值,所以要用梯度上升的方法进行优化(机器学习中一般是梯度下降),所以e的更新是加法(梯度下降是减法)。

        当梯度为正的时, g θ j − 1 w > 0 g\theta_{j - 1}^w > 0 gθj1w>0,则 e : = e + g θ j − 1 w e := e + g\theta_{j - 1}^w e:=e+gθj1w相加后变大,将 e e e更新到 v ( u ) v(u) v(u)后让 X w X_w Xw变大, X w X_w Xw变大后 σ ( X w T θ j − 1 w ) \sigma(X_w^T\theta_{j - 1}^w) σ(XwTθj1w)也就变大,也就是说预测的分类标签越像1(正类),也可以理解成预测为正类的概率增大。为什么要让 σ ( X w T θ j − 1 w ) \sigma(X_w^T\theta_{j - 1}^w) σ(XwTθj1w)增大呢?反过来思考,当梯度为正的时, ( 1 − d j w − q ) > 0 (1 - d_j^w - q) > 0 (1djwq)>0,这时只有当 d j w = 0 d_j^w = 0 djw=0时其值才可能为正,而 d j w = 0 d_j^w = 0 djw=0表示正类,分类标签为1,所以优化时要让 σ ( X w T θ j − 1 w ) \sigma(X_w^T\theta_{j - 1}^w) σ(XwTθj1w)趋近于1。

        同样,当梯度为负的时, g θ j − 1 w < 0 g\theta_{j - 1}^w < 0 gθj1w<0,则 e : = e + g θ j − 1 w e := e + g\theta_{j - 1}^w e:=e+gθj1w相加后变小,将 e e e更新到 v ( u ) v(u) v(u)后让 X w X_w Xw变小, X w X_w Xw变小后 σ ( X w T θ j − 1 w ) \sigma(X_w^T\theta_{j - 1}^w) σ(XwTθj1w)也就变小,也就是说预测的分类标签越像0(负类),也可以理解成预测为正类的概率减小。同样,反过来思考,当梯度为负的时, ( 1 − d j w − q ) < 0 (1 - d_j^w - q) < 0 (1djwq)<0,这时只有当 d j w = 1 d_j^w = 1 djw=1时其值才可能为负,而 d j w = 1 d_j^w = 1 djw=1表示负类,分类标签为0,所以优化时要让 σ ( X w T θ j − 1 w ) \sigma(X_w^T\theta_{j - 1}^w) σ(XwTθj1w)趋近于0。

    4、 θ j − 1 w : = θ j − 1 w + g X w \theta_{j - 1}^w := \theta_{j - 1}^w + gX_w θj1w:=θj1w+gXw

        同上

    5、 F O R j = 2 : l w D O FOR \quad j = 2:l^w \quad DO FORj=2:lwDO

        该循环的含义是上下文预测的是叶子节点的词(当前词在叶子节点上),要经过该词的哈夫曼路径才能到达,所以要循环累计路径上(除根节点)每一次分类的误差。

    总结:根据上下文词,遍历当前词的哈夫曼路径,累计(除根节点以外)每个节点的二分类误差,将误差反向更新到每个上下文词上(同时也会更新路径上节点的辅助向量)。


2、Negative Sampling

    对 w w w的负样本子集 N E G ( w ) NEG(w) NEG(w)的每个样本,定义样本标签:

         L w ( w ~ ) = { 1 , w = w ~ 0 , w ≠ w ~ L^w(\tilde{w}) = \left\{\begin{matrix}1,\quad w = \tilde{w} & \\ & \\ 0,\quad w \neq \tilde{w} & \end{matrix}\right. Lw(w~)=1,w=w~0,w̸=w~

    极大似然 ∏ w ϵ C p ( w ∣ C o n t e x t ( w ) ) \prod_{w \epsilon C}p(w|Context(w)) wϵCp(wContext(w))

    极大对数似然 £ = ∑ w ϵ C l o g   p ( w ∣ C o n t e x t ( w ) ) \pounds = \sum_{w \epsilon C} log \ p(w|Context(w)) £=wϵClog p(wContext(w))

    条件概率 p ( w ∣ C o n t e x t ( w ) ) = ∏ u ϵ { w } ∪ N E G ( w ) p ( u ∣ C o n t e x t ( w ) ) p(w|Context(w)) = \prod_{u \epsilon \left \{ w \right \} \cup NEG(w)}p(u|Context(w)) p(wContext(w))=uϵ{w}NEG(w)p(uContext(w)),其中:

         p ( u ∣ C o n t e x t ( w ) ) = { σ ( X w T θ u ) ,    L w ( u ) = 1 1 − σ ( X w T θ u ) , L w ( u ) = 0 p(u|Context(w)) = \left\{\begin{matrix}\sigma (X_w^T \theta^u), \quad \quad \ \ L^w(u) = 1 & \\ & \\ 1 - \sigma (X_w^T \theta^u), \quad L^w(u) = 0 & \end{matrix}\right. p(uContext(w))=σ(XwTθu),  Lw(u)=11σ(XwTθu),Lw(u)=0

         X w = ∑ u ϵ C o n t e x t ( w ) v ( u ) ∣ C o n t e x t ( w ) ∣ X_w = \frac{\sum_{u \epsilon Context(w)} v(u)}{|Context(w)|} Xw=Context(w)uϵContext(w)v(u)


    写成整体即: p ( u ∣ C o n t e x t ( w ) ) = [ σ ( X w T θ u ) ] L w ( u ) ⋅ [ 1 − σ ( X w T θ u ) ] 1 − L w ( u ) p(u|Context(w)) = [\sigma (X_w^T \theta^u)]^{L^w(u)} \cdot [1 - \sigma (X_w^T \theta^u)]^{1 - L^w(u)} p(uContext(w))=[σ(XwTθu)]Lw(u)[1σ(XwTθu)]1Lw(u),代入对数似然函数得:

     £ = ∑ w ϵ C l o g   ∏ u ϵ { w } ∪ N E G ( w ) p ( u ∣ C o n t e x t ( w ) ) \pounds = \sum_{w \epsilon C} log \ \prod_{u \epsilon \left \{ w \right \} \cup NEG(w)}p(u|Context(w)) £=wϵClog uϵ{w}NEG(w)p(uContext(w))

        = ∑ w ϵ C l o g   ∏ u ϵ { w } ∪ N E G ( w ) [ σ ( X w T θ u ) ] L w ( u ) ⋅ [ 1 − σ ( X w T θ u ) ] 1 − L w ( u ) = \sum_{w \epsilon C} log \ \prod_{u \epsilon \left \{ w \right \} \cup NEG(w)} [\sigma (X_w^T \theta^u)]^{L^w(u)} \cdot [1 - \sigma (X_w^T \theta^u)]^{1 - L^w(u)} =wϵClog uϵ{w}NEG(w)[σ(XwTθu)]Lw(u)[1σ(XwTθu)]1Lw(u)

        = ∑ w ϵ C ∑ u ϵ { w } ∪ N E G ( w ) { L w ( u ) ⋅ l o g [ σ ( X w T θ u ) ] + [ 1 − L w ( u ) ] ⋅ l o g [ 1 − σ ( X w T θ u ) ] } = \sum_{w \epsilon C} \sum_{u \epsilon \left \{ w \right \} \cup NEG(w)} \left \{ L^w(u) \cdot log [\sigma (X_w^T \theta^u)] + [1 - L^w(u)] \cdot log [1 - \sigma (X_w^T \theta^u)] \right \} =wϵCuϵ{w}NEG(w){Lw(u)log[σ(XwTθu)]+[1Lw(u)]log[1σ(XwTθu)]}


    为求导方便,记: £ ( w , u ) = L w ( u ) ⋅ l o g [ σ ( X w T θ u ) ] + [ 1 − L w ( u ) ] ⋅ l o g [ 1 − σ ( X w T θ u ) ] \pounds(w, u) = L^w(u) \cdot log [\sigma (X_w^T \theta^u)] + [1 - L^w(u)] \cdot log [1 - \sigma (X_w^T \theta^u)] £(w,u)=Lw(u)log[σ(XwTθu)]+[1Lw(u)]log[1σ(XwTθu)]

     £ ( w , u ) \pounds(w, u) £(w,u)关于 θ u \theta^u θu的梯度:

         ∂ £ ( w , u ) ∂ θ u = ∂ ∂ θ u { L w ( u ) ⋅ l o g [ σ ( X w T θ u ) ] + [ 1 − L w ( u ) ] ⋅ l o g [ 1 − σ ( X w T θ u ) ] } \frac{\partial \pounds(w, u)}{\partial \theta^u} = \frac{\partial }{\partial \theta^u}\left \{ L^w(u) \cdot log [\sigma (X_w^T \theta^u)] + [1 - L^w(u)] \cdot log [1 - \sigma (X_w^T \theta^u)] \right \} θu£(w,u)=θu{Lw(u)log[σ(XwTθu)]+[1Lw(u)]log[1σ(XwTθu)]}

                 = L w ( u ) ⋅ [ 1 − σ ( X w T θ u ) ] X w − [ 1 − L w ( u ) ] ⋅ [ σ ( X w T θ u ) ] X w = L^w(u) \cdot [1 - \sigma (X_w^T \theta^u)]X_w - [1 - L^w(u)] \cdot [\sigma (X_w^T \theta^u)] X_w =Lw(u)[1σ(XwTθu)]Xw[1Lw(u)][σ(XwTθu)]Xw

                 = { L w ( u ) ⋅ [ 1 − σ ( X w T θ u ) ] − [ 1 − L w ( u ) ] ⋅ [ σ ( X w T θ u ) ] } X w = \left \{ L^w(u) \cdot [1 - \sigma (X_w^T \theta^u)] - [1 - L^w(u)] \cdot [\sigma (X_w^T \theta^u)] \right \} X_w ={Lw(u)[1σ(XwTθu)][1Lw(u)][σ(XwTθu)]}Xw

                 = [ L w ( u ) − σ ( X w T θ u ) ] X w = [L^w(u) - \sigma (X_w^T \theta^u) ] X_w =[Lw(u)σ(XwTθu)]Xw

    于是, θ u \theta^u θu的更新可写为:

         θ u   : =   θ u + η [ L w ( u ) − σ ( X w T θ u ) ] X w \theta^u \ := \ \theta^u + \eta [L^w(u) - \sigma (X_w^T \theta^u) ] X_w θu := θu+η[Lw(u)σ(XwTθu)]Xw


    由于在 £ ( w , u ) \pounds(w, u) £(w,u) θ u \theta^u θu X w X_w Xw是对称的,所以 £ ( w , u ) \pounds(w, u) £(w,u)关于 X w X_w Xw的梯度:

         ∂ £ ( w , u ) ∂ X w = [ L w ( u ) − σ ( X w T θ u ) ] θ u \frac{\partial \pounds(w, u)}{\partial X_w} = [L^w(u) - \sigma (X_w^T \theta^u) ] \theta^u Xw£(w,u)=[Lw(u)σ(XwTθu)]θu

     用 ∂ £ ( w , u ) ∂ X w \frac{\partial \pounds(w, u)}{\partial X_w} Xw£(w,u)来对上下文词 v ( u ) , u ϵ C o n t e x t ( w ) v(u), u \epsilon Context(w) v(u),uϵContext(w)进行更新:

         v ( u ) : = v ( u ) + η ∑ u ϵ { w } ∪ N E G ( w ) ∂ £ ( w , u ) ∂ X w v(u) := v(u) + \eta \sum_{u \epsilon \left \{ w \right \} \cup NEG(w)}\frac{\partial \pounds(w, u)}{\partial X_w} v(u):=v(u)+ηuϵ{w}NEG(w)Xw£(w,u)


    以样本 ( C o n t e x t ( w ) , w ) (Context(w), w) (Context(w),w)为例,训练伪代码如下:

     e = 0 e = 0 e=0

     X w = ∑ u ϵ C o n t e x t ( w ) v ( u ) ∣ C o n t e x t ( w ) ∣ X_w = \frac{\sum_{u \epsilon Context(w)} v(u)}{|Context(w)|} Xw=Context(w)uϵContext(w)v(u)

     F O R u ϵ { w } ∪ N E G ( w ) D O FOR \quad u \epsilon \left \{ w \right \} \cup NEG(w) \quad DO FORuϵ{w}NEG(w)DO
     {
          q = σ ( X w T θ u ) q = \sigma(X_w^T\theta^u) q=σ(XwTθu)

          g = η [ L w ( u ) − q ] g = \eta [L^w(u) - q] g=η[Lw(u)q]

          e : = e + g θ u e := e + g\theta^u e:=e+gθu

          θ u : = θ u + g X w \theta^u := \theta^u + gX_w θu:=θu+gXw
       }

     F O R u ϵ C o n t e x t ( w ) D O FOR \quad u \epsilon Context(w) \quad DO FORuϵContext(w)DO
     {
          v ( u ) : = v ( u ) + e v(u) := v(u) + e v(u):=v(u)+e
     }


    总结:根据上下文词,对当前词做一次负采样(包括当前词,当前词是正样本),遍历每个样本,累计上下文对每个样本的预测误差,将误差反向更新到每个上下文词上(同时也会更新样本向量)。


三、Skip-gram

    根据当前词,预测上下文词,将预测误差反向更新到当前词上,以达到更准确的预测的目的。但word2vec并没有按这个思路训练,而是依然按照CBOW的思路,用上下文中的每个词(注意这里的区别,CBOW是合并了上下文,即 ∑ u ϵ C o n t e x t ( w ) v ( u ) \sum_{u \epsilon Context(w)}v(u) uϵContext(w)v(u)),对当前词进行预测,再将预测误差反向更新到该上下文词上。

1、Hierarchical Softmax

    极大似然 ∏ w ϵ C p ( C o n t e x t ( w ) ∣ w ) \prod_{w \epsilon C}p(Context(w)|w) wϵCp(Context(w)w)

    极大对数似然 £ = ∑ w ϵ C l o g   p ( C o n t e x t ( w ) ∣ w ) \pounds = \sum_{w \epsilon C} log \ p(Context(w)|w) £=wϵClog p(Context(w)w)

    条件概率 p ( C o n t e x t ( w ) ∣ w ) = ∏ u ϵ C o n t e x t ( w ) p ( u ∣ w ) p(Context(w)|w) = \prod_{u \epsilon Context(w)} p(u|w) p(Context(w)w)=uϵContext(w)p(uw),其中:

         p ( u ∣ w ) = ∏ j = 2 l u p ( d j u ∣ v ( w ) , θ j − 1 u ) p(u|w) = \prod_{j=2}^{l^u} p(d_j^u|v(w), \theta_{j-1}^u) p(uw)=j=2lup(djuv(w),θj1u)

         p ( d j u ∣ v ( w ) , θ j − 1 u ) = { σ ( v ( w ) T θ j − 1 u ) ,     d j u = 0 1 − σ ( v ( w ) T θ j − 1 u ) , d j u = 1 p(d_j^u|v(w), \theta_{j-1}^u) = \left\{\begin{matrix}\sigma (v(w)^T \theta_{j - 1}^u), \quad \quad \ \ \ d_j^u = 0 & \\ & \\ 1 - \sigma (v(w)^T \theta_{j - 1}^u), \quad d_j^u = 1 & \end{matrix}\right. p(djuv(w),θj1u)=σ(v(w)Tθj1u),   dju=01σ(v(w)Tθj1u),dju=1


    写成整体即: p ( d j u ∣ v ( w ) , θ j − 1 u ) = [ σ ( v ( w ) T θ j − 1 u ) ] 1 − d j u ⋅ [ 1 − σ ( v ( w ) T θ j − 1 u ) ] d j u p(d_j^u|v(w), \theta_{j-1}^u) = [\sigma (v(w)^T \theta_{j - 1}^u)]^{1 - d_j^u} \cdot [1 - \sigma (v(w)^T \theta_{j - 1}^u)]^{d_j^u} p(djuv(w),θj1u)=[σ(v(w)Tθj1u)]1dju[1σ(v(w)Tθj1u)]dju,代入对数似然函数得:

     £ = ∑ w ϵ C l o g ∏ u ϵ C o n t e x t ( w ) ∏ j = 2 l u p ( d j u ∣ v ( w ) , θ j − 1 u ) \pounds = \sum _{w \epsilon C} log \prod_{u \epsilon Context(w)} \prod_{j=2}^{l^u} p(d_j^u|v(w), \theta_{j-1}^u) £=wϵCloguϵContext(w)j=2lup(djuv(w),θj1u)

        = ∑ w ϵ C l o g ∏ u ϵ C o n t e x t ( w ) ∏ j = 2 l u [ σ ( v ( w ) T θ j − 1 u ) ] 1 − d j u ⋅ [ 1 − σ ( v ( w ) T θ j − 1 u ) ] d j u = \sum _{w \epsilon C} log \prod_{u \epsilon Context(w)} \prod_{j=2}^{l^u} [\sigma (v(w)^T \theta_{j - 1}^u)]^{1 - d_j^u} \cdot [1 - \sigma (v(w)^T \theta_{j - 1}^u)]^{d_j^u} =wϵCloguϵContext(w)j=2lu[σ(v(w)Tθj1u)]1dju[1σ(v(w)Tθj1u)]dju

        = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) ∑ j = 2 l u { ( 1 − d j u ) ⋅ l o g   [ σ ( v ( w ) T θ j − 1 u ) ] + d j u ⋅ l o g [ 1 − σ ( v ( w ) T θ j − 1 u ) ] } = \sum _{w \epsilon C} \sum_{u \epsilon Context(w)} \sum_{j=2}^{l^u} \left \{ (1 - d_j^u) \cdot log \ [\sigma (v(w)^T \theta_{j - 1}^u)] + d_j^u \cdot log [1 - \sigma (v(w)^T \theta_{j - 1}^u)] \right \} =wϵCuϵContext(w)j=2lu{(1dju)log [σ(v(w)Tθj1u)]+djulog[1σ(v(w)Tθj1u)]}

    为求导方便,记: £ ( w , u , j ) = ( 1 − d j u ) ⋅ l o g   [ σ ( v ( w ) T θ j − 1 u ) ] + d j u ⋅ l o g [ 1 − σ ( v ( w ) T θ j − 1 u ) ] \pounds(w, u, j) = (1 - d_j^u) \cdot log \ [\sigma (v(w)^T \theta_{j - 1}^u)] + d_j^u \cdot log [1 - \sigma (v(w)^T \theta_{j - 1}^u)] £(w,u,j)=(1dju)log [σ(v(w)Tθj1u)]+djulog[1σ(v(w)Tθj1u)]

     £ ( w , u , j ) \pounds(w, u, j) £(w,u,j)关于 θ j − 1 u \theta_{j - 1}^u θj1u的梯度:

         ∂ £ ( w , u , j ) ∂ θ j − 1 u = ∂ ∂ θ j − 1 u { ( 1 − d j u ) ⋅ l o g   [ σ ( v ( w ) T θ j − 1 u ) ] + d j u ⋅ l o g [ 1 − σ ( v ( w ) T θ j − 1 u ) ] } \frac{\partial \pounds(w, u, j)}{\partial \theta_{j - 1}^u} = \frac{\partial }{\partial \theta_{j - 1}^u}\left \{ (1 - d_j^u) \cdot log \ [\sigma (v(w)^T \theta_{j - 1}^u)] + d_j^u \cdot log [1 - \sigma (v(w)^T \theta_{j - 1}^u)] \right \} θj1u£(w,u,j)=θj1u{(1dju)log [σ(v(w)Tθj1u)]+djulog[1σ(v(w)Tθj1u)]}

                 = ( 1 − d j u ) [ 1 − σ ( v ( w ) T θ j − 1 u ) ] v ( w ) − d j u [ σ ( v ( w ) T θ j − 1 u ) ] v ( w ) = (1 - d_j^u)[1 - \sigma (v(w)^T \theta_{j - 1}^u)]v(w) - d_j^u [\sigma (v(w)^T \theta_{j - 1}^u)]v(w) =(1dju)[1σ(v(w)Tθj1u)]v(w)dju[σ(v(w)Tθj1u)]v(w)

                 = { ( 1 − d j u ) [ 1 − σ ( v ( w ) T θ j − 1 u ) ] − d j u [ σ ( v ( w ) T θ j − 1 u ) ] } v ( w ) = \left \{ (1 - d_j^u)[1 - \sigma (v(w)^T \theta_{j - 1}^u)] - d_j^u [\sigma (v(w)^T \theta_{j - 1}^u)] \right \}v(w) ={(1dju)[1σ(v(w)Tθj1u)]dju[σ(v(w)Tθj1u)]}v(w)

                 = [ 1 − d j u − σ ( v ( w ) T θ j − 1 u ) ] v ( w ) = [1 - d_j^u - \sigma (v(w)^T \theta_{j - 1}^u)] v(w) =[1djuσ(v(w)Tθj1u)]v(w)

    于是, θ j − 1 u \theta_{j - 1}^u θj1u的更新可写为:

         θ j − 1 u   : =   θ j − 1 u + η [ 1 − d j u − σ ( v ( w ) T θ j − 1 u ) ] v ( w ) \theta_{j - 1}^u \ := \ \theta_{j - 1}^u + \eta [1 - d_j^u - \sigma (v(w)^T \theta_{j - 1}^u)] v(w) θj1u := θj1u+η[1djuσ(v(w)Tθj1u)]v(w)

    由于在 £ ( w , u , j ) \pounds(w, u, j) £(w,u,j) θ j − 1 u \theta_{j - 1}^u θj1u v ( w ) v(w) v(w)是对称的,所以 £ ( w , u , j ) \pounds(w, u, j) £(w,u,j)关于 v ( w ) v(w) v(w)的梯度:

         ∂ £ ( w , u , j ) ∂ v ( w ) = [ 1 − d j u − σ ( v ( w ) T θ j − 1 u ) ] θ j − 1 u \frac{\partial \pounds(w, u, j)}{\partial v(w)} = [1 - d_j^u - \sigma (v(w)^T \theta_{j - 1}^u)] \theta_{j - 1}^u v(w)£(w,u,j)=[1djuσ(v(w)Tθj1u)]θj1u

     用 ∂ £ ( w , u , j ) ∂ v ( w ) \frac{\partial \pounds(w, u, j)}{\partial v(w)} v(w)£(w,u,j)来对当前词 v ( w ) v(w) v(w)进行更新:

         v ( w ) : = v ( w ) + η ∑ u ϵ C o n t e x t ( w ) ∑ j = 2 l u ∂ £ ( w , u , j ) ∂ v ( w ) v(w) := v(w) + \eta \sum_{u \epsilon Context(w)} \sum_{j=2}^{l^u}\frac{\partial \pounds(w, u, j)}{\partial v(w)} v(w):=v(w)+ηuϵContext(w)j=2luv(w)£(w,u,j)


    以样本 ( w , C o n t e x t ( w ) ) (w, Context(w)) (w,Context(w))为例,训练伪代码如下:

     e = 0 e = 0 e=0

     F O R u ϵ C o n t e x t ( w ) D O FOR \quad u \epsilon Context(w) \quad DO FORuϵContext(w)DO
     {
          F O R j = 2 : l u D O FOR \quad j = 2:l^u \quad DO FORj=2:luDO
          {
               q = σ ( v ( w ) T θ j − 1 u ) q = \sigma(v(w)^T\theta_{j - 1}^u) q=σ(v(w)Tθj1u)

               g = η [ 1 − d j u − q ] g = \eta [1 - d_j^u - q] g=η[1djuq]

               e : = e + g θ j − 1 u e := e + g\theta_{j - 1}^u e:=e+gθj1u

               θ j − 1 u : = θ j − 1 u + g v ( w ) \theta_{j - 1}^u := \theta_{j - 1}^u + gv(w) θj1u:=θj1u+gv(w)
          }
     }

     v ( w ) : = v ( w ) + e v(w) := v(w) + e v(w):=v(w)+e


    值得注意的是,word2vec并不是按上面的流程进行训练的,而依然按CBOW的思路,对每一个上下文词,预测当前词,分析如下:

    极大似然 ∏ w ϵ C ∏ u ϵ C o n t e x t ( w ) p ( w ∣ u ) \prod_{w \epsilon C}\prod_{u \epsilon Context(w)}p(w|u) wϵCuϵContext(w)p(wu)

    极大对数似然 £ = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) l o g   p ( w ∣ u ) \pounds = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)}log \ p(w|u) £=wϵCuϵContext(w)log p(wu)

    条件概率 p ( w ∣ u ) = ∏ j = 2 l w p ( d j w ∣ v ( u ) , θ j − 1 w ) p(w|u) = \prod_{j=2}^{l^w} p(d_j^w|v(u), \theta_{j-1}^w) p(wu)=j=2lwp(djwv(u),θj1w),其中:

         p ( d j w ∣ v ( u ) , θ j − 1 w ) = { σ ( v ( u ) T θ j − 1 w ) ,    d j w = 0 1 − σ ( v ( u ) T θ j − 1 w ) , d j w = 1 p(d_j^w|v(u), \theta_{j-1}^w) = \left\{\begin{matrix}\sigma (v(u)^T \theta_{j - 1}^w), \quad \quad \ \ d_j^w = 0 & \\ & \\ 1 - \sigma (v(u)^T \theta_{j - 1}^w), \quad d_j^w = 1 & \end{matrix}\right. p(djwv(u),θj1w)=σ(v(u)Tθj1w),  djw=01σ(v(u)Tθj1w),djw=1


    写成整体即: p ( d j w ∣ v ( u ) , θ j − 1 w ) = [ σ ( v ( u ) T θ j − 1 w ) ] 1 − d j w ⋅ [ 1 − σ ( v ( u ) T θ j − 1 w ) ] d j w p(d_j^w|v(u), \theta_{j-1}^w) = [\sigma (v(u)^T \theta_{j - 1}^w)]^{1 - d_j^w} \cdot [1 - \sigma (v(u)^T \theta_{j - 1}^w)]^{d_j^w} p(djwv(u),θj1w)=[σ(v(u)Tθj1w)]1djw[1σ(v(u)Tθj1w)]djw,代入对数似然函数得:

     £ = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) l o g ∏ j = 2 l w p ( d j w ∣ v ( u ) , θ j − 1 w ) \pounds = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)} log \prod_{j=2}^{l^w} p(d_j^w|v(u), \theta_{j-1}^w) £=wϵCuϵContext(w)logj=2lwp(djwv(u),θj1w)

        = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) l o g ∏ j = 2 l w [ σ ( v ( u ) T θ j − 1 w ) ] 1 − d j w ⋅ [ 1 − σ ( v ( u ) T θ j − 1 w ) ] d j w = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)} log \prod_{j=2}^{l^w} [\sigma (v(u)^T \theta_{j - 1}^w)]^{1 - d_j^w} \cdot [1 - \sigma (v(u)^T \theta_{j - 1}^w)]^{d_j^w} =wϵCuϵContext(w)logj=2lw[σ(v(u)Tθj1w)]1djw[1σ(v(u)Tθj1w)]djw

        = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) ∑ j = 2 l w { ( 1 − d j w ) ⋅ l o g [ σ ( v ( u ) T θ j − 1 w ) ] + d j w ⋅ l o g [ 1 − σ ( v ( u ) T θ j − 1 w ) ] } = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)}\sum_{j=2}^{l^w} \left \{ (1 - d_j^w) \cdot log [\sigma (v(u)^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (v(u)^T \theta_{j - 1}^w)] \right \} =wϵCuϵContext(w)j=2lw{(1djw)log[σ(v(u)Tθj1w)]+djwlog[1σ(v(u)Tθj1w)]}


    为求导方便,记: £ ( w , u , j ) = ( 1 − d j w ) ⋅ l o g [ σ ( v ( u ) T θ j − 1 w ) ] + d j w ⋅ l o g [ 1 − σ ( v ( u ) T θ j − 1 w ) ] \pounds(w, u, j) = (1 - d_j^w) \cdot log [\sigma (v(u)^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (v(u)^T \theta_{j - 1}^w)] £(w,u,j)=(1djw)log[σ(v(u)Tθj1w)]+djwlog[1σ(v(u)Tθj1w)]

     £ ( w , u , j ) \pounds(w, u, j) £(w,u,j)关于 θ j − 1 w \theta_{j - 1}^w θj1w的梯度:

         ∂ £ ( w , u , j ) ∂ θ j − 1 w = ∂ ∂ θ j − 1 w { ( 1 − d j w ) ⋅ l o g [ σ ( v ( u ) T θ j − 1 w ) ] + d j w ⋅ l o g [ 1 − σ ( v ( u ) T θ j − 1 w ) ] } \frac{\partial \pounds(w, u, j)}{\partial \theta_{j - 1}^w} = \frac{\partial }{\partial \theta_{j - 1}^w}\left \{ (1 - d_j^w) \cdot log [\sigma (v(u)^T \theta_{j - 1}^w)] + d_j^w \cdot log [1 - \sigma (v(u)^T \theta_{j - 1}^w)] \right \} θj1w£(w,u,j)=θj1w{(1djw)log[σ(v(u)Tθj1w)]+djwlog[1σ(v(u)Tθj1w)]}

                 = ( 1 − d j w ) [ 1 − σ ( v ( u ) T θ j − 1 w ) ] v ( u ) − d j w [ σ ( v ( u ) T θ j − 1 w ) ] v ( u ) = (1 - d_j^w)[1 - \sigma (v(u)^T \theta_{j - 1}^w)]v(u) - d_j^w [\sigma (v(u)^T \theta_{j - 1}^w)]v(u) =(1djw)[1σ(v(u)Tθj1w)]v(u)djw[σ(v(u)Tθj1w)]v(u)

                 = { ( 1 − d j w ) [ 1 − σ ( v ( u ) T θ j − 1 w ) ] − d j w [ σ ( v ( u ) T θ j − 1 w ) ] } v ( u ) = \left \{ (1 - d_j^w)[1 - \sigma (v(u)^T \theta_{j - 1}^w)] - d_j^w [\sigma (v(u)^T \theta_{j - 1}^w)] \right \} v(u) ={(1djw)[1σ(v(u)Tθj1w)]djw[σ(v(u)Tθj1w)]}v(u)

                 = [ 1 − d j w − σ ( v ( u ) T θ j − 1 w ) ] v ( u ) = [1 - d_j^w - \sigma (v(u)^T \theta_{j - 1}^w)] v(u) =[1djwσ(v(u)Tθj1w)]v(u)

    于是, θ j − 1 w \theta_{j - 1}^w θj1w的更新可写为:

         θ j − 1 w   : =   θ j − 1 w + η [ 1 − d j w − σ ( v ( u ) T θ j − 1 w ) ] v ( u ) \theta_{j - 1}^w \ := \ \theta_{j - 1}^w + \eta [1 - d_j^w - \sigma (v(u)^T \theta_{j - 1}^w)] v(u) θj1w := θj1w+η[1djwσ(v(u)Tθj1w)]v(u)


    由于在 £ ( w , u , j ) \pounds(w, u, j) £(w,u,j) θ j − 1 w \theta_{j - 1}^w θj1w v ( u ) v(u) v(u)是对称的,所以 £ ( w , u , j ) \pounds(w, u, j) £(w,u,j)关于 v ( u ) v(u) v(u)的梯度:

         ∂ £ ( w , u , j ) ∂ v ( u ) = [ 1 − d j w − σ ( v ( u ) T θ j − 1 w ) ] θ j − 1 w \frac{\partial \pounds(w, u, j)}{\partial v(u)} = [1 - d_j^w - \sigma (v(u)^T \theta_{j - 1}^w)] \theta_{j - 1}^w v(u)£(w,u,j)=[1djwσ(v(u)Tθj1w)]θj1w

     用 ∂ £ ( w , u , j ) ∂ v ( u ) \frac{\partial \pounds(w, u, j)}{\partial v(u)} v(u)£(w,u,j)来对上下文词 v ( u ) , u ϵ C o n t e x t ( w ) v(u), u \epsilon Context(w) v(u),uϵContext(w)进行更新:

         v ( u ) : = v ( u ) + η ∑ j = 2 l w ∂ £ ( w , u , j ) ∂ v ( u ) v(u) := v(u) + \eta \sum_{j=2}^{l^w}\frac{\partial \pounds(w, u, j)}{\partial v(u)} v(u):=v(u)+ηj=2lwv(u)£(w,u,j)


    以样本 ( w , C o n t e x t ( w ) ) (w, Context(w)) (w,Context(w))为例,训练伪代码如下:

     F O R u ϵ C o n t e x t ( w ) D O FOR \quad u \epsilon Context(w) \quad DO FORuϵContext(w)DO
     {
          e = 0 e = 0 e=0

          F O R j = 2 : l w D O FOR \quad j = 2:l^w \quad DO FORj=2:lwDO
          {
               q = σ ( v ( u ) T θ j − 1 w ) q = \sigma(v(u)^T\theta_{j - 1}^w) q=σ(v(u)Tθj1w)

               g = η [ 1 − d j w − q ] g = \eta [1 - d_j^w - q] g=η[1djwq]

               e : = e + g θ j − 1 w e := e + g\theta_{j - 1}^w e:=e+gθj1w

               θ j − 1 w : = θ j − 1 w + g v ( u ) \theta_{j - 1}^w := \theta_{j - 1}^w + gv(u) θj1w:=θj1w+gv(u)
          }

          v ( u ) : = v ( u ) + e v(u) := v(u) + e v(u):=v(u)+e
     }


    总结:根据上下文词(用该上下文词来预测当前词),遍历当前词的哈夫曼路径,累计(除根节点以外)每个节点的二分类误差,将误差反向更新到该上下文词上(同时也会更新路径上节点的辅助向量)。


2、Negative Sampling

    对 w w w的负样本子集 N E G ( w ) NEG(w) NEG(w)的每个样本,定义样本标签:

         L w ( w ~ ) = { 1 , w = w ~ 0 , w ≠ w ~ L^w(\tilde{w}) = \left\{\begin{matrix}1,\quad w = \tilde{w} & \\ & \\ 0,\quad w \neq \tilde{w} & \end{matrix}\right. Lw(w~)=1,w=w~0,w̸=w~

    极大似然 ∏ w ϵ C p ( C o n t e x t ( w ) ∣ w ) \prod_{w \epsilon C}p(Context(w)|w) wϵCp(Context(w)w)

    极大对数似然 £ = ∑ w ϵ C l o g   p ( C o n t e x t ( w ) ∣ w ) \pounds = \sum_{w \epsilon C} log \ p(Context(w)|w) £=wϵClog p(Context(w)w)

    条件概率 p ( C o n t e x t ( w ) ∣ w ) = ∏ u ϵ C o n t e x t ( w ) p ( u ∣ w ) p(Context(w)|w) = \prod_{u \epsilon Context(w)} p(u|w) p(Context(w)w)=uϵContext(w)p(uw),其中:

         p ( u ∣ w ) = ∏ z ϵ { u } ∪ N E G ( u ) p ( z ∣ w ) p(u|w) = \prod_{z \epsilon \left \{ u \right \} \cup NEG(u)} p(z|w) p(uw)=zϵ{u}NEG(u)p(zw)

         p ( z ∣ w ) = { σ ( v ( w ) T θ z ) ,    L u ( z ) = 1 1 − σ ( v ( w ) T θ z ) , L u ( z ) = 0 p(z|w) = \left\{\begin{matrix}\sigma (v(w)^T \theta^z), \quad \quad \ \ L^u(z) = 1 & \\ & \\ 1 - \sigma (v(w)^T \theta^z), \quad L^u(z) = 0 & \end{matrix}\right. p(zw)=σ(v(w)Tθz),  Lu(z)=11σ(v(w)Tθz),Lu(z)=0


    写成整体即: p ( z ∣ w ) = [ σ ( v ( w ) T θ z ) ] L u ( z ) ⋅ [ 1 − σ ( v ( w ) T θ z ) ] 1 − L u ( z ) p(z|w) = [\sigma (v(w)^T \theta^z)]^{L^u(z)} \cdot [1 - \sigma (v(w)^T \theta^z)]^{1 - L^u(z)} p(zw)=[σ(v(w)Tθz)]Lu(z)[1σ(v(w)Tθz)]1Lu(z),代入对数似然函数得:

     £ = ∑ w ϵ C l o g ∏ u ϵ C o n t e x t ( w ) ∏ z ϵ { u } ∪ N E G ( u ) p ( z ∣ w ) \pounds = \sum _{w \epsilon C} log \prod_{u \epsilon Context(w)} \prod_{z \epsilon \left \{ u \right \} \cup NEG(u)} p(z|w) £=wϵCloguϵContext(w)zϵ{u}NEG(u)p(zw)

        = ∑ w ϵ C l o g ∏ u ϵ C o n t e x t ( w ) ∏ z ϵ { u } ∪ N E G ( u ) [ σ ( v ( w ) T θ z ) ] L u ( z ) ⋅ [ 1 − σ ( v ( w ) T θ z ) ] 1 − L u ( z ) = \sum _{w \epsilon C} log \prod_{u \epsilon Context(w)} \prod_{z \epsilon \left \{ u \right \} \cup NEG(u)}[\sigma (v(w)^T \theta^z)]^{L^u(z)} \cdot [1 - \sigma (v(w)^T \theta^z)]^{1 - L^u(z)} =wϵCloguϵContext(w)zϵ{u}NEG(u)[σ(v(w)Tθz)]Lu(z)[1σ(v(w)Tθz)]1Lu(z)

        = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) ∑ z ϵ { u } ∪ N E G ( u ) { L u ( z ) ⋅ l o g [ σ ( v ( w ) T θ z ) ] + [ 1 − L u ( z ) ] ⋅ l o g [ 1 − σ ( v ( w ) T θ z ) ] } = \sum _{w \epsilon C} \sum_{u \epsilon Context(w)} \sum_{z \epsilon \left \{ u \right \} \cup NEG(u)} \left \{ L^u(z) \cdot log [\sigma (v(w)^T \theta^z)] + [1 - L^u(z)] \cdot log [1 - \sigma (v(w)^T \theta^z)] \right \} =wϵCuϵContext(w)zϵ{u}NEG(u){Lu(z)log[σ(v(w)Tθz)]+[1Lu(z)]log[1σ(v(w)Tθz)]}


    为求导方便,记: £ ( w , u , z ) = L u ( z ) ⋅ l o g [ σ ( v ( w ) T θ z ) ] + [ 1 − L u ( z ) ] ⋅ l o g [ 1 − σ ( v ( w ) T θ z ) ] \pounds(w, u, z) = L^u(z) \cdot log [\sigma (v(w)^T \theta^z)] + [1 - L^u(z)] \cdot log [1 - \sigma (v(w)^T \theta^z)] £(w,u,z)=Lu(z)log[σ(v(w)Tθz)]+[1Lu(z)]log[1σ(v(w)Tθz)]

     £ ( w , u , z ) \pounds(w, u, z) £(w,u,z)关于 θ z \theta^z θz的梯度:

         ∂ £ ( w , u , z ) ∂ θ z = ∂ ∂ θ z { L u ( z ) ⋅ l o g [ σ ( v ( w ) T θ z ) ] + [ 1 − L u ( z ) ] ⋅ l o g [ 1 − σ ( v ( w ) T θ z ) ] } \frac{\partial \pounds(w, u, z)}{\partial \theta^z} = \frac{\partial }{\partial \theta^z}\left \{ L^u(z) \cdot log [\sigma (v(w)^T \theta^z)] + [1 - L^u(z)] \cdot log [1 - \sigma (v(w)^T \theta^z)] \right \} θz£(w,u,z)=θz{Lu(z)log[σ(v(w)Tθz)]+[1Lu(z)]log[1σ(v(w)Tθz)]}

                 = L u ( z ) ⋅ [ 1 − σ ( v ( w ) T θ z ) ] v ( w ) − [ 1 − L u ( z ) ] ⋅ [ σ ( v ( w ) T θ z ) ] v ( w ) = L^u(z) \cdot [1 - \sigma (v(w)^T \theta^z)]v(w) - [1 - L^u(z)] \cdot [\sigma (v(w)^T \theta^z)]v(w) =Lu(z)[1σ(v(w)Tθz)]v(w)[1Lu(z)][σ(v(w)Tθz)]v(w)

                 = { L u ( z ) ⋅ [ 1 − σ ( v ( w ) T θ z ) ] − [ 1 − L u ( z ) ] ⋅ [ σ ( v ( w ) T θ z ) ] } v ( w ) = \left \{ L^u(z) \cdot [1 - \sigma (v(w)^T \theta^z)] - [1 - L^u(z)] \cdot [\sigma (v(w)^T \theta^z)] \right \}v(w) ={Lu(z)[1σ(v(w)Tθz)][1Lu(z)][σ(v(w)Tθz)]}v(w)

                 = [ L u ( z ) − σ ( v ( w ) T θ z ) ] v ( w ) = [L^u(z) - \sigma (v(w)^T \theta^z)]v(w) =[Lu(z)σ(v(w)Tθz)]v(w)

    于是, θ z \theta^z θz的更新可写为:

         θ z   : =   θ z + η [ L u ( z ) − σ ( v ( w ) T θ z ) ] v ( w ) \theta^z \ := \ \theta^z + \eta [L^u(z) - \sigma (v(w)^T \theta^z) ] v(w) θz := θz+η[Lu(z)σ(v(w)Tθz)]v(w)


    由于在 £ ( w , u , z ) \pounds(w, u, z) £(w,u,z) θ z \theta^z θz v ( w ) v(w) v(w)是对称的,所以 £ ( w , u , z ) \pounds(w, u, z) £(w,u,z)关于 v ( w ) v(w) v(w)的梯度:

         ∂ £ ( w , u , z ) ∂ v ( w ) = [ L u ( z ) − σ ( v ( w ) T θ z ) ] θ z \frac{\partial \pounds(w, u, z)}{\partial v(w)} = [L^u(z) - \sigma (v(w)^T \theta^z) ] \theta^z v(w)£(w,u,z)=[Lu(z)σ(v(w)Tθz)]θz

     用 ∂ £ ( w , u , z ) ∂ v ( w ) \frac{\partial \pounds(w, u, z)}{\partial v(w)} v(w)£(w,u,z)来对当前词 v ( w ) v(w) v(w)进行更新:

         v ( w ) : = v ( w ) + η ∑ u ϵ C o n t e x t ( w ) ∑ z ϵ { u } ∪ N E G ( u ) ∂ £ ( w , u , z ) ∂ v ( w ) v(w) := v(w) + \eta \sum_{u \epsilon Context(w)} \sum_{z \epsilon \left \{ u \right \} \cup NEG(u)}\frac{\partial \pounds(w, u, z)}{\partial v(w)} v(w):=v(w)+ηuϵContext(w)zϵ{u}NEG(u)v(w)£(w,u,z)


    以样本 ( w , C o n t e x t ( w ) ) (w, Context(w)) (w,Context(w))为例,训练伪代码如下:

     e = 0 e = 0 e=0

     F O R u ϵ C o n t e x t ( w ) D O FOR \quad u \epsilon Context(w) \quad DO FORuϵContext(w)DO
     {
          F O R z ϵ { u } ∪ N E G ( u ) D O FOR \quad z \epsilon \left \{ u \right \} \cup NEG(u) \quad DO FORzϵ{u}NEG(u)DO
          {
               q = σ ( v ( w ) T θ z ) q = \sigma(v(w)^T\theta^z) q=σ(v(w)Tθz)

               g = η [ L u ( z ) − q ] g = \eta [L^u(z) - q] g=η[Lu(z)q]

               e : = e + g θ z e := e + g\theta^z e:=e+gθz

               θ z : = θ z + g v ( w ) \theta^z := \theta^z + gv(w) θz:=θz+gv(w)
          }
     }

     v ( w ) : = v ( w ) + e v(w) := v(w) + e v(w):=v(w)+e


    同样,word2vec也不是按上面的流程进行训练的,也依然按CBOW的思路,对每一个上下文词,预测当前词,分析如下:

    极大似然 ∏ w ϵ C ∏ u ϵ C o n t e x t ( w ) p ( w ∣ u ) \prod_{w \epsilon C}\prod_{u \epsilon Context(w)}p(w|u) wϵCuϵContext(w)p(wu)

    极大对数似然 £ = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) l o g   p ( w ∣ u ) \pounds = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)}log \ p(w|u) £=wϵCuϵContext(w)log p(wu)

    条件概率 p ( w ∣ u ) = ∏ z ϵ { w } ∪ N E G ( w ) p ( z ∣ u ) p(w|u) = \prod_{z \epsilon \left \{ w \right \} \cup NEG(w)} p(z|u) p(wu)=zϵ{w}NEG(w)p(zu),其中:

         p ( z ∣ u ) = { σ ( v ( u ) T θ z ) ,    L w ( z ) = 1 1 − σ ( v ( u ) T θ z ) , L w ( z ) = 0 p(z|u) = \left\{\begin{matrix}\sigma (v(u)^T \theta^z), \quad \quad \ \ L^w(z) = 1 & \\ & \\ 1 - \sigma (v(u)^T \theta^z), \quad L^w(z) = 0 & \end{matrix}\right. p(zu)=σ(v(u)Tθz),  Lw(z)=11σ(v(u)Tθz),Lw(z)=0


    写成整体即: p ( z ∣ u ) = [ σ ( v ( u ) T θ z ) ] L w ( z ) ⋅ [ 1 − σ ( v ( u ) T θ z ) ] 1 − L w ( z ) p(z|u) = [\sigma (v(u)^T \theta^z)]^{L^w(z)} \cdot [1 - \sigma (v(u)^T \theta^z)]^{1 - L^w(z)} p(zu)=[σ(v(u)Tθz)]Lw(z)[1σ(v(u)Tθz)]1Lw(z),代入对数似然函数得:

     £ = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) l o g ∏ z ϵ { w } ∪ N E G ( w ) p ( z ∣ u ) \pounds =\sum_{w \epsilon C}\sum_{u \epsilon Context(w)}log \prod_{z \epsilon \left \{ w \right \} \cup NEG(w)} p(z|u) £=wϵCuϵContext(w)logzϵ{w}NEG(w)p(zu)

        = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) l o g ∏ z ϵ { w } ∪ N E G ( w ) [ σ ( v ( u ) T θ z ) ] L w ( z ) ⋅ [ 1 − σ ( v ( u ) T θ z ) ] 1 − L w ( z ) = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)}log \prod_{z \epsilon \left \{ w \right \} \cup NEG(w)} [\sigma (v(u)^T \theta^z)]^{L^w(z)} \cdot [1 - \sigma (v(u)^T \theta^z)]^{1 - L^w(z)} =wϵCuϵContext(w)logzϵ{w}NEG(w)[σ(v(u)Tθz)]Lw(z)[1σ(v(u)Tθz)]1Lw(z)

        = ∑ w ϵ C ∑ u ϵ C o n t e x t ( w ) ∑ z ϵ { w } ∪ N E G ( w ) { L w ( z ) ⋅ l o g [ σ ( v ( u ) T θ z ) ] + [ 1 − L w ( z ) ] ⋅ l o g [ 1 − σ ( v ( u ) T θ z ) ] } = \sum_{w \epsilon C}\sum_{u \epsilon Context(w)} \sum_{z \epsilon \left \{ w \right \} \cup NEG(w)} \left \{ L^w(z) \cdot log [\sigma (v(u)^T \theta^z)] + [1 - L^w(z)] \cdot log [1 - \sigma (v(u)^T \theta^z)] \right \} =wϵCuϵContext(w)zϵ{w}NEG(w){Lw(z)log[σ(v(u)Tθz)]+[1Lw(z)]

你可能感兴趣的:(深度学习,word2vec)