



在NLP中,最基础的问题就是如何表示一个词、句子(Represent the Meaning of a Word)。接下来介绍的几种方法各有优劣,不过也是不断进步的过程。


WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.


  1. can find synonyms. 方便寻找同义词


  1. missing new words (impossible to keep up to date). 缺少新词。
  2. subjective. 主观化。
  3. requires human labor to create and adapt. 需要耗费大量人力去整理。

One Hot Encoding

Discrete representation.


  1. dimension is extremely high. 维度爆炸。
  2. hard to compute accurate word similarity (all vectors are orthogonal). 无法计算词语相似度。

Bag of Words

Co-occurrence of words with variable window size.


  1. dimension is extremely high, will grow as dictionary grows. 维度爆炸,而且会随着字典大小的增大而增大,对下游的ML模型产生影响。


A neural probabilistic language model.

Distributional similarity based representations. Represent a word by means of its neighbors.上下文足以辅助理解一个词的意思。

We will build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context.

Distributional similarity & Distributed representation (dense vector)

There are certain differences between the two. The Distributional Similarity emphasizes that the meaning of a word shall be inferred from its context. Distributed Representation is opposite to One Hot Encoding, and vector representation is non-sparse. 两者有一定的区别。distributional similarity强调能够用上下文去表示某一个单词的意思,而distributed representation与one hot encoding相对,强调向量的表示是非稀疏的。


  1. can compute accurate word similarity. 可以计算词语相似度。


  1. The calculation is related word vector instead of semantic word vector, so the polysemous case cannot be solved (1 vector for each word instead of each meaning). 计算出来的是关联词向量,而不是语义词向量,所以无法解决一词多意的情况(每个单词而不是每个词意对应1个向量)。

损失函数Loss Function

Softmax function: map from R v R^v Rv to a probability distribution(从实数空间到概率分布的标准映射方法)。公式分子部分保证将这个数转化成一个正数,分母部分保证所有概率之和为1。

p i = e x p ( u i ) ∑ j e x p ( u j ) p_i = \frac {exp(u_i)} {\sum_{j} exp(u_j)} pi=jexp(uj)exp(ui)

我们在求出center/context word的概率分布之后,还需要使用交叉熵来得到loss。

L ( y ^ , y ) = − ∑ j = 1 V y j l o g ( y ^ j ) L(\hat y, y) = − \sum_{j=1}^V y_j log(\hat y_j) L(y^,y)=j=1Vyjlog(y^j). 根据公式,在完美预测的情况下,loss是0。


J = 1 − p ( w − t ∣ w t ) J = 1 - p(w_{-t} | w_t) J=1p(wtwt)

w − t w_{-t} wt代表 w t w_t wt的上下文(负号表示除了该词之外)。

p ( o ∣ c ) = e x p ( u o T v c ) ∑ w = 1 V e x p ( u w T v c ) p(o|c) = \frac {exp(u_o^T v_c)} {\sum_{w=1}^V exp(u_w^T v_c)} p(oc)=w=1Vexp(uwTvc)exp(uoTvc)

o is the outside (or output) word index, c is the center word index. v c v_c vc and u o u_o uo are center and outside vectors of indices c and o. Softmax uses word c to obtain probability of word o.

According to this formula, the words in the text will be represented by two vectors. There’s one when it’s a center word, and there’s another when it’s a context. 根据这个公式,文中的单词会有两个向量表示。当它作为中心词的时候有一个,当它作为上下文的时候又有一个。

∂ ∂ v c p ( o ∣ c ) = ∂ ∂ v c l o g [ e x p ( u o T v c ) / ∑ w = 1 V e x p ( u w T v c ) ] = ∂ ∂ v c l o g [ e x p ( u o T v c ) ] ① − ∂ ∂ v c l o g [ ∑ w = 1 V e x p ( u w T v c ) ] ② = u o − ∑ x = 1 V p ( x ∣ c ) u x \frac{\partial} {\partial v_c} p(o|c) \\\\ = \frac{\partial} {\partial v_c} log[exp(u_o^T v_c) / \sum_{w=1}^V exp(u_w^T v_c)] \\\\ = \frac{\partial}{\partial v_c} log[exp(u_o^T v_c)] ① - \frac{\partial} {\partial v_c} log[\sum_{w=1}^V exp(u_w^T v_c)] ② \\\\ = u_o - \sum_{x=1}^V p(x|c)u_x vcp(oc)=vclog[exp(uoTvc)/w=1Vexp(uwTvc)]=vclog[exp(uoTvc)]vclog[w=1Vexp(uwTvc)]=uox=1Vp(xc)ux
① 表示的是observation,也就是context word实际是什么(true label)。② 表示的是expectation,也就是模型认为概率最高的应该是哪个词(prediction label)。所以,实际上我们就是希望最小化实际和预测之间的差值。
② = ∂ ∂ v c l o g [ ∑ w = 1 V e x p ( u w T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∂ ∂ v c [ ∑ x = 1 V e x p ( u x T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∑ x = 1 V [ ∂ ∂ v c e x p ( u x T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∑ x = 1 V [ e x p ( u x T v c ) ∂ ∂ v c ( u x T v c ) ] = 1 ∑ w = 1 V e x p ( u w T v c ) ∑ x = 1 V [ e x p ( u x T v c ) u x ] = ∑ x = 1 V e x p ( u x T v c ) ∑ w = 1 V e x p ( u w T v c ) u x = ∑ x = 1 V p ( x ∣ c ) u x ② = \frac{\partial} {\partial v_c} log[\sum_{w=1}^V exp(u_w^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \frac{\partial}{\partial v_c} [\sum_{x=1}^V exp(u_x^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \sum_{x=1}^V [\frac{\partial}{\partial v_c} exp(u_x^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \sum_{x=1}^V [exp(u_x^T v_c) \frac{\partial}{\partial v_c}(u_x^T v_c)] \\\\ = \frac{1}{\sum_{w=1}^V exp(u_w^T v_c)} \sum_{x=1}^V [exp(u_x^T v_c) u_x] \\\\ = \sum_{x=1}^V \frac{exp(u_x^T v_c)}{\sum_{w=1}^V exp(u_w^T v_c)} u_x \\\\ = \sum_{x=1}^V p(x|c) u_x =vclog[w=1Vexp(uwTvc)]=w=1Vexp(uwTvc)1vc[x=1Vexp(uxTvc)]=w=1Vexp(uwTvc)1x=1V[vcexp(uxTvc)]=w=1Vexp(uwTvc)1x=1V[exp(uxTvc)vc(uxTvc)]=w=1Vexp(uwTvc)1x=1V[exp(uxTvc)ux]=x=1Vw=1Vexp(uwTvc)exp(uxTvc)ux=x=1Vp(xc)ux

当我们使用sgd进行优化的时候,每一个窗口最多有2m+1个单词,所以 ∇ θ J t ( θ ) \nabla_{\theta} J_{t}(\theta) θJt(θ) 会非常稀疏。


训练方法(Training Algorithms)包括两种:跳字模型和连续词袋模型。


英文是Skip-grams (SG)。Predict context words given target (position independent).

在这个案例中,"into"是target (center word),而"problems turning"和"banking crises"是我们的output context words。假设我们的句子一共有T个单词。我们定义window size(也就是预测上下文的半径)为m,这个案例中m=2。


通过center word和context word组成一组训练数据,喂给word2vec模型。

目标函数Objective Function

给定当前中心词时,最大化上下文词的概率。 θ \theta θ 代表我们需要优化的参数。给定一个长度为 T T T的文本序列。窗口大小是m。

J ′ ( θ ) = ∏ t = 1 T ∏ j = − m , j ≠ 0 m p ( w t + j ∣ w t ; θ ) J'(\theta) = \prod_{t=1}^T \prod_{j=-m,j \ne 0}^m p(w_{t+j}|w_t; \theta) J(θ)=t=1Tj=m,j=0mp(wt+jwt;θ)

J ( θ ) = − 1 T ∑ t = 1 T ∑ j = − m , j ≠ 0 m l o g p ( w t + j ∣ w t ) J(\theta) = -\frac 1 {T} \sum_{t=1}^T \sum_{j=-m,j \ne 0}^m log p(w_{t+j}|w_t) J(θ)=T1t=1Tj=m,j=0mlogp(wt+jwt)

训练过程Training Process


这张图第一眼看上去非常花哨,但是其实把这个工作流程说清楚了。d表示向量的维度,V是vocabulary size。

图中的 W W W是center word矩阵,以列为单位存储每一个单词作为center word的向量表示, W ∈ R d ∗ V W \in R^{d*V} WRdV。在一个训练批次只有一个center word,所以可以用独热向量 w t w_t wt来表示。通过计算两者的乘积,我们就得到了当前想要的center word的向量 v c v_c vc v c ∈ R d ∗ 1 v_c \in R^{d*1} vcRd1 v c = w t ⋅ W v_c = w_t \cdot W vc=wtW.

图中的 W ′ W' W是context word矩阵,以行为单位存储每一个单词作为context word的向量表示, W ′ ∈ R V ∗ d W' \in R^{V*d} WRVd。通过计算该矩阵和center word向量的内积我们可以得到一个中间产物 v t m p v_{tmp} vtmp v t m p = W ′ ⋅ v c v_{tmp} = W' \cdot v_c vtmp=Wvc。对这个中间产物进行softmax,可以得到每一个词作为context word对应的概率,这个概率的向量表示标记为 p ( x ∣ c ) p(x|c) p(xc),是大小为 V V V的向量 y p r e d y_{pred} ypred p ( x ∣ c ) = s o f t m a x ( v t m p ) p(x|c) = softmax(v_{tmp}) p(xc)=softmax(vtmp)。我们希望在得到的向量 y p r e d y_{pred} ypred中真正context word所对应的索引处的值(在上个模块例子中有4个context word)是大的,而其他索引处的值是小的。

W W W W ′ W' W都是模型训练过程中需要学习的。


之前提到每一个单词会有两个向量表示,即v (center word)和u (context word),把这两个向量拼接起来(其实也可以相加)作为训练参数 θ \theta θ θ ∈ R 2 V d \theta \in R^{2Vd} θR2Vd。这里的 θ \theta θ是一个非常长的向量,而不是一个矩阵。


英文是Continuous Bag of Words (CBOW)。Predict target word from bag-of-words context.

目标函数Objective Function

Max the probability of center word given its context words. θ \theta θ represents all variables we will optimize. The number of total words is T. Window size is m.

J ′ ( θ ) = ∏ t = 1 T ∏ j = − m , j ≠ 0 m p ( w t ∣ w t + j ; θ ) J'(\theta) = \prod_{t=1}^T \prod_{j = -m, j \ne 0}^m p(w_t|w_{t+j}; \theta) J(θ)=t=1Tj=m,j=0mp(wtwt+j;θ)

We use negative log likelihood to turn the objective function into a loss function.

J ( θ ) = − 1 T ∑ t = 1 T ∑ j = − m , j ≠ 0 m l o g p ( w t ∣ w t + j ) J(\theta) = -\frac 1 {T} \sum_{t=1}^T \sum_{j = -m, j \ne 0}^m log p(w_{t}|w_{t+j}) J(θ)=T1t=1Tj=m,j=0mlogp(wtwt+j)

训练过程Training Process



When computing the hidden layer output, instead of directly copying the input vector of the input context word, the CBOW model takes the average of the vectors of the input context words, and use the product of the input→hidden weight matrix and the average vector as the output. 图中的 W W W是context word矩阵,以列为单位存储每一个单词作为context word的向量表示, W ∈ R d ∗ V W \in R^{d*V} WRdV。如果在一个训练批次只考虑一个context word,可以用独热向量 x t x_t xt来表示。通过计算两者的内积,我们就得到了当前想要的context word的向量 v c o n t e x t v_{context} vcontext v c o n t e x t ∈ R d ∗ 1 v_{context} \in R^{d*1} vcontextRd1 v c o n t e x t = W ⋅ x t v_{context} = W \cdot x_t vcontext=Wxt. 但是,在context包含多个词的时候,通常会采用这多个context word所对应向量的平均值作为输入。 v c o n t e x t = 1 2 m ∑ j = − m j ≠ 0 m W ⋅ x j v_{context} = \frac{1}{2m} \sum_{j=-m \\\\ j \ne 0}^m W \cdot x_j vcontext=2m1j=mj=0mWxj.




英文是Hierarchical softmax。层序softmax将语言模型的输出softmax层编码为树形层次结构,其中每个叶子代表词典中一个单词,每个内部节点代表子节点的相对概率。

图中从根节点到 w 2 w_2 w2的示例路径被突出显示。 p w p^w pw 根节点到叶节点的路径。我们使用 l ( w ) l(w) l(w)代表根结点到叶节点的路径(包括根节点和叶节点)上的结点数。例如,在图示中, l ( w 2 ) l(w_2) l(w2)是4。使用 n ( w , j ) n(w, j) n(w,j) 表示到叶节点 w w w路径中的第 j j j个节点,该节点的背景词向量是 u n ( w , j ) u_{n(w, j)} un(w,j) d j w ∈ { 0 , 1 } d_j^w \in \{0,1\} djw{0,1} p w p^w pw上第 j j j个节点的编码。 θ j w \theta_j^w θjw p w p^w pw上第 j j j个节点的向量。

在此模型中,没有单词的输出矢量表示。相当于是去掉了模型的隐藏层。原因是从hidden layer到output layer的矩阵运算太多了。

使用了哈夫曼树,时间复杂度就从 O ( ∣ V ∣ ) O(|V|) O(V)降到了 O ( l o g 2 ∣ V ∣ ) O(log_2|V|) O(log2V)。另外,由于哈夫曼树的特点,词频高的编码短,进一步加快了模型的训练过程。

损失函数Loss Function

with SG

P ( w o ∣ w c ) = ∏ j = 1 l ( w o ) − 1 σ ( [  ⁣ [ n ( w o , j + 1 ) = leftChild ( n ( w o , j ) ) ]  ⁣ ] ⋅ u n ( w o , j ) ⊤ v c ) , P(w_o|w_c) = \prod_{j=1}^{l(w_o)-1} \sigma\left( [\![ n(w_o, j+1) = \text{leftChild}(n(w_o,j)) ]\!] \cdot \boldsymbol{u}_{n(w_o,j)}^\top \boldsymbol{v}_c\right), P(wowc)=j=1l(wo)1σ([[n(wo,j+1)=leftChild(n(wo,j))]]un(wo,j)vc),
其中, [  ⁣ [ x ]  ⁣ ] [\![ x]\!] [[x]]中的值如果为真,则表达式结果为1,否则结果为-1。

我们需要将中心词的向量和根节点到预测背景词路径上的非叶节点向量一一求内积。由于 σ ( x ) + σ ( − x ) = 1 \sigma(x)+\sigma(-x) = 1 σ(x)+σ(x)=1,给定中心词 w c w_c wc生成词典中任一词的条件概率之和为1这一条件也将满足。

with CBOW

负采样Negative Sampling



w e i g h t ( w ) = c o u n t ( w ) 0.75 / ∑ i = 1 V c o u n t ( i ) 0.75 P ( w ) = U ( w ) 0.75 / Z weight(w) = count(w)^{0.75}/\sum_{i=1}^V count(i)^{0.75} \\\\ P(w) = U(w)^{0.75} / Z weight(w)=count(w)0.75/i=1Vcount(i)0.75P(w)=U(w)0.75/Z

损失函数Loss Function

with SG

Our new objective function:

l o g σ ( u o T ⋅ v c ) + ∑ k = 1 K E j ∼ P ( w ) l o g σ ( − u j T ⋅ v c ) log \sigma(u_{o}^T \cdot v_c) + \sum_{k=1}^K E_{j \sim P(w)} log \sigma(-u_j^T \cdot v_c) logσ(uoTvc)+k=1KEjP(w)logσ(ujTvc).

Loss function:

J n e g ( o , v c , U ) = − l o g σ ( u o T v c ) − ∑ k = 1 K l o g σ ( − u k T ⋅ v c ) J_{neg}(o, v_c, U) = -log \sigma(u_o^Tv_c) - \sum_{k=1}^K log \sigma(-u_k^T \cdot v_c) Jneg(o,vc,U)=logσ(uoTvc)k=1Klogσ(ukTvc)

This maximizes probability that real outside word appears, minimize probability that random words appear around center word.

Now, we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it indeed is, and minimize the probability of a word and context not being in the corpus data if it indeed is not. We take a simple maximum likelihood approach of these two probabilities. (Here we take θ to be the parameters of the model, and in our case it is V and U.)
θ = a r g m a x θ ∏ ( w , c ) ∈ D P ( D = 1 ∣ w , c , θ ) ∏ ( w , c ) ∈ D ~ P ( D = 0 ∣ w , c , θ ) = a r g m a x θ ∏ ( w , c ) ∈ D P ( D = 1 ∣ w , c , θ ) ∏ ( w , c ) ∈ D ~ 1 − P ( D = 1 ∣ w , c , θ ) = a r g m a x θ ∑ ( w , c ) ∈ D l o g P ( D = 1 ∣ w , c , θ ) ∑ ( w , c ) ∈ D ~ l o g ( 1 − P ( D = 1 ∣ w , c , θ ) ) = a r g m a x θ ∑ ( w , c ) ∈ D l o g ( 1 / ( 1 + e x p ( − u w T v c ) ) ∑ ( w , c ) ∈ D ~ l o g ( 1 − 1 / ( 1 + e x p ( − u w T v c ) ) = a r g m a x θ ∑ ( w , c ) ∈ D l o g ( 1 / ( 1 + e x p ( − u w T v c ) ) ∑ ( w , c ) ∈ D ~ l o g ( 1 / ( 1 + e x p ( u w T v c ) ) \theta = argmax_\theta \prod_{(w,c) \in D} P(D=1| w,c,\theta) \prod_{(w,c) \in \widetilde D} P(D=0| w,c,\theta) \\\\ = argmax_\theta \prod_{(w,c) \in D} P(D=1| w,c,\theta) \prod_{(w,c) \in \widetilde D} 1-P(D=1| w,c,\theta) \\\\ = argmax_\theta \sum_{(w,c) \in D} log P(D=1| w,c,\theta) \sum_{(w,c) \in \widetilde D} log(1-P(D=1| w,c,\theta)) \\\\ = argmax_\theta \sum_{(w,c) \in D} log(1/(1+exp(-u_w^Tv_c)) \sum_{(w,c) \in \widetilde D} log(1-1/(1+exp(-u_w^Tv_c)) \\\\ = argmax_\theta \sum_{(w,c) \in D} log(1/(1+exp(-u_w^Tv_c)) \sum_{(w,c) \in \widetilde D} log(1/(1+exp(u_w^Tv_c)) \\\\ θ=argmaxθ(w,c)DP(D=1w,c,θ)(w,c)D P(D=0w,c,θ)=argmaxθ(w,c)DP(D=1w,c,θ)(w,c)D 1P(D=1w,c,θ)=argmaxθ(w,c)DlogP(D=1w,c,θ)(w,c)D log(1P(D=1w,c,θ))=argmaxθ(w,c)Dlog(1/(1+exp(uwTvc))(w,c)D log(11/(1+exp(uwTvc))=argmaxθ(w,c)Dlog(1/(1+exp(uwTvc))(w,c)D log(1/(1+exp(uwTvc))
D ~ \widetilde D D stands for a “false” corpus. For example, the unnatural sentences is one of such corpus.

考虑一对单词(w,c)和上下文。使用 P ( D = 1 ∣ w , c , θ ) P(D = 1 | w,c,\theta) P(D=1w,c,θ)表示(w,c)来自语料数据的概率。相应地, P ( D = 0 ∣ w , c , θ ) P(D = 0 | w,c,\theta) P(D=0w,c,θ)将是(w,c)不是来自语料数据的概率。

P ( D = 1 ∣ w , c , θ ) = σ ( u o T v c ) = 1 1 + e x p ( − u o T v c ) P(D = 1 | w,c,\theta) = \sigma(u_o^Tv_c) = \frac 1 {1+exp(-u_o^Tv_c)} P(D=1w,c,θ)=σ(uoTvc)=1+exp(uoTvc)1
J ′ = ∏ t = 1 T ∏ j = − m , j ≠ 0 m P ( D = 1 ∣ w ( t ) , w ( t + j ) ) J' = \prod_{t=1}^T \prod_{j = -m, j \ne 0}^m P(D=1 | w^{(t)}, w^{(t+j)}) J=t=1Tj=m,j=0mP(D=1w(t),w(t+j))
如果我们尝试对这个联合概率取最大值,那么会让所有词向量相等且值为正无穷。这样的优化没有意义。所以我们需要进行负采样。设背景词 w 0 w_0 w0出现在中心词的窗口为事件 P P P,我们根据分布 P ( w ) P(w) P(w)采样 k k k个未出现在该窗口中的词,即噪声词(负样例)。设噪声词 w k w_k wk不出现在中心词的窗口为事件 N k N_k Nk。我们可以得到以下模型
∏ t = 1 T ∏ j = − m , j ≠ 0 m P ( w ( t + j ) ∣ w ( t ) ) \prod_{t=1}^T \prod_{j = -m, j \ne 0}^m P(w^{(t+j)} | w^{(t)}) t=1Tj=m,j=0mP(w(t+j)w(t))
P ( w ( t + j ) ∣ w ( t ) ) = P ( D = 1 ∣ w ( t ) , w ( t + j ) ) ∏ k = 1 , w k ∼ P ( w ) K P ( D = 0 ∣ w ( t ) , w k ) P(w^{(t+j)} | w^{(t)}) = P(D=1 | w^{(t)},w^{(t+j)}) \prod_{k=1, w_k \sim P(w)}^K P(D=0 | w^{(t)}, w_k) P(w(t+j)w(t))=P(D=1w(t),w(t+j))k=1,wkP(w)KP(D=0w(t),wk)
假设文本序列中时间步 t t t的词在词典中索引是 i t i_t it,噪声词 w k w_k wk的索引是 h k h_k hk。那么对数损失可以表达成为
− l o g P ( w ( t + j ) ∣ w ( t ) ) = − P ( D = 1 ∣ w ( t ) , w ( t + j ) ) − ∑ k = 1 , w k ∼ P ( w ) K P ( D = 0 ∣ w ( t ) , w k ) = − l o g ( σ ( u i t + j T v i t ) ) − ∑ k = 1 , w k ∼ P ( w ) K l o g ( 1 − σ ( u h k T v i t ) ) = − l o g ( σ ( u i t + j T v i t ) ) − ∑ k = 1 , w k ∼ P ( w ) K l o g ( σ ( − u h k T v i t ) ) -logP(w^{(t+j)} | w^{(t)}) = -P(D=1 | w^{(t)},w^{(t+j)}) - \sum_{k=1, w_k \sim P(w)}^K P(D=0 | w^{(t)}, w_k) \\\\ = -log(\sigma(u_{i_{t+j}}^T v_{i_t})) - \sum_{k=1, w_k \sim P(w)}^K log(1-\sigma(u_{h_k}^T v_{i_t})) \\\\ = -log(\sigma(u_{i_{t+j}}^T v_{i_t})) - \sum_{k=1, w_k \sim P(w)}^K log(\sigma(-u_{h_k}^T v_{i_t})) \\\\ logP(w(t+j)w(t))=P(D=1w(t),w(t+j))k=1,wkP(w)KP(D=0w(t),wk)=log(σ(uit+jTvit))k=1,wkP(w)Klog(1σ(uhkTvit))=log(σ(uit+jTvit))k=1,wkP(w)Klog(σ(uhkTvit))


E = − l o g σ ( v w p o s ′ h ) − ∑ w j ∈ W n e g l o g σ ( − v w j ′ h ) E = -log \sigma(v'_{w_{pos}}h) - \sum_{w_j \in W_{neg}} log \sigma(-v'_{w_j}h) E=logσ(vwposh)wjWneglogσ(vwjh)


v w j ′ ( t + 1 ) = v w j ′ ( t ) − η ( σ ( v w j ′ ( t ) T h ) − t j ) h v_{w_j}^{'(t+1)} = v_{w_j}^{'(t)} - \eta (\sigma(v_{w_j}^{'(t)T}h)-t_j)h vwj(t+1)=vwj(t)η(σ(vwj(t)Th)tj)h

with CBOW


我们使用负采样来进行近似训练。对于一对中心词和背景词,我们随机采样 K 个噪声词(K=5)。

def get_negatives(all_contexts, sampling_weights, K):
    all_negatives, neg_candidates, i = [], [], 0
    population = list(range(len(sampling_weights)))
    for contexts in all_contexts:
        negatives = []
        while len(negatives) < len(contexts) * K:
            if i == len(neg_candidates):
                # 根据每个词的权重sampling_weights随机生成k个词的索引作为噪声词。
                # 为了高效计算,可以将k设得稍大一点
                i, neg_candidates = 0, random.choices(
                    population, sampling_weights, k=int(1e5))
            neg, i = neg_candidates[i], i + 1
            # 噪声词不能是背景词
            if neg not in set(contexts):
    return all_negatives

sampling_weights = [counter[w]**0.75 for w in idx_to_token]
all_negatives = get_negatives(all_contexts, sampling_weights, 5)





  1. 非监督的学习方法,可以被应用于没有足够标签的训练数据


  1. missing new words (impossible to keep up to date)缺少新词

训练方法Training Algorithms

Distributed Memory (PV-DM)

PV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a center word based an average of both context word-vectors and the full document’s doc-vector. It acts as a memory that remembers what is missing from the current context — or as the topic of the paragraph. 名字起得比较搞笑,PV-DM实际上对应的是word2vec中的CBOW模式。在给定上下文和文档向量的情况下预测单词的概率。


在这个图中,作者貌似只说了用前文的词去预测后文的词,比如在这个例子中"the cat sat"是"on"的前文。这个实际上具有一定的误导性。在原paper的损失函数仍然是包含了一个center word的前后文的。



Distributed Bag of Words (PV-DBOW)

PV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training a neural network on the synthetic task of predicting a target word just from the full document’s doc-vector.名字起得比较搞笑,PV-DM实际上对应的是word2vec中的SG模式。在每次迭代的时候,从文本中采样得到一个窗口,再从这个窗口中随机采样一个单词作为预测任务,让模型去预测,输入就是段落向量。





  • dm: 0 = DBOW; 1 = DMPV. 模型的模式
  • vector_size: Dimensionality of the feature vectors.
  • window: The maximum distance between the current and predicted word within a sentence.
  • min_count: Ignores all words with total frequency lower than this.
  • sample: this is the sub-sampling threshold to downsample frequent words; 10e-5 is usually good for DBOW, and 10e-6 for DMPV.
  • hs: 1 turns on hierarchical sampling; this is rarely turned on as negative sampling is in general better
  • negative: number of negative samples; 5 is a good value.
  • **dm_mean **(optional): If 0 , use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.
  • dm_concat (optional): If 1, use concatenation of context vectors rather than sum/average; Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.
  • dbow_words (optional): If set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training; If 0, only trains doc-vectors (faster).


class TaggedDocument(namedtuple('TaggedDocument', 'words tags')):
    A single document, made up of `words` (a list of unicode string tokens)
    and `tags` (a list of tokens). Tags may be one or more unicode string
    tokens, but typical practice (which will also be most memory-efficient) is
    for the tags list to include a unique integer id as the only tag.

    Replaces "sentence as a list of words" from Word2Vec.

很多人奇怪doc2vec作为一个非监督学习的方法,为什么会需要提供一个words tags的选项。通过看文档我们可以发现,实际上这个参数我们填写每个文档对应的唯一性标识就可以。当然,我们也可以传对应的标签进去,但是这个并不会妨碍doc2vec把文档当成标记过的数据。注意,必须要把标记当成列表传递。



from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

def getVec(model, tagged_docs, epochs=20):
  sents = tagged_docs.values
  regressors = [model.infer_vector(doc.words, epochs=epochs) for doc in sents]
  return np.array(regressors)

def plotVec(ax, x, y, title="title"):
  scatter = ax.scatter(x[:, 0], x[:, 1], c=y, 
             cmap=matplotlib.colors.ListedColormap(["red", "blue", "yellow"]))
  ax.legend(*scatter.legend_elements(), loc=0, title="Classes")

xtrain_tagged = xtrain.apply(
    lambda r: TaggedDocument(words=r["ngram"], tags=[r["Label"]]), axis=1

model_dm = Doc2Vec(dm=1, vector_size=30, negative=5, hs=0, min_count=2, sample=0)
for epoch in range(10):
    sents = xtrain_tagged.values
    model_dm.train(sents, total_examples=len(sents), epochs=1)
    model_dm.alpha -= 0.002 
    model_dm.min_alpha = model_dm.alpha
xtrain_vec = getVec(model_dm, xtrain_tagged)
xtrain_tsne = TSNE(n_components=2, metric="cosine").fit_transform(xtrain_vec)
plotVec(ax1, xtrain_tsne, ytrain, title="training")


