Assignment1的答案一共被我分成了4部分,分别包含第1,2,3,4题。这部分包含第3题的答案。
3. word2vec (40 points + 5 bonus)
(a). (3 points) Assume you are given a predicted word vector vc corresponding to the center word c for skipgram, and word prediction is made with the softmax function found in word2vec models
y^0=p(o|c)=exp(uTovc)∑Ww=1exp(uTwvc)(4)
where
w denotes the
w -th word and
uw
(w=1,…,W) are the “output” word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word
o is the expected word (the
o -th element of the one-hot label vector is one), derive the gradients with respect to
vc .
Hint: It will be helpful to use notation from question 2. For instance, letting
y^ be the vector of softmax predictions for every word,
y as the expected word vector, and the loss function
Jsoftmax−CE(o,vc,U)=CE(y,y^)(5)
where
U=[u1,u2,…,uW] is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.
解:设词向量的维度为 ndim ,且各词向量为列向量,即 vc 的维度为 ndim×1 , U 的维度为 ndim×W 。并且记 θ=UTvc 。则有 y^=softmax(θ) 。由第2问(b)的结果可得:
∂Jsoftmax−CE∂vc=∂Jsoftmax−CE∂θ⋅∂θ∂vc=U⋅(y^−o)
(b)(3 points) As in the previous part, derive gradients for the “output” word vectors uw (including uo ).
解:同(a)中一样,设 θ=UTvc ,则有:
∂Jsoftmax−CE∂Uij=∑k∂Jsoftmax−CE∂θk∂θk∂Uij=∑k(y^−o)|k∂θk∂Uij
其中
θk=uTk⋅vc ,则
∂θk∂Uij={vi0j=kj≠k ,其中
vi 表示
vc 的第
i 个元素。则有:
∑k(y^−o)|k∂θk∂Uij=(y^−o)|jvi=vc⋅(y^−o)T|i,j
所以:
∂Jsoftmax−CE∂U=vc⋅(y^−o)T
(c). (6 points) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector vc , and the expected output word is o . Assume that K negative samples (words) are drawn, and they are 1,…,K , respectively for simplicity of notation (o∉{1,…,K}) . Again, for a given word, o , denote its output vector as uo . The negative sampling loss function in this case is
Jneg−sample(o,vc,U)=−log(σ(uTovc))−∑k=1Klog(σ(−uTkvc))(6)
where
σ(⋅) is the sigmoid function.
After you’ve done this, describe with one sentence why this cost function is much more efficient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e. the runtime of the softmax-CE loss divided by the runtime of the negative sampling loss).
Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.
解:设所取的 K 个索引所在的集合为 S 。
∂Jneg−sample∂vc=−∂logσ(uTovc)∂vc−∑i∈Slogσ(−uTivc)∂vc=[σ(uTovc)−1]uo−∑i∈S[σ(−uTivc)−1]ui
∂Jneg−sample∂uw=−∂logσ(uTovc)∂uw−∑i∈Slogσ(−uTivc)∂uw=⎧⎩⎨[σ(uTovc)−1]vc[1−σ(−uTwvc)]vc0w=ow∈Sw≠o且w∉S
之所以(6)式比(5)式快是因为: runtime of softmax-CEruntime of negative sampling loss=O(W)O(K) (不知道这么说是不是准确,望大神指正)。
(d). (8 points) Derive gradients for all of the word vectors for skip-gram and CBOW given the previous parts and given a set of context words [wordc−m,…,wordc−1,wordc,wordc+1,…,wordc+m] , where m is the context size. Denote the “input” and “output” word vectors for wordk as vk and uk respectively.
Hint: feel free to use F(o,vc) (where o is the expected word) as a placeholder for the Jsoftmax−CE(o,vc,…) or Jneg−sample(o,vc,…) cost functions in this part - you’ll see that this is a useful abstraction for the coding part. That is, your solution may contain terms of the form F(o,vc)∂… .
Recall that for skip-gram, the cost for a context centered around c is
Jskip−gram(wordc−m…c+m)=∑−m≤j≤m,j≠0F(wc+j,vc)(7)
where
wc+j refers to the word at the
j -th index from the center.
CBOW is slightly different. Instead of using
vc as the predicted vector, we use
v^ de fined below. For (a simpler variant of) CBOW, we sum up the input word vectors in the context
v^=∑−m≤j≤m,j≠0vc+j(8)
then the CBOW cost is
JCBOW(wordc−m…c+m)=F(wc,v^)(9)
Note: To be consistent with the
v^ notation such as for the code portion, for skip-gram
v^=vc .
解:设 vk,uk 分别为词 k 所对应的内外向量。
skip-gram对应的答案:
∂Jskip−gram(wordc−m…c+m)∂vk=∑−m≤j≤m,j≠0∂F(wc+j,vc)∂vk
∂Jskip−gram(wordc−m…c+m)∂uk=∑−m≤j≤m,j≠0∂F(wc+j,vc)∂uk
其中
wc+j 为从中心数第
j 个词所对应的one-hot vector。
CBOW对应的答案:
∂JCBOW(wordc−m…+m)∂vk=∂F(wc,v^)∂vk=∂F(wc,v^)∂v^⋅∂v^∂vk={∂F(wc,v^)∂v^0k∈{wc−m,…,wc−1,wc+1,…,wc+m}k∉{wc−m,…,wc−1,wc+1,…,wc+m}
∂JCBOW(wordc−m…+m)∂wk=∂F(wc,v^)∂wk
ps: 感觉这个答案好简单的样子,为什么要给8分呢?
(e)(f)(g)(h). 见代码,略
附一张训出来的图,也就是我跑完q3_run.py之后出现的图,reddit 上有人讨论怎么看这个图是否合理: