(1)Suppose you learn a word embedding for a vocabulary of 10000 words. Then the embedding vectors should be 10000 dimensional, so as to capture the full range of variation and meaning in those words.
[A]True
[B]False
答案:B
解析:注意和one-hot的区别。
(2)What is t-SNE?
[A]A linear transformation that allows us to solve analogies on word vectors.
[B]A non-linear dimensionality reduction technique.
[C]A supervised learning algorithm for learning word embeddings.
[D]An open-source sequence modeling library.
答案:B
解析:t-SNE是一种非线性的降维算法。
(3)Suppose you download a pre-trained word embedding which has been trained on a huge corpus of text. You then use this word embedding to train an RNN for a language task of recognizing if someone is happy from a short snippet of text, using a small training set.
x(input text) | y(happy?) |
---|---|
I’m feeling wonderful today! | 1 |
I’m bummed my cat is ill | 0 |
Really enjoying this! | 1 |
Then even if the word “ecstatic” does not appear in your small training set, your RNN might reasonably be expected to recognize “I’m ecstatic” as deserving a label y=1.
[A]True
[B]False
答案:A
解析:正向积极的词会有相似的特征向量。
(4)Which of these equations do you think should hold for a good word embedding?(Check all that apply)
[A] e b o y − e g i r l ≈ e b r o t h e r − e s i s t e r e_{boy}-e_{girl} \approx e_{brother}-e_{sister} eboy−egirl≈ebrother−esister
[B] e b o y − e g i r l ≈ e s i s t e r − e b r o t h e r e_{boy}-e_{girl} \approx e_{sister}-e_{brother} eboy−egirl≈esister−ebrother
[C] e b o y − e b r o t h e r ≈ e g i r l − e s i s t e r e_{boy}-e_{brother} \approx e_{girl}-e_{sister} eboy−ebrother≈egirl−esister
[D] e b o y − e b r o t h e r ≈ e s i s t e r − e g i r l e_{boy}-e_{brother} \approx e_{sister}-e_{girl} eboy−ebrother≈esister−egirl
答案:A,C
(5)Let E E E be an embedding matrix, and let o 1234 o_{1234} o1234 be a one-hot vector, corresponding to word 1234. Then to get the embedding of word 1234, why don’t we call E T ∗ o 1234 E^T*o_{1234} ET∗o1234 in Python?
[A]it is computationally wasteful.
[B]The correct formula is E T ∗ e 1234 E^T*e_{1234} ET∗e1234
[C]This doesn’t handle unknown words (
[D]None of the above: Calling the Python snippet as described above is fine.
答案:A
解析:one-hot向量维度高,并且大多数为0,所以 E E E和 o 1234 o_{1234} o1234 进行相乘效率很低。
(6)When learning word embeddings, we create an artificial task of estimating P ( t a r g e t ∣ c o n t e x t ) P(target|context) P(target∣context). It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.
[A]True
[B]False
答案:B
解析:错在artificial人工。
(7)In the word2vec algorithm, you estimate P ( t ∣ c ) P(t|c) P(t∣c), where t t t is the target word and c c c is a context word, How are t t t and c c c chosen from the training set? Pick the best answer.
[A] c c c is the one word that comes immediately before t t t.
[B] c c c is the sequence of all the words in the sentence before t t t
[C] c c c is a sequence of several words immediately before t t t.
[D] c c c and t t t are chosen to be nearby words.
答案:D
(8)Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The word2vec mode: uses the following softmax function:
P ( t ∣ c ) = e θ t T e c ∑ t ′ = 1 10000 e θ t ′ T e c P\left( t|c \right) =\frac{e^{\theta _t^Te_c}}{\sum_{t'=1}^{10000}{e^{\theta _{t'}^{T}e_c}}} P(t∣c)=∑t′=110000eθt′TeceθtTec
Which of these statements are correct? Check all that apply.
[A] θ t \theta_t θt and e c e_c ec are both 500 dimensional vectors.
[B] θ t \theta_t θt and e c e_c ec are both 10000 dimensional vectors.
[C] θ t \theta_t θt and e c e_c ec are both trained with an optimization algorithm such as Adam or gradient descent.
[D]After training, we should expect θ t \theta_t θt to be very close to e c e_c ec when t t t and c c c are the same word.
答案:A,C
解析:由题意embedding的大小为500维度,所以 θ t \theta_t θt 和 e c e_c ec的维度都为500。
D选项有点争议,具体见
Why does word2vec use 2 representations for each word?
Word2Vec哪个矩阵是词向量?
word2Vec的CBOW,SKIP-gram为什么有2组词向量?
本人认为 θ \theta θ向量和 e e e向量均可作为词向量,只是表达的方式和所表达的特征有所不同,所以数值上也会不同。
表达方式不同可以理解为半径为1的圆和面积为 π \pi π的圆,他们表达方式不同但都表示同一个圆。也可以理解为处于不同基底的向量空间。
表达的特征不同可以理解为对于同一个词不同向量提取到的特征不同。就比如“juice”这个词,一个提取到的特征这是一种液体,另一个提取到的特征这是由水果制成的。
如有错误,请大佬指出。
(9)Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The GloVe model minimizes this objective:
min ∑ i = 1 10000 ∑ j = 1 10000 f ( X i j ) ( θ i T e j + b i + b j ′ − log X i j ) 2 \min \sum_{i=1}^{10000}{\sum_{j=1}^{10000}{f\left( X_{ij} \right) \left( \theta _i^Te_j+b_i+b_j'-\log X_{ij} \right) ^2}} mini=1∑10000j=1∑10000f(Xij)(θiTej+bi+bj′−logXij)2
Which of these statements are correct? Check all that apply.
[A] θ i \theta_i θi and e j e_j ej should be initialized to 0 at the beginning of training.
[B] θ i \theta_i θi and e j e_j ej should be initialized randomly at the beginning of training.
[C] X i j X_{ij} Xij is the number of times word i appears in the context of word j.
[D]The weighting function f ( . ) f(.) f(.) must satisfy f ( 0 ) = 0 f(0)=0 f(0)=0
答案:B,C,D
(10)You have trained word embeddings using a text dataset of m1 words. You are considering using these word embeddings for a language task, for which you have separate labeled dataset of m2 words. keeping in mind that using word embeddings of a form of transfer learning, under which of these circumstance would you expect the word embeddings to be helpful?
[A] m 1 > > m 2 m1>>m2 m1>>m2
[B] m 1 < < m 2 m1<
答案:A