CS224N notes_chapter3_Deeper Look at Word Vectors

第三讲 Deeper Look at Word Vectors

Negtive Sampling

Firstly, we need to review the Skip-gram
p ( w t + j ∣ w t ) = e x p ( u o T v c ) ∑ w = 1 V e x p ( u w T v c ) p(w_{t+j}|w_t)= \frac{exp(u_o^Tv_c)}{\sum_{w=1}^V exp(u_w^Tv_c)} p(wt+jwt)=w=1Vexp(uwTvc)exp(uoTvc)
the calculation of numerator is very easy, while we have to go through the whole dict to calculate the denominator. Any method to simplify this?
We could do negative sampling. Firstly we do minor changes to the loss func.
J t ( θ ) = log ⁡ σ ( u o T v c ) + ∑ j ∼ P ( w ) [ log ⁡ σ ( − u j T v c ) ] σ ( x ) = 1 1 + e − x \begin{aligned} J_t(\theta) =& \log \sigma(u_o^T v_c)+ \sum_{j\sim P(w)}[\log \sigma (-u_j^Tv_c)] \\ \sigma(x) =& \frac 1 {1+e^{-x}} \end{aligned} Jt(θ)=σ(x)=logσ(uoTvc)+jP(w)[logσ(ujTvc)]1+ex1
We only sample several words (usually less than 10) from dict based on their frequency( U ( w ) U(w) U(w)) to calculate the second term in J ( θ ) J(\theta) J(θ). And P ( w ) = U ( w ) 3 / 4 / Z P(w)=U(w)^{3/4}/Z P(w)=U(w)3/4/Z. The power 3/4 makes less frequent words be sampled more often.
Word2vec captures cooccurrence of words one at a time. Could we capture cooccurrence counts directly?

Cooccurrence matrix X

We could use cooccurrence matrix X

  • 2 options: windows v.s. full doc.
  • Window: similar to word2vec -> it captures both syntactic and semantic info.
  • Word-document co-occurrence matrix will give general topics. -> Latent Semantic Analysis.

Example:

  • I like deep learning.
  • I like NLP.
  • I enjoy flying.

Window size: 1

counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0

Problems with cooccurrence matrix

  • Increase in size with vocabulary
  • high dimensional
  • sparsity

-> less robust.

How to get Low dimensional vectors?
SVD.

Hacks to X

  • Problem: function words (the, he, has) are too frequent.
    • min(X, t), with t~100
    • Ignore them all
  • Ramped windows that count closer words more
  • Use pearson correlations instead of counts. set neg values to 0.

Problems of SVD

  • Bad for millions of words.
  • Hard to incorporate new words.
Count based Direct prediction
LSA, HAL; COALS SG/CBOW;NNLM,RNN
Fast training scale with corpus size
Efficient usage of statistics inefficient usage of statistics
primarily used to capture word similarity generate improved performance on other tasks
Disproportionate importance given to large counts can capture complex patterns beyond word similarity

GloVe: Global Vectors model

J ( θ ) = 1 2 ∑ i , j = 1 W f ( P i j ) ( u i T v j − log ⁡ P i j ) 2 f ( x ) = m i n ( 2 x , 1 ) J(\theta) = \frac 1 2 \sum_{i,j=1}^W f(P_{ij})(u_i^Tv_j-\log P_{ij})^2 \\ f(x) = min(2x,1) J(θ)=21i,j=1Wf(Pij)(uiTvjlogPij)2f(x)=min(2x,1)
Finally, X=U+V

How to evaluate word vectors

  • Intrinsic
    • Eval on a specific subtask
    • Fast to compute
    • Helps to understand that system
    • Not clear if really helpful unless correlation to real task is established
  • Extrinsic
    • Eval on a real task
    • Can take long time to compute acc.
    • Unclear if the subsystem is the problem or its interaction or other subsystems.
    • If replacing exactly subsystem with another improves acc -> winning!

An example of Intrinsic eval.
a:b :: c:?
man:woman :: king:?
#这里想表的是,男人对女人类似于国王对王后
d = arg ⁡ max ⁡ i ( x b − x a + x c ) T x i ∣ ∣ x b − x a + x c ∣ ∣ d = \mathop{\arg \max_i}{\frac{(x_b-x_a+x_c)^Tx_i}{||x_b-x_a+x_c||}} d=argimaxxbxa+xc(xbxa+xc)Txi
cosine distance.
city - in - state
capital - world
verb - past - tense

你可能感兴趣的:(CS224N)