Course Homepage: http://web.stanford.edu/class/cs224n/
2021.2.25
A word’s meaning is given by the words that frequentlt appear closed-by.
“You should know a word by the company it keeps.”——J. R. Firth
Loss function:
Finally:
Before is trainning method, then this is prediction method:
Use vectors’ dot multiply to calculate the simularity.
Now we need to calculate partial:
numerator is easy, look into denominator:
Use the chain rule:
Then move the derivative inside the sum:
Use the chain rule again:
Finally:
Write it into a simple form:
It means it takes the “difference” between the expected context word and the actual context word that showed up.
This difference give us the slope as to whcih direction we should be walking changing the words.
Exercise one
Solution of Exercise one
Negative Sampling: Train binary logistic regressions for a true pair(center word and word in its context window) versus a couple of noise pairs(the center word paired with a random word)
Here, The capital Z is often used as a normalization term:
So this is saying if you want the probability distribution of words, is you work out this three quaters power of the count of the word for every word in the vocabulary and then these numbers you just sum them up over the vocabulary and it’ll be sum total andf we’re dividing by that. So we get a probability distribution.
(Notes: In this class, Z means teacher normalization term to turn things into probabliyies)
Reference
X i j : X_{ij}: Xij: For center word i i i, times of word j j j appearance.
X i : X_i: Xi: Times of word i i i appearance in full context.
P i , k = X i , k X i : P_{i,k}=\frac{X_{i,k}}{X_{i}}: Pi,k=XiXi,k:Frequency of word k k k appear in context of word i i i
ratio i , j , k = P i , k P j , k : \text { ratio }_{i, j, k}=\frac{P_{i, k}}{P_{j, k}}: ratio i,j,k=Pj,kPi,k: For different center word, the frequency of k k k appearance’s ratio.
Now we want to get function g:
P i , k P j , k = ratio i , j , k = g ( w i , w j , w k ) \frac{P_{i, k}}{P_{j, k}}=\text { ratio }_{i, j, k}=g\left(w_{i}, w_{j}, w_{k}\right) Pj,kPi,k= ratio i,j,k=g(wi,wj,wk)
The following is very curtness.
Finally:
P i , k P j , k = exp ( ( w i − w j ) T w k ) P i , k P j , k = exp ( w i T w k − w j T w k ) P i , k P j , k = exp ( w i T w k ) exp ( w j T w k ) \begin{aligned} \frac{P_{i, k}}{P_{j, k}} &=\exp \left(\left(w_{i}-w_{j}\right)^{T} w_{k}\right) \\ \frac{P_{i, k}}{P_{j, k}} &=\exp \left(w_{i}^{T} w_{k}-w_{j}^{T} w_{k}\right) \\ \frac{P_{i, k}}{P_{j, k}} &=\frac{\exp \left(w_{i}^{T} w_{k}\right)}{\exp \left(w_{j}^{T} w_{k}\right)} \end{aligned} Pj,kPi,kPj,kPi,kPj,kPi,k=exp((wi−wj)Twk)=exp(wiTwk−wjTwk)=exp(wjTwk)exp(wiTwk)
P i , j = exp ( w i T w j ) log ( X i , j ) − log ( X i ) = w i T w j log ( X i , j ) = w i T w j + b i + b j \begin{array}{c} P_{i, j}=\exp \left(w_{i}^{T} w_{j}\right) \\ \log \left(X_{i, j}\right)-\log \left(X_{i}\right)=w_{i}^{T} w_{j} \\ \log \left(X_{i, j}\right)=w_{i}^{T} w_{j}+b_{i}+b_{j} \end{array} Pi,j=exp(wiTwj)log(Xi,j)−log(Xi)=wiTwjlog(Xi,j)=wiTwj+bi+bj
The loss function is:
J = ∑ i , j N ( w i T w j + b i + b j − log ( X i , j ) ) 2 J=\sum_{i, j}^{N}\left(w_{i}^{T} w_{j}+b_{i}+b_{j}-\log \left(X_{i, j}\right)\right)^{2} J=i,j∑N(wiTwj+bi+bj−log(Xi,j))2
Add the weight item:
J = ∑ i , j N f ( X i , j ) ( v i T v j + b i + b j − log ( X i , j ) ) 2 f ( x ) = { ( x / x m a x ) 0.75 , if x < x max 1 , if x > = x max \begin{array}{c} J=\sum_{i, j}^{N} f\left(X_{i, j}\right)\left(v_{i}^{T} v_{j}+b_{i}+b_{j}-\log \left(X_{i, j}\right)\right)^{2} \\ f(x)=\left\{\begin{array}{ll} \left(x / x_{m a x}\right)^{0.75}, & \text { if } x
Here f ( x ) f(x) f(x) means the more frequnetly x x x appears, the higher weight it will have.
Little new knowledge in this lecture.
Use the window of word to classifiy the center word:
With this bigger vector, we get:
“In 2019, deeplearrning is still a kind of craft.”
I write the mtrix derivation skills here
zilch n. 零无价值的物品; 人名
Notation n. 符号;乐谱;计数法
Orthogonality n. 正交性
Determinant n. 决定因素;行列式 adj. 决定性的
非常基础的SVD推导
矩阵求导术