Word2Vec(CBOW+Skip-gram) 概念介绍+代码实现

Word2Vec

Word2vec attempts to decide the importance of a word by breaking down its neighboring words (the context) and thus resolving the context loss issue.

The two major architectures for word2vec are continuous bag-of-words (CBOW) and skip-gram (SG).

continuous bag-of-words
Word2Vec(CBOW+Skip-gram) 概念介绍+代码实现_第1张图片

  1. Input layer: One-Hot encoding word vector of context words, V V V is the number of words in vocabulary list, and C C C is the number of context words.

  2. Initialize a weight matrix W V × N W_{V \times N} WV×N, and then left-multiply the matrix with all input One-Hot coding word vectors to get a vector ω 1 ω 2 , . . . , ω c \omega_{1} \omega_{2} , ... , \omega_{c} ω1ω2,...,ωc of dimension N N N, where N N N is set by oneself according to the task needs.

  3. Add the obtained vectors ω 1 ω 2 , . . . , ω c \omega_{1} \omega_{2} , ... , \omega_{c} ω1ω2,...,ωc and calculate the average as the hidden layer vector h h h.

  4. Initialize another weight matrix W N × V ′ W'_{N \times V} WN×V, multiply it by the hidden layer vector h h h, and then get the V-dimensional vector y y y through the activation function. Each element of y y y represents the probability distribution of each corresponding word.

  5. The word indicated by the element with the largest probability in y y y is the predicted intermediate word (target word) compared with the One-Hot coding word vector of true label. The smaller the error, the better (update the two weight matrices according to the error)

Before training, the loss function (generally referred to as cross-entropy loss function) should be defined, and the gradient descent algorithm should be used to update W W W and W ′ W' W. After the training, each word of the input layer is multiplied by matrix W W W to obtain the vector of the word vector represented by the Distributed Representation, which is also called word embedding.

skip-gram (SG)
Word2Vec(CBOW+Skip-gram) 概念介绍+代码实现_第2张图片

In Skip-gram, the model iterates over the words in the corpus and predicts the neighbors (i.e. the context). That is, by giving a word that you want to predict the context in which it is likely to occur. By training on a large corpus, a weight model is obtained from the input layer to the hidden layer.

# Python program to generate word vectors using Word2Vec
 
# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
 
warnings.filterwarnings(action = 'ignore')
 
import gensim
from gensim.models import Word2Vec
 
#  Reads ‘alice.txt’ file
sample = open("alice.txt")
s = sample.read()
 
# Replaces escape character with space
f = s.replace("\n", " ")
 
data = []
 
# iterate through each sentence in the file
for i in sent_tokenize(f):
    temp = []
     
    # tokenize the sentence into words
    for j in word_tokenize(i):
        temp.append(j.lower())
 
    data.append(temp)
 
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1,
                              vector_size = 100, window = 5)


# Print results
print("Cosine similarity between 'alice' " +
               "and 'wonderland' - CBOW : ",
    model1.wv.similarity('alice', 'wonderland'))
     
print("Cosine similarity between 'alice' " +
                 "and 'machines' - CBOW : ",
      model1.wv.similarity('alice', 'machines'))



# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, vector_size = 100,
                                             window = 5, sg = 1)
 
# Print results
print("Cosine similarity between 'alice' " +
          "and 'wonderland' - Skip Gram : ",
    model2.wv.similarity('alice', 'wonderland'))
     
print("Cosine similarity between 'alice' " +
            "and 'machines' - Skip Gram : ",
      model2.wv.similarity('alice', 'machines'))

# output
Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.9845774
Cosine similarity between 'alice' and 'machines' - CBOW :  0.94986236
Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.7097177
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.81548774

References

  1. Getting started with Word2vec
  2. Word2Vec
  3. Python | Word Embedding using Word2Vec
  4. Data(alice.txt) can be downloaded here

你可能感兴趣的:(NLP,Learning,Notes,word2vec,python,人工智能)