20221107学习word2vec

【随便写写,个人理解】

一、word2vec

起初用于语言处理【将中文、英文换成计算机可以识别的语言,也就是词向量】

可以通过多种方法进行模型的训练【pytorch、tensorflow、python的gensim库等】

作用:可以查找某个词语的词向量、可以对比两个词语或者两句话的相似度

原理:

skip-gram原理:skip-gram是word2vec的一种训练方法,核心思想是用中心词预测周围词

CBO原理:cbow是word2vec的一种训练方法,核心思想是用周围词预测核心词


二、代码

(1)pytorch

(2)gensim

参考:Word2Vec的PyTorch实现(乞丐版) - mathor (wmathor.com)​​​​​​icon-default.png?t=M85Bhttps://wmathor.com/index.php/archives/1443/

Pytorch实现word2vec(Skip-gram训练方式)_我唱歌比较走心的博客-CSDN博客_for i in range(len(skip_grams)): one_hot = np.zeroicon-default.png?t=M85Bhttps://blog.csdn.net/Delusional/article/details/114477987

(1)pytorch

#导入所需要的包

import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.utils.data as Data

dtype = torch.FloatTensor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 文本预处理
sentences = ["jack like dog", "jack like cat", "jack like animal",
             "dog cat animal", "banana apple cat dog like", "dog fish milk like",
             "dog cat animal like", "jack like apple", "apple like", "jack like banana",
             "apple banana jack movie book music like", "cat dog hate", "cat dog like"]
 
word_sequence = " ".join(sentences).split()  # ['jack', 'like', 'dog', 'jack', 'like', 'cat', 'animal',...]
vocab = list(set(word_sequence))  # build words vocabulary,去重
word2idx = {w: i for i, w in enumerate(vocab)}  # {'apple': 0, 'fish': 1,..., },注意,不固定!!!不一定apple对应的就是0,在真实的源码中,是按照词频来排序、分配序号的。

# 模型的相关参数
batch_size = 8
embedding_size = 2  # 词向量的维度是2
C = 2  # window size,即左右各两个周围词
voc_size = len(vocab)  # 词典的大小

# 数据预处理
skip_grams = []
print(word2idx)
for idx in range(C, len(word_sequence) - C):
    center = word2idx[word_sequence[idx]]  # 中心词
 
    context_idx = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1))  # 中心词左边的2个词+中心词右边的两个词
    context = [word2idx[word_sequence[i]] for i in context_idx]
    for w in context:
        skip_grams.append([center, w])  # 中心词和每个周围词组成一个训练样本
 
 
def make_data(skip_grams):
    input_data = []
    output_data = []
    for i in range(len(skip_grams)):
        # input_data转换为one-hot形式,output_data合成一个list
        input_data.append(np.eye(voc_size)[skip_grams[i][0]])
        output_data.append(skip_grams[i][1])
    return input_data, output_data
 
 
print(skip_grams)
input_data, output_data = make_data(skip_grams)
print(input_data)
print(output_data)
input_data, output_data = torch.Tensor(input_data), torch.LongTensor(output_data)
dataset = Data.TensorDataset(input_data, output_data)
loader = Data.DataLoader(dataset, batch_size, True)
"""
skip_grams: [[10, 2],[9, 8], [11, 5], ..., [11, 7], [11, 10], [11, 0]]
input_data: [array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]),...]
output_data: [2, 0, 2, 0, 0, 10, 0, 11, 10, 2, 11, 2, 2, 0, 2, 0, 0, 11, 0, 8, 11, 2, 8, 10, 2, 0, 10,...]
"""

 构建模型:

# 构建模型
class Word2Vec(nn.Module):
    def __init__(self):
        super(Word2Vec, self).__init__()
        # W:one-hot到词向量的hidden layer
        self.W = nn.Parameter(torch.randn(voc_size, embedding_size).type((dtype)))
        # V:输出层的参数
        self.V = nn.Parameter(torch.randn(embedding_size, voc_size).type((dtype)))
 
    def forward(self, X):
        # X : [batch_size, voc_size] one-hot
        # torch.mm only for 2 dim matrix, but torch.matmul can use to any dim
        hidden_layer = torch.matmul(X, self.W)  # hidden_layer : [batch_size, embedding_size]
        output_layer = torch.matmul(hidden_layer, self.V)  # output_layer : [batch_size, voc_size]
        return output_layer
 
 
model = Word2Vec().to(device)
criterion = nn.CrossEntropyLoss().to(device)  # 多分类,交叉熵损失函数
optimizer = optim.Adam(model.parameters(), lr=1e-3)  # Adam优化算法

# 训练
for epoch in range(2000):
    for i, (batch_x, batch_y) in enumerate(loader):
        batch_x = batch_x.to(device)
        batch_y = batch_y.to(device)
        pred = model(batch_x)
        loss = criterion(pred, batch_y)
        if (epoch + 1) % 1000 == 0:
            print(epoch + 1, i, loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
 
# 将每个词在平面直角坐标系中标记出来,看看各个词之间的距离
for i, label in enumerate(vocab):
    W, WT = model.parameters()
    # W是词向量矩阵
    x, y = float(W[i][0]), float(W[i][1])
    plt.scatter(x, y)
    plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show() 

完整代码:

import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.utils.data as Data
 
dtype = torch.FloatTensor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
# 文本预处理
sentences = ["jack like dog", "jack like cat", "jack like animal",
             "dog cat animal", "banana apple cat dog like", "dog fish milk like",
             "dog cat animal like", "jack like apple", "apple like", "jack like banana",
             "apple banana jack movie book music like", "cat dog hate", "cat dog like"]
 
word_sequence = " ".join(sentences).split()  # ['jack', 'like', 'dog', 'jack', 'like', 'cat', 'animal',...]
vocab = list(set(word_sequence))  # build words vocabulary,去重
word2idx = {w: i for i, w in enumerate(vocab)}  # {'apple': 0, 'fish': 1,..., },注意,不固定!!!
 
# 模型的相关参数
batch_size = 8
embedding_size = 2  # 词向量的维度是2
C = 2  # window size
voc_size = len(vocab)
 
# 数据预处理
skip_grams = []
print(word2idx)
for idx in range(C, len(word_sequence) - C):
    center = word2idx[word_sequence[idx]]  # 中心词
 
    context_idx = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1))  # 中心词左边的2个词+中心词右边的两个词
    context = [word2idx[word_sequence[i]] for i in context_idx]
    for w in context:
        skip_grams.append([center, w])  # 中心词和每个周围词组成一个训练样本
 
 
def make_data(skip_grams):
    input_data = []
    output_data = []
    for i in range(len(skip_grams)):
        # input_data转换为one-hot形式,output_data合成一个list
        input_data.append(np.eye(voc_size)[skip_grams[i][0]])
        output_data.append(skip_grams[i][1])
    return input_data, output_data
 
 
print(skip_grams)
input_data, output_data = make_data(skip_grams)
print(input_data)
print(output_data)
input_data, output_data = torch.Tensor(input_data), torch.LongTensor(output_data)
dataset = Data.TensorDataset(input_data, output_data)
loader = Data.DataLoader(dataset, batch_size, True)
"""
skip_grams: [[10, 2],[9, 8], [11, 5], ..., [11, 7], [11, 10], [11, 0]]
input_data: [array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]),...]
output_data: [2, 0, 2, 0, 0, 10, 0, 11, 10, 2, 11, 2, 2, 0, 2, 0, 0, 11, 0, 8, 11, 2, 8, 10, 2, 0, 10,...]
"""
 
 
# 构建模型
class Word2Vec(nn.Module):
    def __init__(self):
        super(Word2Vec, self).__init__()
        self.W = nn.Parameter(torch.randn(voc_size, embedding_size).type((dtype)))
        self.V = nn.Parameter(torch.randn(embedding_size, voc_size).type((dtype)))
 
    def forward(self, X):
        # X : [batch_size, voc_size] one-hot
        # torch.mm only for 2 dim matrix, but torch.matmul can use to any dim
        hidden_layer = torch.matmul(X, self.W)  # hidden_layer : [batch_size, embedding_size]
        output_layer = torch.matmul(hidden_layer, self.V)  # output_layer : [batch_size, voc_size]
        return output_layer
 
 
model = Word2Vec().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
 
# 训练
for epoch in range(2000):
    for i, (batch_x, batch_y) in enumerate(loader):
        batch_x = batch_x.to(device)
        batch_y = batch_y.to(device)
        pred = model(batch_x)
        loss = criterion(pred, batch_y)
        if (epoch + 1) % 1000 == 0:
            print(epoch + 1, i, loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
 
 
# 将每个词在平面直角坐标系中标记出来,看看各个词之间的距离
for i, label in enumerate(vocab):
    W, WT = model.parameters()
    # W是词向量矩阵
    x, y = float(W[i][0]), float(W[i][1])
    plt.scatter(x, y)
    plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show()
 
 
# https://wmathor.com/index.php/archives/1443/ 

直接使用 nn.Linear(),只不过输入要改为 one-hot Encoding,而不能像 nn.Embedding() 这种方式直接传入一个 index。还有就是需要设置 bias=False,因为我们只需要训练一个权重矩阵,不训练偏置

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data as tud
 
from collections import Counter
import numpy as np
import random
 
import scipy
from sklearn.metrics.pairwise import cosine_similarity
 
random.seed(1)
np.random.seed(1)
torch.manual_seed(1)
 
C = 3  # 背景词
K = 15  # 负采样的噪声词
epoch = 2
MAX_VOCAB_SIZE = 10000
EMBEDDING_SIZE = 100
batch_size = 32
lr = 0.2
 
# 读取文本数据并处理
with open('./text8/text8.train.txt') as f:
    text = f.read()  # 得到文本内容
 
text = text.lower().split()  # 分割成单词列表
vocab_dict = dict(Counter(text).most_common(MAX_VOCAB_SIZE - 1))  # 得到单词字典表,key是单词,value是次数
vocab_dict[''] = len(text) - np.sum(list(vocab_dict.values()))  # 把不常用的单词编码成""
print(len(vocab_dict))
word2idx = {word: i for i, word in enumerate(vocab_dict.keys())}
idx2word = {i: word for i, word in enumerate(vocab_dict.keys())}
word_counts = np.array([count for count in vocab_dict.values()], dtype=np.float32)
word_freqs = word_counts / np.sum(word_counts)
word_freqs = word_freqs ** (3. / 4.)  # 将所有的频率变为原来的 0.75 次方,论文中是这样写的
 
 
# 实现DataLoader
class WordEmbeddingDataset(tud.Dataset):
    def __init__(self, text, word2idx, word_freqs):
        super(WordEmbeddingDataset, self).__init__()
        # 单词数字化表示,如果不在词典中(前10000个,表示为UNK对应的value
        self.text_encoded = [word2idx.get(word, word2idx['']) for word in text]
        self.text_encoded = torch.LongTensor(self.text_encoded)  # nn.Embedding需要传入LongTensor类型
        self.word2idx = word2idx
        self.word_freqs = torch.Tensor(word_freqs)
 
    def __len__(self):
        return len(self.text_encoded)  # 返回所有单词的总数
 
    def __getitem__(self, idx):
        """
        返回以下数据:
        1.中心词
        2.这个单词附近的positive word
        3.随机采样的K个单词作为negative word
        """
        center_words = self.text_encoded[idx]  # 取得中心词
        pos_indices = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1))  # 取得中心左右各C个词的索引
        pos_indices = [i % len(self.text_encoded) for i in pos_indices]  # 避免索引越界
        pos_words = self.text_encoded[pos_indices]  # tensor(list)
 
        # torch.multinomial作用是对self.word_freqs做K * pos_words.shape[0]次取值,输出的是self.word_freqs对应的下标
        # 取样方式采用有放回的采样,并且self.word_freqs数值越大,取样概率越大
        # 每采样一个正确的单词(positive word),就采样K个错误的单词(negative word),pos_words.shape[0]是正确单词数量
        neg_words = torch.multinomial(self.word_freqs, K * pos_words.shape[0], True)
 
        # while循环是为了保证neg_words中不能包含背景词
        while len(set(pos_indices) & set(neg_words)) > 0:
            neg_words = torch.multinomial(self.word_freqs, K * pos_words.shape[0], True)
        return center_words, pos_words, neg_words
 
 
dataset = WordEmbeddingDataset(text, word2idx, word_freqs)
dataloader = tud.DataLoader(dataset, batch_size, shuffle=True)
print(next(iter(dataset)))
 
 
class EmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(EmbeddingModel, self).__init__()
        self.vocab_size = vocab_size
        self.emded_size = embed_size
 
        self.in_embed = nn.Embedding(self.vocab_size, self.emded_size)
        self.out_embed = nn.Embedding(self.vocab_size, self.emded_size)
 
    def forward(self, input_labels, pos_labels, neg_labels):
        """
        :param input_labels: center words, [batch_size]
        :param pos_labels: positive words, [batch_size, (window_size * 2)]
        :param neg_labels: negative words, [batch_size, (window_size * 2 * K)]
        :return: loss, [batch_size]
        """
        input_embedding = self.in_embed(input_labels)  # [batch_size, embed_size]
        pos_embedding = self.out_embed(pos_labels)  # [batch_size, (window * 2), embed_size]
        neg_embedding = self.out_embed(neg_labels)  # [batch_size, (window * 2 * K), embed_size]
 
        # squeeze是挤压的意思,所以squeeze方法是删除一个维度,反之,unsqueeze方法是增加一个维度
 
        input_embedding = input_embedding.unsqueeze(2)  # [batch_size, embed_size, 1],在最后一维上增加一个维度
        # bmm方法是两个三维张量相乘,两个tensor的维度是,(b * m * n), (b * n * k) 得到(b * m * k)
        pos_dot = torch.bmm(pos_embedding, input_embedding)  # [batch_size, (window * 2), 1]
        pos_dot = pos_dot.squeeze(2)  # [batch_size, (window * 2)]
 
        neg_dot = torch.bmm(neg_embedding, -input_embedding)  # [batch_size, (window * 2 * K), 1]
        neg_dot = neg_dot.squeeze(2)  # batch_size, (window * 2 * K)]
 
        log_pos = F.logsigmoid(pos_dot).sum(1)  # .sum()结果只为一个数,.sum(1)结果是一维的张量
        log_neg = F.logsigmoid(neg_dot).sum(1)
 
        loss = log_pos + log_neg
 
        return -loss
 
    def input_embedding(self):
        return self.in_embed.weight.detach().numpy()
 
 
model = EmbeddingModel(MAX_VOCAB_SIZE, EMBEDDING_SIZE)
print(model)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
 
# 训练模型
for e in range(1):
    for i, (input_labels, pos_labels, neg_labels) in enumerate(dataloader):
        input_labels = input_labels.long()
        pos_labels = pos_labels.long()
        neg_labels = neg_labels.long()
 
        optimizer.zero_grad()
        loss = model(input_labels, pos_labels, neg_labels).mean()
        loss.backward()
 
        if i % 100 == 0:
            print('epoch', e, 'iteration', i, loss.item())
 
embedding_weights = model.input_embedding()
torch.save(model.state_dict(), "embedding-{}.th".format(EMBEDDING_SIZE))
 
 
def find_nearest(word):
    index = word2idx[word]
    embedding = embedding_weights[index]
    cos_dis = np.array([scipy.spatial.distance.cosine(e, embedding) for e in embedding_weights])
    return [idx2word[i] for i in cos_dis.argsort()[:10]]
 
 
for word in ["two", 'america', "computer"]:
    print(word, find_nearest(word)) 


(2)gensim

参考:

python-word2vec模块使用详解_这是一个死肥宅的博客-CSDN博客_word2vec.text8corpus

pip install --upgrade gensim

【1】sentences是训练所需材料,可通过两种格式载入:

a、文本格式:
将每篇文章 分词去停用词后,用空格分割,将其存入txt文本中(每一行一篇文章) 

你 好 我 是 网 络 咸 鱼

b、列表嵌套

[ [你] , [好], …]

【2】参数:

def __init__(self, sentences=None, size=100, alpha=0.025, window=5, min_count=5,
                 max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,
                 sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
                 trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, compute_loss=False, callbacks=()):

#sg=1是skip—gram算法,对低频词敏感,默认sg=0为CBOW算法
#size是神经网络层数,值太大则会耗内存并使算法计算变慢,一般值取为100到200之间。
#window是句子中当前词与目标词之间的最大距离,3表示在目标词前看3-b个词,后面看b个词(b在0-3之间随机)
#min_count是对词进行过滤,频率小于min-count的单词则会被忽视,默认值为5。
#negative和sample可根据训练结果进行微调,sample表示更高频率的词被随机下采样到所设置的阈值,默认值为1e-3,
#negative: 如果>0,则会采用negativesamping,用于设置多少个noise words
#hs=1表示层级softmax将会被使用,默认hs=0且negative不为0,则负采样将会被选择使用。
#workers是线程数,此参数只有在安装了Cpython后才有效,否则只能使用单核

 主要关注的就是 sentences、size、window、min_count这几个参数。

         sentences它是一个list,size表示输出向量的维度。

         window:当前词与预测次在一个句子中最大距离是多少。

         min_count:用于字典阶段,词频少于min_count次数的单词会被丢弃掉,默认为5

【3】模型训练:

dim=300
embedding_size = dim
model = gensim.models.Word2Vec(LineSentence(model_dir + 'train_word.txt'),
                               size=embedding_size,
                               window=5,
                               min_count=10,
                               workers=multiprocessing.cpu_count())

model.save(model_dir + "word2vec_gensim"+str(embedding_size)+".w2v")
model.wv.save_word2vec_format(model_dir + "word2vec_gensim_300d.txt", binary=False)

【4】模型保存与加载:

model.save("文本名")    #模型会保存到该 .py文件同级目录下,该模型打开为乱码
model.wv.save_word2vec_format("文件名",binary = "Ture/False")

  1. model2.wv.save_word2vec_format('word2vec.vector')

  2. model2.wv.save_word2vec_format('word2vec.bin')

 #通过该方式保存的模型,能通过文本格式打开,也能通过设置binary是否保存为二进制文件。但该模型在保存时丢弃了树的保存形式(详情参加word2vec构建过程,以类似哈夫曼树的形式保存词),所以在后续不能对模型进行追加训练

ps:

model.train(more_sentences)   追加训练

如果对..wv.save_word2vec_format加载的模型进行追加训练,会报错:
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'train'
 

#加载
gensim.models.Word2Vec.load("模型文件名")  【model.save】
model = model.wv.load_word2vec_format('模型文件名')【model.wv.save_word2vec_format】

不同的保存对应不同的加载方式
 

计算一个词的最近似的词:
model.most_similar("word",topn=10)    #计算与该 词最近似的词,topn指定排名前n的词

计算两个词的相似度:
model.similarity("word1","word2")  

计算两个列表的相似度:

list1 = [u'今天', u'我', u'很', u'开心']
list2 = [u'空气',u'清新', u'善良', u'开心']
list3 = [u'国家电网', u'再次', u'宣告', u'破产', u'重新']
list_sim1 =  model.n_similarity(list1, list2)
list_sim2 = model.n_similarity(list1, list3)

获取词向量:
model ['word']



后期遇到别的再补充!!!!

ps:

学习体会:

word2Vec最大的作用是获得词向量!可以根据此功能进行蛋白质的相关操作,获得词向量之后再些别的模型,将词向量传入之后的模型中!!!! 

你可能感兴趣的:(随便写写,学习,word2vec,tensorflow)