Pytorch 词向量训练

说明

对于词向量的训练,常用的有如gensim库下提供的word2vec模型,后面会简单的示例gensim库下该模型的使用。而并尝试基于pytorch实现词向量的训练

数据-word.txt文件

文件内容如下:

pen pencil pencilcase ruler book bag comic-book post-card newspaper schoolbag eraser crayon sharpener storybook notebook Chinese-book English-book math book magazine dictionary foot head face hair nose mouth eye ear arm hand finger leg tail
red blue yellow green white black pink purple orange brown
cat dog pig duck rabbit horse elephant ant fish bird eagle beaver snake mouse squirrel kangaroo monkey panda bear lion tiger fox zebra deer giraffe goose hen turkey lamb sheep-goat cow donkey squid lobster shark seal sperm whale killer-whale
friend boy girl mother father sister brother uncle man woman Mr Miss lady mom dad parents grandparents grandma grandmother grandpa grandfather aunt cousin son daughter baby kid classmate queen visitor neighbour principal university-student pen-pal tourist people robot
teacher student doctor nurse driver farmer singer writer actor actress artist TV-reporter engineer accountant policeman salesperson cleaner baseball-player assistant police
rice bread beef milk water egg fish tofu cake hot-dog hamburger French-fries cookie biscuit jam noodles meat chicken pork mutton vegetable salad soup-ice icecream Coke juice tea coffee-breakfast lunch dinner supper meal
apple banana pear orange watermelon grape eggplant green-beans tomato potato peach strawberry cucumber onion carrot cabbage
jacket shirt Tshirt skirt dress jeans pants socks shoes sweater coat raincoat shorts sneakers slippers sandals boots hat cap sunglasses tie scarf gloves trousers cloth 
bike bus train boat ship yacht car taxi jeep van plane airplane subway underground motor-cycle
window door desk chair bed computer board fan light teachers-desk picture wall floor curtain trash-bin closet mirror end-table football soccer present walkman lamp phone sofa shelf fridge table TV airconditioner key lock photo chart plate knife fork spoon chopsticks pot gift-toy doll ball balloon kite jigsaw-puzzle box umbrella zipper violin yoyo nest hole tube toothbrush menu ecard email traffic-light money medicine
home room bedroom bathroom living-room kitchen classroom school park library post-office police-office hospital cinema bookstore farm zoo garden-study playground canteen teachers-office library gym washroom art-room computer-room music-room TV-room flat company factory fruit-stand pet-shop nature-park theme-park science-museum Great-Wall supermarket bank country village city hometown bus-stop
sports science Moral-Education Social-Studies-Chinese math PE English
China PRC America USA UK England Canada CAN Australia New-York London Sydney Moscow-Cairo
cold warm cool snowy sunny hot rainy windy cloudy weather-report
river lake stream forest path road house-bridge building rain cloud sun mountain sky-rainbow wind air moon
flower grass tree seed sprout plant rose leaf
Monday Tuesday Wednesday Thursday Friday Saturday Sunday weekend
Jan Feb Mar April May June July Aug Sept Oct Nov Dec
spring summer fall autumn winter
south north east west left right
have-a-fever hurt have-a-cold have-a-toothache have-a-headache have-a-sore-throat
one two three four five six seven eight nine ten-eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty thirty forty fifty-sixty seventy eighty ninety fortytwo hundred one a-hundred-and-thirtysix first second third fourth fifth eighth ninth twelfth twentieth thirtieth fortieth fiftieth sixtieth seventieth eightieth ninetieth fiftysixth
big small long tall short young old strong thin active quiet nice kind strict smart funny tasty sweet salty sour fresh favourite clean tired excited angry happy bored sad taller shorter stronger older younger bigger heavier longer thinner smaller good fine great heavy new fat happy right hungry cute little lovely beautiful colourful pretty cheap expensive juicy tender healthy ill helpful high easy proud sick better higher
in on under near behind next-to over in-front-of
I we you he she it they my our your his her
play swim skate fly jump walk run climb fight swing eat sleep like have turn buy take live teach go study learn sing dance row do do-homework do-housework watch-TV read-books cook-meals water-flowers sweep-floor clean-bedroom make-bed set-table wash-clothes do-dishes use-a-computer do-morning-exercises eat-breakfast eat-dinner go-to-school have-English-class play-sports getup climb-mountains go-shopping play-piano visit-grandparents go-hiking fly-kites make-a-snowman plant-trees draw-pictures cook-dinner read-a-book answer-phone listen-to-music clean-room write-a-letter write-an-email drink-water take-pictures watch-insects pick-up-leaves do-an-experiment catch-butterflies count-insects collect-insects collect-leaves write-a-report play-chess have-a-picnic get-to ride-a-bike play-violin make-kites collect-stamps meet welcome thank love work drink taste smell feed shear milk look guess help pass show use clean open close put paint tell kick bounce ride stop wait find drive fold send wash shine become feel think meet fall leave wake-up put-on take-off hang-up wear go-home go-to-bed play-computer-games play-chess empty-trash put-away clothes get-off take-a-trip read-a-magazine go-to-cinema go-straight

这里的数据集可能较少,不过作为简单示例应该还是够的,其中每一类单词都放在同一行

基于gensim库的word2vec训练

首先需要安装gensim库:pip install gensim,安装完成后训练代码如下:

from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

sentences = LineSentence('word.txt')
# 读取前面示例的word.txt文件,这个会生成一个迭代器,里面存放着按行分类的所有单词
model = Word2Vec(sentences, size=16, window=5, min_count=0, workers=4, iter=5000)
# 使用该模型,参数解释如下:
# size:每个单词用16维的词向量表示
# window:窗口大小为5,即通过当前单词能够预测其前两个和后两个单词
# min_count:单词频数最低要求,低于该数的都将被忽略,默认是5,这里因为数据集小就改成0
# workers:工作线程数,用4个线程
# iter:迭代5000轮
model.save('gensim_16.mdl')
# 保存模型

因为数据量小,大概1分钟内就能跑完,测试代码如下:

model = Word2Vec.load('gensim_16')
# 载入模型
items = model.wv.most_similar('bear')
# wv下提供了很多工具方法,这里词向量按与传入的单词相似度从高到低排序
for i, item in enumerate(items):
    print(i, item[0], item[1])

print(model.wv.similarity('bear', 'tiger'))
# 计算两个词的相似度
查看词向量

通过model.wv.index2word可以查看模型下所有词向量对应的标签名,结果如下:

['book',
 'math',
 'orange',
 'fish',
 'milk',
...
]

所以要获取所有词向量可以通过model[model.wv.index2word]获取,结果如下:

array([[ 2.1858058 , -1.1265628 ,  0.7986337 , ..., -3.3885555 ,
        -5.0689073 , -2.3837712 ],
       [ 2.5849087 ,  0.6549566 ,  1.0028977 , ..., -1.8795928 ,
        -4.4294124 , -4.1221085 ],
       [ 0.9784559 , -4.1107635 ,  0.8471646 , ..., -3.7726424 ,
        -0.33898747, -3.4206762 ],
       ...,
       [ 2.0379307 , -1.7257718 ,  0.98616403, ..., -2.5776517 ,
        -0.8687243 ,  1.4909588 ],
       [ 1.8207592 , -1.4406224 ,  0.66797787, ..., -2.2530203 ,
        -0.6574308 ,  1.4921187 ],
       [ 1.2744113 , -1.1354392 ,  0.6139609 , ..., -1.8367131 ,
        -0.59694195,  1.073009  ]], dtype=float32)
降维可视化

我们可以将上面的词向量降维成二维向量,并放在坐标轴上进行展示,查看向量之间的分布,代码如下:

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(n_components=2, init='pca', n_iter=5000)
# 将数据降成2维,基于主成分分析算法,迭代5000次
embed_two = tsne.fit_transform(model[model.wv.index2word])
# 将词向量转成降维后的
labels = model.wv.index2word
# 标签就是每个词向量对应的名字

plt.figure(figsize=(15, 12))
for i, label in enumerate(labels[:80]):
    # 展示前80个词向量的二维分布
    x, y = embed_two[i, :]
    plt.scatter(x, y)
    plt.annotate(label, (x, y), ha='center', va='top')
    # 对对应坐标添加标注
# plt.savefig('word.png')

结果如下:


词向量降维结果

可以看到词向量聚成了好几块,比如颜色的单词基本聚到了中间的一块,而动物则聚到了底下

参考:
https://blog.csdn.net/zhl493722771/article/details/82781675
https://blog.csdn.net/qq_27586341/article/details/90025288

基于pytorch定义模型训练

前面是使用了gensim库直接调用word2vec模型进行词向量训练,接下来我们尝试用pytorch来训练。首先我们要选择一个训练的方式,一般来说有两种:

CBOW(Continuous Bag-of-Words):根据上下文词语预测当前词
Skip-Gram:根据当前词预测上下文词语

即假设有一类数据:[a, b, c, d, e],如果使用CBOW,假设有一组模型的输入输出数据[(x1, y1), (x2, y2), ...],那么就可能是:[(a, c), (b, c), (d, c), (e, c)],此时输入a、b、d、e对应的结果都可以是c,同理输入a、c、d、e对应的结果都可以是b...即多个预测一个;而Skip-Gram则是反过来的关系来训练,即同样的情况可能就是:[(c, a), (c, b), (c, d), (c, e)]

这里我们尝试基于Skip-Gram来进行训练,步骤如下:

模块导入
import torch
from torch import nn
import matplotlib.pyplot as plt
初始化定义

首先我们需要使用gpu,然后定义上下文的窗口大小,这里设置2,即在数据集当中当前单词的前两个和后两个单词都能够预测他,还有一些其他的初始化定义,代码如下:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()

context_size = 2
# 上下文大小,即当前词与关联词的最大距离
lr = 1e-3
batch_size = 64
li_loss = []
数据预处理

这里读取前面的单词数据集,并基于前面的窗口大小来创建输入和输出的数据,代码如下:

with open("word.txt", "r", encoding="utf-8") as f:
    lines = f.read().strip().split("\n")

set_words = set()
# 存放所有单词集合
for words in lines:
    for word in words.split():
        set_words.add(word)
word_to_id = {word: i for i, word in enumerate(set_words)}
# 单词索引id
id_to_word = {word_to_id[word]: word for word in word_to_id}
# id索引单词
word_size = len(set_words)
# 单词数

train_x = []
train_y = []
for words in lines:
    li_words = words.split()
    # 存放每一行的所有单词
    for i, word in enumerate(li_words):
        for j in range(-context_size, context_size + 1):
            # 对于每个单词,将上下文大小内的词与其进行关联
            if i + j < 0 or i + j > len(li_words) - 1 or li_words[i + j] == word:
                # 对于上下文越界以及当前单词本身不添加关联关系
                continue
            train_x.append(word_to_id[word])
            train_y.append(word_to_id[li_words[i + j]])
            # 训练数据基于Skip-Gram,输入当前词,输出当前词上下文大小内的所有词

这里可以查看一部分输入和输出数据:

print("init:", lines[0].split()[:10])
print("x:", [ id_to_word[each] for each in train_x[:10]])
print("y:", [ id_to_word[each] for each in train_y[:10]])

# 结果:
# init: ['pen', 'pencil', 'pencilcase', 'ruler', 'book', 'bag', 'comic-book', 'post-card', 'newspaper', 'schoolbag']
# x: ['pen', 'pen', 'pencil', 'pencil', 'pencil', 'pencilcase', 'pencilcase', 'pencilcase', 'pencilcase', 'ruler']
# y: ['pencil', 'pencilcase', 'pen', 'pencilcase', 'ruler', 'pen', 'pencil', 'ruler', 'book', 'pencil']

可以看出对于第一个单词pen,能够预测pencil和pencilcase,而pencil可以预测pen、pencilcase和ruler...,即每个单词都只能预测自己的前两个和后两个,越界则忽略

定义模型

这里就是用embedding层来训练词向量表,后面再加上个全连接和logsoftmax计算每个词向量的可能概率,代码如下:

class EmbedWord(nn.Module):
    def __init__(self, word_size, context_size):
        super(EmbedWord, self).__init__()
        self.embedding = nn.Embedding(word_size, 16)
        # 用128维向量表示每个单词
        self.linear = nn.Linear(16, word_size)
        self.log_softmax = nn.LogSoftmax()

    def forward(self, x):
        x = self.embedding(x)
        x = self.linear(x)
        x = self.log_softmax(x)
        return x

model = EmbedWord(word_size, context_size).to(device)
定义优化器和loss
loss_fun = nn.NLLLoss()
# 这里使用NLL作为损失函数
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
开始训练

由于词向量表十分庞大,因此这里训练50000轮(为了效果更好的话,建议训练100000轮以上),并且每隔50轮输出查看训练结果,代码如下:

model.train()
for epoch in range(len(li_loss), 50000):
    if epoch % 2000 == 0 and epoch > 0:
        optimizer.param_groups[0]['lr'] /= 1.05
        # 每2000轮下降一点学习率
    for batch in range(0, len(train_x) - batch_size, batch_size):
        word = torch.tensor(train_x[batch: batch + batch_size]).long().to(device)
        label = torch.tensor(train_y[batch: batch + batch_size]).to(device)
        out = model(word)
        loss = loss_fun(out, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    li_loss.append(loss)   
    
    if epoch % 50 == 0 and epoch > 0:
        print('epoch: {}, Loss: {}, lr: {}'.format(epoch, loss, optimizer.param_groups[0]['lr']))
        plt.plot(li_loss[-500:])
        plt.show()
        
        for w in range(5):
            # 每50轮测试一下前5个单词的预测结果
            pred = model(torch.tensor(w).long().to(device))
            print("{} -> ".format(id_to_word[w]), end="\t")
            for i, each in enumerate((-pred).argsort()[:10]):
                print("{}:{}".format(i, id_to_word[int(each)]), end="   ")
            print()
降维可视化

这部分代码基本和前面的一样:

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

result = model(torch.tensor([i for i in range(word_size)]).long().to(device))
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
embed_two = tsne.fit_transform(model.embedding.weight.cpu().detach().numpy())
# 将词向量降到二维查看空间分布
# embed_two = tsne.fit_transform(result.cpu().detach().numpy())
labels = [id_to_word[i] for i in range(200)]
# 这里就查看前200个单词的分布
plt.figure(figsize=(15, 12))
for i, label in enumerate(labels):
    x, y = embed_two[i, :]
    plt.scatter(x, y)
    plt.annotate(label, (x, y), ha='center', va='top')
# plt.savefig('词向量降维可视化.png')

可视化结果:


词向量降维结果

仔细看可以看出一部分数据的确被很好的聚类了,但是对比前面使用gensim库的结果可以发现还是差一些

完整代码
import torch
from torch import nn
import matplotlib.pyplot as plt

# ----------------------------
# 初始化定义
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()

context_size = 2
# 上下文大小,即当前词与关联词的最大距离
word_size = len(set_words)
# 单词数
lr = 1e-3
batch_size = 64
li_loss = []

# ----------------------------
# 数据预处理
with open("word.txt", "r", encoding="utf-8") as f:
    lines = f.read().strip().split("\n")

set_words = set()
# 存放所有单词集合
for words in lines:
    for word in words.split():
        set_words.add(word)
word_to_id = {word: i for i, word in enumerate(set_words)}
# 单词索引id
id_to_word = {word_to_id[word]: word for word in word_to_id}
# id索引单词

train_x = []
train_y = []
for words in lines:
    li_words = words.split()
    # 存放每一行的所有单词
    for i, word in enumerate(li_words):
        for j in range(-context_size, context_size + 1):
            # 对于每个单词,将上下文大小内的词与其进行关联
            if i + j < 0 or i + j > len(li_words) - 1 or li_words[i + j] == word:
                # 对于上下文越界以及当前单词本身不添加关联关系
                continue
            train_x.append(word_to_id[word])
            train_y.append(word_to_id[li_words[i + j]])
            # 训练数据基于Skip-Gram,输入当前词,输出当前词上下文大小内的所有词

# ----------------------------
# 定义模型          
class EmbedWord(nn.Module):
    def __init__(self, word_size, context_size):
        super(EmbedWord, self).__init__()
        self.embedding = nn.Embedding(word_size, 16)
        # 用128维向量表示每个单词
        self.linear = nn.Linear(16, word_size)
        self.log_softmax = nn.LogSoftmax()

    def forward(self, x):
        x = self.embedding(x)
        x = self.linear(x)
        x = self.log_softmax(x)
        return x

model = EmbedWord(word_size, context_size).to(device)

# ----------------------------
# 定义优化器和loss
loss_fun = nn.NLLLoss()
# 这里使用NLL作为损失函数
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

# ----------------------------
# 开始训练
model.train()
for epoch in range(len(li_loss), 50000):
    if epoch % 2000 == 0 and epoch > 0:
        optimizer.param_groups[0]['lr'] /= 1.05
        # 每2000轮下降一点学习率
    for batch in range(0, len(train_x) - batch_size, batch_size):
        word = torch.tensor(train_x[batch: batch + batch_size]).long().to(device)
        label = torch.tensor(train_y[batch: batch + batch_size]).to(device)
        out = model(word)
        loss = loss_fun(out, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    li_loss.append(loss)   
    
    if epoch % 50 == 0 and epoch > 0:
        print('epoch: {}, Loss: {}, lr: {}'.format(epoch, loss, optimizer.param_groups[0]['lr']))
        plt.plot(li_loss[-500:])
        plt.show()
        
        for w in range(5):
            # 每50轮测试一下前5个单词的预测结果
            pred = model(torch.tensor(w).long().to(device))
            print("{} -> ".format(id_to_word[w]), end="\t")
            for i, each in enumerate((-pred).argsort()[:10]):
                print("{}:{}".format(i, id_to_word[int(each)]), end="   ")
            print()
            
# ----------------------------
# 降维可视化
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

result = model(torch.tensor([i for i in range(word_size)]).long().to(device))
tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
embed_two = tsne.fit_transform(model.embedding.weight.cpu().detach().numpy())
# 将词向量降到二维查看空间分布
# embed_two = tsne.fit_transform(result.cpu().detach().numpy())
labels = [id_to_word[i] for i in range(200)]
# 这里就查看前200个单词的分布
plt.figure(figsize=(15, 12))
for i, label in enumerate(labels):
    x, y = embed_two[i, :]
    plt.scatter(x, y)
    plt.annotate(label, (x, y), ha='center', va='top')
# plt.savefig('词向量降维可视化.png')

参考:
https://blog.csdn.net/weixin_40759186/article/details/87857361
https://my.oschina.net/earnp/blog/1113897

你可能感兴趣的:(Pytorch 词向量训练)