上一篇:NLP【04】tensorflow 实现Wordvec(附代码详解)
下一篇:NLP【06】RCNN原理及文本分类实战(附代码详解)
完整代码下载:https://github.com/ttjjlw/NLP/tree/main/Word_vector%E8%AF%8D%E5%90%91%E9%87%8F/glove
glove词向量的原理,请参考我的第三篇文章
pytorch
data 数据
dataset.py 把raw_data处理成train_corpus的格式
huffman.py 构建哈夫曼树
glove.py glove模型
tools.py 构建共现矩阵,可视化以及计算词与词相似性等功能
main.py 程序运行入口,直接运行main.py即可得到glove词向量
def train(args):
corpus_preprocessor = CorpusPreprocess(args.train_data_path, args.min_count)
coo_matrix = corpus_preprocessor.get_cooccurrence_matrix(args.windows_size)
vocab = corpus_preprocessor.get_vocab()
glove = Glove(vocab, args)
print(glove)
if os.path.isfile(args.embed_path_pkl):
glove.load_state_dict(torch.load(args.embed_path_pkl))
print('载入模型{}'.format(args.embed_path_pkl))
if use_gpu:
glove.cuda()
optimizer = torch.optim.Adam(glove.parameters(), lr=args.learning_rate)
train_data = TrainData(coo_matrix,args)
data_loader = DataLoader(train_data,
batch_size=args.batch_size,
shuffle=True,
num_workers=2,
pin_memory=True)
steps = 0
for epoch in range(args.epoches):
print(f"currently epoch is {epoch + 1}, all epoch is {args.epoches}")
avg_epoch_loss = 0
for i, batch_data in enumerate(data_loader):
c = batch_data['c']
s = batch_data['s']
X_c_s = batch_data['X_c_s']
W_c_s = batch_data["W_c_s"]
if use_gpu:
c = c.cuda()
s = s.cuda()
X_c_s = X_c_s.cuda()
W_c_s = W_c_s.cuda()
W_c_s_hat = glove(c, s)
loss = loss_func(W_c_s_hat, X_c_s, W_c_s)
optimizer.zero_grad()
loss.backward()
optimizer.step()
avg_epoch_loss += loss / len(train_data)
if steps % 1000 == 0:
print(f"Steps {steps}, loss is {loss.item()}")
steps += 1
print(f"Epoches {epoch + 1}, complete!, avg loss {avg_epoch_loss}.\n")
save_word_vector(args.embed_path_txt, corpus_preprocessor, glove)
torch.save(glove.state_dict(), args.embed_path_pkl)
这里摘抄了核心部分代码:
主要包含两部:
1、构建共现矩阵
corpus_preprocessor = CorpusPreprocess(args.train_data_path, args.min_count)
coo_matrix = corpus_preprocessor.get_cooccurrence_matrix(args.windows_size)
2、输入到模型进行训练
W_c_s_hat = glove(c, s)
处理第1个原始语料文件
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\USER\AppData\Local\Temp\jieba.cache
Loading model cost 1.110 seconds.
Prefix dict has been built successfully.
Glove(
(c_weight): Embedding(50216, 128)
(c_biase): Embedding(50216, 1)
(s_weight): Embedding(50216, 128)
(s_biase): Embedding(50216, 1)
)
currently epoch is 1, all epoch is 3
Steps 0, loss is 65.82528686523438
Steps 1000, loss is 30.690044403076172