作者主页(文火冰糖的硅基工坊):文火冰糖(王文兵)的博客_文火冰糖的硅基工坊_CSDN博客
本文网址:https://blog.csdn.net/HiWangWenBing/article/details/121723021
目录
第1章 gensim概述
第2章 gensim.models.word2vec参数详解
第3章 使用gensim.models.word2vec构建向量模型
3.0 前提
3.1 语料库
3.2 创建并训练模型
3.3 对相识度单词进行预测
Gensim是一款开源的第三方Python工具包,用于从原始的非结构化的文本中,无监督地学习到文本隐层的主题向量表达。
Word2Vec模型是Gensi库的词向量模型。
class gensim.models.word2vec.Word2Vec(
sentences=None,
corpus_file=None,
size=100,
alpha=0.025,
window=5,
min_count=5,
max_vocab_size=None,
sample=0.001,
seed=1,
workers=3,
min_alpha=0.0001,
sg=0,
hs=0,
negative=5,
ns_exponent=0.75,
cbow_mean=1,
hashfxn=,
iter=5,
null_word=0,
trim_rule=None,
sorted_vocab=1,
batch_words=10000,
compute_loss=False,
callbacks=(),
max_final_vocab=None)
参数说明:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from gensim.models import Word2Vec
import matplotlib.pyplot as plt
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".lower().split()
model = Word2Vec([raw_text], window=5, min_count=0, vector_size=100)
(1)获取某个单词的词向量
model.wv.get_vector("we")
array([-8.2371626e-03, 9.3018124e-03, -2.0378912e-04, -1.9672457e-03, 4.6009533e-03, -4.1048718e-03, 2.7483397e-03, 6.9529405e-03, 6.0647726e-03, -7.5193373e-03, 9.3864016e-03, 4.6757250e-03, 3.9595906e-03, -6.2362696e-03, 8.4568849e-03, -2.1459276e-03, 8.8368189e-03, -5.3625666e-03, -8.1349388e-03, 6.8205344e-03, 1.6731464e-03, -2.1995250e-03, 9.5159588e-03, 9.4903978e-03, -9.7708460e-03, 2.5059620e-03, 6.1574611e-03, 3.8693496e-03, 2.0194747e-03, 4.3256412e-04, 6.8311812e-04, -3.8289619e-03, -7.1381810e-03, -2.1045576e-03, 3.9239591e-03, 8.8271257e-03, 9.2626950e-03, -5.9751221e-03, -9.4050728e-03, 9.7564282e-03, 3.4208333e-03, 5.1657772e-03, 6.2864725e-03, -2.8053685e-03, 7.3280791e-03, 2.8254921e-03, 2.8643315e-03, -2.3794267e-03, -3.1234692e-03, -2.3632357e-03, 4.2710570e-03, 8.2289553e-05, -9.5984712e-03, -9.6682198e-03, -6.1445762e-03, -1.2618728e-04, 1.9983812e-03, 9.4273640e-03, 5.5828230e-03, -4.2890343e-03, 2.7802799e-04, 4.9645198e-03, 7.7032396e-03, -1.1378536e-03, 4.3263095e-03, -5.8062747e-03, -8.0820709e-04, 8.1010396e-03, -2.3662101e-03, -9.6660787e-03, 5.7865614e-03, -3.9302218e-03, -1.2270809e-03, 9.9810772e-03, -2.2439670e-03, -4.7674584e-03, -5.3300112e-03, 6.9841221e-03, -5.7071578e-03, 2.1063576e-03, -5.2589145e-03, 6.1209816e-03, 4.3569636e-03, 2.6094934e-03, -1.4887219e-03, -2.7490708e-03, 8.9987572e-03, 5.2161841e-03, -2.1613305e-03, -9.4713038e-03, -7.4321763e-03, -1.0670737e-03, -7.8357977e-04, -2.5633539e-03, 9.6833659e-03, -4.6015202e-04, 5.8634020e-03, -7.4515464e-03, -2.5067476e-03, -5.5492264e-03], dtype=float32)
(2)获取与某个单词的关联性最高的几个词
model.wv.similar_by_word("processes", topn=5)
[('study', 0.1669149100780487), ('effect,', 0.16261565685272217), ('abstract', 0.1388479620218277), ('we', 0.1315048635005951), ('directed', 0.11603082716464996)]
备注:
由于输入的文本太小,因此关联度不明显。
作者主页(文火冰糖的硅基工坊):文火冰糖(王文兵)的博客_文火冰糖的硅基工坊_CSDN博客
本文网址:https://blog.csdn.net/HiWangWenBing/article/details/121723021