零基础入门NLP - 新闻文本分类
随着人工智能的不断发展,自然语言这门技术也越来越重要,很多人都开启了学习自然语言,本文就介绍了基于word2vec的word相似度。
word2vec是指将文本数据处理成计算机可以处理的向量,并根据语料库可以计算word之间的相似度。
版本如下:
jieba 0.42.1
gensim 4.1.2
#coding:utf-8
import jieba
from gensim.models import Word2Vec
import gensim.models.word2vec as w2v
斗破苍穹下载,并利用jieba进行分词,这里jieba分词处理比较粗略,可以继续进行细化并保存下来,用来word2vec读取。
with open('斗破苍穹.txt',encoding='gb18030') as f:
document = f.read()
document_cut = jieba.cut(document)
result = ' '.join(document_cut)
print("type",type(result))
with open('斗破苍穹_seg.txt', 'w',encoding="utf-8") as f2:
f2.write(result)
读取文本,利用gensim中的Word2Vec进行训练得到模型并保存。
model_file_name = '斗破苍穹.model'
#模型训练,生成词向量
sentences = w2v.LineSentence('斗破苍穹_seg.txt')
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
model.save(model_file_name)
直接看看结果,看看与萧炎、斗帝最相近的15个word。
model = Word2Vec.load(model_file_name)
for vec in ['斗帝', '萧炎']:
print('--%s--的似度' %(vec))
print(model.wv.similar_by_word(vec, topn=15))
print('\n')
–斗帝–的似度
[(‘斗圣’, 0.8141199350357056), (‘半圣’, 0.8010677695274353), (‘斗灵’, 0.7803141474723816), (‘斗宗’, 0.7673341035842896), (‘斗师’, 0.7628400325775146), (‘斗皇’, 0.7628328800201416), (‘斗王’, 0.7606483697891235), (‘顶尖’, 0.7596161365509033), (‘别的’, 0.7553380131721497), (‘超级’, 0.7388392686843872), (‘斗尊’, 0.7323582172393799), (‘一流’, 0.7282006740570068), (‘王族’, 0.7209020853042603), (‘十大’, 0.7201776504516602), (‘前十’, 0.716377317905426)]–萧炎–的似度
[(‘前者’, 0.7002381682395935), (‘他’, 0.6438902616500854), (‘韩枫’, 0.6365240216255188), (‘后者’, 0.6304334998130798), (‘药老’, 0.6298486590385437), (‘紫研’, 0.5926578044891357), (‘萧厉’, 0.5817351937294006), (‘云山’, 0.5741239190101624), (‘柳擎’, 0.5692581534385681), (‘苏千’, 0.543209433555603), (‘唐震’, 0.5430344343185425), (‘凤清儿’, 0.5383116006851196), (‘小医仙’, 0.5327429175376892), (‘韩闲’, 0.5258689522743225), (‘古元’, 0.5253770351409912)]