参考文献:windows下使用word2vec训练维基百科中文语料全攻略!(网址见文末)
步骤一:下载中文微博语料
原数据:https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/simplifyweibo_4_moods/intro.ipynb#__NO_LINK_PROXY__
(这部分数据噪声太大,得出效果不好,我后面又自己爬了写数据来进行训练)
步骤二:安装依赖库
我们需要安装一些依赖库,有numpy、scipy以及gensim,安装gensim依赖于scipy,安装scipy依赖于numpy。我们直接用pip安装numpy,在windows命令行下使用命令:
1 pip install numpy
2 pip install scipy
3 pip install gensim
步骤三:对中文首先进行分词,分词采用jieba。
1. 首先对微博文本进行去噪处理:pre_data.py
import re
def filter(s):
r1 =u'[a-zA-Z0-9’!"#$%&\'()*+,-./:;<=>?@,。?★、…【】《》?“”‘’![\\]^_`{|}~]+' # 用户也可以在此进行自定义过滤字符
# cleanr = re.compile('<.*?>')
# sentence = re.sub(cleanr, ' ', s) # 去除html标签
sentence = re.sub(r1, '', s)
return sentence
#print(sentence)
#去除空白行以及某些字符
def clearBlankLine(f1,f2):
file1 =open(f1, 'r', encoding='utf-8')# 要去掉空行的文件
file2 =open(f2, 'w', encoding='utf-8')# 生成没有空行的文件
try:
for linein file1.readlines():
line = filter(line)
if line =='\n':
line = line.strip("\n")
print(line)
file2.write(line)
finally:
file1.close()
file2.close()
# s = '[href="http://app.weibo.com/t/feed/9ksdit"]今天真开心'
# print(filter(s))
#读入文本
f1='weibo.txt'
#要写进的文本
f2='result.txt'
clearBlankLine(f1,f2)
得到处理后的文本:result.txt
2.进行分词:去停用词表是我整合的,你也可以自己网上下载自己需要的,然后我没有定义自己的词典,需要可以自己定义。
'''添加自定义词库'''
import jieba
# jieba.enable_parallel(4)
#载入词典
#jieba.load_userdict('抑郁情感词典.txt')
'''加载停用词库'''
stopwords = [line.rstrip() for line in open('./data/stopwords.txt', encoding='utf-8')]
'''分词'''
def seg_sentence(sentence):
sentence_seged = jieba.cut(sentence.strip())
outstr = ''
for word in sentence_seged:
if word not in stopwords:
if word != '\t':
outstr += word
outstr += " "
return outstr
inputs = open('result.txt', 'r', encoding='utf-8')
outputs = open('weibo_seg.txt', 'w',encoding='utf-8')
for line in inputs:
line_seg = seg_sentence(line)
print(line_seg)
if line_seg == '\n':
line_seg= line_seg.strip("\n")
outputs.write(line_seg + '\n')
outputs.close()
inputs.close()
3.最后得到一个分词好的文本:weibo_seg.txt
步骤四:分词结束,训练词向量
1.训练代码:model.py
2.训练得到模型:model-2-5.model
3.测试模型
model.py
# # 生成word2vec模型部分
import logging
from gensim.modelsimport Word2Vec
from gensim.models.word2vecimport LineSentence
from gensim.modelsimport word2vec
if __name__ =='__main__':
#生成日志文件
program ="model.py"
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" %' '.join(program))
infile='result.txt'
sentences = LineSentence(infile)
model = Word2Vec(LineSentence(infile), size=200, window=5, min_count=1, iter=5)
model.save(u'model-2-5.model')
#测试词向量模型部分
model = word2vec.Word2Vec.load("model-2-5.model")
result = model.similar_by_word('抑郁')
print(result)
话说谁知道怎么插入代码吗,这直接粘贴希望大家不要看着太难受。
参考网址:(一)https://www.jianshu.com/p/98d84854f7a3?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes&utm_source=recommendation
(二)https://www.jianshu.com/p/af02db32fac2
(三)https://www.jianshu.com/p/83b742994946?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes&utm_source=recommendation
作者:石晓文的学习日记
链接:https://www.jianshu.com/p/af02db32fac2
來源:
著作权归作者所有,任何形式的转载都请联系作者获得授权并注明出处。
参考另一篇更详细的文献:https://www.cnblogs.com/gaofighting/p/9105614.html
(有部分结果图可供参考)