1.下载维基百科数据
https://dumps.wikimedia.org/zhwiki/latest/
2.预处理文件:将压缩的文件转化成.txt文件
添加脚本文件process.py,代码如下:
import logging
import os.path
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 3:
print(globals()['__doc__'] % locals())
sys.exit(1)
inp, outp = sys.argv[1:3]
space = b' '
i = 0
output = open(outp, 'w', encoding='utf-8')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
s = space.join(text)
s = s.decode('utf8') + "\n"
output.write(s)
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
执行脚本文件
python process.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh1.text
3.文件内容是繁体,我们转化成简体(用工具opencc,找到自己对应的版本下载)
步骤1:将要转化的.txt文件复制到opencc-1.0.1-win64目录下
步骤2:cmd中进入你的opencc-1.0.1-win64目录
步骤3:输入命令opencc -i wiki_texts.txt -o test.txt -c t2s.json
其中wiki_texts.txt是你原始的文件名,test.txt是改成简体的文件的文件名
步骤4:将test.txt文件再拷贝到你的python环境中
4.打开看看是不是已经是简体中文形式(速度可能比较慢,耐心等待)
5.将简体文件进行分词
添加脚本文件Testjieba.py,代码如下:
import jieba
import jieba.analyse
import jieba.posseg as pseg
import codecs,sys
def cut_words(sentence):
#print sentence
return " ".join(jieba.cut(sentence)).encode('utf-8')
f=codecs.open('wiki.zh.jian.text','r',encoding="utf8") #分词前的脚本文件
target = codecs.open("zh.jian.wiki.seg-1.3g.txt", 'w',encoding="utf8") #分词后的文件
print ('open files')
line_num=1
line = f.readline()
while line:
print('---- processing ', line_num, ' article----------------')
line_seg = " ".join(jieba.cut(line))
target.writelines(line_seg)
line_num = line_num + 1
line = f.readline()
f.close()
target.close()
exit()
while line:
curr = []
for oneline in line:
#print(oneline)
curr.append(oneline)
after_cut = map(cut_words, curr)
target.writelines(after_cut)
print ('saved',line_num,'articles')
exit()
line = f.readline1()
f.close()
target.close()
执行脚本文件
python Testjieba.py
6.word2vec建模
添加脚本文件word2vec_model.py,代码如下:
import logging # 日志
import os.path # 路径
import sys
import multiprocessing
from gensim.corpora import WikiCorpus # 维基百科语料库
from gensim.models import Word2Vec # 自然语言处理
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
program = os.path.basename(sys.argv[0]) #
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 4:
print(globals()['__doc__'] % locals())
sys.exit(1)
inp, outp1, outp2 = sys.argv[1:4]
model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
#min_count=5表示去除出现次数小于5的词;size=400设置神经网络的层数
model.save(outp1)
model.model.wv.save_word2vec_format(outp2, binary=False)
执行脚本文件(速度慢,耐心等待)
python word2vec_model.py zh.jian.wiki.seg.txt wiki.zh.text.model wiki.zh.text.vector
#zh.jian.wiki.seg.txt分词好的文件名;wiki.zh.text.model建立模型的名字;wiki.zh.text.vector生成的向量的名字
注:这里用的是神经网络模型,要求生成的向量:相近的词的词向量是相似的
7.简单测试一下
在test.py添加如下代码,并执行
from gensim.models import Word2Vec
en_wiki_word2vec_model = Word2Vec.load('wiki.zh.text.model')
testwords = ['苹果','数学','学术','白痴','篮球']
for i in range(5):
res = en_wiki_word2vec_model.most_similar(testwords[i])
print (testwords[i])
print (res)
注:语料库越大效果越好