0,如果您觉得操作麻烦,可以直接直接下载生成好的wiki.zh.text.model模型
https://download.csdn.net/download/luolinll1212/10640451
1,下载中文维基百科 https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2,并安装gensim
pip install gensim
2,创建process_wiki.py,代码如下
-
import logging
-
import os.path
-
import sys
-
-
from gensim.corpora
import WikiCorpus
-
-
if __name__ ==
'__main__':
-
program = os.path.basename(sys.argv[
0])
-
logger = logging.getLogger(program)
-
-
logging.basicConfig(format=
'%(asctime)s: %(levelname)s: %(message)s')
-
logging.root.setLevel(level=logging.INFO)
-
logger.info(
"running %s" %
' '.join(sys.argv))
-
-
# check and process input arguments
-
if len(sys.argv) <
3:
-
print(globals()[
'__doc__'] % locals())
-
sys.exit(
1)
-
inp, outp = sys.argv[
1:
3]
-
space =
" "
-
i =
0
-
-
output = open(outp,
'w',encoding=
"utf-8")
# 本人采用win7环境,所以要采用utf-8模式
-
# output = open(outp, 'w')
-
wiki = WikiCorpus(inp, lemmatize=
False, dictionary={})
-
for text
in wiki.get_texts():
-
output.write(space.join(text) +
"\n")
-
i = i +
1
-
if (i %
10000 ==
0):
-
logger.info(
"Saved " + str(i) +
" articles")
-
-
output.close()
-
logger.info(
"Finished Saved " + str(i) +
" articles")
执行:python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text。此处参数为4个。
3,创建train_word2vec_model.py,代码如下
-
import logging
-
import multiprocessing
-
import os.path
-
import sys
-
-
from gensim.models
import Word2Vec
-
from gensim.models.word2vec
import PathLineSentences
-
-
if __name__ ==
'__main__':
-
program = os.path.basename(sys.argv[
0])
-
logger = logging.getLogger(program)
-
logging.basicConfig(format=
'%(asctime)s: %(levelname)s: %(message)s')
-
logging.root.setLevel(level=logging.INFO)
-
logger.info(
"running %s" %
' '.join(sys.argv))
-
check
and process input arguments
-
if len(sys.argv) <
4:
-
print(globals()[
'__doc__'] % locals())
-
sys.exit(
1)
-
input_dir, outp1, outp2 = sys.argv[
1:
4]
-
-
model = Word2Vec(PathLineSentences(input_dir),
-
size=
256, window=
10, min_count=
5,
-
workers=multiprocessing.cpu_count(),
iter=
10)
-
model.save(outp1)
-
model.wv.save_word2vec_format(outp2, binary=
False)
执行:python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector。此处参数为5个。
4,按照以上步骤执行完,生成4个文件
5,将此4个文件放入model文件夹,运行
转载地址:https://blog.csdn.net/luolinll1212/article/details/82291622