gensim训练word2vec,生成wiki.zh.text.model

0,如果您觉得操作麻烦,可以直接直接下载生成好的wiki.zh.text.model模型

     https://download.csdn.net/download/luolinll1212/10640451

1,下载中文维基百科 https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2,并安装gensim

      pip install gensim

2,创建process_wiki.py,代码如下


   
   
   
   
  1. import logging
  2. import os.path
  3. import sys
  4. from gensim.corpora import WikiCorpus
  5. if __name__ == '__main__':
  6. program = os.path.basename(sys.argv[ 0])
  7. logger = logging.getLogger(program)
  8. logging.basicConfig(format= '%(asctime)s: %(levelname)s: %(message)s')
  9. logging.root.setLevel(level=logging.INFO)
  10. logger.info( "running %s" % ' '.join(sys.argv))
  11. # check and process input arguments
  12. if len(sys.argv) < 3:
  13. print(globals()[ '__doc__'] % locals())
  14. sys.exit( 1)
  15. inp, outp = sys.argv[ 1: 3]
  16. space = " "
  17. i = 0
  18. output = open(outp, 'w',encoding= "utf-8") # 本人采用win7环境,所以要采用utf-8模式
  19. # output = open(outp, 'w')
  20. wiki = WikiCorpus(inp, lemmatize= False, dictionary={})
  21. for text in wiki.get_texts():
  22. output.write(space.join(text) + "\n")
  23. i = i + 1
  24. if (i % 10000 == 0):
  25. logger.info( "Saved " + str(i) + " articles")
  26. output.close()
  27. logger.info( "Finished Saved " + str(i) + " articles")

执行:python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text。此处参数为4个。

3,创建train_word2vec_model.py,代码如下


   
   
   
   
  1. import logging
  2. import multiprocessing
  3. import os.path
  4. import sys
  5. from gensim.models import Word2Vec
  6. from gensim.models.word2vec import PathLineSentences
  7. if __name__ == '__main__':
  8. program = os.path.basename(sys.argv[ 0])
  9. logger = logging.getLogger(program)
  10. logging.basicConfig(format= '%(asctime)s: %(levelname)s: %(message)s')
  11. logging.root.setLevel(level=logging.INFO)
  12. logger.info( "running %s" % ' '.join(sys.argv))
  13. check and process input arguments
  14. if len(sys.argv) < 4:
  15. print(globals()[ '__doc__'] % locals())
  16. sys.exit( 1)
  17. input_dir, outp1, outp2 = sys.argv[ 1: 4]
  18. model = Word2Vec(PathLineSentences(input_dir),
  19. size= 256, window= 10, min_count= 5,
  20. workers=multiprocessing.cpu_count(), iter= 10)
  21. model.save(outp1)
  22. model.wv.save_word2vec_format(outp2, binary= False)

执行:python train_word2vec_model.py wiki.zh.text.jian.seg.utf-8 wiki.zh.text.model wiki.zh.text.vector。此处参数为5个。

4,按照以上步骤执行完,生成4个文件

gensim训练word2vec,生成wiki.zh.text.model_第1张图片

5,将此4个文件放入model文件夹,运行

gensim训练word2vec,生成wiki.zh.text.model_第2张图片

转载地址:https://blog.csdn.net/luolinll1212/article/details/82291622

你可能感兴趣的:(gensim训练word2vec,生成wiki.zh.text.model)