wiki中英文语料处理

Wiki官方提供了下载链接:https://dumps.wikimedia.org/zhwiki/latest/

本文处理的中文wiki:zhwiki-latest-pages-articles.xml.bz2

本文处理的英文wiki:enwiki-latest-pages-articles.xml.bz2


1,数据抽取,将*.xml.bz2转为可编辑txt

#process_wiki.py
# -*- coding: utf-8 -*-
from gensim.corpora import WikiCorpus

if __name__ == '__main__':

    inp="enwiki-latest-pages-articles.xml.bz2"
    i = 0
    output_file="wiki_englist_%07d.txt"%i


    output = open(output_file, 'w',encoding="utf-8")
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write("".join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            output.close()
            output_file = "wiki_englist_%07d.txt" % i
            output = open(output_file, 'w', encoding="utf-8")
            print("Save "+str(i) + " articles")
    output.close()
print("Finished saved "+str(i) + "articles")

wiki中英文语料处理_第1张图片


2,繁体转简体

使用opencc工具,https://code.google.com/archive/p/opencc/downloads

https://code.google.com/archive/p/opencc/downloads
-i:输入文件
-o:输出文件
-c:配置文件,zht2zhs.ini为繁体到简体转化


3,字符编码转换

iconv -c -t UTF-8 < input_file > output_file


4,分词处理

https://github.com/fxsjy/jieba

pip install jieba
python -m jieba input_file > cut_file

或者使用FoolNLTK

https://github.com/rockyzhengwu/FoolNLTK

pip install foolnltk

或者jieba_fast

https://github.com/deepcs233/jieba_fast

pip install jieba_fast




你可能感兴趣的:(深度学习)