搜狗800GB数据预处理

项目位置：node2:/home/disk1/xukaituo/expriments/ngram-2016-11/

Step 1. 转换编码

iconv -f gbk//IGNORE -t utf-8//IGNORE filename > new_format_file

Step 2. 将非汉字去掉

#!/usr/bin/env python
# coding: utf-8

import codecs
import re
import sys

def remove_non_Chinese_word(input_file, output_file):
    re_non_chinese = ur"[^\u4e00-\u9fa5]+"
    with codecs.open(input_file, 'r', 'utf-8') as inputf:
        with codecs.open(output_file, 'w', 'utf-8') as outputf:
            for line in inputf:
                re_result = re.sub(re_non_chinese, u"", line)
                # new_line = " ".join(re_result)
                new_line = re_result
                outputf.write(new_line + '\n')


if __name__ == '__main__':
    if len(sys.argv) < 3:
        print "Usage: python 0-filter_non_chinese.py input-file output-file"
        sys.exit()
    remove_non_Chinese_word(sys.argv[1], sys.argv[2])

Step 3. 删除空白行

sed -i '/^$/d' filename

Step 4. 分词

使用ltp分词工具
[1]github https://github.com/HIT-SCIR/ltp
[2]文档 http://ltp.readthedocs.io/zh_CN/latest/api.html#id2
[3]模型 https://pan.baidu.com/share/link?shareid=1988562907&uk=2738088569
部分bash脚本：

cd /home/disk1/xukaituo/expriments/ngram-2016-11/utils
CWSTOOL=/home/disk1/xukaituo/projects/Chinese-word-segmentation
1-Chinese-word-segmentor/cws ${CWSTOOL}/ltp_data/cws.model $2 $3

调用ltp接口的分词程序：

// cws.cc

// Copyright 2016 ASLP(Author: Kaituo Xu)

#include 
#include 
#include 
#include 
#include "segment_dll.h"

int main(int argc, char *argv[])
{
    try {
        if (argc < 4) {
            std::cerr << "cws [model path] [input file path] [output file path]" << std::endl;
            return 1;
        }

        void *engine = segmentor_create_segmentor(argv[1]);
        std::ifstream input(argv[2]);
        std::ofstream output(argv[3], std::ofstream::app);

        if (!engine || !input || !output) {
            return -1;
        }

        std::string line;
        while (getline(input, line)) {
            std::vector words;
            int len = segmentor_segment(engine, line, words);
            for (int i = 0; i < len; ++i) {
                output << words[i] << " ";
            }
            output << std::endl;
        }

        segmentor_release_segmentor(engine);
        return 0;

    } catch(const std::exception &e) {
        std::cerr << e.what();
        return -1;
    }
}

Step 5. 将暂时不用的数据进行压缩，节省磁盘空间

# 使用`gzip`对文件进行压缩
gzip 
# 解压缩
gzip -d .gz

压缩后原文件消失，默认在后加.gz;解压缩后，.gz文件会消失。