python中文文本分类代码示例

数据集

数据集是THUCnews的,清华大学根据新浪新闻RSS订阅频道2005-2011年间的历史数据筛选过滤生成,我对其进行了整理,只剩下一个txt文件——dataSet.txt。
链接: 数据集下载地址

提取码: rvs9

对句子进行分词

其中的停用词stopwords.txt,完整代码可以在我的github上找到——完整代码

def seg_sentence(sentence, stopwords_path):
    """
    对句子进行分词
    """
    # print "now token sentence..."

    def stopwordslist(filepath):
        """
        创建停用词list ,闭包
        """
        stopwords = [line.decode('utf-8').strip() for line in open(filepath, 'rb').readlines()]
        return stopwords

    sentence_seged = jieba.cut(sentence.strip())
    stopwords = stopwordslist(stopwords_path)  # 这里加载停用词的路径
    outstr = ''  # 返回值是字符串
    for word in sentence_seged:
        if word not in stopwords:
            if word != '\t':
                outstr += word
                outstr += " "
    return outstr

对整个文件进行分词

def tokenFile(file_path, write_path):
    """
    对文本进行分词,结果存储在write_path
    :param file_path:
    :param write_path:
    :return:
    """
    with open(write_path, 'w', encoding='utf-8') as w:
        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f.readlines():
                line = line.strip()
                token_sen = seg_sentence(line.split('\t')[1], 'stopwords.txt')
                w.write(line.split('\t')[0] + "\t" + token_sen + "\n")

将分词后的文件整理成数据集

def constructDataset(path):
    """
    path: file path
    rtype: lable_list and corpus_list
    """
    label_list = []
    corpus_list = []
    with open(path, 'r') as p:
        for line in p.readlines():
            label_list.append(line.split('\t')[0])
            corpus_list.append(line.split('\t')[1])
    return label_list, corpus_list

划分训练集与测试集

label, data = constructDataset(write_path)
x_train, x_test, y_train, y_test = train_test_split(data, label, test_size=0.25,
                                                        random_state=42)

使用TfidfVectorizer生成特征向量

tfidf_vect = TfidfVectorizer(analyzer='word', max_features=5000)
tfidf_vect.fit(x_train)
xtrain_tfidf = tfidf_vect.transform(x_train)
xtest_tfidf = tfidf_vect.transform(x_test)

贝叶斯分类

mnb_count = MultinomialNB()
mnb_count.fit(xtrain_tfidf, y_train)
print(mnb_count.score(xtest_tfidf, y_test))

你可能感兴趣的:(文本分类)