短文本分类---小白从0到0.3的辛酸历程(上)

短文本分类—纪录小白从0到0.3的辛酸历程

高能预警,读完本文大概需要5分钟

先谈需求:
    老大说,目前需要将友商的数据按我们规则分出最细分类,我们好进一步分析对标。
    一条商品数据10多个字段,有用的字段就是友商商品【各级分类】,【商品名称】,【品牌名称】,【店铺名称】、【公司名称】,在和雷雷姐及强哥头脑风暴后,决定仅用【商品名称】提取特征值进行分类,去TM的友商各级分类,我不关心好吧~
刚开始做法:
    初步分类想法是按照人类思维去解读分类
举个栗子:
    商品名称:【小米最新款9086型平衡车,冲满一次电可以跑一年】
    人类思维会先对商品名进行分词,然后确定主语,最终判断该类型为【骑行分类】重点内容
涉及到的算法:
这里写图片描述
流程:
    对训练集数据进行分词
    停用词过滤/正则过滤/jieba自定义词典进行辅助分词
    构建分类集词库/词向量
    测试数据进行分词
    分词后测试数据利用词库通过算法进行计算
    取最优结果的集合为该商品分类
算法核心:
                            热度+相似度判定分类
训练集方面:
    还好有老大辛苦准备的几十万条数据已分类好的数据,不用再当农村人!
本次算法思考与改进:
    实际在用训练数据构建词库时,每个分类集数据量不一样大,有的几十万数据,有的几千条,初期的算法很容易引起数据倾斜,导致分类更多地倾斜到训练集多的分类中,初期在未改进算法时,很蠢的人为改变词库的词频数,强行干预整个训练过程。
    初期的算法存在很多I/O操作,程序重复的遍历每个训练集浪费的大量的时间,导致分类效率太低,以至于不能正常使用。
    改进点:算法中热度这块,以前求当前集合所有待分类词的频数,现在改进为所有待分类词的占比,避免了数据倾斜。
    效率方面:以前是进行一次I/O操作,进行一次分词,然后纪录每条数据的词向量。改进后的算法一次将所有待分类数据分词并读入内存中,构建词袋,第二次遍历全部带分类数据,然后在词袋中找词向量在进行计算,极大避免了重复的I/O操作(感谢我鹏哥给出犀利的优化),如果说以前的I/O操作是N*M次,那优化后的就是N->1*M=M次操作。
    最终进行测试时,改进前分1000条数据跑20多分钟,改进后2分钟跑完整个程序。
代码:

先放一个改进前的代码:

"""
喷一下这段代码,无脑for循环,要不是自己写的别人真看不懂
每条数据都需要读数据,分词,读训练集,训练,大量时间浪费在I/O操作上,这段代码大家看一眼就好,没啥可仔细研究的,性能太差,当初没仔细思考写的
"""
def Predicted_Data(test_Data_path, stop_words_path, train_path, write_path, location_a, location_b):
    stopwords = [line.strip() for line in open(stop_words_path, 'r', encoding='gbk').readlines()]
    pattern = re.compile(u"[\u4e00-\u9fa5]+")
    new_Data = pd.read_excel(test_Data_path)
    result_new_Data = new_Data['spu名称']
    predicted_list = []
    for position in range(location_a, location_b):
        datas = result_new_Data[position]
        clean_datas = re.findall(pattern, str(datas))
        clean_data = ''.join(clean_datas)
        results = jieba.cut(clean_data)
        words_list = []
        for word in results:
            if word not in stopwords:
                words_list.append(word)
        """预测数据"""
        predicted_dict = {}                                                                   # 构建预测评分字典
        catelist = os.listdir(train_path)
        for new_name in catelist:
            new_path = train_path + '\\' + new_name
            with open(new_path, 'r') as f:
                new_contents = f.readlines()
                train_Datas = eval(new_contents[0])
            label = new_name.strip('.txt')
            words_list_length = len(words_list)
            value_count = 0
            label_count = 0
            total_dict = sum(train_Datas.values())
            for key in words_list:                                                    # 算法,相似度、热度、当前集合占比
                if key not in train_Datas.keys():
                    value = 0
                    value_count = value + value_count
                else:
                    label_count += 1
                    value = train_Datas[key]
                    value_count = value_count + value/total_dict
            predicted_score = value_count * (label_count / words_list_length)        # 通过词频及相似度判断分类结果
            predicted_dict[label] = predicted_score
        predicted_dict = sorted(predicted_dict.items(), key=lambda item: item[1], reverse=True)  # 按分数降序
        # new_dict = {}
        # new_dict['label'] = predicted_dict[0][0]
        # new_dict['position'] = position
        print('数据写入中,判定分类为:%s' % predicted_dict[0][0])
        predicted_list.append(predicted_dict[0][0])                                         # 写入Excel
    predicted_dataframe = pd.DataFrame(predicted_list)
    predicted_dataframe.to_csv(write_path, index=False, header=None, encoding='utf_8_sig')

  以下为对I/O操作及算法进行优化后的代码,效率精确度方面都有提升

import re
import os
import time
import jieba
import numpy as np
import pandas as pd
from collections import Counter

"""
自定义词典
辅助jieba分词,jieba在分词时会先加载自定义词典,并在待分词数据中找词典中存在的词,进行分词
"""
jieba.load_userdict("自定义词典")

# 构建训练集词典
def classification(file_name, stop_words):
    stopwords = [
        line.strip() for line in open(stop_words, "r", encoding="utf-8").readlines()
    ]
    """
    正则过滤非中文字符
    """
    pattern = re.compile(u"[\u4e00-\u9fa5]+")
    dirty_data = pd.read_excel(file_name)
    print("写入中……………………")
    for i in range(len(dirty_data["label"])):
        train_file_name = (
            r"训练集路径"
            + "\\"
            + str(dirty_data["label"][i])
            + ".txt"
        )
        train_file_name = train_file_name.replace("/", "")
        print(train_file_name)
        """
        构建词库
        会按照每条商品分类创建一个集合,将分好的词存入集合
        """
        if os.path.exists(train_file_name) == True:
            with open(train_file_name, "a", encoding="utf-8") as file:
                outstr = ""
                datas = dirty_data["商品名称"][i]
                clean_datas = re.findall(pattern, str(datas))
                clean_data = "".join(clean_datas)
                results = jieba.cut(clean_data)
                for word in results:
                    if word not in stopwords:
                        if word != "\t":
                            outstr += word
                            outstr += " "
                file.write(outstr + "\n")

        else:
            with open(train_file_name, "a", encoding="utf-8") as file:
                outstr = ""
                datas = dirty_data["spu名称"][i]
                clean_datas = re.findall(pattern, str(datas))
                clean_data = "".join(clean_datas)
                results = jieba.cut(clean_data)
                for word in results:
                    if word not in stopwords:
                        if word != "\t":
                            outstr += word
                            outstr += " "
                file.write(outstr + "\n")

# 将训练集词库转为词向量
def result_data(total_path):
    total_file = os.listdir(total_path)
    for file in total_file:
        file_name_1 = "训练集路径" + "\\" + file
        with open(file_name_1, "r", encoding="utf-8") as f:
            contents = f.readlines()
        new_list = []
        for i in contents:
            for x in i.split():
                new_list.append(x)
        """
        将词典转为词向量
        导入Counter类,Counter类继承自dict
        统计list中数量,形成key:value的字典,kay为词,value为频数
        """
        dir = Counter(new_list)
        dicts = str(dir)
        dicts = dicts.strip("Counter()")
        file_name_2 = "训练集词向量路径" + "\\" + file
        with open(file_name_2, "w", encoding="utf-8") as f:
            f.write(dicts)
            print("训练集构建完成")

 # 自定义词典,辅助jieba分词,提高分词准确度
jieba.load_userdict(r"自定义词典")

# 测试集路径
test_Data_path = "测试数据"
# 停词表
stop_words_path = "停用词表"
# 训练集词库路径
train_path = "训练集词库"
# 写入路径
write_path = "保存路径"

# 构建测试集词向量
def ALL_Data_Words(test_Data_path, stop_words_path, train_path):
    print("开始运行")
    start = time.time()
    stopwords = [
        line.strip()
        for line in open(stop_words_path, "r", encoding="UTF-8").readlines()
    ]
    """
    引入正则,只取中文
    个人认为在本次分类中,非中文无意义
    """
    pattern = re.compile(u"[\u4e00-\u9fa5]+")
    new_Data = pd.read_excel(test_Data_path)
    result_new_Data = new_Data["商品名称"]
    """
    读取全量测试数据,分词构建词袋
    """
    word_list = set()
    for position in range(len(result_new_Data)):
        datas = result_new_Data[position]
        clean_datas = re.findall(pattern, str(datas))
        clean_data = "".join(clean_datas)
        results = jieba.cut(clean_data)
        for word in results:
            if word not in stopwords:
                word_list.add(word)
    """
    读取训练集词库,通过训练集词库构建测试集词向量
    """
    new_words = []
    catelist = os.listdir(train_path)
    for new_name in catelist:
        new_path = train_path + "/" + new_name
        with open(new_path, "r", encoding="UTF-8") as f:
            new_contents = f.readlines()
            train_Datas = eval(new_contents[0])
        total_num = sum(train_Datas.values())
        label = new_name.strip(".txt")
        all_dict_words = {}
        for data_word in word_list:
            all_dict_words["label"] = label
            all_dict_words["total_num"] = total_num
            if data_word not in train_Datas.keys():
                all_dict_words[data_word] = 0
            else:
                all_dict_words[data_word] = train_Datas[data_word]
        new_words.append(all_dict_words)
    end = time.time()
    print("构建词库共计花费:" + str(round(end - start, 3)) + "s")
    return new_words

# 文本分类
def Predicted_Data(new_words, test_Data_path, stop_words_path):
    print("读取全量词典")
    start = time.time()
    stopwords = [
        line.strip()
        for line in open(stop_words_path, "r", encoding="UTF-8").readlines()
    ]
    pattern = re.compile(u"[\u4e00-\u9fa5]+")
    new_Data = pd.read_excel(test_Data_path)
    result_new_Data = new_Data["商品名称"]
    print("开始预测")
    write_list = []
    """
    重新对测试集进行分词,构建每条数据词向量
    """
    for position in range(len(result_new_Data)):
        words_list = []
        datas = result_new_Data[position]
        clean_datas = re.findall(pattern, str(datas))
        clean_data = "".join(clean_datas)
        results = jieba.cut(clean_data)
        for word in results:
            if word not in stopwords:
                words_list.append(word)
        predicted_dict = {}  # 构建预测评分字典
        """
        通过词向量结合算法计算每个分类集得分
        降序排列,取所有分类集中得分最高的分类为该商品分类
        """
        for words in new_words:
            words_list_length = len(words_list)
            value_count = 0
            label_count = 0
            total_dict = words["total_num"]
            for key in words_list:  # 算法,相似度、热度、当前集合占比
                if words[key] == 0:
                    value = 0
                    value_count = value + value_count
                else:
                    label_count += 1
                    value = words[key]
                    value_count = value_count + value / total_dict  # 该词在当前集合占比
            if words_list_length == 0:
                predicted_score = 0
            else:
                """
                label_count / words_list_length :相似度
                一条商品信息分出10个词,在训练集中只能找出8个与之匹配,该分类相似度为0.8
                """
                predicted_score = value_count * (
                    label_count / words_list_length
                )  # 通过词频及相似度判断分类结果
            label = words["label"]
            predicted_dict[label] = predicted_score
        predicted_dict = sorted(
            predicted_dict.items(), key=lambda item: item[1], reverse=True
        )  # 按分数降序
        write_list.append(predicted_dict[0][0])

    end = time.time()
    print("spu分词预测共计花费:" + str(round(end - start, 3)) + "s")
    return write_list

# 写入函数
def write_file(write_path, write_data):

    print("写入Excel")
    """
    通过pandas一次写入excel
    """
    data = pd.read_excel(test_Data_path)
    data["label"] = write_data
    data.to_excel(write_path, encoding="utf_8_sig", index=False)


def main_fuction(test_Data_path, stop_words_path, train_path, write_path):

    """转化词表"""
    new_words = ALL_Data_Words(test_Data_path, stop_words_path, train_path)
    """计算分词结果"""
    write_data = Predicted_Data(new_words, test_Data_path, stop_words_path)
    """写入Excel"""
    write_file(write_path, write_data)

    回想5月份刚接触文本分类的时候真是一筹莫展,全凭组内小伙伴的耐心指导才能坚持下来,写出这个文本分类的雏形,到现在差不多三个月了,短文本分类也已经换了更成熟模型去做,效率和精确度早已经甩这个渣渣雏形几条街,本文写了这么多,算是给想要接触文本分类的小伙伴给出一些思路吧,耐着性子把我写的渣渣文章看完,再看大佬或者巨佬们写的专业文章上手飞快~
说说短文本分类的思路:
    文本分词(很关键,可优化提升模型准确率)
    引入停用词表(过滤文本中“啊”“好的”这种无效词,避免干扰)
    构建词向量(常用算法TF-IDF 词频*逆文档频率)
    TF-IDF介绍
    训练集训练模型
    测试集分词->将测试集数据映射到训练集TF-IDF向量空间中保证测试集训练集特征一致(文本分类中,个人理解所谓的特征就是指分好的“”)
    用测试集输出模型

文本分类可能会用到模型:
    MultinomialNB 朴素贝叶斯(多项式分布)
    BernoulliNB 朴素贝叶斯(伯努利分布)
    GaussianNB 朴素贝叶斯(高斯分布)
    DecisionTreeClassifier 决策树
    SVC SVM(支持向量机)
    MLPClassifier 神经网络
    KNN K-近邻算法

刚开始搞短文本分类时,在组内大佬指导下本文代码中实现了【分词】、【词典】、【词向量】。在野路子了一段时间后,静下心来算是搞明白了分词和词向量及词袋的原理。

后期会重新继续更一篇新的短文本分类辛酸史,会推翻本文中用到的所有代码算法及大部分的思想,会详细写出菜鸡如何反复调包调参,不断更换trainSet的数据对模型进行调优,最终可能会训练出一个最优模型。。。手动滑稽,虽然到现在我还在训练模型。。。。。。。
    大佬们如果觉得本文扯得这些还有点意思,可以加个好友,常沟通。

                                            人生苦短,我用Python

你可能感兴趣的:(数据挖掘,文本分类)