高能预警,读完本文大概需要5分钟
先谈需求:
老大说,目前需要将友商的数据按我们规则分出最细分类,我们好进一步分析对标。
一条商品数据10多个字段,有用的字段就是友商商品【各级分类】,【商品名称】,【品牌名称】,【店铺名称】、【公司名称】,在和雷雷姐及强哥头脑风暴后,决定仅用【商品名称】提取特征值进行分类,去TM的友商各级分类,我不关心好吧~
刚开始做法:
初步分类想法是按照人类思维去解读分类
举个栗子:
商品名称:【小米最新款9086型平衡车,冲满一次电可以跑一年】
人类思维会先对商品名进行分词,然后确定主语,最终判断该类型为【骑行分类】重点内容
涉及到的算法:
流程:
对训练集数据进行分词
停用词过滤/正则过滤/jieba自定义词典进行辅助分词
构建分类集词库/词向量
测试数据进行分词
分词后测试数据利用词库通过算法进行计算
取最优结果的集合为该商品分类
算法核心:
热度+相似度判定分类
训练集方面:
还好有老大辛苦准备的几十万条数据已分类好的数据,不用再当农村人!
本次算法思考与改进:
实际在用训练数据构建词库时,每个分类集数据量不一样大,有的几十万数据,有的几千条,初期的算法很容易引起数据倾斜,导致分类更多地倾斜到训练集多的分类中,初期在未改进算法时,很蠢的人为改变词库的词频数,强行干预整个训练过程。
初期的算法存在很多I/O操作,程序重复的遍历每个训练集浪费的大量的时间,导致分类效率太低,以至于不能正常使用。
改进点:算法中热度这块,以前求当前集合所有待分类词的频数,现在改进为所有待分类词的占比,避免了数据倾斜。
效率方面:以前是进行一次I/O操作,进行一次分词,然后纪录每条数据的词向量。改进后的算法一次将所有待分类数据分词并读入内存中,构建词袋,第二次遍历全部带分类数据,然后在词袋中找词向量在进行计算,极大避免了重复的I/O操作(感谢我鹏哥给出犀利的优化),如果说以前的I/O操作是N*M次,那优化后的就是N->1*M=M次操作。
最终进行测试时,改进前分1000条数据跑20多分钟,改进后2分钟跑完整个程序。
代码:
先放一个改进前的代码:
"""
喷一下这段代码,无脑for循环,要不是自己写的别人真看不懂
每条数据都需要读数据,分词,读训练集,训练,大量时间浪费在I/O操作上,这段代码大家看一眼就好,没啥可仔细研究的,性能太差,当初没仔细思考写的
"""
def Predicted_Data(test_Data_path, stop_words_path, train_path, write_path, location_a, location_b):
stopwords = [line.strip() for line in open(stop_words_path, 'r', encoding='gbk').readlines()]
pattern = re.compile(u"[\u4e00-\u9fa5]+")
new_Data = pd.read_excel(test_Data_path)
result_new_Data = new_Data['spu名称']
predicted_list = []
for position in range(location_a, location_b):
datas = result_new_Data[position]
clean_datas = re.findall(pattern, str(datas))
clean_data = ''.join(clean_datas)
results = jieba.cut(clean_data)
words_list = []
for word in results:
if word not in stopwords:
words_list.append(word)
"""预测数据"""
predicted_dict = {} # 构建预测评分字典
catelist = os.listdir(train_path)
for new_name in catelist:
new_path = train_path + '\\' + new_name
with open(new_path, 'r') as f:
new_contents = f.readlines()
train_Datas = eval(new_contents[0])
label = new_name.strip('.txt')
words_list_length = len(words_list)
value_count = 0
label_count = 0
total_dict = sum(train_Datas.values())
for key in words_list: # 算法,相似度、热度、当前集合占比
if key not in train_Datas.keys():
value = 0
value_count = value + value_count
else:
label_count += 1
value = train_Datas[key]
value_count = value_count + value/total_dict
predicted_score = value_count * (label_count / words_list_length) # 通过词频及相似度判断分类结果
predicted_dict[label] = predicted_score
predicted_dict = sorted(predicted_dict.items(), key=lambda item: item[1], reverse=True) # 按分数降序
# new_dict = {}
# new_dict['label'] = predicted_dict[0][0]
# new_dict['position'] = position
print('数据写入中,判定分类为:%s' % predicted_dict[0][0])
predicted_list.append(predicted_dict[0][0]) # 写入Excel
predicted_dataframe = pd.DataFrame(predicted_list)
predicted_dataframe.to_csv(write_path, index=False, header=None, encoding='utf_8_sig')
以下为对I/O操作及算法进行优化后的代码,效率精确度方面都有提升
import re
import os
import time
import jieba
import numpy as np
import pandas as pd
from collections import Counter
"""
自定义词典
辅助jieba分词,jieba在分词时会先加载自定义词典,并在待分词数据中找词典中存在的词,进行分词
"""
jieba.load_userdict("自定义词典")
# 构建训练集词典
def classification(file_name, stop_words):
stopwords = [
line.strip() for line in open(stop_words, "r", encoding="utf-8").readlines()
]
"""
正则过滤非中文字符
"""
pattern = re.compile(u"[\u4e00-\u9fa5]+")
dirty_data = pd.read_excel(file_name)
print("写入中……………………")
for i in range(len(dirty_data["label"])):
train_file_name = (
r"训练集路径"
+ "\\"
+ str(dirty_data["label"][i])
+ ".txt"
)
train_file_name = train_file_name.replace("/", "")
print(train_file_name)
"""
构建词库
会按照每条商品分类创建一个集合,将分好的词存入集合
"""
if os.path.exists(train_file_name) == True:
with open(train_file_name, "a", encoding="utf-8") as file:
outstr = ""
datas = dirty_data["商品名称"][i]
clean_datas = re.findall(pattern, str(datas))
clean_data = "".join(clean_datas)
results = jieba.cut(clean_data)
for word in results:
if word not in stopwords:
if word != "\t":
outstr += word
outstr += " "
file.write(outstr + "\n")
else:
with open(train_file_name, "a", encoding="utf-8") as file:
outstr = ""
datas = dirty_data["spu名称"][i]
clean_datas = re.findall(pattern, str(datas))
clean_data = "".join(clean_datas)
results = jieba.cut(clean_data)
for word in results:
if word not in stopwords:
if word != "\t":
outstr += word
outstr += " "
file.write(outstr + "\n")
# 将训练集词库转为词向量
def result_data(total_path):
total_file = os.listdir(total_path)
for file in total_file:
file_name_1 = "训练集路径" + "\\" + file
with open(file_name_1, "r", encoding="utf-8") as f:
contents = f.readlines()
new_list = []
for i in contents:
for x in i.split():
new_list.append(x)
"""
将词典转为词向量
导入Counter类,Counter类继承自dict
统计list中数量,形成key:value的字典,kay为词,value为频数
"""
dir = Counter(new_list)
dicts = str(dir)
dicts = dicts.strip("Counter()")
file_name_2 = "训练集词向量路径" + "\\" + file
with open(file_name_2, "w", encoding="utf-8") as f:
f.write(dicts)
print("训练集构建完成")
# 自定义词典,辅助jieba分词,提高分词准确度
jieba.load_userdict(r"自定义词典")
# 测试集路径
test_Data_path = "测试数据"
# 停词表
stop_words_path = "停用词表"
# 训练集词库路径
train_path = "训练集词库"
# 写入路径
write_path = "保存路径"
# 构建测试集词向量
def ALL_Data_Words(test_Data_path, stop_words_path, train_path):
print("开始运行")
start = time.time()
stopwords = [
line.strip()
for line in open(stop_words_path, "r", encoding="UTF-8").readlines()
]
"""
引入正则,只取中文
个人认为在本次分类中,非中文无意义
"""
pattern = re.compile(u"[\u4e00-\u9fa5]+")
new_Data = pd.read_excel(test_Data_path)
result_new_Data = new_Data["商品名称"]
"""
读取全量测试数据,分词构建词袋
"""
word_list = set()
for position in range(len(result_new_Data)):
datas = result_new_Data[position]
clean_datas = re.findall(pattern, str(datas))
clean_data = "".join(clean_datas)
results = jieba.cut(clean_data)
for word in results:
if word not in stopwords:
word_list.add(word)
"""
读取训练集词库,通过训练集词库构建测试集词向量
"""
new_words = []
catelist = os.listdir(train_path)
for new_name in catelist:
new_path = train_path + "/" + new_name
with open(new_path, "r", encoding="UTF-8") as f:
new_contents = f.readlines()
train_Datas = eval(new_contents[0])
total_num = sum(train_Datas.values())
label = new_name.strip(".txt")
all_dict_words = {}
for data_word in word_list:
all_dict_words["label"] = label
all_dict_words["total_num"] = total_num
if data_word not in train_Datas.keys():
all_dict_words[data_word] = 0
else:
all_dict_words[data_word] = train_Datas[data_word]
new_words.append(all_dict_words)
end = time.time()
print("构建词库共计花费:" + str(round(end - start, 3)) + "s")
return new_words
# 文本分类
def Predicted_Data(new_words, test_Data_path, stop_words_path):
print("读取全量词典")
start = time.time()
stopwords = [
line.strip()
for line in open(stop_words_path, "r", encoding="UTF-8").readlines()
]
pattern = re.compile(u"[\u4e00-\u9fa5]+")
new_Data = pd.read_excel(test_Data_path)
result_new_Data = new_Data["商品名称"]
print("开始预测")
write_list = []
"""
重新对测试集进行分词,构建每条数据词向量
"""
for position in range(len(result_new_Data)):
words_list = []
datas = result_new_Data[position]
clean_datas = re.findall(pattern, str(datas))
clean_data = "".join(clean_datas)
results = jieba.cut(clean_data)
for word in results:
if word not in stopwords:
words_list.append(word)
predicted_dict = {} # 构建预测评分字典
"""
通过词向量结合算法计算每个分类集得分
降序排列,取所有分类集中得分最高的分类为该商品分类
"""
for words in new_words:
words_list_length = len(words_list)
value_count = 0
label_count = 0
total_dict = words["total_num"]
for key in words_list: # 算法,相似度、热度、当前集合占比
if words[key] == 0:
value = 0
value_count = value + value_count
else:
label_count += 1
value = words[key]
value_count = value_count + value / total_dict # 该词在当前集合占比
if words_list_length == 0:
predicted_score = 0
else:
"""
label_count / words_list_length :相似度
一条商品信息分出10个词,在训练集中只能找出8个与之匹配,该分类相似度为0.8
"""
predicted_score = value_count * (
label_count / words_list_length
) # 通过词频及相似度判断分类结果
label = words["label"]
predicted_dict[label] = predicted_score
predicted_dict = sorted(
predicted_dict.items(), key=lambda item: item[1], reverse=True
) # 按分数降序
write_list.append(predicted_dict[0][0])
end = time.time()
print("spu分词预测共计花费:" + str(round(end - start, 3)) + "s")
return write_list
# 写入函数
def write_file(write_path, write_data):
print("写入Excel")
"""
通过pandas一次写入excel
"""
data = pd.read_excel(test_Data_path)
data["label"] = write_data
data.to_excel(write_path, encoding="utf_8_sig", index=False)
def main_fuction(test_Data_path, stop_words_path, train_path, write_path):
"""转化词表"""
new_words = ALL_Data_Words(test_Data_path, stop_words_path, train_path)
"""计算分词结果"""
write_data = Predicted_Data(new_words, test_Data_path, stop_words_path)
"""写入Excel"""
write_file(write_path, write_data)
回想5月份刚接触文本分类的时候真是一筹莫展,全凭组内小伙伴的耐心指导才能坚持下来,写出这个文本分类的雏形,到现在差不多三个月了,短文本分类也已经换了更成熟模型去做,效率和精确度早已经甩这个渣渣雏形几条街,本文写了这么多,算是给想要接触文本分类的小伙伴给出一些思路吧,耐着性子把我写的渣渣文章看完,再看大佬或者巨佬们写的专业文章上手飞快~
说说短文本分类的思路:
文本分词(很关键,可优化提升模型准确率)
引入停用词表(过滤文本中“啊”“好的”这种无效词,避免干扰)
构建词向量(常用算法TF-IDF 词频*逆文档频率)
TF-IDF介绍
训练集训练模型
测试集分词->将测试集数据映射到训练集TF-IDF向量空间中保证测试集训练集特征一致(文本分类中,个人理解所谓的特征就是指分好的“词”)
用测试集输出模型
文本分类可能会用到模型:
MultinomialNB 朴素贝叶斯(多项式分布)
BernoulliNB 朴素贝叶斯(伯努利分布)
GaussianNB 朴素贝叶斯(高斯分布)
DecisionTreeClassifier 决策树
SVC SVM(支持向量机)
MLPClassifier 神经网络
KNN K-近邻算法
刚开始搞短文本分类时,在组内大佬指导下本文代码中实现了【分词】、【词典】、【词向量】。在野路子了一段时间后,静下心来算是搞明白了分词和词向量及词袋的原理。
后期会重新继续更一篇新的短文本分类辛酸史,会推翻本文中用到的所有代码算法及大部分的思想,会详细写出菜鸡如何反复调包调参,不断更换trainSet的数据对模型进行调优,最终可能会训练出一个最优模型。。。手动滑稽,虽然到现在我还在训练模型。。。。。。。
大佬们如果觉得本文扯得这些还有点意思,可以加个好友,常沟通。
人生苦短,我用Python