本章节研究内容: tfidf 特征抽取&n-gram 扩展 + 朴素贝叶斯模型
经过交叉验证,模型平均得分为0.8947。
最后在测试集上的f1-score指标平均为0.907060,这个分类模型较优秀,能够投入实际应用。
Label | Precision | Recall | F1 | Support |
---|---|---|---|---|
entertainment | 0.901968 | 0.924083 | 0.912891 | 11012 |
technology | 0.833637 | 0.896196 | 0.863785 | 7649 |
sports | 0.947450 | 0.928812 | 0.938039 | 9201 |
military | 0.947098 | 0.880473 | 0.912571 | 4819 |
car | 0.940442 | 0.852045 | 0.894064 | 3447 |
总体 | 0.908775 | 0.906693 | 0.907060 | 36128 |
sklearn 中文本特征抽取TfidfVector
TfidfVectorizer中文叫做词袋向量化模型
IF-IDF(term frequency-inverse document frequency)词频-逆向文件频率。在处理文本时,如何将文字转化为模型可以处理的向量呢?IF-IDF就是这个问题的解决方案之一。字词的重要性与其在文本中出现的频率成正比(IF),与其在语料库中出现的频率成反比(IDF)。
TF-IDF = TF*IDF
TF:词频。IF(w)=(词w在文档中出现的次数)/(文档的总词数)
IDF:逆向文件频率。有些词可能在文本中频繁出现,但并不重要,也即信息量小,如is,of,that这些单词,这些单词在语料库中出现的频率也非常大,我们就可以利用这点,降低其权重。IDF(w)=log_e(语料库的总文档数)/(语料库中词w出现的文档数)
我们这里假设处理新闻类文章,对不同的文章进行数值类表示
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
## 格式: 例如,新闻2篇文章,然后分词,分词结果组成的字符串放入list中
## 文章1: hello world
## 文章2: this is a panda.
sentences = ['hello world','this is a panda.']
# 文章列表-> 数值类向量矩阵表示.默认是稀疏类的矩阵
vec = tfidf.fit_transform(sentences)
print(vec)
# (0, 0) 0.7071067811865476 表示 第一个文章,第一个单词 对应的tfidf 数值为0.7071067811865476
# 实际的计算结果: 是对某个单词求出tfidf数值
(0, 0) 0.7071067811865476
(0, 4) 0.7071067811865476
(1, 3) 0.5773502691896257
(1, 1) 0.5773502691896257
(1, 2) 0.5773502691896257
# vec : 文本特征的稀疏表示,即表示了每篇文章下 对应的每个单词对应的tfidf 数值
# vec.toarray() 表示 稠密矩阵表示
vec.toarray()
array([[0.70710678, 0. , 0. , 0. , 0.70710678],
[0. , 0.57735027, 0.57735027, 0.57735027, 0. ]])
# tfidf 通过fit_transform() 后,可以获取对应的字典。这个实际上就是我们的词典中的所有词语
tfidf.get_feature_names()
['hello', 'is', 'panda', 'this', 'world']
综上内容,我们把两篇文章: [‘hello world’,‘this is a panda.’] ,通过tfidf 后,把我们的文章数字化了。
一行代表一个句子样本,这样的矩阵就可以放入模型中训练了
# 定义文章中类别
categories = ['entertainment','technology', 'sports','military','car']
sentences = []
labels = []
with open("../data/news.csv", 'r',encoding='utf8') as f:
lines = f.readlines()
for line in lines:
splits = line.split(' ')
feat = splits[:splits.__len__() - 1]
label = splits[splits.__len__() - 1]
if label.strip() in categories:
sentences.append((" ".join(feat), label.strip()))
labels.append(label.strip()) # 统计每个label放到list中
sentences[:2]
[('另一边 舞王 韩庚 跟随 欢乐 起舞 八十年代 迪斯科 舞步 轮番上阵 场面 精彩 歌之夜 敬请期待 浙江 卫视 2017 周五 00 畅意 100% 乳酸菌 饮品 独家 冠名 二十四 小时 第二季 水手 欢乐 出发',
'entertainment'),
('三是 改变 割裂 状况 建立 一体化 防御 体系', 'technology')]
# 统计每个类别 样本数量
from collections import Counter
Counter(labels)
Counter({'entertainment': 33341,
'technology': 23153,
'sports': 27966,
'military': 14494,
'car': 10523})
特征抽取-文本向量化表示
X_train 和 y_train 的数据格式分别对应:
[‘音乐大师 播出’, ‘设计 公司 承担 四届 奥运场馆 设计’]
[‘entertainment’, ‘sports’]
备注: 用数据输入形式为列表,列表元素为代表文章的字符串,一个字符串代表一篇文章,字符串是已经分割好的.
调用sklearn.feature_extraction.text库的TfidfVectorizer方法实例化模型对象。
TfidfVectorizer方法需要4个参数。
第1个参数是分词结果,数据类型为列表,其中的元素也为列表;
第2个关键字参数stop_words是停顿词,数据类型为列表;
第3个关键字参数min_df是词频低于此值则忽略,数据类型为int或float;
第4个关键字参数max_df是词频高于此值则忽略,数据类型为Int或float。
查看TfidfVectorizer方法的更多参数用法,官方文档链接: http://sklearn.apachecn.org/cn/0.19.0/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
# 特征向量: 输入数据格式
words_list,category_name_list = zip(*sentences)
print(list(words_list[:2]))
print(list(category_name_list[:2]))
['另一边 舞王 韩庚 跟随 欢乐 起舞 八十年代 迪斯科 舞步 轮番上阵 场面 精彩 歌之夜 敬请期待 浙江 卫视 2017 周五 00 畅意 100% 乳酸菌 饮品 独家 冠名 二十四 小时 第二季 水手 欢乐 出发', '三是 改变 割裂 状况 建立 一体化 防御 体系']
['entertainment', 'technology']
文本-> tfidf 格式
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer='word',max_features=12000,min_df=10,ngram_range=(1,3))
# 获取训练样本的tfidf 向量化表示
X = tfidf.fit_transform(words_list)
# n_samples : 表示文本行数,训练样本
# max_features : 词典长度 = len(tfidf.get_feature_names())
print('n_samples = {},max_features = {}'.format(X.shape[0],X.shape[1]))
n_samples = 109477,max_features = 12000
print('feature size = ',len(tfidf.get_feature_names()))
feature size = 12000
import numpy as np
category_count = np.unique(category_name_list)
print('class count = ',category_count)
class count = ['car' 'entertainment' 'military' 'sports' 'technology']
# 统计tfidf 后,词汇列表
vocab_dict = tfidf.vocabulary_
print('词表大小:', len(vocab_dict)) # 表示字典的大小为12000
vocab_dict = sorted(vocab_dict.items(),key = lambda d:d[1],reverse=True)
with open('../data/tfidf_vocab.txt','w') as f:
for t in vocab_dict:
word_count = '{}|{}'.format(t[0],t[1])
f.write('{}\n'.format(word_count))
词表大小: 12000
查看词汇列表具体内容:
龙芯|11999
龙舟队|11998
龙舟赛|11997
龙舟 大赛|11996
龙舟|11995
龙头企业|11994
龙头|11993
齐达内|11992
齐聚|11991
鼓舞|11990
鼓掌|11989
鼓励|11988
默默|11987
我们希望得到需要的X,y
X: 是经过文本表示后的结果;y是经过编码的结果
# 样本中feature 通过tfidf 特征表示
X = tfidf.fit_transform(words_list)
X.toarray()[:2]
array([[0.17450065, 0. , 0. , ..., 0. , 0. ,
0. ],
[0. , 0. , 0. , ..., 0. , 0. ,
0. ]])
#样本中label 进行编码
categories = ['entertainment','technology', 'sports','military','car']
label = [0,1,2,3,4]
categories_label = dict(zip(categories,label))
print('categories_label = ',categories_label)
y = [categories_label[label_name] for label_name in category_name_list]
y[:2]
categories_label = {'entertainment': 0, 'technology': 1, 'sports': 2, 'military': 3, 'car': 4}
[0, 1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print('X_train = ',X_train.toarray()[:2])
print('X_test = ',X_test.toarray()[:2])
print('y_train = ',y_train[:2])
print('y_test = ',y_test[:2])
X_train = [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
X_test = [[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
y_train = [0, 3]
y_test = [0, 1]
用朴素贝叶斯完成一个中文文本分类器,一般在数据量足够,数据丰富度够的情况下,用朴素贝叶斯完成这个任务,准确度还是很不错的。
y_train[:5]
[0, 3, 2, 4, 0]
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train,y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
调用pickle库的dump方法保存模型,需要2个参数
第1个参数是保存的对象,可以为任意数据类型,因为有多个模型需要保存,所以下面代码第1个参数是字典。
第2个参数是保存的文件对象,数据类型为_io.BufferedWriter
import pickle
with open('../model/tfidf_model.pkl','wb') as f:
model_dict = {
'categories_label' : categories_label,
'tfidfVectorizer' : tfidf,
'nb' : model
}
pickle.dump(model_dict,f)
accuracy = clf.score(X_test, y_test)
print('accuracy = ',accuracy)
accuracy = 0.9066928697962799
调用sklearn.model_selection库的ShuffleSplit方法实例化交叉验证对象。
调用sklearn.model_selection库的cross_val_score方法获得交叉验证每一次的得分。
最后打印每一次的得分以及平均得分
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB
# 模型
clf = MultinomialNB()
clf.fit(X_train,y_train) # vec : X_train targets:y_train
# 训练集=> 切分训练集+验证集, 用于交叉验证
cv_split = ShuffleSplit(n_splits=5, test_size=0.3)
score_ndarray = cross_val_score(clf, X_test,y_test, cv=cv_split)
print(score_ndarray)
print(score_ndarray.mean())
[0.89380939 0.89316358 0.89685395 0.89620814 0.89334809]
0.8946766306854876
from collections import Counter
tmp = dict(Counter(y_test))
print(tmp)
#python 使用zip反转字典
tmp2=dict(zip(tmp.values(),tmp.keys()))
print(tmp2)
[categories[0] for x in tmp2.items()]
{0: 11012, 1: 7649, 3: 4819, 4: 3447, 2: 9201}
{11012: 0, 7649: 1, 4819: 3, 3447: 4, 9201: 2}
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,precision_recall_fscore_support
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
# 切分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# 模型训练
clf = MultinomialNB()
clf.fit(X_train,y_train)
# 测试集预测
y_pred = nb.predict(X_test)
# 绘制混淆矩阵
matrix = confusion_matrix(y_pred=y_pred,y_true=y_test)
print(matrix)
[[10176 563 183 75 15]
[ 460 6855 140 65 129]
[ 309 268 8546 57 21]
[ 269 172 114 4243 21]
[ 68 365 37 40 2937]]
# 获取的类别名称
category = list(categories_label.keys())
# 矩阵中对角线是预测正确的记录数字,其他预测成别的类别了
pd.DataFrame(matrix,columns=category,index=category)
entertainment | technology | sports | military | car | |
---|---|---|---|---|---|
entertainment | 10176 | 563 | 183 | 75 | 15 |
technology | 460 | 6855 | 140 | 65 | 129 |
sports | 309 | 268 | 8546 | 57 | 21 |
military | 269 | 172 | 114 | 4243 | 21 |
car | 68 | 365 | 37 | 40 | 2937 |
sum = 0
for x in [10176,563,183,75,15]:
sum+=x
print(sum)
11012
我们来看看原始数据不同类别的数据分布。
def eval_model(y_true, y_pred, labels):
# 计算每个分类的Precision, Recall, f1, support
p, r, f1, s = precision_recall_fscore_support(y_true, y_pred)
# 计算总体的平均Precision, Recall, f1, support
tot_p = np.average(p, weights=s)
tot_r = np.average(r, weights=s)
tot_f1 = np.average(f1, weights=s)
tot_s = np.sum(s)
res1 = pd.DataFrame({
u'Label': labels,
u'Precision': p,
u'Recall': r,
u'F1': f1,
u'Support': s
})
res2 = pd.DataFrame({
u'Label': ['总体'],
u'Precision': [tot_p],
u'Recall': [tot_r],
u'F1': [tot_f1],
u'Support': [tot_s]
})
res2.index = [999]
res = pd.concat([res1, res2])
return res[['Label', 'Precision', 'Recall', 'F1', 'Support']]
y_pred = clf.predict(X_test)
eval_model(y_test,y_pred, category)
Label | Precision | Recall | F1 | Support | |
---|---|---|---|---|---|
0 | entertainment | 0.901968 | 0.924083 | 0.912891 | 11012 |
1 | technology | 0.833637 | 0.896196 | 0.863785 | 7649 |
2 | sports | 0.947450 | 0.928812 | 0.938039 | 9201 |
3 | military | 0.947098 | 0.880473 | 0.912571 | 4819 |
4 | car | 0.940442 | 0.852045 | 0.894064 | 3447 |
999 | 总体 | 0.908775 | 0.906693 | 0.907060 | 36128 |
import warnings
warnings.filterwarnings('ignore')
# 加载停止词
with open('../data/stopwords.txt',encoding='utf-8') as f:
stopwords = [stopword.strip() for stopword in f.readlines()]
print(stopwords[:10])
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']
# 加载模型
import pickle
tf_model = '../model/tfidf_model.pkl'
with open(tf_model,'rb') as f:
model_dict = pickle.load(f)
model = model_dict['nb'] # 通过字典获取模型
categories_labal_dict = model_dict['categories_label'] # 获取类别名称-编码关系
tfidf_vec = model_dict['tfidfVectorizer'] # 获取文本特征向量转换器
print('categories_labal_dict = ',categories_labal_dict)
categories_labal_dict = {'entertainment': 0, 'technology': 1, 'sports': 2, 'military': 3, 'car': 4}
# 我们这里categories_labal_dict 进行kv 交换
labal_category_dict = {
}
for k,v in categories_labal_dict.items():
labal_category_dict[v]=k
print('labal_category_dict=',labal_category_dict)
# 封装文本特征-tfidf 数值表示的方法
def get_features(x):
return tfidf_vec.transform(x)
labal_category_dict= {0: 'entertainment', 1: 'technology', 2: 'sports', 3: 'military', 4: 'car'}
摘自今日头条: https://www.toutiao.com/a6714271125473346055/
import jieba
import warnings
warnings.filterwarnings('ignore')
text = "奥迪A3、宝马1系和奔驰A级一直纠缠不休的三个冤家"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])# 使用加载模型后的tfidf 文本提取器
target = model.predict(feat)[0] # 使用加载后的model
print('target = ',target)
print('category_name = ',labal_category_dict[target])
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\shenfuli\AppData\Local\Temp\jieba.cache
Loading model cost 1.313 seconds.
Prefix dict has been built succesfully.
words = ['奥迪', 'A3', '宝马', '奔驰', '纠缠', '不休', '三个', '冤家']
target = 4
category_name = car
摘自今日头条新闻: https://www.toutiao.com/a6714188329937535496/
import jieba
import warnings
warnings.filterwarnings('ignore')
text = "谁说文物只能躺在博物馆,想买一架梦想中的战斗机开着兜风吗?"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
target = model.predict(feat)[0]
print('feat = ',feat.toarray()[0])
print('target = ',target)
print('category_name = ',labal_category_dict[target])
words = ['文物', '只能', '博物馆', '一架', '梦想', '战斗机', '开着', '兜风']
feat = [0. 0. 0. ... 0. 0. 0.]
target = 3
category_name = military
我们从 今日头条: https://www.toutiao.com/a6689675139333751299/ 拷贝标题来进行预测
import jieba
import warnings
warnings.filterwarnings('ignore')
text = "陈晓旭:从完美林黛玉到身家过亿后剃度出家,她戏里戏外都是传奇"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
target = model.predict(feat)[0]
print('feat = ',feat.toarray()[0])
print('target = ',target)
print('category_name = ',labal_category_dict[target])
words = ['陈晓旭', '完美', '林黛玉', '身家', '亿后', '剃度', '出家', '戏里', '戏外', '传奇']
feat = [0. 0. 0. ... 0. 0. 0.]
target = 0
category_name = entertainment
摘自今日头条:https://www.toutiao.com/a6714266792253981192/
import jieba
import warnings
warnings.filterwarnings('ignore')
text = "男女有别!国乒主力参加马来西亚T2联赛 男队站着吃自助女队吃桌餐"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
target = model.predict(feat)[0]
print('feat = ',feat)
print('target = ',target)
print('category_name = ',labal_category_dict[target])
words = ['男女有别', '国乒', '主力', '参加', '马来西亚', 'T2', '联赛', '男队', '自助', '女队', '桌餐']
feat = (0, 11805) 0.36267398552780705
(0, 9731) 0.4253487251684549
(0, 9650) 0.25562899934208005
(0, 8531) 0.3991826114621538
(0, 4411) 0.3854020264908788
(0, 3781) 0.39298079438830585
(0, 3166) 0.234671335932325
(0, 1215) 0.3237496516306487
target = 2
category_name = sports
# (0, 11805) 0.36267398552780705 我们来看下 11805 这个索引是对应的特征词语
tfidf.get_feature_names()[11805]
'马术'
import jieba
import warnings
warnings.filterwarnings('ignore')
text = "摩托罗拉One Macro将是最新一款Android One智能手机"
words = [word for word in jieba.lcut(text) if len(word)>=2 and word not in stopwords]
print('words = ',words)
data = " ".join(words)
feat = get_features([data])
target = model.predict(feat)[0]
print('feat = ',feat)
print('target = ',target)
print('category_name = ',labal_category_dict[target])
words = ['摩托罗拉', 'One', 'Macro', '最新', '一款', 'Android', 'One', '智能手机']
feat = (0, 7016) 0.4234811702104358
(0, 6932) 0.4889135792344855
(0, 501) 0.4667627151048488
(0, 232) 0.6031250105121443
target = 1
category_name = technology
[1] 基于jieba、TfidfVectorizer、LogisticRegression的文档分类
https://cloud.tencent.com/developer/article/1332181