1 赛题理解
2 数据分析
3 词向量+机器学习模型
词向量:是文本表示成计算机能都计算的数字或向量的一般方法。将不定长文本转换到定长空间,是文本分类的第一步。
- One-hot 编码:将每个单词/字赋予一个唯一索引,根据索引在句子中进行0-1编码。在句子中出现则编码为1,不出现编码为0.
- Bag of word 词袋模型:
-(1) Count Vectors :统计每个单词/字在语料中出现的次数。(from sklearn.feature_extaction.text import CountVectorier )
- (2)N-gram :与词频统计类似,只不过N-gram 以相邻的N个单词/字 为单位进行词频统计。
- (3)TF-IDF : TF*IDF =(某个词在该文档中出现的频数/这个词在所有文档中出现的频数)x(log_e[总的文档数/出现这个词的所有文档数])
3.1 Count Vecotrs + RidgeClassifier
岭回归器有一个分类器变体:RidgeClassifier,这个分类器有时被称为带有线性核的最小二乘支持向量机。
RidgeClassifier原理相关介绍
RidgeClassifier参数的相关介绍
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
df = pd.read_csv(r'E:\jupyter_lab\天池\新闻文本分类\data\train\train_set.csv',sep='\t',encoding='utf8')
df.head()
CountVec = CountVectorizer(max_features = 3000)
train_text = CountVec.fit_transform(df.text)
x_train,x_val,y_train,y_val = train_test_split(train_text,df.label,test_size=0.3,random_state=0 )
clf = RidgeClassifier()
clf.fit(x_train,y_train)
val_pre = clf.predict(x_val)
score_f1 = f1_score(y_val,val_pre,average='macro')
print('CountVectorizer + RidgeClassifier : %.4f' %score_f1 )
3.2 TF-IDF + RidgeClassifier
TfidfVctorizer参数及属性介绍
from sklearn.feature_extraction.text import TfidfVectorizer
%%time
tfidf = TfidfVectorizer(ngram_range=(1,3),max_features=3000)
#max_features=3000文档中出现频率最多的前3000个词
#ngram_range(1,3)(单个字,两个字,三个字 都会统计)
train_text_tfidf = tfidf.fit_transform(df.text)
超长运行时间预警!!
划分数据集:
x_train_tfidf,x_val_tfidf,y_train_tfidf,y_val_tfidf = train_test_split(train_text_tfidf,df.label,test_size=0.3,random_state=0 )
%%time
clf = RidgeClassifier()
clf.fit(x_train_tfidf,y_train_tfidf)
val_pre_tfidf = clf.predict(x_val_tfidf)
score_f1_tfidf = f1_score(y_val_tfidf,val_pre_tfidf,average='macro')
print('TF-IDF + RidgeClassifier : %.4f' %score_f1_tfidf )
3.3 Count Vectors | TFIDF + MultinomialNB
朴素贝叶斯分类(NBC)是以贝叶斯定理为基础并且假设特征条件之间相互独立的方法,先通过已给定的训练集,以特征词之间独立作为前提假设,学习从输入到输出的联合概率分布,再基于学习到的模型,输入X求出使得后验概率最大的输出Y。MultinomialNB 实现了服从多项分布数据的朴素贝叶斯算法。
朴素贝叶斯分类(算法理解)
from sklearn.naive_bayes import MultinomialNB
count vectors 词向量
%%time
clf = MultinomialNB()
clf.fit(x_train,y_train)
val_pre_CountVec_NBC = clf.predict(x_val)
score_f1_CountVec_NBC = f1_score(y_val,val_pre_CountVec_NBC,average='macro')
print('CountVec + MultinomialNB : %.4f' %score_f1_CountVec_NBC )
TF-IDF
%%time
clf = MultinomialNB()
clf.fit(x_train_tfidf,y_train_tfidf)
val_pre_tfidf_NBC = clf.predict(x_val_tfidf)
score_f1_tfidf_NBC = f1_score(y_val_tfidf,val_pre_tfidf_NBC,average='macro')
print('TF-IDF + MultinomialNB : %.4f' %score_f1_tfidf_NBC )
比较各个模型的结果
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
scores = [score_f1,score_f1_tfidf,score_f1_CountVec_NBC,score_f1_tfidf_NBC]
x_ticks = np.arange(4)
x_ticks_label = ['CountVec_RidgeClassifier','tfidf_RidgeClassifier','CountVec_NBC','tfidf_NBC']
plt.plot(x_ticks,scores)
plt.xticks(x_ticks, x_ticks_label, fontsize=8) #指定字体
plt.ylabel('F1_score')
plt.show()
总结:利用不同的分类模型分别在两种词向量上进行了实验,总的来说还是tfidf词向量比count vectors 更有效一点。 所以接下来直接用tfidf实验模型。参考其它参考资料后,直接上性能比较好的模型。
3.5 TFIDF + LinearSVC
svm的原理
svm.LinearSVC 与 svm.SVC 的区别
from sklearn.svm import LinearSVC
%%time
clf = LinearSVC()
clf.fit(x_train_tfidf,y_train_tfidf)
val_pre_tfidf_LSVC = clf.predict(x_val_tfidf)
score_f1_tfidf_LSVC = f1_score(y_val_tfidf,val_pre_tfidf_LSVC,average='macro')
print('TF-IDF + LinearSVC : %.4f' %score_f1_tfidf_LSVC )
3.6 TFIDF + RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
%%time
clf = RandomForestClassifier()
clf.fit(x_train_tfidf,y_train_tfidf)
val_pre_tfidf_RFC = clf.predict(x_val_tfidf)
score_f1_tfidf_RFC = f1_score(y_val_tfidf,val_pre_tfidf_RFC,average='macro')
超超长时间预警:
print('TF-IDF + RandomForestClassifier : %.4f' %score_f1_tfidf_RFC )
模型比较
比较来看,LinearSVC无论是从运行时间还是模型精度上都完美胜出。RandomForestClassifier拟合效果第二,但运行时间实在太长了。其次,就是RidgeClassifier,最后是MultinomialNB。还有其他机器学习模型没有试过,但机器学习模型的上线应该在0.92-0.93上下。我们接下来对TfidfVectorizer进行参数的调优,看看能有多大改进。
3.7 TfidfVectorizer参数调优
- N-gram=(1,2)max_fratures=3000 #f1_score:0.9207
- N-gram=(1,2)max_fratures=4000 #f1_score:0.9247
- N-gram=(1,3)max_fratures=3000 #f1_score:0.9215
- N-gram=(1,3)max_fratures=4000 #f1_score:0.9257
(1) N-gram=(1,2)max_fratures=3000
%%time
tfidf_N2 = TfidfVectorizer(ngram_range=(1,2),max_features=3000)
train_text_tfidf_N2 = tfidf_N2.fit_transform(df.text)
x_train_tfidf,x_val_tfidf,y_train_tfidf,y_val_tfidf = train_test_split(train_text_tfidf_N2,df.label,test_size=0.3,random_state=0 )
clf = LinearSVC()
clf.fit(x_train_tfidf,y_train_tfidf)
val_pre_tfidf_LSVC = clf.predict(x_val_tfidf)
score_f1_tfidf_LSVC_N2 = f1_score(y_val_tfidf,val_pre_tfidf_LSVC,average='macro')
print('TF-IDF_N2 + LinearSVC : %.4f' %score_f1_tfidf_LSVC_N2 )
(2)N-gram=(1,2)max_fratures=4000
%%time
tfidf_mf4000 = TfidfVectorizer(ngram_range=(1,2),max_features=4000)
train_text_tfidf_mf4000 = tfidf_mf4000.fit_transform(df.text)
x_train_tfidf,x_val_tfidf,y_train_tfidf,y_val_tfidf = train_test_split(train_text_tfidf_mf4000,df.label,test_size=0.3,random_state=0 )
clf = LinearSVC()
clf.fit(x_train_tfidf,y_train_tfidf)
val_pre_tfidf_LSVC = clf.predict(x_val_tfidf)
score_f1_tfidf_LSVC_mf4000 = f1_score(y_val_tfidf,val_pre_tfidf_LSVC,average='macro')
print('TF-IDF_LSVC_mf4000 + LinearSVC : %.4f' %score_f1_tfidf_LSVC_mf4000 )
(3)N-gram=(1,3)max_fratures=4000
%%time
tfidf_N3_mf4000 = TfidfVectorizer(ngram_range=(1,3),max_features=4000)
train_text_tfidf_N3_mf4000 = tfidf_N3_mf4000.fit_transform(df.text)
超长时间预警
x_train_tfidf,x_val_tfidf,y_train_tfidf,y_val_tfidf = train_test_split(train_text_tfidf_N3_mf4000,df.label,test_size=0.3,random_state=0 )
clf = LinearSVC()
clf.fit(x_train_tfidf,y_train_tfidf)
val_pre_tfidf_LSVC = clf.predict(x_val_tfidf)
score_f1_tfidf_LSVC_N3_mf4000 = f1_score(y_val_tfidf,val_pre_tfidf_LSVC,average='macro')
print('TF-IDF_N2 + LinearSVC : %.4f' %score_f1_tfidf_LSVC_N3_mf4000 )
参数优化总结
由于中文多以两个字组成词语,所以我尝试了N-gram=(1,2)。
- N-gram=(1,2)max_fratures=3000 #f1_score:0.9207
- N-gram=(1,2)max_fratures=4000 #f1_score:0.9247
- N-gram=(1,3)max_fratures=3000 #f1_score:0.9215
- N-gram=(1,3)max_fratures=4000 #f1_score:0.9257
从结果来看,N-gram=(1,3)窗口的增大还是可以优化精度的。另外,毫无疑问,使用的特征词越多,模型的分数也越高,但同时,运行时间也会增加。从现在的数据看出的是:4000仍能够优化模型,按理说应该继续增大看看优化的性能会不会继续增加以及看看增加的幅度有多大,再去决定是否就选到4000为止了。由于运行时间的限制,我就不增加max_features去试了。
更新中,下一节预告:深度学习的文本分类1 fasttext