目录
1. 学习内容
2. 构建词向量并将处理好的数据保存
2.1 准备工作
2.2 词袋向量
2.3 TFIDF向量
3. 将不同的词向量应用到不同的机器学习分类算法上
3.1 逻辑回归
3.1.1 词袋+逻辑回归
3.1.2 TFIDF+逻辑回归
3.2 岭回归分类
3.2.1 词袋+岭回归
3.2.2 TFIDF+岭回归
3.3 朴素贝叶斯分类
3.3.1 词袋+朴素贝叶斯
3.3.2 TFIDF+朴素贝叶斯
3.4 SVM
3.4.1 词袋+SVM
3.4.2 TFIDF+SVM
3.5 XGBoost
3.5.1 词袋+XGBoost
3.5.2 TFIDF+XGBoost
4. 参考文献
1. 学会简单的词向量提取方法(词袋向量和TFIDF)
2. 学会用逻辑回归、岭回归分类、朴素贝叶斯分类、SVM和XGBoost结合得到的词向量来进行分类
注意:本次学习的内容只重过程不重结果,因此并没有选择较大的数据样本,也没有进行细致调参。因此,以下各部分的程序得到的结果未必是最优的。
import pandas as pd
import numpy as np
data_df = pd.read_csv(r'./data/train_set.csv', sep = '\t', nrows = 5000)
词袋向量是一个计数向量。简单来说,一个文档的词袋向量就是记录这篇文档中出现的词的出现次数的向量。在sklearn中,可以使用CountVectorizer来实现词袋向量的生成。
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pickle
# 生成向量
count_vector = CountVectorizer(max_df = 0.5, min_df = 3 / data_df.shape[0],
max_features = None, ngram_range = (1, 1))
data = count_vector.fit_transform(data_df['text'])
X_train, X_test, y_trian, y_test = train_test_split(data, data_df['label'],
stratify = data_df['label'],
random_state = 2020,
test_size = 0.2
)
# 将数据保存到文件
data_count_sample = (X_train, X_test, y_trian, y_test)
with open(r'./data/count_sample.pkl', 'wb') as f:
pickle.dump(data_count_sample, f)
这里简单介绍一下CountVectorizer()中各个参数的含义:max_df可以是[0, 1]的浮点数,也可以是整数。函数会将出现在文档中的频率或者次数大于该值的词过滤掉(因为这种词很可能是停用词)。min_df同理,函数会将出现在文档中的频率或者次数小于该值的词过滤掉。max_feature表示保留的最多词的个数,它可以有效防止得到的向量维度过大。ngram_range则表示n-gram模型的范围。下面的TFIDF向量的构造也是同理的。
用一句话概括TFIDF的核心思想就是:一个词如果在某个文档中出现次数越多,同时在其他文档中出现次数越少,则这个词就越有可能代表这个文档,就应该赋予更高的权值。
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import pickle
# 生成向量
tfidf_vector = TfidfVectorizer(max_df = 0.5, min_df = 3 / data_df.shape[0],
max_features = None, ngram_range = (1, 1))
data = tfidf_vector.fit_transform(data_df['text'])
X_train, X_test, y_trian, y_test = train_test_split(data, data_df['label'],
stratify = data_df['label'],
random_state = 2020,
test_size = 0.2
)
# 将数据保存到文件
data_tfidf_sample = (X_train, X_test, y_trian, y_test)
with open(r'./data/tfidf_sample.pkl', 'wb') as f:
pickle.dump(data_tfidf_sample, f)
为了方便理解,以下每一部分的代码都力争保留最高的相似度,只是在调用模型和数据进一步处理上略有不同。
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
clf_lr = LogisticRegression(C = 1.0, solver = 'newton-cg', \
multi_class = 'multinomial')
clf_lr.fit(X_train, y_train)
y_prediction = clf_lr.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.8099388797387872
这里,关于solver的选取,我发现了一篇写的比较明白的博客[1],给大家分享一下。
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
clf_lr = LogisticRegression(C = 1.0, solver = 'newton-cg', \
multi_class = 'multinomial')
clf_lr.fit(X_train, y_train)
y_prediction = clf_lr.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.7938627628785371
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
clf_rc = RidgeClassifier()
clf_rc.fit(X_train, y_train)
y_prediction = clf_rc.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.5713044356078195
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
clf_rc = RidgeClassifier()
clf_rc.fit(X_train, y_train)
y_prediction = clf_rc.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.8612591130592262
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
clf_nb = MultinomialNB()
clf_nb.fit(X_train, y_train)
y_prediction = clf_nb.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.8049698286580436
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
clf_nb = MultinomialNB()
clf_nb.fit(X_train, y_train)
y_prediction = clf_nb.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.42510191518185725
这里的代码略有不同,主要是添加了两部分数据处理:SVD降维和归一化。实际上这两步不进行对这个实验结果影响不是很大,但是这么做却可以明显地缩短训练时间。
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import f1_score
f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
svd = TruncatedSVD(n_components = 120)
svd.fit(X_train)
X_train = svd.transform(X_train)
X_test = svd.transform(X_test)
scl = StandardScaler()
scl.fit(X_train)
X_train = scl.transform(X_train)
X_test = scl.transform(X_test)
clf_svc = SVC(C = 1.0, probability = True)
clf_svc.fit(X_train, y_train)
y_prediction = clf_svc.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.6239219719523244
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import f1_score
f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
svd = TruncatedSVD(n_components = 120)
svd.fit(X_train)
X_train = svd.transform(X_train)
X_test = svd.transform(X_test)
scl = StandardScaler()
scl.fit(X_train)
X_train = scl.transform(X_train)
X_test = scl.transform(X_test)
clf_svc = SVC(C = 1.0, probability = True)
clf_svc.fit(X_train, y_train)
y_prediction = clf_svc.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.8399423730362795
import xgboost as xgb
from sklearn.metrics import f1_score
f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
clf_xgb = xgb.XGBClassifier(max_depth = 7, n_estimators = 200, \
colsample_bytree = 0.8, subsample = 0.8, \
nthread = 10, learning_rate = 0.1)
clf_xgb.fit(X_train, y_train)
y_prediction = clf_xgb.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.849811679301213
import xgboost as xgb
from sklearn.metrics import f1_score
f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)
clf_xgb = xgb.XGBClassifier(max_depth = 7, n_estimators = 200, \
colsample_bytree = 0.8, subsample = 0.8, \
nthread = 10, learning_rate = 0.1)
clf_xgb.fit(X_train, y_train)
y_prediction = clf_xgb.predict(X_test)
f1_score(y_prediction, y_test, average = 'macro')
0.8442801532269851
其他方法可以参考[2]。
[1] https://blog.csdn.net/CherDW/article/details/54891073
[2] https://zhuanlan.zhihu.com/p/50657430