自然语言处理实践(新闻文本分类)-Task3 简单词向量+机器学习算法

目录

1. 学习内容

2. 构建词向量并将处理好的数据保存

2.1 准备工作

2.2 词袋向量

2.3 TFIDF向量

3. 将不同的词向量应用到不同的机器学习分类算法上

3.1 逻辑回归

3.1.1 词袋+逻辑回归

3.1.2 TFIDF+逻辑回归

3.2 岭回归分类

3.2.1 词袋+岭回归

3.2.2 TFIDF+岭回归

3.3 朴素贝叶斯分类

3.3.1 词袋+朴素贝叶斯

3.3.2 TFIDF+朴素贝叶斯

3.4 SVM

3.4.1 词袋+SVM

3.4.2 TFIDF+SVM

3.5 XGBoost

3.5.1 词袋+XGBoost

3.5.2 TFIDF+XGBoost

4. 参考文献

1. 学习内容

1. 学会简单的词向量提取方法(词袋向量和TFIDF)

2. 学会用逻辑回归、岭回归分类、朴素贝叶斯分类、SVM和XGBoost结合得到的词向量来进行分类

注意:本次学习的内容只重过程不重结果,因此并没有选择较大的数据样本,也没有进行细致调参。因此,以下各部分的程序得到的结果未必是最优的。

2. 构建词向量并将处理好的数据保存

2.1 准备工作

import pandas as pd
import numpy as np

data_df = pd.read_csv(r'./data/train_set.csv', sep = '\t', nrows = 5000)

2.2 词袋向量

词袋向量是一个计数向量。简单来说,一个文档的词袋向量就是记录这篇文档中出现的词的出现次数的向量。在sklearn中,可以使用CountVectorizer来实现词袋向量的生成。

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pickle

# 生成向量
count_vector = CountVectorizer(max_df = 0.5, min_df = 3 / data_df.shape[0], 
                               max_features = None, ngram_range = (1, 1))
data = count_vector.fit_transform(data_df['text'])

X_train, X_test, y_trian, y_test = train_test_split(data, data_df['label'],
                                                    stratify = data_df['label'],
                                                    random_state = 2020,
                                                    test_size = 0.2
                                                   )

# 将数据保存到文件
data_count_sample = (X_train, X_test, y_trian, y_test)
with open(r'./data/count_sample.pkl', 'wb') as f:
    pickle.dump(data_count_sample, f)

这里简单介绍一下CountVectorizer()中各个参数的含义:max_df可以是[0, 1]的浮点数,也可以是整数。函数会将出现在文档中的频率或者次数大于该值的词过滤掉(因为这种词很可能是停用词)。min_df同理,函数会将出现在文档中的频率或者次数小于该值的词过滤掉。max_feature表示保留的最多词的个数,它可以有效防止得到的向量维度过大。ngram_range则表示n-gram模型的范围。下面的TFIDF向量的构造也是同理的。

2.3 TFIDF向量

用一句话概括TFIDF的核心思想就是:一个词如果在某个文档中出现次数越多,同时在其他文档中出现次数越少,则这个词就越有可能代表这个文档,就应该赋予更高的权值。

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import pickle

# 生成向量
tfidf_vector = TfidfVectorizer(max_df = 0.5, min_df = 3 / data_df.shape[0], 
                               max_features = None, ngram_range = (1, 1))
data = tfidf_vector.fit_transform(data_df['text'])

X_train, X_test, y_trian, y_test = train_test_split(data, data_df['label'],
                                                    stratify = data_df['label'],
                                                    random_state = 2020,
                                                    test_size = 0.2
                                                   )

# 将数据保存到文件
data_tfidf_sample = (X_train, X_test, y_trian, y_test)
with open(r'./data/tfidf_sample.pkl', 'wb') as f:
    pickle.dump(data_tfidf_sample, f)

3. 将不同的词向量应用到不同的机器学习分类算法上

为了方便理解,以下每一部分的代码都力争保留最高的相似度,只是在调用模型和数据进一步处理上略有不同。

3.1 逻辑回归

3.1.1 词袋+逻辑回归

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

clf_lr = LogisticRegression(C = 1.0, solver = 'newton-cg', \
                            multi_class = 'multinomial')
clf_lr.fit(X_train, y_train)
y_prediction = clf_lr.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.8099388797387872

这里,关于solver的选取,我发现了一篇写的比较明白的博客[1],给大家分享一下。

3.1.2 TFIDF+逻辑回归

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

clf_lr = LogisticRegression(C = 1.0, solver = 'newton-cg', \
                            multi_class = 'multinomial')
clf_lr.fit(X_train, y_train)
y_prediction = clf_lr.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.7938627628785371

3.2 岭回归分类

3.2.1 词袋+岭回归

from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

clf_rc = RidgeClassifier()
clf_rc.fit(X_train, y_train)
y_prediction = clf_rc.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.5713044356078195

3.2.2 TFIDF+岭回归

from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score

f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

clf_rc = RidgeClassifier()
clf_rc.fit(X_train, y_train)
y_prediction = clf_rc.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.8612591130592262

3.3 朴素贝叶斯分类

3.3.1 词袋+朴素贝叶斯

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

clf_nb = MultinomialNB()
clf_nb.fit(X_train, y_train)
y_prediction = clf_nb.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.8049698286580436

3.3.2 TFIDF+朴素贝叶斯

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

clf_nb = MultinomialNB()
clf_nb.fit(X_train, y_train)
y_prediction = clf_nb.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.42510191518185725

3.4 SVM

这里的代码略有不同,主要是添加了两部分数据处理:SVD降维和归一化。实际上这两步不进行对这个实验结果影响不是很大,但是这么做却可以明显地缩短训练时间。

3.4.1 词袋+SVM

from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import f1_score

f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

svd = TruncatedSVD(n_components = 120)
svd.fit(X_train)
X_train = svd.transform(X_train)
X_test = svd.transform(X_test)

scl = StandardScaler()
scl.fit(X_train)
X_train = scl.transform(X_train)
X_test = scl.transform(X_test)

clf_svc = SVC(C = 1.0, probability = True)
clf_svc.fit(X_train, y_train)
y_prediction = clf_svc.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.6239219719523244

3.4.2 TFIDF+SVM

from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import f1_score

f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

svd = TruncatedSVD(n_components = 120)
svd.fit(X_train)
X_train = svd.transform(X_train)
X_test = svd.transform(X_test)

scl = StandardScaler()
scl.fit(X_train)
X_train = scl.transform(X_train)
X_test = scl.transform(X_test)

clf_svc = SVC(C = 1.0, probability = True)
clf_svc.fit(X_train, y_train)
y_prediction = clf_svc.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.8399423730362795

3.5 XGBoost

3.5.1 词袋+XGBoost

import xgboost as xgb
from sklearn.metrics import f1_score

f = open(r'./data/count_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

clf_xgb = xgb.XGBClassifier(max_depth = 7, n_estimators = 200, \
                            colsample_bytree = 0.8, subsample = 0.8, \
                            nthread = 10, learning_rate = 0.1)
clf_xgb.fit(X_train, y_train)
y_prediction = clf_xgb.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.849811679301213

3.5.2 TFIDF+XGBoost

import xgboost as xgb
from sklearn.metrics import f1_score

f = open(r'./data/tfidf_sample.pkl', 'rb')
X_train, X_test, y_train, y_test = pickle.load(f)

clf_xgb = xgb.XGBClassifier(max_depth = 7, n_estimators = 200, \
                            colsample_bytree = 0.8, subsample = 0.8, \
                            nthread = 10, learning_rate = 0.1)
clf_xgb.fit(X_train, y_train)
y_prediction = clf_xgb.predict(X_test)

f1_score(y_prediction, y_test, average = 'macro')
0.8442801532269851

其他方法可以参考[2]。

4. 参考文献

[1] https://blog.csdn.net/CherDW/article/details/54891073

[2] https://zhuanlan.zhihu.com/p/50657430

你可能感兴趣的:(Datawhale,Team,Learning)