数据集:中、英文数据集各一份
中文数据集:THUCNews
THUCNews数据子集:https://pan.baidu.com/s/1hugrfRu 密码:qfud
英文数据集:IMDB数据集Sentiment Analysis
IMDB数据集下载和探索
参考链接:
影评文本分类 | TensorFlow
科赛 - Kesci.com
THUCNews数据集下载和探索
参考博客中的数据集部分和预处理部分:
参考链接
参考代码
学习召回率、准确率、ROC曲线、AUC、PR曲线这些基本概念
参考链接
import jieba
import pandas as pd
import tensorflow as tf
from collections import Counter
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
path = 'E:/机器学习/Tensorflow学习/cnews/'
train_data = pd.read_csv(path + 'cnews.train.txt', names=['title', 'content'], sep='\t',engine='python',encoding='UTF-8') # (50000, 2)
test_data = pd.read_csv(path + 'cnews.test.txt', names=['title', 'content'], sep='\t',engine='python',encoding='UTF-8') # (10000, 2)
val_data = pd.read_csv(path + 'cnews.val.txt', names=['title', 'content'], sep='\t',engine='python',encoding='UTF-8') # (5000, 2)
train_data = train_data.head(50)
test_data = test_data.head(50)
val_data = val_data.head(50)
# 读取停用词
def read_stopword(filename):
stopword = []
fp = open(filename, 'r')
for line in fp.readlines():
stopword.append(line.replace('\n', ''))
fp.close()
return stopword
stopword = read_stopword(path + 'stopword.txt')
def cut_data(data, stopword):
words = []
for content in data['content']:
word = list(jieba.cut(content))
for w in list(set(word) & set(stopword)):
while w in word:
word.remove(w)
words.append(word)
data['content'] = words
return data
train_data = cut_data(train_data, stopword)
test_data = cut_data(test_data, stopword)
val_data = cut_data(val_data, stopword)
train_data.shape
(50, 2)
def word_list(data):
all_word = []
for word in data['content']:
all_word.extend(word)
return all_word
word_list(train_data)
def feature(train_data, test_data, val_data):
content = pd.concat([train_data['content'], test_data['content'], val_data['content']], ignore_index=True)
# count_vec = CountVectorizer(max_features=300, min_df=2)
# count_vec.fit_transform(content)
# train_fea = count_vec.transform(train_data['content']).toarray()
# test_fea = count_vec.transform(test_data['content']).toarray()
# val_fea = count_vec.transform(val_data['content']).toarray()
model = Word2Vec(content, size=100, min_count=1, window=10, iter=10)
train_fea = train_data['content'].apply(lambda x: model[x])
test_fea = test_data['content'].apply(lambda x: model[x])
val_fea = val_data['content'].apply(lambda x: model[x])
return train_fea, test_fea, val_fea
train_fea, test_fea, val_fea = feature(train_data, test_data, val_data)
all_word = []
all_word.extend(word_list(train_data))
all_word.extend(word_list(test_data))
all_word.extend(word_list(val_data))
all_word = list(set(all_word))
word_list(train_data)
介绍上面的概念之前,我们先来了解一下TP, FP, TN, FN
召回率是覆盖面的度量,度量有多个正例被分为正例
R = T P T P + F P R = \frac{TP}{TP+FP} R=TP+FPTP
召回率API:
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_predict)
#recall得到的是一个list,是每一类的召回率
A c c u r a c y = T P + T N T P + T N + F P + F N Accuracy = \frac{TP+TN}{TP+TN+FP+FN} Accuracy=TP+TN+FP+FNTP+TN
准确率的API:
from sklearn.metrics import accuracy
accuracy = accuracy_score(y_test, y_predict)
表示被分为正例的示例中实际为正例的比例
P = T P T P + F P P = \frac{TP}{TP+FP} P=TP+FPTP
精确率API:
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_predict)
F1是精确率和召回率的调和均值,更接近两个数中较小的那个,所有当P和R接近时,F值最大。F1-score多用于数据分布不均衡的问题、推荐系统等。
2 F 1 = 1 P + 1 R \frac{2}{F1} = \frac{1}{P}+\frac{1}{R} F12=P1+R1
F 1 = 2 T P 2 T P + F P + F N F1 = \frac{2TP}{2TP+FP+FN} F1=2TP+FP+FN2TP
F1值API:
from sklearn.metrics import f1_score
f1_score(y_test, y_predict)
混淆矩阵的API
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_predict)
AUC用于衡量“二分类问题”机器学习算法性能(泛化能力)。
from sklearn.metrics import auc
auc= auc(y_test, y_predict)
ROC全称是“受试者工作特征”(Receiver Operating Characteristic)。ROC曲线的面积就是AUC(Area Under the Curve)
ROC曲线越接近左上角,代表模型越好,即ACU接近1
from sklearn.metrics import roc_auc_score, auc
import matplotlib.pyplot as plt
y_predict = model.predict(x_test)
y_probs = model.predict_proba(x_test) #模型的预测得分
fpr, tpr, thresholds = metrics.roc_curve(y_test,y_probs)
roc_auc = auc(fpr, tpr) #auc为Roc曲线下的面积
#开始画ROC曲线
plt.plot(fpr, tpr, 'b',label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.1])
plt.ylim([-0.1,1.1])
plt.xlabel('False Positive Rate') #横坐标是fpr
plt.ylabel('True Positive Rate') #纵坐标是tpr
plt.title('Receiver operating characteristic example')
plt.show()
还有一个概念叫”截断点”。机器学习算法对test样本进行预测后,可以输出各test样本对某个类别的相似度概率。比如t1是P类别的概率为0.3,一般我们认为概率低于0.5,t1就属于类别N。这里的0.5,就是”截断点”。 对于计算ROC,最重要的三个概念就是TPR, FPR, 截断点。
纵坐标—>TPR即灵敏度(true positive rate ),预测的正类中实际正实例占所有正实例的比例。 T P R = T P ( T P + F N ) TPR = \frac{TP}{(TP+FN)} TPR=(TP+FN)TP
横坐标—>特异度(false positive rate),预测的正类中实际负实例占所有负实例的比例。 F P R = F P ( T N + F P ) FPR=\frac{FP}{(TN+FP)} FPR=(TN+FP)FP
ROC与AUC的定义与使用详解