若一个学习器的P-R曲线/ROC曲线被另一个学习器的曲线完全“包住”,则后者的性能优于前者。
首先要知道用于衡量分类器性能的混淆矩阵:
其中TP、FN、FP、TN可以这样记忆:第一个字母为预测的是否正确,正确为True(T),错误为False(F);第二个字母为预测的结果,预测为正例为Positive§,预测为反例为Negative(N)。
·查准率P(Precision)定义为:
可以理解为在预测为正例的样本中预测正确的比例;
·查全率R(Recall)定义为:
可以理解为在真正例中预测正确的比例;
在很多情形下,我们可以根据学习器的预测结果对样例进行排序,排在前面的是学习器认为“最可能”是正例的样本,排在最后的则是学习器认为是“最不可能”是正例的样本。按此顺序逐个把样例作为正例进行预测,则每次可以计算出当前的查全率和查准率。以查准率为纵轴,查全率为横轴作图,就得到了查准率-查全率曲线,简称“P-R曲线”。
当两个学习器的P-R曲线发生了交叉,则一般难以比较两个学习器的优劣,于是人们设计了一些综合考虑查准率和查全率的性能度量,包括BEP、F1-score等等。
与P-R曲线类似,我们根据学习器的预测结果对样例进行排序,然后逐个把样例作为正例进行预测,以“真正例率”(TPR)为纵轴,以“假正例率”(FPR)为横轴,两者分别定义为:
注意:ROC与AUC不受不平衡样本的影响,因为TPR与真实反例无关,而FPR与真实正例无关。
对于ROC曲线来说,若曲线发生交叉,则一般难以比较两个学习器的优劣,此时可以根据AUC(Area Under ROC Curve)的值来判断。
以下是python代码,以简单的二分类任务为例,比较 RF, LR, GaussianNB, SVC, KNN 这几种算法的性能:
#划分数据集
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=0)
#RF
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier(max_depth=5,n_estimators=5)
estimator.fit(x_train,y_train)
algorithm = ['RF', 'LR', 'GaussianNB', 'SVC', 'KNN']
evaluation = pd.DataFrame(index=algorithm, columns=['fpr', 'tpr', 'pre', 'rec', 'auc'])
from sklearn import metrics
evaluation.loc['RF', 'fpr'], evaluation.loc['RF', 'tpr'], thresholds = metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1])
evaluation.loc['RF', 'auc'] = metrics.auc(evaluation.loc['RF', 'fpr'], evaluation.loc['RF', 'tpr'])
evaluation.loc['RF', 'pre'], evaluation.loc['RF', 'rec'], thresholds = metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])
#LR
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression(solver='liblinear',penalty='l2',C=1.0)
estimator.fit(x_train,y_train)
evaluation.loc['LR', 'fpr'], evaluation.loc['LR', 'tpr'], thresholds = metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1])
evaluation.loc['LR', 'auc'] = metrics.auc(evaluation.loc['LR', 'fpr'], evaluation.loc['LR', 'tpr'])
evaluation.loc['LR', 'pre'], evaluation.loc['LR', 'rec'], thresholds = metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])
#GaussianNB
from sklearn.naive_bayes import GaussianNB
estimator = GaussianNB()#参数只有一个,先验概率,P(Y=Ck)=mk/m
estimator.fit(x_train,y_train)
evaluation.loc['GaussianNB', 'fpr'], evaluation.loc['GaussianNB', 'tpr'], thresholds = metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1])
evaluation.loc['GaussianNB', 'auc'] = metrics.auc(evaluation.loc['GaussianNB', 'fpr'], evaluation.loc['GaussianNB', 'tpr'])
evaluation.loc['GaussianNB', 'pre'], evaluation.loc['GaussianNB', 'rec'], thresholds = metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])
#SVC
from sklearn.svm import SVC
estimator = SVC(kernel='rbf', random_state=0, probability=True)
estimator.fit(x_train,y_train)
evaluation.loc['SVC', 'fpr'], evaluation.loc['SVC', 'tpr'], thresholds = metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1])
evaluation.loc['SVC', 'auc'] = metrics.auc(evaluation.loc['SVC', 'fpr'], evaluation.loc['SVC', 'tpr'])
evaluation.loc['SVC', 'pre'], evaluation.loc['SVC', 'rec'], thresholds = metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])
#KNN
from sklearn.neighbors import KNeighborsClassifier
estimator = KNeighborsClassifier()
estimator = GridSearchCV(estimator,param_grid={"n_neighbors":[3,4,5,6]},cv=5)
estimator.fit(x_train,y_train)
evaluation.loc['KNN', 'fpr'], evaluation.loc['KNN', 'tpr'], thresholds = metrics.roc_curve(y_test, estimator.predict_proba(x_test)[:,1])
evaluation.loc['KNN', 'auc'] = metrics.auc(evaluation.loc['KNN', 'fpr'], evaluation.loc['KNN', 'tpr'])
evaluation.loc['KNN', 'pre'], evaluation.loc['KNN', 'rec'], thresholds = metrics.precision_recall_curve(y_test, estimator.predict_proba(x_test)[:,1])
#Visualization
import matplotlib.pyplot as plt
plt.figure(1)
plt.figure(figsize=(6,4))
plt.plot([0, 1], [0, 1], color='navy', lw=1, linestyle='--')
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
for i in range(5):
fpr = evaluation.iloc[i]['fpr']
tpr = evaluation.iloc[i]['tpr']
auc = evaluation.iloc[i]['auc']
plt.plot(fpr, tpr, lw=2, label=evaluation.index[i]+'+AUC'+' (%0.4f)'%(auc) )
plt.legend()
plt.figure(2)
plt.figure(figsize=(6,4))
plt.xlim([0.0, 1.05])
plt.ylim([0, 1.05])
plt.title('Precision/Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
for i in range(5):
pre = evaluation.iloc[i]['pre']
rec = evaluation.iloc[i]['rec']
plt.plot(pre, rec, lw=2, label=evaluation.index[i])
plt.legend()
plt.show()