sklearn自带的classification_report方法可以针对二分类或多分类问题,计算分类器的precision、recall和f1-score。
示例:
from sklearn.metrics import classification_report
y_true=[0,1,2,2,0]
y_pred=[1,0,2,1,1]
print(classification_report(y_true,y_pred))
运行结果为:
precision recall f1-score support
0 0.00 0.00 0.00 2
1 0.00 0.00 0.00 1
2 1.00 0.50 0.67 2
accuracy 0.20 5
macro avg 0.33 0.17 0.22 5
weighted avg 0.40 0.20 0.27 5
可以看出对于该三分类问题(类别分别为0,1,2),classification_report函数可以求出每一类的precision、recall和f1-score值。并且可以给出按照每类类别数据量加权求出的weighted avg的值。不难看出,该函数输入的预测值y_pred要求每条数据只给出一条预测结果。然而,在实际应用中,可能经常需要针对模型给出的top-k个结果进行评估(k一般取3和5)。此时,classification_report方法是否支持呢?不妨试验一下:
y_true=[0, 5, 0, 3, 4, 2, 1, 1, 5, 4]
y_pred=
[[0, 0, 2, 1, 5],
[2, 2, 4, 1, 4],
[4, 5, 1, 3, 5],
[5, 4, 2, 4, 3],
[2, 0, 0, 2, 3],
[3, 3, 4, 1, 4],
[1, 1, 0, 1, 2],
[1, 4, 4, 2, 4],
[4, 1, 3, 3, 5],
[2, 4, 2, 2, 3]]
针对一个六分类场景,对每条数据给出top5的结果(按照概率降序),计算每类的进度
print(classification_report(y_true,y_pred))
发现报错了:
ValueError Traceback (most recent call last)
in ()
----> 1 print(classification_report(y_true,y_pred))
E:\python35\lib\site-packages\sklearn\metrics\_classification.py in classification_report(y_true, y_pred, labels, target_names, sample_weight, digits, output_dict, zero_division)
1965 """
1966
-> 1967 y_type, y_true, y_pred = _check_targets(y_true, y_pred)
1968
1969 labels_given = True
E:\python35\lib\site-packages\sklearn\metrics\_classification.py in _check_targets(y_true, y_pred)
88 if len(y_type) > 1:
89 raise ValueError("Classification metrics can't handle a mix of {0} "
---> 90 "and {1} targets".format(type_true, type_pred))
91
92 # We can't have more than one value on y_type => The set is no more needed
ValueError: Classification metrics can't handle a mix of multiclass and multiclass-multioutput targets
显然,原生的classification_report不支持给出多个预测结果时对precision等结果的计算。
因此,针对这个问题,开发方法实现:当模型给出k(k>1)个预测结果时,计算precision@k、recall@k及f1_score@k
方法见下:
#y_true: 1d-list-like
#y_pred: 2d-list-like
#num: 针对num个结果进行计算(num<=y_pred.shape[1])
def precision_recall_fscore_k(y_true,y_pred,num=3):
if not isinstance(y_pred[0],list):
y_pred=[[each] for each in y_pred]
# print(y_pred)
y_pred=[each[0:num] for each in y_pred]
unique_label=count_unique_label(y_true,y_pred)
#计算每个类别的precision、recall、f1-score、support
res={}
result=''
for each in unique_label:
cur_res=[]
tp_fn=y_true.count(each)#TP+FN
#TP+FP
tp_fp=0
for i in y_pred:
if each in i:
tp_fp+=1
#TP
tp=0
for i in range(len(y_true)):
if y_true[i] == each and each in y_pred[i]:
tp+=1
support=tp_fn
try:
precision=round(tp/tp_fp,2)
recall=round(tp/tp_fn,2)
f1_score=round(2/((1/precision)+(1/recall)),2)
except ZeroDivisionError:
precision=0
recall=0
f1_score=0
cur_res.append(precision)
cur_res.append(recall)
cur_res.append(f1_score)
cur_res.append(support)
res[str(each)]=cur_res
title='\t'+'precision@'+str(num)+'\t'+'recall@'+str(num)+'\t'+'f1_score@'+str(num)+'\t'+'support'+'\n'
result+=title
for k,v in sorted(res.items()):
cur=str(k)+'\t'+str(v[0])+'\t'+str(v[1])+'\t'+str(v[2])+'\t'+str(v[3])+'\n'
result+=cur
sums=len(y_true)
weight_info=[(v[0]*v[3],v[1]*v[3],v[2]*v[3]) for k,v in sorted(res.items())]
weight_precision=0
weight_recall=0
weight_f1_score=0
for each in weight_info:
weight_precision+=each[0]
weight_recall+=each[1]
weight_f1_score+=each[2]
weight_precision/=sums
weight_recall/=sums
weight_f1_score/=sums
last_line='avg_total'+'\t'+str(round(weight_precision,2))+'\t'+str(round(weight_recall,2))+'\t'+str(round(weight_f1_score,2))+'\t'+str(sums)
result+=last_line
return result
#统计所有的类别
def count_unique_label(y_true,y_pred):
unique_label=[]
for each in y_true:
if each not in unique_label:
unique_label.append(each)
for i in y_pred:
for j in i:
if j not in unique_label:
unique_label.append(j)
unique_label=list(set(unique_label))
return unique_label
运行precision_recall_fscore_k方法
res=precision_recall_fscore_k(y_true,y_pred,num=3)
print(res)
得到结果:
precision@3 recall@3 f1_score@3 support
0 0.33 0.5 0.4 2
1 0.5 1.0 0.67 2
2 0 0 0 1
3 0 0 0 1
4 0.14 0.5 0.22 2
5 0 0 0 2
avg_total 0.19 0.4 0.26 10
ok。实现功能
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~·
讨论:
对模型给出的top-k结果进行precision、recall和f1-score进行评估:k>1和k=1相比,前者的precision一般会降低;recall一般会提高;f1-score的变化不一定,取决于precision和recall的综合变化。
原因:(1)precision=TP/(TP+FP)。对于k>1的情况,整体测试集的TP值是确定的,当模型为每个测试数据给出更多的结果时,TP+FP的结果一般会增加,所以precision一般减小(2)recall=TP/(TP+FN),对于k>1的情况,TP+FN的数值一般减小,所以recall上升(3)f1-score=2/((1/precision)+(1/recall))。所以f1-score的变化趋势不定,取决于precision和recall的共同变化。