精确率,又称查准率(Precision,P):
召回率,又称查全率(Recall,R):
F1值:
二分类
当标签只有两类时
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score
real = np.random.randint(0,2, size=10) # array([1, 0, 0, 1, 1, 0, 0, 1, 1, 1])
pred = np.random.randint(0,2, size=10) # array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])
# 直接计算
p = sum(pred*real)/sum(pred) # 0.8
r = sum(pred*real)/sum(real) # 0.67
# 用skearn计算
p = precision_score(true, pred) # 0.8
r = recall_score(true, pred) # 0.67
多分类
当问题属于多分类问题时,要综合考察在不同类别下分类器的优劣,这时候就需要引入宏平均(Macro-averaging)、微平均(Micro-averaging),下边以3分类为例
Macro-averaging
宏平均(Macro-averaging)是指所有类别的每一个统计指标值的算数平均值,也就是宏精确率(Macro-Precision),宏召回率(Macro-Recall),宏F值(Macro-F Score),其计算公式如下:
Micro-averaging
微平均(Micro-averaging)是对数据集中的每一个示例不分类别进行统计建立全局混淆矩阵,然后计算相应的指标。其计算公式如下:
Macro-averaging与Micro-averaging的不同之处在于:Macro-averaging赋予每个类相同的权重,然而Micro-averaging赋予每个样本决策相同的权重。因为从F1值的计算公式可以看出,它忽略了那些被分类器正确判定为负类的那些样本,它的大小主要由被分类器正确判定为正类的那些样本决定的,在微平均评估指标中,样本数多的类别主导着样本数少的类。
下边通过一个实际的三分类数据详细计算下:
假设有10个样本,它们属于A、B、C三个类别。假设这10个样本的真实类别和预测的类别分别是:
真实:A A A C B C A B B C
预测:A A C B A C A C B C
对于类别A来说:
对于类别B来说:
对于类别C来说:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, fbeta_score
y_true = [0, 0, 0, 2, 1, 2, 0, 1, 1, 2]
y_pred = [0, 0, 2, 1, 0, 2, 0, 2, 1, 2]
accuracy_score(y_true, y_pred) # Return the number of correctly classified samples
accuracy_score(y_true, y_pred, normalize=False) # Return the fraction of correctly classified samples
# Calculate precision score
precision_score(y_true, y_pred, average='macro')
precision_score(y_true, y_pred, average='micro')
precision_score(y_true, y_pred, average=None)
# Calculate recall score
recall_score(y_true, y_pred, average='macro')
recall_score(y_true, y_pred, average='micro')
recall_score(y_true, y_pred, average=None)
# Calculate f1 score
f1_score(y_true, y_pred, average='macro')
f1_score(y_true, y_pred, average='micro')
f1_score(y_true, y_pred, average=None)
# Calculate f beta score
fbeta_score(y_true, y_pred, average='macro', beta=0.5)
fbeta_score(y_true, y_pred, average='micro', beta=0.5)
fbeta_score(y_true, y_pred, average=None, beta=0.5)
机器学习中自己喜欢用的metric方式
def compute_metrics(output_mode, y_true, y_pred, labels=None):
logger.info('*'*30)
logger.info(y_true[:10])
logger.info(y_pred[:10])
logger.info('*'*30)
metric = dict()
# 整体的macro和micro指标
for metric_mode in ['macro', 'micro']:
p, r, f1, _ = precision_recall_fscore_support(y_true=y_true, y_pred=y_pred, average=metric_mode)
metric[metric_mode] = [str(p), str(r), str(f1)]
# 每个类别上的指标
metric_each_label = precision_recall_fscore_support(y_true=y_true, y_pred=y_pred)
metric['each_label'] = dict()
if not labels:
labels = sorted(set(y_true))
for idx, label in enumerate(labels):
metric['each_label'][f'label-{label}'] = [str(metric_each_label[j][idx]) for j in range(4)]
# 准确率,多分类和多标签分类不一样
if output_mode == 'classification':
accuracy = (y_true == y_pred).mean()
if output_mode == 'multi-label-classification':
correct_num = 0
for i in range(y_true.shape[0]):
if (y_true[i] == y_pred[i]).all():
correct_num += 1
accuracy = correct_num / y_true.shape[0]
metric['accuracy'] = str(accuracy)
return metric
# 从零实现“每个类别上的指标计算”
def compute_metrics_each_label(label_list, y_true, y_pred):
all_predicate_metric_dict = dict()
metric = dict()
for idx, label in enumerate(label_list):
true_num = sum([1 for i in y_true if i == idx])
pred_num = sum([1 for i in y_pred if i == idx])
correct_num = np.logical_and(y_true == y_pred, y_true == idx).astype(np.int).sum()
p = correct_num/(pred_num+1e-5)
r = correct_num/(true_num+1e-5)
f1 = (2*p*r)/(p+recall+1e-5)
all_predicate_metric_dict[label] = [f1, p, recall]
metric[f'label-{label}'] = [str(p), str(r), str(f1), ]
metric_each_label_df = pd.DataFrame(
{
'label': list(all_predicate_metric_dict.keys()),
'f1': [round(i[0], 3) for i in all_predicate_metric_dict.values()],
'p': [round(i[1], 3) for i in all_predicate_metric_dict.values()],
'r': [round(i[2], 3) for i in all_predicate_metric_dict.values()]
}
)
return metric_each_label_df
# 使用sklearn实现“每个类别上的指标计算”
In [140]: precision_recall_fscore_support(y_true=['1','1','2','1', 'a'], y_pred=['a','1','2','2','1'], labels=['a', '1','2', '3'])
Out[140]:
(array([0. , 0.5, 0.5, 0. ]),
array([0. , 0.33333333, 1. , 0. ]),
array([0. , 0.4 , 0.66666667, 0. ]),
array([1, 3, 1, 0]))
out的列对应着labels
In [141]: precision_recall_fscore_support(y_true=['1','1','2','1', 'a'], y_pred=['a','1','2','2','1'])
Out[141]:
(array([0.5, 0.5, 0. ]),
array([0.33333333, 1. , 0. ]),
array([0.4 , 0.66666667, 0. ]),
array([3, 1, 1]))
若labels未提供,则labels为sorted(set(y_true))
参考
http://www.cnblogs.com/robert-dlut/p/5276927.html
https://zhuanlan.zhihu.com/p/30953081