精确率、召回率和F1

精确率,又称查准率(Precision,P):

召回率,又称查全率(Recall,R):

F1值:


二分类

当标签只有两类时

import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

real = np.random.randint(0,2, size=10)  # array([1, 0, 0, 1, 1, 0, 0, 1, 1, 1])
pred = np.random.randint(0,2, size=10)  # array([0, 0, 0, 0, 1, 1, 0, 1, 1, 1])

# 直接计算
p = sum(pred*real)/sum(pred)   # 0.8
r = sum(pred*real)/sum(real)     # 0.67

# 用skearn计算
p = precision_score(true, pred)  # 0.8
r = recall_score(true, pred) # 0.67

多分类

当问题属于多分类问题时,要综合考察在不同类别下分类器的优劣,这时候就需要引入宏平均(Macro-averaging)、微平均(Micro-averaging),下边以3分类为例

Macro-averaging

宏平均(Macro-averaging)是指所有类别的每一个统计指标值的算数平均值,也就是宏精确率(Macro-Precision),宏召回率(Macro-Recall),宏F值(Macro-F Score),其计算公式如下:


Micro-averaging
微平均(Micro-averaging)是对数据集中的每一个示例不分类别进行统计建立全局混淆矩阵,然后计算相应的指标。其计算公式如下:

Macro-averaging与Micro-averaging的不同之处在于:Macro-averaging赋予每个类相同的权重,然而Micro-averaging赋予每个样本决策相同的权重。因为从F1值的计算公式可以看出,它忽略了那些被分类器正确判定为负类的那些样本,它的大小主要由被分类器正确判定为正类的那些样本决定的,在微平均评估指标中,样本数多的类别主导着样本数少的类。

下边通过一个实际的三分类数据详细计算下:
假设有10个样本,它们属于A、B、C三个类别。假设这10个样本的真实类别和预测的类别分别是:

真实:A A A C B C A B B C
预测:A A C B A C A C B C

对于类别A来说:


对于类别B来说:


对于类别C来说:


from sklearn.metrics import accuracy_score, precision_score, recall_score,  f1_score, fbeta_score

y_true = [0, 0, 0, 2, 1, 2, 0, 1, 1, 2]
y_pred = [0, 0, 2, 1, 0, 2, 0, 2, 1, 2]

accuracy_score(y_true, y_pred) # Return the number of correctly classified samples
accuracy_score(y_true, y_pred, normalize=False) # Return the fraction of correctly classified samples


# Calculate precision score
precision_score(y_true, y_pred, average='macro')
precision_score(y_true, y_pred, average='micro')
precision_score(y_true, y_pred, average=None)


# Calculate recall score
recall_score(y_true, y_pred, average='macro')
recall_score(y_true, y_pred, average='micro')
recall_score(y_true, y_pred, average=None)

# Calculate f1 score
f1_score(y_true, y_pred, average='macro')
f1_score(y_true, y_pred, average='micro')
f1_score(y_true, y_pred, average=None)

# Calculate f beta score
fbeta_score(y_true, y_pred, average='macro', beta=0.5)
fbeta_score(y_true, y_pred, average='micro', beta=0.5)
fbeta_score(y_true, y_pred, average=None, beta=0.5)

机器学习中自己喜欢用的metric方式

def compute_metrics(output_mode, y_true, y_pred, labels=None):
    logger.info('*'*30)
    logger.info(y_true[:10])
    logger.info(y_pred[:10])
    logger.info('*'*30)

    metric = dict()

    # 整体的macro和micro指标
    for metric_mode in ['macro', 'micro']:
        p, r, f1, _ = precision_recall_fscore_support(y_true=y_true, y_pred=y_pred, average=metric_mode)
        metric[metric_mode] = [str(p), str(r), str(f1)]

    # 每个类别上的指标
    metric_each_label = precision_recall_fscore_support(y_true=y_true, y_pred=y_pred)
    metric['each_label'] = dict()
    if not labels:
        labels = sorted(set(y_true))
    for idx, label in enumerate(labels):
        metric['each_label'][f'label-{label}'] = [str(metric_each_label[j][idx]) for j in range(4)]

    # 准确率,多分类和多标签分类不一样
    if output_mode == 'classification':
        accuracy = (y_true == y_pred).mean()
    if output_mode == 'multi-label-classification':
        correct_num = 0
        for i in range(y_true.shape[0]):
            if (y_true[i] == y_pred[i]).all():
                correct_num += 1
        accuracy = correct_num / y_true.shape[0]

    metric['accuracy'] = str(accuracy)

    return metric


# 从零实现“每个类别上的指标计算”
def compute_metrics_each_label(label_list, y_true, y_pred):
    all_predicate_metric_dict = dict()
    metric = dict()
    for idx, label in enumerate(label_list):
        true_num = sum([1 for i in y_true if i == idx])
        pred_num = sum([1 for i in y_pred if i == idx])
        correct_num = np.logical_and(y_true == y_pred, y_true == idx).astype(np.int).sum()
        p = correct_num/(pred_num+1e-5)
        r = correct_num/(true_num+1e-5)
        f1 = (2*p*r)/(p+recall+1e-5)
        all_predicate_metric_dict[label] = [f1, p, recall]
        metric[f'label-{label}'] = [str(p), str(r), str(f1), ]

    metric_each_label_df = pd.DataFrame(
        {
            'label': list(all_predicate_metric_dict.keys()),
            'f1': [round(i[0], 3) for i in all_predicate_metric_dict.values()],
            'p': [round(i[1], 3) for i in all_predicate_metric_dict.values()],
            'r': [round(i[2], 3) for i in all_predicate_metric_dict.values()]
        }
    )
    return metric_each_label_df

# 使用sklearn实现“每个类别上的指标计算”
In [140]: precision_recall_fscore_support(y_true=['1','1','2','1', 'a'], y_pred=['a','1','2','2','1'], labels=['a', '1','2', '3'])
Out[140]:
(array([0. , 0.5, 0.5, 0. ]),
 array([0.        , 0.33333333, 1.        , 0.        ]),
 array([0.        , 0.4       , 0.66666667, 0.        ]),
 array([1, 3, 1, 0]))
out的列对应着labels

In [141]: precision_recall_fscore_support(y_true=['1','1','2','1', 'a'], y_pred=['a','1','2','2','1'])
Out[141]:
(array([0.5, 0.5, 0. ]),
 array([0.33333333, 1.        , 0.        ]),
 array([0.4       , 0.66666667, 0.        ]),
 array([3, 1, 1]))
若labels未提供,则labels为sorted(set(y_true))

参考
http://www.cnblogs.com/robert-dlut/p/5276927.html
https://zhuanlan.zhihu.com/p/30953081

你可能感兴趣的:(精确率、召回率和F1)