Tensorflow Evaluation 计算accuracy踩的坑

在对分类模型进行测试的时候，tensorflow slim提供了eval_image_classifier.py 来evaluate大部分分类模型的结果。

在评价标准的地方，tensorflow写的是accuracy和recall at k两个指标。

'Accuracy': slim.metrics.streaming_accuracy(predictions, labels),

'Recall_5': slim.metrics.streaming_recall_at_k(logits, labels,5),

Generally，模型evaluation的指标用的是True Positive (TP)，True Negative (TN), False Negative (FP), False Positive(NP) 。

confusion matrix borrowed somewhere..

先说点大家都知道的吧

Recall=TP/（TP+FN）

Recall是所有预测值中TP的比例，也就是

Precision=TP/（TP+FP）

Recall是所有真实值中TP的比例

Accuracy=(TP+TN)/(TP+FP+TN+FN)

Accuracy就是所有的测试数据中正确预测的比例

在slim的github页面上那么长一个list写的都是accuracy top1 top5

可是evaluation的code里出现的是accuracy和recall_5，所以问题来了，recall和accuracy能一样吗？！

仔细看了看tf的源码，

accuracy的在这里: https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/ops/metrics_impl.py#L397-L459

is_correct=math_ops.cast(

math_ops.equal(predictions, labels), dtypes.float32)

return mean(is_correct, weights, metrics_collections, updates_collections, name or 'accuracy')

嗯，就是对了对答案数一数有几个预测正确了然后求mean。

recall_at_k: https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/contrib/metrics/python/ops/metric_ops.py#L2114

in_top_k=math_ops.cast(nn.in_top_k(predictions, labels, k), dtypes.float32)

returnstreaming_mean(in_top_k, weights, metrics_collections,

updates_collections, nameor_at_k_name('recall', k))

用的是nn.in_top_k，就是看看预测值top5里面有没有真实值，返回bool然后转换成float32,然后求mean。

当然我就很迷惑啊。。这俩有啥区别啊，不是说好要算accuracy的吗你算的难道不是recall吗?然后查了查有很多人一样迷惑，声讨tf是不是太misleading。。

我再仔细回头看了看。。

其实在multiple classification模型里，TP+TN 就是正确答案的数量，也就是说，计算accuracy的时候，batch size是100，100个真实值，100个top1预测值，预测正确的有80个的话，TP+TN=80，TP+FP+TN+FN=80+20=100。

所以，这个测试基准实际上应该是正确的。。

不过tensorflow写的这个accuracy和recall，个人觉得真的还是很容易让人迷惑

Tensorflow Evaluation 计算accuracy踩的坑

你可能感兴趣的:(Tensorflow Evaluation 计算accuracy踩的坑)