Performance metrics are used to evaluate the overall performance of Machine learning algorithms and to understand how well our machine learning models are performing on a given data under different scenarios. Choosing the right metric is very essential to understand the behavior of our model and make necessary changes to further improve the model. There are different types of performance metrics. In this article, we’ll have a look at some of the most used metrics.
性能指标用于评估机器学习算法的整体性能,并了解我们的机器学习模型在不同情况下对给定数据的性能。 选择正确的度量标准对于理解我们模型的行为并进行必要的更改以进一步改进模型非常重要。 有不同类型的性能指标。 在本文中,我们将介绍一些最常用的指标。
混淆矩阵。 (Confusion Matrix.)
A confusion matrix is used to evaluate the performance of classification algorithms.
混淆矩阵用于评估分类算法的性能。
As we can see from the image above, a confusion matrix has two rows and two columns for binary classification. The number of rows and columns of a confusion matrix is equal to the number of classes. Columns are the predicted classes, and rows are the actual classes.
从上图可以看出,混淆矩阵有两行两列用于二进制分类。 混淆矩阵的行数和列数等于类数。 列是预测类,行是实际类。
Now let’s look at each block of our confusion matrix:
现在,让我们看一下混淆矩阵的每个块:
1) True Positives (TP): In this case, the actual value is 1 and the value predicted by our classifier is also 1
1) 真实正值(TP):在这种情况下,实际值为1,而我们的分类器预测的值为1
2) True Negatives (TN): In this case, the actual value is 0 and the value predicted by our classifier is also 0
2) True Negatives(TN):在这种情况下,实际值为0,而我们的分类器预测的值为0
2) False Positives (FP) (Type 1 error): In this case, the actual value is 0 but the value predicted by our classifier is 1
2) 误报(FP)(类型1错误):在这种情况下,实际值为0,但分类器预测的值为1
3) False Negatives (FN) (Type 2 error): In this case, the actual value is 1 but the value predicted by our classifier is 0
3) 假阴性(FN)(第2类错误):在这种情况下,实际值为1,但分类器预测的值为0
The end goal of our classification algorithm is to maximize the true positives and true negatives i.e. correct predictions and minimize the false positives and false negatives i.e. incorrect predictions.
我们的分类算法的最终目标是最大化真实的正数和真实的负数,即正确的预测,并最小化错误的正数和错误的负数,即错误的预测。
False negatives can be worrisome especially in medical applications e.g., Consider an application where you have to detect breast cancer in patients. Suppose a patient has cancer but our model predicted that she doesn’t have cancer. This can be dangerous as the person is cancer positive but our model failed to predict it.
假阴性可能会令人担忧,特别是在医疗应用中,例如,考虑必须在患者中检测到乳腺癌的应用。 假设患者患有癌症,但我们的模型预测她没有癌症。 这可能很危险,因为该人是癌症阳性,但我们的模型未能对其进行预测。
准确性。 (Accuracy.)
Accuracy is the most commonly used performance metric for classification algorithms. Accuracy can be defined as the number of correct predictions divided by Total predictions. We can easily calculate accuracy from the confusion matrix using the below formula.
准确性是分类算法最常用的性能指标。 准确度可以定义为正确预测数除以总预测数。 我们可以使用以下公式轻松地从混淆矩阵中计算出准确性。
Accuracy works well when the classes are balanced i.e. equal number of samples for each class, but if the classes are imbalanced i.e. unequal number of samples per class, then accuracy might not be the right metric.
当类别是平衡的(即每个类别的样本数相等)时,精度效果很好,但是,如果类别是不平衡的(即每个类别的样本数不相等),则精度可能不是正确的指标。
为什么精度对于不平衡数据是不可靠的指标? (Why is accuracy an unreliable metric for imbalanced data?)
let’s consider a binary classification problem where we have two classes of cats and dogs, where cats consist of 90% of the total population and dogs consist of 10%. Here cat is our majority class and the dog is our minority class. now if our model predicts every data point as cats still we can get a very high accuracy of 90%.
让我们考虑一个二元分类问题,其中有猫和狗两类,其中猫占总人口的90%,狗占总人口的10%。 猫是我们的主要阶层,狗是我们的少数阶层。 现在,如果我们的模型将每个数据点都预测为猫,那么我们仍然可以获得90%的非常高的准确性。
This can be worrisome especially when the cost of misclassification of minority class is very high e.g., in applications such as fraud detection in credit card transactions, where the fraudulent transactions are very less in number compared to non-fraudulent transactions.
当少数群体类别的错误分类的成本非常高时,例如在信用卡交易中的欺诈检测之类的应用中,与非欺诈性交易相比,欺诈性交易的数量要少得多,这尤其令人担忧。
回忆或敏感性。 (Recall or sensitivity.)
Recall can be defined as the number of correct positive predictions divided by the sum of correct positive predictions and incorrect positive predictions, it is also called a true positive rate. The recall value ranges from 0 to 1.
召回率可以定义为正确肯定预测的数量除以正确肯定预测和错误肯定预测的总和,也称为真实肯定率。 召回值的范围是0到1。
Recall can be calculated from the confusion matrix using the below formula. The recall metric is used when the classes are imbalanced.
可以使用以下公式从混淆矩阵计算召回率。 当类不平衡时,将使用召回指标。
Recall answers the following question:- Out of all the actual positive class samples how many did we correctly predict as positive and how many should have been predicted as positive but were incorrectly predicted as negative?
回想一下,回答以下问题:-在所有实际的阳性类别样本中,我们正确地预测了多少为正,应该预测多少为正,但是错误地预测为负?
Recall is all about minimizing the False Negatives or Type 2 error, so when our objective is to minimize false negatives we choose recall as a metric.
召回是最大程度地减少误报率或类型2错误,因此,当我们的目标是最大程度地减少误报率时,我们选择召回率作为度量标准。
为什么召回是衡量不平衡数据的好指标? (Why is recall a good metric for imbalanced data?)
let’s consider the example of an imbalanced dataset from the confusion matrix above, there are 1100 total samples in the dataset out of which 91% samples belong to the negative class, the TP, TN, FP, FN values are
让我们考虑上面的混淆矩阵中不平衡数据集的示例,数据集中共有1100个样本,其中91%样本属于负类,TP,TN,FP,FN值为
True positive = 20
真实肯定= 20
True Negative=800
真负= 800
False-positive = 200
假阳性= 200
False Negative=80
假阴性= 80
now if we put these values in our recall formula we get recall = 0.2, this means that out of all the actual positive class samples only 20% were correctly predicted as positive and 80% samples should have been predicted as positive but were incorrectly predicted as negative.
现在,如果将这些值放在召回公式中,则召回率= 0.2,这意味着在所有实际的阳性分类样本中,只有20%的样本被正确地预测为阳性,而80%的样本本应被预测为阳性,但被错误地预测为负。
Here, we can see that despite getting a high accuracy of 74.5% the recall score is very low as the number of false negatives is more than the number of true positives.
在这里,我们可以看到,尽管准确率高达74.5%,但由于假阴性的数量大于真实阳性的数量,召回得分非常低。
特异性。 (Specificity.)
Specificity can be defined as the number of correct negative predictions divided by the sum of correct negative predictions and incorrect negative predictions, it is also called a true negative rate. The specificity value ranges from 0 to 1.
特异性可以定义为正确的阴性预测的数目除以正确的阴性预测和错误的阴性预测的总和,也称为真实阴性率。 特异性值的范围是0到1。
The specificity value can be calculated from the confusion matrix using the below formula.
可以使用以下公式从混淆矩阵中计算出特异性值。
Specificity answers the following question:- Out of all the actual negative class samples how many did we correctly predict as negative and how many should have been predicted as negative but were incorrectly predicted as positive?
特异性回答了以下问题:-在所有实际的阴性类别样本中,我们正确地将多少预测为阴性,应该将多少预测为阴性,但错误地将其预测为阳性?
let’s consider the example of an imbalanced dataset from the confusion matrix above, the TP, TN, FP, FN values are
让我们考虑上面的混淆矩阵中不平衡数据集的示例,TP,TN,FP,FN值分别为
True positive = 20
真实肯定= 20
True Negative=800
真负= 800
False positive = 200
误报= 200
False Negative=80
假阴性= 80
If we put these values in our specificity formula we get specificity = 0.8, this means that out of all the actual negative class samples 80% were correctly predicted as negative and 20% samples should have been predicted as negative but were incorrectly predicted as positive.
如果将这些值放在特异性公式中,则特异性= 0.8,这意味着在所有实际的阴性类别样本中,有80%的样本被正确地预测为阴性,而20%的样本本应被预测为阴性,但被错误地预测为阳性。
精确。 (Precision.)
Precision can be defined as the number of correct positive predictions divided by the sum of correct positive predictions and incorrect negative predictions. The precision value ranges from 0 to 1.
精度可以定义为正确的阳性预测的数目除以正确的阳性预测和错误的阴性预测的总和。 精度范围为0到1。
Precision value can be calculated from the confusion matrix using the below formula. The precision metric is used when classes are imbalanced.
可以使用以下公式从混淆矩阵计算精度值。 当类不平衡时,使用精度度量。
Precision answers the following question:- Out of all the positive predictions how many were actually positive and how many were actually negative but we incorrectly predict them as positive?
Precision回答了以下问题:-在所有积极的预测中,有多少实际上是积极的,有多少实际上是消极的,但是我们错误地将它们预测为积极的?
Precision is all about minimizing the False Positives or Type 1 error, so when our objective is to minimize false positives we choose precision as a metric
精度就是将误报率或类型1错误最小化,因此,当我们的目标是使误报率最小化时,我们选择精度作为度量标准
为什么Precision是衡量不平衡数据的良好指标? (Why is Precision a good metric for imbalanced data?)
let’s consider the example of an imbalanced dataset from the confusion matrix above, the TP, TN, FP, FN values are,
让我们考虑上面的混淆矩阵中不平衡数据集的示例,TP,TN,FP,FN值为:
True positive = 20
真实肯定= 20
True Negative=800
真负= 800
False-positive = 200
假阳性= 200
False Negative=80
假阴性= 80
now if we put these values in our precision formula we get precision = 0.09, this means that out of all the positive predictions only 9% were actually positive the remaining 91% were actually negative but were incorrectly predicted as positive
现在,如果将这些值放在精度公式中,则精度= 0.09,这意味着在所有阳性预测中,只有9%实际上是阳性,其余91%实际上是阴性,但被错误地预测为阳性
Here, we can see that despite getting a high accuracy of 74.5% the precision score is very low as the number of false positives is more than the number of true positives
在这里,我们可以看到尽管获得了74.5%的高准确度,但由于假阳性的数量大于真实阳性的数量,因此准确性得分仍然很低
分类器的精度和召回率的不同值意味着什么? (What do different values of precision and recall mean for a classifier?)
High precision (Less false positives)+ High recall (Less false negatives):
高精度(误报率较低)+高召回率(误报率较低):
This model predicts all the classes properly
该模型可以正确预测所有类别
High precision (Less false positives)+ Low recall (More false negatives):
高精度(误报少)+召回率低(误报更多):
This model predicts few values but most of the predicted values are correct
该模型预测的值很少,但是大多数预测值是正确的
Low precision (More false positives)+ High recall (Less false negatives):
低精度(更多误报)+高召回率(更少误报):
This model predicts many values, but most of its predicted values are incorrect.
该模型可以预测许多值,但是大多数预测值都不正确。
Low precision (More false positives)+ Low recall (More false negatives):
低精度(更多误报)+低召回率(更多误报):
This model is a no-skill classifier and can’t predict any class properly
此模型是非技能分类器,无法正确预测任何课程
F1得分。 (F1- Score.)
F1-score uses both precision and recall values. It is the harmonic mean of Precision and recall score. The F1-score can be calculated using the below formula.
F1分数同时使用精度值和召回率值。 它是“精度”和“查全率”的谐波平均值。 可以使用以下公式计算F1分数。
F1-score gives a balance between Precision and recall. It works best when the precision and recall scores are balanced. we always want our classifiers to have high precision and recall but there is always a trade-off between precision and recall when tuning the classifier.
F1得分可在精确度和召回率之间取得平衡。 当精确度和召回率得分达到平衡时,效果最佳。 我们一直希望我们的分类器具有较高的精度和查全率,但是在调整分类器时,总是需要在精度和查全率之间进行权衡。
let’s consider we have two different classifiers one with high precision score and other with high recall score if we want to output the performance of each classifier as a single metric then we can use F1-score, as we can see F1-score is nothing but a weighted average of precision and recall scores
让我们考虑一下我们有两个不同的分类器,一个分类器的精度得分高,另一个分类器的召回率很高,如果我们想将每个分类器的性能作为一个指标输出,那么我们可以使用F1得分,因为我们看到F1得分不过是精确度和召回率的加权平均值
If precision=0 and recall=100, then F1-score=0.
如果precision = 0且召回率= 100,则F1-score = 0 。
F1-score always gives more weight-age to the lower number, If even one value of precision or recall is low then F1-score is pulled down
F1分数总是赋予较低的数字更多的权重值,如果即使精度或召回率中的一个值很低,F1分数也会被拉低
F1分数有不同的变体 (There are different variants of F1-score)
a) Macro F1 score/ Macro averaged F1-score
a)宏F1得分/宏平均F1得分
For the Macro F1 -score, we first calculate the F1-score for each class and then take a simple average of the individual class F1-scores.
对于宏F1分数,我们首先计算每个类别的F1分数,然后取各个类别F1分数的简单平均值。
The Macro F1-score can be used when u want to know how your classifier is performing overall across each class.
当您想了解分类器在每个班级的整体表现如何时,可以使用Macro F1-score。
b) weighted average F1-score
b)加权平均F1分数
For the weighted average F1 -score, we first give weight-age to the F1-score of each class by the number of samples in that class, then we add these values and divide them by the total number of samples.
对于加权平均F1-分数,我们首先将每个类别的F1-分数的权重年龄除以该类别中的样本数,然后将这些值相加并除以样本总数。
F-beta分数。 (F-beta score.)
Sometimes it is important to give more importance to either precision or recall while calculating the F1-score, in such cases we can use the F-beta score. The F-beta score is a generalization of the F1-score where we can adjust the beta parameter to give more weight-age to either precision or recall depending upon the application.
有时在计算F1分数时,更重要的是要提高准确性或回忆性,在这种情况下,我们可以使用F-beta分数。 F-beta分数是F1-分数的概括,我们可以根据应用调整beta参数以赋予精度或召回率更多权重。
The default beta value is 1, which is the same as the F1-score. A smaller beta value gives more weight to precision and less weight to recall whereas a larger beta value gives less weight to precision and more weight to recall in the calculation.
Beta的默认值为1,与F1分数相同。 Beta值越小,表示精度的权重越大,召回的权重就越小;而beta值越大,表示精度的权重越小,计算中的权重就越大。
For Beta = 0: we get F-beta score = precision
对于Beta = 0:我们得到F-beta分数=精度
For Beta = 1: we get a harmonic mean of precision and recall
对于Beta = 1:我们得到精度和召回率的调和平均值
For Beta = infinity: we get a F-beta score = Recall
对于Beta =无限:我们获得F-beta分数=回忆
For 0
对于0
ROC(接收机工作特性)曲线。 (ROC (Receiver Operating Characteristics) curve.)
The ROC curve is a metric used to visualize the performance of a binary classification problem. ROC is a probability curve, it is a plot of False positive rate vs True positive rate. The AUC score is the area under the ROC curve. The higher the AUC score better the performance of the classifier. The AUC score values can range from 0 to 1.
ROC曲线是用于可视化二进制分类问题的性能的度量。 ROC是一条概率曲线,它是假阳性率与真阳性率的关系图。 AUC得分是ROC曲线下的面积。 AUC分数越高,分类器的性能越好。 AUC得分值的范围可以从0到1。
The True Positive Rate can be calculated using the below formula. Larger the True positive rate means that all positive points are classified correctly.
可以使用以下公式计算真实正利率。 真实阳性率越大,表示所有阳性点均已正确分类。
The False Positive Rate can be calculated using the below formula. Smaller the False positive rate means that all negative points are classified correctly.
误报率可以使用以下公式计算。 误报率越小,表示所有负点均已正确分类。
The above plot shows the ROC curve,
上图显示了ROC曲线,
A smaller value on the x-axis means fewer false positives and more true negatives, which means more negative points are classified correctly.
x轴上的较小值表示较少的假阳性和较多的真阴性,这意味着可以正确分类更多的阴性点。
A larger value on the y-axis means more true positives and fewer false negatives, which means more positive points are predicted correctly.
y轴上的值越大,表示真实的正数越多,虚假的负数越少,这意味着可以正确预测更多的正点。
AUC分数的不同值意味着什么? (What do different values of the AUC score mean?)
AUC score < 0.5: the classifier is worse than a no-skill classifier
AUC得分<0.5:分类器比无技能分类器差
AUC=0.5: it’s a no skill classifier and is not able to classify positive and negative classes, it just randomly predicts any class
AUC = 0.5:这是没有技能的分类器,无法对正面和负面的类别进行分类,它只是随机预测任何类别
0.5
0.5
AUC=1: it’s a perfect skill classifier and it can classify all negative and positive class points accurately
AUC = 1:这是一个完美的技能分类器,它可以对所有负面和正面的类分进行准确分类
精确调用曲线。 (Precision-Recall Curve.)
Precision is a metric that quantifies the number of correct positive predictions made and Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made.
精度是一种度量,它量化做出的正确肯定预测的数量,而召回率是一种度量,它量化可能做出的所有肯定预测中正确做出的正确预测的数量。
The Precision-Recall curve is a plot of precision vs recall, on the x-axis, we have recall values and on the y-axis, we have precision values. A larger area under the precision-recall curve means high precision and recall values, where high precision means a low false-positive rate, and high recall means a low false-negative rate.
Precision-Recall曲线是精度与召回率的关系图,在x轴上,我们具有召回值,在y轴上,我们具有精度值。 精确召回曲线下的较大区域表示高精度和召回值,其中高精度意味着低的假阳性率,而较高的召回率则意味着低的假阴性率。
The above plot shows a precision-recall curve, A no-skill classifier is a horizontal line on the plot, A skillful classifier is represented by a curve bowing towards (1,1) and a perfect skill classifier is depicted as a point at the co-ordinate (1,1)
上图显示了精确召回曲线,无技能分类器是图中的水平线,熟练分类器由向(1,1)弯曲的曲线表示,而完美技能分类器则表示为曲线上的点。坐标(1,1)
日志丢失。 (Log Loss.)
Log Loss(Logarithmic Loss) is also called as logistic loss or cross-entropy loss. log loss improves the accuracy of a model by penalizing false classifications. in general, the lower the log loss, the higher the accuracy. The values of log loss can range between 0 to infinity.
对数损失(对数损失)也称为对数损失或交叉熵损失。 对数损失通过惩罚错误分类提高了模型的准确性。 通常,对数损失越小,精度越高。 对数损失的值可以在0到无穷大之间。
To calculate log loss the model assigns a probability to each class, For log loss, the output of the classifier needs to be probability values between 0 and 1, these probability values signify the confidence of how likely a data point belongs to a certain class. log loss measures the amount of uncertainty of our predictions based on how much it varies from the actual value
为了计算对数损失,模型为每个类别分配一个概率。对于对数损失,分类器的输出必须是介于0和1之间的概率值,这些概率值表示数据点属于某个类别的可能性的置信度。 对数损失基于与实际值相差多少来衡量我们预测的不确定性量
The log loss for multi-class classification is calculated using the below formula
使用以下公式计算多类分类的对数损失
where,
哪里,
M = Number of classes
M =班级数量
N = Number of samples
N =样本数
y_ij = it’s a binary indicator if observation (i) belongs to class j it’s 1 or else it’s 0
y_ij =如果观察值(i)属于类j,则它是一个二元指标,否则为0
p_ij = it indicates the probability of observation (i) belonging to class j
p_ij =表示属于类别j的观察概率(i)
Let’s say a no skill classifier which predicts any class randomly has a log loss of 0.6, then any classifier which predicts a log loss of more than 0.6 is a poor classifier.
假设没有技能的分类器随机预测任何类别的对数损失为0.6,那么任何预测对数损失大于0.6的分类器都是较差的分类器。
MSE(均方误差)。 (MSE (Mean Squared Error).)
Mean squared error is the average of the square of the difference between the actual value and the predicted value. mean squared error is a metric used for regression analysis.
均方误差是实际值和预测值之差的平方的平均值。 均方误差是用于回归分析的度量。
MSE gives us the mean of the distance from actual points to the points predicted by our regression model i.e., how far our predicted values are from the actual values.
MSE为我们提供了从实际点到回归模型预测的点的距离的平均值,即预测值与实际值之间的距离。
Computing the gradient is much easier in MSE and it penalizes larger errors very well.
在MSE中,计算梯度要容易得多,并且可以很好地惩罚较大的误差。
The squaring is done to reduce the complexity so that there are no negative values.
进行平方是为了降低复杂度,因此不会出现负值。
Lower the MSE means the model is more accurate and the predicted values are very close to the actual values.
MSE较低意味着模型更准确,并且预测值非常接近实际值。
where,
哪里,
Y = actual values
Y =实际值
Y^ = predicted values
Y ^ =预测值
MAE(平均绝对误差)。 (MAE (Mean Absolute Error).)
MAE is another metric that is used for regression analysis. It is quite similar to mean squared error, It is the average of the absolute difference between the actual value and predicted value.
MAE是用于回归分析的另一个指标。 它与均方误差非常相似,它是实际值和预测值之间的绝对差的平均值。
Unlike MSE, MAE gives us the absolute average distance between the actual points and the points predicted by our regression model i.e., how far our predicted values are from the actual values.
与MSE不同,MAE为我们提供了实际点与回归模型预测的点之间的绝对平均距离,即预测值与实际值之间的距离。
Computing the gradient is not easy in MAE and larger errors are not penalized in prediction.
在MAE中计算梯度并不容易,并且较大的误差在预测中不会受到影响。
Lower the MAE means the model is more accurate and the predicted values are very close to the actual values.
较低的MAE表示模型更准确,并且预测值与实际值非常接近。
where,
哪里,
Y = actual values
Y =实际值
Y^ = predicted values
Y ^ =预测值
— — — — — — — — — — — — — —
— — — — — — — — — — — — — — — — —
Thank you for reading this article. For any suggestions or queries please leave a comment.
感谢您阅读本文。 如有任何建议或疑问,请发表评论。
参考文献。 (References.)
If you like this article, do share it with others.
如果您喜欢本文,请与他人分享。
翻译自: https://medium.com/analytics-vidhya/understanding-performance-metrics-for-machine-learning-algorithms-996dd7efde1e