数据挖掘过程当中,模型评估必须可少。
最近做一个pyspark
的项目,评估模型过程中使用了MulticlassClassificationEvaluator
进行模型评估,踩了不少坑,所以在此做个记录,分享给大家。
官方参考文档:
MulticlassClassificationEvaluator
- Init signature: MulticlassClassificationEvaluator(predictionCol=‘prediction’, labelCol=‘label’, metricName=‘f1’)
- Docstring:
… note:: Experimental
Evaluator for Multiclass Classification, which expects two input
columns: prediction and label.
从帮助文档我们可以看出,方法MulticlassClassificationEvaluator
里面有三个参数,分别为predictionCol
, labelCol
, metricName
,重点解释一下metricName
。
metricName
查阅文档可知,metricName
配置f1
、weightedPrecision
、weightedRecall
、accuracy
为了方便理解,我们将MulticlassClassificationEvaluator
同sklean.metrics
进行对比。
from pyspark.sql import SparkSession
import pandas as pd
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from sklearn import metrics
spark= SparkSession.builder.master('local').appName('test').getOrCreate()
# 创建一个数据集
data = [(0.0, 0.0),
(0.0, 1.0),
(0.0, 0.0),
(1.0, 0.0),
(1.0, 1.0),
(1.0, 1.0),
(1.0, 1.0),
(2.0, 2.0),
(2.0, 0.0)]
data_pd_df = pd.DataFrame(data, columns=["prediction", "label"])
data_spark_df = spark.createDataFrame(data, ["prediction", "label"])
evaluator_acc = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="accuracy")
evaluator_f1 = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="f1")
evaluator_pre = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="weightedPrecision")
evaluator_recall = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="weightedRecall")
sklern_acc = metrics.accuracy_score(data_pd_df['label'], data_pd_df['prediction'])
sklern_f1 = metrics.f1_score(data_pd_df['label'], data_pd_df['prediction'], average='weighted')
sklern_pre = metrics.precision_score(data_pd_df['label'], data_pd_df['prediction'], average='weighted')
sklern_recall = metrics.recall_score(data_pd_df['label'], data_pd_df['prediction'], average='weighted')
print('pyspark accuracy: %.6f' %evaluator_acc.evaluate(data_spark_df))
print('pyspark f1-score: %.6f' %evaluator_f1.evaluate(data_spark_df))
print('pyspark precision: %.6f' %evaluator_pre.evaluate(data_spark_df))
print('pyspark recall: %.6f' %evaluator_recall.evaluate(data_spark_df))
print('-----------------------')
print('sklearn accuracy: %.6f' %sklern_acc)
print('sklearn f1-score: %.6f' %sklern_f1)
print('sklearn precision: %.6f' %sklern_pre)
print('sklearn recall: %.6f' %sklern_recall)
pyspark accuracy: 0.666667
pyspark f1-score: 0.661376
pyspark precision: 0.685185
pyspark recall: 0.666667
-----------------------
sklearn accuracy: 0.666667
sklearn f1-score: 0.661376
sklearn precision: 0.685185
sklearn recall: 0.666667
从上代码可以看出,MulticlassClassificationEvaluator
和metrics
运行结果完全一致,但细心的同学应该发现,在metrics
模式下,配置了一个及其重要的参数average
=weight
才使得两种结果保持一致。
metrics
模式下,参数average
解释如下:
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.
计算每个标签的度量,并找到它们的平均权重按支持(每个标签的真实实例数)。这个更改“宏”以解释标签不平衡。
MulticlassClassificationEvaluator
对于评估二分类结果不太友好,在sklearn
模式下,参数average
默认值为binary
,而MulticlassClassificationEvaluator
当中均为weight
模式。这时,两种模型将会形成完全不一样的结果。具体示例如下:
# 产生一个二分类的数据
data = [(0.0, 0.0),
(0.0, 0.0),
(0.0, 0.0),
(1.0, 0.0),
(1.0, 1.0),
(1.0, 0.0),
(1.0, 1.0)]
data_pd_df = pd.DataFrame(data, columns=["prediction", "label"])
data_spark_df = spark.createDataFrame(data, ["prediction", "label"])
evaluator_acc = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="accuracy")
evaluator_f1 = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="f1")
evaluator_pre = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="weightedPrecision")
evaluator_recall = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="weightedRecall")
sklern_acc = metrics.accuracy_score(data_pd_df['label'], data_pd_df['prediction'])
sklern_f1 = metrics.f1_score(data_pd_df['label'], data_pd_df['prediction'])
sklern_pre = metrics.precision_score(data_pd_df['label'], data_pd_df['prediction'])
sklern_recall = metrics.recall_score(data_pd_df['label'], data_pd_df['prediction'])
print('pyspark accuracy: %.6f' %evaluator_acc.evaluate(data_spark_df))
print('pyspark f1-score: %.6f' %evaluator_f1.evaluate(data_spark_df))
print('pyspark precision: %.6f' %evaluator_pre.evaluate(data_spark_df))
print('pyspark recall: %.6f' %evaluator_recall.evaluate(data_spark_df))
print('-----------------------')
print('sklearn accuracy: %.6f' %sklern_acc)
print('sklearn f1-score: %.6f' %sklern_f1)
print('sklearn precision: %.6f' %sklern_pre)
print('sklearn recall: %.6f' %sklern_recall)
pyspark accuracy: 0.714286
pyspark f1-score: 0.726190
pyspark precision: 0.857143
pyspark recall: 0.714286
-----------------------
sklearn accuracy: 0.714286
sklearn f1-score: 0.666667
sklearn precision: 0.500000
sklearn recall: 1.000000
由上面的例子可以看出来,在二分类情况下,尽量不要使用MulticlassClassificationEvaluator
,它与我们二分类通常的评估标准有很大差别。
2020-5-21,继续
如果必须要使用的话,我自己写了个函数,供大家大家参考。
def precise_recall_f1(pred_label,test_label):
p_label = pred_label.withColumnRenamed('prediction','p_label')
d_label = test_label.withColumnRenamed('label','d_label')
# p_label=p_label.withColumn('id',monotonically_increasing_id())
# d_label=d_label.withColumn('id',monotonically_increasing_id())
result=p_label.join(d_label, on=USER_NO, how='left')
tp = result[(result.d_label == 1) & (result.p_label == 1)].count()
tn = result[(result.d_label == 0) & (result.p_label == 0)].count()
fp = result[(result.d_label == 0) & (result.p_label == 1)].count()
fn = result[(result.d_label == 1) & (result.p_label == 0)].count()
try:
p = float(tp)/(tp + fp)
except:
p = 0
try:
r = float(tp)/(tp + fn)
except:
r = 0
try:
f1 = 2*p*r/(p+r)
except:
f1 = 0
return p,r,f1