pyspark MulticlassClassificationEvaluator的一些使用总结

数据挖掘过程当中,模型评估必须可少。
最近做一个pyspark的项目,评估模型过程中使用了MulticlassClassificationEvaluator进行模型评估,踩了不少坑,所以在此做个记录,分享给大家。
官方参考文档:

MulticlassClassificationEvaluator

  • Init signature: MulticlassClassificationEvaluator(predictionCol=‘prediction’, labelCol=‘label’, metricName=‘f1’)
  • Docstring:
    … note:: Experimental
    Evaluator for Multiclass Classification, which expects two input
    columns: prediction and label.

从帮助文档我们可以看出,方法MulticlassClassificationEvaluator里面有三个参数,分别为predictionCol, labelCol, metricName,重点解释一下metricName

metricName
查阅文档可知,metricName配置f1weightedPrecisionweightedRecallaccuracy

使用示例如下:

为了方便理解,我们将MulticlassClassificationEvaluatorsklean.metrics进行对比。

from pyspark.sql import SparkSession
import pandas as pd

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from sklearn import metrics

spark= SparkSession.builder.master('local').appName('test').getOrCreate()
# 创建一个数据集
data =  [(0.0, 0.0), 
         (0.0, 1.0), 
         (0.0, 0.0),
         (1.0, 0.0), 
         (1.0, 1.0), 
         (1.0, 1.0), 
         (1.0, 1.0), 
         (2.0, 2.0), 
         (2.0, 0.0)]
data_pd_df = pd.DataFrame(data, columns=["prediction", "label"])
data_spark_df = spark.createDataFrame(data, ["prediction", "label"])

evaluator_acc = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="accuracy")
evaluator_f1 = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="f1")
evaluator_pre = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="weightedPrecision")
evaluator_recall = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="weightedRecall")

sklern_acc = metrics.accuracy_score(data_pd_df['label'], data_pd_df['prediction'])
sklern_f1 = metrics.f1_score(data_pd_df['label'], data_pd_df['prediction'], average='weighted')
sklern_pre = metrics.precision_score(data_pd_df['label'], data_pd_df['prediction'], average='weighted')
sklern_recall = metrics.recall_score(data_pd_df['label'], data_pd_df['prediction'], average='weighted')

print('pyspark accuracy: %.6f' %evaluator_acc.evaluate(data_spark_df))
print('pyspark f1-score: %.6f' %evaluator_f1.evaluate(data_spark_df))
print('pyspark precision: %.6f' %evaluator_pre.evaluate(data_spark_df))
print('pyspark recall: %.6f' %evaluator_recall.evaluate(data_spark_df))
print('-----------------------')
print('sklearn accuracy: %.6f' %sklern_acc)
print('sklearn f1-score: %.6f' %sklern_f1)
print('sklearn precision: %.6f' %sklern_pre)
print('sklearn recall: %.6f' %sklern_recall)

pyspark accuracy: 0.666667
pyspark f1-score: 0.661376
pyspark precision: 0.685185
pyspark recall: 0.666667
-----------------------
sklearn accuracy: 0.666667
sklearn f1-score: 0.661376
sklearn precision: 0.685185
sklearn recall: 0.666667

从上代码可以看出,MulticlassClassificationEvaluatormetrics运行结果完全一致,但细心的同学应该发现,在metrics模式下,配置了一个及其重要的参数average=weight才使得两种结果保持一致。
metrics模式下,参数average解释如下:

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance.
计算每个标签的度量,并找到它们的平均权重按支持(每个标签的真实实例数)。这个更改“宏”以解释标签不平衡。

产生的一个坑

MulticlassClassificationEvaluator对于评估二分类结果不太友好,在sklearn模式下,参数average默认值为binary,而MulticlassClassificationEvaluator当中均为weight模式。这时,两种模型将会形成完全不一样的结果。具体示例如下:

# 产生一个二分类的数据
data =  [(0.0, 0.0), 
         (0.0, 0.0), 
         (0.0, 0.0),
         (1.0, 0.0), 
         (1.0, 1.0), 
         (1.0, 0.0), 
         (1.0, 1.0)]

data_pd_df = pd.DataFrame(data, columns=["prediction", "label"])
data_spark_df = spark.createDataFrame(data, ["prediction", "label"])

evaluator_acc = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="accuracy")
evaluator_f1 = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="f1")
evaluator_pre = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="weightedPrecision")
evaluator_recall = MulticlassClassificationEvaluator(predictionCol="prediction", metricName="weightedRecall")

sklern_acc = metrics.accuracy_score(data_pd_df['label'], data_pd_df['prediction'])
sklern_f1 = metrics.f1_score(data_pd_df['label'], data_pd_df['prediction'])
sklern_pre = metrics.precision_score(data_pd_df['label'], data_pd_df['prediction'])
sklern_recall = metrics.recall_score(data_pd_df['label'], data_pd_df['prediction'])

print('pyspark accuracy: %.6f' %evaluator_acc.evaluate(data_spark_df))
print('pyspark f1-score: %.6f' %evaluator_f1.evaluate(data_spark_df))
print('pyspark precision: %.6f' %evaluator_pre.evaluate(data_spark_df))
print('pyspark recall: %.6f' %evaluator_recall.evaluate(data_spark_df))
print('-----------------------')
print('sklearn accuracy: %.6f' %sklern_acc)
print('sklearn f1-score: %.6f' %sklern_f1)
print('sklearn precision: %.6f' %sklern_pre)
print('sklearn recall: %.6f' %sklern_recall)

pyspark accuracy: 0.714286
pyspark f1-score: 0.726190
pyspark precision: 0.857143
pyspark recall: 0.714286
-----------------------
sklearn accuracy: 0.714286
sklearn f1-score: 0.666667
sklearn precision: 0.500000
sklearn recall: 1.000000

由上面的例子可以看出来,在二分类情况下,尽量不要使用MulticlassClassificationEvaluator,它与我们二分类通常的评估标准有很大差别。

如果必须要使用?

2020-5-21,继续
如果必须要使用的话,我自己写了个函数,供大家大家参考。

def precise_recall_f1(pred_label,test_label):
    p_label = pred_label.withColumnRenamed('prediction','p_label')
    d_label = test_label.withColumnRenamed('label','d_label')
    # p_label=p_label.withColumn('id',monotonically_increasing_id())
    # d_label=d_label.withColumn('id',monotonically_increasing_id())
    result=p_label.join(d_label, on=USER_NO, how='left')
    tp = result[(result.d_label == 1) & (result.p_label == 1)].count()
    tn = result[(result.d_label == 0) & (result.p_label == 0)].count()
    fp = result[(result.d_label == 0) & (result.p_label == 1)].count()
    fn = result[(result.d_label == 1) & (result.p_label == 0)].count()
    try:
        p = float(tp)/(tp + fp)
    except:
        p = 0
    try:
        r = float(tp)/(tp + fn)
    except:
        r = 0
    try:
        f1 = 2*p*r/(p+r)
    except:
        f1 = 0
    return p,r,f1

你可能感兴趣的:(pyspark)