二分类模型-分布式SPARK效果评估实现代码+混淆矩阵

最近在做一个平台级的项目,为了保证分布式的可扩展性,评估最终用sparkmlib进行模型的评估,sparkmlib里面封装好了二分类、
多分类、聚类的通用的评估指标,通用指标实现起来都比较简单。

关键点:
 val metrics=new BinaryClassificationMetrics(scoreAndLable,100)
  获取到预测列和标签列,并转化为RDD[double,double]。
  • BinaryClassificationMetrics第二个参数解释:这个一个分箱参数,可能你们预测值的不同值会有几百万个,这样会导致计算量巨大,计算的结果也巨大,所以引入分箱,对预测列进行分箱降采样后进行计算。
  • 官方:param: scoreAndLabels an RDD of (score, label) pairs. param: numBins if greater than 0, then the curves (ROC curve, PR curve) computed internally will be down-sampled to this many "bins". If 0, no down-sampling will occur. This is useful because the curve contains a point for each distinct score in the input, and this could be as large as the input itself -- millions of points or more, when thousands may be entirely sufficient to summarize the curve. After down-sampling, the curves will instead be made of approximately numBins points instead. Points are made from bins of equal numbers of consecutive points. The size of each bin is floor(scoreAndLabels.count() / numBins), which means the resulting number of bins may not exactly equal numBins. The last bin in each partition may be smaller as a result, meaning there may be an extra sample at partition boundaries
直接上代码:

package com.tiger
import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}


/**
 * Created by tiger on 2020-04-18
 */

object testHive {

  def main(args: Array[String]) {

    //源数据通过hive读取
    val sessionApp = SparkSession.builder().appName(args(0)) //开启hive支持
      .enableHiveSupport().master("local[2]").config("HADOOP_USER_NAME", "hive")
      .getOrCreate()
    //查询hive表,获取标签和预测列
    val rs: DataFrame =sessionApp.sql("select predict,realvalue from predict A ," +
      "lable B where A.transId=B.transId")
    val paramDf: DataFrame = sessionApp.read.json(args(1))
    paramDf.show()
    val num1 =rs.columns.indexOf("realvalue")
    val num2 =rs.columns.indexOf("predict")
    //转化为RDD[(Double, Double)] 
    val scoreAndLable: RDD[(Double, Double)] = {
      rs.rdd.filter(rows=>rows.getAs("model_code")=="modelcode1").map( t => (t.getString(num2).toDouble,t.getString(num1).toDouble))
    }
    print(scoreAndLable.collect().addString(new StringBuilder(),","))
    print("==========scoreAndLable=========")
    scoreAndLable.foreach(println)
    //调用sparkmilb的封装好的二分类评估指标
    val metrics=new BinaryClassificationMetrics(scoreAndLable,100)
   //

    //计算精准率
    val precision=metrics.precisionByThreshold().collect()
    val precision_str=new StringBuilder()
    for (i <- 0 until precision.length)
      precision_str.append(precision(i)._1).append(":").append(precision(i)._2).append(",")
    print("==========precisionByThreshold=========")

    //计算召回率
    val recall=metrics.recallByThreshold().collect()
    val recall_str=new StringBuilder()
    for (i <- 0 until precision.length)
      recall_str.append(recall(i)._1).append(":").append(recall(i)._2).append(",")
    print("==========recallByThreshold=========")

    //计算F1
    val f1=metrics.fMeasureByThreshold().collect()
    val f1_str=new StringBuilder()
    for (i <- 0 until precision.length)
      f1_str.append(f1(i)._1).append(":").append(f1(i)._2).append(",")
    print("==========f1ByThreshold=========")

    //计算roc
    val roc=metrics.roc().collect()
    val roc_str=new StringBuilder()
    for (i <- 0 until precision.length)
      roc_str.append(roc(i)._1).append(":").append(roc(i)._2).append(",")
    print("==========roc=========")


    //计算pr
    val pr=metrics.pr().collect()
    val pr_str=new StringBuilder()
    for (i <- 0 until precision.length)
      pr_str.append(pr(i)._1).append(":").append(pr(i)._2).append(",")
    print("==========pr=========")

    val auc=metrics.areaUnderROC()
    print("==========auc=========")

  }


}


 

============华丽的分割线======================================================

补充: 项目中还有一个需求,二分类评估也需要做动态的混淆矩阵,如图:

二分类模型-分布式SPARK效果评估实现代码+混淆矩阵_第1张图片

当前的sparkmlib的包里面是没有提供二分类评估混淆矩阵指标的 仔细研究下这个混淆矩阵,我们输出的精准率和召回率都是从混淆矩阵中计算的,那我们是否可以通过精准率和召回率对混淆矩阵进行反推,

答案是可行的! 我们只需要计算好:

记录总数、真阳的总数即可:

val allCount=scoreAndLable

val lableCount=scoreAndLable.filter(t=>t._2.intValue()==1).count()

于是:

TPnums=recall*lableCount

FNnums=lableCount-TPnums

FPnums=TPnums/precision-TPnums

TNnums=allCount-TPnums-FNnums-FPnums

这样就可以根据recall和precision的阈值进行动态变化混淆矩阵

 

你可能感兴趣的:(模型评估,算法,机器学习)