spark.mllib中LogisticRegression源代码分析

前言:在用spark编写多分类逻辑回归的不同优化器算法求解时遇到问题,特写此篇。

 

主要包含LogisticRegressionModel和LogisticRegressionWithLBFGS

直接看源码:注释说明了一些参数的维度

import org.apache.spark.SparkContext
import org.apache.spark.annotation.Since
import org.apache.spark.ml.linalg.DenseMatrix
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.mllib.classification.impl.GLMClassificationModel
import org.apache.spark.mllib.linalg.{DenseVector, Vector, Vectors}
import org.apache.spark.mllib.linalg.BLAS.dot
import org.apache.spark.mllib.optimization._
import org.apache.spark.mllib.pmml.PMMLExportable
import org.apache.spark.mllib.regression._
import org.apache.spark.mllib.util.{DataValidators, Loader, Saveable}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel

/**
 * Classification model trained using Multinomial/Binary Logistic Regression.
 *
 * @param weights Weights computed for every feature.
 * @param intercept Intercept computed for this model. (Only used in Binary Logistic Regression.
 *                  In Multinomial Logistic Regression, the intercepts will not be a single value,
 *                  so the intercepts will be part of the weights.)
 * @param numFeatures the dimension of the features.
 * @param numClasses the number of possible outcomes for k classes classification problem in
 *                   Multinomial Logistic Regression. By default, it is binary logistic regression
 *                   so numClasses will be set to 2.
 */
@Since("0.8.0")
class LogisticRegressionModel @Since("1.3.0") (
    @Since("1.0.0") override val weights: Vector,  // 不带bias时(numClasses - 1)*numFeatures 或 带bias时(numClasses - 1)*(numFeatures+1)
    @Since("1.0.0") override val intercept: Double, // 1
    @Since("1.3.0") val numFeatures: Int,
    @Since("1.3.0") val numClasses: Int)
  extends GeneralizedLinearModel(weights, intercept) with ClassificationModel with Serializable
  with Saveable with PMMLExportable {

  if (numClasses == 2) {
    require(weights.size == numFeatures,
      s"LogisticRegressionModel with numClasses = 2 was given non-matching values:" +
      s" numFeatures = $numFeatures, but weights.size = ${weights.size}")
  } else {
    val weightsSizeWithoutIntercept = (numClasses - 1) * numFeatures  // 
    val weightsSizeWithIntercept = (numClasses - 1) * (numFeatures + 1) // 即带bias时的大小为(numClasses*numFeatures + numClasses) - numFeatures - 1 
    //  InterceptSize = (numClasses - 1)
    require(weights.size == weightsSizeWithoutIntercept || weights.size == weightsSizeWithIntercept,
      s"LogisticRegressionModel.load with numClasses = $numClasses and numFeatures = $numFeatures" +
      s" expected weights of length $weightsSizeWithoutIntercept (without intercept)" +
      s" or $weightsSizeWithIntercept (with intercept)," +
      s" but was given weights of length ${weights.size}")
  }

  private val dataWithBiasSize: Int = weights.size / (numClasses - 1) // 不带bias时:numFeatures;带bias时:numFeatures + 1

  private val weightsArray: Array[Double] = weights match {   // 不带bias时:(numClasses - 1)*numFeatures;带bias时:(numClasses - 1)*(numFeatures+1)
    case dv: DenseVector => dv.values
    case _ =>
      throw new IllegalArgumentException(
        s"weights only supports dense vector but got type ${weights.getClass}.")
  }

  /**
   * Constructs a [[LogisticRegressionModel]] with weights and intercept for binary classification.
   */
  @Since("1.0.0")
  def this(weights: Vector, intercept: Double) = this(weights, intercept, weights.size, 2)

  private var threshold: Option[Double] = Some(0.5)

  /**
   * Sets the threshold that separates positive predictions from negative predictions
   * in Binary Logistic Regression. An example with prediction score greater than or equal to
   * this threshold is identified as a positive, and negative otherwise. The default value is 0.5.
   * It is only used for binary classification.
   */
  @Since("1.0.0")
  def setThreshold(threshold: Double): this.type = {
    this.threshold = Some(threshold)
    this
  }

  /**
   * Returns the threshold (if any) used for converting raw prediction scores into 0/1 predictions.
   * It is only used for binary classification.
   */
  @Since("1.3.0")
  def getThreshold: Option[Double] = threshold

  /**
   * Clears the threshold so that `predict` will output raw prediction scores.
   * It is only used for binary classification.
   */
  @Since("1.0.0")
  def clearThreshold(): this.type = {
    threshold = None
    this
  }


  override protected def predictPoint(
      dataMatrix: Vector,
      weightMatrix: Vector,
      intercept: Double) = {
    require(dataMatrix.size == numFeatures)   // 确保dataMatrix == numFeatures

    // If dataMatrix and weightMatrix have the same dimension, it's binary logistic regression.
    if (numClasses == 2) {
      val margin = dot(weightMatrix, dataMatrix) + intercept
      val score = 1.0 / (1.0 + math.exp(-margin))
      threshold match {
        case Some(t) => if (score > t) 1.0 else 0.0
        case None => score
      }
    } else {
      /**
       * Compute and find the one with maximum margins. If the maxMargin is negative, then the
       * prediction result will be the first class.
       *
       * PS, if you want to compute the probabilities for each outcome instead of the outcome
       * with maximum probability, remember to subtract the maxMargin from margins if maxMargin
       * is positive to prevent overflow.
       */
      var bestClass = 0
      var maxMargin = 0.0
      val withBias = dataMatrix.size + 1 == dataWithBiasSize  // numFeatures + 1 == dataWithBiasSize --> 看52行,若成立则为带bias。
      (0 until numClasses - 1).foreach { i =>    // i:类别索引
        var margin = 0.0
        dataMatrix.foreachNonZero { (index, value) =>    //  index: 特征索引 0 -> numFeatures-1
          margin += value * weightsArray((i * dataWithBiasSize) + index)
        }
        // Intercept is required to be added into margin.
        if (withBias) {
          margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size) // index = numFeatures的特征
        }
        if (margin > maxMargin) {
          maxMargin = margin
          bestClass = i + 1
        }
      }
      bestClass.toDouble
    }
  }

  @Since("1.3.0")
  override def save(sc: SparkContext, path: String): Unit = {
    GLMClassificationModel.SaveLoadV1_0.save(sc, path, this.getClass.getName,
      numFeatures, numClasses, weights, intercept, threshold)
  }

  override def toString: String = {
    s"${super.toString}, numClasses = ${numClasses}, threshold = ${threshold.getOrElse("None")}"
  }
}

@Since("1.3.0")
object LogisticRegressionModel extends Loader[LogisticRegressionModel] {

  @Since("1.3.0")
  override def load(sc: SparkContext, path: String): LogisticRegressionModel = {
    val (loadedClassName, version, metadata) = Loader.loadMetadata(sc, path)
    // Hard-code class name string in case it changes in the future
    val classNameV1_0 = "org.apache.spark.mllib.classification.LogisticRegressionModel"
    (loadedClassName, version) match {
      case (className, "1.0") if className == classNameV1_0 =>
        val (numFeatures, numClasses) = ClassificationModel.getNumFeaturesClasses(metadata)
        val data = GLMClassificationModel.SaveLoadV1_0.loadData(sc, path, classNameV1_0)
        // numFeatures, numClasses, weights are checked in model initialization
        val model =
          new LogisticRegressionModel(data.weights, data.intercept, numFeatures, numClasses)
        data.threshold match {
          case Some(t) => model.setThreshold(t)
          case None => model.clearThreshold()
        }
        model
      case _ => throw new Exception(
        s"LogisticRegressionModel.load did not recognize model with (className, format version):" +
        s"($loadedClassName, $version).  Supported:\n" +
        s"  ($classNameV1_0, 1.0)")
    }
  }
}

@Since("0.8.0")
class LogisticRegressionWithSGD private[mllib] (
    private var stepSize: Double,
    private var numIterations: Int,
    private var regParam: Double,
    private var miniBatchFraction: Double)
  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable {

  private val gradient = new LogisticGradient()
  private val updater = new SquaredL2Updater()
  @Since("0.8.0")
  override val optimizer = new GradientDescent(gradient, updater)
    .setStepSize(stepSize)
    .setNumIterations(numIterations)
    .setRegParam(regParam)
    .setMiniBatchFraction(miniBatchFraction)
  override protected val validators = List(DataValidators.binaryLabelValidator)

  override protected[mllib] def createModel(weights: Vector, intercept: Double) = {
    new LogisticRegressionModel(weights, intercept)
  }
}




/**
 * Train a classification model for Multinomial/Binary Logistic Regression using
 * Limited-memory BFGS. Standard feature scaling and L2 regularization are used by default.
 *
 * Earlier implementations of LogisticRegressionWithLBFGS applies a regularization
 * penalty to all elements including the intercept. If this is called with one of
 * standard updaters (L1Updater, or SquaredL2Updater) this is translated
 * into a call to ml.LogisticRegression, otherwise this will use the existing mllib
 * GeneralizedLinearAlgorithm trainer, resulting in a regularization penalty to the
 * intercept.
 *
 * @note Labels used in Logistic Regression should be {0, 1, ..., k - 1}
 * for k classes multi-label classification problem.
 */
@Since("1.1.0")
class LogisticRegressionWithLBFGS
  extends GeneralizedLinearAlgorithm[LogisticRegressionModel] with Serializable {

  this.setFeatureScaling(true)

  @Since("1.1.0")
  override val optimizer = new LBFGS(new LogisticGradient, new SquaredL2Updater)

  override protected val validators = List(multiLabelValidator)

  private def multiLabelValidator: RDD[LabeledPoint] => Boolean = { data =>
    if (numOfLinearPredictor > 1) {  // 多分类,numOfLinearPredictor=numClasses-1
      DataValidators.multiLabelValidator(numOfLinearPredictor + 1)(data)
    } else {
      DataValidators.binaryLabelValidator(data)
    }
  }

  /**
   * Set the number of possible outcomes for k classes classification problem in
   * Multinomial Logistic Regression.
   * By default, it is binary logistic regression so k will be set to 2.
   */
  @Since("1.3.0")
  def setNumClasses(numClasses: Int): this.type = {
    require(numClasses > 1) 
    numOfLinearPredictor = numClasses - 1 // 为类别-1
    if (numClasses > 2) {
      optimizer.setGradient(new LogisticGradient(numClasses))
    }
    this
  }

  override protected def createModel(weights: Vector, intercept: Double) = {
    if (numOfLinearPredictor == 1) {  
      new LogisticRegressionModel(weights, intercept)
    } else {  // 多分类,numOfLinearPredictor = numClasses-1
      new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1)  // 这里传入的最后一个参数为numClasses
    }
  }

  /**
   * Run Logistic Regression with the configured parameters on an input RDD
   * of LabeledPoint entries.
   *
   * If a known updater is used calls the ml implementation, to avoid
   * applying a regularization penalty to the intercept, otherwise
   * defaults to the mllib implementation. If more than two classes
   * or feature scaling is disabled, always uses mllib implementation.
   * If using ml implementation, uses ml code to generate initial weights.
   */
  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
    run(input, generateInitialWeights(input), userSuppliedWeights = false)
  }

  /**
   * Run Logistic Regression with the configured parameters on an input RDD
   * of LabeledPoint entries starting from the initial weights provided.
   *
   * If a known updater is used calls the ml implementation, to avoid
   * applying a regularization penalty to the intercept, otherwise
   * defaults to the mllib implementation. If more than two classes
   * or feature scaling is disabled, always uses mllib implementation.
   * Uses user provided weights.
   *
   * In the ml LogisticRegression implementation, the number of corrections
   * used in the LBFGS update can not be configured. So `optimizer.setNumCorrections()`
   * will have no effect if we fall into that route.
   */
  override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = {
    run(input, initialWeights, userSuppliedWeights = true)
  }

  private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean):
      LogisticRegressionModel = {
    // ml's Logistic regression only supports binary classification currently.
    if (numOfLinearPredictor == 1) {
      def runWithMlLogisticRegression(elasticNetParam: Double) = {
        // Prepare the ml LogisticRegression based on our settings
        val lr = new org.apache.spark.ml.classification.LogisticRegression()
        lr.setRegParam(optimizer.getRegParam())
        lr.setElasticNetParam(elasticNetParam)
        lr.setStandardization(useFeatureScaling)
        if (userSuppliedWeights) {
          val uid = Identifiable.randomUID("logreg-static")
          lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel(uid,
            new DenseMatrix(1, initialWeights.size, initialWeights.toArray),
            Vectors.dense(1.0).asML, 2, false))
        }
        lr.setFitIntercept(addIntercept)
        lr.setMaxIter(optimizer.getNumIterations())
        lr.setTol(optimizer.getConvergenceTol())
        // Convert our input into a DataFrame
        val spark = SparkSession.builder().sparkContext(input.context).getOrCreate()
        val df = spark.createDataFrame(input.map(_.asML))
        // Determine if we should cache the DF
        val handlePersistence = input.getStorageLevel == StorageLevel.NONE
        // Train our model
        val mlLogisticRegressionModel = lr.train(df, handlePersistence)
        // convert the model
        val weights = Vectors.dense(mlLogisticRegressionModel.coefficients.toArray)
        createModel(weights, mlLogisticRegressionModel.intercept)
      }
      optimizer.getUpdater() match {
        case x: SquaredL2Updater => runWithMlLogisticRegression(0.0)
        case x: L1Updater => runWithMlLogisticRegression(1.0)
        case _ => super.run(input, initialWeights)
      }
    } else {
      super.run(input, initialWeights)
    }
  }
}

对于多分类的逻辑回归的计算,这里先挖个坑。

因为比较奇怪的点是LogisticRegressionModel这个类接收四个参数(多分类时),第一个参数weights接收的类型为Vector[Double],而第二个参数intercept接收的类型为Double,

42行说明了:

val weightsSizeWithoutIntercept = (numClasses - 1) * numFeatures 
val weightsSizeWithIntercept = (numClasses - 1) * (numFeatures + 1) // numFeatures + 1个特征,分成numClasses - 1个类,剩下的那个类可由 1- 导出

weights的两种情况:不带bias和带bias,这里我们考虑带bias。

所以,weights的维度应该为:(numClasses*numFeatures + numClasses) - numFeatures - 1 

 

老师给的程序中,初始化权重时:

val weightsWithIntercept = Vectors.dense( new Array[Double]( (numFeatuews+1)*numClass) )

可以看成一个神经网络:输入层的神经元个数为numFeatures+1,输出层神经元个数我

所以由权重构建LR模型时:

val model = new LogisticRegeressionModel(
    Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - numFeatures - 1)),
    weightsWithIntercept(weightsWithIntercept.size - 1), // 参数表里的最后一个参数
    numFeatures, numClasses
)

注意:slice(0, weightsWithIntercept.size - numFeatures - 1)), 不包含第weightsWithIntercept.size - numFeatures - 1。

所以 weightsWithIntercept.size - numFeatures - 1得到的值才能与前面的weights=(numClasses - 1) * (numFeatures + 1)相等。

但这样weights与intercept就连不上了?。。weightsWithIntercept中间还有numFeatures个数未用上。

 

考虑到输入模型的权重连接不上,于是将权重初始化修改为:

val initialWeightsWithIntercept = Vectors.dense(new Array[Double]( (numFeatures + 1)*numClass - numFeatures) )

模型为:

val model = new LogisticRegressionModel(
      Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size-1)), // numFeatures*numClass+numClass-numFeatures-1
      weightsWithIntercept(weightsWithIntercept.size - 1), // 参数表里的最后一个参数
      numFeatures, numClass)

验证(numFeatures + 1)*numClass - numFeatures = (numFeatures + 1)*(numClass - 1) + 1也成立,而且weights与intercept就连接上了。

但这样用LBFGS训练权重的时候直接报错。

 

正确的权重初始化:

突然灵感一现,回想之前上吴恩达的机器学习课时,当时将逻辑回归推广到广义线性模型中的softmax回归,输出单元只用了numClass-1个单元,权重矩阵的大小为(numFeatures + 1)*(numClass - 1) 

val initialWeightsWithIntercept = Vectors.dense(new Array[Double]( (numFeatures + 1)*(numClass - 1)) )

这样维数才更加正确:(numFeatures + 1)*(numClass - 1),因为要分为numClass个类别,类之间满足softmax关系,即只要计算numClass - 1个类别的概率即可,剩下的那个类别的概率可由1减前面的概率得到。

另外 weights = (numFeatures + 1)*(numClass - 1) 条件也满足。

用LBFGS和splash优化求解都能正确求解!

构建模型:

val model = new LogisticRegressionModel(
      Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size)), // numFeatures*numClass+numClass-numFeatures-1
      weightsWithIntercept(weightsWithIntercept.size - 1), // 参数表里的最后一个参数
      numFeatures, numClass)

类似python,dense vector是从0开始,到dv.size结束,右开。这样weights与intercept就连接上了。

下面填前面挖的一个坑:关于多分类的逻辑回归的计算:

权重矩阵为(numFeatures + 1)*(numClass - 1),计算时,对于从0到numClass-1的类别i,都可以算一个概率:

var margin = 0.0
dataMatrix.foreachNonZero { (index, value) =>    //  index: 特征索引 0 -> numFeatures-1
    margin += value * weightsArray((i * dataWithBiasSize) + index) // 对权重矩阵的第i列
    }
    // Intercept is required to be added into margin.
if (withBias) {
  margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size) // index = numFeatures的特征
}

margin += value * weightsArray((i * dataWithBiasSize) + index) 就是对权重矩阵的第i行的权重相乘累加。

而margin += weightsArray((i * dataWithBiasSize) + dataMatrix.size),就是1*第i行的最后一个权重(最后一个权重为第numFeatures列(从0开始数) )。

ie:对于一个样本有:W[c-1, f+1] * x[f+1, 1] = p[c-1, 1],方括号表示该变量的维度。

发现构建模型的第二个参数intercept: Double并没有用到,注意Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size)) 时,已经把weightsWithIntercept的所有元素都传入了weight,在模型内部,这个weight就是带bias的weight,而不是不带bias的weight,所以第二个参数可以直接给0。

 

Summary

作者这样写代码也真是奇葩。。。非得把参数中的特征权重和偏置权重这两个分开,引起歧义。。。还是得结合理论知识才能弄清楚到底在干嘛呀。

所以这个interpret在多分类时根本就没起到作用,而且在二分类时,可以将weightsWithIncerpet拆成两份,也可以全部写入第一个参数,第二个置0即可。

 

 

你可能感兴趣的:(Spark)