摘要:本文是spark ml的逻辑回归实现,共包括5个部分,分别为LogisticCostFun、binaryUpdataInPlace、multinomialUpdataInPlace、LogisticAggregator和train。其中,
(一)LogisticCostFun:计算损失函数
(二)binaryUpdataInPlace:对2分类模型,进行参数更新
(三)multinomialUpdataInPlace:对多分类模型,进行参数更新
(四)LogisticAggregator:对分布式的LR,进行整合
(五)train:训练
得到结论如下:
1)spark 只对样本实现了分布式,并没有对参数实现分布式
2)样本增加时,训练不会有问题
3)参数增加时,训练会有影响。(spark还是将LR的所有参数广播到各个节点)
/**
* LogisticCostFun implements Breeze's DiffFunction[T] for a multinomial (softmax) logistic loss
* function, as used in multi-class classification (it is also used in binary logistic regression).
* It returns the loss and gradient with L2 regularization at a particular point (coefficients).
* It's used in Breeze's convex optimization routines.
*/
private class LogisticCostFun(
instances: RDD[Instance],
numClasses: Int,
fitIntercept: Boolean,
standardization: Boolean,
bcFeaturesStd: Broadcast[Array[Double]],
regParamL2: Double,
multinomial: Boolean,
aggregationDepth: Int) extends DiffFunction[BDV[Double]] {
override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = {
val coeffs = Vectors.fromBreeze(coefficients)
val bcCoeffs = instances.context.broadcast(coeffs)
val featuresStd = bcFeaturesStd.value
val numFeatures = featuresStd.length
val numCoefficientSets = if (multinomial) numClasses else 1
val numFeaturesPlusIntercept = if (fitIntercept) numFeatures + 1 else numFeatures
val logisticAggregator = {
val seqOp = (c: LogisticAggregator, instance: Instance) => c.add(instance)
val combOp = (c1: LogisticAggregator, c2: LogisticAggregator) => c1.merge(c2)
instances.treeAggregate(
new LogisticAggregator(bcCoeffs, bcFeaturesStd, numClasses, fitIntercept,
multinomial)
)(seqOp, combOp, aggregationDepth)
}
val totalGradientMatrix = logisticAggregator.gradient
val coefMatrix = new DenseMatrix(numCoefficientSets, numFeaturesPlusIntercept, coeffs.toArray)
// regVal is the sum of coefficients squares excluding intercept for L2 regularization.
val regVal = if (regParamL2 == 0.0) {
0.0
} else {
var sum = 0.0
coefMatrix.foreachActive { case (classIndex, featureIndex, value) =>
// We do not apply regularization to the intercepts
val isIntercept = fitIntercept && (featureIndex == numFeatures)
if (!isIntercept) {
// The following code will compute the loss of the regularization; also
// the gradient of the regularization, and add back to totalGradientArray.
sum += {
if (standardization) {
val gradValue = totalGradientMatrix(classIndex, featureIndex) //取梯度矩阵中(classIndex, featureIndex)对应的值
totalGradientMatrix.update(classIndex, featureIndex, gradValue + regParamL2 * value) //正则项的梯度为regParamL2 * value
value * value //正则项的loss为value * value
} else {
if (featuresStd(featureIndex) != 0.0) {
// If `standardization` is false, we still standardize the data
// to improve the rate of convergence; as a result, we have to
// perform this reverse standardization by penalizing each component
// differently to get effectively the same objective function when
// the training dataset is not standardized.
val temp = value / (featuresStd(featureIndex) * featuresStd(featureIndex)) //梯度除以相应类别的特征的方差
val gradValue = totalGradientMatrix(classIndex, featureIndex)
totalGradientMatrix.update(classIndex, featureIndex, gradValue + regParamL2 * temp)
value * temp //loss也相应除以方差
} else { //如果该类别的特征的标准差为0.0,不需要scaling
0.0
}
}
}
}
}
0.5 * regParamL2 * sum //正则项总的loss
}
bcCoeffs.destroy(blocking = false)
(logisticAggregator.loss + regVal, new BDV(totalGradientMatrix.toArray)) //返回(loss, gradient)
}
}
/** Update gradient and loss using binary loss function. */
private def binaryUpdateInPlace(
features: Vector,
weight: Double,
label: Double): Unit = {
val localFeaturesStd = bcFeaturesStd.value
val localCoefficients = bcCoefficients.value
val localGradientArray = gradientSumArray
val margin = - {
var sum = 0.0
features.foreachActive { (index, value) =>
if (localFeaturesStd(index) != 0.0 && value != 0.0) {
sum += localCoefficients(index) * value / localFeaturesStd(index) // sum = x * W^{T}
}
}
if (fitIntercept) sum += localCoefficients(numFeaturesPlusIntercept - 1) // sum = x * W^{T} + b
sum
}
val multiplier = weight * (1.0 / (1.0 + math.exp(margin)) - label) //weight是该样本的权重
features.foreachActive { (index, value) =>
if (localFeaturesStd(index) != 0.0 && value != 0.0) {
localGradientArray(index) += multiplier * value / localFeaturesStd(index) //更新梯度
}
}
if (fitIntercept) {
localGradientArray(numFeaturesPlusIntercept - 1) += multiplier
} //更新截距b对应的梯度
if (label > 0) {
// 在MLUtils.log1pExp(x)实现中,if(x > 0) log(1+exp(x)) = x + log(1+exp(-x))
lossSum += weight * MLUtils.log1pExp(margin) //loss = log(1+exp(margin))
} else {
lossSum += weight * (MLUtils.log1pExp(margin) - margin) //loss = margin - log(1+exp(margin))
}
}
Note that there is a difference between multinomial (softmax) and binary loss. The binary case uses one outcome class as a “pivot” and regresses the other class against the pivot. In the multinomial case, the softmax loss function is used to model each class probability independently. Using softmax loss produces K
sets of coefficients, while using a pivot class produces K - 1
sets of coefficients (a single coefficient vector in the binary case). In the binary case, we can say that the coefficients are shared between the positive and negative classes. When regularization is applied, multinomial (softmax) loss will produce a result different from binary loss since the positive and negative don’t share the coefficients while the binary regression shares the coefficients between positive and negative.
The following is a mathematical derivation for the multinomial (softmax) loss. The probability of the multinomial outcome y y taking on any of the K possible outcomes is:
P(yi=0|x⃗ i,β)=ex⃗ Tiβ⃗ 0∑K−1k=0ex⃗ Tiβ⃗ kP(yi=1|x⃗ i,β)=ex⃗ Tiβ⃗ 1∑K−1k=0ex⃗ Tiβ⃗ kP(yi=K−1|x⃗ i,β)=ex⃗ Tiβ⃗ K−1∑K−1k=0ex⃗ Tiβ⃗ k P ( y i = 0 | x → i , β ) = e x → i T β → 0 ∑ k = 0 K − 1 e x → i T β → k P ( y i = 1 | x → i , β ) = e x → i T β → 1 ∑ k = 0 K − 1 e x → i T β → k P ( y i = K − 1 | x → i , β ) = e x → i T β → K − 1 ∑ k = 0 K − 1 e x → i T β → k
The model coefficients β=(β0,β1,β2,...,βK−1) β = ( β 0 , β 1 , β 2 , . . . , β K − 1 ) become a matrix which has dimension of K×(N+1) K × ( N + 1 ) if the intercepts are added. If the intercepts are not added, the dimension will be K×N K × N .
Note that the coefficients in the model above lack identifiability. That is, any constant scalar can be added to all of the coefficients and the probabilities remain the same.
ex⃗ Ti(β⃗ 0+c⃗ )∑K−1k=0ex⃗ Ti(β⃗ k+c⃗ )=ex⃗ Tiβ⃗ 0ex⃗ Tic⃗ ex⃗ Tic⃗ ∑K−1k=0ex⃗ Tiβ⃗ k=ex⃗ Tiβ⃗ 0∑K−1k=0ex⃗ Tiβ⃗ k(7) (7) e x → i T ( β → 0 + c → ) ∑ k = 0 K − 1 e x → i T ( β → k + c → ) = e x → i T β → 0 e x → i T c → e x → i T c → ∑ k = 0 K − 1 e x → i T β → k = e x → i T β → 0 ∑ k = 0 K − 1 e x → i T β → k
However, when regularization is added to the loss function, the coefficients are indeed identifiable because there is only one set of coefficients which minimizes the regularization term. When no regularization is applied, we choose the coefficients with the minimum L2 penalty for consistency and reproducibility. For further discussion see: Friedman, et al. “Regularization Paths for Generalized Linear Models via Coordinate Descent” The loss of objective function for a single instance of data (we do not include the regularization term here for simplicity) can be written as
ℓ(β,xi)=−logP(yi|x⃗ i,β)=log(∑k=0K−1ex⃗ Tiβ⃗ k)−x⃗ Tiβ⃗ y=log(∑k=0K−1emarginsk)−marginsy(8)(9)(10) (8) ℓ ( β , x i ) = − l o g P ( y i | x → i , β ) (9) = l o g ( ∑ k = 0 K − 1 e x → i T β → k ) − x → i T β → y (10) = l o g ( ∑ k = 0 K − 1 e m a r g i n s k ) − m a r g i n s y
where marginsk=x⃗ Tiβ⃗ k m a r g i n s k = x → i T β → k .
For optimization, we have to calculate the first derivative of the loss function, and a simple calculation shows that
∂ℓ(β,x⃗ i,wi)∂βj,k=xi,j⋅wi⋅⎛⎝ex⃗ i⋅β⃗ k∑K−1k′=0ex⃗ i⋅β⃗ k′−Iy=k⎞⎠=xi,j⋅wi⋅multiplierk(11)(12) (11) ∂ ℓ ( β , x → i , w i ) ∂ β j , k = x i , j ⋅ w i ⋅ ( e x → i ⋅ β → k ∑ k ′ = 0 K − 1 e x → i ⋅ β → k ′ − I y = k ) (12) = x i , j ⋅ w i ⋅ m u l t i p l i e r k
where wi w i is the sample weight, Iy=k I y = k is an indicator function
Iy=k={10y=kelse I y = k = { 1 y = k 0 e l s e
and
multiplierk=(ex⃗ i⋅β⃗ k∑K−1k=0ex⃗ i⋅β⃗ k−Iy=k) m u l t i p l i e r k = ( e x → i ⋅ β → k ∑ k = 0 K − 1 e x → i ⋅ β → k − I y = k )
If any of margins is larger than 709.78, the numerical computation of multiplier and loss function will suffer from arithmetic overflow. This issue occurs when there are outliers in data which are far away from the hyperplane, and this will cause the failing of training once infinity is introduced. Note that this is only a concern when max(margins) > 0.
Fortunately, when max(margins) = maxMargin > 0, the loss function and the multiplier can easily be rewritten into the following equivalent numerically stable formula.
ℓ(β,x)=log(∑k=0K−1emarginsk−maxMargin)−marginsy+maxMargin ℓ ( β , x ) = l o g ( ∑ k = 0 K − 1 e m a r g i n s k − m a x M a r g i n ) − m a r g i n s y + m a x M a r g i n
Note that each term, (marginsk−maxMargin) ( m a r g i n s k − m a x M a r g i n ) in the exponential is no greater than zero; as a result, overflow will not happen with this formula.
For multiplier m u l t i p l i e r , a similar trick can be applied as the following,
multiplierk=⎛⎝ex⃗ i⋅β⃗ k−maxMargin∑K−1k′=0ex⃗ i⋅β⃗ k′−maxMargin−Iy=k⎞⎠ m u l t i p l i e r k = ( e x → i ⋅ β → k − m a x M a r g i n ∑ k ′ = 0 K − 1 e x → i ⋅ β → k ′ − m a x M a r g i n − I y = k )
@param bcCoefficients The broadcast coefficients corresponding to the features.
@param bcFeaturesStd The broadcast standard deviation values of the features.
@param numClasses the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression.
@param fitIntercept Whether to fit an intercept term.
@param multinomial Whether to use multinomial (softmax) or binary loss
@note In order to avoid unnecessary computation during calculation of the gradient updates we lay out the coefficients in column major order during training. This allows us to perform feature standardization once, while still retaining sequential memory access for speed. We convert back to row major order when we create the model, since this form is optimal for the matrix operations used for prediction.
/** Update gradient and loss using multinomial (softmax) loss function. */
private def multinomialUpdateInPlace(
features: Vector,
weight: Double,
label: Double): Unit = {
// TODO: use level 2 BLAS operations
/*
Note: this can still be used when numClasses = 2 for binary
logistic regression without pivoting.
*/
val localFeaturesStd = bcFeaturesStd.value
val localCoefficients = bcCoefficients.value
val localGradientArray = gradientSumArray
// marginOfLabel is margins(label) in the formula
var marginOfLabel = 0.0
var maxMargin = Double.NegativeInfinity
val margins = new Array[Double](numClasses)
features.foreachActive { (index, value) =>
val stdValue = value / localFeaturesStd(index)
var j = 0
while (j < numClasses) {
margins(j) += localCoefficients(index * numClasses + j) * stdValue
j += 1
}
}
var i = 0
while (i < numClasses) {
if (fitIntercept) {
margins(i) += localCoefficients(numClasses * numFeatures + i)
}
if (i == label.toInt) marginOfLabel = margins(i)
if (margins(i) > maxMargin) {
maxMargin = margins(i)
}
i += 1
}
/**
* When maxMargin is greater than 0, the original formula could cause overflow.
* We address this by subtracting maxMargin from all the margins, so it's guaranteed
* that all of the new margins will be smaller than zero to prevent arithmetic overflow.
*/
val multipliers = new Array[Double](numClasses)
val sum = {
var temp = 0.0
var i = 0
while (i < numClasses) {
if (maxMargin > 0) margins(i) -= maxMargin
val exp = math.exp(margins(i))
temp += exp
multipliers(i) = exp
i += 1
}
temp
}
margins.indices.foreach { i =>
multipliers(i) = multipliers(i) / sum - (if (label == i) 1.0 else 0.0)
}
features.foreachActive { (index, value) =>
if (localFeaturesStd(index) != 0.0 && value != 0.0) {
val stdValue = value / localFeaturesStd(index)
var j = 0
while (j < numClasses) {
localGradientArray(index * numClasses + j) +=
weight * multipliers(j) * stdValue
j += 1
}
}
}
if (fitIntercept) {
var i = 0
while (i < numClasses) {
localGradientArray(numFeatures * numClasses + i) += weight * multipliers(i)
i += 1
}
}
val loss = if (maxMargin > 0) {
math.log(sum) - marginOfLabel + maxMargin
} else {
math.log(sum) - marginOfLabel
}
lossSum += weight * loss
}
LogisticAggregator computes the gradient and loss for binary or multinomial logistic (softmax) loss function, as used in classification for instances in sparse or dense vector in an online fashion.
Two LogisticAggregators can be merged together to have a summary of loss and gradient of the corresponding joint dataset.
For improving the convergence rate during the optimization process and also to prevent against features with very large variances exerting an overly large influence during model training, packages like R’s GLMNET perform the scaling to unit variance and remove the mean in order to reduce the condition number. The model is then trained in this scaled space, but returns the coefficients in the original scale. See page 9 in http://cran.r-project.org/web/packages/glmnet/glmnet.pdf
However, we don’t want to apply the [[org.apache.spark.ml.feature.StandardScaler]] on the training dataset, and then cache the standardized dataset since it will create a lot of overhead.
As a result, we perform the scaling implicitly when we compute the objective function (though we do not subtract the mean).
private class LogisticAggregator(
bcCoefficients: Broadcast[Vector],
bcFeaturesStd: Broadcast[Array[Double]],
numClasses: Int,
fitIntercept: Boolean,
multinomial: Boolean) extends Serializable with Logging {
private val numFeatures = bcFeaturesStd.value.length
private val numFeaturesPlusIntercept = if (fitIntercept) numFeatures + 1 else numFeatures
private val coefficientSize = bcCoefficients.value.size
private val numCoefficientSets = if (multinomial) numClasses else 1
if (multinomial) {
require(numClasses == coefficientSize / numFeaturesPlusIntercept, s"The number of " +
s"coefficients should be ${numClasses * numFeaturesPlusIntercept} but was $coefficientSize")
} else {
require(coefficientSize == numFeaturesPlusIntercept, s"Expected $numFeaturesPlusIntercept " +
s"coefficients but got $coefficientSize")
require(numClasses == 1 || numClasses == 2, s"Binary logistic aggregator requires numClasses " +
s"in {1, 2} but found $numClasses.")
}
private var weightSum = 0.0
private var lossSum = 0.0
private val gradientSumArray = Array.ofDim[Double](coefficientSize)
if (multinomial && numClasses <= 2) {
logInfo(s"Multinomial logistic regression for binary classification yields separate " +
s"coefficients for positive and negative classes. When no regularization is applied, the" +
s"result will be effectively the same as binary logistic regression. When regularization" +
s"is applied, multinomial loss will produce a result different from binary loss.")
}
/** Update gradient and loss using binary loss function. */
private def binaryUpdateInPlace(
features: Vector,
weight: Double,
label: Double): Unit = {
}
/** Update gradient and loss using multinomial (softmax) loss function. */
private def multinomialUpdateInPlace(
features: Vector,
weight: Double,
label: Double): Unit = {
}
def add(instance: Instance): this.type = {
instance match { case Instance(label, weight, features) =>
require(numFeatures == features.size, s"Dimensions mismatch when adding new instance." +
s" Expecting $numFeatures but got ${features.size}.")
require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")
if (weight == 0.0) return this
if (multinomial) {
multinomialUpdateInPlace(features, weight, label)
} else {
binaryUpdateInPlace(features, weight, label)
}
weightSum += weight
this
}
}
def merge(other: LogisticAggregator): this.type = {
require(numFeatures == other.numFeatures, s"Dimensions mismatch when merging with another " +
s"LeastSquaresAggregator. Expecting $numFeatures but got ${other.numFeatures}.")
if (other.weightSum != 0.0) {
weightSum += other.weightSum
lossSum += other.lossSum
var i = 0
val localThisGradientSumArray = this.gradientSumArray
val localOtherGradientSumArray = other.gradientSumArray
val len = localThisGradientSumArray.length
while (i < len) {
localThisGradientSumArray(i) += localOtherGradientSumArray(i)
i += 1
}
}
this
}
def loss: Double = {
require(weightSum > 0.0, s"The effective number of instances should be " +
s"greater than 0.0, but $weightSum.")
lossSum / weightSum
}
def gradient: Matrix = {
require(weightSum > 0.0, s"The effective number of instances should be " +
s"greater than 0.0, but $weightSum.")
val result = Vectors.dense(gradientSumArray.clone())
scal(1.0 / weightSum, result)
new DenseMatrix(numCoefficientSets, numFeaturesPlusIntercept, result.toArray)
}
}
For multinomial logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercepts such that it follows the distribution of the labels.
{{{
protected[spark] def train(
dataset: Dataset[_],
handlePersistence: Boolean): LogisticRegressionModel = {
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol)) //如果没有定义权重,就设成全为1
val instances: RDD[Instance] = //样本集从dataset到RDD
dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd.map {
case Row(label: Double, weight: Double, features: Vector) =>
Instance(label, weight, features)
}
if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) //样本集的RDD要不要持久化
val instr = Instrumentation.create(this, instances) //输出训练过程中的一些有用信息
instr.logParams(regParam, elasticNetParam, standardization, threshold,
maxIter, tol, fitIntercept)
val (summarizer, labelSummarizer) = {
//MultivariateOnlineSummarizer computes the mean, variance, minimum, maximum, counts, and nonzero counts for instances
//MultiClassSummarizer computes the number of distinct labels and corresponding counts
val seqOp = (c: (MultivariateOnlineSummarizer, MultiClassSummarizer),
instance: Instance) =>
(c._1.add(instance.features, instance.weight), c._2.add(instance.label, instance.weight))
val combOp = (c1: (MultivariateOnlineSummarizer, MultiClassSummarizer),
c2: (MultivariateOnlineSummarizer, MultiClassSummarizer)) =>
(c1._1.merge(c2._1), c1._2.merge(c2._2))
instances.treeAggregate(
new MultivariateOnlineSummarizer, new MultiClassSummarizer
)(seqOp, combOp, $(aggregationDepth))
}
val histogram = labelSummarizer.histogram //各个label的权重总和 Array[weightSum of class i]
val numInvalid = labelSummarizer.countInvalid //总的无效样本数量if (label - label.toInt != 0.0 || label < 0)
val numFeatures = summarizer.mean.size //特征的维度size
val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 else numFeatures //如果训练截距,维度+1
val numClasses = MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
case Some(n: Int) =>
require(n >= histogram.length, s"Specified number of classes $n was " +
s"less than the number of unique labels ${histogram.length}.")
n
case None => histogram.length
}
val isMultinomial = $(family) match {
case "binomial" =>
require(numClasses == 1 || numClasses == 2, s"Binomial family only supports 1 or 2 " +
s"outcome classes but found $numClasses.")
false
case "multinomial" => true
case "auto" => numClasses > 2
case other => throw new IllegalArgumentException(s"Unsupported family: $other")
}
val numCoefficientSets = if (isMultinomial) numClasses else 1 //要训练多少组参数
if (isDefined(thresholds)) { //验证阈值数组
require($(thresholds).length == numClasses, this.getClass.getSimpleName +
".train() called with non-matching numClasses and thresholds.length." +
s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
}
instr.logNumClasses(numClasses)
instr.logNumFeatures(numFeatures)
val (coefficientMatrix, interceptVector, objectiveHistory) = {
if (numInvalid != 0) { //有无效样本时,throw exception
val msg = s"Classification labels should be in [0 to ${numClasses - 1}]. " +
s"Found $numInvalid invalid labels."
logError(msg)
throw new SparkException(msg)
}
val isConstantLabel = histogram.count(_ != 0.0) == 1 //从权重判断label种类
if ($(fitIntercept) && isConstantLabel) {
logWarning(s"All labels are the same value and fitIntercept=true, so the coefficients " +
s"will be zeros. Training is not needed.")
val constantLabelIndex = Vectors.dense(histogram).argmax
// TODO: use `compressed` after SPARK-17471
val coefMatrix = if (numFeatures < numCoefficientSets) {
//coefMatrix[row * col],按col稀疏存储
new SparseMatrix(numCoefficientSets, numFeatures,
Array.fill(numFeatures + 1)(0), Array.empty[Int], Array.empty[Double])
} else {
new SparseMatrix(numCoefficientSets, numFeatures, Array.fill(numCoefficientSets + 1)(0),
Array.empty[Int], Array.empty[Double], isTransposed = true)
}
val interceptVec = if (isMultinomial) {
Vectors.sparse(numClasses, Seq((constantLabelIndex, Double.PositiveInfinity)))
} else {
Vectors.dense(if (numClasses == 2) Double.PositiveInfinity else Double.NegativeInfinity)
}
(coefMatrix, interceptVec, Array.empty[Double])
} else {
if (!$(fitIntercept) && isConstantLabel) {
logWarning(s"All labels belong to a single class and fitIntercept=false. It's a " +
s"dangerous ground, so the algorithm may not converge.")
}
val featuresMean = summarizer.mean.toArray //特征均值数组
val featuresStd = summarizer.variance.toArray.map(math.sqrt) //特征标准差数组
if (!$(fitIntercept) && (0 until numFeatures).exists { i =>
featuresStd(i) == 0.0 && featuresMean(i) != 0.0 }) {
logWarning("Fitting LogisticRegressionModel without intercept on dataset with " +
"constant nonzero column, Spark MLlib outputs zero coefficients for constant " +
"nonzero columns. This behavior is the same as R glmnet but different from LIBSVM.")
}
val regParamL1 = $(elasticNetParam) * $(regParam) //L1正则参数
val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) //L2正则参数
val bcFeaturesStd = instances.context.broadcast(featuresStd) //又是broadcast,当feature维度太大时,就瓦特了
val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept),
$(standardization), bcFeaturesStd, regParamL2, multinomial = isMultinomial,
$(aggregationDepth))
val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { //不加L1正则项的情况
new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) //LBFGS
} else {
val standardizationParam = $(standardization)
def regParamL1Fun = (index: Int) => {
// Remove the L1 penalization on the intercept
val isIntercept = $(fitIntercept) && index >= numFeatures * numCoefficientSets //coefMtrix是按列存储的,index >= numFeatures * numCoefficientSets 都是相应的intercept
if (isIntercept) { //intercept不加L1正则
0.0
} else {
if (standardizationParam) { //标准的L1
regParamL1
} else { //scaling的L1
val featureIndex = index / numCoefficientSets
// If `standardization` is false, we still standardize the data
// to improve the rate of convergence; as a result, we have to
// perform this reverse standardization by penalizing each component
// differently to get effectively the same objective function when
// the training dataset is not standardized.
if (featuresStd(featureIndex) != 0.0) {
regParamL1 / featuresStd(featureIndex)
} else {
0.0
}
}
}
}
new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) //Implements the Orthant-wise Limited Memory QuasiNewton method
}
/*
The coefficients are laid out in column major order during training. Here we initialize
a column major matrix of initial coefficients.
*/
val initialCoefWithInterceptMatrix =
Matrices.zeros(numCoefficientSets, numFeaturesPlusIntercept) //coefMatrix初始化为全0
val initialModelIsValid = optInitialModel match { //验证初始化coefMatrix的有效性
case Some(_initialModel) =>
val providedCoefs = _initialModel.coefficientMatrix
val modelIsValid = (providedCoefs.numRows == numCoefficientSets) &&
(providedCoefs.numCols == numFeatures) &&
(_initialModel.interceptVector.size == numCoefficientSets) &&
(_initialModel.getFitIntercept == $(fitIntercept))
if (!modelIsValid) {
logWarning(s"Initial coefficients will be ignored! Its dimensions " +
s"(${providedCoefs.numRows}, ${providedCoefs.numCols}) did not match the " +
s"expected size ($numCoefficientSets, $numFeatures)")
}
modelIsValid
case None => false
}
if (initialModelIsValid) {
val providedCoef = optInitialModel.get.coefficientMatrix
providedCoef.foreachActive { (classIndex, featureIndex, value) =>
// We need to scale the coefficients since they will be trained in the scaled space
initialCoefWithInterceptMatrix.update(classIndex, featureIndex,
value * featuresStd(featureIndex))
} //scaling为何是乘以标准差?
if ($(fitIntercept)) {
optInitialModel.get.interceptVector.foreachActive { (classIndex, value) =>
initialCoefWithInterceptMatrix.update(classIndex, numFeatures, value)
}
}
} else if ($(fitIntercept) && isMultinomial) {
val rawIntercepts = histogram.map(c => math.log(c + 1)) // add 1 for smoothing
val rawMean = rawIntercepts.sum / rawIntercepts.length
rawIntercepts.indices.foreach { i =>
initialCoefWithInterceptMatrix.update(i, numFeatures, rawIntercepts(i) - rawMean)
}
} else if ($(fitIntercept)) {
initialCoefWithInterceptMatrix.update(0, numFeatures,
math.log(histogram(1) / histogram(0)))
}
val states = optimizer.iterations(new CachedDiffFunction(costFun),
new BDV[Double](initialCoefWithInterceptMatrix.toArray))
val arrayBuilder = mutable.ArrayBuilder.make[Double]
var state: optimizer.State = null
while (states.hasNext) {
state = states.next()
arrayBuilder += state.adjustedValue //adjustedValue f(x) + r(x), where r is any regularization added to the objective
}
bcFeaturesStd.destroy(blocking = false)
if (state == null) {
val msg = s"${optimizer.getClass.getName} failed."
logError(msg)
throw new SparkException(msg)
}
val allCoefficients = state.x.toArray.clone() //取出当前的coefMatrix
val allCoefMatrix = new DenseMatrix(numCoefficientSets, numFeaturesPlusIntercept,
allCoefficients)
val denseCoefficientMatrix = new DenseMatrix(numCoefficientSets, numFeatures,
new Array[Double](numCoefficientSets * numFeatures), isTransposed = true)
val interceptVec = if ($(fitIntercept) || !isMultinomial) {
Vectors.zeros(numCoefficientSets)
} else {
Vectors.sparse(numCoefficientSets, Seq())
}
// separate intercepts and coefficients from the combined matrix
allCoefMatrix.foreachActive { (classIndex, featureIndex, value) =>
val isIntercept = $(fitIntercept) && (featureIndex == numFeatures)
if (!isIntercept && featuresStd(featureIndex) != 0.0) {
denseCoefficientMatrix.update(classIndex, featureIndex,
value / featuresStd(featureIndex)) //scaling回来
}
if (isIntercept) interceptVec.toArray(classIndex) = value
}
if ($(regParam) == 0.0 && isMultinomial) {
/*When no regularization is applied, the multinomial
coefficients lack identifiability because we do not use
a pivot class.
We can add any constant value to the coefficients and
get the same likelihood.
So here, we choose the mean centered coefficients for reproducibility.*/
val denseValues = denseCoefficientMatrix.values
val coefficientMean = denseValues.sum / denseValues.length
denseCoefficientMatrix.update(_ - coefficientMean)
}
// TODO: use `denseCoefficientMatrix.compressed` after SPARK-17471
val compressedCoefficientMatrix = if (isMultinomial) {
denseCoefficientMatrix
} else {
val compressedVector = Vectors.dense(denseCoefficientMatrix.values).compressed
compressedVector match {
case dv: DenseVector => denseCoefficientMatrix
case sv: SparseVector =>
new SparseMatrix(1, numFeatures, Array(0, sv.indices.length), sv.indices, sv.values,
isTransposed = true)
}
}
// center the intercepts when using multinomial algorithm
if ($(fitIntercept) && isMultinomial) {
val interceptArray = interceptVec.toArray
val interceptMean = interceptArray.sum / interceptArray.length
(0 until interceptVec.size).foreach { i => interceptArray(i) -= interceptMean }
}
(compressedCoefficientMatrix, interceptVec.compressed, arrayBuilder.result())
}
}
if (handlePersistence) instances.unpersist()
val model = copyValues(new LogisticRegressionModel(uid, coefficientMatrix, interceptVector,
numClasses, isMultinomial))
// TODO: implement summary model for multinomial case
val m = if (!isMultinomial) {
val (summaryModel, probabilityColName) = model.findSummaryModelAndProbabilityCol()
val logRegSummary = new BinaryLogisticRegressionTrainingSummary(
summaryModel.transform(dataset),
probabilityColName,
$(labelCol),
$(featuresCol),
objectiveHistory)
model.setSummary(Some(logRegSummary))
} else {
model
}
instr.logSuccess(m)
m
}
参考文献
Spark LogisticRegression