1 package org.apache.spark.mllib.optimization 2
3 import org.jblas.DoubleMatrix 4
5 /**
6
7 * Class used to compute the gradient for a loss function, given a single data point. 8
9 */
10
11 abstract class Gradient extends Serializable { 12
13 /**
14
15 * Compute the gradient and loss given the features of a single data point. 16
17 * @param data - Feature values for one data point. Column matrix of size dx1 18
19 * where d is the number of features. 20
21 * @param label - Label for this data item. 22
23 * @param weights - Column matrix containing weights for every feature. 24
25 * @return A tuple of 2 elements. The first element is a column matrix containing the computed 26
27 * gradient and the second element is the loss computed at this data point. 28
29 */
30
31 def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix): 32
33 (DoubleMatrix, Double) 34
35 }
可以从上面的注释上看出compute的参数data是一个样本的特征(d*1维度),label就是一个double型变量,该数据点(a single data point)的标签,weights就是特征变量的回归系数也是d*1维度,该函数返回2个东西,第1个是该样本点下计算的梯度,第2个该样本点下的损失loss
1 /**
2
3 * Compute gradient and loss for a logistic loss function, as used in binary classification. 4
5 * See also the documentation for the precise formulation. 6
7 */
8
9 class LogisticGradient extends Gradient { 10
11 override def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix): 12
13 (DoubleMatrix, Double) = { 14
15 val margin: Double = -1.0 * data.dot(weights) 16
17 val gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label 18
19 val gradient = data.mul(gradientMultiplier) 20
21 val loss =
22
23 if (label > 0) { 24
25 math.log(1 + math.exp(margin)) 26
27 } else { 28
29 math.log(1 + math.exp(margin)) - margin 30
31 } 32
33 (gradient, loss) 34
35 } 36
37 }
我们知道对于log-loss的表达式loss=-[y*log(g(wx))+(1-y)*log(1-g(wx))], 其中g(wx)=1/(1+exp(-wx)),二分类(0,1),对这个loss进行求w偏导,d(loss)/d(w)=[g(wx)-y] * x (为书写方便,用d代表偏导的符号了),具体的表达式推导请移步http://www.cnblogs.com/kobedeshow/p/3340240.html
1 /**
2
3 * Compute gradient and loss for a Least-squared loss function, as used in linear regression. 4
5 * This is correct for the averaged least squares loss function (mean squared error) 6
7 * L = 1/n ||A weights-y||^2 8
9 * See also the documentation for the precise formulation. 10
11 */
12
13 class LeastSquaresGradient extends Gradient { 14
15 override def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix): 16
17 (DoubleMatrix, Double) = { 18
19 val diff: Double = data.dot(weights) - label 20
21 val loss = diff * diff 22
23 val gradient = data.mul(2.0 * diff) 24
25 (gradient, loss) 27 } 28
29 }
1 /**
2
3 * Compute gradient and loss for a Hinge loss function, as used in SVM binary classification. 4
5 * See also the documentation for the precise formulation. 6
7 * NOTE: This assumes that the labels are {0,1} 8
9 */
10
11 class HingeGradient extends Gradient { 12
13 override def compute(data: DoubleMatrix, label: Double, weights: DoubleMatrix): 14
15 (DoubleMatrix, Double) = { 16
17 val dotProduct = data.dot(weights) 18
19 // Our loss function with {0, 1} labels is max(0, 1 - (2y – 1) (f_w(x))) 20
21 // Therefore the gradient is -(2y - 1)*x
22
23 val labelScaled = 2 * label - 1.0
24
25 if (1.0 > labelScaled * dotProduct) { 26
27 (data.mul(-labelScaled), 1.0 - labelScaled * dotProduct) 28
29 } else { 30
31 (DoubleMatrix.zeros(1, weights.length), 0.0) 32
33 } 35 } 37 }
hinge-loss的二分类(-1,1)的表达式是max(0,1- y * f(x)),代码中映射到(0,1),变成max(0, 1 - (2y – 1) (f(x))),这时候当样本错分的时候(也就是labelScaled * dotProduct<1),梯度是data.mul(-labelScaled),损失是1-labelScaled * dotProduct
1 /**
2
3 * Class used to perform steps (weight update) using Gradient Descent methods. 4
5 * For general minimization problems, or for regularized problems of the form 6
7 * min L(w) + regParam * R(w), 8
9 * the compute function performs the actual update step, when given some 10
11 * (e.g. stochastic) gradient direction for the loss L(w), 12
13 * and a desired step-size (learning rate). 14
15 * 16
17 * The updater is responsible to also perform the update coming from the 18
19 * regularization term R(w) (if any regularization is used). 20
21 */
22
23 abstract class Updater extends Serializable { 24
25 /**
26
27 * Compute an updated value for weights given the gradient, stepSize, iteration number and 28
29 * regularization parameter. Also returns the regularization value regParam * R(w) 30
31 * computed using the *updated* weights. 32
33 * @param weightsOld - Column matrix of size dx1 where d is the number of features. 34
35 * @param gradient - Column matrix of size dx1 where d is the number of features. 36
37 * @param stepSize - step size across iterations 38
39 * @param iter - Iteration number 40
41 * @param regParam - Regularization parameter 42
43 * 44
45 * @return A tuple of 2 elements. The first element is a column matrix containing updated weights, 46
47 * and the second element is the regularization value computed using updated weights. 48
49 */
50
51 def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix, stepSize: Double, iter: Int, 52
53 regParam: Double): (DoubleMatrix, Double) 54
55 }
compute的参数weightsOld是更新前的变量回归系数(d*1维),gradient是根据指定的损失函数计算出的当前梯度,stepSize 是步长也就是学习速率,iter 迭代次数,regParam 是正则参数值,该函数返回2个东西,第1个是更新后的回归系数,第2个是更新后的regParam * R(w) 值。
1 /**
2
3 * A simple updater for gradient descent *without* any regularization. 4
5 * Uses a step-size decreasing with the square root of the number of iterations. 6
7 */
8
9 class SimpleUpdater extends Updater { 10
11 override def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix, 12
13 stepSize: Double, iter: Int, regParam: Double): (DoubleMatrix, Double) = { 14
15 val thisIterStepSize = stepSize / math.sqrt(iter) 16
17 val step = gradient.mul(thisIterStepSize) 18
19 (weightsOld.sub(step), 0) 20
21 } 22
23 }
对于梯度下降算法,w -= a*gradient,a是学习率对应代码里面的thisIterStepSize(相当于一开始步长很大,随迭代次数,增加而减小),式子中的a*gradient对应着step,最后,weightsNew=weightsOld.sub(step)
1 /**
2
3 * Updater for L1 regularized problems. 4
5 * R(w) = ||w||_1 6
7 * Uses a step-size decreasing with the square root of the number of iterations. 8
9 * Instead of subgradient of the regularizer, the proximal operator for the 10
11 * L1 regularization is applied after the gradient step. This is known to 12
13 * result in better sparsity of the intermediate solution. 14
15 * The corresponding proximal operator for the L1 norm is the soft-thresholding 16
17 * function. That is, each weight component is shrunk towards 0 by shrinkageVal. 18
19 * If w > shrinkageVal, set weight component to w-shrinkageVal. 20
21 * If w < -shrinkageVal, set weight component to w+shrinkageVal. 22
23 * If -shrinkageVal < w < shrinkageVal, set weight component to 0. 24
25 * Equivalently, set weight component to signum(w) * max(0.0, abs(w) - shrinkageVal) 26
27 */
28
29 class L1Updater extends Updater { 30
31 override def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix, 32
33 stepSize: Double, iter: Int, regParam: Double): (DoubleMatrix, Double) = { 34
35 val thisIterStepSize = stepSize / math.sqrt(iter) 36
37 val step = gradient.mul(thisIterStepSize) 38
39 // Take gradient step
40
41 val newWeights = weightsOld.sub(step) 42
43 // Apply proximal operator (soft thresholding)
44
45 val shrinkageVal = regParam * thisIterStepSize 46
47 (0 until newWeights.length).foreach { i =>
48
49 val wi = newWeights.get(i) 50
51 newWeights.put(i, signum(wi) * max(0.0, abs(wi) - shrinkageVal)) 52
53 } 54
55 (newWeights, newWeights.norm1 * regParam) 56
57 } 58
59 }
加了正则项之后,前几步都一样,然后关键是对后面的处理(后面的理论暂时还不太理解,可以参考http://freemind.pluskid.org/machine-learning/sparsity-and-some-basics-of-l1-regularization/),还是说代码步骤吧,变量shrinkageVal =regParam * thisIterStepSize(注意:要*thisIterStepSize,因为w -= a*gradient 里面的gradient包括L(w)还包括正则的R(w)),然后对加正则前更新的newWeights,上遍历每一个元素,直接对该元素赋值newWeights.put(i, signum(wi) * max(0.0, abs(wi) - shrinkageVal)),对应着代码注释的红体部分。
1 /**
2
3 * Updater for L2 regularized problems. 4
5 * R(w) = 1/2 ||w||^2 6
7 * Uses a step-size decreasing with the square root of the number of iterations. 8
9 */
10
11 class SquaredL2Updater extends Updater { 12
13 override def compute(weightsOld: DoubleMatrix, gradient: DoubleMatrix, 14
15 stepSize: Double, iter: Int, regParam: Double): (DoubleMatrix, Double) = { 16
17 val thisIterStepSize = stepSize / math.sqrt(iter) 18
19 val step = gradient.mul(thisIterStepSize) 20
21 // add up both updates from the gradient of the loss (= step) as well as 22
23 // the gradient of the regularizer (= regParam * weightsOld)
24
25 val newWeights = weightsOld.mul(1.0 - thisIterStepSize * regParam).sub(step) 26
27 (newWeights, 0.5 * pow(newWeights.norm2, 2.0) * regParam) 28
29 } 30
31 }
L2正则项加入后,损失函数变为loss1=loss+1/2 *regParam* ||w||^2,按梯度下降的更新公式:w=w-学习速率 * (d(loss1)/d(w));后面的d(loss1)=d(loss1)/d(w) + d(1/2*regParam*||w||^2) / d(w)了,那么更新公式变成了w=w-学习速率*d(loss)/d(w)-学习速率*d(1/2*regParam*||w|| ^2)/d(w)=(1-学习速率*regParam)*w-学习速率*d(loss)/d(w),这个也就对应了第25行代码的意思
1 package org.apache.spark.mllib.optimization 2
3 import org.apache.spark.Logging 4
5 import org.apache.spark.rdd.RDD 6
7 import org.jblas.DoubleMatrix 8
9 import scala.collection.mutable.ArrayBuffer 10
11 /**
12
13 * Class used to solve an optimization problem using Gradient Descent. 14
15 * @param gradient Gradient function to be used. 16
17 * @param updater Updater to be used to update weights after every iteration. 18
19 */
20
21 class GradientDescent(var gradient: Gradient, var updater: Updater) 22
23 extends Optimizer with Logging 24
25 { 26
27 private var stepSize: Double = 1.0
28
29 private var numIterations: Int = 100
30
31 private var regParam: Double = 0.0
32
33 private var miniBatchFraction: Double = 1.0
34
35 /**
36
37 * Set the initial step size of SGD for the first step. Default 1.0. 38
39 * In subsequent steps, the step size will decrease with stepSize/sqrt(t) 40
41 */
42
43 def setStepSize(step: Double): this.type = { 44
45 this.stepSize = step 46
47 this
48
49 } 50
51 /**
52
53 * Set fraction of data to be used for each SGD iteration. 54
55 * Default 1.0 (corresponding to deterministic/classical gradient descent) 56
57 */
58
59 def setMiniBatchFraction(fraction: Double): this.type = { 60
61 this.miniBatchFraction = fraction 62
63 this
64
65 } 66
67 /**
68
69 * Set the number of iterations for SGD. Default 100. 70
71 */
72
73 def setNumIterations(iters: Int): this.type = { 74
75 this.numIterations = iters 76
77 this
78
79 } 80
81 /**
82
83 * Set the regularization parameter. Default 0.0. 84
85 */
86
87 def setRegParam(regParam: Double): this.type = { 88
89 this.regParam = regParam 90
91 this
92
93 } 94
95 /**
96
97 * Set the gradient function (of the loss function of one single data example) 98
99 * to be used for SGD. 100
101 */
102
103 def setGradient(gradient: Gradient): this.type = { 104
105 this.gradient = gradient 106
107 this
108
109 } 110
111 /**
112
113 * Set the updater function to actually perform a gradient step in a given direction. 114
115 * The updater is responsible to perform the update from the regularization term as well, 116
117 * and therefore determines what kind or regularization is used, if any. 118
119 */
120
121 def setUpdater(updater: Updater): this.type = { 122
123 this.updater = updater 124
125 this
126
127 } 128
129 def optimize(data: RDD[(Double, Array[Double])], initialWeights: Array[Double]) 130
131 : Array[Double] = { 132
133 val (weights, stochasticLossHistory) = GradientDescent.runMiniBatchSGD( 134
135 data, 136
137 gradient, 138
139 updater, 140
141 stepSize, 142
143 numIterations, 144
145 regParam, 146
147 miniBatchFraction, 148
149 initialWeights) 150
151 weights 152
153 } 154
155 }
该类的输入有2个参数,第一个是前面都是gradient对应了用户需要选哪个损失函数计算梯度,第二个updater 对应了用户选择哪一种正则方式,程序开头都设置了stepSize,numIterations,regParam,miniBatchFraction的默认值最后一个函数optimize,输入RDD数据,跟初始的回归系数weight,返回weights权重
1 // Top-level method to run gradient descent.
2
3 object GradientDescent extends Logging { 4
5 /**
6
7 * Run stochastic gradient descent (SGD) in parallel using mini batches. 8
9 * In each iteration, we sample a subset (fraction miniBatchFraction) of the total data 10
11 * in order to compute a gradient estimate. 12
13 * Sampling, and averaging the subgradients over this subset is performed using one standard 14
15 * spark map-reduce in each iteration. 16
17 * 18
19 * @param data - Input data for SGD. RDD of the set of data examples, each of 20
21 * the form (label, [feature values]). 22
23 * @param gradient - Gradient object (used to compute the gradient of the loss function of 24
25 * one single data example) 26
27 * @param updater - Updater function to actually perform a gradient step in a given direction. 28
29 * @param stepSize - initial step size for the first step 30
31 * @param numIterations - number of iterations that SGD should be run. 32
33 * @param regParam - regularization parameter 34
35 * @param miniBatchFraction - fraction of the input data set that should be used for 36
37 * one iteration of SGD. Default value 1.0. 38
39 * 40
41 * @return A tuple containing two elements. The first element is a column matrix containing 42
43 * weights for every feature, and the second element is an array containing the 44
45 * stochastic loss computed for every iteration. 46
47 */
48
49 def runMiniBatchSGD( 50
51 data: RDD[(Double, Array[Double])], 52
53 gradient: Gradient, 54
55 updater: Updater, 56
57 stepSize: Double, 58
59 numIterations: Int, 60
61 regParam: Double, 62
63 miniBatchFraction: Double, 64
65 initialWeights: Array[Double]) : (Array[Double], Array[Double]) = { 66
67 val stochasticLossHistory = new ArrayBuffer[Double](numIterations) 68
69 val nexamples: Long = data.count() 70
71 val miniBatchSize = nexamples * miniBatchFraction 72
73 // Initialize weights as a column vector
74
75 var weights = new DoubleMatrix(initialWeights.length, 1, initialWeights:_*) 76
77 var regVal = 0.0
78
79 for (i <- 1 to numIterations) { 80
81 // Sample a subset (fraction miniBatchFraction) of the total data 82
83 // compute and sum up the subgradients on this subset (this is one map-reduce)
84
85 val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i).map { 86
87 case (y, features) =>
88
89 val featuresCol = new DoubleMatrix(features.length, 1, features:_*) 90
91 val (grad, loss) = gradient.compute(featuresCol, y, weights) 92
93 (grad, loss) 94
95 }.reduce((a, b) => (a._1.addi(b._1), a._2 + b._2)) 96
97 /**
98
99 * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration 100
101 * and regVal is the regularization value computed in the previous iteration as well. 102
103 */
104
105 stochasticLossHistory.append(lossSum / miniBatchSize + regVal) 106
107 val update = updater.compute( 108
109 weights, gradientSum.div(miniBatchSize), stepSize, i, regParam) 110
111 weights = update._1 112
113 regVal = update._2 114
115 } 116
117 logInfo("GradientDescent.runMiniBatchSGD finished. Last 10 stochastic losses %s".format( 118
119 stochasticLossHistory.takeRight(10).mkString(", "))) 120
121 (weights.toArray, stochasticLossHistory.toArray) 122
123 } 124
125 }
该object进行了整个的优化过程,输出是回归系数跟每次迭代的loss,这里实现的是minibatch-sgd的并行,前面的var weights = new DoubleMatrix(initialWeights.length, 1, initialWeights:_*),这个操作是把array型的搞成矩阵型的d*1维矩阵。关键代码for (i <- 1 to numIterations) 里面的,首先data是spark的RDD数据类型,data.sample方法第一个参数指是否又放回的抽样,第二个是抽样比例,第三个是随机种子,data.sample返回抽样后的RDD,然后RDD.map,RDD.reduce操作就是一个完整的map-reduce操作。接着,把得到的gradientSum除以miniBatchSize,扔到updater里面去更新梯度,关于minibatch-sgd的并行策略可以参考我之前的文章《常见数据挖掘算法的Map-Reduce策略(2)》里面的algorithm3。