上一次的内容分享主要给出了基于保险数据的三种机器学习算法不用的应用流程。主要以代码展示和结果对比为主,本篇文章,笔者将详细解释代码中出现的各个参数的意义
object ScalaLR {
def main(args: Array[String]): Unit = {
val ss: SparkSession = SparkSessionCreate.createSession()
import ss.implicits._
//定义参数
val numFolds = 10
val MaxIter: Seq[Int] = Seq(1000)
val Regparam: Seq[Double] = Seq(0.001)
val Tol: Seq[Double] = Seq(1e-6)
val ElasticNetParam: Seq[Double] = Seq(0.001)
//创建一个LR估量模型
val model = new LinearRegression()
.setFeaturesCol("features")
.setLabelCol("label")
//创建一个ML pipeline
val pipeline = new Pipeline()
.setStages((Preproessing.stringIndexerStages :+ Preproessing.assembler) :+ model)
//进行交叉验证之前,我们需要指定一些验证参数,下面创建一个paramGrid来指定参数设置
val paramGrid: Array[ParamMap] = new ParamGridBuilder()
.addGrid(model.maxIter, MaxIter)
.addGrid(model.regParam, Regparam)
.addGrid(model.tol, Tol)
.addGrid(model.elasticNetParam, ElasticNetParam)
.build()
//为了更好的交叉验证性能,进行模型调优,参数自行设置,这里的参数为 numFolds = 10
val cv: CrossValidator = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new RegressionEvaluator())
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds)
//创建了交叉验证模型器之后,我们可以来训练这个LR模型了
val cvModel = cv.fit(Preproessing.trainData)
//现在我们有了fit模型后,就可以做一些predict行为了,现在,我们可以在这个模型上用train数据和test数据进行模型评估了
val trainPredictionsAndLabels: RDD[(Double, Double)] = cvModel
.transform(Preproessing.trainData)
.select("label", "prediction")
.map { case Row(label: Double, prediction: Double) => (label, prediction) }.rdd
val validPredictionsAndLabels: RDD[(Double, Double)] = cvModel
.transform(Preproessing.validationData)
.select("label", "prediction")
.map { case Row(label: Double, prediction: Double) => (label, prediction) }.rdd
val trainRegressionMetrics = new RegressionMetrics(trainPredictionsAndLabels)
val validRegressionMetrics: RegressionMetrics = new RegressionMetrics(validPredictionsAndLabels)
//通过train数据和test数据已经得到了一个原始predict,下面选择一个最优模型
val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]
//现在观察在train和交叉验证模型上的结果集
val results =
"=====================================================================================\r\n" +
s"Param trainSample: ${Preproessing.trainSample}\r\n" +
s"TrainData count : ${Preproessing.trainData.count}\r\n" +
s"ValidationData count : ${Preproessing.validationData.count}\r\n" +
s"TestData count : ${Preproessing.testData.count}\r\n" +
"\r\n===================================================================================\r\n" +
s"Param maxIter = ${MaxIter.mkString(",")}\r\n" +
s"Param numFolds = ${numFolds}\r\n" +
"\r\n===================================================================================\r\n" +
s"Train data MSE = ${trainRegressionMetrics.meanSquaredError}\r\n" +
s"Train data RMSE = ${trainRegressionMetrics.rootMeanSquaredError}\r\n" +
s"Train data R-squared = ${trainRegressionMetrics.r2}\r\n" +
s"Train data MEA = ${trainRegressionMetrics.meanAbsoluteError}\r\n" +
s"Train data Explained variance = ${trainRegressionMetrics.explainedVariance}\r\n" +
"\r\n===================================================================================\r\n" +
s"Validation data MSE = ${validRegressionMetrics.meanSquaredError}\r\n" +
s"Validation data RMSE = ${validRegressionMetrics.rootMeanSquaredError}\r\n" +
s"Validation data R-squared = ${validRegressionMetrics.r2}\r\n" +
s"Validation data MEA = ${validRegressionMetrics.meanAbsoluteError}\r\n" +
s"Validation data explained variance = ${validRegressionMetrics.explainedVariance}\r\n" +
"\r\n===================================================================================\r\n" +
s"CV params explained : ${cvModel.explainParams}n" +
s"LR params explained : ${bestModel.stages.last.asInstanceOf[LinearRegressionModel].explainParams}n" +
"\r\n==================================THE END==========================================="
println(results)
println("Run this prediction on test set")
cvModel.transform(Preproessing.testData)
.select("id", "prediction")
.withColumnRenamed("prediction", "loss")
.coalesce(1)
.write.format("com.databricks.spark.csv")
.save("file:\\C:\\Users\\PC\\Desktop\\墨菲斯文件备份\\Word文档\\学习资料\\spark\\书\\机器学习\\数据\\output\\res_LR.csv")
}
}
—paramGrid: 参数列表,用于指定参数,ML pipeline中共享这一个参数列表API,是一个[k,v]类型的ParamMap.
—numFolds: 执行验证的次数,数值越大,会具有更高的精度,但是也会耗费更多的计算资源
—maxIter: 最多迭代次数
—regParam: 正则化参数(>=0),double类型
—tol: 迭代算法的收敛性
—elasticNetParam: 弹性网络混合参数,范围[0,1]
—FeaturesCol: 特征列名
—LabelCol: 标签列名
—Pipeline: pipeline将多个Transformer和Estimator连接起来确定一个ML工作流程
—CrossValidator: 交叉验证
—Estimator: 可以作用于一个DF产生一个Transformer。例如,学习算法是一个Estimator,负责训练DF和产生模型
—Evaluator: 模型评估器,衡量模型在测试数据上的最终你和程度,给出评估结果
—RegressionEvaluator: 用于回归问题的模型评估器。
=====================================================================================
Param trainSample: 1.0
TrainData count : 140977
ValidationData count : 47341
TestData count : 125546
===================================================================================
Param maxIter = 1000
Param numFolds = 10
===================================================================================
Train data MSE = 4523266.93398241
Train data RMSE = 2126.797342010378
Train data R-squared = -0.16181596223081596
Train data MEA = 1358.4888709703798
Train data Explained variance = 8415946.47720863
===================================================================================
Validation data MSE = 4651416.497204879
Validation data RMSE = 2156.714282700627
Validation data R-squared = -0.19498670604587942
Validation data MEA = 1358.6436775990019
Validation data explained variance = 8486835.155155173
===================================================================================
CV params explained : estimator: estimator for selection (current: pipeline_c5ad4ff638f1)
estimatorParamMaps: param maps for the estimator (current: [Lorg.apache.spark.ml.param.ParamMap;@17228435)
evaluator: evaluator used to select hyper-parameters that maximize the validated metric (current: regEval_1d803bd7fa7f)
numFolds: number of folds for cross validation (>= 2) (default: 3, current: 10)
seed: random seed (default: -1191137437)nLR params explained : aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0, current: 0.001)
featuresCol: features column name (default: features, current: features)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label, current: label)
maxIter: maximum number of iterations (>= 0) (default: 100, current: 1000)
predictionCol: prediction column name (default: prediction)
regParam: regularization parameter (>= 0) (default: 0.0, current: 0.001)
solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto' (default: auto)
standardization: whether to standardize the training features before fitting the model (default: true)
tol: the convergence tolerance for iterative algorithms (>= 0) (default: 1.0E-6, current: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0 (undefined)n
==================================THE END===========================================
—RMSE: 均方根误差
—MSE: 均方误差 它是测量拟合线与数据点之间的距离。MSE越小,拟合越接近数据
—MAE: 平方绝对误差 MAE在不考虑预测方向的情况下,测量一组预测中误差的平均大小。它是预测和实际观测绝对差异的测试样本的平均值,其中所有个体差异的权重相等。
—R^2: r的平方是一个统计测量数据是如何接近拟合的回归线。r的平方总是在0到100%之间。Rsquared越大,模型越适合您的数据。
—Explained variance: 在统计学中,解释变差度量数学模型对给定数据集的变化所占的比例。
对于spark ML流程中的这些参数,要根据数据集的特征数和本身所具备的资源数,合理去配置。资源较少时,建议将部分参数(类似于迭代次数,验证次数等等)设置为较小的参数,防止运行时间过长(以损失精度为代价),当自身所具备的计算资源很大的时候,就可以将这些参数调大,来获取更精确的结果。同时,评估一个模型的好与坏,可以通过上述的输出参数来直观的去判断,通过这些参数,你可以根据自己需要,来选择一个合适参数的模型供你使用。
后续文章会介绍代码中用的三种算法的原理和对比分析,欢迎继续关注
如有问题欢迎添加作者微信:ljelzl416108 ,一同交流学习大数据和机器学习的知识!!!