本套技术专栏是作者(秦凯新)平时工作的总结和升华,通过从真实商业环境抽取案例进行总结和分享,并给出商业应用的调优建议和集群环境容量规划等内容,请持续关注本套博客。版权声明:禁止转载,欢迎学习。QQ邮箱地址:[email protected],如有任何商业交流,可随时联系。
1 燃烧吧!模型选择
- 模型选择可以针对单个Estimtor进行,比如:逻辑回归,决策树等。
- 模型选择同样可以基于整套流水线进行参数调优,从而避免了对PipeLine中的每一个元素进行单独调优。
- Estimtor:用户调优的算法或者Pipeline。
- ParamMap: 用于参数选择,支持多参数如:迭代次数,正则化等。
- Evaluator:衡量模型在测试数据上的最终拟合程度,给出评估结果。
2 模型验证
- ML目前支持交叉验证(CrossValidator)和训练验证拆分法(TranValidationSplit)
3 模型训练流程
- 训练集和测试集进行切分。
- 根据参数网格,对每一个测试数据和训练数据进行迭代,最后根据Evaluator来评估模型的性能。
- 选择最好的参数集合生成最优模型。
4 Evaluator 评估器
- RegressionEvaluator 用于回归问题,
- BinaryClassificationEvaluator 用于二分类,默认的评估指标是AUC
- MulticlassClassificationEvaluator 用于多类问题。
- 用于选择最佳值ParamMap的默认度量指标可以被evaluators的setMetricName方法覆盖。
5 ML交叉验证PipeLine案例实战
5.1 CrossValidator 训练查分验证法
- CrossValidator 先将数据集划分为多组(比如:3组),每一组有训练集和测试集组成,因此就会有3个训练集和3个测试集。
- 3折交叉验证,每一组数据是2/3用来训练,1/3用来测试。
- 为了评估出一个组特殊的paramMap,crossValidator会通过Estimator在三组不同数据集上调用fit产生的3个模型的平均评估指标。
- 确定最佳ParamMap后,CrossValidator最后使用最佳ParamMap和整个数据集重新拟合Estimator。
举例如下:
选择2折交叉验证,参数网格中有两个参数:hashingTF.numFeatures有3个值 以及lr.regParam有2个值。那么有多少模型用于训练呢? (3×2)×2=12,也即12个模型用于训练,因此可见代价还是非常高的。
5.2 CrossValidator案例实战
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.sql.Row
准备训练数据,格式(id,text,label)
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0),
(4L, "b spark who", 1.0),
(5L, "g d a y", 0.0),
(6L, "spark fly", 1.0),
(7L, "was mapreduce", 0.0),
(8L, "e spark program", 1.0),
(9L, "a e c l", 0.0),
(10L, "spark compile", 1.0),
(11L, "hadoop software", 0.0)
)).toDF("id", "text", "label")
1 配置一个ML pipeline,总共有三个stages:tokenizer, hashingTF, and lr
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
-参考 val tokenized = tokenizer.transform(training)
-参考 tokenized.show()
-参考 scala> tokenized.rdd.foreach(println)
[0,a b c d e spark,1.0,WrappedArray(a, b, c, d, e, spark)]
[1,b d,0.0,WrappedArray(b, d)]
[2,spark f g h,1.0,WrappedArray(spark, f, g, h)]
[3,hadoop mapreduce,0.0,WrappedArray(hadoop, mapreduce)]
[4,b spark who,1.0,WrappedArray(b, spark, who)]
[5,g d a y,0.0,WrappedArray(g, d, a, y)]
[6,spark fly,1.0,WrappedArray(spark, fly)]
[7,was mapreduce,0.0,WrappedArray(was, mapreduce)]
[8,e spark program,1.0,WrappedArray(e, spark, program)]
[9,a e c l,0.0,WrappedArray(a, e, c, l)]
[10,spark compile,1.0,WrappedArray(spark, compile)]
[11,hadoop software,0.0,WrappedArray(hadoop, software)]
2 配置一个ML HashingTF
val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")
3 配置一个ML LogisticRegression, 输入label,features,prediction均可采用默认值名称。
val lr = new LogisticRegression().setMaxIter(10)
lr.transform()
4 构建算法流水线
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
5 用ParamGridBuilder构建一个查询用的参数网格hashingTF.numFeatures有三个值,lr.regParam有两个值该网格将会有3*2=6组参数被CrossValidator使用
val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(lr.regParam, Array(0.1, 0.01)).build()
Array({
hashingTF_a4b3e2e4efc2-numFeatures: 10,
logreg_3f15efefe425-regParam: 0.1
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 10,
logreg_3f15efefe425-regParam: 0.01
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 100,
logreg_3f15efefe425-regParam: 0.1
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 100,
logreg_3f15efefe425-regParam: 0.01
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 1000,
logreg_3f15efefe425-regParam: 0.1
}, {
hashingTF_a4b3e2e4efc2-numFeatures: 1000,
logreg_3f15efefe425-regParam: 0.01
})
6 CrossValidator 交叉验证,默认的评估指标是AUC
这里对将整个PipeLine视为一个Estimator
这种方式允许我们联合选择这个Pipeline stages参数
一个CrossValidator需要一个Estimator,一组Estimator ParamMaps,一个Evaluator。
这个Evaluator是一个BinaryClassificationEvaluator,它默认度量是areaUnderROC
val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2)
7 建立测试集
val cvModel = cv.fit(training)
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
8 模型训练,输出结果
val allresult=cvModel.transform(test)
allresult.show
+---+---------------+------------------+--------------------+--------------------+--------------------+----------+
| id| text| words| features| rawPrediction| probability|prediction|
+---+---------------+------------------+--------------------+--------------------+--------------------+----------+
| 4| spark i j k| [spark, i, j, k]|(10,[5,6,9],[1.0,...|[0.52647041270060...|[0.62865951622023...| 0.0|
| 5| l m n| [l, m, n]|(10,[5,6,8],[1.0,...|[-0.6393098371808...|[0.34540256830050...| 1.0|
| 6|mapreduce spark|[mapreduce, spark]|(10,[3,5],[1.0,1.0])|[-0.6753938557453...|[0.33729012038845...| 1.0|
| 7| apache hadoop| [apache, hadoop]|(10,[1,5],[1.0,1.0])|[-0.9696913340282...|[0.27494203016056...| 1.0|
+---+---------------+------------------+--------------------+--------------------+--------------------+----------+
9 模型训练,详细输出结果
val allresult=cvModel.transform(test)
allresult.rdd.foreach(println)
[4,spark i j k,WrappedArray(spark, i, j, k),(10,[5,6,9],[1.0,1.0,2.0]),[0.5264704127006035,-0.5264704127006035],[0.6286595162202399,0.37134048377976003],0.0]
[5,l m n,WrappedArray(l, m, n),(10,[5,6,8],[1.0,1.0,1.0]),[-0.6393098371808272,0.6393098371808272],[0.3454025683005015,0.6545974316994986],1.0]
[6,mapreduce spark,WrappedArray(mapreduce, spark),(10,[3,5],[1.0,1.0]),[-0.6753938557453469,0.6753938557453469],[0.3372901203884568,0.6627098796115432],1.0]
[7,apache hadoop,WrappedArray(apache, hadoop),(10,[1,5],[1.0,1.0]),[-0.9696913340282707,0.9696913340282707],[0.2749420301605646,0.7250579698394354],1.0]
10 模型训练,选择性输出结果
cvModel.transform(test).select("id", "text", "probability", "prediction").collect().foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
(4, spark i j k) --> prob=[0.6286595162202399,0.37134048377976003], prediction=0.0
(5, l m n) --> prob=[0.3454025683005015,0.6545974316994986], prediction=1.0
(6, mapreduce spark) --> prob=[0.3372901203884568,0.6627098796115432], prediction=1.0
(7, apache hadoop) --> prob=[0.2749420301605646,0.7250579698394354], prediction=1.0
11 查看最优模型中各参数值
val bestModel= cvModel.bestModel.asInstanceOf[PipelineModel]
val lrModel=bestModel.stages(2).asInstanceOf[LogisticRegressionModel]
lrModel.getRegParam
res22: Double = 0.1
lrModel.numFeatures
res24: Int = 10
scala> lrModel.getMaxIter
res25: Int = 10
复制代码
5.3 训练验证拆分法
-
除了CrossValidator,spark还提供了TrainValidationSplit用于超参数的调整。
-
TrainValidationSplit只对一次参数的每个组合进行一次评估,与CrossValidator的k词调整相对。真就意味着代价相对少了一些,当训练集不是很大的时候,将不会产生一个可靠的结果。
-
不像CrossValidator,TrainValidationSplit产生一个(training,test)数据集对。通过使用trainRatio参数将数据集分割成两个部分。例如,trainRatio=0.75,TrainValidationSplit将会产生一个训练集和一个测试集,其中75%数据用来训练,25%数据用来验证。
-
和CrossValidator一样, TrainValidationSplit在最后会使用最佳的参数和整个数据集对Estimator进行拟合。
import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit} 1 测试数据(spark安装包里面data) val data = spark.read.format("libsvm").load("/data/mllib/sample_linear_regression_data.txt") -9.490009878824548 1:0.4551273600657362 2:0.36644694351969087 3:-0.38256108933468047 4:-0.4458430198517267 5:0.33109790358914726 6:0.80 67445293443565 7:-0.2624341731773887 8:-0.44850386111659524 9:-0.07269284838169332 10:0.5658035575800715 0.2577820163584905 1:0.8386555657374337 2:-0.1270180511534269 3:0.499812362510895 4:-0.22686625128130267 5:-0.6452430441812433 6:0.1886 9982177936828 7:-0.5804648622673358 8:0.651931743775642 9:-0.6555641246242951 10:0.17485476357259122 -4.438869807456516 1:0.5025608135349202 2:0.14208069682973434 3:0.16004976900412138 4:0.505019897181302 5:-0.9371635223468384 6:-0.2841 601610457427 7:0.6355938616712786 8:-0.1646249064941625 9:0.9480713629917628 10:0.42681251564645817 val Array(training, test) = data.randomSplit(Array(0.9, 0.1), seed = 12345) 2 选择模型 val lr = new LinearRegression().setMaxIter(10) 3 使用ParamGridBuilder构建一个parameters网格,用来存储查询参数,TrainValidationSplit会尝试所有值的组合使用evaluator来产生一个最佳模型 val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).addGrid(lr.fitIntercept).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).build() 4 Estimator选用简单的线性回归模型,80%数据用来训练,20%用来验证 val trainValidationSplit = new TrainValidationSplit().setEstimator(lr).setEvaluator(new RegressionEvaluator).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8) 5 运行TrainValidationSplit,选出最佳参数 val model = trainValidationSplit.fit(training) 6 对测试数据进行预测。参数就是刚刚训练的最佳参数。 val allresult = model.transform(test) allresult.rdd.take(5).foreach(println) [-23.51088409032297,(10,[0,1,2,3,4,5,6,7,8,9],[-0.4683538422180036,0.1469540185936138,0.9113612952591796,-0.9838482669789823,0.4506466371133697,0.6456121712599778,0.8264783725578371,0.562664168655115,-0.8299281852090683,0.40690300256653256]),-1.6659388625179559] [-21.432387764165806,(10,[0,1,2,3,4,5,6,7,8,9],[-0.4785033857256795,0.520350718059089,-0.2988515012130126,-0.46260150057299754,0.5394344995663083,0.39320468081626836,0.1890560923345248,0.13123799325264507,0.43613839380760355,0.39541998419731494]),0.3400877302576284] [-12.977848725392104,(10,[0,1,2,3,4,5,6,7,8,9],[-0.5908891529017144,-0.7678208242918028,0.8512434510178621,-0.14910196410347298,0.6250260229199651,0.5393378705290228,-0.9573580597625002,-0.864881502860934,0.4175735160503429,0.4872169215922426]),-0.02335359093652395] [-11.827072996392571,(10,[0,1,2,3,4,5,6,7,8,9],[0.9409739656166973,0.17053032210347996,-0.5735271206214345,0.2713064952443933,-0.11725988807909005,0.34413389399753047,-0.2987734110474076,-0.5436538528015331,-0.06578668798680076,0.7901644743575837]),2.5642684021108417] [-10.945919657782932,(10,[0,1,2,3,4,5,6,7,8,9],[0.7669971723591666,0.38702771863552776,-0.6664311930513411,-0.2817072090916286,-0.16955916900934387,-0.9425831315444453,0.5685476711649924,-0.20782258743798265,0.015213591474494637,0.8183723865760859]),-0.1631314487734783] scala> allresult.show +--------------------+--------------------+--------------------+ | label| features| prediction| +--------------------+--------------------+--------------------+ | -23.51088409032297|(10,[0,1,2,3,4,5,...| -1.6659388625179559| | -21.432387764165806|(10,[0,1,2,3,4,5,...| 0.3400877302576284| | -12.977848725392104|(10,[0,1,2,3,4,5,...|-0.02335359093652395| | -11.827072996392571|(10,[0,1,2,3,4,5,...| 2.5642684021108417| | -10.945919657782932|(10,[0,1,2,3,4,5,...| -0.1631314487734783| | -10.58331129986813|(10,[0,1,2,3,4,5,...| 2.517790654691453| | -10.288657252388708|(10,[0,1,2,3,4,5,...| -0.9443474180536754| | -8.822357870425154|(10,[0,1,2,3,4,5,...| 0.6872889429113783| | -8.772667465932606|(10,[0,1,2,3,4,5,...| -1.485408580416465| | -8.605713514762092|(10,[0,1,2,3,4,5,...| 1.110272909026478| | -6.544633229269576|(10,[0,1,2,3,4,5,...| 3.0454559778611285| | -5.055293333055445|(10,[0,1,2,3,4,5,...| 0.6441174575094268| | -5.039628433467326|(10,[0,1,2,3,4,5,...| 0.9572366607107066| | -4.937258492902948|(10,[0,1,2,3,4,5,...| 0.2292114538379546| | -3.741044592262687|(10,[0,1,2,3,4,5,...| 3.343205816009816| | -3.731112242951253|(10,[0,1,2,3,4,5,...| -2.6826413698701064| | -2.109441044710089|(10,[0,1,2,3,4,5,...| -2.1930034039595445| | -1.8722161156986976|(10,[0,1,2,3,4,5,...| 0.49547270330052423| | -1.1009750789589774|(10,[0,1,2,3,4,5,...| -0.9441633113006601| |-0.48115211266405217|(10,[0,1,2,3,4,5,...| -0.6756196573079968| +--------------------+--------------------+--------------------+ only showing top 20 rows 复制代码
6 结语
应该已经到最后,通过详细对比分析,感慨万千,辛苦成文,各自珍惜
秦凯新 于深圳 2018 11 18 15 46