1)Logistic regression(逻辑回归)
逻辑回归是一种预测分类响应的流行方法。这是广义线性模型的一种特殊情况,可以预测结果的可能性。在spark.ml中,逻辑回归可以通过使用二项式逻辑回归来预测二进制结果,或者可以通过使用多项逻辑回归来预测多类结果。使用family参数在这两种算法之间进行选择,或者不设置它,Spark会推断出正确的变体。
Multinomial logistic regression can be used for binary classification by setting the family param to “multinomial”. It will produce two sets of coefficients and two intercepts.
通过将族参数设置为“多项式”,可以将多项式逻辑回归用于二进制分类。它将产生两组系数和两个截距。
When fitting LogisticRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM.
当对具有恒定非零列的数据集进行LogisticRegressionModel拟合而没有截距时,Spark MLlib为恒定非零列输出零系数。此行为与R glmnet相同,但与LIBSVM不同。
(1)Binomial Logistic regression(二项逻辑回归)
有关二项式逻辑回归的实现的更多背景和更多详细信息,请参阅spark.mllib中的逻辑回归文档。
示例代码
以下示例显示了如何使用弹性网正则化训练二项式和多项式逻辑回归模型进行二分类。 elasticNetParam对应于α,regParam对应于λ。
import org.apache.spark.ml.classification.LogisticRegression
// Load training data
val training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for logistic regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
// We can also use the multinomial family for binary classification
val mlr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
.setFamily("multinomial")
val mlrModel = mlr.fit(training)
// Print the coefficients and intercepts for logistic regression with multinomial family
println(s"Multinomial coefficients: ${mlrModel.coefficientMatrix}")
println(s"Multinomial intercepts: ${mlrModel.interceptVector}")
Logistic回归的spark.ml实现还支持提取训练集中的模型摘要。请注意,在LogisticRegressionSummary中存储为DataFrame的预测和度量标有@transient注释,因此仅在驱动程序上可用。
示例代码
LogisticRegressionTrainingSummary提供LogisticRegressionModel的摘要。在二进制分类的情况下,某些其他指标可用,例如ROC曲线。可以通过binarySummary方法访问二进制摘要。请参阅BinaryLogisticRegressionTrainingSummary。
继续前面的示例:
import org.apache.spark.ml.classification.LogisticRegression
// Extract the summary from the returned LogisticRegressionModel instance trained in the earlier
// example
val trainingSummary = lrModel.binarySummary
// Obtain the objective per iteration.
val objectiveHistory = trainingSummary.objectiveHistory
println("objectiveHistory:")
objectiveHistory.foreach(loss => println(loss))
// Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
val roc = trainingSummary.roc
roc.show()
println(s"areaUnderROC: ${trainingSummary.areaUnderROC}")
// Set the model threshold to maximize F-Measure
val fMeasure = trainingSummary.fMeasureByThreshold
val maxFMeasure = fMeasure.select(max("F-Measure")).head().getDouble(0)
val bestThreshold = fMeasure.where($"F-Measure" === maxFMeasure)
.select("threshold").head().getDouble(0)
lrModel.setThreshold(bestThreshold)
(2)Multinomial logistic regression(多项逻辑回归)
通过多项逻辑(softmax)回归支持多类分类。在多项逻辑回归中,该算法生成K组系数或维数K×J的矩阵,其中K是结果类的数量,J是特征的数量。如果该算法符合截距项,则截距的长度K向量可用。
Multinomial coefficients are available as coefficientMatrix and intercepts are available as interceptVector.、
多项式系数可用作系数矩阵,截距可用作截距向量。
coefficients and intercept methods on a logistic regression model trained with multinomial family are not supported. Use coefficientMatrix and interceptVector instead.
不支持使用多项式族训练的逻辑回归模型的系数和截距方法。请改用系数矩阵和interceptVector。
示例代码
以下示例显示了如何使用弹性网正则化训练多类逻辑回归模型,以及如何提取多类训练摘要以评估模型。
import org.apache.spark.ml.classification.LogisticRegression
// Load training data
val training = spark
.read
.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for multinomial logistic regression
println(s"Coefficients: \n${lrModel.coefficientMatrix}")
println(s"Intercepts: \n${lrModel.interceptVector}")
val trainingSummary = lrModel.summary
// Obtain the objective per iteration
val objectiveHistory = trainingSummary.objectiveHistory
println("objectiveHistory:")
objectiveHistory.foreach(println)
// for multiclass, we can inspect metrics on a per-label basis
println("False positive rate by label:")
trainingSummary.falsePositiveRateByLabel.zipWithIndex.foreach { case (rate, label) =>
println(s"label $label: $rate")
}
println("True positive rate by label:")
trainingSummary.truePositiveRateByLabel.zipWithIndex.foreach { case (rate, label) =>
println(s"label $label: $rate")
}
println("Precision by label:")
trainingSummary.precisionByLabel.zipWithIndex.foreach { case (prec, label) =>
println(s"label $label: $prec")
}
println("Recall by label:")
trainingSummary.recallByLabel.zipWithIndex.foreach { case (rec, label) =>
println(s"label $label: $rec")
}
println("F-measure by label:")
trainingSummary.fMeasureByLabel.zipWithIndex.foreach { case (f, label) =>
println(s"label $label: $f")
}
val accuracy = trainingSummary.accuracy
val falsePositiveRate = trainingSummary.weightedFalsePositiveRate
val truePositiveRate = trainingSummary.weightedTruePositiveRate
val fMeasure = trainingSummary.weightedFMeasure
val precision = trainingSummary.weightedPrecision
val recall = trainingSummary.weightedRecall
println(s"Accuracy: $accuracy\nFPR: $falsePositiveRate\nTPR: $truePositiveRate\n" +
s"F-measure: $fMeasure\nPrecision: $precision\nRecall: $recall")
2)Deicision tree classifier(决策树分类器)
决策树是流行的分类和回归方法系列。有关spark.ml实现的更多信息,请参见决策树部分。
示例代码
以下示例以LibSVM格式加载数据集,将其分为训练集和测试集,在第一个数据集上进行训练,然后对保留的测试集进行评估。我们使用两个特征转换器准备数据。这些帮助为标签和分类功能的索引类别,将元数据添加到决策树算法可以识别的DataFrame中。
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Chain indexers and tree in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println(s"Learned classification tree model:\n ${treeModel.toDebugString}")
3)Random forest classifier(随机森林分类器)
随机森林是一种流行的分类和回归方法系列。有关spark.ml实现的更多信息可以在随机森林部分中找到。
示例代码
以下示例以LibSVM格式加载数据集,将其分为训练集和测试集,在第一个数据集上进行训练,然后对保留的测试集进行评估。我们使用两个特征转换器准备数据。这些帮助为标签和类别功能的索引类别,将元数据添加到基于树的算法可以识别的DataFrame中。
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setNumTrees(10)
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
println(s"Learned classification forest model:\n ${rfModel.toDebugString}")
4)Gradient-boosted tree classifier(梯度提升树分类器)
梯度提升树(GBT)是一种使用决策树集合的流行分类和回归方法。有关spark.ml实现的更多信息,请参见GBT部分。
code example
以下示例以LibSVM格式加载数据集,将其分为训练集和测试集,在第一个数据集上进行训练,然后对保留的测试集进行评估。我们使用两个特征转换器准备数据。这些帮助为标签和类别功能的索引类别,将元数据添加到基于树的算法可以识别的DataFrame中。
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a GBT model.
val gbt = new GBTClassifier()
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
.setMaxIter(10)
.setFeatureSubsetStrategy("auto")
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels)
// Chain indexers and GBT in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${1.0 - accuracy}")
val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]
println(s"Learned classification GBT model:\n ${gbtModel.toDebugString}")
5)Multilayer perceptron classifier(多层感知器分类器)
多层感知器分类器(MLPC)是基于前馈人工神经网络的分类器。 MLPC由多层节点组成。每一层都完全连接到网络中的下一层。输入层中的节点代表输入数据。所有其他节点通过输入与节点权重w和偏差b的线性组合并应用激活函数,将输入映射到输出。对于具有K + 1层的MLPC,可以将其写成矩阵形式,如下所示:
输出层中的节点数N对应于类数。MLPC使用反向传播来学习模型。我们使用逻辑损失函数进行优化,并使用L-BFGS作为优化例程。
示例代码
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
// Split the data into train and test
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4
// and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3)
// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
// train the model
val model = trainer.fit(train)
// compute accuracy on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator()
.setMetricName("accuracy")
println(s"Test set accuracy = ${evaluator.evaluate(predictionAndLabels)}")
6)Linear Support Vector Machine(线性支持向量机)
支持向量机在高维或无限维空间中构建一个超平面或一组超平面,可用于分类,回归或其他任务。直观地,通过超平面可以实现良好的分离,该超平面与任何类别的最近训练数据点之间的距离最大(所谓的功能边界),因为通常边界越大,分类器的泛化误差越低。 Spark ML中的LinearSVC支持使用线性SVM进行二进制分类。在内部,它使用OWLQN优化器优化铰链损耗。
示例代码
import org.apache.spark.ml.classification.LinearSVC
// Load training data
val training = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val lsvc = new LinearSVC()
.setMaxIter(10)
.setRegParam(0.1)
// Fit the model
val lsvcModel = lsvc.fit(training)
// Print the coefficients and intercept for linear svc
println(s"Coefficients: ${lsvcModel.coefficients} Intercept: ${lsvcModel.intercept}")
7)One-vs-Rest Classifier(一对多分类器)
OneVsRest是机器学习简化的一个示例,它提供了可以有效执行二进制分类的基本分类器,从而可以执行多分类。也称为“一对多”。
OneVsRest被实现为一个估计器。对于基本分类器,它采用分类器的实例,并为k个类的每一个创建一个二进制分类问题。训练类别i的分类器以预测标签是否为i,从而将类别i与所有其他类别区分开。
预测是通过评估每个二进制分类器来完成的,最可靠分类器的索引将作为标签输出。
示例代码
下面的示例演示如何加载Iris数据集,将其解析为DataFrame并使用OneVsRest执行多类分类。计算测试误差以测量算法准确性
import org.apache.spark.ml.classification.{LogisticRegression, OneVsRest}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// load data file.
val inputData = spark.read.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
// generate the train/test split.
val Array(train, test) = inputData.randomSplit(Array(0.8, 0.2))
// instantiate the base classifier
val classifier = new LogisticRegression()
.setMaxIter(10)
.setTol(1E-6)
.setFitIntercept(true)
// instantiate the One Vs Rest Classifier.
val ovr = new OneVsRest().setClassifier(classifier)
// train the multiclass model.
val ovrModel = ovr.fit(train)
// score the model on test data.
val predictions = ovrModel.transform(test)
// obtain evaluator.
val evaluator = new MulticlassClassificationEvaluator()
.setMetricName("accuracy")
// compute the classification error on test data.
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${1 - accuracy}")
8)Naive Bayer(朴素贝叶斯)
朴素贝叶斯分类器是一系列简单的概率分类器,基于贝叶斯定理,并在每对特征之间使用强(朴素)独立性假设。
朴素贝叶斯可以非常有效地训练。通过一次传递训练数据,就可以计算给定每个标签的每个特征的条件概率分布。为了进行预测,它使用贝叶斯定理来计算给定观察结果的每个标签的条件概率分布。
MLlib支持多项式朴素贝叶斯和Bernoulli朴素贝叶斯。
输入数据:这些模型通常用于文档分类。在这种情况下,每个观察结果都是一个文档,每个功能都代表一个术语。要素的值是术语的频率(在多项式朴素贝叶斯中),或者为零或一,表示是否在文档中找到了该术语(在Bernoulli Naive Bayes中)。特征值必须为非负数。使用可选参数“ multinomial”或“ bernoulli”(默认值为“ multinomial”)选择模型类型。对于文档分类,输入特征向量通常应为稀疏向量。由于训练数据仅使用一次,因此无需将其缓存。
可以通过设置参数λ(默认值为1.0)来使用加法平滑。
示例代码
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
// Train a NaiveBayes model.
val model = new NaiveBayes()
.fit(trainingData)
// Select example rows to display.
val predictions = model.transform(testData)
predictions.show()
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test set accuracy = $accuracy")
1)Linear Regresison(线性回归)
用于线性回归模型和模型摘要的界面类似于逻辑回归情况。
When fitting LinearRegressionModel without intercept on dataset with constant nonzero column by “l-bfgs” solver, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM.
通过“ l-bfgs”求解器将LinearRegressionModel拟合为不具有恒定非零列的数据集时,Spark MLlib为恒定非零列输出零系数。此行为与R glmnet相同,但与LIBSVM不同。
示例代码
import org.apache.spark.ml.regression.LinearRegression
// Load training data
val training = spark.read.format("libsvm")
.load("data/mllib/sample_linear_regression_data.txt")
val lr = new LinearRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
// Print the coefficients and intercept for linear regression
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
// Summarize the model over the training set and print out some metrics
val trainingSummary = lrModel.summary
println(s"numIterations: ${trainingSummary.totalIterations}")
println(s"objectiveHistory: [${trainingSummary.objectiveHistory.mkString(",")}]")
trainingSummary.residuals.show()
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
println(s"r2: ${trainingSummary.r2}")
2)Generalized Linear Regression(广义线性回归)
与线性回归(假设输出遵循高斯分布)相反,广义线性模型(GLM)是线性模型的规范,其中响应变量Yi遵循指数分布族的某种分布。Spark的GeneralizedLinearRegression接口可灵活地指定GLM,可用于各种类型的预测问题,包括线性回归,泊松回归,逻辑回归等。当前在spark.ml中,仅支持指数族分布的子集,并在下面列出。
注意:Spark当前仅通过其GeneralizedLinearRegression接口最多支持4096个功能,如果超出此约束,将引发异常。有关更多详细信息,请参见高级部分。对于线性和逻辑回归,仍然可以使用LinearRegression和LogisticRegression估计量来训练具有更多特征的模型。
GLM要求可以以其“规范”或“自然”形式(也称为自然指数族分布)编写的指数族分布。自然指数族分布的形式为:
(2)Available families(可用分布)
示例代码
以下示例演示了使用高斯响应和身份链接功能训练GLM并提取模型摘要统计信息。
import org.apache.spark.ml.regression.GeneralizedLinearRegression
// Load training data
val dataset = spark.read.format("libsvm")
.load("data/mllib/sample_linear_regression_data.txt")
val glr = new GeneralizedLinearRegression()
.setFamily("gaussian")
.setLink("identity")
.setMaxIter(10)
.setRegParam(0.3)
// Fit the model
val model = glr.fit(dataset)
// Print the coefficients and intercept for generalized linear regression model
println(s"Coefficients: ${model.coefficients}")
println(s"Intercept: ${model.intercept}")
// Summarize the model over the training set and print out some metrics
val summary = model.summary
println(s"Coefficient Standard Errors: ${summary.coefficientStandardErrors.mkString(",")}")
println(s"T Values: ${summary.tValues.mkString(",")}")
println(s"P Values: ${summary.pValues.mkString(",")}")
println(s"Dispersion: ${summary.dispersion}")
println(s"Null Deviance: ${summary.nullDeviance}")
println(s"Residual Degree Of Freedom Null: ${summary.residualDegreeOfFreedomNull}")
println(s"Deviance: ${summary.deviance}")
println(s"Residual Degree Of Freedom: ${summary.residualDegreeOfFreedom}")
println(s"AIC: ${summary.aic}")
println("Deviance Residuals: ")
summary.residuals().show()
3)Decision tree regression(决策树回归)
决策树是流行的分类和回归方法系列。有关spark.ml实现的更多信息,请参见决策树部分。
示例代码
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Automatically identify categorical features, and index them.
// Here, we treat features with > 4 distinct values as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a DecisionTree model.
val dt = new DecisionTreeRegressor()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
// Chain indexer and tree in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, dt))
// Train model. This also runs the indexer.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")
val treeModel = model.stages(1).asInstanceOf[DecisionTreeRegressionModel]
println(s"Learned regression tree model:\n ${treeModel.toDebugString}")
4)Random forest regresion
随机森林是一种流行的分类和回归方法系列。有关spark.ml实现的更多信息可以在随机森林部分中找到。
示例代码
以下示例以LibSVM格式加载数据集,将其分为训练集和测试集,在第一个数据集上进行训练,然后对保留的测试集进行评估。我们使用特征转换器对分类特征进行索引,将元数据添加到基于树的算法可以识别的DataFrame中。
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a RandomForest model.
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
// Chain indexer and forest in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, rf))
// Train model. This also runs the indexer.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")
val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel]
println(s"Learned regression forest model:\n ${rfModel.toDebugString}")
5)Gradient- Boosted tree regression(梯度提升树回归)
梯度增强树(GBT)是一种使用决策树集合的流行回归方法。有关spark.ml实现的更多信息,请参见GBT部分。
示例代码
注意:对于此示例数据集,GBTRegressor实际上仅需要进行1次迭代,但通常不会如此。
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
// Train a GBT model.
val gbt = new GBTRegressor()
.setLabelCol("label")
.setFeaturesCol("indexedFeatures")
.setMaxIter(10)
// Chain indexer and GBT in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(featureIndexer, gbt))
// Train model. This also runs the indexer.
val model = pipeline.fit(trainingData)
// Make predictions.
val predictions = model.transform(testData)
// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)
// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")
val gbtModel = model.stages(1).asInstanceOf[GBTRegressionModel]
println(s"Learned regression GBT model:\n ${gbtModel.toDebugString}")
6)survival regression(生存回归)
在spark.ml中,我们实现了加速失败时间(AFT)模型,该模型是用于审查数据的参数生存回归模型。它描述了生存时间的对数模型,因此通常称为对数线性模型,用于生存分析。与为相同目的设计的比例风险模型不同,AFT模型更易于并行化,因为每个实例独立地对目标函数做出贡献。
给定协变量x’的值,对于受试者i = 1,…,n的随机寿命ti,并可能进行右删失,AFT模型下的似然函数为:
When fitting AFTSurvivalRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is different from R survival::survreg.
在具有非零常量列的数据集上拟合AFTSurvivalRegressionModel且不进行截取时,Spark MLlib为非零常量列输出零系数。此行为不同于R Survival :: survreg。
示例代码
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.regression.AFTSurvivalRegression
val training = spark.createDataFrame(Seq(
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))
)).toDF("label", "censor", "features")
val quantileProbabilities = Array(0.3, 0.6)
val aft = new AFTSurvivalRegression()
.setQuantileProbabilities(quantileProbabilities)
.setQuantilesCol("quantiles")
val model = aft.fit(training)
// Print the coefficients, intercept and scale parameter for AFT survival regression
println(s"Coefficients: ${model.coefficients}")
println(s"Intercept: ${model.intercept}")
println(s"Scale: ${model.scale}")
model.transform(training).show(false)
7)Isotonic regression(等渗回归)
我们实现了一个池相邻违反者算法,该算法使用一种使等渗回归并行化的方法。训练输入是一个DataFrame,其中包含三列标签,特征和权重。此外,IsotonicRegression算法具有一个名为isotonic的可选参数,默认为true。该参数指定等渗回归是等渗(单调增加)还是反渗(单调减少)。
训练返回一个IsotonicRegressionModel,可用于预测已知和未知特征的标签。等渗回归的结果被视为分段线性函数。因此,预测规则为:
示例代码
import org.apache.spark.ml.regression.IsotonicRegression
// Loads data.
val dataset = spark.read.format("libsvm")
.load("data/mllib/sample_isotonic_regression_libsvm_data.txt")
// Trains an isotonic regression model.
val ir = new IsotonicRegression()
val model = ir.fit(dataset)
println(s"Boundaries in increasing order: ${model.boundaries}\n")
println(s"Predictions associated with the boundaries: ${model.predictions}\n")
// Makes predictions.
model.transform(dataset).show()
决策树及其集成是用于机器学习任务的分类和回归的流行方法。决策树被广泛使用,因为它们易于解释,处理分类特征,扩展到多类分类设置,不需要特征缩放,并且能够捕获非线性和特征交互。树木分类算法(例如随机森林和boosting)在分类和回归任务中表现最佳。
spark.ml实现使用连续和分类功能,支持用于二进制和多类分类以及用于回归的决策树。该实现按行对数据进行分区,从而允许对数百万甚至数十亿个实例进行分布式训练。
用户可以在《 MLlib决策树》指南中找到有关决策树算法的更多信息。此API与原始MLlib决策树API之间的主要区别是:
决策树的管道API比原始API提供了更多功能。具体来说,对于分类,用户可以获得每个类别的预测概率(又称类别条件概率);对于回归,用户可以获得预测的有偏样本方差。
Ensembles of trees(随机森林和渐变树)在下面的“Tree ensembles section”部分中进行介绍。
1)输入和输出
我们在此处列出输入和输出(预测)列类型。所有输出列都是可选的;要排除输出列,请将其对应的Param设置为空字符串。
DataFrame API支持两种主要的树集成算法:随机森林和梯度增强树(GBT)。两者都使用spark.ml决策树作为其基础模型。
用户可以在MLlib Ensemble指南中找到有关集成算法的更多信息。
在本节中,我们演示用于集成的DataFrame API。
此API与原始MLlib集成API之间的主要区别是:
1)Random Forests(随机森林)
随机森林是决策树的集合。随机森林结合了许多决策树,以减少过度拟合的风险。 spark.ml实现使用连续和分类功能支持随机森林进行二进制和多类分类以及回归。
有关算法本身的更多信息,请参阅关于随机森林的spark.mllib文档。
(1)Inputs and Outputs(输入和输出)
我们在此处列出输入和输出(预测)列类型。所有输出列都是可选的;要排除输出列,请将其对应的Param设置为空字符串。
2)Gradient-Boosted Trees(GBTs)
梯度增强树(GBT)是决策树的集合。 GBT迭代地训练决策树,以最小化损失函数。 spark.ml实现使用连续和分类功能支持GBT用于二进制分类和回归。
有关算法本身的更多信息,请参阅GBT上的spark.mllib文档。
(1)Inputs and Outputs(输入和输出)
我们在此处列出输入和输出(预测)列类型。所有输出列都是可选的;要排除输出列,请将其对应的Param设置为空字符串。