同步于Buracag的博客
本节主要讲Spark ML中关于回归算法的实现。示例的算法Demo包含:线性回归、广义线性回归、决策树回归、随机森林回归、梯度提升树回归等。
与logistic regression类似的,直接附上示例代码吧:
# -*- coding: utf-8 -*-
# @Time : 2019/8/7 10:11
# @Author : buracagyang
# @File : linear_regression_with_elastic_net.py
# @Software : PyCharm
"""
Describe:
"""
from __future__ import print_function
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("LinearRegressionWithElasticNet").getOrCreate()
training = spark.read.format("libsvm").load("../data/mllib/sample_linear_regression_data.txt")
lr = LinearRegression(maxIter=20, regParam=0.01, elasticNetParam=0.6)
lrModel = lr.fit(training)
# 参数和截距项
# print("Coefficients: %s" % str(lrModel.coefficients))
# print("Intercept: %s" % str(lrModel.intercept))
# 模型评估指标
trainingSummary = lrModel.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)
spark.stop()
结果如下( r 2 r^2 r2如此低。。。。):
numIterations: 8
objectiveHistory: [0.49999999999999994, 0.4931849471651455, 0.4863393782275527, 0.48633557300904495, 0.48633543657045664, 0.4863354337756586, 0.4863354337094886, 0.4863354337092549]
+-------------------+
| residuals|
+-------------------+
| -10.9791066152217|
| 0.9208605107751258|
| -4.608888656779226|
|-20.424003572582702|
|-10.316545030639608|
+-------------------+
only showing top 5 rows
RMSE: 10.163110
r2: 0.027836
与假设输出遵循高斯分布的线性回归相反,广义线性模型(GLMs)是线性模型的规范,其中响应变量 Y i Y_i Yi遵循指数分布族的一些分布。 Spark的GeneralizedLinearRegression接口允许灵活地指定GLMs,可用于各种类型的预测问题,包括线性回归,泊松回归,逻辑回归等。目前在spark.ml中,仅支持指数族分布的子集,在下面列出。
分布族 | 响应变量类型 | Supported Links |
---|---|---|
高斯分布 | Continuous | Identity*,Log,Inverse |
Binomial | 二元变量 | Logit*, Probit, CLogLog |
泊松分布 | 数值 | Log*, Identity, Sqrt |
Gamma分布 | 连续变量 | Inverse*, Identity, Log |
Tweedie分布 | zero-inflated continuous | Power link function |
*标准Link
注意:Spark目前通过其GeneralizedLinearRegression接口仅支持4096个特征,如果超出此约束,则会抛出异常。尽管如此,对于线性回归和逻辑回归,可以使用LinearRegression和LogisticRegression估计来训练具有更多特征的模型(对于高纬特征集没啥用。。。)。
GLMs需要指数族分布,这些分布可以用“标准(canonical)”或“自然(natural)”形式写成,也就是自然指数族分布。自然指数分布族(natural exponential family distribution)的形式如下:
(1) f Y ( y ∣ θ , τ ) = h ( y , τ ) e x p ( θ . y − A ( θ ) d ( τ ) ) f_Y(y|\theta, \tau) = h(y, \tau)exp(\frac{\theta.y - A(\theta)}{d(\tau)}) \tag{1} fY(y∣θ,τ)=h(y,τ)exp(d(τ)θ.y−A(θ))(1)
其中 θ \theta θ是parameter of interest, τ \tau τ是dispersion parameter。 在GLMs中,假设响应变量 Y i Y_i Yi是从自然指数分布族中提取的:
(2) Y i ∼ f ( ⋅ ∣ θ i , τ ) Y_i \sim f(\cdot|\theta_i, \tau) \tag{2} Yi∼f(⋅∣θi,τ)(2)
其中,parameter of interest θ i \theta_i θi与响应变量的期望值 μ i \mu_i μi相关:
(3) μ i = A ′ ( θ i ) \mu_i = A'(\theta_i) \tag{3} μi=A′(θi)(3)
这里, A ′ ( θ i ) A'(\theta_i) A′(θi)根据选择的分布形式定义,GLMs还支持指定的链接函数(link function),该函数定义响应变量的期望值 μ i \mu_i μi与线性预测值 η i \eta_i ηi之间的关系:
(4) g ( μ i ) = η i = x i ⃗ T ⋅ β ⃗ g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta} \tag{4} g(μi)=ηi=xiT⋅β(4)
通常,选择链接函数(link function)使得 A ′ = g − 1 A' = g^{-1} A′=g−1,其产生的parameter of interest θ θ θ与线性预测值 η \eta η之间的简化关系。 在这种情况下,链接函数 g ( μ ) g(μ) g(μ)被称为“标准”(“canonical”)链接函数。
(5) θ i = A ′ − 1 ( μ i ) = g ( g − 1 ( η i ) ) = η i \theta_i = A'^{-1}(\mu_i) = g(g^{-1}(\eta_i)) = \eta_i \tag{5} θi=A′−1(μi)=g(g−1(ηi))=ηi(5)
一个GLM根据最大化似然概率函数值寻找回归系数 β ⃗ \vec{\beta} β:
(6) max β ⃗ L ( θ ⃗ ∣ y ⃗ , X ) = ∏ i = 1 N h ( y i , τ ) exp ( y i θ i − A ( θ i ) d ( τ ) ) \max_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) = \prod_{i=1}^{N} h(y_i, \tau) \exp{\left(\frac{y_i\theta_i - A(\theta_i)}{d(\tau)}\right)} \tag{6} βmaxL(θ∣y,X)=i=1∏Nh(yi,τ)exp(d(τ)yiθi−A(θi))(6)
其中the parameter of interest θ i \theta_i θi与回归系数 β ⃗ \vec{\beta} β的关系为:
(7) θ i = A ′ − 1 ( g − 1 ( x i ⃗ ⋅ β ⃗ ) ) \theta_i = A'^{-1}(g^{-1}(\vec{x_i} \cdot \vec{\beta})) \tag{7} θi=A′−1(g−1(xi⋅β))(7)
示例代码如下:
# -*- coding: utf-8 -*-
# @Time : 2019/8/7 13:12
# @Author : buracagyang
# @File : generalized_linear_regression_example.py
# @Software : PyCharm
"""
Describe:
"""
from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.regression import GeneralizedLinearRegression
if __name__ == "__main__":
spark = SparkSession.builder.appName("GeneralizedLinearRegressionExample").getOrCreate()
dataset = spark.read.format("libsvm").load("../data/mllib/sample_linear_regression_data.txt")
glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3)
model = glr.fit(dataset)
# 参数和截距项
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
# 通不过模型检验啊...
summary = model.summary
print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
print("T Values: " + str(summary.tValues))
print("P Values: " + str(summary.pValues))
print("Dispersion: " + str(summary.dispersion))
print("Null Deviance: " + str(summary.nullDeviance))
print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
print("Deviance: " + str(summary.deviance))
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
print("AIC: " + str(summary.aic))
print("Deviance Residuals: ")
summary.residuals().show()
spark.stop()
与决策树分类类似,直接附上示例代码:
# -*- coding: utf-8 -*-
# @Time : 2019/8/7 13:57
# @Author : buracagyang
# @File : decision_tree_regression_example.py
# @Software : PyCharm
"""
Describe:
"""
from __future__ import print_function
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("DecisionTreeRegressionExample").getOrCreate()
data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")
# 自动识别分类特性,并对它们进行索引。指定maxCategories,这样存在 > 4个不同值的特征将被视为连续的。
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3], seed=2019)
dt = DecisionTreeRegressor(featuresCol="indexedFeatures")
# 通过Pipeline进行训练
pipeline = Pipeline(stages=[featureIndexer, dt])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)
predictions.select("prediction", "label", "features").show(5)
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
treeModel = model.stages[1]
print(treeModel)
spark.stop()
结果如下:
+----------+-----+--------------------+
|prediction|label| features|
+----------+-----+--------------------+
| 0.0| 0.0|(692,[100,101,102...|
| 0.0| 0.0|(692,[121,122,123...|
| 0.0| 0.0|(692,[124,125,126...|
| 0.0| 0.0|(692,[124,125,126...|
| 0.0| 0.0|(692,[124,125,126...|
+----------+-----+--------------------+
only showing top 5 rows
Root Mean Squared Error (RMSE) on test data = 0.297044
DecisionTreeRegressionModel (uid=DecisionTreeRegressor_4089a3fc367ac7a943d9) of depth 2 with 5 nodes
与决策树回归类似,示例代码如下:
# -*- coding: utf-8 -*-
# @Time : 2019/8/7 14:05
# @Author : buracagyang
# @File : random_forest_regressor_example.py
# @Software : PyCharm
"""
Describe:
"""
from __future__ import print_function
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("RandomForestRegressorExample").getOrCreate()
data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")
# 自动识别分类特性,并对它们进行索引。指定maxCategories,这样存在 > 4个不同值的特征将被视为连续的。
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3], seed=2019)
rf = RandomForestRegressor(featuresCol="indexedFeatures")
# 通过Pipeline进行训练
pipeline = Pipeline(stages=[featureIndexer, rf])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)
predictions.select("prediction", "label", "features").show(5)
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
rfModel = model.stages[1]
print(rfModel)
spark.stop()
结果如下:
+----------+-----+--------------------+
|prediction|label| features|
+----------+-----+--------------------+
| 0.4| 0.0|(692,[100,101,102...|
| 0.0| 0.0|(692,[121,122,123...|
| 0.0| 0.0|(692,[124,125,126...|
| 0.1| 0.0|(692,[124,125,126...|
| 0.15| 0.0|(692,[124,125,126...|
+----------+-----+--------------------+
only showing top 5 rows
Root Mean Squared Error (RMSE) on test data = 0.141421
RandomForestRegressionModel (uid=RandomForestRegressor_4dc1b1ad32480cc89ddc) with 20 trees
示例代码如下(需要注意的是,对于这个示例样本集,GBTRegressor只需要一次迭代,但是在通常情况下是不止一次的~):
# -*- coding: utf-8 -*-
# @Time : 2019/8/7 14:11
# @Author : buracagyang
# @File : gradient_boosted_tree_regressor_example.py
# @Software : PyCharm
"""
Describe:
"""
from __future__ import print_function
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("GradientBoostedTreeRegressorExample").getOrCreate()
data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")
featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3], seed=2019)
gbt = GBTRegressor(featuresCol="indexedFeatures", maxIter=10)
# 通过Pipeline进行训练
pipeline = Pipeline(stages=[featureIndexer, gbt])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)
predictions.select("prediction", "label", "features").show(5)
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)
gbtModel = model.stages[1]
print(gbtModel)
spark.stop()
结果如下:
+----------+-----+--------------------+
|prediction|label| features|
+----------+-----+--------------------+
| 0.0| 0.0|(692,[100,101,102...|
| 0.0| 0.0|(692,[121,122,123...|
| 0.0| 0.0|(692,[124,125,126...|
| 0.0| 0.0|(692,[124,125,126...|
| 0.0| 0.0|(692,[124,125,126...|
+----------+-----+--------------------+
only showing top 5 rows
Root Mean Squared Error (RMSE) on test data = 0.297044
GBTRegressionModel (uid=GBTRegressor_46239bc64cc2f59b1fb5) with 10 trees
在spark.ml中,实现了加速失败时间(Accelerated failure time, AFT)模型,它是一个用于删失数据的参数生存回归模型(survival regression model)。 它描述了生存时间对数的模型,因此它通常被称为生存分析的对数线性模型。 与为相同目的设计的比例风险模型不同,AFT模型更易于并行化,因为每个实例都独立地为目标函数做出贡献。
给定协变量 x ′ x' x′的值,对于受试者 i = 1 , . . . , n i = 1,...,n i=1,...,n的随机寿命 t i t_i ti(random lifetime),可能进行右截尾(right-censoring),AFT模型下的似然函数如下:
(8) L ( β , σ ) = ∏ i = 1 n [ 1 σ f 0 ( log t i − x ′ β σ ) ] δ i S 0 ( log t i − x ′ β σ ) 1 − δ i L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}} \tag{8} L(β,σ)=i=1∏n[σ1f0(σlogti−x′β)]δiS0(σlogti−x′β)1−δi(8)
其中 δ i \delta_i δi是事件发生的指标,即是否经过截尾的。 使用 ϵ i = log t i − x ‘ β σ \epsilon_{i}=\frac{\log{t_{i}}-x^{‘}\beta}{\sigma} ϵi=σlogti−x‘β,对数似然函数采用以下形式:
(9) ι ( β , σ ) = ∑ i = 1 n [ − δ i log σ + δ i log f 0 ( ϵ i ) + ( 1 − δ i ) log S 0 ( ϵ i ) ] \iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}] \tag{9} ι(β,σ)=i=1∑n[−δilogσ+δilogf0(ϵi)+(1−δi)logS0(ϵi)](9)
其中 S 0 ( ϵ i ) S_{0}(\epsilon_{i}) S0(ϵi)是幸存函数基线, f 0 ( ϵ i ) f_{0}(\epsilon_{i}) f0(ϵi)是相应的密度函数。
最常用的AFT模型基于Weibull分布的生存时间。 生命周期的Weibull分布对应于生命日志的极值分布(the extreme value distribution for the log of the lifetime), S 0 ( ϵ ) S_{0}(\epsilon) S0(ϵ)函数是:
(10) S 0 ( ϵ i ) = exp ( − e ϵ i ) S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}}) \tag{10} S0(ϵi)=exp(−eϵi)(10)
其中 f 0 ( ϵ i ) f_{0}(\epsilon_{i}) f0(ϵi)函数形式为:
(11) f 0 ( ϵ i ) = e ϵ i exp ( − e ϵ i ) f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}}) \tag{11} f0(ϵi)=eϵiexp(−eϵi)(11)
带有Weibull生命分布的AFT模型的对数似然函数是:
(12) ι ( β , σ ) = − ∑ i = 1 n [ δ i log σ − δ i ϵ i + e ϵ i ] \iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}] \tag{12} ι(β,σ)=−i=1∑n[δilogσ−δiϵi+eϵi](12)
由于最小化负对数似然函数等效于最大化后验概率的,我们用于优化的损失函数是 − ι ( β , σ ) -\iota(\beta,\sigma) −ι(β,σ)。 β \beta β和 log σ \log\sigma logσ的梯度函数分别为:
(13) ∂ ( − ι ) ∂ β = ∑ 1 = 1 n [ δ i − e ϵ i ] x i σ ∂ ( − ι ) ∂ ( log σ ) = ∑ i = 1 n [ δ i + ( δ i − e ϵ i ) ϵ i ] \frac{\partial (-\iota)}{\partial \beta} = \sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma} \\ \frac{\partial (-\iota)}{\partial (\log\sigma)} = \sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}] \tag{13} ∂β∂(−ι)=1=1∑n[δi−eϵi]σxi∂(logσ)∂(−ι)=i=1∑n[δi+(δi−eϵi)ϵi](13)
AFT模型可以被转换为凸优化问题,即,找到取决于系数向量 β \beta β和尺度参数的对数 log σ \log\sigma logσ的凸函数的最小化的任务 − ι ( β , σ ) -\iota(\beta,\sigma) −ι(β,σ)。 实现的优化算法是L-BFGS。 实现匹配R的生存函数的幸存结果。
当拟合AFTSurvivalRegressionModel而不截断具有常量非零列的数据集时,Spark MLlib为常量非零列输出零系数。 此与R survival :: survreg不同
示例代码如下:
# -*- coding: utf-8 -*-
# @Time : 2019/8/7 14:38
# @Author : buracagyang
# @File : aft_survival_regression.py
# @Software : PyCharm
"""
Describe:
"""
from __future__ import print_function
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("AFTSurvivalRegressionExample").getOrCreate()
training = spark.createDataFrame([
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.3, 0.6]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, quantilesCol="quantiles")
model = aft.fit(training)
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
print("Scale: " + str(model.scale))
model.transform(training).show(truncate=False)
spark.stop()
结果如下:
Coefficients: [-0.49631114666506776,0.1984443769993409]
Intercept: 2.6380946151
Scale: 1.54723455744
+-----+------+--------------+-----------------+---------------------------------------+
|label|censor|features |prediction |quantiles |
+-----+------+--------------+-----------------+---------------------------------------+
|1.218|1.0 |[1.56,-0.605] |5.718979487634966|[1.1603238947151588,4.995456010274735] |
|2.949|0.0 |[0.346,2.158] |18.07652118149563|[3.667545845471802,15.789611866277884] |
|3.627|0.0 |[1.38,0.231] |7.381861804239103|[1.4977061305190853,6.447962612338967] |
|0.273|1.0 |[0.52,1.151] |13.57761250142538|[2.7547621481507067,11.859872224069784]|
|4.199|0.0 |[0.795,-0.226]|9.013097744073846|[1.8286676321297735,7.872826505878384] |
+-----+------+--------------+-----------------+---------------------------------------+
Isotonic regression属于回归算法族。 对 isotonic regression定义如下,给定一组有限的实数 Y = y 1 , y 2 , . . . , y n Y = {y_1, y_2, ..., y_n} Y=y1,y2,...,yn表示观察到的响应, X = x 1 , x 2 , . . . , x n X = {x_1, x_2, ..., x_n} X=x1,x2,...,xn表示未知的响应值,拟合一个函数以最小化:
(14) f ( x ) = ∑ i = 1 n w i ( y i − x i ) 2 f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2 \tag{14} f(x)=i=1∑nwi(yi−xi)2(14)
以 x 1 ≤ x 2 ≤ . . . ≤ x n x_1\le x_2\le ...\le x_n x1≤x2≤...≤xn为完整的顺序,其中 w i w_i wi是正权重。 由此产生的函数称为isotonic regression,它是独一无二的。 它可以被视为有顺序限制下的最小二乘问题。 基本上isotonic regression是最适合原始数据点的单调函数。
Spark实现了一个相邻违规算法的池,该算法使用一种并行化isotonic regression的方法。 训练输入是一个DataFrame,它包含三列标签,label, features 和 weight。 此外,IsotonicRegression算法有一个isotonis
(默认为true)的可选参数。 表示isotonic regression是isotonic的(单调递增的)还是antitonic的(单调递减的)。
训练返回IsotonicRegressionModel,可用于预测已知和未知特征的标签。isotonic regression的结果被视为分段线性函数。因此预测规则是:
示例代码如下:
# -*- coding: utf-8 -*-
# @Time : 2019/8/7 14:56
# @Author : buracagyang
# @File : isotonic_regression_example.py
# @Software : PyCharm
"""
Describe:
"""
from __future__ import print_function
from pyspark.ml.regression import IsotonicRegression
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName("IsotonicRegressionExample").getOrCreate()
dataset = spark.read.format("libsvm").load("../data/mllib/sample_isotonic_regression_libsvm_data.txt")
model = IsotonicRegression().fit(dataset)
print("Boundaries in increasing order: %s\n" % str(model.boundaries))
print("Predictions associated with the boundaries: %s\n" % str(model.predictions))
model.transform(dataset).show(5)
spark.stop()
结果如下:
Boundaries in increasing order: [0.01,0.17,0.18,0.27,0.28,0.29,0.3,0.31,0.34,0.35,0.36,0.41,0.42,0.71,0.72,0.74,0.75,0.76,0.77,0.78,0.79,0.8,0.81,0.82,0.83,0.84,0.85,0.86,0.87,0.88,0.89,1.0]
Predictions associated with the boundaries: [0.15715271294117644,0.15715271294117644,0.189138196,0.189138196,0.20040796,0.29576747,0.43396226,0.5081591025000001,0.5081591025000001,0.54156043,0.5504844466666667,0.5504844466666667,0.563929967,0.563929967,0.5660377366666667,0.5660377366666667,0.56603774,0.57929628,0.64762876,0.66241713,0.67210607,0.67210607,0.674655785,0.674655785,0.73890872,0.73992861,0.84242733,0.89673636,0.89673636,0.90719021,0.9272055075,0.9272055075]
+----------+--------------+-------------------+
| label| features| prediction|
+----------+--------------+-------------------+
|0.24579296|(1,[0],[0.01])|0.15715271294117644|
|0.28505864|(1,[0],[0.02])|0.15715271294117644|
|0.31208567|(1,[0],[0.03])|0.15715271294117644|
|0.35900051|(1,[0],[0.04])|0.15715271294117644|
|0.35747068|(1,[0],[0.05])|0.15715271294117644|
+----------+--------------+-------------------+
only showing top 5 rows
相关系列:
【Spark】Pipelines
【Spark】特征工程1-Extractors
【Spark】特征工程2-Transformers
【Spark】分类和回归算法-分类
【Spark】分类和回归算法-回归
【Spark】聚类分析
【Spark】协同过滤
【Spark】频繁项集挖掘
【Spark】模型选择和调优