【Spark】分类和回归算法-回归

同步于Buracag的博客

本节主要讲Spark ML中关于回归算法的实现。示例的算法Demo包含:线性回归、广义线性回归、决策树回归、随机森林回归、梯度提升树回归等。

文章目录

  • 1. 线性回归(Linear regression)
  • 2. 广义线性回归(Generalized linear regression)
  • 3. 决策树回归(Decision tree regression)
  • 4. 随机森林回归(Random forest regression)
  • 5. 梯度提升树回归(Gradient-boosted tree regression)
  • 6. 生存回归(Survival regression)
  • 7. Isotonic regression

1. 线性回归(Linear regression)

与logistic regression类似的,直接附上示例代码吧:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/7 10:11
# @Author   : buracagyang
# @File     : linear_regression_with_elastic_net.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("LinearRegressionWithElasticNet").getOrCreate()

    training = spark.read.format("libsvm").load("../data/mllib/sample_linear_regression_data.txt")

    lr = LinearRegression(maxIter=20, regParam=0.01, elasticNetParam=0.6)
    lrModel = lr.fit(training)

    # 参数和截距项
    # print("Coefficients: %s" % str(lrModel.coefficients))
    # print("Intercept: %s" % str(lrModel.intercept))

    # 模型评估指标
    trainingSummary = lrModel.summary
    print("numIterations: %d" % trainingSummary.totalIterations)
    print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
    trainingSummary.residuals.show()
    print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
    print("r2: %f" % trainingSummary.r2)

    spark.stop()

结果如下( r 2 r^2 r2如此低。。。。):

numIterations: 8
objectiveHistory: [0.49999999999999994, 0.4931849471651455, 0.4863393782275527, 0.48633557300904495, 0.48633543657045664, 0.4863354337756586, 0.4863354337094886, 0.4863354337092549]
+-------------------+
|          residuals|
+-------------------+
|  -10.9791066152217|
| 0.9208605107751258|
| -4.608888656779226|
|-20.424003572582702|
|-10.316545030639608|
+-------------------+
only showing top 5 rows

RMSE: 10.163110
r2: 0.027836

2. 广义线性回归(Generalized linear regression)

与假设输出遵循高斯分布的线性回归相反,广义线性模型(GLMs)是线性模型的规范,其中响应变量 Y i Y_i Yi遵循指数分布族的一些分布。 Spark的GeneralizedLinearRegression接口允许灵活地指定GLMs,可用于各种类型的预测问题,包括线性回归,泊松回归,逻辑回归等。目前在spark.ml中,仅支持指数族分布的子集,在下面列出。

分布族 响应变量类型 Supported Links
高斯分布 Continuous Identity*,Log,Inverse
Binomial 二元变量 Logit*, Probit, CLogLog
泊松分布 数值 Log*, Identity, Sqrt
Gamma分布 连续变量 Inverse*, Identity, Log
Tweedie分布 zero-inflated continuous Power link function

*标准Link

注意:Spark目前通过其GeneralizedLinearRegression接口仅支持4096个特征,如果超出此约束,则会抛出异常。尽管如此,对于线性回归和逻辑回归,可以使用LinearRegression和LogisticRegression估计来训练具有更多特征的模型(对于高纬特征集没啥用。。。)。

GLMs需要指数族分布,这些分布可以用“标准(canonical)”或“自然(natural)”形式写成,也就是自然指数族分布。自然指数分布族(natural exponential family distribution)的形式如下:
(1) f Y ( y ∣ θ , τ ) = h ( y , τ ) e x p ( θ . y − A ( θ ) d ( τ ) ) f_Y(y|\theta, \tau) = h(y, \tau)exp(\frac{\theta.y - A(\theta)}{d(\tau)}) \tag{1} fY(yθ,τ)=h(y,τ)exp(d(τ)θ.yA(θ))(1)
其中 θ \theta θ是parameter of interest, τ \tau τ是dispersion parameter。 在GLMs中,假设响应变量 Y i Y_i Yi是从自然指数分布族中提取的:
(2) Y i ∼ f ( ⋅ ∣ θ i , τ ) Y_i \sim f(\cdot|\theta_i, \tau) \tag{2} Yif(θi,τ)(2)
其中,parameter of interest θ i \theta_i θi与响应变量的期望值 μ i \mu_i μi相关:
(3) μ i = A ′ ( θ i ) \mu_i = A'(\theta_i) \tag{3} μi=A(θi)(3)
这里, A ′ ( θ i ) A'(\theta_i) A(θi)根据选择的分布形式定义,GLMs还支持指定的链接函数(link function),该函数定义响应变量的期望值 μ i \mu_i μi与线性预测值 η i \eta_i ηi之间的关系:
(4) g ( μ i ) = η i = x i ⃗ T ⋅ β ⃗ g(\mu_i) = \eta_i = \vec{x_i}^T \cdot \vec{\beta} \tag{4} g(μi)=ηi=xi Tβ (4)
通常,选择链接函数(link function)使得 A ′ = g − 1 A' = g^{-1} A=g1,其产生的parameter of interest θ θ θ与线性预测值 η \eta η之间的简化关系。 在这种情况下,链接函数 g ( μ ) g(μ) g(μ)被称为“标准”(“canonical”)链接函数。
(5) θ i = A ′ − 1 ( μ i ) = g ( g − 1 ( η i ) ) = η i \theta_i = A'^{-1}(\mu_i) = g(g^{-1}(\eta_i)) = \eta_i \tag{5} θi=A1(μi)=g(g1(ηi))=ηi(5)
一个GLM根据最大化似然概率函数值寻找回归系数 β ⃗ \vec{\beta} β :
(6) max ⁡ β ⃗ L ( θ ⃗ ∣ y ⃗ , X ) = ∏ i = 1 N h ( y i , τ ) exp ⁡ ( y i θ i − A ( θ i ) d ( τ ) ) \max_{\vec{\beta}} \mathcal{L}(\vec{\theta}|\vec{y},X) = \prod_{i=1}^{N} h(y_i, \tau) \exp{\left(\frac{y_i\theta_i - A(\theta_i)}{d(\tau)}\right)} \tag{6} β maxL(θ y ,X)=i=1Nh(yi,τ)exp(d(τ)yiθiA(θi))(6)
其中the parameter of interest θ i \theta_i θi与回归系数 β ⃗ \vec{\beta} β 的关系为:
(7) θ i = A ′ − 1 ( g − 1 ( x i ⃗ ⋅ β ⃗ ) ) \theta_i = A'^{-1}(g^{-1}(\vec{x_i} \cdot \vec{\beta})) \tag{7} θi=A1(g1(xi β ))(7)

示例代码如下:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/7 13:12
# @Author   : buracagyang
# @File     : generalized_linear_regression_example.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.sql import SparkSession
from pyspark.ml.regression import GeneralizedLinearRegression


if __name__ == "__main__":
    spark = SparkSession.builder.appName("GeneralizedLinearRegressionExample").getOrCreate()

    dataset = spark.read.format("libsvm").load("../data/mllib/sample_linear_regression_data.txt")

    glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3)
    model = glr.fit(dataset)

    # 参数和截距项
    print("Coefficients: " + str(model.coefficients))
    print("Intercept: " + str(model.intercept))
    
    # 通不过模型检验啊...
    summary = model.summary
    print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
    print("T Values: " + str(summary.tValues))
    print("P Values: " + str(summary.pValues))
    print("Dispersion: " + str(summary.dispersion))
    print("Null Deviance: " + str(summary.nullDeviance))
    print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
    print("Deviance: " + str(summary.deviance))
    print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
    print("AIC: " + str(summary.aic))
    print("Deviance Residuals: ")
    summary.residuals().show()

    spark.stop()

3. 决策树回归(Decision tree regression)

与决策树分类类似,直接附上示例代码:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/7 13:57
# @Author   : buracagyang
# @File     : decision_tree_regression_example.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("DecisionTreeRegressionExample").getOrCreate()

    data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")

    # 自动识别分类特性,并对它们进行索引。指定maxCategories,这样存在 > 4个不同值的特征将被视为连续的。
    featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
    (trainingData, testData) = data.randomSplit([0.7, 0.3], seed=2019)
    dt = DecisionTreeRegressor(featuresCol="indexedFeatures")

    # 通过Pipeline进行训练
    pipeline = Pipeline(stages=[featureIndexer, dt])

    model = pipeline.fit(trainingData)
    predictions = model.transform(testData)

    predictions.select("prediction", "label", "features").show(5)

    evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
    rmse = evaluator.evaluate(predictions)
    print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

    treeModel = model.stages[1]
    print(treeModel)

    spark.stop()

结果如下:

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|  0.0|(692,[100,101,102...|
|       0.0|  0.0|(692,[121,122,123...|
|       0.0|  0.0|(692,[124,125,126...|
|       0.0|  0.0|(692,[124,125,126...|
|       0.0|  0.0|(692,[124,125,126...|
+----------+-----+--------------------+
only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 0.297044
DecisionTreeRegressionModel (uid=DecisionTreeRegressor_4089a3fc367ac7a943d9) of depth 2 with 5 nodes

4. 随机森林回归(Random forest regression)

与决策树回归类似,示例代码如下:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/7 14:05
# @Author   : buracagyang
# @File     : random_forest_regressor_example.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("RandomForestRegressorExample").getOrCreate()

    data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")

    # 自动识别分类特性,并对它们进行索引。指定maxCategories,这样存在 > 4个不同值的特征将被视为连续的。
    featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
    (trainingData, testData) = data.randomSplit([0.7, 0.3], seed=2019)
    rf = RandomForestRegressor(featuresCol="indexedFeatures")

    # 通过Pipeline进行训练
    pipeline = Pipeline(stages=[featureIndexer, rf])
    model = pipeline.fit(trainingData)

    predictions = model.transform(testData)

    predictions.select("prediction", "label", "features").show(5)

    evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
    rmse = evaluator.evaluate(predictions)
    print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

    rfModel = model.stages[1]
    print(rfModel)

    spark.stop()

结果如下:

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.4|  0.0|(692,[100,101,102...|
|       0.0|  0.0|(692,[121,122,123...|
|       0.0|  0.0|(692,[124,125,126...|
|       0.1|  0.0|(692,[124,125,126...|
|      0.15|  0.0|(692,[124,125,126...|
+----------+-----+--------------------+
only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 0.141421
RandomForestRegressionModel (uid=RandomForestRegressor_4dc1b1ad32480cc89ddc) with 20 trees

5. 梯度提升树回归(Gradient-boosted tree regression)

示例代码如下(需要注意的是,对于这个示例样本集,GBTRegressor只需要一次迭代,但是在通常情况下是不止一次的~):

# -*- coding: utf-8 -*-
# @Time     : 2019/8/7 14:11
# @Author   : buracagyang
# @File     : gradient_boosted_tree_regressor_example.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("GradientBoostedTreeRegressorExample").getOrCreate()

    data = spark.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt")

    featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
    (trainingData, testData) = data.randomSplit([0.7, 0.3], seed=2019)
    gbt = GBTRegressor(featuresCol="indexedFeatures", maxIter=10)

    # 通过Pipeline进行训练
    pipeline = Pipeline(stages=[featureIndexer, gbt])
    model = pipeline.fit(trainingData)

    predictions = model.transform(testData)

    predictions.select("prediction", "label", "features").show(5)

    evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
    rmse = evaluator.evaluate(predictions)
    print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

    gbtModel = model.stages[1]
    print(gbtModel)

    spark.stop()

结果如下:

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|  0.0|(692,[100,101,102...|
|       0.0|  0.0|(692,[121,122,123...|
|       0.0|  0.0|(692,[124,125,126...|
|       0.0|  0.0|(692,[124,125,126...|
|       0.0|  0.0|(692,[124,125,126...|
+----------+-----+--------------------+
only showing top 5 rows

Root Mean Squared Error (RMSE) on test data = 0.297044
GBTRegressionModel (uid=GBTRegressor_46239bc64cc2f59b1fb5) with 10 trees

6. 生存回归(Survival regression)

在spark.ml中,实现了加速失败时间(Accelerated failure time, AFT)模型,它是一个用于删失数据的参数生存回归模型(survival regression model)。 它描述了生存时间对数的模型,因此它通常被称为生存分析的对数线性模型。 与为相同目的设计的比例风险模型不同,AFT模型更易于并行化,因为每个实例都独立地为目标函数做出贡献。

给定协变量 x ′ x' x的值,对于受试者 i = 1 , . . . , n i = 1,...,n i=1...n的随机寿命 t i t_i ti(random lifetime),可能进行右截尾(right-censoring),AFT模型下的似然函数如下:
(8) L ( β , σ ) = ∏ i = 1 n [ 1 σ f 0 ( log ⁡ t i − x ′ β σ ) ] δ i S 0 ( log ⁡ t i − x ′ β σ ) 1 − δ i L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}(\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}} \tag{8} L(β,σ)=i=1n[σ1f0(σlogtixβ)]δiS0(σlogtixβ)1δi(8)
其中 δ i \delta_i δi是事件发生的指标,即是否经过截尾的。 使用 ϵ i = log ⁡ t i − x ‘ β σ \epsilon_{i}=\frac{\log{t_{i}}-x^{‘}\beta}{\sigma} ϵi=σlogtixβ,对数似然函数采用以下形式:
(9) ι ( β , σ ) = ∑ i = 1 n [ − δ i log ⁡ σ + δ i log ⁡ f 0 ( ϵ i ) + ( 1 − δ i ) log ⁡ S 0 ( ϵ i ) ] \iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}] \tag{9} ι(β,σ)=i=1n[δilogσ+δilogf0(ϵi)+(1δi)logS0(ϵi)](9)
其中 S 0 ( ϵ i ) S_{0}(\epsilon_{i}) S0(ϵi)是幸存函数基线, f 0 ( ϵ i ) f_{0}(\epsilon_{i}) f0(ϵi)是相应的密度函数。

最常用的AFT模型基于Weibull分布的生存时间。 生命周期的Weibull分布对应于生命日志的极值分布(the extreme value distribution for the log of the lifetime), S 0 ( ϵ ) S_{0}(\epsilon) S0(ϵ)函数是:
(10) S 0 ( ϵ i ) = exp ⁡ ( − e ϵ i ) S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}}) \tag{10} S0(ϵi)=exp(eϵi)(10)
其中 f 0 ( ϵ i ) f_{0}(\epsilon_{i}) f0(ϵi)函数形式为:
(11) f 0 ( ϵ i ) = e ϵ i exp ⁡ ( − e ϵ i ) f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}}) \tag{11} f0(ϵi)=eϵiexp(eϵi)(11)
带有Weibull生命分布的AFT模型的对数似然函数是:
(12) ι ( β , σ ) = − ∑ i = 1 n [ δ i log ⁡ σ − δ i ϵ i + e ϵ i ] \iota(\beta,\sigma)= -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}] \tag{12} ι(β,σ)=i=1n[δilogσδiϵi+eϵi](12)
由于最小化负对数似然函数等效于最大化后验概率的,我们用于优化的损失函数是 − ι ( β , σ ) -\iota(\beta,\sigma) ι(β,σ) β \beta β log ⁡ σ \log\sigma logσ的梯度函数分别为:
(13) ∂ ( − ι ) ∂ β = ∑ 1 = 1 n [ δ i − e ϵ i ] x i σ ∂ ( − ι ) ∂ ( log ⁡ σ ) = ∑ i = 1 n [ δ i + ( δ i − e ϵ i ) ϵ i ] \frac{\partial (-\iota)}{\partial \beta} = \sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma} \\ \frac{\partial (-\iota)}{\partial (\log\sigma)} = \sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}] \tag{13} β(ι)=1=1n[δieϵi]σxi(logσ)(ι)=i=1n[δi+(δieϵi)ϵi](13)

AFT模型可以被转换为凸优化问题,即,找到取决于系数向量 β \beta β和尺度参数的对数 log ⁡ σ \log\sigma logσ的凸函数的最小化的任务 − ι ( β , σ ) -\iota(\beta,\sigma) ι(β,σ)。 实现的优化算法是L-BFGS。 实现匹配R的生存函数的幸存结果。

当拟合AFTSurvivalRegressionModel而不截断具有常量非零列的数据集时,Spark MLlib为常量非零列输出零系数。 此与R survival :: survreg不同

示例代码如下:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/7 14:38
# @Author   : buracagyang
# @File     : aft_survival_regression.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("AFTSurvivalRegressionExample").getOrCreate()

    training = spark.createDataFrame([
        (1.218, 1.0, Vectors.dense(1.560, -0.605)),
        (2.949, 0.0, Vectors.dense(0.346, 2.158)),
        (3.627, 0.0, Vectors.dense(1.380, 0.231)),
        (0.273, 1.0, Vectors.dense(0.520, 1.151)),
        (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
    quantileProbabilities = [0.3, 0.6]
    aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, quantilesCol="quantiles")

    model = aft.fit(training)

    print("Coefficients: " + str(model.coefficients))
    print("Intercept: " + str(model.intercept))
    print("Scale: " + str(model.scale))
    model.transform(training).show(truncate=False)

    spark.stop()

结果如下:

Coefficients: [-0.49631114666506776,0.1984443769993409]
Intercept: 2.6380946151
Scale: 1.54723455744
+-----+------+--------------+-----------------+---------------------------------------+
|label|censor|features      |prediction       |quantiles                              |
+-----+------+--------------+-----------------+---------------------------------------+
|1.218|1.0   |[1.56,-0.605] |5.718979487634966|[1.1603238947151588,4.995456010274735] |
|2.949|0.0   |[0.346,2.158] |18.07652118149563|[3.667545845471802,15.789611866277884] |
|3.627|0.0   |[1.38,0.231]  |7.381861804239103|[1.4977061305190853,6.447962612338967] |
|0.273|1.0   |[0.52,1.151]  |13.57761250142538|[2.7547621481507067,11.859872224069784]|
|4.199|0.0   |[0.795,-0.226]|9.013097744073846|[1.8286676321297735,7.872826505878384] |
+-----+------+--------------+-----------------+---------------------------------------+

7. Isotonic regression

Isotonic regression属于回归算法族。 对 isotonic regression定义如下,给定一组有限的实数 Y = y 1 , y 2 , . . . , y n Y = {y_1, y_2, ..., y_n} Y=y1,y2,...,yn表示观察到的响应, X = x 1 , x 2 , . . . , x n X = {x_1, x_2, ..., x_n} X=x1,x2,...,xn表示未知的响应值,拟合一个函数以最小化:
(14) f ( x ) = ∑ i = 1 n w i ( y i − x i ) 2 f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2 \tag{14} f(x)=i=1nwi(yixi)2(14)
x 1 ≤ x 2 ≤ . . . ≤ x n x_1\le x_2\le ...\le x_n x1x2...xn为完整的顺序,其中 w i w_i wi是正权重。 由此产生的函数称为isotonic regression,它是独一无二的。 它可以被视为有顺序限制下的最小二乘问题。 基本上isotonic regression是最适合原始数据点的单调函数。

Spark实现了一个相邻违规算法的池,该算法使用一种并行化isotonic regression的方法。 训练输入是一个DataFrame,它包含三列标签,label, features 和 weight。 此外,IsotonicRegression算法有一个isotonis(默认为true)的可选参数。 表示isotonic regression是isotonic的(单调递增的)还是antitonic的(单调递减的)。

训练返回IsotonicRegressionModel,可用于预测已知和未知特征的标签。isotonic regression的结果被视为分段线性函数。因此预测规则是:

  • 如果预测输入与训练特征完全匹配,则返回相关联的预测。如果有多个具有相同特征的预测,则返回其中一个。哪一个是未定义的(与java.util.Arrays.binarySearch相同)。
  • 如果预测输入低于或高于所有训练特征,则分别返回具有最低或最高特征的预测。如果存在具有相同特征的多个预测,则分别返回最低或最高。
  • 如果预测输入落在两个训练特征之间,则将预测视为分段线性函数,并且根据两个最接近特征的预测来计算内插值。如果存在具有相同特征的多个值,则使用与先前点相同的规则。

示例代码如下:

# -*- coding: utf-8 -*-
# @Time     : 2019/8/7 14:56
# @Author   : buracagyang
# @File     : isotonic_regression_example.py
# @Software : PyCharm

"""
Describe:
        
"""

from __future__ import print_function
from pyspark.ml.regression import IsotonicRegression
from pyspark.sql import SparkSession


if __name__ == "__main__":
    spark = SparkSession.builder.appName("IsotonicRegressionExample").getOrCreate()

    dataset = spark.read.format("libsvm").load("../data/mllib/sample_isotonic_regression_libsvm_data.txt")

    model = IsotonicRegression().fit(dataset)
    print("Boundaries in increasing order: %s\n" % str(model.boundaries))
    print("Predictions associated with the boundaries: %s\n" % str(model.predictions))

    model.transform(dataset).show(5)

    spark.stop()

结果如下:

Boundaries in increasing order: [0.01,0.17,0.18,0.27,0.28,0.29,0.3,0.31,0.34,0.35,0.36,0.41,0.42,0.71,0.72,0.74,0.75,0.76,0.77,0.78,0.79,0.8,0.81,0.82,0.83,0.84,0.85,0.86,0.87,0.88,0.89,1.0]

Predictions associated with the boundaries: [0.15715271294117644,0.15715271294117644,0.189138196,0.189138196,0.20040796,0.29576747,0.43396226,0.5081591025000001,0.5081591025000001,0.54156043,0.5504844466666667,0.5504844466666667,0.563929967,0.563929967,0.5660377366666667,0.5660377366666667,0.56603774,0.57929628,0.64762876,0.66241713,0.67210607,0.67210607,0.674655785,0.674655785,0.73890872,0.73992861,0.84242733,0.89673636,0.89673636,0.90719021,0.9272055075,0.9272055075]

+----------+--------------+-------------------+
|     label|      features|         prediction|
+----------+--------------+-------------------+
|0.24579296|(1,[0],[0.01])|0.15715271294117644|
|0.28505864|(1,[0],[0.02])|0.15715271294117644|
|0.31208567|(1,[0],[0.03])|0.15715271294117644|
|0.35900051|(1,[0],[0.04])|0.15715271294117644|
|0.35747068|(1,[0],[0.05])|0.15715271294117644|
+----------+--------------+-------------------+
only showing top 5 rows

相关系列:

【Spark】Pipelines

【Spark】特征工程1-Extractors

【Spark】特征工程2-Transformers

【Spark】分类和回归算法-分类

【Spark】分类和回归算法-回归

【Spark】聚类分析

【Spark】协同过滤

【Spark】频繁项集挖掘

【Spark】模型选择和调优

你可能感兴趣的:(Spark,技术备忘)