LogisticRegression:逻辑回归分类
class pyspark.ml.classification.LogisticRegression(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-06, fitIntercept=True, threshold=0.5, thresholds=None, probabilityCol=‘probability’, rawPredictionCol=‘rawPrediction’, standardization=True, weightCol=None, aggregationDepth=2, family=‘auto’, lowerBoundsOnCoefficients=None, upperBoundsOnCoefficients=None, lowerBoundsOnIntercepts=None, upperBoundsOnIntercepts=None)
逻辑回归。此类支持多项逻辑 (softmax) 和二项逻辑回归
aggregationDepth = Param(parent=‘undefined’, name=‘aggregationDepth’, doc=‘treeAggregate (>= 2) 的建议深度。’)
elasticNetParam = Param(parent=‘undefined’, name=‘elasticNetParam’, doc=‘ElasticNet混合参数,范围[0, 1]。对于alpha = 0,惩罚是L2惩罚。对于alpha = 1,它是 L1 处罚。’)
family = Param(parent=‘undefined’, name=‘family’, doc=‘family 的名称,描述模型中使用的标签分布。支持的选项:auto、binomial、multinomial’)
fitIntercept = Param(parent=‘undefined’, name=‘fitIntercept’, doc=‘是否适合截取项。’)
lowerBoundsOnCoefficients = Param(parent=‘undefined’, name=‘lowerBoundsOnCoefficients’, doc=‘如果在有界约束优化下拟合,则系数的下界。边界矩阵必须与二项式回归的形状(1,特征数)兼容 , 或 (类数, 特征数) 用于多项回归。’)
lowerBoundsOnIntercepts = Param(parent=‘undefined’, name=‘lowerBoundsOnIntercepts’, doc=‘如果在有界约束优化下拟合,则截距的下限。对于二项式回归,边界向量大小必须等于 1,对于多项式回归,则必须等于 1 .’)
probabilityCol = Param(parent=‘undefined’, name=‘probabilityCol’, doc=‘预测类条件概率的列名。注意:并非所有模型都输出经过良好校准的概率估计!这些概率应该被视为置信度,而不是精确概率 .’)
rawPredictionCol = Param(parent=‘undefined’, name=‘rawPredictionCol’, doc=‘原始预测 (a.k.a. confidence) 列名.’)
regParam = Param(parent=‘undefined’, name=‘regParam’, doc=‘正则化参数 (>= 0).’)
standardization = Param(parent=‘undefined’, name=‘standardization’, doc=‘在拟合模型之前是否对训练特征进行标准化。’)
threshold = Param(parent=‘undefined’, name=‘threshold’, doc=‘二元分类预测中的阈值,范围 [0, 1]。如果阈值和阈值都设置了,则它们必须匹配。例如,如果阈值为 p ,则阈值必须等于 [1-p, p]。’)
thresholds = Param(parent=‘undefined’, name=‘thresholds’, doc="多类分类中的阈值调整预测每个类的概率。数组的长度必须等于类的数量,值> 0, 除了最多一个值可能为 0 之外。具有最大值 p 的类
tol = Param(parent=‘undefined’, name=‘tol’, doc=‘迭代算法的收敛容差 (>= 0).’)
upperBoundsOnCoefficients = Param(parent=‘undefined’, name=‘upperBoundsOnCoefficients’, doc=‘如果在边界约束优化下拟合,则系数的上限。边界矩阵必须与二项式回归的形状(1,特征数)兼容 , 或 (类数, 特征数) 用于多项回归。’)
upperBoundsOnIntercepts = Param(parent=‘undefined’, name=‘upperBoundsOnIntercepts’, doc=‘如果在有界约束优化下拟合,则截距的上限。对于二项式回归,界向量大小必须等于 1,或者对于 多项回归。’)
weightCol = Param(parent=‘undefined’, name=‘weightCol’, doc=‘weight 列名。如果未设置或为空,我们将所有实例权重视为 1.0。’)
model.coefficients:二项式逻辑回归的模型系数。在多项逻辑回归的情况下抛出异常
**model.intercept:**二项式逻辑回归的模型截距。在多项逻辑回归的情况下抛出异常
01.创建数据
from pyspark.sql import SparkSession
from pyspark.sql.types import Row
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.appName("LogisticRegression").master("local[*]").getOrCreate()
bdf = spark.createDataFrame([
Row(label=1.0, weight=1.0, features=Vectors.dense(0.0, 5.0)),
Row(label=0.0, weight=2.0, features=Vectors.dense(1.0, 2.0)),
Row(label=1.0, weight=3.0, features=Vectors.dense(2.0, 1.0)),
Row(label=0.0, weight=4.0, features=Vectors.dense(3.0, 3.0))
])
bdf.show()
输出结果:
+---------+-----+------+
| features|label|weight|
+---------+-----+------+
|[0.0,5.0]| 1.0| 1.0|
|[1.0,2.0]| 0.0| 2.0|
|[2.0,1.0]| 1.0| 3.0|
|[3.0,3.0]| 0.0| 4.0|
+---------+-----+------+
02.使用逻辑回归分类器并转换原有的数据进行比对
from pyspark.ml.classification import LogisticRegression
blor = LogisticRegression(regParam=0.01, weightCol="weight")
blorModel = blor.fit(bdf)
blorModel.transform(bdf).show()
输出结果:
+---------+-----+------+--------------------+--------------------+----------+
| features|label|weight| rawPrediction| probability|prediction|
+---------+-----+------+--------------------+--------------------+----------+
|[0.0,5.0]| 1.0| 1.0|[0.11868570761143...|[0.52963664585087...| 0.0|
|[1.0,2.0]| 0.0| 2.0|[-0.7394588648584...|[0.32312248644960...| 1.0|
|[2.0,1.0]| 1.0| 3.0|[-0.3050226266204...|[0.42433012185133...| 1.0|
|[3.0,3.0]| 0.0| 4.0|[2.06828482767961...|[0.88778220107828...| 0.0|
+---------+-----+------+--------------------+--------------------+----------+
03.查看模型系数
blorModel.coefficients
输出结果:DenseVector([-1.0807, -0.6463])
04.查看模型截距
blorModel.intercept
输出结果:3.1127663191585175