Pyspark分类--LogisticRegression

LogisticRegression:逻辑回归分类

class pyspark.ml.classification.LogisticRegression(featuresCol=‘features’, labelCol=‘label’, predictionCol=‘prediction’, maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-06, fitIntercept=True, threshold=0.5, thresholds=None, probabilityCol=‘probability’, rawPredictionCol=‘rawPrediction’, standardization=True, weightCol=None, aggregationDepth=2, family=‘auto’, lowerBoundsOnCoefficients=None, upperBoundsOnCoefficients=None, lowerBoundsOnIntercepts=None, upperBoundsOnIntercepts=None)

逻辑回归。此类支持多项逻辑 (softmax) 和二项逻辑回归

aggregationDepth = Param(parent=‘undefined’, name=‘aggregationDepth’, doc=‘treeAggregate (>= 2) 的建议深度。’)

elasticNetParam = Param(parent=‘undefined’, name=‘elasticNetParam’, doc=‘ElasticNet混合参数,范围[0, 1]。对于alpha = 0,惩罚是L2惩罚。对于alpha = 1,它是 L1 处罚。’)

family = Param(parent=‘undefined’, name=‘family’, doc=‘family 的名称,描述模型中使用的标签分布。支持的选项:auto、binomial、multinomial’)

fitIntercept = Param(parent=‘undefined’, name=‘fitIntercept’, doc=‘是否适合截取项。’)

lowerBoundsOnCoefficients = Param(parent=‘undefined’, name=‘lowerBoundsOnCoefficients’, doc=‘如果在有界约束优化下拟合,则系数的下界。边界矩阵必须与二项式回归的形状(1,特征数)兼容 , 或 (类数, 特征数) 用于多项回归。’)

lowerBoundsOnIntercepts = Param(parent=‘undefined’, name=‘lowerBoundsOnIntercepts’, doc=‘如果在有界约束优化下拟合,则截距的下限。对于二项式回归,边界向量大小必须等于 1,对于多项式回归,则必须等于 1 .’)

probabilityCol = Param(parent=‘undefined’, name=‘probabilityCol’, doc=‘预测类条件概率的列名。注意:并非所有模型都输出经过良好校准的概率估计!这些概率应该被视为置信度,而不是精确概率 .’)

rawPredictionCol = Param(parent=‘undefined’, name=‘rawPredictionCol’, doc=‘原始预测 (a.k.a. confidence) 列名.’)

regParam = Param(parent=‘undefined’, name=‘regParam’, doc=‘正则化参数 (>= 0).’)

standardization = Param(parent=‘undefined’, name=‘standardization’, doc=‘在拟合模型之前是否对训练特征进行标准化。’)

threshold = Param(parent=‘undefined’, name=‘threshold’, doc=‘二元分类预测中的阈值,范围 [0, 1]。如果阈值和阈值都设置了,则它们必须匹配。例如,如果阈值为 p ,则阈值必须等于 [1-p, p]。’)

thresholds = Param(parent=‘undefined’, name=‘thresholds’, doc="多类分类中的阈值调整预测每个类的概率。数组的长度必须等于类的数量,值> 0, 除了最多一个值可能为 0 之外。具有最大值 p 的类

tol = Param(parent=‘undefined’, name=‘tol’, doc=‘迭代算法的收敛容差 (>= 0).’)

upperBoundsOnCoefficients = Param(parent=‘undefined’, name=‘upperBoundsOnCoefficients’, doc=‘如果在边界约束优化下拟合,则系数的上限。边界矩阵必须与二项式回归的形状(1,特征数)兼容 , 或 (类数, 特征数) 用于多项回归。’)

upperBoundsOnIntercepts = Param(parent=‘undefined’, name=‘upperBoundsOnIntercepts’, doc=‘如果在有界约束优化下拟合,则截距的上限。对于二项式回归,界向量大小必须等于 1,或者对于 多项回归。’)

weightCol = Param(parent=‘undefined’, name=‘weightCol’, doc=‘weight 列名。如果未设置或为空,我们将所有实例权重视为 1.0。’)

model.coefficients:二项式逻辑回归的模型系数。在多项逻辑回归的情况下抛出异常

**model.intercept:**二项式逻辑回归的模型截距。在多项逻辑回归的情况下抛出异常

01.创建数据

from pyspark.sql import SparkSession
from pyspark.sql.types import Row
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.appName("LogisticRegression").master("local[*]").getOrCreate()
bdf = spark.createDataFrame([
    Row(label=1.0, weight=1.0, features=Vectors.dense(0.0, 5.0)),
    Row(label=0.0, weight=2.0, features=Vectors.dense(1.0, 2.0)),
    Row(label=1.0, weight=3.0, features=Vectors.dense(2.0, 1.0)),
    Row(label=0.0, weight=4.0, features=Vectors.dense(3.0, 3.0))
])
bdf.show()

​ 输出结果:

+---------+-----+------+
| features|label|weight|
+---------+-----+------+
|[0.0,5.0]|  1.0|   1.0|
|[1.0,2.0]|  0.0|   2.0|
|[2.0,1.0]|  1.0|   3.0|
|[3.0,3.0]|  0.0|   4.0|
+---------+-----+------+

02.使用逻辑回归分类器并转换原有的数据进行比对

from pyspark.ml.classification import LogisticRegression
blor = LogisticRegression(regParam=0.01, weightCol="weight")
blorModel = blor.fit(bdf)
blorModel.transform(bdf).show()

​ 输出结果:

+---------+-----+------+--------------------+--------------------+----------+
| features|label|weight|       rawPrediction|         probability|prediction|
+---------+-----+------+--------------------+--------------------+----------+
|[0.0,5.0]|  1.0|   1.0|[0.11868570761143...|[0.52963664585087...|       0.0|
|[1.0,2.0]|  0.0|   2.0|[-0.7394588648584...|[0.32312248644960...|       1.0|
|[2.0,1.0]|  1.0|   3.0|[-0.3050226266204...|[0.42433012185133...|       1.0|
|[3.0,3.0]|  0.0|   4.0|[2.06828482767961...|[0.88778220107828...|       0.0|
+---------+-----+------+--------------------+--------------------+----------+

03.查看模型系数

blorModel.coefficients

​ 输出结果:DenseVector([-1.0807, -0.6463])

04.查看模型截距

blorModel.intercept

​ 输出结果:3.1127663191585175

你可能感兴趣的:(ML基础,分类,机器学习,人工智能)