目录
一、逻辑回归模型-log损失函数
1.1 模型定义
1.2 损失函数
1.3 梯度下降求解参数
二、利用最大似然估计求解逻辑回归模型参数
三、逻辑回归模型优缺点分析
四、spark ml机器学习库实现逻辑回归模型
五、离散特征作为模型输入
迭代直至收敛
事件发生概率为:
训练样本的似然函数为:
利用梯度下降优化似然函数:
优点:
缺点:
1、读取数据,划分训练集和测试集
births = spark.read.csv("births_transformed.csv",header=True,inferSchema=True)
births_train,births_test = births.randomSplit([0.7,0.3])
2、创建pipeline:feature transform、LR model
import pyspark.ml.feature as ft
import pyspark.ml.classification as cl
from pyspark.ml import Pipeline
encoder = ft.OneHotEncoder(inputCol = 'BIRTH_PLACE',outputCol='BIRTH_PLACE_VEC')
featuresCreator = ft.VectorAssembler(inputCols=[c for c in births.columns[2:]] + [encoder.getOutputCol()], outputCol = 'features')
logistic = cl.LogisticRegression(maxIter=10,regParam=0.01,labelCol='INFANT_ALIVE_AT_REPORT')
pipeline = Pipeline(stages=[encoder,featuresCreator,logistic])
model = pipeline.fit(births_train)
test_model = model.transform(births_test)
其中,BIRTH_PLACE是类别变量,转化为向量;INFANT_ALIVE_AT_REPORT是label。结果如下:
test_model.take(1)
[Row(INFANT_ALIVE_AT_REPORT=0, BIRTH_PLACE=1, MOTHER_AGE_YEARS=13, FATHER_COMBINED_AGE=99, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=62, MOTHER_PRE_WEIGHT=218, MOTHER_DELIVERY_WEIGHT=240, MOTHER_WEIGHT_GAIN=22, DIABETES_PRE=0, DIABETES_GEST=0, HYP_TENS_PRE=0, HYP_TENS_GEST=0, PREV_BIRTH_PRETERM=0, BIRTH_PLACE_VEC=SparseVector(9, {1: 1.0}), features=SparseVector(24, {0: 13.0, 1: 99.0, 6: 62.0, 7: 218.0, 8: 240.0, 9: 22.0, 16: 1.0}), rawPrediction=DenseVector([0.9171, -0.9171]), probability=DenseVector([0.7145, 0.2855]), prediction=0.0)]
probability为预测的概率值,prediction是预测结果。
3、模型评估
import pyspark.ml.evaluation as ev
evaluator = ev.BinaryClassificationEvaluator(rawPredictionCol='probability',labelCol='INFANT_ALIVE_AT_REPORT')
print(evaluator.evaluate(test_model,{evaluator.metricName:'areaUnderROC'}))
print(evaluator.evaluate(test_model,{evaluator.metricName:'areaUnderPR'}))
结果为:
0.7368260192094396
0.7100164930103934
海量离散特征+LR在业内更为常见,将离散特征作为模型输入的优势如下:
1、LR为线性模型,将连续单变量离散化后变为N个,每个单变量有单独的权重,相当于为模型加入非线性,增强模型的表达能力,提高拟合精度。
2、离散化后的特征对异常值有很强的鲁棒性。
3、LR内部的计算是向量乘积,特征离散化后加快运算速度。
参考文献:
1、数据下载路径:
http://www.tomdrabas.com/data/LearningPySpark/births_train.csv.gz
2、Coursera:machine Learning课程
https://www.coursera.org/learn/machine-learning/resources/Zi29t
3、《数据挖掘与数据化运营实战》
4、https://testerhome.com/topics/11064/show_wechat