ML之PySpark:基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用
目录
基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用
# 1、定义数据集
# 1.1、创建SparkSession连接
# 1.2、读取数据集
# 2、数据预处理/特征工程
# 2.1、定义特征向量化
# 2.2、合并特征与标签
# 3、模型训练与评估
# 3.1、切分数据集
# 3.1.1、切分为训练集、验证集
# 3.1.2、数据类型转换:将spark_df类型转换为pandas_df类型
# 3.1.3、分离特征和标签
# 3.1.4、构建LightGBM数据集
# 3.2、模型训练
# 3.3、模型预测与评估
# 4、停止SparkSession
相关文章
ML之PySpark:基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用
ML之PySpark:基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用实现代码
# 去除无意义的索引列
+---+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
| ID| crim| zn|indus|chas| nox| rm| age| dis|rad|tax|ptratio| black|lstat|medv|
+---+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
| 1|0.00632|18.0| 2.31| 0|0.538|6.575|65.2| 4.09| 1|296| 15.3| 396.9| 4.98|24.0|
| 2|0.02731| 0.0| 7.07| 0|0.469|6.421|78.9|4.9671| 2|242| 17.8| 396.9| 9.14|21.6|
| 4|0.03237| 0.0| 2.18| 0|0.458|6.998|45.8|6.0622| 3|222| 18.7|394.63| 2.94|33.4|
+---+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
only showing top 3 rows
root
|-- ID: integer (nullable = true)
|-- crim: double (nullable = true)
|-- zn: double (nullable = true)
|-- indus: double (nullable = true)
|-- chas: integer (nullable = true)
|-- nox: double (nullable = true)
|-- rm: double (nullable = true)
|-- age: double (nullable = true)
|-- dis: double (nullable = true)
|-- rad: integer (nullable = true)
|-- tax: integer (nullable = true)
|-- ptratio: double (nullable = true)
|-- black: double (nullable = true)
|-- lstat: double (nullable = true)
|-- medv: double (nullable = true)
15 ['ID', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat', 'medv']
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
| crim| zn|indus|chas| nox| rm| age| dis|rad|tax|ptratio| black|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|0.00632|18.0| 2.31| 0|0.538|6.575|65.2| 4.09| 1|296| 15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07| 0|0.469|6.421|78.9|4.9671| 2|242| 17.8| 396.9| 9.14|21.6|
|0.03237| 0.0| 2.18| 0|0.458|6.998|45.8|6.0622| 3|222| 18.7|394.63| 2.94|33.4|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
only showing top 3 rows
23/04/12 18:05:04 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+----+--------------------+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+
|medv| features| crim| zn|indus|chas| nox| rm| age| dis|rad|tax|ptratio| black|lstat|
+----+--------------------+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+
|24.0|[0.00632,18.0,2.3...|0.00632|18.0| 2.31| 0|0.538|6.575|65.2| 4.09| 1|296| 15.3| 396.9| 4.98|
|21.6|[0.02731,0.0,7.07...|0.02731| 0.0| 7.07| 0|0.469|6.421|78.9|4.9671| 2|242| 17.8| 396.9| 9.14|
|33.4|[0.03237,0.0,2.18...|0.03237| 0.0| 2.18| 0|0.458|6.998|45.8|6.0622| 3|222| 18.7|394.63| 2.94|
+----+--------------------+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+
only showing top 3 rows
[DenseVector([38.3518, 0.0, 18.1, 0.0, 0.693, 5.453, 100.0, 1.4896, 24.0, 666.0, 20.2, 396.9, 30.59])
DenseVector([25.0461, 0.0, 18.1, 0.0, 0.693, 5.987, 100.0, 1.5888, 24.0, 666.0, 20.2, 396.9, 26.77])
DenseVector([14.2362, 0.0, 18.1, 0.0, 0.693, 6.343, 100.0, 1.5741, 24.0, 666.0, 20.2, 396.9, 20.32])
DenseVector([18.0846, 0.0, 18.1, 0.0, 0.679, 6.434, 100.0, 1.8347, 24.0, 666.0, 20.2, 27.25, 29.05])
……
DenseVector([9.2323, 0.0, 18.1, 0.0, 0.631, 6.216, 100.0, 1.1691, 24.0, 666.0, 20.2, 366.15, 9.53])]
[[ 25.0461 0. 18.1 ... 20.2 396.9 26.77 ]
[ 45.7461 0. 18.1 ... 20.2 88.27 36.98 ]
[ 14.2362 0. 18.1 ... 20.2 396.9 20.32 ]
...
[ 4.89822 0. 18.1 ... 20.2 375.52 3.26 ]
[ 6.53876 0. 18.1 ... 20.2 392.05 2.96 ]
[ 9.2323 0. 18.1 ... 20.2 366.15 9.53 ]]
[ 5. 5.6 7.2 7.2 7.4 8.1 8.3 8.4 8.7 8.8 8.8 9.5 9.7 10.4
10.5 10.5 10.8 10.9 11. 11.3 11.7 12. 12.3 13.1 13.4 13.5 13.5 13.6
……
32.9 33. 33.1 33.4 33.4 33.8 34.9 34.9 35.1 35.2 36.1 36.2 36.2 36.4
37. 37.2 37.6 37.9 39.8 41.7 43.1 44.8 46. 48.3 48.5 48.8 50. 50.
50. 50. 50. 50. 50. 50. 50. ]
lgb_train
lgb_eval
# 设置模型参数
# 训练模型
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000313 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 638
[LightGBM] [Info] Number of data points in the train set: 251, number of used features: 12
[LightGBM] [Info] Start training from score 23.116335
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] valid_0's l1: 5.55296 valid_0's l2: 55.3567
Training until validation scores don't improve for 5 rounds
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2] valid_0's l1: 5.37904 valid_0's l2: 52.1612
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3] valid_0's l1: 5.18424 valid_0's l2: 48.5647
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4] valid_0's l1: 4.98621 valid_0's l2: 45.1861
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5] valid_0's l1: 4.81712 valid_0's l2: 42.6588
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[6] valid_0's l1: 4.64324 valid_0's l2: 39.9921
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[7] valid_0's l1: 4.48456 valid_0's l2: 37.6254
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[8] valid_0's l1: 4.32798 valid_0's l2: 35.562
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[9] valid_0's l1: 4.1635 valid_0's l2: 33.5291
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[10] valid_0's l1: 4.04524 valid_0's l2: 31.976
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[11] valid_0's l1: 3.92398 valid_0's l2: 30.43
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[12] valid_0's l1: 3.82096 valid_0's l2: 29.178
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[13] valid_0's l1: 3.70099 valid_0's l2: 27.894
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[14] valid_0's l1: 3.5979 valid_0's l2: 26.7405
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[15] valid_0's l1: 3.5056 valid_0's l2: 25.7414
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[16] valid_0's l1: 3.42553 valid_0's l2: 24.9116
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[17] valid_0's l1: 3.33382 valid_0's l2: 23.9615
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[18] valid_0's l1: 3.26277 valid_0's l2: 23.1291
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[19] valid_0's l1: 3.19639 valid_0's l2: 22.4056
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[20] valid_0's l1: 3.13146 valid_0's l2: 21.7271
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[21] valid_0's l1: 3.06704 valid_0's l2: 21.0404
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[22] valid_0's l1: 3.01769 valid_0's l2: 20.5252
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[23] valid_0's l1: 2.97406 valid_0's l2: 19.9501
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[24] valid_0's l1: 2.93286 valid_0's l2: 19.5415
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[25] valid_0's l1: 2.89854 valid_0's l2: 19.2612
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[26] valid_0's l1: 2.86595 valid_0's l2: 18.9108
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[27] valid_0's l1: 2.83455 valid_0's l2: 18.4983
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[28] valid_0's l1: 2.80282 valid_0's l2: 18.1043
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[29] valid_0's l1: 2.7676 valid_0's l2: 17.7102
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[30] valid_0's l1: 2.73643 valid_0's l2: 17.5716
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[31] valid_0's l1: 2.7189 valid_0's l2: 17.3091
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[32] valid_0's l1: 2.70055 valid_0's l2: 17.0558
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[33] valid_0's l1: 2.67765 valid_0's l2: 16.7472
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[34] valid_0's l1: 2.66355 valid_0's l2: 16.5579
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[35] valid_0's l1: 2.65571 valid_0's l2: 16.4488
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[36] valid_0's l1: 2.64186 valid_0's l2: 16.1984
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[37] valid_0's l1: 2.62485 valid_0's l2: 15.9867
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[38] valid_0's l1: 2.60517 valid_0's l2: 15.7785
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[39] valid_0's l1: 2.60108 valid_0's l2: 15.6885
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[40] valid_0's l1: 2.58566 valid_0's l2: 15.5137
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[41] valid_0's l1: 2.56902 valid_0's l2: 15.2941
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[42] valid_0's l1: 2.55925 valid_0's l2: 15.2012
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[43] valid_0's l1: 2.55608 valid_0's l2: 15.1871
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[44] valid_0's l1: 2.55078 valid_0's l2: 15.0495
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[45] valid_0's l1: 2.54551 valid_0's l2: 14.9857
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[46] valid_0's l1: 2.54433 valid_0's l2: 14.9713
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[47] valid_0's l1: 2.54001 valid_0's l2: 14.8974
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[48] valid_0's l1: 2.53317 valid_0's l2: 14.7989
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[49] valid_0's l1: 2.52767 valid_0's l2: 14.6993
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[50] valid_0's l1: 2.53182 valid_0's l2: 14.6695
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[51] valid_0's l1: 2.525 valid_0's l2: 14.5664
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[52] valid_0's l1: 2.52136 valid_0's l2: 14.5268
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[53] valid_0's l1: 2.51818 valid_0's l2: 14.4511
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[54] valid_0's l1: 2.51344 valid_0's l2: 14.3942
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[55] valid_0's l1: 2.51794 valid_0's l2: 14.3446
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[56] valid_0's l1: 2.51472 valid_0's l2: 14.3073
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[57] valid_0's l1: 2.51281 valid_0's l2: 14.2588
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[58] valid_0's l1: 2.51246 valid_0's l2: 14.1835
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[59] valid_0's l1: 2.51186 valid_0's l2: 14.1126
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[60] valid_0's l1: 2.50594 valid_0's l2: 14.0302
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[61] valid_0's l1: 2.50899 valid_0's l2: 14.0065
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[62] valid_0's l1: 2.51171 valid_0's l2: 13.9461
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[63] valid_0's l1: 2.51573 valid_0's l2: 13.95
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[64] valid_0's l1: 2.52288 valid_0's l2: 13.9602
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[65] valid_0's l1: 2.52993 valid_0's l2: 13.9526
Early stopping, best iteration is:
[60] valid_0's l1: 2.50594 valid_0's l2: 14.0302
R2 = 0.757685
RMSE = 3.74569