ML之PySpark:基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用

ML之PySpark:基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用

目录

基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用

# 1、定义数据集

# 1.1、创建SparkSession连接

# 1.2、读取数据集

# 2、数据预处理/特征工程

# 2.1、定义特征向量化

# 2.2、合并特征与标签

# 3、模型训练与评估

# 3.1、切分数据集

# 3.1.1、切分为训练集、验证集

# 3.1.2、数据类型转换:将spark_df类型转换为pandas_df类型

# 3.1.3、分离特征和标签

# 3.1.4、构建LightGBM数据集

# 3.2、模型训练

# 3.3、模型预测与评估

# 4、停止SparkSession


相关文章
ML之PySpark:基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用
ML之PySpark:基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用实现代码

基于PySpark框架针对boston波士顿房价数据集利用lightgbm算法(评估)实现房价回归预测案例应用

# 1、定义数据集

# 1.1、创建SparkSession连接

# 1.2、读取数据集

# 去除无意义的索引列
+---+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
| ID|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio| black|lstat|medv|
+---+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|  1|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|  2|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|  4|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
+---+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
only showing top 3 rows

root
 |-- ID: integer (nullable = true)
 |-- crim: double (nullable = true)
 |-- zn: double (nullable = true)
 |-- indus: double (nullable = true)
 |-- chas: integer (nullable = true)
 |-- nox: double (nullable = true)
 |-- rm: double (nullable = true)
 |-- age: double (nullable = true)
 |-- dis: double (nullable = true)
 |-- rad: integer (nullable = true)
 |-- tax: integer (nullable = true)
 |-- ptratio: double (nullable = true)
 |-- black: double (nullable = true)
 |-- lstat: double (nullable = true)
 |-- medv: double (nullable = true)

15 ['ID', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat', 'medv']
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio| black|lstat|medv|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98|24.0|
|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|21.6|
|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94|33.4|
+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
only showing top 3 rows

# 2、数据预处理/特征工程

# 2.1、定义特征向量化

# 2.2、合并特征与标签

23/04/12 18:05:04 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+----+--------------------+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+
|medv|            features|   crim|  zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio| black|lstat|
+----+--------------------+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+
|24.0|[0.00632,18.0,2.3...|0.00632|18.0| 2.31|   0|0.538|6.575|65.2|  4.09|  1|296|   15.3| 396.9| 4.98|
|21.6|[0.02731,0.0,7.07...|0.02731| 0.0| 7.07|   0|0.469|6.421|78.9|4.9671|  2|242|   17.8| 396.9| 9.14|
|33.4|[0.03237,0.0,2.18...|0.03237| 0.0| 2.18|   0|0.458|6.998|45.8|6.0622|  3|222|   18.7|394.63| 2.94|
+----+--------------------+-------+----+-----+----+-----+-----+----+------+---+---+-------+------+-----+
only showing top 3 rows

# 3、模型训练与评估

# 3.1、切分数据集

# 3.1.1、切分为训练集、验证集

# 3.1.2、数据类型转换:将spark_df类型转换为pandas_df类型

# 3.1.3、分离特征和标签

[DenseVector([38.3518, 0.0, 18.1, 0.0, 0.693, 5.453, 100.0, 1.4896, 24.0, 666.0, 20.2, 396.9, 30.59])
 DenseVector([25.0461, 0.0, 18.1, 0.0, 0.693, 5.987, 100.0, 1.5888, 24.0, 666.0, 20.2, 396.9, 26.77])
 DenseVector([14.2362, 0.0, 18.1, 0.0, 0.693, 6.343, 100.0, 1.5741, 24.0, 666.0, 20.2, 396.9, 20.32])
 DenseVector([18.0846, 0.0, 18.1, 0.0, 0.679, 6.434, 100.0, 1.8347, 24.0, 666.0, 20.2, 27.25, 29.05])
……

 DenseVector([9.2323, 0.0, 18.1, 0.0, 0.631, 6.216, 100.0, 1.1691, 24.0, 666.0, 20.2, 366.15, 9.53])]


 [[ 25.0461    0.       18.1     ...  20.2     396.9      26.77   ]
 [ 45.7461    0.       18.1     ...  20.2      88.27     36.98   ]
 [ 14.2362    0.       18.1     ...  20.2     396.9      20.32   ]
 ...
 [  4.89822   0.       18.1     ...  20.2     375.52      3.26   ]
 [  6.53876   0.       18.1     ...  20.2     392.05      2.96   ]
 [  9.2323    0.       18.1     ...  20.2     366.15      9.53   ]]



[ 5.   5.6  7.2  7.2  7.4  8.1  8.3  8.4  8.7  8.8  8.8  9.5  9.7 10.4
 10.5 10.5 10.8 10.9 11.  11.3 11.7 12.  12.3 13.1 13.4 13.5 13.5 13.6
……
 32.9 33.  33.1 33.4 33.4 33.8 34.9 34.9 35.1 35.2 36.1 36.2 36.2 36.4
 37.  37.2 37.6 37.9 39.8 41.7 43.1 44.8 46.  48.3 48.5 48.8 50.  50.
 50.  50.  50.  50.  50.  50.  50. ]

# 3.1.4、构建LightGBM数据集

lgb_train 
 
lgb_eval 
 

# 3.2、模型训练

# 设置模型参数

# 训练模型

[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000313 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 638
[LightGBM] [Info] Number of data points in the train set: 251, number of used features: 12
[LightGBM] [Info] Start training from score 23.116335
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1]	valid_0's l1: 5.55296	valid_0's l2: 55.3567
Training until validation scores don't improve for 5 rounds
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2]	valid_0's l1: 5.37904	valid_0's l2: 52.1612
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3]	valid_0's l1: 5.18424	valid_0's l2: 48.5647
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4]	valid_0's l1: 4.98621	valid_0's l2: 45.1861
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5]	valid_0's l1: 4.81712	valid_0's l2: 42.6588
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[6]	valid_0's l1: 4.64324	valid_0's l2: 39.9921
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[7]	valid_0's l1: 4.48456	valid_0's l2: 37.6254
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[8]	valid_0's l1: 4.32798	valid_0's l2: 35.562
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[9]	valid_0's l1: 4.1635	valid_0's l2: 33.5291
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[10]	valid_0's l1: 4.04524	valid_0's l2: 31.976
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[11]	valid_0's l1: 3.92398	valid_0's l2: 30.43
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[12]	valid_0's l1: 3.82096	valid_0's l2: 29.178
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[13]	valid_0's l1: 3.70099	valid_0's l2: 27.894
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[14]	valid_0's l1: 3.5979	valid_0's l2: 26.7405
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[15]	valid_0's l1: 3.5056	valid_0's l2: 25.7414
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[16]	valid_0's l1: 3.42553	valid_0's l2: 24.9116
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[17]	valid_0's l1: 3.33382	valid_0's l2: 23.9615
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[18]	valid_0's l1: 3.26277	valid_0's l2: 23.1291
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[19]	valid_0's l1: 3.19639	valid_0's l2: 22.4056
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[20]	valid_0's l1: 3.13146	valid_0's l2: 21.7271
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[21]	valid_0's l1: 3.06704	valid_0's l2: 21.0404
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[22]	valid_0's l1: 3.01769	valid_0's l2: 20.5252
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[23]	valid_0's l1: 2.97406	valid_0's l2: 19.9501
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[24]	valid_0's l1: 2.93286	valid_0's l2: 19.5415
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[25]	valid_0's l1: 2.89854	valid_0's l2: 19.2612
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[26]	valid_0's l1: 2.86595	valid_0's l2: 18.9108
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[27]	valid_0's l1: 2.83455	valid_0's l2: 18.4983
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[28]	valid_0's l1: 2.80282	valid_0's l2: 18.1043
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[29]	valid_0's l1: 2.7676	valid_0's l2: 17.7102
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[30]	valid_0's l1: 2.73643	valid_0's l2: 17.5716
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[31]	valid_0's l1: 2.7189	valid_0's l2: 17.3091
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[32]	valid_0's l1: 2.70055	valid_0's l2: 17.0558
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[33]	valid_0's l1: 2.67765	valid_0's l2: 16.7472
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[34]	valid_0's l1: 2.66355	valid_0's l2: 16.5579
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[35]	valid_0's l1: 2.65571	valid_0's l2: 16.4488
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[36]	valid_0's l1: 2.64186	valid_0's l2: 16.1984
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[37]	valid_0's l1: 2.62485	valid_0's l2: 15.9867
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[38]	valid_0's l1: 2.60517	valid_0's l2: 15.7785
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[39]	valid_0's l1: 2.60108	valid_0's l2: 15.6885
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[40]	valid_0's l1: 2.58566	valid_0's l2: 15.5137
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[41]	valid_0's l1: 2.56902	valid_0's l2: 15.2941
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[42]	valid_0's l1: 2.55925	valid_0's l2: 15.2012
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[43]	valid_0's l1: 2.55608	valid_0's l2: 15.1871
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[44]	valid_0's l1: 2.55078	valid_0's l2: 15.0495
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[45]	valid_0's l1: 2.54551	valid_0's l2: 14.9857
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[46]	valid_0's l1: 2.54433	valid_0's l2: 14.9713
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[47]	valid_0's l1: 2.54001	valid_0's l2: 14.8974
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[48]	valid_0's l1: 2.53317	valid_0's l2: 14.7989
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[49]	valid_0's l1: 2.52767	valid_0's l2: 14.6993
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[50]	valid_0's l1: 2.53182	valid_0's l2: 14.6695
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[51]	valid_0's l1: 2.525	valid_0's l2: 14.5664
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[52]	valid_0's l1: 2.52136	valid_0's l2: 14.5268
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[53]	valid_0's l1: 2.51818	valid_0's l2: 14.4511
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[54]	valid_0's l1: 2.51344	valid_0's l2: 14.3942
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[55]	valid_0's l1: 2.51794	valid_0's l2: 14.3446
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[56]	valid_0's l1: 2.51472	valid_0's l2: 14.3073
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[57]	valid_0's l1: 2.51281	valid_0's l2: 14.2588
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[58]	valid_0's l1: 2.51246	valid_0's l2: 14.1835
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[59]	valid_0's l1: 2.51186	valid_0's l2: 14.1126
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[60]	valid_0's l1: 2.50594	valid_0's l2: 14.0302
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[61]	valid_0's l1: 2.50899	valid_0's l2: 14.0065
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[62]	valid_0's l1: 2.51171	valid_0's l2: 13.9461
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[63]	valid_0's l1: 2.51573	valid_0's l2: 13.95
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[64]	valid_0's l1: 2.52288	valid_0's l2: 13.9602
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[65]	valid_0's l1: 2.52993	valid_0's l2: 13.9526
Early stopping, best iteration is:
[60]	valid_0's l1: 2.50594	valid_0's l2: 14.0302

# 3.3、模型预测与评估

R2 = 0.757685
RMSE = 3.74569

# 4、停止SparkSession

你可能感兴趣的:(ML,BigData/Cloud,Computing,DataScience,回归,机器学习,PySpark)