选择UCI数据集中的Bike Sharing数据集(http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset)进行实验。
场景:预测共享单车租借数量。
特征:季节、月份、时间(0~23)、节假日、星期、工作日、天气、温度、体感温度、湿度、风速
预测目标:每一小时的单车租用数量
1、下载数据集并打开
终端输入命令
cd ~/pythonwork/PythonProject/data
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip
unzip -j Bike-Sharing-Dataset.zip
cat hour.csv|more
打开hour.csv(以小时为单位的租界数量)
字段说明及相应处理:
2、打开IPython/Jupyter Notebook导入数据
终端输入命令运行IPython/Jupyter Notebook
cd ~/pythonwork/ipynotebook
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=local[*] pyspark
在IPython/Jupyter Notebook中输入以下命令导入并读取数据:
## 定义路径
global Path
if sc.master[:5]=="local":
Path="file:/home/yyf/pythonwork/PythonProject/"
else:
Path="hdfs://master:9000/user/yyf/"
## 读取hour.tsv
print("开始导入数据...")
rawData = sc.textFile(Path+"data/hour.csv")
header = rawData.first() # 第一行为字段说明行
## 删除第一行
rData = rawData.filter(lambda x: x != header)
## 取出前2项数据
print(rData.take(2))
## 以逗号每一行
lines = rData.map(lambda x: x.split(","))
print("共有:"+str(lines.count())+"项数据")
1、处理特征
## 处理特征
## 处理特征
import numpy as np
def convert_float(v):
"""处理数值, 将字符串转化为float"""
return float(v)
def process_features(line):
"""处理特征,line为字段行"""
## 处理季节特征
SeasonFeature = [convert_float(value) for value in line[2]]
## 处理余下的特征
Features = [convert_float(value) for value in line[4:14]]
# 返回拼接的总特征列表
return Features
2、处理预测目标值
## 处理预测目标值
def process_label(line):
return float(line[-1])
process_label(lines.first())
3、构建LabeledPoint数据格式
Spark Mllib分类任务所支持的数据类型为LabeledPoint格式,LabeledPoint数据由标签label和特征feature组成。
## 构建LabeledPoint数据:
from pyspark.mllib.regression import LabeledPoint
labelpointRDD = lines.map(lambda r: LabeledPoint(process_label(r), \
process_features(r)))
labelpointRDD.first()
返回结果:
4、划分训练集、验证集及测试集
## 划分训练集、验证集和测试集
(trainData, validationData, testData) = labelpointRDD.randomSplit([7,1,2])
print("训练集样本个数:"+str(trainData.count()) + "验证集样本个数:"+str(validationData.count())+ "测试集样本个数:"+str(testData.count()))
# 将数据暂存在内存中,加快后续运算效率
trainData.persist()
validationData.persist()
testData.persist()
选择Spark MLlib中的决策树DecisionTree模块中的trainRegressor方法进行训练并建立模型,调用方式如下:
DecisionTree.trainRegressor(data, categoricalFeaturesInfo, impurity=”variance”, maxDepth=5, maxBins=32,
minInstancesPerNode=1, minInfoGain=0.0)
参数说明如下:
## 使用决策数模型进行训练
from pyspark.mllib.tree import DecisionTree
model = DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo={}, impurity="variance", maxDepth=5, maxBins=32,
minInstancesPerNode=1, minInfoGain=0.0)
使用均方误差RMSE对模型进行评估:
## 使用RMSE对模型进行评估
import numpy as np
from pyspark.mllib.evaluation import RegressionMetrics
## 定义模型评估函数
def RMSE(model, validationData):
## 计算模型的准确率
predict = model.predict(validationData.map(lambda p:p.features))
## 拼接预测值和实际值
predict_real = predict.zip(validationData.map(lambda p: p.label))
## 计算均方误差
rmse = np.sqrt(predict_real.map(lambda p: (p[0]-p[1])**2).sum() / predict_real.count())
return rmse
## 调用函数求模型在验证集上的准确率
rmse = RMSE(model, validationData)
print("均方误差RMSE="+str(rmse))
返回结果:均方误差RMSE=117.804043648
DecisionTree的参数maxDepth, maxBins, minInstancesPerNode, minInfoGain会影响模型的RMSE及训练的时间,下面对不同模型参数取值进行测试评估。
创建trainEvaluateModel函数包含训练与评估功能,并计算训练评估的时间。
## 创建trainEvaluateModel函数包含训练与评估功能,并计算训练评估的时间。
import time
def trainEvaluateModel(trainData, validationData, maxDepthParm, maxBinsParm, minInstancesPerNodeParm, minInfoGainParm):
startTime = time.time()
## 创建并训练模型
model = DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo={}, impurity="variance", maxDepth=maxDepthParm,
maxBins=maxBinsParm, minInstancesPerNode=minInstancesPerNodeParm, minInfoGain=minInfoGainParm)
## 计算RMSE
rmse = RMSE(model, validationData)
duration = time.time() - startTime # 持续时间
print("训练评估:参数"+ ", maxDepth="+str(maxDepthParm)+", maxBins="+str(maxBinsParm)+
", minInstancesPerNode="+str(minInstancesPerNodeParm) +", minInfoGainParm="+str(minInfoGainParm)+"\n"
"===>消耗时间="+str(duration)+", 均方误差RMSE="+str(rmse))
return rmse, duration, maxDepthParm, maxBinsParm, minInstancesPerNodeParm, minInfoGainParm, model
1、评估maxDepth参数
## 评估参数 maxDepth
maxDepthList = [3,5, 10, 15, 20,25]
maxBinsList = [10]
minInstancesPerNodeList=[1]
minInfoGainList=[0.0]
## 返回结果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
for maxDepth in maxDepthList
for maxBins in maxBinsList
for minInstancesPerNode in minInstancesPerNodeList
for minInfoGain in minInfoGainList]
返回结果:
训练评估:参数, maxDepth=3, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.782449960709, 均方误差RMSE=134.527910738
训练评估:参数, maxDepth=5, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.586301088333, 均方误差RMSE=113.391580393
训练评估:参数, maxDepth=10, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.838535070419, 均方误差RMSE=89.6594076876
训练评估:参数, maxDepth=15, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=1.38911700249, 均方误差RMSE=96.6207085889
训练评估:参数, maxDepth=20, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=2.21378922462, 均方误差RMSE=105.020884741
训练评估:参数, maxDepth=25, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=3.13737297058, 均方误差RMSE=107.091143371
观察发现,此例中,树的最大深度maxDepth过小或过大,RMSE均比较大,同时maxDepth越大,训练花费的时间就越长。
2、评估maxBins参数
## 评估参数 maxBins
maxDepthList = [10]
maxBinsList = [5,10,15,100,200,500]
minInstancesPerNodeList=[1]
minInfoGainList=[0.0]
## 返回结果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
for maxDepth in maxDepthList
for maxBins in maxBinsList
for minInstancesPerNode in minInstancesPerNodeList
for minInfoGain in minInfoGainList]
返回结果:
训练评估:参数, maxDepth=10, maxBins=5, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.910573959351, 均方误差RMSE=124.584498251
训练评估:参数, maxDepth=10, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.759560108185, 均方误差RMSE=89.6594076876
训练评估:参数, maxDepth=10, maxBins=15, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.758432865143, 均方误差RMSE=91.626783826
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.798367023468, 均方误差RMSE=80.187592618
训练评估:参数, maxDepth=10, maxBins=200, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.819510936737, 均方误差RMSE=80.187592618
训练评估:参数, maxDepth=10, maxBins=500, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.823748111725, 均方误差RMSE=80.187592618
观察发现,maxBins较大时,消耗时间也越长,RMSE就越低,大到一定程度时,RMSE就不变了。
3、评估 minInstancesPerNode参数
## 评估参数minInstancesPerNode
maxDepthList = [10]
maxBinsList = [100]
minInstancesPerNodeList=[1,3,5,10,20,50]
minInfoGainList=[0.0]
## 返回结果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
for maxDepth in maxDepthList
for maxBins in maxBinsList
for minInstancesPerNode in minInstancesPerNodeList
for minInfoGain in minInfoGainList]
返回结果:
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.953587055206, 均方误差RMSE=80.187592618
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=3, minInfoGainParm=0.0
===>消耗时间=0.811645030975, 均方误差RMSE=77.0894951256
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.0
===>消耗时间=0.798023939133, 均方误差RMSE=76.7551686019
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=10, minInfoGainParm=0.0
===>消耗时间=0.776537179947, 均方误差RMSE=77.276983917
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=20, minInfoGainParm=0.0
===>消耗时间=0.737972974777, 均方误差RMSE=78.4232321135
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=50, minInfoGainParm=0.0
===>消耗时间=0.703710079193, 均方误差RMSE=81.9538443152
观察发现 minInstancesPerNode过小或过大,RMSE都会稍微变大一点,但幅度不是很大。
4、评估minInfoGain参数
## 评估参数minInfoGain
maxDepthList = [10]
maxBinsList = [100]
minInstancesPerNodeList=[5]
minInfoGainList=[0.0,0.1,0.3,0.5,0.8]
## 返回结果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
for maxDepth in maxDepthList
for maxBins in maxBinsList
for minInstancesPerNode in minInstancesPerNodeList
for minInfoGain in minInfoGainList]
返回结果:
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.0
===>消耗时间=1.01054096222, 均方误差RMSE=76.7551686019
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.1
===>消耗时间=0.769320011139, 均方误差RMSE=76.7551686019
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.3
===>消耗时间=0.804311990738, 均方误差RMSE=76.7551434805
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.5
===>消耗时间=0.77669095993, 均方误差RMSE=76.7553417561
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.8
===>消耗时间=0.762629985809, 均方误差RMSE=76.7553265736
观察发现 minInfoGain对模型的影响几乎没有。
4、网格搜索最佳参数组合
## 定义函数gridSearch网格搜索最佳参数组合
def gridSearch(trainData, validationData, maxDepthList, maxBinsList, minInstancesPerNodeList, minInfoGainList):
metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
for maxDepth in maxDepthList
for maxBins in maxBinsList
for minInstancesPerNode in minInstancesPerNodeList
for minInfoGain in minInfoGainList]
# 按照RMSE从小到大排序,返回最小RMSE的参数组合
sorted_metics = sorted(metrics, key=lambda k:k[0], reverse=False)
best_parameters = sorted_metics[0]
print("最佳参数组合:"+"maxDepth="+str( best_parameters[2]) +
", maxBins="+str( best_parameters[3])+", minInstancesPerNode="+str( best_parameters[4])+
", minInfoGain="+str(best_parameters[5])+"\n"+
", 均方误差RMSE="+str( best_parameters[0]))
return best_parameters
## 参数组合
maxDepthList = [3, 5, 10,20,25]
maxBinsList = [30, 50,100,200]
minInstancesPerNodeList=[1,3,5,10,20]
minInfoGainList=[0.0,0.3,0.5]
## 调用函数返回最佳参数组合
best_parameters = gridSearch(trainData, validationData, maxDepthList, maxBinsList, minInstancesPerNodeList, minInfoGainList)
返回结果:
最佳参数组合:maxDepth=25, maxBins=30, minInstancesPerNode=10, minInfoGain=0.3
, 均方误差RMSE=76.4217911844
前面已经得到最佳参数组合maxDepth=25, maxBins=30, minInstancesPerNode=10, minInfoGain=0.3及相应的在验证集上的RMSE。使用该最佳参数组合训练模型,用该模型分别作用于训练数据和测试数据得出准确率,验证是否会过拟合:
## 使用最佳参数组合imaxDepth=25, maxBins=30, minInstancesPerNode=10, minInfoGain=0.3训练模型
best_model = DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo={}, impurity="variance", maxDepth=25,
maxBins=30, minInstancesPerNode=10, minInfoGain=0.3)
rmse1 = RMSE(best_model, trainData)
rmse2 = RMSE(best_model, testData)
print("training: 均方误差RMSE="+str(rmse1))
print("testing: 均方误差RMSE="+str(rmse2))
返回结果:
training: 均方误差RMSE=62.3712304863
testing: 均方误差RMSE=79.1479978784
观察发现,训练数据的RMSE比测试数据的RMSE大,故发生了过拟合。