Python Spark MLlib 之决策树回归分析

数据准备

选择UCI数据集中的Bike Sharing数据集(http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset)进行实验。

场景:预测共享单车租借数量。
特征:季节、月份、时间(0~23)、节假日、星期、工作日、天气、温度、体感温度、湿度、风速
预测目标:每一小时的单车租用数量

1、下载数据集并打开

终端输入命令

cd ~/pythonwork/PythonProject/data
wget http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip
unzip -j Bike-Sharing-Dataset.zip
cat hour.csv|more

打开hour.csv(以小时为单位的租界数量)
Python Spark MLlib 之决策树回归分析_第1张图片
字段说明及相应处理:

  • instant:序号,忽略
  • dteday:日期,忽略
  • season:季节(1: spring、2: summer、3: fall、4: winter),特征字段
  • yr:年份(0:2011,1:2012),忽略
  • mnth:月份(1~12),特征字段
  • hr:时间(0~23)特征字段
  • holiday:节假日(0:非节假日,1:节假日),特征字段
  • weekday:星期,特征字段
  • workingday:工作日,特征字段
  • weathersit:天气:1~4表示好天气~恶劣天气分级,特征字段
  • temp:摄氏度(除以41标准化),特征字段
  • atemp:实际感觉温度(除以50标准化),特征字段
  • hum:湿度(除以100标准化),特征字段
  • windspeed:风速(除以67标准化),特征字段
  • casual:临时会员此时段租借的数量,忽略
  • registered:正式会员此时段租借的数量,忽略
  • cnt:此时段租借的总数量,预测目标

2、打开IPython/Jupyter Notebook导入数据

终端输入命令运行IPython/Jupyter Notebook

cd ~/pythonwork/ipynotebook
PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=local[*] pyspark

在IPython/Jupyter Notebook中输入以下命令导入并读取数据:

## 定义路径
global Path
if sc.master[:5]=="local":
    Path="file:/home/yyf/pythonwork/PythonProject/"
else:
    Path="hdfs://master:9000/user/yyf/"

## 读取hour.tsv
print("开始导入数据...")
rawData = sc.textFile(Path+"data/hour.csv")
header = rawData.first()  # 第一行为字段说明行

## 删除第一行
rData = rawData.filter(lambda x: x != header)

## 取出前2项数据
print(rData.take(2))

## 以逗号每一行
lines = rData.map(lambda x: x.split(","))
print("共有:"+str(lines.count())+"项数据")

返回结果:

这里写图片描述


数据预处理

1、处理特征

##  处理特征
##  处理特征
import numpy as np

def convert_float(v):
    """处理数值, 将字符串转化为float"""
    return float(v)

def process_features(line):
    """处理特征,line为字段行"""
    ## 处理季节特征
    SeasonFeature = [convert_float(value) for value in line[2]]
    ## 处理余下的特征
    Features = [convert_float(value) for value in line[4:14]]
    # 返回拼接的总特征列表
    return Features

2、处理预测目标值

## 处理预测目标值
def process_label(line):
    return float(line[-1])
process_label(lines.first())

3、构建LabeledPoint数据格式

Spark Mllib分类任务所支持的数据类型为LabeledPoint格式,LabeledPoint数据由标签label和特征feature组成。

## 构建LabeledPoint数据:
from pyspark.mllib.regression import LabeledPoint

labelpointRDD = lines.map(lambda r: LabeledPoint(process_label(r), \
                                                 process_features(r)))
labelpointRDD.first()                                              

返回结果:
这里写图片描述

4、划分训练集、验证集及测试集

## 划分训练集、验证集和测试集
(trainData, validationData, testData) = labelpointRDD.randomSplit([7,1,2])
print("训练集样本个数:"+str(trainData.count()) + "验证集样本个数:"+str(validationData.count())+ "测试集样本个数:"+str(testData.count()))

# 将数据暂存在内存中,加快后续运算效率
trainData.persist()
validationData.persist()
testData.persist()

训练模型

选择Spark MLlib中的决策树DecisionTree模块中的trainRegressor方法进行训练并建立模型,调用方式如下:

DecisionTree.trainRegressor(data, categoricalFeaturesInfo, impurity=”variance”, maxDepth=5, maxBins=32,
minInstancesPerNode=1, minInfoGain=0.0)

参数说明如下:

  • (1) data:输入的训练数据,数据格式为LabeledPoint数据
  • (2) categoricalFeaturesInfo:设置分类特征字段信息,本例离散/分类特征字段在原始数据集中已经处理,故这里设置为空字典dict()
  • (3) impurity:决策树的impurity评估方法(划分的度量选择)只支持”variance”
  • (4) maxDepth:决策树最大深度
  • (5) maxBins:决策树每个节点的最大分支数
  • (6) minInstancesPerNode=1:每个节点的最少的实例数
  • (7) minInfoGain:划分节点要求的信息增益(取值0~1之间),不宜过大,不然数很难划分下去
## 使用决策数模型进行训练
from pyspark.mllib.tree import DecisionTree
model = DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo={}, impurity="variance", maxDepth=5, maxBins=32, 
                            minInstancesPerNode=1, minInfoGain=0.0)

模型评估

使用均方误差RMSE对模型进行评估:

## 使用RMSE对模型进行评估
import numpy as np
from pyspark.mllib.evaluation import RegressionMetrics

## 定义模型评估函数
def RMSE(model, validationData):
    ## 计算模型的准确率
    predict = model.predict(validationData.map(lambda p:p.features))
    ## 拼接预测值和实际值
    predict_real = predict.zip(validationData.map(lambda p: p.label))
    ## 计算均方误差
    rmse = np.sqrt(predict_real.map(lambda p: (p[0]-p[1])**2).sum() / predict_real.count())
    return rmse

## 调用函数求模型在验证集上的准确率
rmse =  RMSE(model, validationData)
print("均方误差RMSE="+str(rmse))

返回结果:均方误差RMSE=117.804043648


模型参数选择

DecisionTree的参数maxDepth, maxBins, minInstancesPerNode, minInfoGain会影响模型的RMSE及训练的时间,下面对不同模型参数取值进行测试评估。

创建trainEvaluateModel函数包含训练与评估功能,并计算训练评估的时间。

## 创建trainEvaluateModel函数包含训练与评估功能,并计算训练评估的时间。
import time

def trainEvaluateModel(trainData, validationData, maxDepthParm, maxBinsParm, minInstancesPerNodeParm, minInfoGainParm):
    startTime = time.time()
    ## 创建并训练模型
    model = DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo={}, impurity="variance", maxDepth=maxDepthParm, 
                                        maxBins=maxBinsParm, minInstancesPerNode=minInstancesPerNodeParm, minInfoGain=minInfoGainParm)
    ## 计算RMSE
    rmse = RMSE(model, validationData)
    duration = time.time() - startTime   # 持续时间
    print("训练评估:参数"+ ",  maxDepth="+str(maxDepthParm)+",  maxBins="+str(maxBinsParm)+ 
          ", minInstancesPerNode="+str(minInstancesPerNodeParm) +", minInfoGainParm="+str(minInfoGainParm)+"\n"
         "===>消耗时间="+str(duration)+",  均方误差RMSE="+str(rmse))
    return rmse, duration, maxDepthParm, maxBinsParm, minInstancesPerNodeParm, minInfoGainParm, model

1、评估maxDepth参数

## 评估参数 maxDepth
maxDepthList = [3,5, 10, 15, 20,25]
maxBinsList = [10]
minInstancesPerNodeList=[1]
minInfoGainList=[0.0]

## 返回结果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
          for maxDepth in maxDepthList
          for maxBins in maxBinsList
          for minInstancesPerNode in minInstancesPerNodeList
          for minInfoGain in minInfoGainList]

返回结果:

训练评估:参数, maxDepth=3, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.782449960709, 均方误差RMSE=134.527910738
训练评估:参数, maxDepth=5, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.586301088333, 均方误差RMSE=113.391580393
训练评估:参数, maxDepth=10, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.838535070419, 均方误差RMSE=89.6594076876
训练评估:参数, maxDepth=15, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=1.38911700249, 均方误差RMSE=96.6207085889
训练评估:参数, maxDepth=20, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=2.21378922462, 均方误差RMSE=105.020884741
训练评估:参数, maxDepth=25, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=3.13737297058, 均方误差RMSE=107.091143371

观察发现,此例中,树的最大深度maxDepth过小或过大,RMSE均比较大,同时maxDepth越大,训练花费的时间就越长。

2、评估maxBins参数

## 评估参数 maxBins
maxDepthList = [10]
maxBinsList = [5,10,15,100,200,500]
minInstancesPerNodeList=[1]
minInfoGainList=[0.0]

## 返回结果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
          for maxDepth in maxDepthList
          for maxBins in maxBinsList
          for minInstancesPerNode in minInstancesPerNodeList
          for minInfoGain in minInfoGainList]

返回结果:

训练评估:参数, maxDepth=10, maxBins=5, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.910573959351, 均方误差RMSE=124.584498251
训练评估:参数, maxDepth=10, maxBins=10, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.759560108185, 均方误差RMSE=89.6594076876
训练评估:参数, maxDepth=10, maxBins=15, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.758432865143, 均方误差RMSE=91.626783826
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.798367023468, 均方误差RMSE=80.187592618
训练评估:参数, maxDepth=10, maxBins=200, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.819510936737, 均方误差RMSE=80.187592618
训练评估:参数, maxDepth=10, maxBins=500, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.823748111725, 均方误差RMSE=80.187592618

观察发现,maxBins较大时,消耗时间也越长,RMSE就越低,大到一定程度时,RMSE就不变了。

3、评估 minInstancesPerNode参数

## 评估参数minInstancesPerNode
maxDepthList = [10]
maxBinsList = [100]
minInstancesPerNodeList=[1,3,5,10,20,50]
minInfoGainList=[0.0]

## 返回结果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
          for maxDepth in maxDepthList
          for maxBins in maxBinsList
          for minInstancesPerNode in minInstancesPerNodeList
          for minInfoGain in minInfoGainList]

返回结果:

训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=1, minInfoGainParm=0.0
===>消耗时间=0.953587055206, 均方误差RMSE=80.187592618
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=3, minInfoGainParm=0.0
===>消耗时间=0.811645030975, 均方误差RMSE=77.0894951256
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.0
===>消耗时间=0.798023939133, 均方误差RMSE=76.7551686019
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=10, minInfoGainParm=0.0
===>消耗时间=0.776537179947, 均方误差RMSE=77.276983917
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=20, minInfoGainParm=0.0
===>消耗时间=0.737972974777, 均方误差RMSE=78.4232321135
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=50, minInfoGainParm=0.0
===>消耗时间=0.703710079193, 均方误差RMSE=81.9538443152

观察发现 minInstancesPerNode过小或过大,RMSE都会稍微变大一点,但幅度不是很大。

4、评估minInfoGain参数

## 评估参数minInfoGain
maxDepthList = [10]
maxBinsList = [100]
minInstancesPerNodeList=[5]
minInfoGainList=[0.0,0.1,0.3,0.5,0.8]

## 返回结果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
          for maxDepth in maxDepthList
          for maxBins in maxBinsList
          for minInstancesPerNode in minInstancesPerNodeList
          for minInfoGain in minInfoGainList]

返回结果:

训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.0
===>消耗时间=1.01054096222, 均方误差RMSE=76.7551686019
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.1
===>消耗时间=0.769320011139, 均方误差RMSE=76.7551686019
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.3
===>消耗时间=0.804311990738, 均方误差RMSE=76.7551434805
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.5
===>消耗时间=0.77669095993, 均方误差RMSE=76.7553417561
训练评估:参数, maxDepth=10, maxBins=100, minInstancesPerNode=5, minInfoGainParm=0.8
===>消耗时间=0.762629985809, 均方误差RMSE=76.7553265736

观察发现 minInfoGain对模型的影响几乎没有。

4、网格搜索最佳参数组合

## 定义函数gridSearch网格搜索最佳参数组合

def gridSearch(trainData, validationData, maxDepthList, maxBinsList, minInstancesPerNodeList, minInfoGainList):
    metrics = [trainEvaluateModel(trainData, validationData, maxDepth, maxBins, minInstancesPerNode, minInfoGain)
          for maxDepth in maxDepthList
          for maxBins in maxBinsList
          for minInstancesPerNode in minInstancesPerNodeList
          for minInfoGain in minInfoGainList]
    # 按照RMSE从小到大排序,返回最小RMSE的参数组合
    sorted_metics = sorted(metrics, key=lambda k:k[0], reverse=False)
    best_parameters = sorted_metics[0]
    print("最佳参数组合:"+"maxDepth="+str( best_parameters[2]) + 
         ",  maxBins="+str( best_parameters[3])+",  minInstancesPerNode="+str( best_parameters[4])+
          ", minInfoGain="+str(best_parameters[5])+"\n"+
         ",  均方误差RMSE="+str( best_parameters[0]))
    return  best_parameters
## 参数组合
maxDepthList = [3, 5, 10,20,25]
maxBinsList = [30, 50,100,200]
minInstancesPerNodeList=[1,3,5,10,20]
minInfoGainList=[0.0,0.3,0.5]

## 调用函数返回最佳参数组合
best_parameters = gridSearch(trainData, validationData, maxDepthList, maxBinsList, minInstancesPerNodeList, minInfoGainList)

返回结果:

最佳参数组合:maxDepth=25, maxBins=30, minInstancesPerNode=10, minInfoGain=0.3
, 均方误差RMSE=76.4217911844


判断是否过拟合

前面已经得到最佳参数组合maxDepth=25, maxBins=30, minInstancesPerNode=10, minInfoGain=0.3及相应的在验证集上的RMSE。使用该最佳参数组合训练模型,用该模型分别作用于训练数据和测试数据得出准确率,验证是否会过拟合:

## 使用最佳参数组合imaxDepth=25,  maxBins=30,  minInstancesPerNode=10, minInfoGain=0.3训练模型
best_model = DecisionTree.trainRegressor(trainData, categoricalFeaturesInfo={}, impurity="variance", maxDepth=25, 
                                        maxBins=30, minInstancesPerNode=10, minInfoGain=0.3)
rmse1 = RMSE(best_model, trainData)
rmse2 = RMSE(best_model, testData)
print("training: 均方误差RMSE="+str(rmse1))
print("testing: 均方误差RMSE="+str(rmse2))

返回结果:


training: 均方误差RMSE=62.3712304863
testing: 均方误差RMSE=79.1479978784

观察发现,训练数据的RMSE比测试数据的RMSE大,故发生了过拟合。

你可能感兴趣的:(Spark,Python,机器学习与大数据实践)