暴走的鹏鹏哥哥

[Python嗯~机器学习]---用python3来分析共享单车投放量

共享单车投放量预测

数据集https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

字段信息

hour.csv和day.csv都有如下的字段，不过day.csv中不会有hr。

instant: 样本编号
dteday: 日期
season: 季节(1-春季；2-夏季；3-秋季；4-冬季)
yr: 年份(0-2011；1-2012)
mnth: 月份(1~12)
hr: 小时(0~23)
holidy: 当天是否是假日(从美国政府网站获取)
weekday: 当天是一周的第几天(0~6, 0表示周日)
workingday: 当天是否是工作日
weathersit:
- 1: 晴朗，少云，局部有云
- 2: 雾+多云，雾+少云，雾
- 3: 小雪、小雨+雷电、小雨
- 4: 大雨、冰雹、浓雾、大雪
temp: 归一化的摄氏气温
atemp: 归一化的体感温度
hum: 归一化的湿度
windspeed: 归一化的风速
casual: 非注册用户使用量
registered: 注册用户使用量
cnt: 整体的使用量

上面我们可以发现天气一项要用独热编码

定义问题

随着共享单车梦想圈钱，以及管理不规范，群众买单梦想，来分析共享单车需求。

分析数据集

1、数据预处理

In [11]:

import numpy as np
import pandas as pd

import seaborn as sn
import matplotlib.pyplot as plt
%matplotlib inline

# 设置参数
params = {
    'legend.fontsize': 'x-large',
    'figure.figsize': (50, 25), 
    'axes.labelsize': 'x-large',
    'axes.titlesize':'x-large',
    'xtick.labelsize':'x-large',
    'ytick.labelsize':'x-large',
    'font.sans-serif':'SimHei',     # 显示中文
    'axes.unicode_minus':False
}

sn.set_style('whitegrid')
sn.set_context('talk')

plt.rcParams.update(params)         # 使我们的参数生效
pd.options.display.max_colwidth = 600

In [2]:

hour_df = pd.read_csv('./data/hour.csv')
print('数据集的形状:{}'.format(hour_df.shape))

数据集的形状:(17379, 17)

In [3]:

# 查看数据集前几行信息
hour_df.head()

Out[3]:

	instant	dteday	season	mnth	hr	weekday	weathersit	temp	atemp	hum	casual	registered	cnt
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1

In [4]:

# 检查是不是有缺失值
hour_df.info()


RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
instant       17379 non-null int64
dteday        17379 non-null object
season        17379 non-null int64
yr            17379 non-null int64
mnth          17379 non-null int64
hr            17379 non-null int64
holiday       17379 non-null int64
weekday       17379 non-null int64
workingday    17379 non-null int64
weathersit    17379 non-null int64
temp          17379 non-null float64
atemp         17379 non-null float64
hum           17379 non-null float64
windspeed     17379 non-null float64
casual        17379 non-null int64
registered    17379 non-null int64
cnt           17379 non-null int64
dtypes: float64(4), int64(12), object(1)
memory usage: 2.2+ MB

属性dteday需要进行类型转换，object(str) -> timestamp.
类似season, holiday, weekday之类的属性都被Pandas识别成了整形，我们需要把它们转换成易于理解的类别型数据。

在进行类别转换之前，为了便于理解，我们把数据的列(表头)改成易于理解的名称：

In [5]:

hour_df.rename(columns={'instant':'rec_id',
                      'dteday':'datetime',
                      'holiday':'is_holiday',
                      'workingday':'is_workingday',
                      'weathersit':'weather_condition',
                      'hum':'humidity',
                      'mnth':'month',
                      'cnt':'total_count',
                      'hr':'hour',
                      'yr':'year'},
                      # inplace=True：不创建新的对象，直接对原始对象进行修改；
                      # inplace=False：对数据进行修改，创建并返回新的对象承载其修改结果。
                      inplace=True)

In [6]:

hour_df.head()

Out[6]:

	rec_id	datetime	season	month	hour	weekday	weather_condition	temp	atemp	humidity	casual	registered	total_count
0	1	2011-01-01	1	1	0	6	1	0.24	0.2879	0.81	3	13	16
1	2	2011-01-01	1	1	1	6	1	0.22	0.2727	0.80	8	32	40
2	3	2011-01-01	1	1	2	6	1	0.22	0.2727	0.80	5	27	32
3	4	2011-01-01	1	1	3	6	1	0.24	0.2879	0.75	3	10	13
4	5	2011-01-01	1	1	4	6	1	0.24	0.2879	0.75	0	1	1

类型转换

In [7]:

# 对日期类型转换
hour_df['datetime'] = pd.to_datetime(hour_df.datetime)

# 枚举类型category类别属性
hour_df['season'] = hour_df.season.astype('category')
hour_df['is_holiday'] = hour_df.is_holiday.astype('category')
hour_df['weekday'] = hour_df.weekday.astype('category')
hour_df['weather_condition'] = hour_df.weather_condition.astype('category')
hour_df['is_workingday'] = hour_df.is_workingday.astype('category')
hour_df['month'] = hour_df.month.astype('category')
hour_df['year'] = hour_df.year.astype('category')
hour_df['hour'] = hour_df.hour.astype('category')

In [8]:

hour_df.info()


RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
rec_id               17379 non-null int64
datetime             17379 non-null datetime64[ns]
season               17379 non-null category
year                 17379 non-null category
month                17379 non-null category
hour                 17379 non-null category
is_holiday           17379 non-null category
weekday              17379 non-null category
is_workingday        17379 non-null category
weather_condition    17379 non-null category
temp                 17379 non-null float64
atemp                17379 non-null float64
humidity             17379 non-null float64
windspeed            17379 non-null float64
casual               17379 non-null int64
registered           17379 non-null int64
total_count          17379 non-null int64
dtypes: category(8), datetime64[ns](1), float64(4), int64(4)
memory usage: 1.3 MB

In [9]:

hour_df.describe()

Out[9]:

	rec_id	temp	atemp	humidity	windspeed	casual	registered	total_count
count	17379.0000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000	17379.000000
mean	8690.0000	0.496987	0.475775	0.627229	0.190098	35.676218	153.786869	189.463088
std	5017.0295	0.192556	0.171850	0.192930	0.122340	49.305030	151.357286	181.387599
min	1.0000	0.020000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
25%	4345.5000	0.340000	0.333300	0.480000	0.104500	4.000000	34.000000	40.000000
50%	8690.0000	0.500000	0.484800	0.630000	0.194000	17.000000	115.000000	142.000000
75%	13034.5000	0.660000	0.621200	0.780000	0.253700	48.000000	220.000000	281.000000
max	17379.0000	1.000000	1.000000	1.000000	0.850700	367.000000	886.000000	977.000000

2、数据分布和变化趋势

按小时变化趋势

In [12]:

fig, ax = plt.subplots()
sn.pointplot(data=hour_df[['hour',
                          'total_count',
                          'season']],
                          x='hour',
                          y='total_count',
                          hue='season',
                          ax=ax)
ax.set(title='分季节按小时需求量')
plt.xlabel('小时')
plt.ylabel('总需求量')

Out[12]:

Text(0,0.5,'总需求量')

发现：

不同季节整体趋势差不多
高峰期在上午7-9点，下午4-6点(上下班高峰)
春季需求量最小，秋季需求量最大

逻辑上，工作日和周末的需求量的分布应该不太一样：

一周内按小时趋势

In [13]:

fig, ax = plt.subplots()
sn.pointplot(data=hour_df[['hour','total_count','weekday']],
            x='hour',y='total_count',hue='weekday',ax=ax)
ax.set(title='周内按小时需求量')

Out[13]:

[Text(0.5,1,'周内按小时需求量')]

发现：

周末午后需求量较高，工作日早晚需求量较大
工作日的需求量曲线和整体相近
相较于周末，工作日整体需求量更大
如果我们分注册用户和非注册用户看，可能也会有有趣的发现

按月来看看需求量：

In [14]:

fig,ax = plt.subplots()
sn.barplot(data=hour_df[['month',
                         'total_count']],
           x="month",y="total_count")
ax.set(title="月度需求量")

Out[14]:

[Text(0.5,1,'月度需求量')]

发现：
看起来6月到9月是需求量最旺盛的时候，这个时候正好是秋季，秋高气爽适合骑行。
这跟前面的秋季需求量大是吻合的。

年度的数据情况。0表示2011年，1表示2012年。
这里我们使用小提琴图来展现数据的多方面信息：

In [15]:

sn.violinplot(data=hour_df[['year',
                            'total_count']],
             x='year', y='total_count')

Out[15]:

发现：

两个年份均是多峰分布
2011年总体而言相较2012，需求量低，且需求量中位数较低
2012年最大需求量更高

节假日影响

In [16]:

fig,(ax1,ax2) = plt.subplots(ncols=2)
sn.barplot(data=hour_df,x='is_holiday',y='total_count',hue='season',ax=ax1)
sn.barplot(data=hour_df,x='is_workingday',y='total_count',hue='season',ax=ax2)

Out[16]:

发现：

工作日需求量更加稳定
夏秋季需求量均值差不多；但春、冬季工作日需求量显然高于假日

极值点

在探索数据、学习数据时，我们要看看是不是数据集中有一些不太像是正常值的数据。
异常点很有可能对后面的步骤产生影响。
我们一般使用箱型图(boxplot)来检查数据中的异常点。
接下来，我们看看如total_count, temperature, wind_speed这样的数值型的特征。

In [17]:

fig,(ax1,ax2)= plt.subplots(ncols=2)
sn.boxplot(data=hour_df[['total_count',
                         'casual','registered']],ax=ax1)
sn.boxplot(data=hour_df[['temp','windspeed']],ax=ax2)

Out[17]:

从上面两幅图我们可以清晰地得到：

总需求量、未注册用户、注册用户这三个值都有客观的异常值
对于天气因素，只有windspeed有一些异常值

箱型图异常值定义：

这只是箱型图用于识别异常值的方式，但并不一定这些异常值真的是异常值。
比如我们的数据是两年的，横跨了四季、24个月，那么整年中，风速一般都不怎么大，但是冬季可能有些日子会出现大风天气，那么箱型图就会把这些数据标记成异常点，但它实际上可能是正常的值，只是跟大部分数据有明显的差异罢了。

我们还可以看看其他维度上的数据是否存在异常点：

In [18]:

fig,ax = plt.subplots()
sn.boxplot(data=hour_df[['hour','total_count']],x="hour",y="total_count",ax=ax)
ax.set(title=u"按小时需求量箱型图")

Out[18]:

[Text(0.5,1,'按小时需求量箱型图')]

发现：

0-4, 21-23 需求量低，且有明显的异常值
午后的几个小时也有明显的异常值(受工作日、周末影响)
上下班高峰期的需求量中位数较高，且没有什么异常值(也就是说这段时间需求量很稳定)

数据相关性

相关性能够帮助我们更好地理解数据中各特征两两之间的线性相关性。

In [19]:

corrMatt = hour_df[["temp","atemp",
                    "humidity","windspeed",
                    "casual","registered",
                    "total_count"]].corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
sn.heatmap(corrMatt, mask=mask,
           vmax=.8, square=True,annot=True)

Out[19]:

发现：

temp和atemp之间相关性非常高(符合预期)
同样，total_count和casual，total_count和registered相关性也非常高
风速(windspeed)和湿度(humidity)呈现负相关性
整体上，特征和目标变量的线性相关度并不是很高

回归

在一种非常抽象的层面上，回归指的是对连续型的目标值进行的估计。它和分类相对，分类的目标估计对象一般都是离散值。
在回归学习中，房价预测是一个典型的案例。另外一个入门型的案例就是：身高体重。
身高体重的案例是说，一个人的体重跟他的身高是成正比的。因此，在给定足够量的训练样本后，我们就可以根据一个人的身高来估计他的体重。
回归本质上是对特征和目标变量之间的相关性进行建模。不过我们仍然要强调：相关性并不意味着因果！

回归的种类

我们在理论课上学习了好几种回归模型，对于任意一种，我们都可以借助下面的符号来描述：

函数 h 的形式是我们指定的，但是其内部对应的参数θθ则是需要从数据中学习出来的。
也就是说，特征和目标变量之间大致的关系是由我们通过指定函数 h 的形式来限定的，但是在这种关系模式下的具体细节则由模型从数据中学习出来。

我们在理论课上主要学习了两种回归模型：

线性回归：认为特征和目标变量之间是简单的线性关系。在这种情况下，回归线实际上就是一条直线(高维情况下就是一个超平面)
非线性回归：认为目标变量和特征的多项式形式之间是线性关系，这也被称为多项式回归。不过非线性回归范畴内还有其他模型，如：决策树回归、随机森林回归等

假设

在使用回归模型的时候，一般都隐含着一些重要的假设：

训练集是对目标建模对象具有很好的代表性
特征之间互相之间是线性无关的
样本的误差应该是同方差性的

评估

我们在理论课上讨论了MSE(均方误差), RMSE(均方根误差)之类的评估指标，不过仍然也存在其他评估指标。

残差分析

回归本质是上是基于特征变量使用回归方程来对目标变量的一种估计。
因为输出是一种估计，因此它跟实际值之间一般都会有一些差别。这个差别我们称之为残差(residual):

一个好的回归模型如果对整个数据拟合的很好，那么它产生的残差应该表现得很随机，不会出现什么特定的分布模式。
一般，我们可以通过绘制"预测值和残差"的散点图来确认这一点。

和残差一样，也是对回归拟合情况的一种衡量指标。它衡量的是，模型对目标变量的方差的解释性。

一个简单的对方差解释性的理解是：
当我们想要预测一个目标变量 Y 时，最朴素的回归模型其实就是直接使用一个常数模型，也就是，此时模型的均方误差()和目标变量的方差()完全一致，因此就是0，表示这个模型对目标变量的预测能力完全没有。
那么，如果模型效果非常好，和实际值完全一致，那么模型的均方误差(SSresSSres)就是0，进而，这预示着回归模型对数据的拟合能力非常好。
所以，，越大表示拟合能力越好。

交叉验证

这一点始终是必要的，特别是在训练集不是海量的情况下，一定要使用交叉验证的方式进行模型选择。
同时也可以避免过拟合之类的问题。

为机器学习准备数据

记得我吗需要独热编码

In [20]:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
def fit_transform_ohe(df, col_name):
    """
    该函数对于指定的列进行独热编码。
    
    Args:
        df(pandas.DataFrame): 含有目标数据的dataframe
        col_name: 需要进行独热编码的字段
    Returns:
        tuple: label_encoder, one_hot_encoder, transformed column as pandas Series
    """
    # 首先转换成数值型编码
    le = LabelEncoder()
    le_labels = le.fit_transform(df[col_name])
    df[col_name+'_label'] = le_labels
    
    # 将数值型编码转成独热编码
    ohe = OneHotEncoder()
    feature_attr = ohe.fit_transform(df[[col_name+'_label']]).toarray()
    feature_labels = [col_name+'_'+str(cls_label) for cls_label in le.classes_]
    features_df = pd.DataFrame(feature_attr, columns=feature_labels)
    return le, ohe, features_df

def transform_ohe(df,le,ohe,col_name):
    """对于给定的列，使用目标编码器对其进行独热编码

    Args:
        df(pandas.DataFrame): 含有目标数据的dataframe
        le(Label Encoder): 标签编码器
        ohe(One Hot Encoder): 独热编码器
        col_name: 需要进行独热编码的字段

    Returns:
        tuple: transformed column as pandas Series

    """
    # 首先转换成数值型编码
    col_labels = le.transform(df[col_name])
    df[col_name+'_label'] = col_labels
    
    # 将数值型编码转成独热编码
    feature_arr = ohe.fit_transform(df[[col_name+'_label']]).toarray()
    feature_labels = [col_name+'_'+str(cls_label) for cls_label in le.classes_]
    features_df = pd.DataFrame(feature_arr, columns=feature_labels)
    
    return features_df

测试集
最好在对数据进行探索之前，就预先设置好一个测试集，然后只有在向外汇报模型效果时才使用测试集。
到目前为止，都没有准备测试集，我们到目前所有的工作都是在整个数据集上进行的。

In [21]:

from sklearn.model_selection import train_test_split
X, X_test, y, y_test = train_test_split(hour_df.iloc[:,0:-3], # 最后一列是目标变量，倒数2、3是未注册用户和注册用户，并不是可用的特征
                                        hour_df.iloc[:,-1],
                                        test_size=0.33,
                                        random_state=42)

X.reset_index(inplace=True)
y = y.reset_index()

X_test.reset_index(inplace=True)
y_test = y_test.reset_index()

In [22]:

cat_attr_list = ['season','is_holiday',
                 'weather_condition','is_workingday',
                 'hour','weekday','month','year']
numeric_feature_cols = ['temp','humidity','windspeed','hour','weekday','month','year']
subset_cat_features =  ['season','is_holiday','weather_condition','is_workingday']

In [23]:

encoded_attr_list = []
for col in cat_attr_list:
    return_obj = fit_transform_ohe(X,col)
    encoded_attr_list.append({'label_enc':return_obj[0],
                              'ohe_enc':return_obj[1],
                              'feature_df':return_obj[2],
                              'col_name':col})

In [24]:

# 将待用的特征放在一起
feature_df_list = [X[numeric_feature_cols]]
feature_df_list.extend([enc['feature_df'] \
                            for enc in encoded_attr_list \
                                if enc['col_name'] in subset_cat_features])
train_df_new = pd.concat(feature_df_list, axis=1)
print("Shape:{}".format(train_df_new.shape))

Shape:(11643, 19)

In [25]:

train_df_new.head()

Out[25]:

	temp	humidity	windspeed	hour	weekday	month	year	season_1	season_2	season_3	is_holiday_0	weather_condition_1	is_workingday_0	is_workingday_1
0	0.64	0.65	0.1940	0	5	9	0	0.0	0.0	1.0	1.0	1.0	0.0	1.0
1	0.50	0.45	0.2239	13	2	3	0	0.0	1.0	0.0	1.0	1.0	0.0	1.0
2	0.86	0.47	0.5224	12	0	8	1	0.0	0.0	1.0	1.0	1.0	1.0	0.0
3	0.30	0.61	0.0000	2	3	2	1	1.0	0.0	0.0	1.0	1.0	0.0	1.0
4	0.54	0.19	0.4179	17	6	4	1	0.0	1.0	0.0	1.0	1.0	1.0	0.0

建立机器学习模型

线性回归

In [26]:

X = train_df_new
y = y.total_count.values.reshape(-1,1)

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

In [27]:

from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(lin_reg, X, y, cv=10)

In [28]:

fig, ax = plt.subplots()
ax.scatter(y, y-predicted)
ax.axhline(lw=2,color='black')
ax.set_xlabel(u'真实值')
ax.set_ylabel(u'残差')

Out[28]:

Text(0,0.5,'残差')

显然，这里面的残差跟真实之间并不是随机的，真实值越大，残差也越大，这显然是具有某种潜在的模式。

In [29]:

from sklearn.model_selection import cross_val_score

r2_scores = cross_val_score(lin_reg, X, y, cv=10) # 默认采用R^2
mse_scores = cross_val_score(lin_reg, X, y, cv=10,scoring='neg_mean_squared_error') # 效用函数：负均方误差

In [30]:

fig, ax = plt.subplots()
ax.plot([i for i in range(len(r2_scores))],r2_scores,lw=2)
ax.set_xlabel('Iteration')
ax.set_ylabel('R-Squared')
ax.title.set_text("较差验证得分, Avg:{}".format(np.average(r2_scores)))

从这里我们看到折交叉验证的均值只有39%，也就是说我们的这个模型只能解释39%的目标变量方差。
这暗示着，我们当前的这个模型对数据的拟合并不好。

我们也可以使用我们在理论课上使用的评估指标均方根误差(RMSE)来发现欠拟合：

In [31]:

print("R-squared:{}".format(r2_scores.mean()))
print("RMSE::{}".format(np.sqrt(-mse_scores).mean()))
print("y mean:", y.mean())

R-squared:0.39423906942549125
RMSE::142.08580203044002
y mean: 191.21875805204843

In [32]:

lin_reg.fit(X,y) # cross_val_score本身不会影响lin_reg

Out[32]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

模型评估指标
即使对于同一个模型，我们也可以用不同的评估方法。
对于回归，我们一般使用MSE,RMSE，也可以使用R2R2，MAE。它们虽然计算方式不一样，但是对于欠拟合都可以反映出来。

在sklearn中，对于回归，内建了一些评估的指标可供选择：

explained_variance
neg_mean_absolute_error
neg_mean_squared_error
neg_mean_squared_log_error
neg_median_absolute_error
r2

具体参考：
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

测试

线性回归模型显然对共享单车这个数据集的拟合能力很差。
但是这个模型可以作为一个基准，用于后续模型的评估。如果一个模型比这个简单的线性回归模型效果还要差的话，无疑会被抛弃。
所以，对于对于这个基准模型，我们也需要无偏的评估指标，因此我们需要在测试集上运行一下模型以得到公允的评估指标。

In [33]:

test_encoded_attr_list = []
for enc in encoded_attr_list:
    col_name = enc['col_name']
    le = enc['label_enc']
    ohe = enc['ohe_enc']
    test_encoded_attr_list.append({'feature_df':transform_ohe(X_test,
                                                              le,ohe,
                                                              col_name),
                                   'col_name':col_name})
    
    
test_feature_df_list = [X_test[numeric_feature_cols]]
test_feature_df_list.extend([enc['feature_df'] \
                             for enc in test_encoded_attr_list \
                             if enc['col_name'] in subset_cat_features])

test_df_new = pd.concat(test_feature_df_list, axis=1) 
print("Shape::{}".format(test_df_new.shape))

Shape::(5736, 19)

In [34]:

test_df_new.head()

Out[34]:

	temp	humidity	windspeed	hour	weekday	month	year	season_1	season_2	season_3	season_4	is_holiday_0	is_holiday_1	weather_condition_1	is_workingday_0	is_workingday_1
0	0.80	0.27	0.1940	19	6	6	1	0.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	0.0
1	0.24	0.41	0.2239	20	1	1	1	1.0	0.0	0.0	0.0	0.0	1.0	1.0	1.0	0.0
2	0.32	0.66	0.2836	2	5	10	0	0.0	0.0	0.0	1.0	1.0	0.0	1.0	0.0	1.0
3	0.78	0.52	0.3582	19	2	5	1	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
4	0.26	0.56	0.3881	0	4	1	0	1.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0

In [35]:

X_test = test_df_new
y_test = y_test.total_count.values.reshape(-1,1)

y_pred = lin_reg.predict(X_test)
residuals = y_test-y_pred

In [36]:

from sklearn.metrics import mean_squared_error

r2_score = lin_reg.score(X_test,y_test)
print("R-squared::{}".format(r2_score))
print("MSE: %.2f" % np.sqrt(mean_squared_error(y_test, y_pred)))

R-squared::0.4024409682673428
MSE: 138.07

In [37]:

fig, ax = plt.subplots()
ax.scatter(y_test, residuals)
ax.axhline(lw=2,color='black')
ax.set_xlabel(u'真实值')
ax.set_ylabel(u'残差')
ax.title.set_text(u"$R^2$={}".format(np.average(r2_score)))

注意到，我们这里也绘制了残差的图。在前面，我们在训练集上画的残差图和这里画的看上去差不多。但二者的用途是不一样的：

训练集上的残差图用于帮助我们改进模型
测试集上的残差图用于对确定的模型进行性能汇报

练习

试着把数据处理过程使用Pipeline来进行包装，使得代码更容易被复用
尝试组合一些新的特征来看看是否能够帮助模型取得更好的效果

思考

仔细思考R2R2这个评估指标，使用它跟使用RMSE有什么差别呢？你又倾向于使用哪个？

决策树回归

决策树是一种既可以被用于回归也可以被用于分类的监督学习算法。它本身其实很简单，但是对于非线性关系的建模却具有非常强大的能力。
它最大的特点在于，它是基于一些简单的决策规则(就像if-else)。这使得我们可以把它的决策过程用树的形式画出来，使得模型更具解释性。

我们这里使用一个例子来解释下决策树模型的基本概念和逻辑。
假如，我们有一份关于不同厂家汽车的数据。每个样本大约有这样一些特征：

fuel_capacity: 油箱容积
engine_capacity: 引擎功率
price: 价格
year_of_purchase: 购买年份
miles_driven: 里程
mileage: 每加仑行驶里程数(目标变量)

那么我们需要一个模型来预测汽车的每加仑行驶里程数(耗油量)。

决策树从根节点开始将数据集划分成两个或多个互不相较的子集，每一个都是根节点的子节点。
这些子节点是根据一些特征来进行划分的。然后子节点再根据一些特征来继续划分，直到得到目标值。这个过程这么描述可能比较难懂，不过我们可以看一幅图：

这幅图展现了一颗可能的关于我们这个例子的决策树。

节点分裂

决策树采用一种自顶向下的方式，因此节点分裂对它而言是最为重要的概念。大部分决策树算法都采用贪心算法来将数据划分成子集。
它的逻辑是，基于一些特征来进行子集的切分，当然我们也需要一个代价函数来衡量切分的好坏。这样，就可以保证每一步地切分都是最小代价的，即贪心地分裂。不过，分类和回归采用的代价函数并不相同，一些常用的衡量指标有：

MSE
MAE
Gini Index
Information Gain

停止条件

决策树采用贪心算法来递归地分裂节点，但是什么时候停止呢？
实际上有很多关于停止的策略，其中最常见的是指定单个节点下最小的样本数。另一个限制是，树的深度。
这些限制条件可以使得整个算法不至于过拟合。

超参

决策树的超参一般都是关于树的一些结构特征，如：叶节点最小样本数，叶节点最大样本数等等。我们可以使用GridSearch等来对超参进行选择。

决策树算法

决策树算法算是比较老的机器学习算法了，在多年的发展中它衍生出了各种变体。其中常见的有：

CART (Classification And Regression Tree)
ID3
C4.5

训练

和使用线性模型类似，我们整体流程没什么特殊的区别。

In [38]:

# 前面线性回归模型应用时，我们已经准备好了X和y
#X = train_df_new
#y= y.total_count.values.reshape(-1,1)

from sklearn.tree import DecisionTreeRegressor

dtr = DecisionTreeRegressor(max_depth=4,
                           min_samples_split=5,
                           max_leaf_nodes=10)
dtr.fit(X, y)

Out[38]:

DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=10, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [39]:

dtr.score(X,y)

Out[39]:

0.6056576562103779

交叉验证

因为对于决策树，有很多超参，因此我们需要进行交叉验证。

In [40]:

param_grid = {"criterion": ["mse", "mae"],
              "min_samples_split": [10, 20, 40],
              "max_depth": [2, 6, 8],
              "min_samples_leaf": [20, 40, 100],
              "max_leaf_nodes": [5, 20, 100, 500, 800],
              }
# 为了快一点，我们减小搜索空间
param_grid = {"criterion": ["mse",],
              "min_samples_split": [10, 20],
              "max_depth": [2,6,8],
              "min_samples_leaf": [20,],
              "max_leaf_nodes": [5, 20, 500,800],
              }

In [41]:

from sklearn.model_selection import GridSearchCV

In [42]:

grid_cv_dtr = GridSearchCV(dtr, param_grid, cv=5, n_jobs=4) # 耗时
grid_cv_dtr.fit(X,y)

Out[42]:

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=DecisionTreeRegressor(criterion='mse', max_depth=4, max_features=None,
           max_leaf_nodes=10, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=5, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best'),
       fit_params=None, iid='warn', n_jobs=4,
       param_grid={'criterion': ['mse'], 'min_samples_split': [10, 20], 'max_depth': [2, 6, 8], 'min_samples_leaf': [20], 'max_leaf_nodes': [5, 20, 500, 800]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [43]:

print("R-Squared::{}".format(grid_cv_dtr.best_score_))
print("Best Hyperparameters:\n{}".format(grid_cv_dtr.best_params_))

R-Squared::0.85891903233008
Best Hyperparameters:
{'criterion': 'mse', 'max_depth': 8, 'max_leaf_nodes': 800, 'min_samples_leaf': 20, 'min_samples_split': 20}

In [44]:

# 查看每一种组合的指标效果
df = pd.DataFrame(data=grid_cv_dtr.cv_results_)
df.sort_values(by='mean_test_score',axis=0, ascending=False)

Out[44]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_criterion	param_max_depth	param_max_leaf_nodes	param_min_samples_leaf	param_min_samples_split	params	...	mean_test_score	std_test_score	rank_test_score	split0_train_score	split1_train_score	split2_train_score	split3_train_score	split4_train_score	mean_train_score	std_train_score
23	0.097806	0.034927	0.010401	0.002332	mse	8	800	20	20	{'criterion': 'mse', 'max_depth': 8, 'max_leaf_nodes': 800, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.858919	0.013606	1	0.887223	0.876620	0.868682	0.877289	0.881233	0.878210	0.006075
22	0.119407	0.062221	0.008201	0.001470	mse	8	800	20	10	{'criterion': 'mse', 'max_depth': 8, 'max_leaf_nodes': 800, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.858858	0.013542	2	0.887223	0.876620	0.868682	0.877289	0.881233	0.878210	0.006075
21	0.112006	0.057751	0.022801	0.029139	mse	8	500	20	20	{'criterion': 'mse', 'max_depth': 8, 'max_leaf_nodes': 500, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.858858	0.013542	2	0.887223	0.876620	0.868682	0.877289	0.881233	0.878210	0.006075
20	0.118007	0.028853	0.011001	0.003847	mse	8	500	20	10	{'criterion': 'mse', 'max_depth': 8, 'max_leaf_nodes': 500, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.858858	0.013542	2	0.887223	0.876620	0.868682	0.877289	0.881233	0.878210	0.006075
13	0.147009	0.141874	0.008000	0.000632	mse	6	500	20	20	{'criterion': 'mse', 'max_depth': 6, 'max_leaf_nodes': 500, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.780143	0.011614	5	0.807639	0.787498	0.785540	0.786980	0.797286	0.792989	0.008424
15	0.089805	0.032556	0.008200	0.001470	mse	6	800	20	20	{'criterion': 'mse', 'max_depth': 6, 'max_leaf_nodes': 800, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.780143	0.011614	5	0.807639	0.787498	0.785540	0.786980	0.797286	0.792989	0.008424
14	0.077804	0.012156	0.008001	0.000632	mse	6	800	20	10	{'criterion': 'mse', 'max_depth': 6, 'max_leaf_nodes': 800, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.780143	0.011614	5	0.807639	0.787498	0.785540	0.786980	0.797286	0.792989	0.008424
12	0.154809	0.075315	0.011201	0.003544	mse	6	500	20	10	{'criterion': 'mse', 'max_depth': 6, 'max_leaf_nodes': 500, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.780143	0.011614	5	0.807639	0.787498	0.785540	0.786980	0.797286	0.792989	0.008424
18	0.097605	0.065909	0.009201	0.003124	mse	8	20	20	10	{'criterion': 'mse', 'max_depth': 8, 'max_leaf_nodes': 20, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.723000	0.008725	9	0.734366	0.733953	0.730833	0.736090	0.728585	0.732765	0.002692
19	0.076404	0.026779	0.012601	0.001744	mse	8	20	20	20	{'criterion': 'mse', 'max_depth': 8, 'max_leaf_nodes': 20, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.723000	0.008725	9	0.734366	0.733953	0.730833	0.736090	0.728585	0.732765	0.002692
10	0.110206	0.085488	0.012801	0.008281	mse	6	20	20	10	{'criterion': 'mse', 'max_depth': 6, 'max_leaf_nodes': 20, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.715495	0.010296	11	0.725802	0.724899	0.723102	0.732985	0.719649	0.725287	0.004387
11	0.132008	0.079629	0.011401	0.003611	mse	6	20	20	20	{'criterion': 'mse', 'max_depth': 6, 'max_leaf_nodes': 20, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.715495	0.010296	11	0.725802	0.724899	0.723102	0.732985	0.719649	0.725287	0.004387
8	0.049603	0.006771	0.008201	0.001939	mse	6	5	20	10	{'criterion': 'mse', 'max_depth': 6, 'max_leaf_nodes': 5, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.502116	0.009131	13	0.504443	0.518035	0.502760	0.506210	0.498082	0.505906	0.006640
9	0.049803	0.003060	0.007801	0.000748	mse	6	5	20	20	{'criterion': 'mse', 'max_depth': 6, 'max_leaf_nodes': 5, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.502116	0.009131	13	0.504443	0.518035	0.502760	0.506210	0.498082	0.505906	0.006640
16	0.077404	0.063308	0.008401	0.000800	mse	8	5	20	10	{'criterion': 'mse', 'max_depth': 8, 'max_leaf_nodes': 5, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.502116	0.009131	13	0.504443	0.518035	0.502760	0.506210	0.498082	0.505906	0.006640
17	0.050603	0.006888	0.010001	0.003522	mse	8	5	20	20	{'criterion': 'mse', 'max_depth': 8, 'max_leaf_nodes': 5, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.502116	0.009131	13	0.504443	0.518035	0.502760	0.506210	0.498082	0.505906	0.006640
7	0.092205	0.089305	0.008200	0.001939	mse	2	800	20	20	{'criterion': 'mse', 'max_depth': 2, 'max_leaf_nodes': 800, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.489907	0.007814	17	0.497879	0.489265	0.495486	0.497041	0.493573	0.494649	0.003066
6	0.052403	0.015345	0.007800	0.000400	mse	2	800	20	10	{'criterion': 'mse', 'max_depth': 2, 'max_leaf_nodes': 800, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.489907	0.007814	17	0.497879	0.489265	0.495486	0.497041	0.493573	0.494649	0.003066
5	0.069604	0.038393	0.009400	0.002333	mse	2	500	20	20	{'criterion': 'mse', 'max_depth': 2, 'max_leaf_nodes': 500, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.489907	0.007814	17	0.497879	0.489265	0.495486	0.497041	0.493573	0.494649	0.003066
4	0.061203	0.017407	0.010801	0.004446	mse	2	500	20	10	{'criterion': 'mse', 'max_depth': 2, 'max_leaf_nodes': 500, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.489907	0.007814	17	0.497879	0.489265	0.495486	0.497041	0.493573	0.494649	0.003066
3	0.060603	0.015552	0.026401	0.026159	mse	2	20	20	20	{'criterion': 'mse', 'max_depth': 2, 'max_leaf_nodes': 20, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.489907	0.007814	17	0.497879	0.489265	0.495486	0.497041	0.493573	0.494649	0.003066
2	0.066204	0.024212	0.008401	0.001020	mse	2	20	20	10	{'criterion': 'mse', 'max_depth': 2, 'max_leaf_nodes': 20, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.489907	0.007814	17	0.497879	0.489265	0.495486	0.497041	0.493573	0.494649	0.003066
1	0.064004	0.026269	0.009201	0.002040	mse	2	5	20	20	{'criterion': 'mse', 'max_depth': 2, 'max_leaf_nodes': 5, 'min_samples_leaf': 20, 'min_samples_split': 20}	...	0.484010	0.007223	23	0.491806	0.483696	0.489150	0.491167	0.487932	0.488750	0.002883
0	0.305017	0.080214	0.037202	0.021490	mse	2	5	20	10	{'criterion': 'mse', 'max_depth': 2, 'max_leaf_nodes': 5, 'min_samples_leaf': 20, 'min_samples_split': 10}	...	0.484010	0.007223	23	0.491806	0.483696	0.489150	0.491167	0.487932	0.488750	0.002883

24 rows × 25 columns

In [45]:

fig,ax = plt.subplots()
sn.pointplot(data=df[['mean_test_score',
                           'param_max_leaf_nodes',
                           'param_max_depth']],
             y='mean_test_score',x='param_max_depth',
             hue='param_max_leaf_nodes',ax=ax)
ax.set(title="Effect of Depth and Leaf Nodes on Model Performance")

Out[45]:

[Text(0.5,1,'Effect of Depth and Leaf Nodes on Model Performance')]

从上图，我们可以发现：

树的深度从2->6，在模型性能上有一个飞跃
最大叶节点数越多效果越好
最大叶节点数500和800没什么区别

考虑到树最大深度现在就是8了，所以可能模型还有一些调优的空间。这个大家可以自己试试看看能不能找到更好的超参。

In [46]:

predicted = grid_cv_dtr.best_estimator_.predict(X)
residuals = y.flatten()-predicted

In [47]:

fig, ax = plt.subplots()
ax.scatter(y.flatten(), residuals)
ax.axhline(lw=2,color='black')
ax.set_xlabel(u'真实值')
ax.set_ylabel(u'残差')

Out[47]:

Text(0,0.5,'残差')

In [48]:

r2_scores = cross_val_score(grid_cv_dtr.best_estimator_, X, y, cv=10)
mse_scores = cross_val_score(grid_cv_dtr.best_estimator_, X, y, cv=10,scoring='neg_mean_squared_error')

In [49]:

print("avg R-squared::{}".format(np.mean(r2_scores)))
print("RMSE::{}".format(np.mean(np.sqrt(-mse_scores))))

avg R-squared::0.8634253078970714
RMSE::67.31793465494015

测试

In [50]:

# 当前最佳模型
best_dtr_model = grid_cv_dtr.best_estimator_

In [51]:

# 前面做线性模型的时候，已经准备好了X_test和y_test

y_pred = best_dtr_model.predict(X_test)
residuals = y_test.flatten() - y_pred

In [52]:

r2_score = best_dtr_model.score(X_test,y_test)
print("R-squared:{}".format(r2_score))
print("RMSE: %.2f" % np.sqrt(mean_squared_error(y_test, y_pred)))

R-squared:0.8722059567160857
RMSE: 63.85

In [53]:

fig, ax = plt.subplots()
ax.scatter(y_test.flatten(), residuals)
ax.axhline(lw=2,color='black')
ax.set_xlabel(u'真实值')
ax.set_ylabel(u'残差')

r2_score = grid_cv_dtr.best_estimator_.score(X_test,y_test)

练习

继续对决策树模型进行调优，看看能不能把R2R2提到90%以上
试试其他模型，看看是不是有更适合这个应用的模型？

我们可以把训练出来的决策树绘制出来。不过我们需要额外安装pydotplus包：

conda install -c conda-forge pydotplus

然后还需要安装依赖的GraphViz，安装完毕后需要把它的目录设置到环境变量PATH中。

然后运行下面的代码即可(可能需要重启当前notebook):

In [54]:

from sklearn.tree import export_graphviz
import pydotplus 

dot_data = export_graphviz(dtr, out_file=None) 
graph = pydotplus.graph_from_dot_data(dot_data) 
graph.write_pdf("bikeshare.pdf")

Out[54]:

True