[实践]自行车租赁预测

认识数据

这是一个城市自行车租赁系统,提供的数据为2年内华盛顿按小时记录的自行车租赁数据,其中训练集由每个月的前19天组成,测试集由20号之后的时间组成(需要我们自己去预测)。数据来源:Kaggle自行车租赁预测比赛

项目数据描述如下: 
(1) datetime:日期,以年-月-日 小时的形式给出。 
(2) season:季节。1 为春季, 2为夏季,3 为秋季,4 为冬季。
(3) hodliday:是否为假期。1代表是,0代表不是。 
(4) workingday:是否为工作日,1代表是,0代表不是。 
(5) weather:天气: 
    1: 天气晴朗或者少云/部分有云。 
    2: 有雾和云/风等。 
    3: 小雪/小雨,闪电及多云。 
    4: 大雨/冰雹/闪电和大雾/大雪。   
(6) temp - 摄氏温度。 
(7) atemp - 人们感觉的温度。 
(8) humidity - 湿度。 
(9) windspeed - 风速。 
(10) casual -随机预定自行车的人数 
(11) registered - 登记预定自行车的人数。 
(12) count - 总租车数,即casual+registered数目。 
其中10~12不属于特征,12为我们需要预测的值。

数据预处理

导入相关数据分析包,将matplotlib的图表直接嵌入到Notebook之中,读取训练数据,观察训练集前十行,获取数据类型与数据集大小。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df_train = pd.read_csv('kaggle_bike_competition_train.csv',header = 0)

df_train.head(10)

[实践]自行车租赁预测_第1张图片

print "字段名称与类型:", '\n' , df_train.dtypes
print "数据集大小:", '\n' , df_train.shape
print "列统计:", '\n' , df_train.count()
字段名称与类型: 
datetime       object
season          int64
holiday         int64
workingday      int64
weather         int64
temp          float64
atemp         float64
humidity        int64
windspeed     float64
casual          int64
registered      int64
count           int64
dtype: object
数据集大小: 
(10886, 12)
列统计: 
datetime      10886
season        10886
holiday       10886
workingday    10886
weather       10886
temp          10886
atemp         10886
humidity      10886
windspeed     10886
casual        10886
registered    10886
count         10886
dtype: int64

训练集有10886个样本,12个变量,没有缺省值。

发现datetime数值包含的信息很多,我们将月、日、和 小时单独拎出来,放到3列中,然后删除与模型学习无关的变量。

df_train['month'] = pd.DatetimeIndex(df_train.datetime).month
df_train['day'] = pd.DatetimeIndex(df_train.datetime).dayofweek
df_train['hour'] = pd.DatetimeIndex(df_train.datetime).hour
df_train_origin = df_train #保存原数据集
df_train = df_train.drop(['datetime','casual','registered'], axis = 1)

将数据集分为两部分:
1. df_train_target:目标,也就是count字段。
2. df_train_data:用于产出特征的数据

df_train_target = df_train['count'].values
df_train_data = df_train.drop(['count'],axis = 1).values

特征工程

应用机器学习算法的过程,多半是在调参,各种不同的参数会带来不同的结果(比如正则化系数,比如决策树类的算法的树深和棵树,比如距离判定准则等等等等)

我们使用交叉验证的方式(交叉验证集约占全部数据的20%)来看看模型的效果,我们会试 支持向量回归/Suport Vector Regression, 岭回归/Ridge Regression 和 随机森林回归/Random Forest Regressor。每个模型会跑3趟看平均的结果。

from sklearn import linear_model
from sklearn import cross_validation
from sklearn import svm
from sklearn.ensemble import RandomForestRegressor
from sklearn.learning_curve import learning_curve
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import explained_variance_score

# 切分数据(训练集和测试集)
cv = cross_validation.ShuffleSplit(len(df_train_data), n_iter=3, test_size=0.2,
    random_state=0)


print "岭回归"    
for train, test in cv:    
    svc = linear_model.Ridge().fit(df_train_data[train], df_train_target[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))

print "支持向量回归/SVR(kernel='rbf',C=10,gamma=.001)"
for train, test in cv:   
    svc = svm.SVR(kernel ='rbf', C = 10, gamma = .001).fit(df_train_data[train], df_train_target[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))

print "随机森林回归/Random Forest(n_estimators = 100)"    
for train, test in cv:    
    svc = RandomForestRegressor(n_estimators = 100).fit(df_train_data[train], df_train_target[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
岭回归
train score: 0.339, test score: 0.332

train score: 0.330, test score: 0.370

train score: 0.342, test score: 0.320

支持向量回归/SVR(kernel='rbf',C=10,gamma=.001)
train score: 0.417, test score: 0.408

train score: 0.406, test score: 0.452

train score: 0.419, test score: 0.390

随机森林回归/Random Forest(n_estimators = 100)
train score: 0.981, test score: 0.866

train score: 0.981, test score: 0.880

train score: 0.981, test score: 0.870

模型调参

随机森林回归获得了最佳结果,利用GridSearch尝试寻找最优参数,大概耗时2分钟左右的时间。

X = df_train_data
y = df_train_target

X_train, X_test, y_train, y_test = cross_validation.train_test_split(
    X, y, test_size=0.2, random_state=0)

tuned_parameters = [{'n_estimators':[10,100,500]}]   

scores = ['r2']

for score in scores:

    print score

    clf = GridSearchCV(RandomForestRegressor(), tuned_parameters, cv=5, scoring=score)
    clf.fit(X_train, y_train)

    #最优模型
    print(clf.best_estimator_)
    print ""
    print("得分分别是:")
    for params, mean_score, scores in clf.grid_scores_:
        print("%0.3f (+/-%0.03f) for %r"
              % (mean_score, scores.std() / 2, params))
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

得分分别是:
0.846 (+/-0.008) for {'n_estimators': 10}
0.862 (+/-0.006) for {'n_estimators': 100}
0.863 (+/-0.005) for {'n_estimators': 500}

再看看模型的学习曲线,是否过拟合或欠拟合

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):

    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt


title = "Learning Curves (Random Forest, n_estimators = 100)"
cv = cross_validation.ShuffleSplit(df_train_data.shape[0], n_iter=10,test_size=0.2, random_state=0)
estimator = RandomForestRegressor(n_estimators = 100)
plot_learning_curve(estimator, title, X, y, (0.0, 1.01), cv=cv, n_jobs=4)

plt.show()

[实践]自行车租赁预测_第2张图片

随机森林的算法学习能力比较强,由图可以发现,训练集和测试集的得分差距也是蛮大的,过拟合还比较明显,尝试一下缓解过拟合,效果不是太好。

print "随机森林回归/Random Forest(n_estimators=200, max_features=0.6, max_depth=15)"
for train, test in cv: 
    svc = RandomForestRegressor(n_estimators = 200, max_features=0.6, max_depth=15).fit(df_train_data[train], df_train_target[train])
    print("train score: {0:.3f}, test score: {1:.3f}\n".format(
        svc.score(df_train_data[train], df_train_target[train]), svc.score(df_train_data[test], df_train_target[test])))
随机森林回归/Random Forest(n_estimators=200, max_features=0.6, max_depth=15)
train score: 0.965, test score: 0.870

train score: 0.966, test score: 0.885

train score: 0.965, test score: 0.872

train score: 0.965, test score: 0.877

train score: 0.967, test score: 0.870

train score: 0.965, test score: 0.872

train score: 0.966, test score: 0.864

train score: 0.966, test score: 0.873

train score: 0.965, test score: 0.873

train score: 0.966, test score: 0.870

你可能感兴趣的:(机器学习)