kaggle链接:https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
ipynb文件:https://github.com/824024445/KaggleCases
建模零售数据的一个挑战是需要根据有限的历史做出决策。如果圣诞节一年一次,那么有机会看到战略决策如何影响到底线。
在此招聘竞赛中,为求职者提供位于不同地区的45家沃尔玛商店的历史销售数据。每个商店都包含许多部门,参与者必须为每个商店中的每个部门预测销售额。要添加挑战,选定的假日降价事件将包含在数据集中。众所周知,这些降价会影响销售,但预测哪些部门受到影响以及影响程度具有挑战性。
想要在世界上最大的一些数据集的良好环境中工作吗?这是向沃尔玛招聘团队展示您的模特气概的机会。
这项比赛计入排名和成就。 如果您希望考虑参加沃尔玛的面试,请在第一次参加时选中“允许主持人与我联系”复选框。
你必须在招募比赛中作为个人参加比赛。您只能使用提供的数据进行预测。
本次比赛的加权平均绝对误差(WMAE)评估:
[外链图片转存失败(img-l8sox1g6-1566399330281)(https://raw.githubusercontent.com/824024445/KaggleCases/master/img/walmart-recruiting-store-sales-forecasting/1-1.jpg)]
提交文件:Id列是通过将Store,Dept和Date与下划线连接而形成的(例如Store_Dept_2012-11-02)
对于测试集中的每一行(商店+部门+日期三元组),您应该预测该部门的每周销售额。
您将获得位于不同地区的45家沃尔玛商店的历史销售数据。每个商店都包含许多部门,您的任务是预测每个商店的部门范围内的销售额。
此外,沃尔玛全年举办多项促销降价活动。这些降价活动在突出的假期之前,其中最大的四个是超级碗,劳动节,感恩节和圣诞节。包括这些假期的周数在评估中的加权比非假日周高五倍。本次比赛提出的部分挑战是在没有完整/理想的历史数据的情况下模拟降价对这些假期周的影响。
stores.csv:
此文件包含有关45个商店的匿名信息,指示商店的类型和大小。
train.csv:
这是历史销售数据,涵盖2010-02-05至2012-11-01。在此文件中,您将找到以下字段:
Store - 商店编号
Dept - 部门编号
Date - 一周
Weekly_Sales - 给定商店中给定部门的销售额(目标值)
sHoliday - 周是否是一个特殊的假日周
test.csv:
此文件与train.csv相同,但我们保留了每周销售额。您必须预测此文件中每个商店,部门和日期三元组的销售额。
features.csv:
此文件包含与给定日期的商店,部门和区域活动相关的其他数据。它包含以下字段:
Store - 商店编号
Date - 一周
Temperature - 该地区的平均温度
Fuel_Price - 该地区的燃料成本
MarkDown1-5 - 与沃尔玛正在运营的促销降价相关的匿名数据。MarkDown数据仅在2011年11月之后提供,并非始终适用于所有商店。任何缺失值都标有NA。
CPI - 消费者物价指数
Unemployment - 失业率
IsHoliday - 周是否是一个特殊的假日周
为方便起见,数据集中的四个假期在接下来的几周内(并非所有假期都在数据中):
超级碗:2月12日至10日,11月2日至11日,10月2日至12日,2月8日至2月13
日劳动节:10月9日至10日,9月9日至9日,9月9日至9月12日-13
感恩节:26-Nov- 10,25 -Nov-11,23-Nov-12,29-Nov-13
圣诞节:31-Dec-10,30-Dec-11,28-Dec-12,27-Dec -13
我写了一个小函数来实现数据的下载,数据全都是官网原版数据,我存到了我的github上。(https://github.com/824024445/KaggleCases)
所有数据都下载到了你当前文件夹下的datasets文件下,每个案例涉及到的数据全部下载到了以该案例命名的文件夹下。
我所有的kaggle案例的博客,下载数据均会使用这个函数,只需要修改前两个常量即可。
> 注:此函数只用于下载数据,函数在该代码框内就运行了。不再用到其它代码中,包括常量,也不会用在其他地方。
import os
import zipfile
from six.moves import urllib
FILE_NAME = "walmart-recruiting-store-sales-forecasting.zip" #文件名
DATA_PATH ="datasets/walmart-recruiting-store-sales-forecasting" #存储文件的文件夹,取跟文件相同(相近)的名字便于区分
DATA_URL = "https://github.com/824024445/KaggleCases/blob/master/datasets/" + FILE_NAME + "?raw=true"
def fetch_data(data_url=DATA_URL, data_path=DATA_PATH, file_name=FILE_NAME):
if not os.path.isdir(data_path): #查看当前文件夹下是否存在"datasets/titanic",没有的话创建
os.makedirs(data_path)
zip_path = os.path.join(data_path, file_name) #下载到本地的文件的路径及名称
# urlretrieve()方法直接将远程数据下载到本地
urllib.request.urlretrieve(data_url, zip_path) #第二个参数zip_path是保存到的本地路径
data_zip = zipfile.ZipFile(zip_path)
data_zip.extractall(path=data_path) #什么参数都不输入就是默认解压到当前文件,为了保持统一,是泰坦尼克的数据就全部存到titanic文件夹下
data_zip.close()
fetch_data()
import pandas as pd
import numpy as np
train_df = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/train.csv")
test_df = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/test.csv")
features = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/features.csv")
stores = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/stores.csv")
train_df = train_df.merge(features, on=["Store", "Date"], how="left").merge(stores, on="Store", how="left")
test_df = test_df.merge(features, on=["Store", "Date"], how="left").merge(stores, on="Store", how="left")
combine = [train_df, test_df]
train_df.head()
Store | Dept | Date | Weekly_Sales | IsHoliday_x | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday_y | Type | Size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 2010-02-05 | 24924.50 | False | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | False | A | 151315 |
1 | 1 | 1 | 2010-02-12 | 46039.49 | True | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | True | A | 151315 |
2 | 1 | 1 | 2010-02-19 | 41595.55 | False | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | False | A | 151315 |
3 | 1 | 1 | 2010-02-26 | 19403.54 | False | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | False | A | 151315 |
4 | 1 | 1 | 2010-03-05 | 21827.90 | False | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | False | A | 151315 |
train_df.info()
Int64Index: 421570 entries, 0 to 421569
Data columns (total 17 columns):
Store 421570 non-null int64
Dept 421570 non-null int64
Date 421570 non-null object
Weekly_Sales 421570 non-null float64
IsHoliday_x 421570 non-null bool
Temperature 421570 non-null float64
Fuel_Price 421570 non-null float64
MarkDown1 150681 non-null float64
MarkDown2 111248 non-null float64
MarkDown3 137091 non-null float64
MarkDown4 134967 non-null float64
MarkDown5 151432 non-null float64
CPI 421570 non-null float64
Unemployment 421570 non-null float64
IsHoliday_y 421570 non-null bool
Type 421570 non-null object
Size 421570 non-null int64
dtypes: bool(2), float64(10), int64(3), object(2)
memory usage: 52.3+ MB
观察到:
test_df.info()
Int64Index: 115064 entries, 0 to 115063
Data columns (total 16 columns):
Store 115064 non-null int64
Dept 115064 non-null int64
Date 115064 non-null object
IsHoliday_x 115064 non-null bool
Temperature 115064 non-null float64
Fuel_Price 115064 non-null float64
MarkDown1 114915 non-null float64
MarkDown2 86437 non-null float64
MarkDown3 105235 non-null float64
MarkDown4 102176 non-null float64
MarkDown5 115064 non-null float64
CPI 76902 non-null float64
Unemployment 76902 non-null float64
IsHoliday_y 115064 non-null bool
Type 115064 non-null object
Size 115064 non-null int64
dtypes: bool(2), float64(9), int64(3), object(2)
memory usage: 13.4+ MB
观察测试集发现:
train_df.describe()
Store | Dept | Weekly_Sales | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | Size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 421570.000000 | 421570.000000 | 421570.000000 | 421570.000000 | 421570.000000 | 150681.000000 | 111248.000000 | 137091.000000 | 134967.000000 | 151432.000000 | 421570.000000 | 421570.000000 | 421570.000000 |
mean | 22.200546 | 44.260317 | 15981.258123 | 60.090059 | 3.361027 | 7246.420196 | 3334.628621 | 1439.421384 | 3383.168256 | 4628.975079 | 171.201947 | 7.960289 | 136727.915739 |
std | 12.785297 | 30.492054 | 22711.183519 | 18.447931 | 0.458515 | 8291.221345 | 9475.357325 | 9623.078290 | 6292.384031 | 5962.887455 | 39.159276 | 1.863296 | 60980.583328 |
min | 1.000000 | 1.000000 | -4988.940000 | -2.060000 | 2.472000 | 0.270000 | -265.760000 | -29.100000 | 0.220000 | 135.160000 | 126.064000 | 3.879000 | 34875.000000 |
25% | 11.000000 | 18.000000 | 2079.650000 | 46.680000 | 2.933000 | 2240.270000 | 41.600000 | 5.080000 | 504.220000 | 1878.440000 | 132.022667 | 6.891000 | 93638.000000 |
50% | 22.000000 | 37.000000 | 7612.030000 | 62.090000 | 3.452000 | 5347.450000 | 192.000000 | 24.600000 | 1481.310000 | 3359.450000 | 182.318780 | 7.866000 | 140167.000000 |
75% | 33.000000 | 74.000000 | 20205.852500 | 74.280000 | 3.738000 | 9210.900000 | 1926.940000 | 103.990000 | 3595.040000 | 5563.800000 | 212.416993 | 8.572000 | 202505.000000 |
max | 45.000000 | 99.000000 | 693099.360000 | 100.140000 | 4.468000 | 88646.760000 | 104519.540000 | 141630.610000 | 67474.850000 | 108519.280000 | 227.232807 | 14.313000 | 219622.000000 |
train_df.describe(include="O")
Date | Type | |
---|---|---|
count | 421570 | 421570 |
unique | 143 | 3 |
top | 2011-12-23 | A |
freq | 3027 | 215478 |
测试集CPI和Unemployment有缺失值,看一下它的结构
test_df[["CPI","Unemployment"]].describe()
CPI | Unemployment | |
---|---|---|
count | 76902.000000 | 76902.000000 |
mean | 176.961347 | 6.868733 |
std | 41.239967 | 1.583427 |
min | 131.236226 | 3.684000 |
25% | 138.402033 | 5.771000 |
50% | 192.304445 | 6.806000 |
75% | 223.244532 | 8.036000 |
max | 228.976456 | 10.199000 |
corr_matrix = train_df.corr()
corr_matrix.Weekly_Sales.sort_values(ascending=False)
Weekly_Sales 1.000000
Size 0.243828
Dept 0.148032
MarkDown5 0.090362
MarkDown1 0.085251
MarkDown3 0.060385
MarkDown4 0.045414
MarkDown2 0.024130
IsHoliday_y 0.012774
IsHoliday_x 0.012774
Fuel_Price -0.000120
Temperature -0.002312
CPI -0.020921
Unemployment -0.025864
Store -0.085195
Name: Weekly_Sales, dtype: float64
corr_matrix[["MarkDown1","MarkDown2","MarkDown3","MarkDown4","MarkDown5"]].sort_values(by="MarkDown5", ascending=False)
MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | |
---|---|---|---|---|---|
MarkDown5 | 0.160257 | -0.007440 | -0.026467 | 0.107792 | 1.000000 |
Size | 0.345673 | 0.108827 | 0.048913 | 0.168196 | 0.304575 |
MarkDown1 | 1.000000 | 0.024486 | -0.108115 | 0.819238 | 0.160257 |
MarkDown4 | 0.819238 | -0.007768 | -0.071095 | 1.000000 | 0.107792 |
Weekly_Sales | 0.085251 | 0.024130 | 0.060385 | 0.045414 | 0.090362 |
CPI | -0.055558 | -0.039534 | -0.023590 | -0.049628 | 0.060630 |
Dept | -0.002426 | 0.000290 | 0.001784 | 0.004257 | 0.000109 |
Unemployment | 0.050285 | 0.020940 | 0.012818 | 0.024963 | -0.003843 |
MarkDown2 | 0.024486 | 1.000000 | -0.050108 | -0.007768 | -0.007440 |
Temperature | -0.040594 | -0.323927 | -0.096880 | -0.063947 | -0.017544 |
MarkDown3 | -0.108115 | -0.050108 | 1.000000 | -0.071095 | -0.026467 |
Store | -0.119588 | -0.035173 | -0.031556 | -0.009941 | -0.026634 |
IsHoliday_x | -0.035586 | 0.334818 | 0.427960 | -0.000562 | -0.053719 |
IsHoliday_y | -0.035586 | 0.334818 | 0.427960 | -0.000562 | -0.053719 |
Fuel_Price | 0.061371 | -0.220895 | -0.102092 | -0.044986 | -0.128065 |
## Markdown 对于训练集markdown的缺失,这里先不处理,等会分成两个数据集,一个含缺失markdown然后填充,一个去掉这些数据
test_df[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']] = test_df[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']].fillna(0)
test_df[["CPI","Unemployment"]] = test_df[["CPI","Unemployment"]].fillna(method="ffill")
type转变成onehot编码
train_df = pd.get_dummies(train_df, columns=["Type"])
test_df = pd.get_dummies(test_df, columns=["Type"])
train_df.head()
Store | Dept | Date | Weekly_Sales | IsHoliday_x | Temperature | Fuel_Price | MarkDown1 | MarkDown2 | MarkDown3 | MarkDown4 | MarkDown5 | CPI | Unemployment | IsHoliday_y | Size | Type_A | Type_B | Type_C | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 2010-02-05 | 24924.50 | False | 42.31 | 2.572 | NaN | NaN | NaN | NaN | NaN | 211.096358 | 8.106 | False | 151315 | 1 | 0 | 0 |
1 | 1 | 1 | 2010-02-12 | 46039.49 | True | 38.51 | 2.548 | NaN | NaN | NaN | NaN | NaN | 211.242170 | 8.106 | True | 151315 | 1 | 0 | 0 |
2 | 1 | 1 | 2010-02-19 | 41595.55 | False | 39.93 | 2.514 | NaN | NaN | NaN | NaN | NaN | 211.289143 | 8.106 | False | 151315 | 1 | 0 | 0 |
3 | 1 | 1 | 2010-02-26 | 19403.54 | False | 46.63 | 2.561 | NaN | NaN | NaN | NaN | NaN | 211.319643 | 8.106 | False | 151315 | 1 | 0 | 0 |
4 | 1 | 1 | 2010-03-05 | 21827.90 | False | 46.50 | 2.625 | NaN | NaN | NaN | NaN | NaN | 211.350143 | 8.106 | False | 151315 | 1 | 0 | 0 |
把日期换成月份
train_df['Month'] = pd.to_datetime(train_df['Date']).dt.month
test_df["Month"] = pd.to_datetime(test_df['Date']).dt.month
#等下记得删除Date,test的暂时先不删,后面要用
温度
想来,人们在极端天气的时候不太会出门。所以把数据分成两组:小于22.01,大于91.03(根据温度分布划分的,画柱状图可得,我已经删掉了)
train_df.loc[(train_df["Temperature"]<22.01)|(train_df["Temperature"]>91.03), "Is_temp_extr"]=1
train_df.loc[(train_df["Temperature"]>=22.01)& (train_df["Temperature"]<=91.03), "Is_temp_extr"]=0
test_df.loc[(test_df["Temperature"]<22.01)|(test_df["Temperature"]>91.03), "Is_temp_extr"]=1
test_df.loc[(test_df["Temperature"]>=22.01)& (test_df["Temperature"]<=91.03), "Is_temp_extr"]=0
train_df.corr().Weekly_Sales.sort_values(ascending=False)[["Temperature", "Is_temp_extr"]]
#提取新特征后相关性提升了十多倍 等下记得把这个特征删除。
Temperature -0.002312
Is_temp_extr -0.030016
Name: Weekly_Sales, dtype: float64
燃油价格
人们会因为燃油费太贵不出门吗?
train_df.loc[train_df["Fuel_Price"]>3.47, "Is_fuel_expen"]=1
train_df.loc[train_df["Fuel_Price"]<=3.47, "Is_fuel_expen"]=0
#无论怎么改,这个相关性都很低,所以这个特征等下去除
train_df.corr().Weekly_Sales.sort_values(ascending=False)[["Fuel_Price", "Is_fuel_expen"]]
Fuel_Price -0.000120
Is_fuel_expen -0.006626
Name: Weekly_Sales, dtype: float64
IsHoliday
由于前面合并表格的时候的问题,出现了两个isholidy,删掉一个即可。
另外,把bool值换成0和5(后面权重)
train_df["IsHoliday"] = train_df["IsHoliday_x"].replace(True, 5).replace(False,0)
test_df["IsHoliday"] = test_df["IsHoliday_x"].replace(True, 5).replace(False,0)
train_df.corr().Weekly_Sales.sort_values(ascending=False)[["IsHoliday_x", "IsHoliday"]]
IsHoliday_x 0.012774
IsHoliday 0.012774
Name: Weekly_Sales, dtype: float64
train_df = train_df.drop(["IsHoliday_x", "IsHoliday_y",'MarkDown4',"Date", "Temperature", "Fuel_Price","Is_fuel_expen"], axis=1)
#这是后面提交表格需要用到的变量,用到了测试集的date特征,先在这里给id变量赋值,然后就可以吧date特征删除了
id = test_df["Store"].astype(str)+"_"+test_df["Dept"].astype(str)+"_"+test_df["Date"].astype(str)
test_df = test_df.drop(["IsHoliday_x", "IsHoliday_y", "MarkDown4", "Date","Temperature", "Fuel_Price"], axis=1)
将数据集用到模型前,一定要确保没有空值,所以最后再检查一下
先把训练集做成两份:一份含缺失的markdown,一个去除掉这些数据
train_df_one = train_df.copy()
train_df_two = train_df.copy()
train_df_one[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']] = train_df_one[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']].fillna(0)
train_df_two.dropna(inplace=True)
train_df_one.info()
Int64Index: 421570 entries, 0 to 421569
Data columns (total 16 columns):
Store 421570 non-null int64
Dept 421570 non-null int64
Weekly_Sales 421570 non-null float64
MarkDown1 421570 non-null float64
MarkDown2 421570 non-null float64
MarkDown3 421570 non-null float64
MarkDown5 421570 non-null float64
CPI 421570 non-null float64
Unemployment 421570 non-null float64
Size 421570 non-null int64
Type_A 421570 non-null uint8
Type_B 421570 non-null uint8
Type_C 421570 non-null uint8
Month 421570 non-null int64
Is_temp_extr 421570 non-null float64
IsHoliday 421570 non-null float64
dtypes: float64(9), int64(4), uint8(3)
memory usage: 46.2 MB
train_df_two.info()
Int64Index: 101480 entries, 92 to 421569
Data columns (total 16 columns):
Store 101480 non-null int64
Dept 101480 non-null int64
Weekly_Sales 101480 non-null float64
MarkDown1 101480 non-null float64
MarkDown2 101480 non-null float64
MarkDown3 101480 non-null float64
MarkDown5 101480 non-null float64
CPI 101480 non-null float64
Unemployment 101480 non-null float64
Size 101480 non-null int64
Type_A 101480 non-null uint8
Type_B 101480 non-null uint8
Type_C 101480 non-null uint8
Month 101480 non-null int64
Is_temp_extr 101480 non-null float64
IsHoliday 101480 non-null float64
dtypes: float64(9), int64(4), uint8(3)
memory usage: 11.1 MB
test_df.info()
Int64Index: 115064 entries, 0 to 115063
Data columns (total 15 columns):
Store 115064 non-null int64
Dept 115064 non-null int64
MarkDown1 115064 non-null float64
MarkDown2 115064 non-null float64
MarkDown3 115064 non-null float64
MarkDown5 115064 non-null float64
CPI 115064 non-null float64
Unemployment 115064 non-null float64
Size 115064 non-null int64
Type_A 115064 non-null uint8
Type_B 115064 non-null uint8
Type_C 115064 non-null uint8
Month 115064 non-null int64
Is_temp_extr 115064 non-null float64
IsHoliday 115064 non-null float64
dtypes: float64(8), int64(4), uint8(3)
memory usage: 11.7 MB
为了快速测试,写了一个类。我写的案例大部分都回用到这个类。不过每次因为性能评测的指标不同,所以需要微改。
import time
import os
from sklearn.metrics import mean_absolute_error
from sklearn.base import clone
class Tester():
def __init__(self, target):
self.target = target
self.datasets = {}
self.models = {}
self.scores = {}
self.cache = {} # 我们添加了一个简单的缓存来加快速度
def addDataset(self, name, df):
self.datasets[name] = df.copy()
def addModel(self, name, model):
self.models[name] = model
def clearModels(self):
self.models = {}
def clearCache(self):
self.cache = {}
def testModelWithDataset(self, m_name, df_name, sample_len, cv):
if (m_name, df_name, sample_len, cv) in self.cache:
return self.cache[(m_name, df_name, sample_len, cv)]
clf = clone(self.models[m_name])
if not sample_len:
sample = self.datasets[df_name]
else: sample = self.datasets[df_name].sample(sample_len)
X = sample.drop([self.target], axis=1)
Y = sample[self.target]
#评分标准不一样的话,修改这里
weights = X["IsHoliday"]
clf.fit(X, Y)
Y_pred = clf.predict(X)
s = mean_absolute_error(Y, Y_pred, sample_weight=weights)
self.cache[(m_name, df_name, sample_len, cv)] = s
return s
def runTests(self, sample_len=97056, cv=3):
# 在所有添加的数据集上测试添加的模型
for m_name in self.models:
for df_name in self.datasets:
# print('Testing %s' % str((m_name, df_name)), end='')
start = time.time()
score = self.testModelWithDataset(m_name, df_name, sample_len, cv)
self.scores[(m_name, df_name)] = score
end = time.time()
# print(' -- %0.2fs ' % (end - start))
print('--- Top 10 Results ---')
# 评分标准改了之后这里也得改
for score in sorted(self.scores.items(), key=lambda x: x[1])[:10]:
# score = int(score[1])
print(score)
def obtian_result(self, X_test):
clf = self.models[sorted(self.scores.items(), key=lambda x: x[1])[0][0]]
Y_pred = clf.predict(X_test)
return Y_pred
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.feature_selection import RFE
from sklearn.neural_network import MLPRegressor
# 我们将在所有模型中使用测试对象
tester = Tester('Weekly_Sales')
# 添加数据集
tester.addDataset('all_markdown', train_df_one)
tester.addDataset('wipe_markdown', train_df_two)
# 添加模型
knn_reg = KNeighborsRegressor(n_neighbors=10)
tree_reg = ExtraTreesRegressor(n_estimators=100,max_features='auto', verbose=1, n_jobs=1)
rf_reg = RandomForestRegressor(n_estimators=100,max_features='log2', verbose=1)
svr_reg = SVR(kernel='rbf', gamma='auto')
mlp_reg = MLPRegressor(hidden_layer_sizes=(10,), activation='relu', verbose=3)
gbrt_reg = GradientBoostingRegressor(max_depth=8, warm_start=True)
tester.addModel('KNeighborsRegressor', knn_reg)
tester.addModel('ExtraTreesRegressor', tree_reg)
tester.addModel('RandomForestRegressor', rf_reg)
tester.addModel('SVR', svr_reg)
tester.addModel('MLPRegressor', mlp_reg)
tester.addModel('GradientBoostingRegressor', gbrt_reg)
# 测试
tester.runTests()
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 26.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 3.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 28.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 3.8s finished
X = train_df_one.drop(["Weekly_Sales"], axis=1)
Y = train_df_one["Weekly_Sales"]
gbrt_reg.fit(X, Y)
Y_pred = gbrt_reg.predict(test_df)
submission = pd.DataFrame({
"Id": id,
"Weekly_Sales": pd.DataFrame(Y_pred)[0]
})
id
submission.to_csv('submission.csv', index=False)