机器学习kaggle案例:沃尔玛招聘 - 商店销售预测



1.1 比赛描述




这项比赛计入排名和成就。 如果您希望考虑参加沃尔玛的面试,请在第一次参加时选中“允许主持人与我联系”复选框。


1.2 比赛评估



  • n是行数
  • yi是真实销售额
  • wi是权重,如果该周是假日周,wi=5,否则为1



1.3 数据描述




Store - 商店编号
Dept - 部门编号
Date - 一周
Weekly_Sales - 给定商店中给定部门的销售额(目标值)
sHoliday - 周是否是一个特殊的假日周


Store - 商店编号
Date - 一周
Temperature - 该地区的平均温度
Fuel_Price - 该地区的燃料成本
MarkDown1-5 - 与沃尔玛正在运营的促销降价相关的匿名数据。MarkDown数据仅在2011年11月之后提供,并非始终适用于所有商店。任何缺失值都标有NA。
CPI - 消费者物价指数
Unemployment - 失业率
IsHoliday - 周是否是一个特殊的假日周


感恩节:26-Nov- 10,25 -Nov-11,23-Nov-12,29-Nov-13
圣诞节:31-Dec-10,30-Dec-11,28-Dec-12,27-Dec -13


2.1 获取数据

2.1.1 下载数据



> 注:此函数只用于下载数据,函数在该代码框内就运行了。不再用到其它代码中,包括常量,也不会用在其他地方。

import os
import zipfile
from six.moves import urllib

FILE_NAME = "walmart-recruiting-store-sales-forecasting.zip" #文件名
DATA_PATH ="datasets/walmart-recruiting-store-sales-forecasting" #存储文件的文件夹,取跟文件相同(相近)的名字便于区分
DATA_URL = "https://github.com/824024445/KaggleCases/blob/master/datasets/" + FILE_NAME + "?raw=true"

def fetch_data(data_url=DATA_URL, data_path=DATA_PATH, file_name=FILE_NAME):
    if not os.path.isdir(data_path): #查看当前文件夹下是否存在"datasets/titanic",没有的话创建
    zip_path = os.path.join(data_path, file_name) #下载到本地的文件的路径及名称
    # urlretrieve()方法直接将远程数据下载到本地
    urllib.request.urlretrieve(data_url, zip_path) #第二个参数zip_path是保存到的本地路径
    data_zip = zipfile.ZipFile(zip_path)
    data_zip.extractall(path=data_path) #什么参数都不输入就是默认解压到当前文件,为了保持统一,是泰坦尼克的数据就全部存到titanic文件夹下

2.1.2 读取数据

import pandas as pd
import numpy as np

train_df = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/train.csv")
test_df = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/test.csv")
features = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/features.csv")
stores = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/stores.csv")

train_df = train_df.merge(features, on=["Store", "Date"], how="left").merge(stores, on="Store", how="left")
test_df = test_df.merge(features, on=["Store", "Date"], how="left").merge(stores, on="Store", how="left")
combine = [train_df, test_df]

Store Dept Date Weekly_Sales IsHoliday_x Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday_y Type Size
0 1 1 2010-02-05 24924.50 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False A 151315
1 1 1 2010-02-12 46039.49 True 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 True A 151315
2 1 1 2010-02-19 41595.55 False 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 False A 151315
3 1 1 2010-02-26 19403.54 False 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 False A 151315
4 1 1 2010-03-05 21827.90 False 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 False A 151315

2.2 初步观察数据

2.2.1 info()


Int64Index: 421570 entries, 0 to 421569
Data columns (total 17 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Date            421570 non-null object
Weekly_Sales    421570 non-null float64
IsHoliday_x     421570 non-null bool
Temperature     421570 non-null float64
Fuel_Price      421570 non-null float64
MarkDown1       150681 non-null float64
MarkDown2       111248 non-null float64
MarkDown3       137091 non-null float64
MarkDown4       134967 non-null float64
MarkDown5       151432 non-null float64
CPI             421570 non-null float64
Unemployment    421570 non-null float64
IsHoliday_y     421570 non-null bool
Type            421570 non-null object
Size            421570 non-null int64
dtypes: bool(2), float64(10), int64(3), object(2)
memory usage: 52.3+ MB


  • MarkDown有太多缺失值,但是后面查看test发现test该特征比较完整,且后面查看想关性,该特征有挺高的相关性
  • 其余特征没有空值,等会就不用补充缺失值了

Int64Index: 115064 entries, 0 to 115063
Data columns (total 16 columns):
Store           115064 non-null int64
Dept            115064 non-null int64
Date            115064 non-null object
IsHoliday_x     115064 non-null bool
Temperature     115064 non-null float64
Fuel_Price      115064 non-null float64
MarkDown1       114915 non-null float64
MarkDown2       86437 non-null float64
MarkDown3       105235 non-null float64
MarkDown4       102176 non-null float64
MarkDown5       115064 non-null float64
CPI             76902 non-null float64
Unemployment    76902 non-null float64
IsHoliday_y     115064 non-null bool
Type            115064 non-null object
Size            115064 non-null int64
dtypes: bool(2), float64(9), int64(3), object(2)
memory usage: 13.4+ MB


  • 测试集的markdown数据还挺全的,这个特征可能还是有用的。就先做去掉这个特征的,然后提升阶段再考虑怎么利用这部分数据吧。
  • 测试集的CPI和Unemployment有空值

2.2.2 describe()

Store Dept Weekly_Sales Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Size
count 421570.000000 421570.000000 421570.000000 421570.000000 421570.000000 150681.000000 111248.000000 137091.000000 134967.000000 151432.000000 421570.000000 421570.000000 421570.000000
mean 22.200546 44.260317 15981.258123 60.090059 3.361027 7246.420196 3334.628621 1439.421384 3383.168256 4628.975079 171.201947 7.960289 136727.915739
std 12.785297 30.492054 22711.183519 18.447931 0.458515 8291.221345 9475.357325 9623.078290 6292.384031 5962.887455 39.159276 1.863296 60980.583328
min 1.000000 1.000000 -4988.940000 -2.060000 2.472000 0.270000 -265.760000 -29.100000 0.220000 135.160000 126.064000 3.879000 34875.000000
25% 11.000000 18.000000 2079.650000 46.680000 2.933000 2240.270000 41.600000 5.080000 504.220000 1878.440000 132.022667 6.891000 93638.000000
50% 22.000000 37.000000 7612.030000 62.090000 3.452000 5347.450000 192.000000 24.600000 1481.310000 3359.450000 182.318780 7.866000 140167.000000
75% 33.000000 74.000000 20205.852500 74.280000 3.738000 9210.900000 1926.940000 103.990000 3595.040000 5563.800000 212.416993 8.572000 202505.000000
max 45.000000 99.000000 693099.360000 100.140000 4.468000 88646.760000 104519.540000 141630.610000 67474.850000 108519.280000 227.232807 14.313000 219622.000000
Date Type
count 421570 421570
unique 143 3
top 2011-12-23 A
freq 3027 215478


CPI Unemployment
count 76902.000000 76902.000000
mean 176.961347 6.868733
std 41.239967 1.583427
min 131.236226 3.684000
25% 138.402033 5.771000
50% 192.304445 6.806000
75% 223.244532 8.036000
max 228.976456 10.199000

2.2.3 corr()

corr_matrix = train_df.corr()
Weekly_Sales    1.000000
Size            0.243828
Dept            0.148032
MarkDown5       0.090362
MarkDown1       0.085251
MarkDown3       0.060385
MarkDown4       0.045414
MarkDown2       0.024130
IsHoliday_y     0.012774
IsHoliday_x     0.012774
Fuel_Price     -0.000120
Temperature    -0.002312
CPI            -0.020921
Unemployment   -0.025864
Store          -0.085195
Name: Weekly_Sales, dtype: float64
corr_matrix[["MarkDown1","MarkDown2","MarkDown3","MarkDown4","MarkDown5"]].sort_values(by="MarkDown5", ascending=False)
MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5
MarkDown5 0.160257 -0.007440 -0.026467 0.107792 1.000000
Size 0.345673 0.108827 0.048913 0.168196 0.304575
MarkDown1 1.000000 0.024486 -0.108115 0.819238 0.160257
MarkDown4 0.819238 -0.007768 -0.071095 1.000000 0.107792
Weekly_Sales 0.085251 0.024130 0.060385 0.045414 0.090362
CPI -0.055558 -0.039534 -0.023590 -0.049628 0.060630
Dept -0.002426 0.000290 0.001784 0.004257 0.000109
Unemployment 0.050285 0.020940 0.012818 0.024963 -0.003843
MarkDown2 0.024486 1.000000 -0.050108 -0.007768 -0.007440
Temperature -0.040594 -0.323927 -0.096880 -0.063947 -0.017544
MarkDown3 -0.108115 -0.050108 1.000000 -0.071095 -0.026467
Store -0.119588 -0.035173 -0.031556 -0.009941 -0.026634
IsHoliday_x -0.035586 0.334818 0.427960 -0.000562 -0.053719
IsHoliday_y -0.035586 0.334818 0.427960 -0.000562 -0.053719
Fuel_Price 0.061371 -0.220895 -0.102092 -0.044986 -0.128065
  • markdown1和4的关联度比较大,只需要要一个就行,删除markdown4

2.3 数据清洗

2.3.1 缺失值处理

## Markdown 对于训练集markdown的缺失,这里先不处理,等会分成两个数据集,一个含缺失markdown然后填充,一个去掉这些数据
test_df[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']] = test_df[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']].fillna(0)
test_df[["CPI","Unemployment"]] = test_df[["CPI","Unemployment"]].fillna(method="ffill")                                                                                                                                                                                            

2.3.2 创建新特征


train_df = pd.get_dummies(train_df, columns=["Type"])
test_df = pd.get_dummies(test_df, columns=["Type"])
Store Dept Date Weekly_Sales IsHoliday_x Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday_y Size Type_A Type_B Type_C
0 1 1 2010-02-05 24924.50 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False 151315 1 0 0
1 1 1 2010-02-12 46039.49 True 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 True 151315 1 0 0
2 1 1 2010-02-19 41595.55 False 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 False 151315 1 0 0
3 1 1 2010-02-26 19403.54 False 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 False 151315 1 0 0
4 1 1 2010-03-05 21827.90 False 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 False 151315 1 0 0


train_df['Month'] = pd.to_datetime(train_df['Date']).dt.month
test_df["Month"] = pd.to_datetime(test_df['Date']).dt.month


train_df.loc[(train_df["Temperature"]<22.01)|(train_df["Temperature"]>91.03), "Is_temp_extr"]=1
train_df.loc[(train_df["Temperature"]>=22.01)& (train_df["Temperature"]<=91.03), "Is_temp_extr"]=0

test_df.loc[(test_df["Temperature"]<22.01)|(test_df["Temperature"]>91.03), "Is_temp_extr"]=1
test_df.loc[(test_df["Temperature"]>=22.01)& (test_df["Temperature"]<=91.03), "Is_temp_extr"]=0

train_df.corr().Weekly_Sales.sort_values(ascending=False)[["Temperature", "Is_temp_extr"]]
#提取新特征后相关性提升了十多倍 等下记得把这个特征删除。
Temperature    -0.002312
Is_temp_extr   -0.030016
Name: Weekly_Sales, dtype: float64


train_df.loc[train_df["Fuel_Price"]>3.47, "Is_fuel_expen"]=1
train_df.loc[train_df["Fuel_Price"]<=3.47, "Is_fuel_expen"]=0
train_df.corr().Weekly_Sales.sort_values(ascending=False)[["Fuel_Price", "Is_fuel_expen"]]
Fuel_Price      -0.000120
Is_fuel_expen   -0.006626
Name: Weekly_Sales, dtype: float64


train_df["IsHoliday"] = train_df["IsHoliday_x"].replace(True, 5).replace(False,0)
test_df["IsHoliday"] = test_df["IsHoliday_x"].replace(True, 5).replace(False,0)

train_df.corr().Weekly_Sales.sort_values(ascending=False)[["IsHoliday_x", "IsHoliday"]]
IsHoliday_x    0.012774
IsHoliday      0.012774
Name: Weekly_Sales, dtype: float64

2.3.3 删除特征

train_df = train_df.drop(["IsHoliday_x", "IsHoliday_y",'MarkDown4',"Date", "Temperature", "Fuel_Price","Is_fuel_expen"], axis=1)

id = test_df["Store"].astype(str)+"_"+test_df["Dept"].astype(str)+"_"+test_df["Date"].astype(str)
test_df = test_df.drop(["IsHoliday_x", "IsHoliday_y", "MarkDown4", "Date","Temperature", "Fuel_Price"], axis=1) 

2.3.4 最终检查



train_df_one = train_df.copy()
train_df_two = train_df.copy()
train_df_one[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']] = train_df_one[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']].fillna(0)


Int64Index: 421570 entries, 0 to 421569
Data columns (total 16 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Weekly_Sales    421570 non-null float64
MarkDown1       421570 non-null float64
MarkDown2       421570 non-null float64
MarkDown3       421570 non-null float64
MarkDown5       421570 non-null float64
CPI             421570 non-null float64
Unemployment    421570 non-null float64
Size            421570 non-null int64
Type_A          421570 non-null uint8
Type_B          421570 non-null uint8
Type_C          421570 non-null uint8
Month           421570 non-null int64
Is_temp_extr    421570 non-null float64
IsHoliday       421570 non-null float64
dtypes: float64(9), int64(4), uint8(3)
memory usage: 46.2 MB

Int64Index: 101480 entries, 92 to 421569
Data columns (total 16 columns):
Store           101480 non-null int64
Dept            101480 non-null int64
Weekly_Sales    101480 non-null float64
MarkDown1       101480 non-null float64
MarkDown2       101480 non-null float64
MarkDown3       101480 non-null float64
MarkDown5       101480 non-null float64
CPI             101480 non-null float64
Unemployment    101480 non-null float64
Size            101480 non-null int64
Type_A          101480 non-null uint8
Type_B          101480 non-null uint8
Type_C          101480 non-null uint8
Month           101480 non-null int64
Is_temp_extr    101480 non-null float64
IsHoliday       101480 non-null float64
dtypes: float64(9), int64(4), uint8(3)
memory usage: 11.1 MB

Int64Index: 115064 entries, 0 to 115063
Data columns (total 15 columns):
Store           115064 non-null int64
Dept            115064 non-null int64
MarkDown1       115064 non-null float64
MarkDown2       115064 non-null float64
MarkDown3       115064 non-null float64
MarkDown5       115064 non-null float64
CPI             115064 non-null float64
Unemployment    115064 non-null float64
Size            115064 non-null int64
Type_A          115064 non-null uint8
Type_B          115064 non-null uint8
Type_C          115064 non-null uint8
Month           115064 non-null int64
Is_temp_extr    115064 non-null float64
IsHoliday       115064 non-null float64
dtypes: float64(8), int64(4), uint8(3)
memory usage: 11.7 MB

2.4 模型和预测


import time
import os
from sklearn.metrics import mean_absolute_error
from sklearn.base import clone

class Tester():
    def __init__(self, target):
        self.target = target
        self.datasets = {}
        self.models = {}
        self.scores = {}
        self.cache = {} # 我们添加了一个简单的缓存来加快速度

    def addDataset(self, name, df):
        self.datasets[name] = df.copy()

    def addModel(self, name, model):
        self.models[name] = model
    def clearModels(self):
        self.models = {}

    def clearCache(self):
        self.cache = {}
    def testModelWithDataset(self, m_name, df_name, sample_len, cv):
        if (m_name, df_name, sample_len, cv) in self.cache:
            return self.cache[(m_name, df_name, sample_len, cv)]

        clf = clone(self.models[m_name])
        if not sample_len: 
            sample = self.datasets[df_name]
        else: sample = self.datasets[df_name].sample(sample_len)
        X = sample.drop([self.target], axis=1)
        Y = sample[self.target]

        weights = X["IsHoliday"]
        clf.fit(X, Y)
        Y_pred = clf.predict(X)
        s = mean_absolute_error(Y, Y_pred, sample_weight=weights)
        self.cache[(m_name, df_name, sample_len, cv)] = s

        return s

    def runTests(self, sample_len=97056, cv=3):
        # 在所有添加的数据集上测试添加的模型
        for m_name in self.models:
            for df_name in self.datasets:
                # print('Testing %s' % str((m_name, df_name)), end='')
                start = time.time()

                score = self.testModelWithDataset(m_name, df_name, sample_len, cv)
                self.scores[(m_name, df_name)] = score
                end = time.time()
                # print(' -- %0.2fs ' % (end - start))

        print('--- Top 10 Results ---')
        # 评分标准改了之后这里也得改
        for score in sorted(self.scores.items(), key=lambda x: x[1])[:10]:
            # score = int(score[1])

    def obtian_result(self, X_test):
        clf = self.models[sorted(self.scores.items(), key=lambda x: x[1])[0][0]]
        Y_pred = clf.predict(X_test)
        return Y_pred

from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.feature_selection import RFE
from sklearn.neural_network import MLPRegressor

# 我们将在所有模型中使用测试对象
tester = Tester('Weekly_Sales')

# 添加数据集
tester.addDataset('all_markdown', train_df_one)
tester.addDataset('wipe_markdown', train_df_two)

# 添加模型
knn_reg = KNeighborsRegressor(n_neighbors=10)
tree_reg = ExtraTreesRegressor(n_estimators=100,max_features='auto', verbose=1, n_jobs=1)
rf_reg = RandomForestRegressor(n_estimators=100,max_features='log2', verbose=1)
svr_reg = SVR(kernel='rbf', gamma='auto')
mlp_reg = MLPRegressor(hidden_layer_sizes=(10,),  activation='relu', verbose=3)
gbrt_reg = GradientBoostingRegressor(max_depth=8, warm_start=True)
tester.addModel('KNeighborsRegressor', knn_reg)
tester.addModel('ExtraTreesRegressor', tree_reg)
tester.addModel('RandomForestRegressor', rf_reg)
tester.addModel('SVR', svr_reg)
tester.addModel('MLPRegressor', mlp_reg)
tester.addModel('GradientBoostingRegressor', gbrt_reg)

# 测试

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   26.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   28.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.8s finished
X = train_df_one.drop(["Weekly_Sales"], axis=1)
Y = train_df_one["Weekly_Sales"]

gbrt_reg.fit(X, Y)
Y_pred = gbrt_reg.predict(test_df)
submission = pd.DataFrame({
        "Id": id,
        "Weekly_Sales": pd.DataFrame(Y_pred)[0]

submission.to_csv('submission.csv', index=False)
