总体思路
TODO LIST (端午节完成,目前基础资料已齐备,不需要过多收集资料):
How much camping gear will one store sell each month in a year?
In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days.
The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.
Time range: [2011-01-29 , 2016-06-19]
Totally 1969 days.
/kaggle/input/m5-forecasting-accuracy/sample_submission.csv
/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv
/kaggle/input/m5-forecasting-accuracy/sell_prices.csv
/kaggle/input/m5-forecasting-accuracy/calendar.csv
Each row contains an id
that is a concatenation of an item_id
and a store_id
, which is either validation
(corresponding to the Public leaderboard), or evaluation
(corresponding to the Private leaderboard).
In the challenge, you are predicting item sales at stores in various locations for two 28-day time periods.
id,F1,...F28
HOBBIES_1_001_CA_1_validation,0,...,2
HOBBIES_1_002_CA_1_validation,2,...,11
...
HOBBIES_1_001_CA_1_evaluation,3,...,7
HOBBIES_1_002_CA_1_evaluation,1,...,4
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])
# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')
d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns
# Below we are chaining the following steps in pandas:
# 1. Select the item.
# 2. Set the id as the index, Keep only sales data columns
# 3. Transform so it's a column
# 4. Plot the data
stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'] \
.set_index('id')[d_cols] \
.T \
.plot(figsize=(15, 5),
title='FOODS_3_090_CA_3 sales by "d" number',
color=next(color_cycle))
plt.legend('')
plt.show()
# Merge calendar on our items' data
example = stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'][d_cols].T
example = example.rename(columns={8412:'FOODS_3_090_CA_3'}) # Name it correctly
example = example.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example = example.merge(cal, how='left', validate='1:1')
example.set_index('date')['FOODS_3_090_CA_3'] \
.plot(figsize=(15, 5),
color=next(color_cycle),
title='FOODS_3_090_CA_3 sales by actual sale dates')
plt.show()
# Select more top selling examples
example2 = stv.loc[stv['id'] == 'HOBBIES_1_234_CA_3_validation'][d_cols].T
example2 = example2.rename(columns={6324:'HOBBIES_1_234_CA_3'}) # Name it correctly
example2 = example2.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example2 = example2.merge(cal, how='left', validate='1:1')
example3 = stv.loc[stv['id'] == 'HOUSEHOLD_1_118_CA_3_validation'][d_cols].T
example3 = example3.rename(columns={6776:'HOUSEHOLD_1_118_CA_3'}) # Name it correctly
example3 = example3.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example3 = example3.merge(cal, how='left', validate='1:1')
examples = ['FOODS_3_090_CA_3','HOBBIES_1_234_CA_3','HOUSEHOLD_1_118_CA_3']
example_df = [example, example2, example3]
for i in [0, 1, 2]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 3))
example_df[i].groupby('wday').mean()[examples[i]] \
.plot(kind='line',
title='average sale: day of week',
lw=5,
color=color_pal[0],
ax=ax1)
example_df[i].groupby('month').mean()[examples[i]] \
.plot(kind='line',
title='average sale: month',
lw=5,
color=color_pal[4],
ax=ax2)
example_df[i].groupby('year').mean()[examples[i]] \
.plot(kind='line',
lw=5,
title='average sale: year',
color=color_pal[2],
ax=ax3)
fig.suptitle(f'Trends for item: {examples[i]}',
size=20,
y=1.1)
plt.tight_layout()
plt.show()
twenty_examples = stv.sample(20, random_state=529) \
.set_index('id')[d_cols] \
.T \
.merge(cal.set_index('d')['date'],
left_index=True,
right_index=True,
validate='1:1') \
.set_index('date')
fig, axs = plt.subplots(10, 2, figsize=(15, 20))
axs = axs.flatten()
ax_idx = 0
for item in twenty_examples.columns:
twenty_examples[item].plot(title=item,
color=next(color_cycle),
ax=axs[ax_idx])
ax_idx += 1
plt.tight_layout()
plt.show()
stv.groupby('cat_id').count()['id'] \
.sort_values() \
.plot(kind='barh', figsize=(15, 5), title='Count of Items by Category')
plt.show()
We are provided data for 10 unique stores. What are the total sales by stores?
Code
It appears that walmarts are closed on Chirstmas day. The highest demand day of all the data was on Sunday March 6th, 2016. What happened on this day you may ask... well the Seventh Democratic presidential candidates debate hosted by CNN and held in Flint, Michigan... I doubt that impacted sales though :D
We are given historical sale prices of each item. Lets take a look at our example item from before.
2.2 数据质量(异常值处理)
2.3 数据相关性分析
2.4 初步特征工程,
2.5 建一个baseline宽表;
2.6 建立pipeline
3.1 模型初步选择;
3.2 初步模型对比,常用给模型,如arima, moving average, random forest, xgboost, light GBM, gbdt等;
4.1 调参 - grid search
4.2 特征工程 - 找特征 (特征构造,外部数据,业务了解);
4.3 模型集成
总体:
主要是为了建立一套自己的方法论;
参考文献:
1. 【Python】【并行计算】Python 多核并行计算
2.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
%matplotlib inline
import matplotlib.pyplot as plt
sales_df = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv')
sell_prices_df = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sell_prices.csv')
calendar_df = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/calendar.csv')
sample_df = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sample_submission.csv')
from sklearn import preprocessing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from tqdm import tqdm
def one_hot_transf(feature_list, src_df_in):
# function: transform the features in feature_list in to one-hot form
src_df = src_df_in.copy()
# transform with lableencoder
gle = preprocessing.LabelEncoder()
# transform with onehot encorder
gen_ohe = preprocessing.OneHotEncoder()
# features
for feature in feature_list:
src_df[feature] = gle.fit_transform(src_df[feature].astype(str).copy())
feature_labels = feature + '_' + gle.classes_
transformed_features = gen_ohe.fit_transform(src_df[[feature]]).toarray()
transformed_feature_df = pd.DataFrame(transformed_features, columns=feature_labels)
src_df_out = pd.concat([src_df.copy(), transformed_feature_df], axis=1)
src_df_out.drop(feature, axis=1, inplace=True)
return src_df_out
def fetch_sku_df(sku_name):
name_list = sku_name.split('_')
# sku_code
prod_l_cat_id = name_list[0]
prod_m_cat_id = name_list[0] + '_' + name_list[1]
prod_s_cat_id = name_list[0] + '_' + name_list[1] + '_' + name_list[2]
# shop_code
shop_code = name_list[3] + '_' + name_list[4]
# fetch sales info
sales_sku_df = sales_df[(sales_df['item_id']==prod_s_cat_id) & (sales_df['store_id']==shop_code)]
# fetch calendar info
calendar_sku_df = calendar_df.copy()
sales_sku_df = sales_sku_df.transpose()
sales_sku_df.reset_index(drop=False, inplace=True)
sales_sku_df.columns = ['d_name', 'pro_num']
sku_df = pd.merge(calendar_df, sales_sku_df, left_on='d', right_on='d_name', how='left')
sku_df.drop('d_name', axis=1, inplace=True)
# fetch price info
sell_prices_item = sell_prices_df[(sell_prices_df['store_id']==shop_code)&(sell_prices_df['item_id']==prod_s_cat_id)].copy()
sell_prices_item.drop(['store_id', 'item_id'], axis=1, inplace=True)
sku_df = pd.merge(sku_df, sell_prices_item, how='left', left_on=['wm_yr_wk'],
right_on=['wm_yr_wk'])
# drop unused columns
state_list = ['CA', 'TX', 'WI']
state_list.remove(name_list[3])
state_list = ['snap_' + i for i in state_list] + ['date', 'weekday', 'wm_yr_wk', 'd']
sku_df.drop(state_list, axis=1, inplace=True)
# data processing
sku_df['year'] = sku_df['year'] - sku_df['year'].min()
categoric_columns = ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
sku_df_cat = one_hot_transf(categoric_columns, sku_df.copy())
numeric_columns = ['wday', 'month', 'year']
sku_df = one_hot_transf(numeric_columns, sku_df_cat.copy())
# drop null rows
sku_df = sku_df[(sku_df['sell_price'].notnull())|(sku_df['pro_num']!=0)]
sku_df = sku_df.loc[:,~((sku_df==0).all())]
return sku_df.copy()
def regression_method(model, x_train, y_train, x_val, y_val):
model.fit(x_train,y_train)
score = model.score(x_val, y_val)
result = model.predict(x_val)
ResidualSquare = (result - y_val)**2 #计算残差平方
RSS = sum(ResidualSquare) #计算残差平方和
MSE = np.mean(ResidualSquare) #计算均方差
num_regress = len(result) #回归样本个数
return model
def fetch_skus(mode='test'):
results = sample_df.copy()
results.set_index('id', inplace=True)
id_list = list(results.index)
sku_num = 3
if mode == 'test':
skus = id_list[0:sku_num]
else:
skus = id_list
for item in tqdm(skus):
# 1. fetch each sku flat table
sku_flat_table = fetch_sku_df(item)
max_pro_num = sku_flat_table['pro_num'].max()
sku_flat_table['pro_num'] = sku_flat_table['pro_num']/max_pro_num
max_sell_price = sku_flat_table['sell_price'].max()
sku_flat_table['sell_price'] = sku_flat_table['sell_price']/max_sell_price
# 2. split data for train, validation and test
train_val_df = sku_flat_table.iloc[0:-56, :]
train_val_col = train_val_df.columns.values.tolist()
train_val_col.remove('pro_num')
train_val_x = train_val_df[train_val_col].values
train_val_y = train_val_df['pro_num'].values
train_x,validation_x, train_y, validation_y = train_test_split(train_val_x, train_val_y,
test_size=0.33, random_state=0)
test_df = sku_flat_table.tail(56)
test_col = test_df.columns.values.tolist()
test_col.remove('pro_num')
test_x = test_df[test_col].values
# 3. import the model for tuning, trainig and validation
model_RandomForestRegressor = ensemble.RandomForestRegressor(n_estimators=800)
model_rf = regression_method(model_RandomForestRegressor, train_x, train_y, validation_x, validation_y)
# 4. predict the result with the test set
prediction = model_rf.predict(test_x)
prediction = (prediction * max_pro_num).astype(int)
results.loc[item, :] = prediction[0:28]
return results
import multiprocessing
import math
cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=cores)
print('cores = ', cores)
def predict_sku(item):
print('item = ', item)
result = sample_df[sample_df['id']==item].copy()
result.set_index('id', inplace=True)
# 1. fetch each sku flat table
sku_flat_table = fetch_sku_df(item)
max_pro_num = sku_flat_table['pro_num'].max()
sku_flat_table['pro_num'] = sku_flat_table['pro_num']/max_pro_num
max_sell_price = sku_flat_table['sell_price'].max()
sku_flat_table['sell_price'] = sku_flat_table['sell_price']/max_sell_price
# 2. split data for train, validation and test
train_val_df = sku_flat_table.iloc[0:-56, :]
train_val_col = train_val_df.columns.values.tolist()
train_val_col.remove('pro_num')
train_val_x = train_val_df[train_val_col].values
train_val_y = train_val_df['pro_num'].values
train_x,validation_x, train_y, validation_y = train_test_split(train_val_x, train_val_y,
test_size=0.33, random_state=0)
test_df = sku_flat_table.tail(56)
test_col = test_df.columns.values.tolist()
test_col.remove('pro_num')
test_x = test_df[test_col].values
# 3. import the model for tuning, trainig and validation
model_RandomForestRegressor = ensemble.RandomForestRegressor(n_estimators=800)
model_rf = regression_method(model_RandomForestRegressor, train_x, train_y, validation_x, validation_y)
# 4. predict the result with the test set
prediction = model_rf.predict(test_x)
prediction = (prediction * max_pro_num).astype(int)
result.loc[item, :] = prediction[0:28]
result.reset_index(inplace=True)
# print('result = \n', result.head(n=3))
return result
item_list = list(sample_df['id'])[0:50]
epoch = 20
epoch_num = math.ceil(len(item_list)/epoch)
item_list_length = len(item_list)
print('item_list_length = ', item_list_length)
final = []
for i in tqdm(range(3)):
print('i =', i)
sku_list = item_list[epoch*i:min(epoch*(i+1), item_list_length)]
result_df = pool.map(predict_sku, sku_list)
final += result_df
print('------------------ Finished !!! --------------')
output_df = pd.concat(final, axis=0)
output_df.to_csv('output.csv', index=False)