kaggle比赛 - 销量预测实战全记录

M5 Forecasting - Accuracy

总体思路

TODO LIST (端午节完成,目前基础资料已齐备,不需要过多收集资料):

  • 1. 完成数据分析,分五个方面-我的数据分析baseline;
  • 2. 完成数据工程pipeline - 基于lightGBM 模型解析;
  • 3. 压缩数据量,从原有数据中随机抽取1/10;
  • 4. 构建RMSE, MAPE, WMAPE评估方式;
  • 5. 单lightGBM模型基于Grid Search 方法调参;
  • 6. 基于多模型Prophet, RF, LightGBM进行集成;
  • 7. 整理报告;
  • 8. 扩展,用于大仓,高频SKU预测;

Part I. 课题了解;


1.1 OBJECT:

How much camping gear will one store sell each month in a year?

In this competition, the fifth iteration, you will use hierarchical sales data from Walmart, the world’s largest company by revenue, to forecast daily sales for the next 28 days.

1.1.1 输入数据

The data, covers stores in three US States (California, Texas, and Wisconsin) and includes item level, department, product categories, and store details. In addition, it has explanatory variables such as price, promotions, day of the week, and special events. Together, this robust dataset can be used to improve forecasting accuracy.

Time range: [2011-01-29 ,  2016-06-19]

Totally 1969 days.

/kaggle/input/m5-forecasting-accuracy/sample_submission.csv
/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv
/kaggle/input/m5-forecasting-accuracy/sell_prices.csv
/kaggle/input/m5-forecasting-accuracy/calendar.csv

1.1.2 输出结果与输出形式

Each row contains an id that is a concatenation of an item_id and a store_id, which is either validation (corresponding to the Public leaderboard), or evaluation (corresponding to the Private leaderboard). 

In the challenge, you are predicting item sales at stores in various locations for two 28-day time periods.

id,F1,...F28
HOBBIES_1_001_CA_1_validation,0,...,2
HOBBIES_1_002_CA_1_validation,2,...,11
...
HOBBIES_1_001_CA_1_evaluation,3,...,7
HOBBIES_1_002_CA_1_evaluation,1,...,4

1.1.3 baseline模型 (见附录)

1.1.4 优化方法论:

  • 数据分析,根据销售数量分布来选择模型,如是否有规律可循,是否量很少,销量是否平稳,是否有异常点outlier;
  • 将历史销量,结合滑动平均作为特征来预测未来销量;
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('bmh')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

# Read in the data
INPUT_DIR = '../input/m5-forecasting-accuracy'
cal = pd.read_csv(f'{INPUT_DIR}/calendar.csv')
stv = pd.read_csv(f'{INPUT_DIR}/sales_train_validation.csv')
ss = pd.read_csv(f'{INPUT_DIR}/sample_submission.csv')
sellp = pd.read_csv(f'{INPUT_DIR}/sell_prices.csv')

Part II. 数据分析


2.1 数据范围分析

  • Visualizing the data for a single item

d_cols = [c for c in stv.columns if 'd_' in c] # sales data columns

# Below we are chaining the following steps in pandas:
# 1. Select the item.
# 2. Set the id as the index, Keep only sales data columns
# 3. Transform so it's a column
# 4. Plot the data
stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'] \
    .set_index('id')[d_cols] \
    .T \
    .plot(figsize=(15, 5),
          title='FOODS_3_090_CA_3 sales by "d" number',
          color=next(color_cycle))
plt.legend('')
plt.show()

Merging the data with real dates

# Merge calendar on our items' data
example = stv.loc[stv['id'] == 'FOODS_3_090_CA_3_validation'][d_cols].T
example = example.rename(columns={8412:'FOODS_3_090_CA_3'}) # Name it correctly
example = example.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example = example.merge(cal, how='left', validate='1:1')
example.set_index('date')['FOODS_3_090_CA_3'] \
    .plot(figsize=(15, 5),
          color=next(color_cycle),
          title='FOODS_3_090_CA_3 sales by actual sale dates')
plt.show()

# Select more top selling examples
example2 = stv.loc[stv['id'] == 'HOBBIES_1_234_CA_3_validation'][d_cols].T
example2 = example2.rename(columns={6324:'HOBBIES_1_234_CA_3'}) # Name it correctly
example2 = example2.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example2 = example2.merge(cal, how='left', validate='1:1')

example3 = stv.loc[stv['id'] == 'HOUSEHOLD_1_118_CA_3_validation'][d_cols].T
example3 = example3.rename(columns={6776:'HOUSEHOLD_1_118_CA_3'}) # Name it correctly
example3 = example3.reset_index().rename(columns={'index': 'd'}) # make the index "d"
example3 = example3.merge(cal, how='left', validate='1:1')

 

Sales broken down by time variables

  • Now that we have our example item lets see how it sells by:
    • Day of the week
    • Month
    • Year
examples = ['FOODS_3_090_CA_3','HOBBIES_1_234_CA_3','HOUSEHOLD_1_118_CA_3']
example_df = [example, example2, example3]
for i in [0, 1, 2]:
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 3))
    example_df[i].groupby('wday').mean()[examples[i]] \
        .plot(kind='line',
              title='average sale: day of week',
              lw=5,
              color=color_pal[0],
              ax=ax1)
    example_df[i].groupby('month').mean()[examples[i]] \
        .plot(kind='line',
              title='average sale: month',
              lw=5,
              color=color_pal[4],

              ax=ax2)
    example_df[i].groupby('year').mean()[examples[i]] \
        .plot(kind='line',
              lw=5,
              title='average sale: year',
              color=color_pal[2],

              ax=ax3)
    fig.suptitle(f'Trends for item: {examples[i]}',
                 size=20,
                 y=1.1)
    plt.tight_layout()
    plt.show()

 

Lets look at a lot of different items!¶

  • Lets put it all together to plot 20 different items and their sales
  • Some observations from these plots:
    • It is common to see an item unavailable for a period of time.
    • Some items only sell 1 or less in a day, making it very hard to predict.
    • Other items show spikes in their demand (super bowl sunday?) possibly the "events" provided to us could help with these.
twenty_examples = stv.sample(20, random_state=529) \
        .set_index('id')[d_cols] \
    .T \
    .merge(cal.set_index('d')['date'],
           left_index=True,
           right_index=True,
            validate='1:1') \
    .set_index('date')
fig, axs = plt.subplots(10, 2, figsize=(15, 20))
axs = axs.flatten()
ax_idx = 0
for item in twenty_examples.columns:
    twenty_examples[item].plot(title=item,
                              color=next(color_cycle),
                              ax=axs[ax_idx])
    ax_idx += 1
plt.tight_layout()
plt.show()

 

Combined Sales over Time by Type

  • We have several item types:
    • Hobbies
    • Household
    • Foods
  • Lets plot the total demand over time for each type
stv.groupby('cat_id').count()['id'] \
    .sort_values() \
    .plot(kind='barh', figsize=(15, 5), title='Count of Items by Category')
plt.show()

 

Rollout of items being sold.

  • We can see the some items come into supply that previously didn't exist. Similarly some items stop being sold completely.
  • Lets plot the sales, but only count if item is selling or not selling (0 -> not selling, >0 -> selling)
  • This plot shows us that many items are being slowly introduced into inventory, so many of them will not register a sale at the beginning of the provided data.

Sales by Store

We are provided data for 10 unique stores. What are the total sales by stores?

  • Note that some stores are more steady than others.
  • CA_2 seems to have a big change occur in 2015

Sales Heatmap Calendar

Code

It appears that walmarts are closed on Chirstmas day. The highest demand day of all the data was on Sunday March 6th, 2016. What happened on this day you may ask... well the Seventh Democratic presidential candidates debate hosted by CNN and held in Flint, Michigan... I doubt that impacted sales though :D

Sale Prices

We are given historical sale prices of each item. Lets take a look at our example item from before.

  • It looks to me like the price of this item is growing.
  • Different stores have different selling prices.

 

2.2 数据质量(异常值处理)

2.3 数据相关性分析

2.4 初步特征工程,

2.5 建一个baseline宽表;

2.6 建立pipeline

 

Part III.预测模型

3.1 模型初步选择;

3.2 初步模型对比,常用给模型,如arima, moving average, random forest, xgboost, light GBM, gbdt等;

 

Part IV.  优化 模型

4.1 调参 - grid search

4.2 特征工程 - 找特征 (特征构造,外部数据,业务了解);

4.3 模型集成

 

总体:

主要是为了建立一套自己的方法论;

 

参考文献:

1. 【Python】【并行计算】Python 多核并行计算

2. 

 

附录I: Baseline model

  • 数据读入
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

%matplotlib inline
import matplotlib.pyplot as plt

sales_df = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv')
sell_prices_df = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sell_prices.csv')
calendar_df = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/calendar.csv')
sample_df = pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sample_submission.csv')

  • 基础函数
from sklearn import preprocessing
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from tqdm import tqdm

def one_hot_transf(feature_list, src_df_in):
    # function: transform the features in feature_list in to one-hot form
    
    src_df = src_df_in.copy()
    # transform with lableencoder
    gle = preprocessing.LabelEncoder()
    # transform with onehot encorder
    gen_ohe = preprocessing.OneHotEncoder()
        
    # features
    for feature in feature_list:

        src_df[feature] = gle.fit_transform(src_df[feature].astype(str).copy())
        feature_labels = feature + '_' + gle.classes_
        transformed_features = gen_ohe.fit_transform(src_df[[feature]]).toarray()
        transformed_feature_df = pd.DataFrame(transformed_features, columns=feature_labels)
        
        src_df_out = pd.concat([src_df.copy(), transformed_feature_df], axis=1)
        src_df_out.drop(feature, axis=1, inplace=True)
    return src_df_out

def fetch_sku_df(sku_name):
    name_list = sku_name.split('_')
    
    # sku_code
    prod_l_cat_id = name_list[0]
    prod_m_cat_id = name_list[0] + '_' + name_list[1]
    prod_s_cat_id = name_list[0] + '_' + name_list[1] + '_' + name_list[2]
    # shop_code
    shop_code = name_list[3] + '_' + name_list[4]
    
    # fetch sales info
    sales_sku_df = sales_df[(sales_df['item_id']==prod_s_cat_id) & (sales_df['store_id']==shop_code)]
    
    # fetch calendar info
    calendar_sku_df = calendar_df.copy()
    sales_sku_df = sales_sku_df.transpose()
    sales_sku_df.reset_index(drop=False, inplace=True)
    sales_sku_df.columns = ['d_name', 'pro_num']
    
    sku_df = pd.merge(calendar_df, sales_sku_df, left_on='d', right_on='d_name', how='left')
    sku_df.drop('d_name', axis=1, inplace=True)
    
    
    # fetch price info
    sell_prices_item = sell_prices_df[(sell_prices_df['store_id']==shop_code)&(sell_prices_df['item_id']==prod_s_cat_id)].copy()
    sell_prices_item.drop(['store_id', 'item_id'], axis=1, inplace=True)
    sku_df = pd.merge(sku_df, sell_prices_item, how='left', left_on=['wm_yr_wk'], 
                      right_on=['wm_yr_wk'])
    
    # drop unused columns
    state_list = ['CA', 'TX', 'WI']
    state_list.remove(name_list[3])
    state_list = ['snap_' + i for i in state_list] + ['date', 'weekday', 'wm_yr_wk', 'd']
    sku_df.drop(state_list, axis=1, inplace=True)
    
    # data processing    
    sku_df['year'] = sku_df['year'] - sku_df['year'].min()
    
    categoric_columns = ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
    sku_df_cat = one_hot_transf(categoric_columns, sku_df.copy())
    numeric_columns = ['wday', 'month', 'year']
    sku_df = one_hot_transf(numeric_columns, sku_df_cat.copy())

    # drop null rows
    sku_df = sku_df[(sku_df['sell_price'].notnull())|(sku_df['pro_num']!=0)]
    sku_df = sku_df.loc[:,~((sku_df==0).all())]

    return sku_df.copy()


def regression_method(model, x_train, y_train, x_val, y_val):
    model.fit(x_train,y_train)
    score = model.score(x_val, y_val)
    result = model.predict(x_val)
    ResidualSquare = (result - y_val)**2     #计算残差平方
    RSS = sum(ResidualSquare)   #计算残差平方和
    MSE = np.mean(ResidualSquare)       #计算均方差
    num_regress = len(result)   #回归样本个数
       
    return model    
    

def fetch_skus(mode='test'):
    results = sample_df.copy()
    results.set_index('id', inplace=True)
    id_list = list(results.index)
    
    sku_num = 3
    if mode == 'test':
        skus = id_list[0:sku_num]
    else:
        skus = id_list
        
    for item in tqdm(skus):
        # 1. fetch each sku flat table
        sku_flat_table = fetch_sku_df(item)

        max_pro_num = sku_flat_table['pro_num'].max()
        sku_flat_table['pro_num'] = sku_flat_table['pro_num']/max_pro_num

        max_sell_price = sku_flat_table['sell_price'].max()
        sku_flat_table['sell_price'] = sku_flat_table['sell_price']/max_sell_price

        # 2. split data for train, validation and test            
        train_val_df = sku_flat_table.iloc[0:-56, :]
        train_val_col = train_val_df.columns.values.tolist()
        train_val_col.remove('pro_num')
        train_val_x = train_val_df[train_val_col].values
        train_val_y = train_val_df['pro_num'].values
        train_x,validation_x, train_y, validation_y = train_test_split(train_val_x, train_val_y, 
                                                           test_size=0.33, random_state=0)

        test_df = sku_flat_table.tail(56)
        test_col = test_df.columns.values.tolist()
        test_col.remove('pro_num')
        test_x = test_df[test_col].values

        # 3. import the model for tuning, trainig and validation
        model_RandomForestRegressor = ensemble.RandomForestRegressor(n_estimators=800)
        model_rf = regression_method(model_RandomForestRegressor, train_x, train_y, validation_x, validation_y)

        # 4. predict the result with the test set
        prediction = model_rf.predict(test_x)
        prediction = (prediction * max_pro_num).astype(int)
        results.loc[item, :] = prediction[0:28]
    
    return results
  • 并行计算
import multiprocessing
import math

cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=cores)
print('cores = ', cores)

def predict_sku(item):
    print('item = ', item)
    result = sample_df[sample_df['id']==item].copy()
    result.set_index('id', inplace=True)
    
    # 1. fetch each sku flat table
    sku_flat_table = fetch_sku_df(item)

    max_pro_num = sku_flat_table['pro_num'].max()
    sku_flat_table['pro_num'] = sku_flat_table['pro_num']/max_pro_num

    max_sell_price = sku_flat_table['sell_price'].max()
    sku_flat_table['sell_price'] = sku_flat_table['sell_price']/max_sell_price

    # 2. split data for train, validation and test            
    train_val_df = sku_flat_table.iloc[0:-56, :]
    train_val_col = train_val_df.columns.values.tolist()
    train_val_col.remove('pro_num')
    train_val_x = train_val_df[train_val_col].values
    train_val_y = train_val_df['pro_num'].values
    train_x,validation_x, train_y, validation_y = train_test_split(train_val_x, train_val_y, 
                                                       test_size=0.33, random_state=0)

    test_df = sku_flat_table.tail(56)
    test_col = test_df.columns.values.tolist()
    test_col.remove('pro_num')
    test_x = test_df[test_col].values

    # 3. import the model for tuning, trainig and validation
    model_RandomForestRegressor = ensemble.RandomForestRegressor(n_estimators=800)
    model_rf = regression_method(model_RandomForestRegressor, train_x, train_y, validation_x, validation_y)

    # 4. predict the result with the test set
    prediction = model_rf.predict(test_x)
    prediction = (prediction * max_pro_num).astype(int)
    result.loc[item, :] = prediction[0:28]
    result.reset_index(inplace=True)
#     print('result = \n', result.head(n=3))
    
    return result

item_list = list(sample_df['id'])[0:50]
epoch = 20
epoch_num = math.ceil(len(item_list)/epoch)
item_list_length = len(item_list)

print('item_list_length = ', item_list_length)

final = []

for i in tqdm(range(3)):
    print('i =', i)
    sku_list = item_list[epoch*i:min(epoch*(i+1), item_list_length)]    
    result_df = pool.map(predict_sku, sku_list)
    final += result_df

print('------------------  Finished !!!  --------------')
output_df = pd.concat(final, axis=0)
output_df.to_csv('output.csv', index=False)

 

你可能感兴趣的:(机器学习,AI应用)