模型结构: 关于时间的广义线性模型
y ( t ) = g ( t ) + s ( t ) + h ( t ) + ϵ t y(t)=g(t)+s(t)+h(t)+\epsilon_t y(t)=g(t)+s(t)+h(t)+ϵt
其中,
g(t)表示趋势 trend,用分段线性函数 或者 逻辑斯蒂增长 (逻辑斯蒂增长 相对于 线性 会有上下界限) 函数拟合。
g ( t ) = C 1 + e − k ( t − m ) g(t)=\frac{C}{1+e^{-k(t-m)}} g(t)=1+e−k(t−m)C , C 表示上界 ; k表示增长/下降快慢; m表示增长/下降 最快的点。
用户需要给出 C 和 分段的段数。
s(t)表示季节性 seasonality, 用傅里叶级数拟合。可以叠加多个季节性, 比如 weekly, yearly。 s = s 1 + s 2 s=s_1+s_2 s=s1+s2
s ( t ) = ∑ n = 1 N [ a n c o s ( 2 π n t T ) + b n s i n ( 2 π n t T ) ] s(t)=\sum_{n=1}^N [a_ncos(\frac{2\pi nt}{T})+b_n sin(\frac{2\pi n t}{T})] s(t)=∑n=1N[ancos(T2πnt)+bnsin(T2πnt)], T 表示周期长度 , N表示阶数。 T表示了数据的周期性; N 表示了采用的傅里叶级数的最高阶数。 N越大,曲线本身的波动也越大,也越容易造成过拟合。
h(t)表示外部变量的影响 regressor,采用线性函数拟合。可以叠加多个外部变量,如节假日,温度,活动。
h = w 1 h 1 + w 2 h 2 + … h=w_1 h_1 +w_2 h_2 +\dots h=w1h1+w2h2+… h(t)可以是连续变量,也可以是0-1量。
ϵ t \epsilon_t ϵt表示模型残差,表示不可知的外部变量造成的影响。
import pandas as pd
import prophet
import matplotlib.pyplot as plt
df_sales = pd.read_csv("store_sales.csv",parse_dates=["week"])
df_prom = pd.read_csv("promotion_data.csv",parse_dates=["week"])
# print(df_sales.head(2))
# print(df_prom.head(2))
df_all = pd.merge(df_sales, df_prom, how="left")
df_all.fillna(0,inplace=True)
dept = 1
df_train = df_all[ (df_all["week"]<="2012-07-30")&
(df_all["store"]==1)&
(df_all["dept"]==dept)]
df_train.rename(columns={"week":"ds","sales":"y"},inplace=True)
df_test = df_all[ (df_all["week"]<="2012-08-06")&
(df_all["store"]==1)&
(df_all["dept"]==dept)]
df_test.rename(columns={"week":"ds","sales":"y"},inplace=True)
###### 年周期性
m = prophet.Prophet(yearly_seasonality=True)
####### add regressor
m.add_regressor("promotion_sales")
##### TRAIN
m.fit(df_train)
###### fit previous data
df_fit = m.predict(df_train)
###### visualize
fig1 = m.plot_components(df_fit)
fig2 = m.plot(df_fit)
plt.show()
import pandas as pd
import prophet
import matplotlib.pyplot as plt
import lightgbm as lgb
from lightgbm.sklearn import LGBMRegressor
df_sales = pd.read_csv("store_sales.csv",parse_dates=["week"])
df_prom = pd.read_csv("promotion_data.csv",parse_dates=["week"])
# print(df_sales.head(2))
# print(df_prom.head(2))
#
df_all = pd.merge(df_sales, df_prom, how="left")
df_all.fillna(0,inplace=True)
#### lightgbm
df_samples = df_all[ (df_all["store"]==1)&
(df_all["dept"]==1)].sort_values("week")
### construct features
feature_cols = []
### first feature: last week
df_samples["sales_lw"] = df_samples["sales"].shift(1)
df_samples["promotion_lw"] = df_samples["promotion_sales"].shift(1)
feature_cols = feature_cols + ["sales_lw", "promotion_lw"]
#### second features: last year
df_samples["sales_ly"] = df_samples["sales"].shift(52)
df_samples["promotion_ly"] = df_samples["promotion_sales"].shift(52)
feature_cols += ["sales_ly","promotion_ly"]
#### third features: variances waiting for prediction
feature_cols = feature_cols + ["promotion_sales"]
#### keep the data that is not nan
for col in feature_cols:
df_samples = df_samples[ ~df_samples[col].isna() ]
######### construct train data and test data
x_train = df_samples[df_samples["week"]<="2012-07-30"][feature_cols].values
y_train = df_samples[df_samples["week"]<="2012-07-30"]["sales"].values
x_test = df_samples[df_samples["week"]=="2012-08-06"][feature_cols].values
y_test = df_samples[df_samples["week"]=="2012-08-06"]["sales"].values
model = LGBMRegressor()
model.fit(x_train,y_train)
#### prediction
y_pred = model.predict(x_test)
print("actual data:",y_test)
print("prediction:",y_pred)
>>>16119.92
>>>16165.25781465
不同算法区别
特点 | ETS | ARIMA | Prophet | LightGBM |
---|---|---|---|---|
是否适合短的时间序列 | Y | N | N | N |
是否可解释 | Y | Y | Y | N |
是否支持多条序列批量预测 | N | N | N | Y,多条线可以在一个model中 |
是否支持外部变量 | N | N | Y,但只是线性变量 | Y |
使用Prophet 和 LightGBM算法 完成 之后所有时间的预测。
df_sales = pd.read_csv("store_sales.csv",parse_dates=["week"])
df_prom = pd.read_csv("promotion_data.csv",parse_dates=["week"])
# print(df_sales.head(2))
# print(df_prom.head(2))
#
df_all = pd.merge(df_sales, df_prom, how="left")
df_all.fillna(0,inplace=True)
#
dept = 1
print(df_all[ (df_all["store"]==1)&
(df_all["dept"]==dept)] .sort_values(["week"]).tail(3))
test_date_begin = pd.to_datetime("2012-07-30")
test_date_end = pd.to_datetime("2012-10-22")
test_date = df_all[ (df_all["week"]>=test_date_begin)&
(df_all["week"]<=test_date_end)]["week"].unique()
predict_result = []
actual_result = []
print("total prediction num:", len(test_date))
print(test_date)
for i, every_test_date in enumerate(test_date):
print("predicting: {}".format(i))
df_train = df_all[ (df_all["week"]<every_test_date)&
(df_all["store"]==1)&
(df_all["dept"]==dept)]
df_train.rename(columns={"week":"ds","sales":"y"},inplace=True)
actual_result.append( df_all[(df_all["week"]==every_test_date)&
(df_all["store"]==1)&
(df_all["dept"]==dept)]["sales"].values[-1] )
# print(actual_result[-1])
df_test = df_all[ (df_all["week"]==every_test_date)&
(df_all["store"]==1)&
(df_all["dept"]==dept)]
df_test.rename(columns={"week":"ds","sales":"y"},inplace=True)
m = prophet.Prophet(yearly_seasonality=True)
m.add_regressor("promotion_sales")
m.fit(df_train)
df_predict = m.predict(df_test)
predict_result.append(df_predict["yhat"].values[0])
# print(predict_result[-1])
plt.plot(test_date,predict_result,label="prediction")
plt.plot(test_date,actual_result,label="actual")
plt.legend()
plt.xticks(rotation=90)
plt.show()
#### lightgbm
df_sales = pd.read_csv("store_sales.csv",parse_dates=["week"])
df_prom = pd.read_csv("promotion_data.csv",parse_dates=["week"])
# print(df_sales.head(2))
# print(df_prom.head(2))
#
df_all = pd.merge(df_sales, df_prom, how="left")
df_all.fillna(0,inplace=True)
test_date_begin = pd.to_datetime("2012-07-30")
test_date_end = pd.to_datetime("2012-10-22")
dept = 1
df_samples = df_all[ (df_all["store"]==1)&
(df_all["dept"]==dept)].sort_values("week")
### construct features
feature_cols = []
### first feature: last week
df_samples["sales_lw"] = df_samples["sales"].shift(1)
df_samples["promotion_lw"] = df_samples["promotion_sales"].shift(1)
feature_cols = feature_cols + ["sales_lw", "promotion_lw"]
#### second features: last year
df_samples["sales_ly"] = df_samples["sales"].shift(52)
df_samples["promotion_ly"] = df_samples["promotion_sales"].shift(52)
feature_cols += ["sales_ly","promotion_ly"]
#### third features: variances waiting for prediction
feature_cols = feature_cols + ["promotion_sales"]
#### keep the data that is not nan
for col in feature_cols:
df_samples = df_samples[ ~df_samples[col].isna() ]
######### construct train data and test data
test_date = df_all[ (df_all["week"]>=test_date_begin)&
(df_all["week"]<=test_date_end)]["week"].unique()
predict_result = []
actual_result = []
for i,every_test_date in enumerate(test_date):
x_train = df_samples[df_samples["week"]<every_test_date][feature_cols].values
y_train = df_samples[df_samples["week"]<every_test_date]["sales"].values
x_test = df_samples[df_samples["week"]==every_test_date][feature_cols].values
y_test = df_samples[df_samples["week"]==every_test_date]["sales"].values
from lightgbm.sklearn import LGBMRegressor
model = LGBMRegressor()
model.fit(x_train,y_train)
#### prediction
y_pred = model.predict(x_test)
predict_result.append(y_pred[0])
actual_result.append(y_test[0])
print("actual data:",actual_result)
print("prediction:",predict_result)
plt.plot(test_date,predict_result,label="prediction")
plt.plot(test_date,actual_result,label="actual")
plt.legend()
plt.xticks(rotation=90)
plt.show()