在本系列前面的文章中,我们介绍了多种时序预测技术和模型。我们可以看出时序预测技术还是非常复杂的,步骤也比较繁琐。读者可能还记得VAR模型的步骤繁多,牵涉到的知识点和概念也非常多。
本篇文章将要介绍Facebook开源的时序预测包Prophet. Prophet的设计者认为时序预测对于分析和使用该技术的人员来说,应该是简单的、易于使用的,而不需要把流行的标准方法全都走一遍。Prophet就是在这个设计初衷的基础上诞生的。
本文我们会用理论结合实践的方式来介绍Prophet包,且对比一下VAR模型和Prophet的预测效果。所采用的例子是一个金融问题,我们希望预测某个国家假想的食品饮料公司的月净销量,从2020年4月到12月,挖掘出在不确定的经济环境下的增长和损失。这其中做了一些假设:一个具有复杂供应链网络的企业,同时需要国内和国外的原材料供应;且客户渠道也是多种多样的,有国内渠道也有全球化渠道,因此公司将会非常受客户情感、零售业、工业生产和供应链影响。也就是说,公司整体对采购、生产、分发非常敏感,无论是国内和全球。
首先,时序预测问题是一个非常复杂的问题,但是其意义重大。最常见的问题是关于销售量的,企业数据科学家、业务分析师和管理者都希望测量出潜在的增长、损失的可能性和在某些困难时期的不确定性。另外,时序预测在天气、股票市场、宏观经济(GDP)、电子商务、零售等领域问题都有很强的需求。
其次,预测未来是非常困难的。虽然时序预测的方法不少,例如ARIMA(可参见笔者另一篇文章时序分析(6) – ARIMA(p,d,q)模型),SARIMA,还有近年比较流行的LSTM,但这些方法的缺点比较明显,我们知道前两种方法比较适合于单变量时序预测;而LSTM是一种黑盒方法,其可解释性比较低。本文我们会用理论结合实践的方式来介绍Prophet包,且对比一下VAR模型(可参见笔者另外一篇时序分析 19 向量自回归 (VAR))文章和Prophet的预测效果。
再次,VAR是一种多变量回归模型,它可以同时预测多个时序;Prophet是单变量预测,但是具备添加附加因子的能力。
最后,笔者理解基于数据科学的预测问题都是挖掘数据的历史特征和规律,而这种特征和规律在未来还会出现,这是预测可以成为可能的底层逻辑。如果新的特征和规律出现或者旧的模式已经失效,预测是不可能准确的。
Prophet是建立在通用加性模型(GAM, Generative Additive Model)的基础上的。
y ( t ) = g ( t ) + s ( t ) + h ( t ) + ϵ t {\LARGE y(t)=g(t)+s(t)+h(t)+\epsilon_t } y(t)=g(t)+s(t)+h(t)+ϵt
Prophet考虑了趋势因素和周期性因素,这都是在零售、经济和金融上出现非常多的模式,而节假日因素通常会影响时序的波动。Prophet的基本理念还是聚焦在实践序列的组成成分,例如频率、周期性、趋势性和领域知识,其中最后一点非常重要。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = 16, 10
import seaborn as sns
sns.set_style('ticks')
%matplotlib inline
from fbprophet import Prophet
from fbprophet.diagnostics import cross_validation
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tools.eval_measures import rmse, aic
import os
from IPython.display import HTML
### disable scientific notation in pandas
pd.set_option('display.float_format', '{:.2f}'.format) ### isplay up to 2 decimal pts
salesdf = pd.read_csv('E:\datasets\salesdf.csv')
salesdf ['Date'] = pd.DatetimeIndex(salesdf ['Date'])
pd.set_option('display.max_rows', 10)
salesdf.head()
以下宏观经济指标数据,例如单位资本全国生产总值,消费价格平均通货膨胀,失业率来源于 IMF, https://www.imf.org/external/pubs/ft/weo/2020/01/weodata/index.aspx
而进出口数据来源于 OECD, https://data.oecd.org/trade/trade-in-goods-and-services-forecast.htm
all_indicators = pd.read_csv('E:\datasets\economic_indicator_yearly.csv')
### Interpolate the data monthly
### Specifically, we change the frequency from yearly to monthly and interpolate the missing data using linear intepolation
all_indicators['Date'] = pd.DatetimeIndex(all_indicators['Date'])
all_indicators = all_indicators.set_index('Date', drop=True)
all_indicators = all_indicators.groupby(['Country']).resample('MS').mean()
all_indicators = all_indicators.interpolate().reset_index()
all_indicators.head()
#### Merge all indicators in a single DataFrame
masterdf = salesdf.merge(all_indicators, how='outer', on='Date')
masterdf = masterdf.drop('Country', 1)
masterdf.head()
上表中展示了多个宏观经济指标,其中GDP、通货膨胀和失业率来源于IMF;贸易数据来源于OECD。
我们将使用OECD和IMF作为附加因子预测净月销量。通过Granger因果测试和Johansen协整测试,从八个宏观经济指标和贸易指标中,过滤出GDP、平均消费价格和净贸易这三个变量是与月销量具备线性关系的。
协整测试可参见笔者另外一篇文章时序分析 15 协整序列
Granger因果测试后面会有简单介绍,日后可能会下一篇文章专门介绍。
choose_cols = ['Date',
'Net Sales',
# 'Gross domestic product per capita, constant prices',
'Gross domestic product, current prices',
'Inflation, average consumer prices',
# 'Unemployment rate',
# 'Export',
# 'Export Market Growth',
# 'Import',
'Net Trade'
]
masterdf = masterdf[choose_cols]
# ## Keep the masterdf intact, select current data until March as current data
alldf = masterdf.copy()
alldf.index = pd.DatetimeIndex(alldf.Date)
alldf = alldf.sort_index()
### fill any missing values on previous date using backfill
alldf = alldf.fillna(method='bfill')
alldf.index = pd.DatetimeIndex(alldf.Date)
alldf = alldf[alldf.index < '2020-04-01']
alldf = alldf.reset_index(drop=True)
## we rename date as 'ds' and Net Sales as 'y'
## Prophet makes it compulsory to rename these columns as such
alldf = alldf.rename(columns={'Date':'ds',
'Net Sales':'y'
})
# alldf['ds'] = pd.to_datetime(alldf['ds'])
tsdf = alldf.copy()
tsdf['ds'] = pd.to_datetime(tsdf['ds'])
tsdf.index = pd.DatetimeIndex(tsdf.ds)
tsdf = tsdf.drop('ds', 1)
季节分解
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(alldf['y'], model='additive', period=12)
result.plot()
from scipy import stats
stat, p = stats.normaltest(tsdf.y)
print('Statistics=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
print('Data looks Gaussian (fail to reject H0)')
else:
print('Data does not look Gaussian (reject H0)')
print('Kurtosis of Normal Dist: {}'.format(stats.kurtosis(tsdf.y)))
print('Skewness of Normal Dist: {}'.format(stats.skew(tsdf.y)))
stats.probplot(tsdf['y'], plot=plt)
Statistics=3.566, p=0.168
Data looks Gaussian (fail to reject H0)
Kurtosis of Normal Dist: -0.46679521297409465
Skewness of Normal Dist: 0.38404204508729184
简要介绍一下Granger因果测试
Granger因果测设主要用来检验一组实践序列是否是影响另一组实践序列的原因,Granger因果关系意味着一组时序数据对另一组时序数据有显著性的影响。测设方法是基于回归检验的。原假设是并不存在因果关系。
下面test方法制定了ssr_chi2test,这个参数如果是‘params_ftest’, ‘ssr_ftest’是F分布;‘ssr_chi2test’, ‘lrtest’是卡方分布。
from statsmodels.tsa.stattools import grangercausalitytests
maxlag=12
test = 'ssr_chi2test'
df = pd.DataFrame(np.zeros((len(tsdf.columns), len(tsdf.columns))), columns=tsdf.columns, index=tsdf.columns)
for c in df.columns:
for r in df.index:
test_result = grangercausalitytests(tsdf[[r, c]], maxlag=maxlag, verbose=False)
p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
min_p_value = np.min(p_values)
df.loc[r, c] = min_p_value
df.columns = [var + '_x' for var in tsdf.columns]
df.index = [var + '_y' for var in tsdf.columns]
grangerdf = df
grangerdf
from statsmodels.tsa.vector_ar.vecm import coint_johansen
alpha = 0.05
out = coint_johansen(tsdf,-1,5)
d = {'0.90':0, '0.95':1, '0.99':2}
traces = out.lr1
cvts = out.cvt[:, d[str(1-alpha)]]
def adjust(val, length= 6):
return str(val).ljust(length)
# # Summary
col_name =[]
trace_value = []
cvt_value = []
indicator = []
for col, trace, cvt in zip(tsdf.columns, traces, cvts):
col_name.append(col)
trace_value.append(np.round(trace, 2))
cvt_value.append(cvt)
indicator.append(trace> cvt)
d = {'Name':col_name, 'Test Stat' : trace_value, 'Critical Value (95%)': cvt_value, 'Significance': indicator}
cj = pd.DataFrame(d)
cj[['Name', 'Test Stat', 'Critical Value (95%)', 'Significance']]
## Split to traning and test
nobs = 10
df_train,df_test = tsdf[0:-nobs],tsdf[-nobs:]
print(df_train.shape)
print(df_test.shape)
(90, 4)
(10, 4)
对训练集使用Augmented Dickey Fuller Test检验其平稳性
def adf(data, variable=None, signif=0.05, verbose=False):
r = adfuller(data, autolag='AIC')
output = {'test_statistic':np.round(r[0], 4), 'pvalue':np.round(r[1], 4), 'n_lags':np.round(r[2], 4), 'n_obs':r[3]}
p_value = output['pvalue']
# Print Summary
print(' Augmented Dickey-Fuller Test on ', variable)
print(' Null Hypothesis: Variable is Non-Stationary.')
print(' Significance Level = ', signif)
print(' Test Statistic = ', output["test_statistic"])
print(' No. Lags Chosen (lowest AIC)= ', output["n_lags"])
for key,val in r[4].items():
print(' Critical value', key, np.round(val, 3))
if p_value <= signif:
print("p-Value = ", p_value, ". P value is less than critical value. Reject Null H. Series is Stationary.")
else:
print("p-Value = ", p_value, ". P value is not less than critical value. Weak evidence to reject the Null H. Series is Non-Stationary")
# ADF Test on each column
for name, column in df_train.iteritems():
adf(column, variable=column.name)
print('\n')
# 1st order differencing
df_differenced = df_train.diff().dropna()
# 2nd order differencing
df_differenced = df_differenced.diff().dropna()
fig, ax = plt.subplots(2,1, figsize=(12.5,7.5))
df_train.reset_index()['y'].plot(ax=ax[0])
df_differenced.reset_index()['y'].plot(ax=ax[1])
ax[0].set_title('Net Sales - Actual')
ax[1].set_title('Net Sales - 2nd order differenced')
Text(0.5, 1.0, ‘Net Sales - 2nd order differenced’)
# ADF Test on each column
for name, column in df_differenced.iteritems():
adf(column, variable=column.name)
print('\n')
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, ax = plt.subplots(2,1, figsize=(10,6))
plot_acf(tsdf['y'], ax=ax[0])
plot_pacf(tsdf['y'], ax=ax[1])
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, ax = plt.subplots(2,1, figsize=(10,6))
plot_acf(df_differenced['y'], ax=ax[0])
plot_pacf(df_differenced['y'], ax=ax[1])
通过比较AIC ,BIC, FPE, HQIC选择VAR模型的最优lag
# call the model and insert the differenced data
model = VAR(df_differenced)
x = model.select_order(maxlags=12)
x.summary()
# 选择lag4
## lag 4 seems decent, we choose lag 4 and fit it into the model
model_fitted = model.fit(4)
model_fitted.summary()
from statsmodels.stats.stattools import durbin_watson
out = durbin_watson(model_fitted.resid)
for col, val in zip(tsdf.columns, out):
print(col, ':', round(val, 2))
y : 2.2
Gross domestic product, current prices : 2.01
Inflation, average consumer prices : 1.98
Net Trade : 2.02
# Get the lag order
lag_order = model_fitted.k_ar
print(lag_order)
# Separate input data for forecasting
# the goal is to forecast based on the last 4 inputs (since the lag is 4)
forecast_input = df_differenced.values[-lag_order:]
4
# Forecast
## we insert the last four values and inform the model to predict the next 10 values
fc = model_fitted.forecast(y=forecast_input, steps=nobs)
## organize the output into a clear DataFrame layout, add '_f' suffix at each column indicating they are the forecasted values
df_forecast = pd.DataFrame(fc, index=tsdf.index[-nobs:], columns=tsdf.columns + '_f')
df_forecast
# get a copy of the forecast
df_fc = df_forecast.copy()
# get column name from the original dataframe
columns = df_train.columns
# Roll back from the 1st order differencing
# we take the cumulative sum (from the top row to the bottom) for each of the forecasting data,
## and the add to the previous step's original value (since we deduct each row from the previous one)
## we rename the new forecasted column with the prefix 'forecast'
# for col in columns:
# df_fc[str(col)+'_forecast'] = df_train[col].iloc[-1] + df_fc[str(col)+'_f'].cumsum()
## if you perform second order diff, make sure to get the difference from the last row and second last row of df_train
for col in columns:
df_fc[str(col)+'_first_differenced'] = (df_train[col].iloc[-1]-df_train[col].iloc[-2]) + df_fc[str(col)+'_f'].cumsum()
df_fc[str(col)+'_forecast'] = df_train[col].iloc[-1] + df_fc[str(col)+'_first_differenced'].cumsum()
df_results = df_fc
df_results.loc[:, [ 'y_forecast',
'Gross domestic product, current prices_forecast',
'Inflation, average consumer prices_forecast',
'Net Trade_forecast']]
forecast = df_results['y_forecast'].values
actual = df_test['y']
mae = np.mean(np.abs(forecast - actual)) # MAE
rmse = np.mean((forecast - actual)**2)**.5 # RMSE
mape = np.mean(np.abs(forecast - actual)/np.abs(actual)) # MAPE
print('Mean Absolute Error : ', mae)
print('Root Mean Squared Error : ',rmse)
print('Mean Absolute Percentage Error: ',mape)
Mean Absolute Error : 445641.1508824545
Root Mean Squared Error : 522946.3865355065
Mean Absolute Percentage Error: 0.0999243747966973
## you can use the statsmodels plot_forecast method but the output is not clean
fig = model_fitted.plot_forecast(10)
fig, ax = plt.subplots(figsize=(12, 5))
df_test.reset_index()['y'].plot(color='k', label='Actual')
df_results.reset_index()['y_forecast'].plot(color='r', label='Predicted')
plt.title('VAR Model: Predicted vs Actual on Test Set (June 2019 - Mar 2020)')
ax.legend()
m = Prophet(seasonality_mode='multiplicative',
growth='linear',
yearly_seasonality=True,
mcmc_samples=1000,
changepoint_prior_scale = 0.001,
)
Prophet的高阶函数需要对列名进行标准化,时间列命名为ds,而待预测变量命名为y。
seasonality_mode参数设为’multiplicative’,意味着周期性为乘性的,是对比于’additive‘。 growth=‘linear’,是指增长为线性。当然也可指定为logistic growth model,也是比较普遍的,例如经济变革周期、人口增长和流行病模型(新冠肺炎)
weekly_seasonality=False 和 yearly_seasonality=True, 缺省情况下,Prophet会用使用哑元变量来拟合weekly_seasonality;使用傅里叶变换来拟合yearly_seasonality。
mcmc_samples=1000: 缺省情况下,Prophet只会在趋势上计算不确定性,无论时序是增长趋势还是下降趋势。如果要在周期性上探测不确定性,需要Prophet进行贝叶斯采样。这里我们使用1000个采样点来估算后验分布。
changepoint_prior_scale = 0.001:Prophet定义变点为轨迹上的异常突变点。如果我们了解在某个期间内存在突变,我们是可以定制化突变点的,但在缺省情况下,Prophet将会自动探测(缺省值为0.05)。这里我们降低这个值是为了避免太多的突变点。
cols = np.array(alldf.columns)
modified_array = np.delete(cols, np.where(cols == 'y'))
modified_array2 = np.delete(modified_array, np.where(modified_array == 'ds'))
alldf_regressor = modified_array2
m.add_regressor('Gross domestic product, current prices', prior_scale=0.5, mode='multiplicative')
m.add_regressor('Inflation, average consumer prices', prior_scale=0.5, mode='multiplicative')
m.add_regressor('Net Trade', prior_scale=0.5, mode='multiplicative')
当我们拟合模型后,我们可以增加在VAR模型中探测出的其他变量作为附加因子。因为回归因子在过去和未来都必须是已知的,这意味着如果我们需要预测12月的数据,我们必须知道未来12月的GDP,所以我们使用了OECD预测的GDP数据。
m.fit(alldf)
准备预测2020年4月到12月的净月销量
future = m.make_future_dataframe(periods=9, freq='MS')
future.tail(9)
masterdf2 = masterdf.reset_index().rename(columns={'Date':'ds'})
masterdf2['ds'] = pd.to_datetime(masterdf2['ds'])
baseline_future = future.merge(masterdf2.loc[:, modified_array], how='left', on='ds')
baseline_future = baseline_future.fillna(method='ffill')
baseline_future = baseline_future.fillna(method='bfill')
baseline_future.tail(9)
forecast = m.predict(baseline_future)
result = pd.concat([forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']], alldf[['y']]], 1)
result.tail(12)
画出图像
黑点代表实际数据,蓝线代表拟合数据,而阴影区域代表不确定区间。
fig, ax = plt.subplots(figsize=(15, 6.5))
m.plot(forecast, ax=ax)
ax.set_title('Net Sales - DEC 2020: BASELINE FORECAST ', size=15)
m.plot_components(forecast)
a = result.loc[pd.DatetimeIndex(result['ds']).year == 2019, 'y'].sum()
b = result.loc[pd.DatetimeIndex(result['ds']).year == 2020, 'yhat'].sum()
(b - a)/ a
0.16700018251482737
from fbprophet.diagnostics import cross_validation
df_cv = cross_validation(m,
initial='730 days', # how many days we want to start training the data
period='90 days', # do the forecast every x days (in this case, every quarter)
horizon = '270 days'
)
df_cv.head()
from fbprophet.diagnostics import performance_metrics
df_p = performance_metrics(df_cv)
df_p.head()
模型的交叉验证对于评估模型的准确性至关重要。通过区间式地把数据分成训练集和测试集,我们可以更加精确地估算预测误差。 把交叉验证数据传入性能指标函数,可以帮助我们计算出每个期间地MSE(Mean Squared Error)、RMSE(Root Mean Squared Error)、MAE(Mean Absolute Error)和MAPE(Mean Absolute Percentage Error)。这四个指标中的前三个是依赖于数据地尺度的,而最后一个是尺度无关的。
from fbprophet.plot import plot_cross_validation_metric
plot_cross_validation_metric(df_cv, metric='mape')
由于MAPE是尺度无关的,所以我们采用这个指标来判断模型性能,上图画出了MAPE随时间变化的情况。最小为13%,效果还不错。
现在,我们希望估算GDP的下降和其他经济指标对销量增长的影响。
下面,我们建立了一个新的数据框名为future2,填入各个回归因子的值,但会把这些值降低5%。如果我们认为经济环境会更差,可以把这些值再降低一些。然后,使用刚才训练的模型来进行预测。
future2 = future.merge(masterdf2.loc[:, modified_array], how='left', on='ds')
future2.iloc[-8:, 1:] = future2.iloc[-8:, 1:].apply(lambda x : x * 0.95)
scenario_forecast = m.predict(future2)
scenario_result = pd.concat([scenario_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']], alldf[['y']]], 1)
future2.iloc[-8:, 1:] = future2.iloc[-8:, 1:].apply(lambda x : x * 0.95)
scenario_forecast = m.predict(future2)
scenario_result = pd.concat([scenario_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']], alldf[['y']]], 1)
scenario_result.tail(12)
fig, ax = plt.subplots(figsize=(15, 6.5))
m.plot(forecast, ax=ax)
ax.set_title('Net Sales - SCENARIO FORECAST UP TO DEC 2020')
a = scenario_result.loc[pd.DatetimeIndex(scenario_result['ds']).year == 2019, 'y'].sum()
b = scenario_result.loc[pd.DatetimeIndex(scenario_result['ds']).year == 2020, 'yhat'].sum()
(b - a) / a
0.11716962913373659
这种场景分析下,我们预测销售增长小于12%。
这次的话题就先聊到这儿了,日后笔者会带来关于Prophet介绍更为详细的文章。