SARIMA(p,d,q)(P,D,Q,s)
季节性自回归移动平均模型,结构参数有七个
AR(p)
自回归模型,即用自己回归自己。基本假设是,当前序列值取决于序列的历史值。p 表示用多少个历史值来回归出预测值。
要确定初始 p,需要查看PACF
图并找到最大的显著时滞,在 p 之后其它时滞都不显著。
MA(q)
移动平均模型,是对时间序列的误差进行建模,并假设当前误差取决于带有滞后的误差。可以在ACF
图上找到初始值。
结合以上两种方法: A R ( p ) + M A ( q ) = A R M A ( p , q ) AR(p)+MA(q)=ARMA(p,q) AR(p)+MA(q)=ARMA(p,q),就是自回归移动平均模型
剩下的参数:
I(d)
表示积分的阶数为 d,出现积分是因为我们先对时间序列做 d 次微分,使得序列平稳。举个例子,对抛物线微分两次就得到了一个定常(平稳)的加速度,估计出了正确的加速度,积分两次回去,就还原了原时间序列。
现在我们有了 ARIMA 模型,可以对无季节变化的非稳态序列建模。
S(s)
用来建模序列的季节性,s 代表季节的长度
有了季节性,就需要额外三个参数 (P,D,Q)
P 表示季节自回归的阶数,从 PACF 推断。和小 p 不同的是,需要看的是季节长度的倍数上的时滞。比如,如果季节长度为 24,那么就要在 pacf 图上检查 24,48,72 个滞后的强度,如果滞后 48 的序列的 pacf 表现显著,那么 P 等于 2
Q 和 P 的取法类似,不过是通过 ACF 图选择
D 表示季节差分的阶数,一般就是 0 或 1,做了季节差分就是 1
好了,说了半天你也没听懂,接下来我们用 SARIMA 来建模
import warnings # do not disturbe mode
warnings.filterwarnings('ignore')
# Load packages
import numpy as np # vectors and matrices
import pandas as pd # tables and data manipulations
import matplotlib.pyplot as plt # plots
import seaborn as sns # more plots
from dateutil.relativedelta import relativedelta # working with dates with style
from scipy.optimize import minimize # for function minimization
import statsmodels.formula.api as smf # statistics and econometrics
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs
from itertools import product # some useful functions
from tqdm import tqdm_notebook
# Importing everything from forecasting quality metrics
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error
ACF 和 PACF
# MAPE
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def tsplot(y, lags=None, figsize=(12, 7), style='bmh'):
"""
Plot time series, its ACF and PACF, calculate Dickey–Fuller test
y - timeseries
lags - how many lags to include in ACF, PACF calculation
"""
if not isinstance(y, pd.Series):
y = pd.Series(y)
with plt.style.context(style):
fig = plt.figure(figsize=figsize)
layout = (2, 2)
ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
acf_ax = plt.subplot2grid(layout, (1, 0))
pacf_ax = plt.subplot2grid(layout, (1, 1))
y.plot(ax=ts_ax)
p_value = sm.tsa.stattools.adfuller(y)[1]
ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
plt.tight_layout()
广告点击量时间序列
ads = pd.read_csv('ads.csv', index_col=['Time'], parse_dates=['Time'])
总共有 216 = 24 × 9 216 = 24 \times 9 216=24×9 个数据,9 天,每天 24 小时
plt.figure(figsize=(18, 6))
plt.plot(ads.Ads)
plt.title('Ads watched (hourly data)')
plt.grid(True)
plt.show()
tsplot(ads.Ads, lags=60)
可以看出,广告点击量数据呈现较强的季节性,以1天24小时为季节长度,所以我们先做个差分,除去季节性
# The seasonal difference
ads_diff = ads.Ads - ads.Ads.shift(24)
tsplot(ads_diff[24:], lags=60)
但是在自相关图和偏自相关图中,还是有太多显著的时滞
我们再来做个差分
ads_diff = ads_diff - ads_diff.shift(1)
tsplot(ads_diff[24+1:], lags=60)
当然了,以上只是一个粗略的估计,我们还是用程序来选择最优的参数
但是可以肯定的是:d = 1, D = 1,s = 24
下面列出所有候选参数组合,共有 36 组候选参数
# setting initial values and some bounds for them
ps = range(2, 5)
d=1
qs = range(2, 5)
Ps = range(0, 2)
D=1
Qs = range(0, 2)
s = 24 # season length is still 24
# creating list with all the possible combinations of parameters
parameters = product(ps, qs, Ps, Qs)
parameters_list = list(parameters)
len(parameters_list) # 36
根据 AIC 调参
def optimizeSARIMA(parameters_list, d, D, s):
"""Return dataframe with parameters and corresponding AIC
parameters_list - list with (p, q, P, Q) tuples
d - integration order in ARIMA model
D - seasonal integration order
s - length of season
"""
results = []
best_aic = float("inf")
for param in tqdm_notebook(parameters_list):
# we need try-except because on some combinations model fails to converge
try:
model=sm.tsa.statespace.SARIMAX(ads.Ads, order=(param[0], d, param[1]),
seasonal_order=(param[2], D, param[3], s)).fit(disp=-1)
except:
continue
aic = model.aic
# saving best model, AIC and parameters
if aic < best_aic:
best_model = model
best_aic = aic
best_param = param
results.append([param, model.aic])
result_table = pd.DataFrame(results)
result_table.columns = ['parameters', 'aic']
# sorting in ascending order, the lower AIC is - the better
result_table = result_table.sort_values(by='aic', ascending=True).reset_index(drop=True)
return result_table
最后得出的最优参数为 p = 2,q = 3,,P = 1,Q = 1
warnings.filterwarnings("ignore")
result_table = optimizeSARIMA(parameters_list, d, D, s)
'''
parameters aic
0 (2, 3, 1, 1) 3888.642174
1 (3, 2, 1, 1) 3888.763568
2 (4, 2, 1, 1) 3890.279740
3 (3, 3, 1, 1) 3890.513196
4 (2, 4, 1, 1) 3892.302849
5 (4, 3, 1, 1) 3892.322855
6 (3, 4, 1, 1) 3893.762846
7 (4, 4, 1, 1) 3894.327967
8 (2, 2, 1, 1) 3894.798147
9 (2, 3, 0, 1) 3897.170902
10 (3, 2, 0, 1) 3897.815032
11 (4, 2, 0, 1) 3899.073591
12 (3, 3, 0, 1) 3899.165271
13 (3, 4, 0, 1) 3900.500309
14 (2, 4, 0, 1) 3900.502494
15 (4, 3, 0, 1) 3901.255700
16 (4, 4, 0, 1) 3902.650501
17 (2, 2, 0, 1) 3903.905714
18 (2, 3, 1, 0) 3909.281188
19 (3, 2, 1, 0) 3909.502838
20 (4, 2, 1, 0) 3910.927759
21 (3, 3, 1, 0) 3911.192654
22 (3, 4, 1, 0) 3911.344351
23 (2, 4, 1, 0) 3911.809710
24 (4, 4, 1, 0) 3913.084053
25 (4, 3, 1, 0) 3913.409057
26 (2, 2, 1, 0) 3914.786853
27 (3, 2, 0, 0) 3925.433806
28 (2, 2, 0, 0) 3925.786566
29 (2, 3, 0, 0) 3925.879649
30 (2, 4, 0, 0) 3926.311190
31 (3, 3, 0, 0) 3927.427240
32 (4, 2, 0, 0) 3927.427417
33 (3, 4, 0, 0) 3928.376406
34 (4, 3, 0, 0) 3929.059361
35 (4, 4, 0, 0) 3930.078725
'''
# set the parameters that give the lowest AIC
p, q, P, Q = result_table.parameters[0]
best_model=sm.tsa.statespace.SARIMAX(ads.Ads, order=(p, d, q),
seasonal_order=(P, D, Q, s)).fit(disp=-1)
print(best_model.summary())
'''
Statespace Model Results
==========================================================================================
Dep. Variable: Ads No. Observations: 216
Model: SARIMAX(2, 1, 3)x(1, 1, 1, 24) Log Likelihood -1936.321
Date: Sat, 07 Mar 2020 AIC 3888.642
Time: 15:00:14 BIC 3914.660
Sample: 09-13-2017 HQIC 3899.181
- 09-21-2017
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.7913 0.270 2.928 0.003 0.262 1.321
ar.L2 -0.5503 0.306 -1.799 0.072 -1.150 0.049
ma.L1 -0.7316 0.262 -2.793 0.005 -1.245 -0.218
ma.L2 0.5651 0.282 2.005 0.045 0.013 1.118
ma.L3 -0.1811 0.092 -1.964 0.049 -0.362 -0.000
ar.S.L24 0.3312 0.076 4.351 0.000 0.182 0.480
ma.S.L24 -0.7635 0.104 -7.361 0.000 -0.967 -0.560
sigma2 4.574e+07 5.61e-09 8.15e+15 0.000 4.57e+07 4.57e+07
===================================================================================
Ljung-Box (Q): 43.70 Jarque-Bera (JB): 10.56
Prob(Q): 0.32 Prob(JB): 0.01
Heteroskedasticity (H): 0.65 Skew: -0.28
Prob(H) (two-sided): 0.09 Kurtosis: 4.00
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.08e+32. Standard errors may be unstable.
'''
tsplot(best_model.resid[24+1:], lags=60)
def plotSARIMA(series, model, n_steps):
"""Plots model vs predicted values
series - dataset with timeseries
model - fitted SARIMA model
n_steps - number of steps to predict in the future
"""
# adding model values
data = series.copy()
data.columns = ['actual']
data['sarima_model'] = model.fittedvalues
# making a shift on s+d steps, because these values were unobserved by the model
# due to the differentiating
data['sarima_model'][:s+d] = np.NaN
# forecasting on n_steps forward
forecast = model.predict(start = data.shape[0], end = data.shape[0]+n_steps)
forecast = data.sarima_model.append(forecast)
# calculate error, again having shifted on s+d steps from the beginning
error = mean_absolute_percentage_error(data['actual'][s+d:], data['sarima_model'][s+d:])
plt.figure(figsize=(15, 7))
plt.title("Mean Absolute Percentage Error: {0:.2f}%".format(error))
plt.plot(forecast, color='r', label="model")
plt.axvspan(data.index[-1], forecast.index[-1], alpha=0.5, color='lightgrey')
plt.plot(data.actual, label="actual")
plt.legend()
plt.grid(True)
plotSARIMA(ads, best_model, 50)