python 时间序列预测 —— SARIMA

SARIMA(p,d,q)(P,D,Q,s) 季节性自回归移动平均模型,结构参数有七个

AR(p) 自回归模型,即用自己回归自己。基本假设是,当前序列值取决于序列的历史值。p 表示用多少个历史值来回归出预测值。

要确定初始 p,需要查看PACF图并找到最大的显著时滞,在 p 之后其它时滞都不显著。

MA(q) 移动平均模型,是对时间序列的误差进行建模,并假设当前误差取决于带有滞后的误差。可以在ACF图上找到初始值。

结合以上两种方法: A R ( p ) + M A ( q ) = A R M A ( p , q ) AR(p)+MA(q)=ARMA(p,q) AR(p)+MA(q)=ARMA(p,q),就是自回归移动平均模型

剩下的参数:

I(d) 表示积分的阶数为 d,出现积分是因为我们先对时间序列做 d 次微分,使得序列平稳。举个例子,对抛物线微分两次就得到了一个定常(平稳)的加速度,估计出了正确的加速度,积分两次回去,就还原了原时间序列。

现在我们有了 ARIMA 模型,可以对无季节变化的非稳态序列建模。

S(s) 用来建模序列的季节性,s 代表季节的长度

有了季节性,就需要额外三个参数 (P,D,Q)

P 表示季节自回归的阶数,从 PACF 推断。和小 p 不同的是,需要看的是季节长度的倍数上的时滞。比如,如果季节长度为 24,那么就要在 pacf 图上检查 24,48,72 个滞后的强度,如果滞后 48 的序列的 pacf 表现显著,那么 P 等于 2

Q 和 P 的取法类似,不过是通过 ACF 图选择

D 表示季节差分的阶数,一般就是 0 或 1,做了季节差分就是 1

好了,说了半天你也没听懂,接下来我们用 SARIMA 来建模

导入包

import warnings                                  # do not disturbe mode
warnings.filterwarnings('ignore')

# Load packages
import numpy as np                               # vectors and matrices
import pandas as pd                              # tables and data manipulations
import matplotlib.pyplot as plt                  # plots
import seaborn as sns                            # more plots

from dateutil.relativedelta import relativedelta # working with dates with style
from scipy.optimize import minimize              # for function minimization

import statsmodels.formula.api as smf            # statistics and econometrics
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs

from itertools import product                    # some useful functions
from tqdm import tqdm_notebook

# Importing everything from forecasting quality metrics
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error

自相关函数与偏自相关函数

ACF 和 PACF

# MAPE
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
def tsplot(y, lags=None, figsize=(12, 7), style='bmh'):
    """
        Plot time series, its ACF and PACF, calculate Dickey–Fuller test
        
        y - timeseries
        lags - how many lags to include in ACF, PACF calculation
    """
    
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
        
    with plt.style.context(style):    
        fig = plt.figure(figsize=figsize)
        layout = (2, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        
        y.plot(ax=ts_ax)
        p_value = sm.tsa.stattools.adfuller(y)[1]
        ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
        plt.tight_layout()

读取数据

广告点击量时间序列

ads = pd.read_csv('ads.csv', index_col=['Time'], parse_dates=['Time'])

python 时间序列预测 —— SARIMA_第1张图片
总共有 216 = 24 × 9 216 = 24 \times 9 216=24×9 个数据,9 天,每天 24 小时

plt.figure(figsize=(18, 6))
plt.plot(ads.Ads)
plt.title('Ads watched (hourly data)')
plt.grid(True)
plt.show()

python 时间序列预测 —— SARIMA_第2张图片
画出序列的自相关与偏自相关函数

tsplot(ads.Ads, lags=60)

python 时间序列预测 —— SARIMA_第3张图片
可以看出,广告点击量数据呈现较强的季节性,以1天24小时为季节长度,所以我们先做个差分,除去季节性

# The seasonal difference
ads_diff = ads.Ads - ads.Ads.shift(24)
tsplot(ads_diff[24:], lags=60)

python 时间序列预测 —— SARIMA_第4张图片
除去季节性之后,是不是心情舒畅了许多

但是在自相关图和偏自相关图中,还是有太多显著的时滞

我们再来做个差分

ads_diff = ads_diff - ads_diff.shift(1)
tsplot(ads_diff[24+1:], lags=60)

python 时间序列预测 —— SARIMA_第5张图片

SARIMA参数选取

  • p 是 4,因为PACF上的 4 步滞后表现突出,之后紧接着的都不显著。
  • d 是1,因为做了一阶差分
  • q 根据 ACF 应该是 4 左右
  • s 是 24,毋庸置疑
  • P 可能是2,因为 24(1s) 和 48(2s) 的滞后在PACF上有些重要
  • D 是 1,因为我们做了季节差分
  • Q 可能为1,ACF的第 24(1s) 个滞后很明显,而第48个的滞后则不是

当然了,以上只是一个粗略的估计,我们还是用程序来选择最优的参数

但是可以肯定的是:d = 1, D = 1,s = 24

下面列出所有候选参数组合,共有 36 组候选参数

# setting initial values and some bounds for them
ps = range(2, 5)
d=1 
qs = range(2, 5)
Ps = range(0, 2)
D=1 
Qs = range(0, 2)
s = 24 # season length is still 24

# creating list with all the possible combinations of parameters
parameters = product(ps, qs, Ps, Qs)
parameters_list = list(parameters)
len(parameters_list)  # 36

根据 AIC 调参

def optimizeSARIMA(parameters_list, d, D, s):
    """Return dataframe with parameters and corresponding AIC
        
        parameters_list - list with (p, q, P, Q) tuples
        d - integration order in ARIMA model
        D - seasonal integration order 
        s - length of season
    """
    
    results = []
    best_aic = float("inf")

    for param in tqdm_notebook(parameters_list):
        # we need try-except because on some combinations model fails to converge
        try:
            model=sm.tsa.statespace.SARIMAX(ads.Ads, order=(param[0], d, param[1]), 
                                            seasonal_order=(param[2], D, param[3], s)).fit(disp=-1)
        except:
            continue
        aic = model.aic
        # saving best model, AIC and parameters
        if aic < best_aic:
            best_model = model
            best_aic = aic
            best_param = param
        results.append([param, model.aic])

    result_table = pd.DataFrame(results)
    result_table.columns = ['parameters', 'aic']
    # sorting in ascending order, the lower AIC is - the better
    result_table = result_table.sort_values(by='aic', ascending=True).reset_index(drop=True)
    
    return result_table

最后得出的最优参数为 p = 2,q = 3,,P = 1,Q = 1

warnings.filterwarnings("ignore") 
result_table = optimizeSARIMA(parameters_list, d, D, s)
'''
	 parameters	    aic
0	(2, 3, 1, 1)	3888.642174
1	(3, 2, 1, 1)	3888.763568
2	(4, 2, 1, 1)	3890.279740
3	(3, 3, 1, 1)	3890.513196
4	(2, 4, 1, 1)	3892.302849
5	(4, 3, 1, 1)	3892.322855
6	(3, 4, 1, 1)	3893.762846
7	(4, 4, 1, 1)	3894.327967
8	(2, 2, 1, 1)	3894.798147
9	(2, 3, 0, 1)	3897.170902
10	(3, 2, 0, 1)	3897.815032
11	(4, 2, 0, 1)	3899.073591
12	(3, 3, 0, 1)	3899.165271
13	(3, 4, 0, 1)	3900.500309
14	(2, 4, 0, 1)	3900.502494
15	(4, 3, 0, 1)	3901.255700
16	(4, 4, 0, 1)	3902.650501
17	(2, 2, 0, 1)	3903.905714
18	(2, 3, 1, 0)	3909.281188
19	(3, 2, 1, 0)	3909.502838
20	(4, 2, 1, 0)	3910.927759
21	(3, 3, 1, 0)	3911.192654
22	(3, 4, 1, 0)	3911.344351
23	(2, 4, 1, 0)	3911.809710
24	(4, 4, 1, 0)	3913.084053
25	(4, 3, 1, 0)	3913.409057
26	(2, 2, 1, 0)	3914.786853
27	(3, 2, 0, 0)	3925.433806
28	(2, 2, 0, 0)	3925.786566
29	(2, 3, 0, 0)	3925.879649
30	(2, 4, 0, 0)	3926.311190
31	(3, 3, 0, 0)	3927.427240
32	(4, 2, 0, 0)	3927.427417
33	(3, 4, 0, 0)	3928.376406
34	(4, 3, 0, 0)	3929.059361
35	(4, 4, 0, 0)	3930.078725
'''
# set the parameters that give the lowest AIC
p, q, P, Q = result_table.parameters[0]

best_model=sm.tsa.statespace.SARIMAX(ads.Ads, order=(p, d, q), 
                                        seasonal_order=(P, D, Q, s)).fit(disp=-1)
print(best_model.summary())
'''
                                 Statespace Model Results                                 
==========================================================================================
Dep. Variable:                                Ads   No. Observations:                  216
Model:             SARIMAX(2, 1, 3)x(1, 1, 1, 24)   Log Likelihood               -1936.321
Date:                            Sat, 07 Mar 2020   AIC                           3888.642
Time:                                    15:00:14   BIC                           3914.660
Sample:                                09-13-2017   HQIC                          3899.181
                                     - 09-21-2017                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.7913      0.270      2.928      0.003       0.262       1.321
ar.L2         -0.5503      0.306     -1.799      0.072      -1.150       0.049
ma.L1         -0.7316      0.262     -2.793      0.005      -1.245      -0.218
ma.L2          0.5651      0.282      2.005      0.045       0.013       1.118
ma.L3         -0.1811      0.092     -1.964      0.049      -0.362      -0.000
ar.S.L24       0.3312      0.076      4.351      0.000       0.182       0.480
ma.S.L24      -0.7635      0.104     -7.361      0.000      -0.967      -0.560
sigma2      4.574e+07   5.61e-09   8.15e+15      0.000    4.57e+07    4.57e+07
===================================================================================
Ljung-Box (Q):                       43.70   Jarque-Bera (JB):                10.56
Prob(Q):                              0.32   Prob(JB):                         0.01
Heteroskedasticity (H):               0.65   Skew:                            -0.28
Prob(H) (two-sided):                  0.09   Kurtosis:                         4.00
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.08e+32. Standard errors may be unstable.
'''
tsplot(best_model.resid[24+1:], lags=60)

python 时间序列预测 —— SARIMA_第6张图片

def plotSARIMA(series, model, n_steps):
    """Plots model vs predicted values
        
        series - dataset with timeseries
        model - fitted SARIMA model
        n_steps - number of steps to predict in the future    
    """
    
    # adding model values
    data = series.copy()
    data.columns = ['actual']
    data['sarima_model'] = model.fittedvalues
    # making a shift on s+d steps, because these values were unobserved by the model
    # due to the differentiating
    data['sarima_model'][:s+d] = np.NaN
    
    # forecasting on n_steps forward 
    forecast = model.predict(start = data.shape[0], end = data.shape[0]+n_steps)
    forecast = data.sarima_model.append(forecast)
    # calculate error, again having shifted on s+d steps from the beginning
    error = mean_absolute_percentage_error(data['actual'][s+d:], data['sarima_model'][s+d:])

    plt.figure(figsize=(15, 7))
    plt.title("Mean Absolute Percentage Error: {0:.2f}%".format(error))
    plt.plot(forecast, color='r', label="model")
    plt.axvspan(data.index[-1], forecast.index[-1], alpha=0.5, color='lightgrey')
    plt.plot(data.actual, label="actual")
    plt.legend()
    plt.grid(True)
plotSARIMA(ads, best_model, 50)

python 时间序列预测 —— SARIMA_第7张图片

你可能感兴趣的:(时间序列,#,编程语言,#,概率统计)