季节性ARIMA:时间序列预测

SARIMAX (seasonal autoregressive integrated moving average with exogenous regressor)是一种常见的时间序列预测方法,可以分为趋势部分和周期性部分;每个部分又可以分为自回归、差分和平滑部分。

趋势稳定性检测:Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test

null-hypothesis: 时间序列趋势稳定。significance: 0.05. 选择KPSS和不是Dickey Fuller由于KPSS的null-hypothesis是趋势稳定,所以接受条件相对于Dickey Fuller更宽松,差分阶数更少。
y t = β ′ D t + μ t + u t y_t = \beta^{\prime} D_t + \mu_t + u_t yt=βDt+μt+ut
μ t = μ t − 1 + ϵ t , \mu_t = \mu_{t-1} + \epsilon_t, μt=μt1+ϵt,
ϵ t ∼ W N ( 0 , σ ϵ 2 ) \epsilon_t \sim WN(0, \sigma^2_\epsilon) ϵtWN(0,σϵ2)

其中D为确定时间序列趋势/常数。u为随机漫步。epsilon为残差。

LM statistic:
K P S S = ( T − 2 ∑ t = 1 T S ^ t 2 ) / λ ^ 2 KPSS = (T^-2 \sum_{t=1}^{T} \hat S_t^2) / \hat\lambda^2 KPSS=(T2t=1TS^t2)/λ^2
其中 S ^ t = ∑ j = 1 t u ^ j \hat S_t = \sum_{j=1}^{t} \hat u_j S^t=j=1tu^j
u ^ \hat{u} u^ 是yt 拟合Dt的残差,$ \hat \lambda^2$ 是var(ut)的预估值。问题归结为拉格朗日乘数法证明 σ ϵ 2 = 0 \sigma^2_\epsilon = 0 σϵ2=0

季节稳定性检测:Canova-Hansen方法
null-hypothesis: 时间序列季节稳定性。significance: 0.05

测试频率:给定待测最大频率m以下所有 2pi/m整数倍。

时间序列yi的拟合:
y i = μ + x i ′ β + S i + e i y_i = \mu + x^{\prime}_i \beta + S_i + e_i yi=μ+xiβ+Si+ei
S i = ∑ j = 1 m / 2 f j i ′ γ j S_i = \sum_{j=1}^{m/2} f^{\prime}_{ji} \gamma_j Si=j=1m/2fjiγj
其中
f j i ′ = ( c o s ( ( j / q ) π i ) , s i n ( ( j / q ) π i ) ) f^{\prime}_{ji} = (cos((j/q) \pi i), sin((j/q) \pi i)) fji=(cos((j/q)πi),sin((j/q)πi))

LM statistic:
L M = 1 n 2 t r a c e ( ( Ω ^ f ) − 1 ∑ i = 1 n F ^ i F ^ i ′ ) LM = \frac{1}{n^2} trace( (\hat \Omega ^f)^-1 \sum_{i=1}^{n} \hat F_i \hat F^{\prime}_i) LM=n21trace((Ω^f)1i=1nF^iF^i)

其中
F ^ i = ∑ t = 1 i f t e ^ t \hat F_i = \sum_{t=1}^{i}f_t \hat e_t F^i=t=1ifte^t
Ω ^ f = ∑ k = − m m w ( k m ) 1 n ∑ i f i + k e ^ i + k f i ′ e ^ i \hat \Omega^f = \sum_{k=-m}^{m} w (\frac{k}{m}) \frac{1}{n} \sum_{i} f_{i+k} \hat e_{i+k} f^{\prime}_i \hat e_i Ω^f=k=mmw(mk)n1ifi+ke^i+kfie^i

用根据待检测频率m计算f’ji向量:用sample data示例代码如下:

import numpy as np
import pandas as pd

def seasonalDummy(tsArray, frequency):
    n = len(tsArray)
    m = frequency
    #if m == 1: tsArray.reshape([n, m])
    tt = np.arange(1, n + 1, 1)
    mat = np.zeros([n, 2 * m], dtype = float)
    for i in np.arange(0, m):
        mat[:, 2 * i] = np.cos(2.0 * np.pi * (i + 1) * tt / m)
        mat[:, 2 * i + 1] = np.sin(2.0 * np.pi * (i + 1) *tt / m)
    return mat[:, 0:(m-1)]

tsArray = np.array([  4.00195672,   4.99944175,   5.99861146,   7.00000213,
         8.00124207,   9.00059539,  10.00029542,  10.99871969,
        11.99728933,  12.99965231,  14.00148869,  15.0006378 ,
        16.00117112,  17.00159081,  17.99848509,  18.99957668,
        20.00066721,  20.99932292,  21.99992471,  23.00099164,
        24.00127222,  25.00014385,  26.00014191,  27.00037435,
        27.9985619 ,  28.99949718,  29.99844772,  30.99911627,
        31.99965086,  33.00211019,  34.00240288,  34.99889972,
        36.00240406,  37.0002379 ,  37.99958145,  38.99825111,
        39.99932529,  40.9998374 ,  42.00034236,  43.00206289,
        43.9994545 ,  45.00141283,  46.00041818,  47.00132581,
        48.00216031,  48.99812424,  50.00060522,  51.00049892,
        51.99817633,  52.9997362 ])
frequency = 6
pd.DataFrame(seasonalDummy(tsArray, frequency)).head(2)

Canova-Hansen用sample data示例代码如下:

from numpy.linalg import lstsq as lsq

n = len(tsArray)
frec = np.ones(int((frequency + 1) / 2), dtype = int)
ltrunc = int(np.round(frequency * np.power(n / 100.0, 0.25)))

y i = μ + x i ′ β + S i + e i y_i = \mu + x^{\prime}_i \beta + S_i + e_i yi=μ+xiβ+Si+ei
S i = ∑ j = 1 m / 2 f j i ′ γ j S_i = \sum_{j=1}^{m/2} f^{\prime}_{ji} \gamma_j Si=j=1m/2fjiγj

其中 f j i ′ = ( c o s ( ( j / q ) π i ) , s i n ( ( j / q ) π i ) ) f^{\prime}_{ji} = (cos((j/q) \pi i), sin((j/q) \pi i)) fji=(cos((j/q)πi),sin((j/q)πi))

# create dummy column f'ji
r1 = seasonalDummy(tsArray, frequency)
#create intercept column for regression
r1wInterceptCol = np.column_stack([np.ones(r1.shape[0], dtype = float), r1])

# residual ei:
result = lsq(a = r1wInterceptCol, b = tsArray)
residual = tsArray - np.matmul(r1wInterceptCol, result[0])

long-run covariance matrix: Ω = l i m n → ∞ 1 n E ( F n F n ′ ) \Omega = lim_{n \to \infty}\frac{1}{n}E(F_n F_n^{\prime}) Ω=limnn1E(FnFn)
在ei可能有serial correlation的情况下可以用estimate
Ω ^ = ∑ k = − m m w ( k m ) 1 n ∑ i F i + k F i ′ \hat{\Omega} = \sum_{k=-m}^{m}w(\frac{k}{m})\frac{1}{n}\sum_{i}F_{i+k}F_i^{\prime} Ω^=k=mmw(mk)n1iFi+kFi

fhat = np.zeros([n, frequency - 1], dtype = float)
fhataux = np.zeros([n, frequency - 1], dtype = float)

for i in np.arange(0, frequency - 1):
    fhataux[:, i] = r1[:, i] * residual

for i in np.arange(0, n):
    for j in np.arange(0, frequency - 1):
        mySum = sum(fhataux[0:(i + 1), j])
        fhat[i, j] = mySum

w ( ⋅ ) w(\cdot) w() 是kernel function,来自rob j. Hyndman的forecast包:

wnw = np.ones(ltrunc, dtype = float) - np.arange(1, ltrunc + 1, 1) / (ltrunc + 1)

Ω ^ f \hat{\Omega}^f Ω^f的计算:

Ne = fhataux.shape[0]
omnw = np.zeros([fhataux.shape[1], fhataux.shape[1]], dtype = float)
for k in np.arange(0, ltrunc):
    omnw = omnw + np.matmul(fhataux.T[:, (k+1):Ne], fhataux[0:(Ne-(k+1)), :]) * float(wnw[k])
cross = np.matmul(fhataux.T, fhataux)
omnwplusTranspose = omnw + omnw.T
omfhat = (cross + omnwplusTranspose) / float(Ne)

Generalized Hannan’s model:通过设置矩阵A选择需要测试的频率的子集。在程序中用的A = eye:

sq = np.arange(0, frequency - 1, 2)
frecob = np.zeros(frequency - 1, dtype = int)    
for i in np.arange(0, len(frec)):
    if (frec[i] == 1) & (i + 1 == int(frequency / 2.0)):
        frecob[sq[i]] = 1
    if (frec[i] == 1) & (i + 1 < int(frequency / 2.0)):
        frecob[sq[i]] = 1
        frecob[sq[i] + 1] = 1

a = frecob.tolist().count(1)  # find nr of 1's in frecob
A = np.zeros([frequency - 1, a], dtype = float)
j = 0
for i in np.arange(0, frequency - 1):
    if frecob[i] == 1:
        A[i, j] = 1
        j = j + 1

LM statistic的计算 (Nyblom(1989), Hansen(1990)):当LM statistic值超过对应自由度的critical value时,拒绝 (null hypothesis = 没有单位根):
L M s t a t i s t i c = 1 n 2 t r a c e ( ( A ′ Ω ^ f A ) − 1 A ′ ∑ i = 1 n F i ^ F i ^ ′ A ) LM statistic = \frac{1}{n^2} trace((A' \hat{\Omega}^f A)^{-1} A' \sum_{i=1}^{n}\hat{F_i} \hat{F_i}^{\prime}A) LMstatistic=n21trace((AΩ^fA)1Ai=1nFi^Fi^A)

其中 ∑ i = 1 n F i ^ \sum_{i=1}^{n}\hat{F_i} i=1nFi^ ∑ i = 1 n F i ^ ′ \sum_{i=1}^{n} \hat{F_i}^{\prime} i=1nFi^由前面步骤中的 f ^ \hat{f} f^ 给出:

from numpy.linalg import svd

aTomfhat = np.matmul(A.T, omfhat)
tmp = np.matmul(aTomfhat, A)

machineDoubleEps = 2.220446e-16

problems = min(svd(tmp)[1]) < machineDoubleEps # svd[1] are the singular values
if problems:
    stL = 0.0
else:
    solved = np.linalg.solve(tmp, np.eye(tmp.shape[1], dtype = float))
    step1 = np.matmul(solved, A.T)
    step2 = np.matmul(step1, fhat.T)
    step3 = np.matmul(step2, fhat)
    step4 = np.matmul(step3, A)
    stL = (1.0 / np.power(n, 2.0)) * sum(np.diag(step4))

在Canova-Hansen test接受alternative hypothesis的情况下对时间序列进行lag = 待测频率的差分(季节差分)。

季节性检测
在进行季节性时间序列稳定性检测之前,首先判断a.时间序列是否有季节性,和b.时间序列在什么频率上有季节性。结果会作为时间序列稳定性检测的参数输入。

季节性检测根据离散傅里叶变换和自相关函数的“与”关系得出结论(只有两个method都返回真值,才会判定时间序列有季节性)。

离散傅里叶变换吧时间序列从时域变为频域。变换后频域的新序列为:

X k = ∑ n = 0 N − 1 x n ⋅ [ c o s ( 2 π k n / N ) − i ⋅ s i n ( 2 π k n / N ) ] X_k = \sum_{n=0}^{N-1}x_n \cdot [cos(2 \pi kn/N) - i \cdot sin(2 \pi kn/N)] Xk=n=0N1xn[cos(2πkn/N)isin(2πkn/N)]

在待检测频率上如果能量为最大值,则返回真值。

自相关函数检测最大lag = 待检频率各阶上的correlation coefficient。时间点t和s之间的自相关性R的定义为:

R ( s , t ) = E [ ( X t − μ t ) ( X s − μ s ) σ t σ s R(s, t) = \frac{E[(X_t - \mu_t)(X_s - \mu_s)}{\sigma_t \sigma_s} R(s,t)=σtσsE[(Xtμt)(Xsμs)

如果在待检频率上的相关系数超过双边confidence interval在0.05的临界值

clim = norm.ppf((1 + ci) / 2) / np.sqrt(n)

则method返回真值。

DFT调用了numpy.fft.fft方法。ACF调用了statsmodels.tsa.stattools.acf方法。

测量函数:根据不同情况采用不同模型测量方法。算法使用了Rob J. Hyndman的 MASE (mean absolute scaled error)。与其他测量方法的优劣对比。

定义:
R-squared:
S S t o t = ∑ i ( y i − y ˉ ) 2 SS_{tot} = \sum_{i}(y_i - \bar{y})^2 SStot=i(yiyˉ)2
S S r e s = ∑ i ( y i − y ^ i ) 2 SS_{res} = \sum_{i}(y_i - \hat{y}_i)^2 SSres=i(yiy^i)2
R 2 = 1 − S S r e s S S t o t R^2 = 1 - \frac{SS_{res}}{SS_{tot}} R2=1SStotSSres

RMSE:
R M S E = ∑ t = 1 n ( y ^ t − y t ) 2 n RMSE = \sqrt{\frac{\sum_{t=1}^{n} (\hat{y}_t - y_t)^2}{n}} RMSE=nt=1n(y^tyt)2

MAE:
M A E = ∑ t = 1 n ∣ y ^ t − y t ∣ n MAE = \frac{\sum_{t=1}^{n} |\hat{y}_t - y_t|}{n} MAE=nt=1ny^tyt

MAPE:
M A P E = 100 n ∑ t = 1 n ∣ y ^ t − y t y t ∣ MAPE = \frac{100}{n} \sum_{t=1}^{n} | \frac{\hat{y}_t - y_t}{y_t} | MAPE=n100t=1nyty^tyt

sMAPE:
S M A P E = 100 n ∑ t = 1 n ∣ y ^ t − y t ∣ ( ∣ y ^ t ∣ + ∣ y t ∣ ) / 2 SMAPE = \frac{100}{n} \sum_{t=1}^{n} \frac{|\hat{y}_t - y_t|}{(|\hat{y}_t| + |y_t|)/2} SMAPE=n100t=1n(y^t+yt)/2y^tyt

MASE without seasonality:
M A S E = ∑ t = 1 n ∣ y ^ t − y t ∣ n n − 1 ∑ t = 2 n ∣ y t − y t − 1 ∣ MASE = \frac{\sum_{t = 1}^{n} |\hat{y}_t - y_t|}{\frac{n}{n-1} \sum_{t=2}^{n} |y_t - y_{t-1}|} MASE=n1nt=2nytyt1t=1ny^tyt

MASE with seasonality:
M A S E = ∑ t = 1 n ∣ y ^ t − y t ∣ n n − m ∑ t = m + 1 n ∣ y t − y t − m ∣ MASE = \frac{\sum_{t = 1}^{n} |\hat{y}_t - y_t|}{\frac{n}{n-m} \sum_{t=m+1}^{n} |y_t - y_{t-m}|} MASE=nmnt=m+1nytytmt=1ny^tyt

趋势平滑:在SARIMA模型中引入时间序列的趋势作为exogenous regressor(X),有几种算法可以选择:

Lowess (locally weighted scatterplot smoothing): 基于KNN的非参数拟合方法。代码调用了 statsmodels.api.nonparametric.lowess

RANSAC
Random sample consensus,一种robust regression方法,可以探测异常值并使拟合对于异常值的敏感度降低。代码调用sklearn.linear_model.RANSACRegressor

Weighted moving average:

指数平滑递归表达:
W n = ( 1 − α ) ∗ W n − 1 + α ∗ y n W_n = (1-\alpha) * W_{n-1} + \alpha * y_n Wn=(1α)Wn1+αyn
W 0 = y 0 W_0 = y_0 W0=y0
α = 2 / ( s p a n + 1 ) \alpha = 2 / (span + 1) α=2/(span+1)

调用了 pandas.ewma

关于脉冲响应

如果有另外的(多维)exogenous regressor Xi 影响预测模型,比如类似离散脉冲波形的机会点数据:假设各个脉冲regressor之间是独立的,并不受时间序列本身的影响:可以用多元线性回归首先发现Xi 和时间序列趋势yhat的关系。

考虑到历史上时间序列对于脉冲输入的响应不同,在算法中会测试三种不同脉冲响应与时间序列的相关性,并挑选相关性最强的脉冲响应作为Xi向量。这三种分别是a. 原始脉冲,b. 经指数平滑的脉冲,c. 经指数平滑并累加的脉冲(cumulative)。

示例:

%matplotlib inline
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

def myEwma(x, histPeriod = 6, fcstPeriod = 18):    
    from pandas import ewma
    xfit = ewma(x, span = histPeriod)
    xpred = np.zeros(fcstPeriod)
    tmp = np.zeros(histPeriod + 1)
    tmp[:histPeriod] = x[-histPeriod:].copy()
    tmp[histPeriod] = xfit[-1]
    for i in np.arange(0, fcstPeriod):
        xpred[i] = ewma(tmp, span = histPeriod)[-1]
        tmp = shiftLeft(tmp)
        tmp[-1] = xpred[i]
    return xfit, xpred

def shiftLeft(_ar, step = 1):
    if step == 0:
        return _ar
    ar = _ar.copy()
    ar[0:-step] = ar[step:]
    ar[-step:] = 0
    return ar

original = np.array([3000,0,0,0,0,0,0,0,0,0,1000,0,0,0,0,0,0,0,0,1000,0,0,0,0,0,0,0,0,0,0])
pd.DataFrame(original).plot(title = 'original')
movingAverage = myEwma(original)[0]
pd.DataFrame(movingAverage).plot(title = 'moving average')
cumulative = movingAverage.cumsum()
pd.DataFrame(cumulative).plot(title = 'cumulative')

检测相加性:时间序列拆分成趋势,季节性和残差的方式有相加和相乘两种。

y = Trend + Seasonality + Residual, 或者

y = Trend * Seasonality * Residual

如果是相乘的情况,残差的分布是不稳定的。所以如果时间序列没有通过相加性检测,会对时间序列做对数处理,变为相加:

y n e w = l o g ( y ) = l o g ( T r e n d ∗ S e a s o n a l i t y ∗ R e s i d u a l ) = l o g ( T r e n d ) + l o g ( S e a s o n a l i t y ) + l o g ( R e s i d u a l ) ynew = log(y) = log(Trend * Seasonality * Residual) = log(Trend) + log(Seasonality) + log(Residual) ynew=log(y)=log(TrendSeasonalityResidual)=log(Trend)+log(Seasonality)+log(Residual)

自动搜索SARIMA参数空间 Auto ARIMA:搜索差分阶数,检测季节性的存在,并搜索能给出Akaike Information Criterion最小值的ARMA模型维度 ARMA(p, q, P, Q)

ARMA模型范式:

X t − α 1 X t − 1 − ⋯ − α p ′ X t − p ′ = ϵ t + θ 1 ϵ t − 1 + ⋯ + θ q ϵ t − q X_t - \alpha_1 X_{t-1} - \dots - \alpha_{p^{\prime}} X_{t-p^{\prime}} = \epsilon_t + \theta_1 \epsilon_{t-1} + \dots + \theta_q \epsilon_{t-q} Xtα1Xt1αpXtp=ϵt+θ1ϵt1++θqϵtq

自动搜索参数空间返回的结果是p和q的个数。作为输入给SARIMAX核心算法

搜索参数空间的初始值来自Rob J. Hyndman的paper section3.2.

SARIMAX 带外部变量的季节性自回归差分平滑算法:预测是基于SARIMAX的state-space representation。 核心算法调用了statsmodels.api.tsa.statespace.SARIMAX.

总结

以上是SARIMAX各个功能模块的详细数学方法和编程实现。由于一些语言(包括python)的时间序列预测开源算法包里缺少比如稳定性检测、季节性差分和自动搜索参数空间的功能,需要自己依照数学公式实现。

你可能感兴趣的:(算法)