SARIMAX (seasonal autoregressive integrated moving average with exogenous regressor)是一种常见的时间序列预测方法,可以分为趋势部分和周期性部分;每个部分又可以分为自回归、差分和平滑部分。
趋势稳定性检测:Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test
null-hypothesis: 时间序列趋势稳定。significance: 0.05. 选择KPSS和不是Dickey Fuller由于KPSS的null-hypothesis是趋势稳定,所以接受条件相对于Dickey Fuller更宽松,差分阶数更少。
y t = β ′ D t + μ t + u t y_t = \beta^{\prime} D_t + \mu_t + u_t yt=β′Dt+μt+ut
μ t = μ t − 1 + ϵ t , \mu_t = \mu_{t-1} + \epsilon_t, μt=μt−1+ϵt,
ϵ t ∼ W N ( 0 , σ ϵ 2 ) \epsilon_t \sim WN(0, \sigma^2_\epsilon) ϵt∼WN(0,σϵ2)
其中D为确定时间序列趋势/常数。u为随机漫步。epsilon为残差。
LM statistic:
K P S S = ( T − 2 ∑ t = 1 T S ^ t 2 ) / λ ^ 2 KPSS = (T^-2 \sum_{t=1}^{T} \hat S_t^2) / \hat\lambda^2 KPSS=(T−2∑t=1TS^t2)/λ^2
其中 S ^ t = ∑ j = 1 t u ^ j \hat S_t = \sum_{j=1}^{t} \hat u_j S^t=∑j=1tu^j
u ^ \hat{u} u^ 是yt 拟合Dt的残差,$ \hat \lambda^2$ 是var(ut)的预估值。问题归结为拉格朗日乘数法证明 σ ϵ 2 = 0 \sigma^2_\epsilon = 0 σϵ2=0
季节稳定性检测:Canova-Hansen方法
null-hypothesis: 时间序列季节稳定性。significance: 0.05
测试频率:给定待测最大频率m以下所有 2pi/m整数倍。
时间序列yi的拟合:
y i = μ + x i ′ β + S i + e i y_i = \mu + x^{\prime}_i \beta + S_i + e_i yi=μ+xi′β+Si+ei
S i = ∑ j = 1 m / 2 f j i ′ γ j S_i = \sum_{j=1}^{m/2} f^{\prime}_{ji} \gamma_j Si=∑j=1m/2fji′γj
其中
f j i ′ = ( c o s ( ( j / q ) π i ) , s i n ( ( j / q ) π i ) ) f^{\prime}_{ji} = (cos((j/q) \pi i), sin((j/q) \pi i)) fji′=(cos((j/q)πi),sin((j/q)πi))
LM statistic:
L M = 1 n 2 t r a c e ( ( Ω ^ f ) − 1 ∑ i = 1 n F ^ i F ^ i ′ ) LM = \frac{1}{n^2} trace( (\hat \Omega ^f)^-1 \sum_{i=1}^{n} \hat F_i \hat F^{\prime}_i) LM=n21trace((Ω^f)−1i=1∑nF^iF^i′)
其中
F ^ i = ∑ t = 1 i f t e ^ t \hat F_i = \sum_{t=1}^{i}f_t \hat e_t F^i=∑t=1ifte^t
Ω ^ f = ∑ k = − m m w ( k m ) 1 n ∑ i f i + k e ^ i + k f i ′ e ^ i \hat \Omega^f = \sum_{k=-m}^{m} w (\frac{k}{m}) \frac{1}{n} \sum_{i} f_{i+k} \hat e_{i+k} f^{\prime}_i \hat e_i Ω^f=∑k=−mmw(mk)n1∑ifi+ke^i+kfi′e^i
用根据待检测频率m计算f’ji向量:用sample data示例代码如下:
import numpy as np
import pandas as pd
def seasonalDummy(tsArray, frequency):
n = len(tsArray)
m = frequency
#if m == 1: tsArray.reshape([n, m])
tt = np.arange(1, n + 1, 1)
mat = np.zeros([n, 2 * m], dtype = float)
for i in np.arange(0, m):
mat[:, 2 * i] = np.cos(2.0 * np.pi * (i + 1) * tt / m)
mat[:, 2 * i + 1] = np.sin(2.0 * np.pi * (i + 1) *tt / m)
return mat[:, 0:(m-1)]
tsArray = np.array([ 4.00195672, 4.99944175, 5.99861146, 7.00000213,
8.00124207, 9.00059539, 10.00029542, 10.99871969,
11.99728933, 12.99965231, 14.00148869, 15.0006378 ,
16.00117112, 17.00159081, 17.99848509, 18.99957668,
20.00066721, 20.99932292, 21.99992471, 23.00099164,
24.00127222, 25.00014385, 26.00014191, 27.00037435,
27.9985619 , 28.99949718, 29.99844772, 30.99911627,
31.99965086, 33.00211019, 34.00240288, 34.99889972,
36.00240406, 37.0002379 , 37.99958145, 38.99825111,
39.99932529, 40.9998374 , 42.00034236, 43.00206289,
43.9994545 , 45.00141283, 46.00041818, 47.00132581,
48.00216031, 48.99812424, 50.00060522, 51.00049892,
51.99817633, 52.9997362 ])
frequency = 6
pd.DataFrame(seasonalDummy(tsArray, frequency)).head(2)
Canova-Hansen用sample data示例代码如下:
from numpy.linalg import lstsq as lsq
n = len(tsArray)
frec = np.ones(int((frequency + 1) / 2), dtype = int)
ltrunc = int(np.round(frequency * np.power(n / 100.0, 0.25)))
y i = μ + x i ′ β + S i + e i y_i = \mu + x^{\prime}_i \beta + S_i + e_i yi=μ+xi′β+Si+ei
S i = ∑ j = 1 m / 2 f j i ′ γ j S_i = \sum_{j=1}^{m/2} f^{\prime}_{ji} \gamma_j Si=∑j=1m/2fji′γj
其中 f j i ′ = ( c o s ( ( j / q ) π i ) , s i n ( ( j / q ) π i ) ) f^{\prime}_{ji} = (cos((j/q) \pi i), sin((j/q) \pi i)) fji′=(cos((j/q)πi),sin((j/q)πi))
# create dummy column f'ji
r1 = seasonalDummy(tsArray, frequency)
#create intercept column for regression
r1wInterceptCol = np.column_stack([np.ones(r1.shape[0], dtype = float), r1])
# residual ei:
result = lsq(a = r1wInterceptCol, b = tsArray)
residual = tsArray - np.matmul(r1wInterceptCol, result[0])
long-run covariance matrix: Ω = l i m n → ∞ 1 n E ( F n F n ′ ) \Omega = lim_{n \to \infty}\frac{1}{n}E(F_n F_n^{\prime}) Ω=limn→∞n1E(FnFn′)
在ei可能有serial correlation的情况下可以用estimate
Ω ^ = ∑ k = − m m w ( k m ) 1 n ∑ i F i + k F i ′ \hat{\Omega} = \sum_{k=-m}^{m}w(\frac{k}{m})\frac{1}{n}\sum_{i}F_{i+k}F_i^{\prime} Ω^=k=−m∑mw(mk)n1i∑Fi+kFi′
fhat = np.zeros([n, frequency - 1], dtype = float)
fhataux = np.zeros([n, frequency - 1], dtype = float)
for i in np.arange(0, frequency - 1):
fhataux[:, i] = r1[:, i] * residual
for i in np.arange(0, n):
for j in np.arange(0, frequency - 1):
mySum = sum(fhataux[0:(i + 1), j])
fhat[i, j] = mySum
w ( ⋅ ) w(\cdot) w(⋅) 是kernel function,来自rob j. Hyndman的forecast包:
wnw = np.ones(ltrunc, dtype = float) - np.arange(1, ltrunc + 1, 1) / (ltrunc + 1)
Ω ^ f \hat{\Omega}^f Ω^f的计算:
Ne = fhataux.shape[0]
omnw = np.zeros([fhataux.shape[1], fhataux.shape[1]], dtype = float)
for k in np.arange(0, ltrunc):
omnw = omnw + np.matmul(fhataux.T[:, (k+1):Ne], fhataux[0:(Ne-(k+1)), :]) * float(wnw[k])
cross = np.matmul(fhataux.T, fhataux)
omnwplusTranspose = omnw + omnw.T
omfhat = (cross + omnwplusTranspose) / float(Ne)
Generalized Hannan’s model:通过设置矩阵A选择需要测试的频率的子集。在程序中用的A = eye:
sq = np.arange(0, frequency - 1, 2)
frecob = np.zeros(frequency - 1, dtype = int)
for i in np.arange(0, len(frec)):
if (frec[i] == 1) & (i + 1 == int(frequency / 2.0)):
frecob[sq[i]] = 1
if (frec[i] == 1) & (i + 1 < int(frequency / 2.0)):
frecob[sq[i]] = 1
frecob[sq[i] + 1] = 1
a = frecob.tolist().count(1) # find nr of 1's in frecob
A = np.zeros([frequency - 1, a], dtype = float)
j = 0
for i in np.arange(0, frequency - 1):
if frecob[i] == 1:
A[i, j] = 1
j = j + 1
LM statistic的计算 (Nyblom(1989), Hansen(1990)):当LM statistic值超过对应自由度的critical value时,拒绝 (null hypothesis = 没有单位根):
L M s t a t i s t i c = 1 n 2 t r a c e ( ( A ′ Ω ^ f A ) − 1 A ′ ∑ i = 1 n F i ^ F i ^ ′ A ) LM statistic = \frac{1}{n^2} trace((A' \hat{\Omega}^f A)^{-1} A' \sum_{i=1}^{n}\hat{F_i} \hat{F_i}^{\prime}A) LMstatistic=n21trace((A′Ω^fA)−1A′∑i=1nFi^Fi^′A)
其中 ∑ i = 1 n F i ^ \sum_{i=1}^{n}\hat{F_i} ∑i=1nFi^ 和 ∑ i = 1 n F i ^ ′ \sum_{i=1}^{n} \hat{F_i}^{\prime} ∑i=1nFi^′由前面步骤中的 f ^ \hat{f} f^ 给出:
from numpy.linalg import svd
aTomfhat = np.matmul(A.T, omfhat)
tmp = np.matmul(aTomfhat, A)
machineDoubleEps = 2.220446e-16
problems = min(svd(tmp)[1]) < machineDoubleEps # svd[1] are the singular values
if problems:
stL = 0.0
else:
solved = np.linalg.solve(tmp, np.eye(tmp.shape[1], dtype = float))
step1 = np.matmul(solved, A.T)
step2 = np.matmul(step1, fhat.T)
step3 = np.matmul(step2, fhat)
step4 = np.matmul(step3, A)
stL = (1.0 / np.power(n, 2.0)) * sum(np.diag(step4))
在Canova-Hansen test接受alternative hypothesis的情况下对时间序列进行lag = 待测频率的差分(季节差分)。
季节性检测:
在进行季节性时间序列稳定性检测之前,首先判断a.时间序列是否有季节性,和b.时间序列在什么频率上有季节性。结果会作为时间序列稳定性检测的参数输入。
季节性检测根据离散傅里叶变换和自相关函数的“与”关系得出结论(只有两个method都返回真值,才会判定时间序列有季节性)。
离散傅里叶变换吧时间序列从时域变为频域。变换后频域的新序列为:
X k = ∑ n = 0 N − 1 x n ⋅ [ c o s ( 2 π k n / N ) − i ⋅ s i n ( 2 π k n / N ) ] X_k = \sum_{n=0}^{N-1}x_n \cdot [cos(2 \pi kn/N) - i \cdot sin(2 \pi kn/N)] Xk=n=0∑N−1xn⋅[cos(2πkn/N)−i⋅sin(2πkn/N)]
在待检测频率上如果能量为最大值,则返回真值。
自相关函数检测最大lag = 待检频率各阶上的correlation coefficient。时间点t和s之间的自相关性R的定义为:
R ( s , t ) = E [ ( X t − μ t ) ( X s − μ s ) σ t σ s R(s, t) = \frac{E[(X_t - \mu_t)(X_s - \mu_s)}{\sigma_t \sigma_s} R(s,t)=σtσsE[(Xt−μt)(Xs−μs)
如果在待检频率上的相关系数超过双边confidence interval在0.05的临界值
clim = norm.ppf((1 + ci) / 2) / np.sqrt(n)
则method返回真值。
DFT调用了numpy.fft.fft方法。ACF调用了statsmodels.tsa.stattools.acf方法。
测量函数:根据不同情况采用不同模型测量方法。算法使用了Rob J. Hyndman的 MASE (mean absolute scaled error)。与其他测量方法的优劣对比。
定义:
R-squared:
S S t o t = ∑ i ( y i − y ˉ ) 2 SS_{tot} = \sum_{i}(y_i - \bar{y})^2 SStot=∑i(yi−yˉ)2
S S r e s = ∑ i ( y i − y ^ i ) 2 SS_{res} = \sum_{i}(y_i - \hat{y}_i)^2 SSres=∑i(yi−y^i)2
R 2 = 1 − S S r e s S S t o t R^2 = 1 - \frac{SS_{res}}{SS_{tot}} R2=1−SStotSSres
RMSE:
R M S E = ∑ t = 1 n ( y ^ t − y t ) 2 n RMSE = \sqrt{\frac{\sum_{t=1}^{n} (\hat{y}_t - y_t)^2}{n}} RMSE=n∑t=1n(y^t−yt)2
MAE:
M A E = ∑ t = 1 n ∣ y ^ t − y t ∣ n MAE = \frac{\sum_{t=1}^{n} |\hat{y}_t - y_t|}{n} MAE=n∑t=1n∣y^t−yt∣
MAPE:
M A P E = 100 n ∑ t = 1 n ∣ y ^ t − y t y t ∣ MAPE = \frac{100}{n} \sum_{t=1}^{n} | \frac{\hat{y}_t - y_t}{y_t} | MAPE=n100∑t=1n∣yty^t−yt∣
sMAPE:
S M A P E = 100 n ∑ t = 1 n ∣ y ^ t − y t ∣ ( ∣ y ^ t ∣ + ∣ y t ∣ ) / 2 SMAPE = \frac{100}{n} \sum_{t=1}^{n} \frac{|\hat{y}_t - y_t|}{(|\hat{y}_t| + |y_t|)/2} SMAPE=n100∑t=1n(∣y^t∣+∣yt∣)/2∣y^t−yt∣
MASE without seasonality:
M A S E = ∑ t = 1 n ∣ y ^ t − y t ∣ n n − 1 ∑ t = 2 n ∣ y t − y t − 1 ∣ MASE = \frac{\sum_{t = 1}^{n} |\hat{y}_t - y_t|}{\frac{n}{n-1} \sum_{t=2}^{n} |y_t - y_{t-1}|} MASE=n−1n∑t=2n∣yt−yt−1∣∑t=1n∣y^t−yt∣
MASE with seasonality:
M A S E = ∑ t = 1 n ∣ y ^ t − y t ∣ n n − m ∑ t = m + 1 n ∣ y t − y t − m ∣ MASE = \frac{\sum_{t = 1}^{n} |\hat{y}_t - y_t|}{\frac{n}{n-m} \sum_{t=m+1}^{n} |y_t - y_{t-m}|} MASE=n−mn∑t=m+1n∣yt−yt−m∣∑t=1n∣y^t−yt∣
趋势平滑:在SARIMA模型中引入时间序列的趋势作为exogenous regressor(X),有几种算法可以选择:
Lowess (locally weighted scatterplot smoothing): 基于KNN的非参数拟合方法。代码调用了 statsmodels.api.nonparametric.lowess
RANSAC
Random sample consensus,一种robust regression方法,可以探测异常值并使拟合对于异常值的敏感度降低。代码调用sklearn.linear_model.RANSACRegressor
Weighted moving average:
指数平滑递归表达:
W n = ( 1 − α ) ∗ W n − 1 + α ∗ y n W_n = (1-\alpha) * W_{n-1} + \alpha * y_n Wn=(1−α)∗Wn−1+α∗yn
W 0 = y 0 W_0 = y_0 W0=y0
α = 2 / ( s p a n + 1 ) \alpha = 2 / (span + 1) α=2/(span+1)
调用了 pandas.ewma
关于脉冲响应:
如果有另外的(多维)exogenous regressor Xi 影响预测模型,比如类似离散脉冲波形的机会点数据:假设各个脉冲regressor之间是独立的,并不受时间序列本身的影响:可以用多元线性回归首先发现Xi 和时间序列趋势yhat的关系。
考虑到历史上时间序列对于脉冲输入的响应不同,在算法中会测试三种不同脉冲响应与时间序列的相关性,并挑选相关性最强的脉冲响应作为Xi向量。这三种分别是a. 原始脉冲,b. 经指数平滑的脉冲,c. 经指数平滑并累加的脉冲(cumulative)。
示例:
%matplotlib inline
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
def myEwma(x, histPeriod = 6, fcstPeriod = 18):
from pandas import ewma
xfit = ewma(x, span = histPeriod)
xpred = np.zeros(fcstPeriod)
tmp = np.zeros(histPeriod + 1)
tmp[:histPeriod] = x[-histPeriod:].copy()
tmp[histPeriod] = xfit[-1]
for i in np.arange(0, fcstPeriod):
xpred[i] = ewma(tmp, span = histPeriod)[-1]
tmp = shiftLeft(tmp)
tmp[-1] = xpred[i]
return xfit, xpred
def shiftLeft(_ar, step = 1):
if step == 0:
return _ar
ar = _ar.copy()
ar[0:-step] = ar[step:]
ar[-step:] = 0
return ar
original = np.array([3000,0,0,0,0,0,0,0,0,0,1000,0,0,0,0,0,0,0,0,1000,0,0,0,0,0,0,0,0,0,0])
pd.DataFrame(original).plot(title = 'original')
movingAverage = myEwma(original)[0]
pd.DataFrame(movingAverage).plot(title = 'moving average')
cumulative = movingAverage.cumsum()
pd.DataFrame(cumulative).plot(title = 'cumulative')
检测相加性:时间序列拆分成趋势,季节性和残差的方式有相加和相乘两种。
y = Trend + Seasonality + Residual, 或者
y = Trend * Seasonality * Residual
如果是相乘的情况,残差的分布是不稳定的。所以如果时间序列没有通过相加性检测,会对时间序列做对数处理,变为相加:
y n e w = l o g ( y ) = l o g ( T r e n d ∗ S e a s o n a l i t y ∗ R e s i d u a l ) = l o g ( T r e n d ) + l o g ( S e a s o n a l i t y ) + l o g ( R e s i d u a l ) ynew = log(y) = log(Trend * Seasonality * Residual) = log(Trend) + log(Seasonality) + log(Residual) ynew=log(y)=log(Trend∗Seasonality∗Residual)=log(Trend)+log(Seasonality)+log(Residual)
自动搜索SARIMA参数空间 Auto ARIMA:搜索差分阶数,检测季节性的存在,并搜索能给出Akaike Information Criterion最小值的ARMA模型维度 ARMA(p, q, P, Q)
ARMA模型范式:
X t − α 1 X t − 1 − ⋯ − α p ′ X t − p ′ = ϵ t + θ 1 ϵ t − 1 + ⋯ + θ q ϵ t − q X_t - \alpha_1 X_{t-1} - \dots - \alpha_{p^{\prime}} X_{t-p^{\prime}} = \epsilon_t + \theta_1 \epsilon_{t-1} + \dots + \theta_q \epsilon_{t-q} Xt−α1Xt−1−⋯−αp′Xt−p′=ϵt+θ1ϵt−1+⋯+θqϵt−q
自动搜索参数空间返回的结果是p和q的个数。作为输入给SARIMAX核心算法
搜索参数空间的初始值来自Rob J. Hyndman的paper section3.2.
SARIMAX 带外部变量的季节性自回归差分平滑算法:预测是基于SARIMAX的state-space representation。 核心算法调用了statsmodels.api.tsa.statespace.SARIMAX
.
以上是SARIMAX各个功能模块的详细数学方法和编程实现。由于一些语言(包括python)的时间序列预测开源算法包里缺少比如稳定性检测、季节性差分和自动搜索参数空间的功能,需要自己依照数学公式实现。