参考博客python实现时间序列分析-ning-ML
时间序列的测试
import numpy as np
import matplotlib.pyplot as plt
output = np.array([
97, 130, 156.5, 135.2, 137.7,
180.5, 205.2, 190, 188.6, 196.7,
180.3, 210.8, 196, 223, 238.2,
263.5, 292.6, 317, 335.4, 327,
321.9, 353.5, 397.8, 436.8, 465.7,
476.7, 462.6, 460.8, 501.8, 501.5,
489.5, 542.3, 512.2, 559.8, 542.0,
567.0
], dtype=float)
year = np.arange(1964, 2000)
平稳性检验
时序图检验
fig, ax = plt.subplots()
ax.plot(year, output, "+-")
plt.show()
该序列具有明显的趋势性,所以不是通常的平稳序列
自相关图检验
def acf(data, lag=None):
n = len(data)
if lag is None:
lag = int(n / 2)
cov = []
sample_mean = np.mean(data)
var = np.sum((data - sample_mean) ** 2) / (n-1)
cov.append(var)
for i in range(1, lag):
x = data[:n-i] - sample_mean
y = data[i:] - sample_mean
cov.append(np.mean(x * y))
cov = np.array(cov, dtype=float)
acf = cov / var
return acf, np.arange(len(acf))
acf_inf, lags = acf(output, lag=23)
print(acf_inf, lags)
fig, ax = plt.subplots()
ax.scatter(lags, acf_inf)
[ 1. 0.91392186 0.86822948 0.81369058 0.76064724 0.68729172
0.63440934 0.57485247 0.49429248 0.41601171 0.32584615 0.20512486
0.08432045 -0.0501945 -0.17245647 -0.28041079 -0.37630504 -0.4783755
-0.59591061 -0.71243415 -0.83519788 -0.97330822 -1.08197993] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22]
比较奇怪的是,和书上的怎么不一样,而且acf绝对值不应该小于1?哪里算错了?我知道了,原来算法都是用:
算的,而不是:
def acf(data, lag=None):
n = len(data)
if lag is None:
lag = int(n / 2)
cov = []
sample_mean = np.mean(data)
var = np.sum((data - sample_mean) ** 2)
cov.append(1)
for i in range(1, lag):
x = data[:n-i] - sample_mean
y = data[i:] - sample_mean
cov.append(np.sum(x * y) / var)
cov = np.array(cov, dtype=float)
acf = cov
return acf, np.arange(len(acf))
acf_inf, lags = acf(output, lag=23)
print(acf_inf, lags)
fig, ax = plt.subplots()
ax.scatter(lags, acf_inf)
[ 1. 0.91392186 0.84342292 0.76719398 0.69544891 0.6087441
0.54377943 0.47630634 0.39543399 0.32092332 0.24205714 0.14651776
0.05781973 -0.03298495 -0.10840121 -0.16824648 -0.21503145 -0.25968956
-0.30646831 -0.34603944 -0.38180474 -0.41713209 -0.43279197] [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22]
结论就是,自相关图显示出明显的三角对称性, 这时具有单调趋势的非平稳序列的一种典型的自相关图形式.
随机性检验
跳过了
ARMA模型
AR模型
AR模型的自相关系数有俩个显著的性质: 1.拖尾性;2.指数衰减
滞后阶的自相关系数的通解为:
其中为差分方程的特征根, 为常数,且不全为0
通过这个通解形式,容易推出始终有非零取值,不会在大于某个常数之后就恒等于零,这个性质就是拖尾性.
而以指数衰减的性质就是利用自相关图判断平稳序列时所说的"短期相关"性质.
AR(p)模型的偏自相关系数具有阶截尾性,利用线性方程组的理论可以证明.事实上,这也是一种确定阶数的方法.另外偏自相关系数可以通过求解Yule-Walker方程获得:
def pcf(data, lag=None):
"""计算偏自相关系数"""
n = len(data)
if lag == None:
lag = int(n / 2)
acf_inf, lags = acf(data, lag)
M = np.zeros((lag, lag), dtype=float)
for i in range(lag):
for j in range(i, lag):
index = np.abs(i - j)
M[i, j] = acf_inf[index]
M[j, i] = acf_inf[index]
return np.linalg.inv(M) @ acf_inf, lags
pcf_inf, lags = pcf(output, lag=30)
print(pcf_inf)
fig, ax = plt.subplots()
ax.scatter(lags, pcf_inf)
[ 1.00000000e+00 5.32907052e-15 -1.77635684e-15 8.88178420e-16
0.00000000e+00 -1.33226763e-15 -8.88178420e-16 -1.33226763e-15
4.44089210e-15 -2.22044605e-15 -1.33226763e-15 2.22044605e-15
-5.55111512e-16 7.77156117e-16 2.22044605e-16 7.77156117e-16
8.88178420e-16 -2.22044605e-16 -1.77635684e-15 1.77635684e-15
0.00000000e+00 2.22044605e-15 -4.44089210e-16 4.44089210e-16
-8.88178420e-16 -1.33226763e-15 8.88178420e-16 2.66453526e-15
-8.88178420e-16 -2.22044605e-16]
是不是又哪里搞错了,和库里的又不一样了.
MA模型
MA(q)模型自相关系数阶截尾,即阶以后自相关系数为0
MA(q)模型偏自相关系数拖尾
ARMA模型
ARMA(p, q)模型自相关系数不截尾,而且偏自相关系数也不截尾
利用已有的库进行时间序列分析
import pandas as pd
from random import randrange
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.api import tsa
import warnings
warnings.filterwarnings("ignore")
data = pd.DataFrame(output, columns=["output"],
index = pd.Index(tsa.datetools.dates_from_range("1964", "1999")))
data
output | |
---|---|
1964-12-31 | 97.0 |
1965-12-31 | 130.0 |
1966-12-31 | 156.5 |
1967-12-31 | 135.2 |
1968-12-31 | 137.7 |
1969-12-31 | 180.5 |
1970-12-31 | 205.2 |
1971-12-31 | 190.0 |
1972-12-31 | 188.6 |
1973-12-31 | 196.7 |
1974-12-31 | 180.3 |
1975-12-31 | 210.8 |
1976-12-31 | 196.0 |
1977-12-31 | 223.0 |
1978-12-31 | 238.2 |
1979-12-31 | 263.5 |
1980-12-31 | 292.6 |
1981-12-31 | 317.0 |
1982-12-31 | 335.4 |
1983-12-31 | 327.0 |
1984-12-31 | 321.9 |
1985-12-31 | 353.5 |
1986-12-31 | 397.8 |
1987-12-31 | 436.8 |
1988-12-31 | 465.7 |
1989-12-31 | 476.7 |
1990-12-31 | 462.6 |
1991-12-31 | 460.8 |
1992-12-31 | 501.8 |
1993-12-31 | 501.5 |
1994-12-31 | 489.5 |
1995-12-31 | 542.3 |
1996-12-31 | 512.2 |
1997-12-31 | 559.8 |
1998-12-31 | 542.0 |
1999-12-31 | 567.0 |
data.plot(); #时序图
plot_acf(data).show(); #自相关图
plot_pacf(data).show(); #偏自相关图
差分运算
data_diff = data.diff(1).dropna()
plot_acf(data_diff).show()
plot_pacf(data_diff).show()
就用个ARMA(1, 1, 4)吧
result = ARIMA(data, order=(0, 1, 4)).fit(disp=False)
fitvalues = result.fittedvalues
fig, ax = plt.subplots()
ax.plot(data_diff, label="known")
ax.plot(fitvalues, label="fitted")
plt.title('ARIMA RSS: %.4f' % sum(result.fittedvalues - data_diff['output']) ** 2)
ax.legend()
plt.show()
利用summary查看
result.summary()
Dep. Variable: | D.output | No. Observations: | 35 |
---|---|---|---|
Model: | ARIMA(0, 1, 4) | Log Likelihood | -156.722 |
Method: | css-mle | S.D. of innovations | 20.534 |
Date: | Thu, 13 Jun 2019 | AIC | 325.444 |
Time: | 18:06:52 | BIC | 334.776 |
Sample: | 12-31-1965 | HQIC | 328.666 |
- 12-31-1999 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 13.9682 | 0.726 | 19.227 | 0.000 | 12.544 | 15.392 |
ma.L1.D.output | -0.3682 | 0.200 | -1.840 | 0.076 | -0.761 | 0.024 |
ma.L2.D.output | -0.1066 | 0.182 | -0.585 | 0.563 | -0.463 | 0.250 |
ma.L3.D.output | -0.3034 | 0.196 | -1.545 | 0.133 | -0.688 | 0.081 |
ma.L4.D.output | -0.2218 | 0.176 | -1.262 | 0.217 | -0.566 | 0.123 |
Real | Imaginary | Modulus | Frequency | |
---|---|---|---|---|
MA.1 | 1.0000 | -0.0000j | 1.0000 | -0.0000 |
MA.2 | -0.1585 | -1.4742j | 1.4827 | -0.2670 |
MA.3 | -0.1585 | +1.4742j | 1.4827 | 0.2670 |
MA.4 | -2.0510 | -0.0000j | 2.0510 | -0.5000 |
其中的ma.L1.D.output 表示模型的MA部分的第一个参数,因为我们的AR部分为0,如果存在的话也有ar.L1.D.output的
表示t检验,这里好像检验没通过,我也不知道咋怎.
LB统计量检验
resid = result.resid
r, q, p = tsa.acf(resid.values.squeeze(), qstat=True)
print(len(r), len(q), len(p))
test_data = np.c_[range(1, len(r)), r[1:], q, p]
table = pd.DataFrame(test_data, columns=['lag', 'AC', 'Q', 'Prob(>Q)'])
print(table.set_index('lag'))
35 34 34
AC Q Prob(>Q)
lag
1.0 -0.006969 0.001850 0.965696
2.0 -0.004150 0.002525 0.998738
3.0 0.062144 0.158812 0.983947
4.0 0.143394 1.017765 0.907090
5.0 0.030833 1.058801 0.957685
6.0 -0.012886 1.066216 0.982977
7.0 0.082033 1.377449 0.986252
8.0 -0.040114 1.454627 0.993436
9.0 -0.058027 1.622335 0.996133
10.0 -0.053791 1.772216 0.997807
11.0 -0.124073 2.602859 0.995003
12.0 -0.086263 3.021837 0.995388
13.0 -0.119229 3.858617 0.992604
14.0 -0.190790 6.103343 0.963819
15.0 -0.116808 6.986800 0.958015
16.0 -0.015661 7.003517 0.973193
17.0 0.023369 7.042806 0.982988
18.0 -0.073408 7.453300 0.985714
19.0 -0.081616 7.992434 0.986747
20.0 0.089290 8.680747 0.986319
21.0 -0.209498 12.740500 0.917441
22.0 0.180079 15.970870 0.817329
23.0 0.096448 16.974725 0.810490
24.0 -0.039265 17.156229 0.841923
25.0 -0.083454 18.058138 0.839913
26.0 0.093348 19.311970 0.822992
27.0 0.067374 20.046753 0.828788
28.0 -0.078219 21.178618 0.817823
29.0 0.006681 21.188253 0.852199
30.0 0.013390 21.234691 0.880406
31.0 0.025666 21.447964 0.899588
32.0 -0.009549 21.487322 0.920437
33.0 -0.019575 21.735429 0.933336
34.0 0.009294 21.847286 0.946828
大于0.05,所以模型是显著的
模型预测
pred = result.predict('1999', '2010', typ='levels')
print(pred)
fig, ax = plt.subplots()
ax.plot(data, label="known")
ax.plot(pred, label="pred")
ax.legend()
plt.show()
1999-12-31 561.711101
2000-12-31 581.618193
2001-12-31 597.624198
2002-12-31 615.014458
2003-12-31 627.809643
2004-12-31 641.777833
2005-12-31 655.746023
2006-12-31 669.714214
2007-12-31 683.682404
2008-12-31 697.650595
2009-12-31 711.618785
2010-12-31 725.586976
Freq: A-DEC, dtype: float64