上一篇文章我们介绍了格兰杰因果关系的基本概念、背景以及相关统计检验法。本篇文章我们使用Python编程实践一下。
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
df_apple = pd.read_csv('AAPL.csv')
df_walmart = pd.read_csv('WMT.csv')
df_tesla = pd.read_csv('TSLA.csv')
df = pd.merge(df_apple[['Date', 'Adj Close']], df_walmart[['Date', 'Adj Close']], on='Date', how='right').rename(columns = {'Adj Close_x':'apple', 'Adj Close_y':'walmart'})
df = df.merge(df_tesla[['Date', 'Adj Close']], on='Date', how='right').rename(columns={'Adj Close':'tesla'})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date').rename_axis('company', axis=1)
df.head()
df.plot(figsize=(16,8))
苹果和沃尔玛的股价趋势比较类似,而特斯拉在2020年增长了700%。
原假设:时序是不平稳的
备选假设:时序是平稳的
n_obs = 20
df_train, df_test = df[0:-n_obs], df[-n_obs:]
from statsmodels.tsa.stattools import adfuller
def adf_test(df):
result = adfuller(df.values)
print('ADF Statistics: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical values:')
for key, value in result[4].items():
print('\t%s: %.3f' % (key, value))
print('ADF Test: Apple time series')
adf_test(df_train['apple'])
print('ADF Test: Walmart time series')
adf_test(df_train['walmart'])
print('ADF Test: Tesla time series')
adf_test(df_train['tesla'])
原假设:时序是平稳的
备选假设:时序是非平稳的
from statsmodels.tsa.stattools import kpss
def kpss_test(df):
statistic, p_value, n_lags, critical_values = kpss(df.values)
print(f'KPSS Statistic: {statistic}')
print(f'p-value: {p_value}')
print(f'num lags: {n_lags}')
print('Critial Values:')
for key, value in critical_values.items():
print(f' {key} : {value}')
print('KPSS Test: Apple time series')
kpss_test(df_train['apple'])
print('KPSS Test: Walmart time series')
kpss_test(df_train['walmart'])
print('KPSS Test: Tesla time series')
kpss_test(df_train['tesla'])
这次p值都是小于0.05的,所以拒绝原假设,时序是不平稳的。
import plotly.express as px
df_train_transformed = df_train.diff().dropna()
fig = px.line(df_train_transformed, facet_col="company", facet_col_wrap=1)
fig.update_yaxes(matches=None)
fig.show()
print('ADF Test: Apple time series transformed')
adf_test(df_train_transformed['apple'])
print('ADF Test: Walmart time series transformed')
adf_test(df_train_transformed['walmart'])
print('ADF Test: Tesla time series transformed')
adf_test(df_train_transformed['tesla'])
ADF的结果告诉我们可以拒绝原假设,接受时序是平稳的这个结论。
print('KPSS Test: Apple time series transformed')
kpss_test(df_train_transformed['apple'])
print('KPSS Test: Walmart time series transformed')
kpss_test(df_train_transformed['walmart'])
print('KPSS Test: Tesla time series transformed')
kpss_test(df_train_transformed['tesla'])
KPSS检验的结果与ADF有些不同,它对Walmart的检验结果表明不能拒绝原假设,此时序仍然是非平稳的。
我们先采用向量自回归模型(VAR,Vector AutoRegression),选择AIC最小的。
from statsmodels.tsa.api import VAR
model = VAR(df_train_transformed)
for i in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]:
result = model.fit(i)
print('Lag Order =', i)
print('AIC : ', result.aic)
print('BIC : ', result.bic)
print('FPE : ', result.fpe)
print('HQIC: ', result.hqic, '\n')
results = model.fit(maxlags=15, ic='aic')
results.summary()
coefficient std. error t-stat prob
coefficient std. error t-stat prob
coefficient std. error t-stat prob
Correlation matrix of residuals
apple walmart tesla
apple 1.000000 0.321701 0.431945
walmart 0.321701 1.000000 0.122985
tesla 0.431945 0.122985 1.000000
Apple & Tesla之间的相关性最大,值为 0.43
Durbin-Watson检验可以检查残差之间的自相关性。
from statsmodels.stats.stattools import durbin_watson
out = durbin_watson(results.resid)
for col, val in zip(df.columns, out):
print(col, ':', round(val, 2))
apple : 2.0
walmart : 2.0
tesla : 2.0
值为2表示没有检测出自相关性
from statsmodels.tsa.stattools import grangercausalitytests
maxlag=15
test = 'ssr_chi2test'
def grangers_causation_matrix(data, variables, test='ssr_chi2test', verbose=False):
df = pd.DataFrame(np.zeros((len(variables), len(variables))), columns=variables, index=variables)
for c in df.columns:
for r in df.index:
test_result = grangercausalitytests(data[[r, c]], maxlag=maxlag, verbose=False)
p_values = [round(test_result[i+1][0][test][1],4) for i in range(maxlag)]
if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}')
min_p_value = np.min(p_values)
df.loc[r, c] = min_p_value
df.columns = [var + '_x' for var in variables]
df.index = [var + '_y' for var in variables]
return df
grangers_causation_matrix(df_train_transformed, variables = df_train_transformed.columns)
上表中行y表示响应变量,x表示预测变量。每一个单元中的值表示p值;例如第一行第二列的p值为0 < 0.05,代表我们可以拒绝原假设,认为walmart_x对apple_y有格兰杰因果关系。上面的数据表明三个序列互为格兰杰因果关系。
之前我们把数据进行了差分,现在我们需要将它转换回来。
lag_order = results.k_ar
df_input = df_train_transformed.values[-lag_order:]
df_forecast = results.forecast(y=df_input, steps=n_obs)
df_forecast = (pd.DataFrame(df_forecast, index=df_test.index, columns=df_test.columns + '_pred'))
def invert_transformation(df, pred):
forecast = df_forecast.copy()
columns = df.columns
for col in columns:
forecast[str(col)+'_pred'] = df[col].iloc[-1] + forecast[str(col)+'_pred'].cumsum()
return forecast
output = invert_transformation(df_train, df_forecast)
combined = pd.concat([output['apple_pred'], df_test['apple'], output['walmart_pred'], df_test['walmart'], output['tesla_pred'], df_test['tesla']], axis=1)
combined.head()
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
rmse = mean_squared_error(combined['apple_pred'], combined['apple'])
mae = mean_absolute_error(combined['apple_pred'], combined['apple'])
print('Forecast accuracy of Apple')
print('RMSE: ', round(np.sqrt(rmse),2))
print('MAE: ', round(mae,2))
Forecast accuracy of Apple
RMSE: 7.31
MAE: 6.0
rmse = mean_squared_error(combined['walmart_pred'], combined['walmart'])
mae = mean_absolute_error(combined['walmart_pred'], combined['walmart'])
print('Forecast accuracy of Walmart')
print('RMSE: ', round(np.sqrt(rmse),2))
print('MAE: ', round(mae,2))
Forecast accuracy of Walmart
RMSE: 5.05
MAE: 4.52
rmse = mean_squared_error(combined['tesla_pred'], combined['tesla'])
mae = mean_absolute_error(combined['tesla_pred'], combined['tesla'])
print('Forecast accuracy of Tesla')
print('RMSE: ', round(np.sqrt(rmse),2))
print('MAE: ', round(mae,2))
Forecast accuracy of Tesla
RMSE: 125.16
MAE: 113.18