python 时间序列预测
Time series analysis is the endeavor of extracting meaningful summary and statistical information from data points that are in chronological order. They are widely used in applied science and engineering which involves temporal measurements such as signal processing, pattern recognition, mathematical finance, weather forecasting, control engineering, healthcare digitization, applications of smart cities, and so on.
时间序列分析是从按时间顺序排列的数据点中提取有意义的摘要和统计信息的努力。 它们被广泛地应用在涉及时间测量的应用科学和工程中,例如信号处理,模式识别,数学财务,天气预报,控制工程,医疗保健数字化,智能城市的应用等。
As we are continuously monitoring and collecting time series data, the opportunities for applying time series analysis and forecasting are increasing.
随着我们不断监视和收集时间序列数据,应用时间序列分析和预测的机会越来越多。
In this article, I will show how to develop an ARIMA model with a seasonal component for time series forecasting in Python. We will follow Box-Jenkins three-stage modeling approach to reach at the best model for forecasting.
在本文中,我将展示如何开发带有季节性成分的ARIMA模型,以便在Python中进行时间序列预测。 我们将遵循Box-Jenkins的三阶段建模方法,以获取最佳的预测模型。
I encourage anyone to check out the Jupyter Notebook on my GitHub for the full analysis.
我鼓励任何人在我的GitHub上查看Jupyter Notebook进行完整分析。
In time series analysis, Box-Jenkins method named after statisticians George Box and Gwilym Jenkins applying ARIMA models to find the best fit of a time series model.
在时间序列分析中,以统计学家George Box和Gwilym Jenkins命名的Box-Jenkins方法应用ARIMA模型来找到时间序列模型的最佳拟合。
The model indicates 3 steps: model identification, parameter estimation and model validation.
该模型指示3个步骤:模型识别,参数估计和模型验证。
时间序列 (Time Series)
As data, we will use the monthly milk production dataset. It includes monthly production records in terms of pounds per cow between 1962–1975.
作为数据,我们将使用每月牛奶产量数据集。 它包括1962年至1975年之间的月度生产记录,以每头母牛的磅数表示。
df = pd.read_csv('./monthly_milk_production.csv', sep=',', parse_dates=['Date'], index_col='Date')
时间序列数据检查 (Time Series Data Inspection)
As we can observe from the plot above, we have an increasing trend and very strong seasonality in our data.
从上图可以看出,我们的数据呈上升趋势,并且季节性非常强。
We will use the statsmodels library from Python to perform a time series decomposition. The decomposition of time series is a statistical method to deconstruct time series into its trend, seasonal and residual components.
我们将使用Python中的statsmodels库执行时间序列分解。 时间序列的分解是一种将时间序列分解为趋势,季节和残差成分的统计方法。
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decomposedecomposition = seasonal_decompose(df['Production'], freq=12)
decomposition.plot()
plt.show()
The decomposition plot indicates that the monthly milk production has an increasing trend and seasonal pattern.
分解图表明,每月的牛奶产量具有增加的趋势和季节性模式。
If we want to observe the seasonal component more precisely, we can plot the data based on the month.
如果我们想更精确地观察季节成分,则可以根据月份绘制数据。
1.型号识别 (1. Model Identification)
In this step, we need to detect whether time series is stationary, and if not, we need to understand what kind of transformation is required to make it stationary.
在此步骤中,我们需要检测时间序列是否稳定,如果不是,则需要了解需要哪种变换才能使其稳定。
A time series is stationary when its statistical properties such as mean, variance, and autocorrelation are constant over time. In other words, time series is stationary when it is not dependent on time and not have a trend or seasonal effects. Most statistical forecasting methods are based on the assumption that time series is (approximately) stationary.
当时间序列的统计属性(例如均值,方差和自相关)随时间恒定时,它是固定的。 换句话说,时间序列在不依赖时间且没有趋势或季节影响的情况下是固定的。 大多数统计预测方法都是基于时间序列(近似)平稳的假设。
Imagine, we have a time series that is consistently increasing over time, the sample mean and variance will grow with the size of the sample, and they will always underestimate the mean and variance in future periods. This is why, we need to start with a stationary time series, which is removed from its time dependent trend and seasonal components.
想象一下,我们有一个随时间连续增长的时间序列,样本均值和方差将随样本的大小而增长,并且它们始终会低估未来期间的均值和方差。 因此,我们需要从固定的时间序列开始,将其从与时间相关的趋势和季节成分中删除。
We can check stationarity by using different approaches:
我们可以使用不同的方法来检查平稳性:
- We can understand from the plots, such as decomposition plot we have seen previously where we have already observed there is trend and seasonality. 我们可以从图中了解到,例如我们之前已经看到的分解图和已经观察到的趋势和季节性。
We can plot autocorrelation function and partial autocorrelation function plots, which provide information about the dependency of time series values to their previous values. If the time series is stationary, the ACF/PACF plots will show a quick cut off after a small number of lags.
我们可以绘制自 相关函数图和部分自相关函数图,它们提供有关时间序列值与其先前值的相关性的信息。 如果时间序列是固定的,则ACF / PACF图将显示少量延迟后的快速中断。
from statsmodels.graphics.tsaplots import plot_acf, plot_pacfplot_acf(df, lags=50, ax=ax1)
plot_pacf(df, lags=50, ax=ax2)
Here we see that both ACF and PACF plots do not show a quick cut off into the 95% confidence interval area (in blue) meaning time series is not stationary.
在这里,我们看到ACF和PACF图都没有显示出快速切入95%置信区间区域(蓝色)的意思,这意味着时间序列不是固定的。
- We can apply statistical tests and Augmented Dickey-Fuller test is the widely used one. The null hypothesis of the test is time series has a unit root, meaning that it is non-stationary. We interpret the test result using the p-value of the test. If the p-value is lower than the threshold value (5% or 1%), we reject the null hypothesis and time series is stationary. If the p-value is higher than the threshold, we fail to reject the null hypothesis and time series is non-stationary. 我们可以应用统计检验,而增强Dickey-Fuller检验是广泛使用的检验。 该检验的零假设是时间序列具有单位根,这意味着它是非平稳的。 我们使用测试的p值解释测试结果。 如果p值低于阈值(5%或1%),我们将拒绝原假设,并且时间序列是固定的。 如果p值高于阈值,则我们无法拒绝原假设,并且时间序列是非平稳的。
from statsmodels.tsa.stattools import adfullerdftest = adfuller(df['Production'])dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key, value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
Results of Dickey-Fuller Test:Test Statistic -1.303812p-value 0.627427#Lags Used 13.000000Number of Observations Used 154.000000Critical Value (1%) -3.473543Critical Value (5%) -2.880498Critical Value (10%) -2.576878
Dickey-Fuller测试的结果:测试统计-1.303812p值0.627427#使用的延迟13.000000使用的观察数154.000000临界值(1%)-3.473543临界值(5%)-2.880498临界值(10%)-2.576878
P-value is greater than the threshold value, we fail to reject the null hypothesis and time series is non-stationary, it has time dependent component.
P值大于阈值,我们无法拒绝原假设并且时间序列是非平稳的,它具有时间依赖性。
All these approaches suggest we have non-stationary data. Now, we need to find a way to make it stationary.
所有这些方法表明我们有不稳定的数据。 现在,我们需要找到一种使其固定的方法。
There are two major reasons behind non-stationary time series; trend and seasonality. We can apply differencing to make time series stationary by subtracting the previous observations from the current observations. Doing so we will eliminate trend and seasonality, and stabilize the mean of time series. Due to both trend and seasonal components, we apply one non-seasonal diff()
and one seasonal differencing diff(12)
.
非平稳时间序列背后的主要原因有两个: 趋势和季节性。 通过从当前观测值中减去先前的观测值,我们可以应用差分来使时间序列平稳。 这样做可以消除趋势和季节性,并稳定时间序列的平均值。 由于趋势和季节因素,我们应用一个非季节性diff()
和一个季节性差异diff(12)
。
df_diff = df.diff().diff(12).dropna()
Results of Dickey-Fuller Test:Test Statistic -5.038002p-value 0.000019#Lags Used 11.000000Number of Observations Used 143.000000Critical Value (1%) -3.476927Critical Value (5%) -2.881973Critical Value (10%) -2.577665
Dickey-Fuller测试的结果:测试统计-5.038002p值0.000019#使用的滞后11.000000使用的观察数143.000000临界值(1%)-3.476927临界值(5%)-2.881973临界值(10%)-2.577665
Applying the previously listed stationarity checks, we notice the plot of differenced time series does not reveal any specific trend or seasonal behavior, ACF/PACF plots have a quick cut-off, and ADF test result returns p-value almost 0.00. which is lower than the threshold. All these checks suggest that differenced data is stationary.
应用先前列出的平稳性检查,我们注意到不同时间序列的图没有揭示任何特定的趋势或季节性行为,ACF / PACF图具有快速截止值,并且ADF测试结果返回的p值几乎为0.00。 低于阈值。 所有这些检查表明差异数据是固定的。
We will apply Seasonal Autoregressive Integrated Moving Average (SARIMA or Seasonal-ARIMA) which is an extension of ARIMA that supports time series data with a seasonal component. ARIMA stands for Autoregressive Integrated Moving Average which is one of the most common techniques of time series forecasting.
我们将应用季节性自回归综合移动平均线(SARIMA或Seasonal-ARIMA),这是ARIMA的扩展,它支持带有季节性成分的时间序列数据。 ARIMA代表自回归综合移动平均值,它是时间序列预测中最常用的技术之一。
ARIMA models are denoted with the order of ARIMA(p,d,q) and SARIMA models are denoted with the order of SARIMA(p, d, q)(P, D, Q)m.
ARIMA模型以ARIMA(p,d,q)的顺序表示,而SARIMA模型以SARIMA(p,d,q)(P,D,Q)m的顺序表示。
AR(p) is a regression model that utilizes the dependent relationship between an observation and some number of lagged observations.
AR(p)是一种回归模型,利用了观察值与一些滞后观察值之间的依赖关系。
I(d) is the differencing order to make time series stationary.
I(d)是使时间序列平稳的微分阶数。
MA(q) is a model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.
MA(q)是一个模型,它使用观察值与应用于滞后观察值的移动平均模型的残差之间的依赖关系。
(P, D, Q)m are the additional set of parameters that specifically describe the seasonal components of the model. P, D, and Q represent the seasonal regression, differencing, and moving average coefficients, and m represents the number of data points in each seasonal cycle.
(P,D,Q)m是另外一组参数,它们专门描述了模型的季节性成分。 P,D和Q表示季节回归系数,微分系数和移动平均系数,m表示每个季节周期中数据点的数量。
2.模型参数估计 (2. Model Parameter Estimation)
We will use Python’s pmdarima library, to automatically extract the best parameters for our Seasonal ARIMA model. Inside auto_arima function, we will specify d=1
and D=1
as we differentiate once for the trend and once for seasonality, m=12
because we have monthly data, and trend='C'
to include constant and seasonal=True
to fit a seasonal-ARIMA. Besides, we specify trace=True
to print status on the fits. This helps us to determine the best parameters by comparing the AIC scores.
我们将使用Python的pmdarima库为我们的季节性ARIMA模型自动提取最佳参数。 在auto_arima函数中,我们将指定d=1
和D=1
因为我们分别对趋势和季节性进行了区分,因为我们有月度数据,所以对m=12
进行了区分,并且trend='C'
包含了常数, seasonal=True
适合一个季节性的ARIMA。 此外,我们指定trace=True
来显示适合的打印状态。 这可以帮助我们通过比较AIC分数来确定最佳参数。
import pmdarima as pmmodel = pm.auto_arima(df['Production'], d=1, D=1,
m=12, trend='c', seasonal=True,
start_p=0, start_q=0, max_order=6, test='adf',
stepwise=True, trace=True)
AIC (Akaike Information Criterion) is an estimator of out of sample prediction error and the relative quality of our model. The desired result is to find the lowest possible AIC score.
AIC (赤池信息准则)是对样本外预测误差和模型相对质量的估计。 理想的结果是找到最低的AIC分数。
The result of auto_arima function with various (p, d, q)(P, D, Q)m parameters indicates that the lowest AIC score is obtained when the parameters equal to (1, 1, 0)(0, 1, 1, 12).
参数为(p,d,q)(P,D,Q)m的auto_arima函数的结果表明,当参数等于(1,1,0)(0,1,1, 12)。
We split the dataset into a train and test set. Here I’ve used 85% as train split size. We create a SARIMA model, on the train set with the suggested parameters. We use SARIMAX function from statsmodel library (X describes the exogenous parameter, but here we don’t add any). After fitting the model, we can also print the summary statistics.
我们将数据集分为训练和测试集。 在这里,我使用了85%作为火车分割大小。 我们在火车上使用建议的参数创建SARIMA模型。 我们使用statsmodel库中的SARIMAX函数(X描述了外部参数,但此处未添加任何参数)。 拟合模型后,我们还可以打印摘要统计信息。
from statsmodels.tsa.statespace.sarimax import SARIMAXmodel = SARIMAX(train['Production'],
order=(1,1,0),seasonal_order=(0,1,1,12))
results = model.fit()
results.summary()
3.模型验证 (3. Model Validation)
Primary concern of the model is to ensure that the residuals are normally distributed with zero mean and uncorrelated.
该模型的主要关注点是确保残差正态分布且均值为零且不相关。
To check for residuals statistics, we can print model diagnostics:
要检查残差统计信息,我们可以打印模型诊断:
results.plot_diagnostics()
plt.show()
- The top-left plot shows the residuals over time and it appears to be a white noise with no seasonal component. 左上方的图显示了随时间变化的残差,它似乎是白噪声,没有季节性成分。
- The top-right plot shows that kde line (in red) closely follows the N(0,1) line, which is the standard notation of normal distribution with zero mean and standard deviation of 1, suggesting the residuals are normally distributed. 右上图显示kde线(红色)紧跟N(0,1)线,这是正态分布的标准表示法,均值为零,标准差为1,表明残差呈正态分布。
- The bottom-left normal gg-plot shows ordered distribution of residuals (in blue) closely follow the linear trend of the samples taken from a standard normal distribution, suggesting residuals are normally distributed. 左下方正态gg曲线显示残差的有序分布(蓝色)紧密遵循从标准正态分布获取的样本的线性趋势,表明残差呈正态分布。
- The bottom-right is a correlogram plot indicating residuals have a low correlation with lagged versions. 右下角是相关图,表明残差与滞后形式的相关性较低。
All these results suggest residuals are normally distributed with low correlation.
所有这些结果表明残差正态分布且相关性较低。
To measure the accuracy of forecasts, we compare the prediction values on the test set with its real values.
为了衡量预测的准确性,我们将测试集上的预测值与其实际值进行比较。
forecast_object = results.get_forecast(steps=len(test))
mean = forecast_object.predicted_mean
conf_int = forecast_object.conf_int()
dates = mean.index
From the plot, we see that model prediction nearly matches with the real values of the test set.
从图中可以看出,模型预测几乎与测试集的实际值匹配。
from sklearn.metrics import r2_scorer2_score(test['Production'], predictions)>>> 0.9240433686806808
The R squared of the model is 0.92, indicating that the coefficient of determination of the model is 92%.
该模型的R平方为0.92,表明该模型的确定系数为92%。
mean_absolute_percentage_error = np.mean(np.abs(predictions - test['Production'])/np.abs(test['Production']))*100>>> 1.649905
Mean absolute percentage error (MAPE) is one of the most used accuracy metrics, expressing the accuracy as a percentage of the error. MAPE score of the model equals to 1.64, indicating the forecast is off by 1.64% and 98.36% accurate.
平均绝对百分比误差 (MAPE)是最常用的精度指标之一,将精度表示为误差的百分比。 该模型的MAPE得分等于1.64,表明预测的准确度为1.64%和98.36%。
Since both the diagnostic test and the accuracy metrics intimates that our model is nearly perfect, we can continue to produce future forecasts.
由于诊断测试和准确性指标都表明我们的模型几乎是完美的,因此我们可以继续产生未来的预测。
Here is the forecast for the next 60 months.
这是对未来60个月的预测。
results.get_forecast(steps=60)
I hope you enjoyed following this tutorial and building time series forecasts in Python.
我希望您喜欢本教程并使用Python建立时间序列预测。
Let me know if you have any questions or suggestions.✨
如果您有任何问题或建议,请告诉我。✨
翻译自: https://towardsdatascience.com/hands-on-time-series-forecasting-with-python-d4cdcabf8aac
python 时间序列预测