常见的两类方法:
回归分析的基本步骤:
预测方法的选择(仅供参考):
数据模式 | 预测方法 | 对数据的要求 | 预测期 |
---|---|---|---|
平稳序列 | 移动平均 | 数据个数与移动平均步长相等 | 非常短 |
平稳序列 | 简单指数平滑 | 5个以上 | 短期 |
线性趋势 | Holt指数平滑 | 5个以上 | 短期至中期 |
线性趋势 | 一元线性回归 | 10个以上 | 短期至中期 |
非线性趋势 | 指数模型 | 10个以上 | 短期至中期 |
非线性趋势 | 多项式函数 | 10个以上 | 短期至中期 |
趋势和季节成分 | Winter指数平滑 | 至少有四个周期的季节或月份数据 | 短期至中期 |
趋势和季节成分 | 季节性多元回归 | 至少有四个周期的季节或月份数据 | 短期、中期、长期 |
趋势、季节成分和循环成分 | 分解预测 | 至少有四个周期的季节或月份数据 | 短期、中期、长期 |
预测方法的评估:一种预测方法的好坏取决于预测误差的大小。预测误差是预测值与实际值的差距,度量方法有:
为研究员工月工资收入与工作年限和性别之间的关系,从某公司职员中随机抽取男女各4名,他们的月工资收入与工作年限和性别之间的关系表如下:
月工资收入(元) | 工作年限 | 性别 |
---|---|---|
2900 | 2 | 男 |
3000 | 6 | 女 |
4800 | 8 | 男 |
1800 | 3 | 女 |
2900 | 2 | 男 |
4900 | 7 | 男 |
4200 | 9 | 女 |
4800 | 8 | 女 |
令 y y y表示月工资收入, x 1 x_1 x1表示工作年限, x 2 x_2 x2表示性别,性别作为哑变量引入时,回归方程如下: y = β 0 + β 1 x 1 + β 2 x 2 y=\beta_0+\beta_1 x_1 + \beta_2 x_2 y=β0+β1x1+β2x2,于是我们可以得到:
其中各参数的含义如下:
python实现代码如下:
import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.DataFrame({
'月工资收入':[2900,3000,4800,1800,2900,4900,4200,4800],
'工作年限':[2,6,8,3,2,7,9,8],
'性别':['男','女','男','女','男','男','女','女']
})
# 哑变量处理
dummy_variables = pd.get_dummies(data=data['性别'].values)
X = np.column_stack(tup=(data['工作年限'].values,dummy_variables.values))
X = sm.add_constant(data=X) # 加一列常数项
y = data['月工资收入'].values
# 用最小二乘法拟合回归方程
linear_model = sm.OLS(endog=y,exog=X)
ols_result = linear_model.fit()
# 输出拟合结果
print(ols_result.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.901
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 22.78
Date: Sun, 22 May 2022 Prob (F-statistic): 0.00307
Time: 17:02:55 Log-Likelihood: -58.036
No. Observations: 8 AIC: 122.1
Df Residuals: 5 BIC: 122.3
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 950.7246 247.685 3.838 0.012 314.029 1587.420
x1 397.5845 60.183 6.606 0.001 242.879 552.290
x2 -85.0242 231.138 -0.368 0.728 -679.184 509.135
x3 1035.7488 172.207 6.015 0.002 593.076 1478.421
==============================================================================
Omnibus: 4.593 Durbin-Watson: 1.536
Prob(Omnibus): 0.101 Jarque-Bera (JB): 1.483
Skew: 1.049 Prob(JB): 0.477
Kurtosis: 3.219 Cond. No. 7.55e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.64e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
从上面结果可以看到,拟合优度R方为0.901,调整R方为0.862,模型拟合显著性检验的P值为0.00307 < 0.05,说明模型拟合效果还是可以的,
再看到参数拟合结果,x2的参数p值为0.728 > 0.05,因此下面我们剔除x2后再做一次拟合
'''
X1 = X[:,[0,1,3]]
# 用最小二乘法拟合回归方程
linear_model1 = sm.OLS(endog=y,exog=X1)
ols_result1 = linear_model1.fit()
# 输出拟合结果
print(ols_result1.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.901
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 22.78
Date: Sun, 22 May 2022 Prob (F-statistic): 0.00307
Time: 17:18:21 Log-Likelihood: -58.036
No. Observations: 8 AIC: 122.1
Df Residuals: 5 BIC: 122.3
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 865.7005 447.091 1.936 0.111 -283.583 2014.984
x1 397.5845 60.183 6.606 0.001 242.879 552.290
x2 1120.7729 323.747 3.462 0.018 288.554 1952.992
==============================================================================
Omnibus: 4.593 Durbin-Watson: 1.536
Prob(Omnibus): 0.101 Jarque-Bera (JB): 1.483
Skew: 1.049 Prob(JB): 0.477
Kurtosis: 3.219 Cond. No. 20.8
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
可以看到,模型整体的拟合是没有变的,在显著性水平为0.05的条件下,x1和x2的参数显著性通过检验,但常数项的显著性不明显,因此下面我们剔除常数项再做一次拟合
'''
X2 = X[:,[1,3]]
# 用最小二乘法拟合回归方程
linear_model2 = sm.OLS(endog=y,exog=X2)
ols_result2 = linear_model2.fit()
# 输出拟合结果
print(ols_result2.summary())
'''
OLS Regression Results
=======================================================================================
Dep. Variable: y R-squared (uncentered): 0.986
Model: OLS Adj. R-squared (uncentered): 0.981
Method: Least Squares F-statistic: 210.6
Date: Sun, 22 May 2022 Prob (F-statistic): 2.77e-06
Time: 17:22:47 Log-Likelihood: -60.274
No. Observations: 8 AIC: 124.5
Df Residuals: 6 BIC: 124.7
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 499.5470 35.188 14.197 0.000 413.446 585.648
x2 1502.1518 310.270 4.841 0.003 742.948 2261.356
==============================================================================
Omnibus: 0.202 Durbin-Watson: 1.923
Prob(Omnibus): 0.904 Jarque-Bera (JB): 0.253
Skew: -0.258 Prob(JB): 0.881
Kurtosis: 2.297 Cond. No. 10.5
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
可以看大,模型显著性和参数显著性都得到了提升,最终我们模型估计结果为:y = 499.5470*x1 + 1502.1518*x2
'''
某牙膏制造企业要求销售部门根据市场调查,找出公司生产的牙膏销售量与销售价格、广告投入等之间的关系,从而预测出在不同价格和广告费用下的销售量。为此,销售部收集了过去30个销售周期公司生产的牙膏的销售量、销售价格、广告费用,以及同期其他厂家生产的同类牙膏的平均销售价格,数据如下:
销售周期 | 公司销售价格(元) | 其他厂家平均价格(元) | 广告费用(百万元) | 价格差(元) | 销售量(百万支) |
---|---|---|---|---|---|
1 | 3.85 | 3.8 | 5.5 | -0.05 | 7.38 |
2 | 3.75 | 4 | 6.75 | 0.25 | 8.51 |
3 | 3.7 | 4.3 | 7.25 | 0.6 | 9.52 |
…… | …… | …… | …… | …… | …… |
试根据这些数据建立一个数学模型,分析牙膏销售量与其他因素的关系,为制定价格策略和广告投入策略提供数据依据
在购买同类产品的牙膏时,顾客会在意不同品牌之间的价格差异,而不是价格本身。设牙膏销售量为y,价格差为x1,广告费为x2,其他厂家平均价格为x3,公司销售价格为x4,x1=x3-x4。
回归分析过程如下:
python代码如下:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 解决中文显示问题
plt.rcParams['axes.unicode_minus'] = False # 解决坐标轴负数的负号显示问题
data = pd.read_excel(r'G:\牙膏销售数据表.xlsx')
print(data.info())
# 1、计算相关系数矩阵
print(data[['公司销售价格(元)','其他厂家平均价格(元)','广告费用(百万元)','价格差(元)','销售量(百万支)']].corr())
'''
公司销售价格(元) 其他厂家平均价格(元) 广告费用(百万元) 价格差(元) 销售量(百万支)
公司销售价格(元) 1.000000 0.078367 -0.468793 -0.322067 -0.469220
其他厂家平均价格(元) 0.078367 1.000000 0.604540 0.918566 0.740948
广告费用(百万元) -0.468793 0.604540 1.000000 0.759964 0.875954
价格差(元) -0.322067 0.918566 0.759964 1.000000 0.889672
销售量(百万支) -0.469220 0.740948 0.875954 0.889672 1.000000
可以看到,销售量与价格差、广告费用的相关系数分别是89%、88%,相关度很高,另外价格差和广告费用的相关度也不低(76%左右)
'''
# 2、绘制自变量与因变量的散点图
fig1 = plt.figure()
plt.scatter(x=data['广告费用(百万元)'].values,y=data['销售量(百万支)'].values,marker='*')
plt.xlabel('广告费用(百万元)')
plt.ylabel('销售量(百万支)')
fig2= plt.figure()
plt.scatter(x=data['价格差(元)'].values,y=data['销售量(百万支)'].values,marker='o')
plt.xlabel('价格差(元)')
plt.ylabel('销售量(百万支)')
plt.show()
'''
可以看到,价格差对销售量的走势是近似线性的,而广告费对销售量的走势有一点弯曲,
因此在假设价格差和广告费对牙膏销售量的影响是相互独立的前提下,准对价格差和销售量建立线性回归方程,准对广告费和销售量建立非线性回归方程(二次项)
'''
# 3、建立模型,并对模型及其参数进行估计
# 3.1、价格差和销售量的线性回归方程估计
X1 = sm.add_constant(data=data['价格差(元)'].values) # X1由1列常数项和价格差组成
y = data['销售量(百万支)']
# 用最小二乘法估计方程
model1 = sm.OLS(endog=y,exog=X1)
result1 = model1.fit()
print(result1.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: 销售量(百万支) R-squared: 0.792
Model: OLS Adj. R-squared: 0.784
Method: Least Squares F-statistic: 106.3
Date: Fri, 27 May 2022 Prob (F-statistic): 4.88e-11
Time: 22:26:27 Log-Likelihood: -7.0261
No. Observations: 30 AIC: 18.05
Df Residuals: 28 BIC: 20.85
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 7.8141 0.080 97.818 0.000 7.650 7.978
x1 2.6652 0.258 10.310 0.000 2.136 3.195
==============================================================================
Omnibus: 5.481 Durbin-Watson: 2.414
Prob(Omnibus): 0.065 Jarque-Bera (JB): 4.092
Skew: 0.883 Prob(JB): 0.129
Kurtosis: 3.391 Cond. No. 4.69
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
可以看到,模型估计的拟合优度为0.792,在假设显著性水平α=0.05的条件下,模型显著性检验和参数显著性检验均通过,得到的回归方程为:y = 7.8141 + 2.6652*x1
'''
# 3.2、广告费和销售量的非线性回归方程估计
# 多项式处理
from sklearn.preprocessing import PolynomialFeatures
X2 = PolynomialFeatures(degree=2).fit_transform(X=data[['广告费用(百万元)']])
# 用最小二乘法估计方程
model2 = sm.OLS(endog=y,exog=X2)
result2 = model2.fit()
print(result2.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: 销售量(百万支) R-squared: 0.838
Model: OLS Adj. R-squared: 0.826
Method: Least Squares F-statistic: 69.81
Date: Fri, 27 May 2022 Prob (F-statistic): 2.14e-11
Time: 22:45:32 Log-Likelihood: -3.2455
No. Observations: 30 AIC: 12.49
Df Residuals: 27 BIC: 16.69
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 25.1091 6.863 3.659 0.001 11.028 39.190
x1 -6.5589 2.217 -2.958 0.006 -11.109 -2.009
x2 0.6101 0.178 3.432 0.002 0.245 0.975
==============================================================================
Omnibus: 0.063 Durbin-Watson: 1.523
Prob(Omnibus): 0.969 Jarque-Bera (JB): 0.224
Skew: -0.090 Prob(JB): 0.894
Kurtosis: 2.617 Cond. No. 5.98e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.98e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
可以看到,模型估计的拟合优度为0.838,在假设显著性水平α=0.05的条件下,模型显著性检验和参数显著性检验均通过,得到的回归方程为:y = 25.1091 - 6.5589*x2 + 0.6101*(x2)**2
'''
# 4、模型改进:加入交互项和广告费的二次项
data['价格差*广告费用'] = data['价格差(元)'] * data['广告费用(百万元)']
X3 = np.column_stack(tup=(X1,X2[:,[1,2]],data['价格差*广告费用'].values))
# 用最小二乘法估计方程
model3 = sm.OLS(endog=y,exog=X3)
result3 = model3.fit()
print(result3.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: 销售量(百万支) R-squared: 0.921
Model: OLS Adj. R-squared: 0.908
Method: Least Squares F-statistic: 72.78
Date: Fri, 27 May 2022 Prob (F-statistic): 2.11e-13
Time: 23:11:51 Log-Likelihood: 7.5137
No. Observations: 30 AIC: -5.027
Df Residuals: 25 BIC: 1.979
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 29.1133 7.483 3.890 0.001 13.701 44.525
x1 11.1342 4.446 2.504 0.019 1.978 20.291
x2 -7.6080 2.469 -3.081 0.005 -12.693 -2.523
x3 0.6712 0.203 3.312 0.003 0.254 1.089
x4 -1.4777 0.667 -2.215 0.036 -2.852 -0.104
==============================================================================
Omnibus: 0.242 Durbin-Watson: 1.512
Prob(Omnibus): 0.886 Jarque-Bera (JB): 0.148
Skew: -0.153 Prob(JB): 0.929
Kurtosis: 2.843 Cond. No. 9.81e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.81e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
可以看到,模型估计的拟合优度提升至0.921,在假设显著性水平α=0.05的条件下,模型显著性检验和参数显著性检验均通过,最终确定回归方程为:
y = 29.1133 + 11.1342*x1 - 7.6080*x2 + 0.6712*(x2)**2 - 1.4777*x1*x2
'''
某省为了研究第三产业在本省宏观经济发展中的运行情况,对影响第三产业的资本要素、劳动力要素及科技进步要素这三项主要因素进行了统计分析,并运用道格拉斯生产函数建立了基本数学模型。原始数据如下:
年份 | 第三产业国内生产总值 | 资本投入 | 从业人员 |
---|---|---|---|
1992 | 448.96 | 524085 | 470.08 |
1993 | 611.23 | 1068889 | 480.77 |
1994 | 834.93 | 1632884 | 529.08 |
…… | …… | …… | …… |
2002 | 3120 | 10029357 | 903.14 |
现需要预测当资本投入为11738245、劳动力投入为987.37时,第三产业国内生产总值是多少。
道格拉斯生产函数对应数学模型为 Y = A K α L β Y=AK^{\alpha}L^{\beta} Y=AKαLβ,其中:
上面的非线性模型我们可以通过对等号两边取对数,使之转换为多元线性模型,即 L n Y = L n A + α L n K + β L n L LnY = LnA + \alpha LnK + \beta LnL LnY=LnA+αLnK+βLnL,令 L n Y = y , L n A = c , L n K = x 1 , L n L = x 2 LnY=y,LnA=c,LnK=x_1,LnL=x_2 LnY=y,LnA=c,LnK=x1,LnL=x2,于是得到 y = c + α x 1 + β x 2 y=c+\alpha x_1+\beta x_2 y=c+αx1+βx2
下面我们用python实现上述的建模过程:
import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.read_excel(r'G:\第三产业国内生产总值数据表.xlsx')
# 线性化处理
data1 = data[['第三产业国内生产总值', '资本投入', '从业人员']].apply(lambda x:np.log(x))
# 确定自变量和因变量
X = sm.add_constant(data=data1[['资本投入', '从业人员']].values) # 添加常数列
y = data1['第三产业国内生产总值'].values
# 用最小二乘法估计方程
model = sm.OLS(endog=y,exog=X)
result = model.fit()
print(result.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.988
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 338.6
Date: Tue, 31 May 2022 Prob (F-statistic): 1.86e-08
Time: 23:20:30 Log-Likelihood: 15.034
No. Observations: 11 AIC: -24.07
Df Residuals: 8 BIC: -22.87
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -4.4745 1.229 -3.641 0.007 -7.308 -1.641
x1 0.4797 0.095 5.060 0.001 0.261 0.698
x2 0.6947 0.393 1.768 0.115 -0.212 1.601
==============================================================================
Omnibus: 2.430 Durbin-Watson: 1.134
Prob(Omnibus): 0.297 Jarque-Bera (JB): 0.971
Skew: 0.197 Prob(JB): 0.615
Kurtosis: 1.598 Cond. No. 967.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
'''
根据返回的回归分析报告,我们可以看到:
解释模型的经济含义:从回归方程可以看到,1992——2002年某省第三产业产出的资本投入弹性和劳动投入弹性分别是0.4797和0.6947。换而言之,在研究期间,保持劳动投入不变,每增加1个单位的资本投入,第三产业国内生产总值增加0.4797;同理,保持资本投入不变,每增加1个单位的劳动投入,第三产业国内生产总值增加0.6947。把两个产出弹性相加得到1.1744,就是规模报酬参数的取值,而该值大于1,说明某省第三产业经历了轻微的规模报酬递增。
模型预测:已知资本投入为11738245,劳动力投入为987.37,将其代入回归方程得到: L n Y = − 4.4745 + 0.4797 L n ( 11738245 ) + 0.6947 L n ( 987.37 ) = 8.124218321936784 LnY= -4.4745 + 0.4797 Ln(11738245) + 0.6947 Ln(987.37)=8.124218321936784 LnY=−4.4745+0.4797Ln(11738245)+0.6947Ln(987.37)=8.124218321936784,即LnY=8.124218321936784,两边取自然指数e,得到Y=e^(8.124218321936784)=3375.2285581155024,因此当资本投入为11738245、劳动力投入为987.37时,第三产业国内生产总值是3375.23。
该预测方法是适合于平稳序列(没有趋势和季节变动的序列),对过去的观测值加权平均进行预测的一种方法。观测值时间越远,其权数也跟着呈现指数的下降,因而称为指数平滑。
t+1的预测值是t期观测值与t期平滑值$S_t$
的线性组合,其预测模型为: F t + 1 = α Y t + ( 1 − α ) S t F_{t+1}=\alpha Y_t + (1-\alpha)S_t Ft+1=αYt+(1−α)St,其中
因此模型又可以表示为 y ^ t + 1 ∣ t = α y t + α ( 1 − α ) y t − 1 + α ( 1 − α ) 2 y t − 2 + ⋯ \hat{y}_{t+1|t} = \alpha y_t + \alpha (1-\alpha)y_{t-1} + \alpha (1-\alpha)^2 y_{t-2} + \cdots y^t+1∣t=αyt+α(1−α)yt−1+α(1−α)2yt−2+⋯
简单指数平滑预测系数α的选择:
python代码实现案例:
import pandas as pd
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from matplotlib import pyplot as plt
dt = pd.DataFrame(data={
'year':list(range(1990,2006)),
'棉花产量(万吨)':[450.77,567.50,450.84,373.93,434.10,476.75,420.33,460.27,450.10,382.88,441.73,532.35,491.62,485.97,632.35,571.42]
})
fig0 = plt.figure()
plt.plot(dt['棉花产量(万吨)'].values)
plt.show()
'''
可以看到,折线图的走势没有明显的趋势和季节性变动
'''
# 下面我们分别用α=0.2,α=0.6,允许statsmodels自动查找最优α这三种方式进行拟合预测
fit1 = SimpleExpSmoothing(endog=dt['棉花产量(万吨)'].values).fit(smoothing_level=0.2,optimized=False) # 用α=0.2拟合
fit2 = SimpleExpSmoothing(endog=dt['棉花产量(万吨)'].values).fit(smoothing_level=0.6,optimized=False) # 用α=0.6拟合
fit3 = SimpleExpSmoothing(endog=dt['棉花产量(万吨)'].values).fit() # 不做设置,自动查找最优α
# 绘制折线图,查看拟合和未来3年的预测值位置
fig = plt.figure()
line1 = plt.plot(list(fit1.fittedvalues)+list(fit1.forecast(steps=3)),c='g',marker='o')
line2 = plt.plot(list(fit2.fittedvalues)+list(fit2.forecast(steps=3)),c='r',marker='o')
line3 = plt.plot(list(fit3.fittedvalues)+list(fit3.forecast(steps=3)),c='b',marker='o')
line4 = plt.plot(dt['棉花产量(万吨)'].values,c='y',marker='*')
plt.legend(labels = ['alpha=0.2','alpha=0.6','auto','data'],loc='best') # 添加图例
plt.show()
Holt指数平滑(二次指数平滑)预测模型,一般简称为Holt模型,适合于含有趋势成分(或有一定的周期成分)序列的预测。
Holt模型有两个平滑系数和三个方程:
Holt模型中初始值的确定:
案例
1990年到2005年的人均GDP数据如下:
年份 | 人均GDP |
---|---|
1990 | 1644.42 |
1991 | 1892.76 |
1992 | 2311.09 |
1993 | 2998.36 |
1994 | 4044.00 |
1995 | 5045.73 |
1996 | 5845.89 |
1997 | 6420.18 |
1998 | 6796.03 |
1999 | 7158.50 |
2000 | 7857.68 |
2001 | 8621.71 |
2002 | 9398.05 |
2003 | 10541.97 |
2004 | 12335.58 |
2005 | 14040.00 |
可以看到人均GDP逐年增加,存在增长趋势,下面我们利用Holt模型预测2006年的人均GDP:
最终得到数据如下:
年份 | 人均GDP | S | T | predict |
---|---|---|---|---|
1990 | 1644.47 | 1644.47 | 248.29 | 1644.47 |
1991 | 1892.76 | 1892.76 | 248.29 | 1892.76 |
1992 | 2311.09 | 2260.078 | 331.6096 | 2141.05 |
1993 | 2998.36 | 2876.35828 | 530.879076 | 2591.6876 |
1994 | 4044.00 | 3852.971207 | 842.8927716 | 3407.237356 |
1995 | 5045.73 | 4940.770194 | 1014.327122 | 4695.863978 |
1996 | 5845.89 | 5878.652195 | 960.8155375 | 5955.097316 |
1997 | 6420.18 | 6545.96632 | 755.3645487 | 6839.467732 |
1998 | 6796.03 | 6947.620261 | 507.7671232 | 7301.330868 |
1999 | 7158.50 | 7247.566215 | 362.2923052 | 7455.387384 |
2000 | 7857.68 | 7783.333556 | 483.7248302 | 7609.85852 |
2001 | 8621.71 | 8515.314516 | 657.5041209 | 8267.058386 |
2002 | 9398.05 | 9330.480591 | 767.8674889 | 9172.818637 |
2003 | 10541.97 | 10408.88342 | 985.2422297 | 10098.34808 |
2004 | 12335.58 | 12053.1437 | 1446.554859 | 11394.12565 |
2005 | 14040.00 | 13877.90957 | 1711.302567 | 13499.69856 |
2006 | 15589.21213 |
对于长期预测,使用Holt方法的预测在未来会无限期地增加或减少,在这种情况下,我们使用具有阻尼参数 ϕ ( 0 < ϕ < 1 ) \phi (0 < \phi < 1) ϕ(0<ϕ<1)的阻尼趋势方法来防止预测 “失控”,因此我们对上面三个方程进行优化:
python代码实现案例:
import pandas as pd
from statsmodels.tsa.holtwinters import Holt
from matplotlib import pyplot as plt
dt = pd.DataFrame(data={
'year':list(range(1990,2006)),
'人均GDP':[1644.47,1892.76,2311.09,2998.36,4044.00 ,5045.73,5845.89,6420.18,6796.03,7158.50,7857.68,8621.71,9398.05,10541.97,12335.58,14040.00]
})
fig0 = plt.figure()
plt.plot(dt['人均GDP'].values)
plt.show()
'''
可以看到,折线图的走势有明显的上升趋势
'''
# 下面我们尝试Holt方法的三种变体,对比拟合效果
fit1 = Holt(endog=dt['人均GDP'].values).fit(smoothing_level=0.8,smoothing_trend=0.2,optimized=False) # 设置alpha=0.8,gamma=0.2
fit2 = Holt(endog=dt['人均GDP'].values,exponential=True).fit(smoothing_level=0.8,smoothing_trend=0.2,optimized=False) # 使用指数模型,而不是Holt的加法模型(默认)
fit3 = Holt(endog=dt['人均GDP'].values,damped_trend=True).fit(smoothing_level=0.8,smoothing_trend=0.2) # 使用阻尼版本的Holt附加模型,但允许优化阻尼参数phi,同时固定alpha=0.8,gamma=0.2的值
# 绘制折线图,查看拟合和未来3年的预测值位置
fig = plt.figure()
line1 = plt.plot(list(fit1.fittedvalues)+list(fit1.forecast(steps=3)),c='g',marker='.')
line2 = plt.plot(list(fit2.fittedvalues)+list(fit2.forecast(steps=3)),c='r',marker='.')
line3 = plt.plot(list(fit3.fittedvalues)+list(fit3.forecast(steps=3)),c='b',marker='.')
line4 = plt.plot(dt['人均GDP'].values,c='y',marker='^')
plt.legend(labels = ["Holt's linear trend",'Exponential trend','Additive damped trend','data'],loc='best') # 添加图例
plt.show()
当时间序列以几何级递增或递减时,适合用指数曲线对样本进行拟合,其模型的一般形式为 Y ^ t = b 0 exp ( b 1 t ) = b 0 e b 1 t \hat{Y}_t=b_0 \exp{(b_1t)}=b_0 e^{b_1t} Y^t=b0exp(b1t)=b0eb1t,式中 b 0 、 b 1 b_0、b_1 b0、b1为待定系数,exp表示自然对数 ln \ln ln的反函数,e=2.71828182845904。
对上面指数曲线模型线性化处理,即等号两边取自然对数得到: ln Y ^ t = ln b 0 + b 1 t \ln{\hat{Y}_t}=\ln{b_0}+b_1 t lnY^t=lnb0+b1t。将模型线性化之后,我们就可以使用最小二乘法来估计参数。
有些现象的变化形态比较复杂,它们不是按照某种固定的形态变化,而是有升有降,在变化过程中可能有几个拐点,这时就需要拟合多项式函数:
k阶曲线函数的一般形式为: Y t ^ = b 0 + b 1 t + b 2 t 2 + b k t k \hat{Y_t} = b_0 + b_1 t + b_2 t^2 + b_k t^k Yt^=b0+b1t+b2t2+bktk。
将函数线性化处理:令 t = x 1 , t 2 = x 2 , ⋯ , t k = x k t=x_1,t^2=x_2,\cdots,t^k=x_k t=x1,t2=x2,⋯,tk=xk。经过处理后,函数变成多元线性回归方程,于是我们就可以使用最小二乘法来估计 b 0 , b 1 , b 2 , ⋯ , b k b_0,b_1,b_2,\cdots,b_k b0,b1,b2,⋯,bk。
简单指数平滑模型适合于对平稳序列(没有趋势和季节成分)的预测;Holt指数平滑模型适合于含有趋势成分但不含季节成分序列的预测。
如果时间序列中既含有趋势成分又含有季节成分,则可以使用Winters指数平滑模型进行预测,该模型要求数据是按季度或月份收集的,而且至少需要4年(4个季节周期长度)以上的数据。
Winters指数平滑(三次指数平滑)模型包括三个平滑参数 α 、 γ 、 δ \alpha、\gamma、\delta α、γ、δ(取值均在0和1之间)和四个方程:
在进行预测时,模型中的参数即 α 、 γ 、 δ \alpha、\gamma、\delta α、γ、δ难以确定时,可采用spss软件自动搜寻的方式确定,而初始平滑值和趋势值也可以采用spss软件自动的方式确定。
Winters指数平滑法有两种变体:
python代码实现案例:
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from matplotlib import pyplot as plt
dt = pd.DataFrame(data={
'year_season':['2005_Q1','2005_Q2','2005_Q3','2005_Q4','2006_Q1','2006_Q2','2006_Q3','2006_Q4','2007_Q1','2007_Q2','2007_Q3','2007_Q4','2008_Q1','2008_Q2','2008_Q3','2008_Q4','2009_Q1','2009_Q2','2009_Q3','2009_Q4','2010_Q1','2010_Q2','2010_Q3','2010_Q4'],
'销售量':[25,32,37,26,30,38,42,30,29,39,50,35,30,39,51,37,29,42,55,38,31,43,54,41]
})
fig0 = plt.figure()
plt.plot(dt['销售量'].values)
plt.show()
'''
可以看到,折线图的走势有明显的上升趋势和季节性变动
'''
# 下面我们尝试完整的Holt-Winters方法,包括趋势组件和季节性组件,对比拟合效果
fit1 = ExponentialSmoothing(endog=dt['销售量'].values,trend='add',seasonal='add',seasonal_periods=4).fit() # 使用加法趋势,周期season_length = 4的加性季节
fit2 = ExponentialSmoothing(endog=dt['销售量'].values,trend='add',seasonal='mul',seasonal_periods=4).fit() # 使用加法趋势,周期season_length = 4的乘法季节
fit3 = ExponentialSmoothing(endog=dt['销售量'].values,trend='add',seasonal='add',damped_trend=True,seasonal_periods=4).fit() # 使用加性阻尼趋势,周期season_length = 4的加性季节
fit4 = ExponentialSmoothing(endog=dt['销售量'].values,trend='add',seasonal='mul',damped_trend=True,seasonal_periods=4).fit() # 使用加性阻尼趋势,周期season_length = 4的加性季节
# 绘制折线图,查看拟合和未来3年的预测值位置
fig = plt.figure()
line1 = plt.plot(list(fit1.fittedvalues)+list(fit1.forecast(steps=4)),c='g',marker='*')
line2 = plt.plot(list(fit2.fittedvalues)+list(fit2.forecast(steps=4)),c='r',marker='*')
line3 = plt.plot(list(fit3.fittedvalues)+list(fit3.forecast(steps=4)),c='b',marker='*')
line4 = plt.plot(list(fit3.fittedvalues)+list(fit3.forecast(steps=4)),c='b',marker='*')
line5 = plt.plot(dt['销售量'].values,c='y',marker='^')
plt.legend(labels = ["aa",'am','aa damped','am damped','data'],loc='best') # 添加图例
plt.show()
分解预测法的步骤:
案例:
(1)计算移动平均和季节比率
year | season | 销售量 | 4项移动平均 | 中心化移动平均 | 季节比率 |
---|---|---|---|---|---|
2005 | Q1 | 25 | |||
2005 | Q2 | 32 | 30 | ||
2005 | Q3 | 37 | 31.25 | 30.625 | 1.208163265 |
2005 | Q4 | 26 | 32.75 | 32 | 0.8125 |
2006 | Q1 | 30 | 34 | 33.375 | 0.898876404 |
2006 | Q2 | 38 | 35 | 34.5 | 1.101449275 |
2006 | Q3 | 42 | 34.75 | 34.875 | 1.204301075 |
2006 | Q4 | 30 | 35 | 34.875 | 0.860215054 |
2007 | Q1 | 29 | 37 | 36 | 0.805555556 |
2007 | Q2 | 39 | 38.25 | 37.625 | 1.03654485 |
2007 | Q3 | 50 | 38.5 | 38.375 | 1.302931596 |
2007 | Q4 | 35 | 38.5 | 38.5 | 0.909090909 |
2008 | Q1 | 30 | 38.75 | 38.625 | 0.776699029 |
2008 | Q2 | 39 | 39.25 | 39 | 1 |
2008 | Q3 | 51 | 39 | 39.125 | 1.303514377 |
2008 | Q4 | 37 | 39.75 | 39.375 | 0.93968254 |
2009 | Q1 | 29 | 40.75 | 40.25 | 0.720496894 |
2009 | Q2 | 42 | 41 | 40.875 | 1.027522936 |
2009 | Q3 | 55 | 41.5 | 41.25 | 1.333333333 |
2009 | Q4 | 38 | 41.75 | 41.625 | 0.912912913 |
2010 | Q1 | 31 | 41.5 | 41.625 | 0.744744745 |
2010 | Q2 | 43 | 42.25 | 41.875 | 1.026865672 |
2010 | Q3 | 54 | 46 | 44.125 | 1.223796034 |
2010 | Q4 | 41 | 47.5 | 46.75 | 0.877005348 |
2011 | Q1 | ||||
2011 | Q2 | ||||
2011 | Q3 | ||||
2011 | Q4 |
(2)季节指数调整
year/season | Q1 | Q2 | Q3 | Q4 | 均值 |
---|---|---|---|---|---|
2005 | 1.208163265 | 0.8125 | |||
2006 | 0.898876404 | 1.101449275 | 1.204301075 | 0.860215054 | |
2007 | 0.805555556 | 1.03654485 | 1.302931596 | 0.909090909 | |
2008 | 0.776699029 | 1 | 1.303514377 | 0.93968254 | |
2009 | 0.720496894 | 1.027522936 | 1.333333333 | 0.912912913 | |
2010 | 0.744744745 | 1.026865672 | 1.223796034 | 0.877005348 | |
平均值 | 0.789274526 | 1.038476547 | 1.26267328 | 0.885234461 | 0.993914703 |
季节指数 | 0.794106902 | 1.044834676 | 1.270404066 | 0.890654357 |
(3)分离季节成分,建立线性预测模型进行预测
year | season | 销售量 | 4项移动平均 | 中心化移动平均 | 季节比率 | 季节指数 | 分离季节成分 | time |
---|---|---|---|---|---|---|---|---|
2005 | Q1 | 25 | 0.794106902 | 31.48190746 | 1 | |||
2005 | Q2 | 32 | 30 | 1.044834676 | 30.62685489 | 2 | ||
2005 | Q3 | 37 | 31.25 | 30.625 | 1.208163265 | 1.270404066 | 29.12459193 | 3 |
2005 | Q4 | 26 | 32.75 | 32 | 0.8125 | 0.890654357 | 29.19202024 | 4 |
2006 | Q1 | 30 | 34 | 33.375 | 0.898876404 | 0.794106902 | 37.77828896 | 5 |
2006 | Q2 | 38 | 35 | 34.5 | 1.101449275 | 1.044834676 | 36.36939019 | 6 |
2006 | Q3 | 42 | 34.75 | 34.875 | 1.204301075 | 1.270404066 | 33.06034759 | 7 |
2006 | Q4 | 30 | 35 | 34.875 | 0.860215054 | 0.890654357 | 33.68310027 | 8 |
2007 | Q1 | 29 | 37 | 36 | 0.805555556 | 0.794106902 | 36.51901266 | 9 |
2007 | Q2 | 39 | 38.25 | 37.625 | 1.03654485 | 1.044834676 | 37.3264794 | 10 |
2007 | Q3 | 50 | 38.5 | 38.375 | 1.302931596 | 1.270404066 | 39.35755666 | 11 |
2007 | Q4 | 35 | 38.5 | 38.5 | 0.909090909 | 0.890654357 | 39.29695032 | 12 |
2008 | Q1 | 30 | 38.75 | 38.625 | 0.776699029 | 0.794106902 | 37.77828896 | 13 |
2008 | Q2 | 39 | 39.25 | 39 | 1 | 1.044834676 | 37.3264794 | 14 |
2008 | Q3 | 51 | 39 | 39.125 | 1.303514377 | 1.270404066 | 40.14470779 | 15 |
2008 | Q4 | 37 | 39.75 | 39.375 | 0.93968254 | 0.890654357 | 41.54249034 | 16 |
2009 | Q1 | 29 | 40.75 | 40.25 | 0.720496894 | 0.794106902 | 36.51901266 | 17 |
2009 | Q2 | 42 | 41 | 40.875 | 1.027522936 | 1.044834676 | 40.19774705 | 18 |
2009 | Q3 | 55 | 41.5 | 41.25 | 1.333333333 | 1.270404066 | 43.29331232 | 19 |
2009 | Q4 | 38 | 41.75 | 41.625 | 0.912912913 | 0.890654357 | 42.66526034 | 20 |
2010 | Q1 | 31 | 41.5 | 41.625 | 0.744744745 | 0.794106902 | 39.03756526 | 21 |
2010 | Q2 | 43 | 42.25 | 41.875 | 1.026865672 | 1.044834676 | 41.15483626 | 22 |
2010 | Q3 | 54 | 46 | 44.125 | 1.223796034 | 1.270404066 | 42.50616119 | 23 |
2010 | Q4 | 41 | 47.5 | 46.75 | 0.877005348 | 0.890654357 | 46.03357037 | 24 |
2011 | Q1 | 0.794106902 | 25 | |||||
2011 | Q2 | 1.044834676 | 26 | |||||
2011 | Q3 | 1.270404066 | 27 | |||||
2011 | Q4 | 0.890654357 | 28 |
线性趋势模型建立:
import pandas as pd
import statsmodels.api as sm
dt = pd.DataFrame(data={
'time':list(range(1,25)),
'分离季节成分':[31.48190746,30.62685489,29.12459193,29.19202024,37.77828896,36.36939019,33.06034759,33.68310027,36.51901266,37.3264794,39.35755666,39.29695032,37.77828896,37.3264794,40.14470779,41.54249034,36.51901266,40.19774705,43.29331232,42.66526034,39.03756526,41.15483626,42.50616119,46.03357037]
})
X = sm.add_constant(data=dt['time'].values) # 添加常数列
y = dt['分离季节成分'].values
# 用最小二乘法估计分离季节成分后的线性趋势方程
model = sm.OLS(endog=y,exog=X)
result = model.fit()
print(result.summary())
'''
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.767
Model: OLS Adj. R-squared: 0.756
Method: Least Squares F-statistic: 72.22
Date: Sun, 19 Jun 2022 Prob (F-statistic): 2.13e-08
Time: 12:08:25 Log-Likelihood: -52.328
No. Observations: 24 AIC: 108.7
Df Residuals: 22 BIC: 111.0
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 30.5780 0.942 32.449 0.000 28.624 32.532
x1 0.5605 0.066 8.498 0.000 0.424 0.697
==============================================================================
Omnibus: 0.733 Durbin-Watson: 1.698
Prob(Omnibus): 0.693 Jarque-Bera (JB): 0.695
Skew: -0.089 Prob(JB): 0.707
Kurtosis: 2.186 Cond. No. 29.6
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
'''
从最小二乘法回归结果可以看到,模型拟合的调整R方不是很高,但模型F检验和参数的t检验在显著性水平 α \alpha α为0.05的情况下,均通过检验,因此我们确定分离季节成分后的序列的线性趋势方程为 Y t ^ = 30.5780 + 0.5605 t \hat{Y_t} = 30.5780 + 0.5605t Yt^=30.5780+0.5605t。
(4)根据线性趋势方程进行预测,并计算最后的预测值
year | season | 销售量 | 4项移动平均 | 中心化移动平均 | 季节比率 | 季节指数 | 分离季节成分 | time | 线性趋势预测 | 最后预测值 | 误差 |
---|---|---|---|---|---|---|---|---|---|---|---|
2005 | Q1 | 25 | 0.794106902 | 31.48190746 | 1 | 31.1385 | 24.72729776 | 0.272702238 | |||
2005 | Q2 | 32 | 30 | 1.044834676 | 30.62685489 | 2 | 31.699 | 33.12021439 | -1.120214385 | ||
2005 | Q3 | 37 | 31.25 | 30.625 | 1.208163265 | 1.270404066 | 29.12459193 | 3 | 32.2595 | 40.98259996 | -3.982599964 |
2005 | Q4 | 26 | 32.75 | 32 | 0.8125 | 0.890654357 | 29.19202024 | 4 | 32.82 | 29.23127598 | -3.231275983 |
2006 | Q1 | 30 | 34 | 33.375 | 0.898876404 | 0.794106902 | 37.77828896 | 5 | 33.3805 | 26.50768544 | 3.492314564 |
2006 | Q2 | 38 | 35 | 34.5 | 1.101449275 | 1.044834676 | 36.36939019 | 6 | 33.941 | 35.46273373 | 2.537266272 |
2006 | Q3 | 42 | 34.75 | 34.875 | 1.204301075 | 1.270404066 | 33.06034759 | 7 | 34.5015 | 43.83084588 | -1.83084588 |
2006 | Q4 | 30 | 35 | 34.875 | 0.860215054 | 0.890654357 | 33.68310027 | 8 | 35.062 | 31.22812305 | -1.22812305 |
2007 | Q1 | 29 | 37 | 36 | 0.805555556 | 0.794106902 | 36.51901266 | 9 | 35.6225 | 28.28807311 | 0.71192689 |
2007 | Q2 | 39 | 38.25 | 37.625 | 1.03654485 | 1.044834676 | 37.3264794 | 10 | 36.183 | 37.80525307 | 1.194746929 |
2007 | Q3 | 50 | 38.5 | 38.375 | 1.302931596 | 1.270404066 | 39.35755666 | 11 | 36.7435 | 46.6790918 | 3.320908205 |
2007 | Q4 | 35 | 38.5 | 38.5 | 0.909090909 | 0.890654357 | 39.29695032 | 12 | 37.304 | 33.22497012 | 1.775029883 |
2008 | Q1 | 30 | 38.75 | 38.625 | 0.776699029 | 0.794106902 | 37.77828896 | 13 | 37.8645 | 30.06846078 | -0.068460784 |
2008 | Q2 | 39 | 39.25 | 39 | 1 | 1.044834676 | 37.3264794 | 14 | 38.425 | 40.14777241 | -1.147772414 |
2008 | Q3 | 51 | 39 | 39.125 | 1.303514377 | 1.270404066 | 40.14470779 | 15 | 38.9855 | 49.52733771 | 1.472662289 |
2008 | Q4 | 37 | 39.75 | 39.375 | 0.93968254 | 0.890654357 | 41.54249034 | 16 | 39.546 | 35.22181718 | 1.778182815 |
2009 | Q1 | 29 | 40.75 | 40.25 | 0.720496894 | 0.794106902 | 36.51901266 | 17 | 40.1065 | 31.84884846 | -2.848848458 |
2009 | Q2 | 42 | 41 | 40.875 | 1.027522936 | 1.044834676 | 40.19774705 | 18 | 40.667 | 42.49029176 | -0.490291757 |
2009 | Q3 | 55 | 41.5 | 41.25 | 1.333333333 | 1.270404066 | 43.29331232 | 19 | 41.2275 | 52.37558363 | 2.624416373 |
2009 | Q4 | 38 | 41.75 | 41.625 | 0.912912913 | 0.890654357 | 42.66526034 | 20 | 41.788 | 37.21866425 | 0.781335748 |
2010 | Q1 | 31 | 41.5 | 41.625 | 0.744744745 | 0.794106902 | 39.03756526 | 21 | 42.3485 | 33.62923613 | -2.629236132 |
2010 | Q2 | 43 | 42.25 | 41.875 | 1.026865672 | 1.044834676 | 41.15483626 | 22 | 42.909 | 44.8328111 | -1.8328111 |
2010 | Q3 | 54 | 46 | 44.125 | 1.223796034 | 1.270404066 | 42.50616119 | 23 | 43.4695 | 55.22382954 | -1.223829543 |
2010 | Q4 | 41 | 47.5 | 46.75 | 0.877005348 | 0.890654357 | 46.03357037 | 24 | 44.03 | 39.21551132 | 1.78448868 |
2011 | Q1 | 0.794106902 | 25 | 44.5905 | 35.40962381 | ||||||
2011 | Q2 | 1.044834676 | 26 | 45.151 | 47.17533044 | ||||||
2011 | Q3 | 1.270404066 | 27 | 45.7115 | 58.07207546 | ||||||
2011 | Q4 | 0.890654357 | 28 | 46.272 | 41.21235839 |
tips:以上大部分知识点主要来自书籍《CDA数据分析实务》的总结,同时也借鉴了文章《如何使用Python构建指数平滑模型:Simple Exponential Smoothing, Holt, and Holt-Winters》http://t.zoukankan.com/harrylyx-p-11852149.html。