数据模式 | 预测方法 | 对数据的要求 | 预测期 |
平稳序列 | 移动平均 | 数据个数与移动平均步长相等 | 非常短 |
平稳序列 | 简单指数平滑 | 5个以上 | 短期 |
线性趋势 | Holt指数平滑 | 5个以上 | 短期至中期 |
线性趋势 | 一元线性回归 | 10个以上 | 短期至中期 |
非线性趋势 | 指数模型 | 10个以上 | 短期至中期 |
非线性趋势 | 多项式函数 | 10个以上 | 短期至中期 |
趋势和季节成分 | Winter指数平滑 | 至少有四个周期的季节或月份数据 | 短期至中期 |
趋势和季节成分 | 季节性多元回归 | 至少有四个周期的季节或月份数据 | 短期、中期、长期 |
趋势、季节成分和循环成分 | 分解预测 | 至少有四个周期的季节或月份数据 | 短期、中期、长期 |
月工资收入(元) | 工作年限 | 性别 |
2900 | 2 | 男 |
3000 | 6 | 女 |
4800 | 8 | 男 |
1800 | 3 | 女 |
2900 | 2 | 男 |
4900 | 7 | 男 |
4200 | 9 | 女 |
4800 | 8 | 女 |
令 y y y表示月工资收入, x 1 x_1 x1表示工作年限, x 2 x_2 x2表示性别,性别作为哑变量引入时,回归方程如下: y = β 0 + β 1 x 1 + β 2 x 2 y=\beta_0+\beta_1 x_1 + \beta_2 x_2 y=β0+β1x1+β2x2,于是我们可以得到:
import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.DataFrame({
# 哑变量处理
dummy_variables = pd.get_dummies(data=data['性别'].values)
X = np.column_stack(tup=(data['工作年限'].values,dummy_variables.values))
X = sm.add_constant(data=X) # 加一列常数项
y = data['月工资收入'].values
# 用最小二乘法拟合回归方程
linear_model = sm.OLS(endog=y,exog=X)
ols_result = linear_model.fit()
# 输出拟合结果
OLS Regression Results
Dep. Variable: y R-squared: 0.901
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 22.78
Date: Sun, 22 May 2022 Prob (F-statistic): 0.00307
Time: 17:02:55 Log-Likelihood: -58.036
No. Observations: 8 AIC: 122.1
Df Residuals: 5 BIC: 122.3
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 950.7246 247.685 3.838 0.012 314.029 1587.420
x1 397.5845 60.183 6.606 0.001 242.879 552.290
x2 -85.0242 231.138 -0.368 0.728 -679.184 509.135
x3 1035.7488 172.207 6.015 0.002 593.076 1478.421
Omnibus: 4.593 Durbin-Watson: 1.536
Prob(Omnibus): 0.101 Jarque-Bera (JB): 1.483
Skew: 1.049 Prob(JB): 0.477
Kurtosis: 3.219 Cond. No. 7.55e+16
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.64e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
从上面结果可以看到,拟合优度R方为0.901,调整R方为0.862,模型拟合显著性检验的P值为0.00307 < 0.05,说明模型拟合效果还是可以的,
再看到参数拟合结果,x2的参数p值为0.728 > 0.05,因此下面我们剔除x2后再做一次拟合
X1 = X[:,[0,1,3]]
# 用最小二乘法拟合回归方程
linear_model1 = sm.OLS(endog=y,exog=X1)
ols_result1 = linear_model1.fit()
# 输出拟合结果
OLS Regression Results
Dep. Variable: y R-squared: 0.901
Model: OLS Adj. R-squared: 0.862
Method: Least Squares F-statistic: 22.78
Date: Sun, 22 May 2022 Prob (F-statistic): 0.00307
Time: 17:18:21 Log-Likelihood: -58.036
No. Observations: 8 AIC: 122.1
Df Residuals: 5 BIC: 122.3
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 865.7005 447.091 1.936 0.111 -283.583 2014.984
x1 397.5845 60.183 6.606 0.001 242.879 552.290
x2 1120.7729 323.747 3.462 0.018 288.554 1952.992
Omnibus: 4.593 Durbin-Watson: 1.536
Prob(Omnibus): 0.101 Jarque-Bera (JB): 1.483
Skew: 1.049 Prob(JB): 0.477
Kurtosis: 3.219 Cond. No. 20.8
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
X2 = X[:,[1,3]]
# 用最小二乘法拟合回归方程
linear_model2 = sm.OLS(endog=y,exog=X2)
ols_result2 = linear_model2.fit()
# 输出拟合结果
OLS Regression Results
Dep. Variable: y R-squared (uncentered): 0.986
Model: OLS Adj. R-squared (uncentered): 0.981
Method: Least Squares F-statistic: 210.6
Date: Sun, 22 May 2022 Prob (F-statistic): 2.77e-06
Time: 17:22:47 Log-Likelihood: -60.274
No. Observations: 8 AIC: 124.5
Df Residuals: 6 BIC: 124.7
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 499.5470 35.188 14.197 0.000 413.446 585.648
x2 1502.1518 310.270 4.841 0.003 742.948 2261.356
Omnibus: 0.202 Durbin-Watson: 1.923
Prob(Omnibus): 0.904 Jarque-Bera (JB): 0.253
Skew: -0.258 Prob(JB): 0.881
Kurtosis: 2.297 Cond. No. 10.5
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
可以看大,模型显著性和参数显著性都得到了提升,最终我们模型估计结果为:y = 499.5470*x1 + 1502.1518*x2
销售周期 | 公司销售价格(元) | 其他厂家平均价格(元) | 广告费用(百万元) | 价格差(元) | 销售量(百万支) |
1 | 3.85 | 3.8 | 5.5 | -0.05 | 7.38 |
2 | 3.75 | 4 | 6.75 | 0.25 | 8.51 |
3 | 3.7 | 4.3 | 7.25 | 0.6 | 9.52 |
…… | …… | …… | …… | …… | …… |
import pandas as pd
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 解决中文显示问题
plt.rcParams['axes.unicode_minus'] = False # 解决坐标轴负数的负号显示问题
data = pd.read_excel(r'G:\牙膏销售数据表.xlsx')
# 1、计算相关系数矩阵
公司销售价格(元) 其他厂家平均价格(元) 广告费用(百万元) 价格差(元) 销售量(百万支)
公司销售价格(元) 1.000000 0.078367 -0.468793 -0.322067 -0.469220
其他厂家平均价格(元) 0.078367 1.000000 0.604540 0.918566 0.740948
广告费用(百万元) -0.468793 0.604540 1.000000 0.759964 0.875954
价格差(元) -0.322067 0.918566 0.759964 1.000000 0.889672
销售量(百万支) -0.469220 0.740948 0.875954 0.889672 1.000000
# 2、绘制自变量与因变量的散点图
fig1 = plt.figure()
fig2= plt.figure()
# 3、建立模型,并对模型及其参数进行估计
# 3.1、价格差和销售量的线性回归方程估计
X1 = sm.add_constant(data=data['价格差(元)'].values) # X1由1列常数项和价格差组成
y = data['销售量(百万支)']
# 用最小二乘法估计方程
model1 = sm.OLS(endog=y,exog=X1)
result1 = model1.fit()
OLS Regression Results
Dep. Variable: 销售量(百万支) R-squared: 0.792
Model: OLS Adj. R-squared: 0.784
Method: Least Squares F-statistic: 106.3
Date: Fri, 27 May 2022 Prob (F-statistic): 4.88e-11
Time: 22:26:27 Log-Likelihood: -7.0261
No. Observations: 30 AIC: 18.05
Df Residuals: 28 BIC: 20.85
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 7.8141 0.080 97.818 0.000 7.650 7.978
x1 2.6652 0.258 10.310 0.000 2.136 3.195
Omnibus: 5.481 Durbin-Watson: 2.414
Prob(Omnibus): 0.065 Jarque-Bera (JB): 4.092
Skew: 0.883 Prob(JB): 0.129
Kurtosis: 3.391 Cond. No. 4.69
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
可以看到,模型估计的拟合优度为0.792,在假设显著性水平α=0.05的条件下,模型显著性检验和参数显著性检验均通过,得到的回归方程为:y = 7.8141 + 2.6652*x1
# 3.2、广告费和销售量的非线性回归方程估计
# 多项式处理
from sklearn.preprocessing import PolynomialFeatures
X2 = PolynomialFeatures(degree=2).fit_transform(X=data[['广告费用(百万元)']])
# 用最小二乘法估计方程
model2 = sm.OLS(endog=y,exog=X2)
result2 = model2.fit()
OLS Regression Results
Dep. Variable: 销售量(百万支) R-squared: 0.838
Model: OLS Adj. R-squared: 0.826
Method: Least Squares F-statistic: 69.81
Date: Fri, 27 May 2022 Prob (F-statistic): 2.14e-11
Time: 22:45:32 Log-Likelihood: -3.2455
No. Observations: 30 AIC: 12.49
Df Residuals: 27 BIC: 16.69
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 25.1091 6.863 3.659 0.001 11.028 39.190
x1 -6.5589 2.217 -2.958 0.006 -11.109 -2.009
x2 0.6101 0.178 3.432 0.002 0.245 0.975
Omnibus: 0.063 Durbin-Watson: 1.523
Prob(Omnibus): 0.969 Jarque-Bera (JB): 0.224
Skew: -0.090 Prob(JB): 0.894
Kurtosis: 2.617 Cond. No. 5.98e+03
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.98e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
可以看到,模型估计的拟合优度为0.838,在假设显著性水平α=0.05的条件下,模型显著性检验和参数显著性检验均通过,得到的回归方程为:y = 25.1091 - 6.5589*x2 + 0.6101*(x2)**2
# 4、模型改进:加入交互项和广告费的二次项
data['价格差*广告费用'] = data['价格差(元)'] * data['广告费用(百万元)']
X3 = np.column_stack(tup=(X1,X2[:,[1,2]],data['价格差*广告费用'].values))
# 用最小二乘法估计方程
model3 = sm.OLS(endog=y,exog=X3)
result3 = model3.fit()
OLS Regression Results
Dep. Variable: 销售量(百万支) R-squared: 0.921
Model: OLS Adj. R-squared: 0.908
Method: Least Squares F-statistic: 72.78
Date: Fri, 27 May 2022 Prob (F-statistic): 2.11e-13
Time: 23:11:51 Log-Likelihood: 7.5137
No. Observations: 30 AIC: -5.027
Df Residuals: 25 BIC: 1.979
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 29.1133 7.483 3.890 0.001 13.701 44.525
x1 11.1342 4.446 2.504 0.019 1.978 20.291
x2 -7.6080 2.469 -3.081 0.005 -12.693 -2.523
x3 0.6712 0.203 3.312 0.003 0.254 1.089
x4 -1.4777 0.667 -2.215 0.036 -2.852 -0.104
Omnibus: 0.242 Durbin-Watson: 1.512
Prob(Omnibus): 0.886 Jarque-Bera (JB): 0.148
Skew: -0.153 Prob(JB): 0.929
Kurtosis: 2.843 Cond. No. 9.81e+03
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.81e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
y = 29.1133 + 11.1342*x1 - 7.6080*x2 + 0.6712*(x2)**2 - 1.4777*x1*x2
年份 | 第三产业国内生产总值 | 资本投入 | 从业人员 |
1992 | 448.96 | 524085 | 470.08 |
1993 | 611.23 | 1068889 | 480.77 |
1994 | 834.93 | 1632884 | 529.08 |
…… | …… | …… | …… |
2002 | 3120 | 10029357 | 903.14 |
道格拉斯生产函数对应数学模型为 Y = A K α L β Y=AK^{\alpha}L^{\beta} Y=AKαLβ,其中:
上面的非线性模型我们可以通过对等号两边取对数,使之转换为多元线性模型,即 L n Y = L n A + α L n K + β L n L LnY = LnA + \alpha LnK + \beta LnL LnY=LnA+αLnK+βLnL,令 L n Y = y , L n A = c , L n K = x 1 , L n L = x 2 LnY=y,LnA=c,LnK=x_1,LnL=x_2 LnY=y,LnA=c,LnK=x1,LnL=x2,于是得到 y = c + α x 1 + β x 2 y=c+\alpha x_1+\beta x_2 y=c+αx1+βx2
import pandas as pd
import numpy as np
import statsmodels.api as sm
data = pd.read_excel(r'G:\第三产业国内生产总值数据表.xlsx')
# 线性化处理
data1 = data[['第三产业国内生产总值', '资本投入', '从业人员']].apply(lambda x:np.log(x))
# 确定自变量和因变量
X = sm.add_constant(data=data1[['资本投入', '从业人员']].values) # 添加常数列
y = data1['第三产业国内生产总值'].values
# 用最小二乘法估计方程
model = sm.OLS(endog=y,exog=X)
result = model.fit()
OLS Regression Results
Dep. Variable: y R-squared: 0.988
Model: OLS Adj. R-squared: 0.985
Method: Least Squares F-statistic: 338.6
Date: Tue, 31 May 2022 Prob (F-statistic): 1.86e-08
Time: 23:20:30 Log-Likelihood: 15.034
No. Observations: 11 AIC: -24.07
Df Residuals: 8 BIC: -22.87
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -4.4745 1.229 -3.641 0.007 -7.308 -1.641
x1 0.4797 0.095 5.060 0.001 0.261 0.698
x2 0.6947 0.393 1.768 0.115 -0.212 1.601
Omnibus: 2.430 Durbin-Watson: 1.134
Prob(Omnibus): 0.297 Jarque-Bera (JB): 0.971
Skew: 0.197 Prob(JB): 0.615
Kurtosis: 1.598 Cond. No. 967.
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
模型预测:已知资本投入为11738245,劳动力投入为987.37,将其代入回归方程得到: L n Y = − 4.4745 + 0.4797 L n ( 11738245 ) + 0.6947 L n ( 987.37 ) = 8.124218321936784 LnY= -4.4745 + 0.4797 Ln(11738245) + 0.6947 Ln(987.37)=8.124218321936784 LnY=−4.4745+0.4797Ln(11738245)+0.6947Ln(987.37)=8.124218321936784,即LnY=8.124218321936784,两边取自然指数e,得到Y=e^(8.124218321936784)=3375.2285581155024,因此当资本投入为11738245、劳动力投入为987.37时,第三产业国内生产总值是3375.23。
的线性组合,其预测模型为: F t + 1 = α Y t + ( 1 − α ) S t F_{t+1}=\alpha Y_t + (1-\alpha)S_t Ft+1=αYt+(1−α)St,其中
因此模型又可以表示为 y ^ t + 1 ∣ t = α y t + α ( 1 − α ) y t − 1 + α ( 1 − α ) 2 y t − 2 + ⋯ \hat{y}_{t+1|t} = \alpha y_t + \alpha (1-\alpha)y_{t-1} + \alpha (1-\alpha)^2 y_{t-2} + \cdots y^t+1∣t=αyt+α(1−α)yt−1+α(1−α)2yt−2+⋯
import pandas as pd
from statsmodels.tsa.holtwinters import SimpleExpSmoothing
from matplotlib import pyplot as plt
dt = pd.DataFrame(data={
fig0 = plt.figure()
# 下面我们分别用α=0.2,α=0.6,允许statsmodels自动查找最优α这三种方式进行拟合预测
fit1 = SimpleExpSmoothing(endog=dt['棉花产量(万吨)'].values).fit(smoothing_level=0.2,optimized=False) # 用α=0.2拟合
fit2 = SimpleExpSmoothing(endog=dt['棉花产量(万吨)'].values).fit(smoothing_level=0.6,optimized=False) # 用α=0.6拟合
fit3 = SimpleExpSmoothing(endog=dt['棉花产量(万吨)'].values).fit() # 不做设置,自动查找最优α
# 绘制折线图,查看拟合和未来3年的预测值位置
fig = plt.figure()
line1 = plt.plot(list(fit1.fittedvalues)+list(fit1.forecast(steps=3)),c='g',marker='o')
line2 = plt.plot(list(fit2.fittedvalues)+list(fit2.forecast(steps=3)),c='r',marker='o')
line3 = plt.plot(list(fit3.fittedvalues)+list(fit3.forecast(steps=3)),c='b',marker='o')
line4 = plt.plot(dt['棉花产量(万吨)'].values,c='y',marker='*')
plt.legend(labels = ['alpha=0.2','alpha=0.6','auto','data'],loc='best') # 添加图例
年份 | 人均GDP |
1990 | 1644.42 |
1991 | 1892.76 |
1992 | 2311.09 |
1993 | 2998.36 |
1994 | 4044.00 |
1995 | 5045.73 |
1996 | 5845.89 |
1997 | 6420.18 |
1998 | 6796.03 |
1999 | 7158.50 |
2000 | 7857.68 |
2001 | 8621.71 |
2002 | 9398.05 |
2003 | 10541.97 |
2004 | 12335.58 |
2005 | 14040.00 |
年份 | 人均GDP | S | T | predict |
1990 | 1644.47 | 1644.47 | 248.29 | 1644.47 |
1991 | 1892.76 | 1892.76 | 248.29 | 1892.76 |
1992 | 2311.09 | 2260.078 | 331.6096 | 2141.05 |
1993 | 2998.36 | 2876.35828 | 530.879076 | 2591.6876 |
1994 | 4044.00 | 3852.971207 | 842.8927716 | 3407.237356 |
1995 | 5045.73 | 4940.770194 | 1014.327122 | 4695.863978 |
1996 | 5845.89 | 5878.652195 | 960.8155375 | 5955.097316 |
1997 | 6420.18 | 6545.96632 | 755.3645487 | 6839.467732 |
1998 | 6796.03 | 6947.620261 | 507.7671232 | 7301.330868 |
1999 | 7158.50 | 7247.566215 | 362.2923052 | 7455.387384 |
2000 | 7857.68 | 7783.333556 | 483.7248302 | 7609.85852 |
2001 | 8621.71 | 8515.314516 | 657.5041209 | 8267.058386 |
2002 | 9398.05 | 9330.480591 | 767.8674889 | 9172.818637 |
2003 | 10541.97 | 10408.88342 | 985.2422297 | 10098.34808 |
2004 | 12335.58 | 12053.1437 | 1446.554859 | 11394.12565 |
2005 | 14040.00 | 13877.90957 | 1711.302567 | 13499.69856 |
2006 | 15589.21213 |
对于长期预测,使用Holt方法的预测在未来会无限期地增加或减少,在这种情况下,我们使用具有阻尼参数 ϕ ( 0 < ϕ < 1 ) \phi (0 < \phi < 1) ϕ(0<ϕ<1)的阻尼趋势方法来防止预测 “失控”,因此我们对上面三个方程进行优化:
import pandas as pd
from statsmodels.tsa.holtwinters import Holt
from matplotlib import pyplot as plt
dt = pd.DataFrame(data={
'人均GDP':[1644.47,1892.76,2311.09,2998.36,4044.00 ,5045.73,5845.89,6420.18,6796.03,7158.50,7857.68,8621.71,9398.05,10541.97,12335.58,14040.00]
fig0 = plt.figure()
# 下面我们尝试Holt方法的三种变体,对比拟合效果
fit1 = Holt(endog=dt['人均GDP'].values).fit(smoothing_level=0.8,smoothing_trend=0.2,optimized=False) # 设置alpha=0.8,gamma=0.2
fit2 = Holt(endog=dt['人均GDP'].values,exponential=True).fit(smoothing_level=0.8,smoothing_trend=0.2,optimized=False) # 使用指数模型,而不是Holt的加法模型(默认)
fit3 = Holt(endog=dt['人均GDP'].values,damped_trend=True).fit(smoothing_level=0.8,smoothing_trend=0.2) # 使用阻尼版本的Holt附加模型,但允许优化阻尼参数phi,同时固定alpha=0.8,gamma=0.2的值
# 绘制折线图,查看拟合和未来3年的预测值位置
fig = plt.figure()
line1 = plt.plot(list(fit1.fittedvalues)+list(fit1.forecast(steps=3)),c='g',marker='.')
line2 = plt.plot(list(fit2.fittedvalues)+list(fit2.forecast(steps=3)),c='r',marker='.')
line3 = plt.plot(list(fit3.fittedvalues)+list(fit3.forecast(steps=3)),c='b',marker='.')
line4 = plt.plot(dt['人均GDP'].values,c='y',marker='^')
plt.legend(labels = ["Holt's linear trend",'Exponential trend','Additive damped trend','data'],loc='best') # 添加图例
当时间序列以几何级递增或递减时,适合用指数曲线对样本进行拟合,其模型的一般形式为 Y ^ t = b 0 exp ( b 1 t ) = b 0 e b 1 t \hat{Y}_t=b_0 \exp{(b_1t)}=b_0 e^{b_1t} Y^t=b0exp(b1t)=b0eb1t,式中 b 0 、 b 1 b_0、b_1 b0、b1为待定系数,exp表示自然对数 ln \ln ln的反函数,e=2.71828182845904。
对上面指数曲线模型线性化处理,即等号两边取自然对数得到: ln Y ^ t = ln b 0 + b 1 t \ln{\hat{Y}_t}=\ln{b_0}+b_1 t lnY^t=lnb0+b1t。将模型线性化之后,我们就可以使用最小二乘法来估计参数。
k阶曲线函数的一般形式为: Y t ^ = b 0 + b 1 t + b 2 t 2 + b k t k \hat{Y_t} = b_0 + b_1 t + b_2 t^2 + b_k t^k Yt^=b0+b1t+b2t2+bktk。
将函数线性化处理:令 t = x 1 , t 2 = x 2 , ⋯ , t k = x k t=x_1,t^2=x_2,\cdots,t^k=x_k t=x1,t2=x2,⋯,tk=xk。经过处理后,函数变成多元线性回归方程,于是我们就可以使用最小二乘法来估计 b 0 , b 1 , b 2 , ⋯ , b k b_0,b_1,b_2,\cdots,b_k b0,b1,b2,⋯,bk。
Winters指数平滑(三次指数平滑)模型包括三个平滑参数 α 、 γ 、 δ \alpha、\gamma、\delta α、γ、δ(取值均在0和1之间)和四个方程:
在进行预测时,模型中的参数即 α 、 γ 、 δ \alpha、\gamma、\delta α、γ、δ难以确定时,可采用spss软件自动搜寻的方式确定,而初始平滑值和趋势值也可以采用spss软件自动的方式确定。
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from matplotlib import pyplot as plt
dt = pd.DataFrame(data={
fig0 = plt.figure()
# 下面我们尝试完整的Holt-Winters方法,包括趋势组件和季节性组件,对比拟合效果
fit1 = ExponentialSmoothing(endog=dt['销售量'].values,trend='add',seasonal='add',seasonal_periods=4).fit() # 使用加法趋势,周期season_length = 4的加性季节
fit2 = ExponentialSmoothing(endog=dt['销售量'].values,trend='add',seasonal='mul',seasonal_periods=4).fit() # 使用加法趋势,周期season_length = 4的乘法季节
fit3 = ExponentialSmoothing(endog=dt['销售量'].values,trend='add',seasonal='add',damped_trend=True,seasonal_periods=4).fit() # 使用加性阻尼趋势,周期season_length = 4的加性季节
fit4 = ExponentialSmoothing(endog=dt['销售量'].values,trend='add',seasonal='mul',damped_trend=True,seasonal_periods=4).fit() # 使用加性阻尼趋势,周期season_length = 4的加性季节
# 绘制折线图,查看拟合和未来3年的预测值位置
fig = plt.figure()
line1 = plt.plot(list(fit1.fittedvalues)+list(fit1.forecast(steps=4)),c='g',marker='*')
line2 = plt.plot(list(fit2.fittedvalues)+list(fit2.forecast(steps=4)),c='r',marker='*')
line3 = plt.plot(list(fit3.fittedvalues)+list(fit3.forecast(steps=4)),c='b',marker='*')
line4 = plt.plot(list(fit3.fittedvalues)+list(fit3.forecast(steps=4)),c='b',marker='*')
line5 = plt.plot(dt['销售量'].values,c='y',marker='^')
plt.legend(labels = ["aa",'am','aa damped','am damped','data'],loc='best') # 添加图例
year | season | 销售量 | 4项移动平均 | 中心化移动平均 | 季节比率 |
2005 | Q1 | 25 | |||
2005 | Q2 | 32 | 30 | ||
2005 | Q3 | 37 | 31.25 | 30.625 | 1.208163265 |
2005 | Q4 | 26 | 32.75 | 32 | 0.8125 |
2006 | Q1 | 30 | 34 | 33.375 | 0.898876404 |
2006 | Q2 | 38 | 35 | 34.5 | 1.101449275 |
2006 | Q3 | 42 | 34.75 | 34.875 | 1.204301075 |
2006 | Q4 | 30 | 35 | 34.875 | 0.860215054 |
2007 | Q1 | 29 | 37 | 36 | 0.805555556 |
2007 | Q2 | 39 | 38.25 | 37.625 | 1.03654485 |
2007 | Q3 | 50 | 38.5 | 38.375 | 1.302931596 |
2007 | Q4 | 35 | 38.5 | 38.5 | 0.909090909 |
2008 | Q1 | 30 | 38.75 | 38.625 | 0.776699029 |
2008 | Q2 | 39 | 39.25 | 39 | 1 |
2008 | Q3 | 51 | 39 | 39.125 | 1.303514377 |
2008 | Q4 | 37 | 39.75 | 39.375 | 0.93968254 |
2009 | Q1 | 29 | 40.75 | 40.25 | 0.720496894 |
2009 | Q2 | 42 | 41 | 40.875 | 1.027522936 |
2009 | Q3 | 55 | 41.5 | 41.25 | 1.333333333 |
2009 | Q4 | 38 | 41.75 | 41.625 | 0.912912913 |
2010 | Q1 | 31 | 41.5 | 41.625 | 0.744744745 |
2010 | Q2 | 43 | 42.25 | 41.875 | 1.026865672 |
2010 | Q3 | 54 | 46 | 44.125 | 1.223796034 |
2010 | Q4 | 41 | 47.5 | 46.75 | 0.877005348 |
2011 | Q1 | ||||
2011 | Q2 | ||||
2011 | Q3 | ||||
2011 | Q4 |
year/season | Q1 | Q2 | Q3 | Q4 | 均值 |
2005 | 1.208163265 | 0.8125 | |||
2006 | 0.898876404 | 1.101449275 | 1.204301075 | 0.860215054 | |
2007 | 0.805555556 | 1.03654485 | 1.302931596 | 0.909090909 | |
2008 | 0.776699029 | 1 | 1.303514377 | 0.93968254 | |
2009 | 0.720496894 | 1.027522936 | 1.333333333 | 0.912912913 | |
2010 | 0.744744745 | 1.026865672 | 1.223796034 | 0.877005348 | |
平均值 | 0.789274526 | 1.038476547 | 1.26267328 | 0.885234461 | 0.993914703 |
季节指数 | 0.794106902 | 1.044834676 | 1.270404066 | 0.890654357 |
year | season | 销售量 | 4项移动平均 | 中心化移动平均 | 季节比率 | 季节指数 | 分离季节成分 | time |
2005 | Q1 | 25 | 0.794106902 | 31.48190746 | 1 | |||
2005 | Q2 | 32 | 30 | 1.044834676 | 30.62685489 | 2 | ||
2005 | Q3 | 37 | 31.25 | 30.625 | 1.208163265 | 1.270404066 | 29.12459193 | 3 |
2005 | Q4 | 26 | 32.75 | 32 | 0.8125 | 0.890654357 | 29.19202024 | 4 |
2006 | Q1 | 30 | 34 | 33.375 | 0.898876404 | 0.794106902 | 37.77828896 | 5 |
2006 | Q2 | 38 | 35 | 34.5 | 1.101449275 | 1.044834676 | 36.36939019 | 6 |
2006 | Q3 | 42 | 34.75 | 34.875 | 1.204301075 | 1.270404066 | 33.06034759 | 7 |
2006 | Q4 | 30 | 35 | 34.875 | 0.860215054 | 0.890654357 | 33.68310027 | 8 |
2007 | Q1 | 29 | 37 | 36 | 0.805555556 | 0.794106902 | 36.51901266 | 9 |
2007 | Q2 | 39 | 38.25 | 37.625 | 1.03654485 | 1.044834676 | 37.3264794 | 10 |
2007 | Q3 | 50 | 38.5 | 38.375 | 1.302931596 | 1.270404066 | 39.35755666 | 11 |
2007 | Q4 | 35 | 38.5 | 38.5 | 0.909090909 | 0.890654357 | 39.29695032 | 12 |
2008 | Q1 | 30 | 38.75 | 38.625 | 0.776699029 | 0.794106902 | 37.77828896 | 13 |
2008 | Q2 | 39 | 39.25 | 39 | 1 | 1.044834676 | 37.3264794 | 14 |
2008 | Q3 | 51 | 39 | 39.125 | 1.303514377 | 1.270404066 | 40.14470779 | 15 |
2008 | Q4 | 37 | 39.75 | 39.375 | 0.93968254 | 0.890654357 | 41.54249034 | 16 |
2009 | Q1 | 29 | 40.75 | 40.25 | 0.720496894 | 0.794106902 | 36.51901266 | 17 |
2009 | Q2 | 42 | 41 | 40.875 | 1.027522936 | 1.044834676 | 40.19774705 | 18 |
2009 | Q3 | 55 | 41.5 | 41.25 | 1.333333333 | 1.270404066 | 43.29331232 | 19 |
2009 | Q4 | 38 | 41.75 | 41.625 | 0.912912913 | 0.890654357 | 42.66526034 | 20 |
2010 | Q1 | 31 | 41.5 | 41.625 | 0.744744745 | 0.794106902 | 39.03756526 | 21 |
2010 | Q2 | 43 | 42.25 | 41.875 | 1.026865672 | 1.044834676 | 41.15483626 | 22 |
2010 | Q3 | 54 | 46 | 44.125 | 1.223796034 | 1.270404066 | 42.50616119 | 23 |
2010 | Q4 | 41 | 47.5 | 46.75 | 0.877005348 | 0.890654357 | 46.03357037 | 24 |
2011 | Q1 | 0.794106902 | 25 | |||||
2011 | Q2 | 1.044834676 | 26 | |||||
2011 | Q3 | 1.270404066 | 27 | |||||
2011 | Q4 | 0.890654357 | 28 |
import pandas as pd
import statsmodels.api as sm
dt = pd.DataFrame(data={
X = sm.add_constant(data=dt['time'].values) # 添加常数列
y = dt['分离季节成分'].values
# 用最小二乘法估计分离季节成分后的线性趋势方程
model = sm.OLS(endog=y,exog=X)
result = model.fit()
OLS Regression Results
Dep. Variable: y R-squared: 0.767
Model: OLS Adj. R-squared: 0.756
Method: Least Squares F-statistic: 72.22
Date: Sun, 19 Jun 2022 Prob (F-statistic): 2.13e-08
Time: 12:08:25 Log-Likelihood: -52.328
No. Observations: 24 AIC: 108.7
Df Residuals: 22 BIC: 111.0
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 30.5780 0.942 32.449 0.000 28.624 32.532
x1 0.5605 0.066 8.498 0.000 0.424 0.697
Omnibus: 0.733 Durbin-Watson: 1.698
Prob(Omnibus): 0.693 Jarque-Bera (JB): 0.695
Skew: -0.089 Prob(JB): 0.707
Kurtosis: 2.186 Cond. No. 29.6
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
从最小二乘法回归结果可以看到,模型拟合的调整R方不是很高,但模型F检验和参数的t检验在显著性水平 α \alpha α为0.05的情况下,均通过检验,因此我们确定分离季节成分后的序列的线性趋势方程为 Y t ^ = 30.5780 + 0.5605 t \hat{Y_t} = 30.5780 + 0.5605t Yt^=30.5780+0.5605t。
year | season | 销售量 | 4项移动平均 | 中心化移动平均 | 季节比率 | 季节指数 | 分离季节成分 | time | 线性趋势预测 | 最后预测值 | 误差 |
2005 | Q1 | 25 | 0.794106902 | 31.48190746 | 1 | 31.1385 | 24.72729776 | 0.272702238 | |||
2005 | Q2 | 32 | 30 | 1.044834676 | 30.62685489 | 2 | 31.699 | 33.12021439 | -1.120214385 | ||
2005 | Q3 | 37 | 31.25 | 30.625 | 1.208163265 | 1.270404066 | 29.12459193 | 3 | 32.2595 | 40.98259996 | -3.982599964 |
2005 | Q4 | 26 | 32.75 | 32 | 0.8125 | 0.890654357 | 29.19202024 | 4 | 32.82 | 29.23127598 | -3.231275983 |
2006 | Q1 | 30 | 34 | 33.375 | 0.898876404 | 0.794106902 | 37.77828896 | 5 | 33.3805 | 26.50768544 | 3.492314564 |
2006 | Q2 | 38 | 35 | 34.5 | 1.101449275 | 1.044834676 | 36.36939019 | 6 | 33.941 | 35.46273373 | 2.537266272 |
2006 | Q3 | 42 | 34.75 | 34.875 | 1.204301075 | 1.270404066 | 33.06034759 | 7 | 34.5015 | 43.83084588 | -1.83084588 |
2006 | Q4 | 30 | 35 | 34.875 | 0.860215054 | 0.890654357 | 33.68310027 | 8 | 35.062 | 31.22812305 | -1.22812305 |
2007 | Q1 | 29 | 37 | 36 | 0.805555556 | 0.794106902 | 36.51901266 | 9 | 35.6225 | 28.28807311 | 0.71192689 |
2007 | Q2 | 39 | 38.25 | 37.625 | 1.03654485 | 1.044834676 | 37.3264794 | 10 | 36.183 | 37.80525307 | 1.194746929 |
2007 | Q3 | 50 | 38.5 | 38.375 | 1.302931596 | 1.270404066 | 39.35755666 | 11 | 36.7435 | 46.6790918 | 3.320908205 |
2007 | Q4 | 35 | 38.5 | 38.5 | 0.909090909 | 0.890654357 | 39.29695032 | 12 | 37.304 | 33.22497012 | 1.775029883 |
2008 | Q1 | 30 | 38.75 | 38.625 | 0.776699029 | 0.794106902 | 37.77828896 | 13 | 37.8645 | 30.06846078 | -0.068460784 |
2008 | Q2 | 39 | 39.25 | 39 | 1 | 1.044834676 | 37.3264794 | 14 | 38.425 | 40.14777241 | -1.147772414 |
2008 | Q3 | 51 | 39 | 39.125 | 1.303514377 | 1.270404066 | 40.14470779 | 15 | 38.9855 | 49.52733771 | 1.472662289 |
2008 | Q4 | 37 | 39.75 | 39.375 | 0.93968254 | 0.890654357 | 41.54249034 | 16 | 39.546 | 35.22181718 | 1.778182815 |
2009 | Q1 | 29 | 40.75 | 40.25 | 0.720496894 | 0.794106902 | 36.51901266 | 17 | 40.1065 | 31.84884846 | -2.848848458 |
2009 | Q2 | 42 | 41 | 40.875 | 1.027522936 | 1.044834676 | 40.19774705 | 18 | 40.667 | 42.49029176 | -0.490291757 |
2009 | Q3 | 55 | 41.5 | 41.25 | 1.333333333 | 1.270404066 | 43.29331232 | 19 | 41.2275 | 52.37558363 | 2.624416373 |
2009 | Q4 | 38 | 41.75 | 41.625 | 0.912912913 | 0.890654357 | 42.66526034 | 20 | 41.788 | 37.21866425 | 0.781335748 |
2010 | Q1 | 31 | 41.5 | 41.625 | 0.744744745 | 0.794106902 | 39.03756526 | 21 | 42.3485 | 33.62923613 | -2.629236132 |
2010 | Q2 | 43 | 42.25 | 41.875 | 1.026865672 | 1.044834676 | 41.15483626 | 22 | 42.909 | 44.8328111 | -1.8328111 |
2010 | Q3 | 54 | 46 | 44.125 | 1.223796034 | 1.270404066 | 42.50616119 | 23 | 43.4695 | 55.22382954 | -1.223829543 |
2010 | Q4 | 41 | 47.5 | 46.75 | 0.877005348 | 0.890654357 | 46.03357037 | 24 | 44.03 | 39.21551132 | 1.78448868 |
2011 | Q1 | 0.794106902 | 25 | 44.5905 | 35.40962381 | ||||||
2011 | Q2 | 1.044834676 | 26 | 45.151 | 47.17533044 | ||||||
2011 | Q3 | 1.270404066 | 27 | 45.7115 | 58.07207546 | ||||||
2011 | Q4 | 0.890654357 | 28 | 46.272 | 41.21235839 |
tips:以上大部分知识点主要来自书籍《CDA数据分析实务》的总结,同时也借鉴了文章《如何使用Python构建指数平滑模型:Simple Exponential Smoothing, Holt, and Holt-Winters》http://t.zoukankan.com/harrylyx-p-11852149.html。