简单线性回归是一种统计技术,用于显示一个因变量和一个自变量之间的关系。 因变量表示为 Y,而自变量表示为 X。变量 X 和 Y 线性相关。 简单线性回归可用于: (a) 描述一个变量对另一个变量的线性相关性; (b) 根据另一个变量的值预测一个变量; © 修正一个变量对另一个变量的线性相关性。
简单线性回归模型的形式为:
Y i = β 0 + β 1 X i + ε i Y_{i}=\beta_{0}+\beta_{1} X_{i}+\varepsilon_{i} Yi=β0+β1Xi+εi
其中 β 0 \beta_{0} β0 和 β 1 \beta_{1} β1 是 X X X 的截距和回归系数, ϵ \epsilon ϵ 是误差项。
上述中回归系数的解可以使用最小二乘法得出:
e i = Y i − β 0 − β 1 X i e_{i}=Y_{i}-\beta_{0}-\beta_{1} X_{i} ei=Yi−β0−β1Xi
要找到残差平方和的最小和(最小二乘线上留下的位),请将下面的总和设置为零:
∑ i = 1 n ( e i ) 2 = ∑ i = 1 n ( Y i − β 0 − β 1 X i ) 2 = 0 \sum_{i=1}^{n}\left(e_{i}\right)^{2}=\sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}=0 i=1∑n(ei)2=i=1∑n(Yi−β0−β1Xi)2=0
上述,对 β 1 \beta_{1} β1 的偏导数:
δ δ β 0 ∑ i = 1 n ( Y i − β 0 − β 1 X i ) 2 = − 2 ( n β 0 + β 1 ∑ i = 1 n X i − ∑ i = 1 n Y i ) = 0 \frac{\delta}{\delta \beta_{0}} \sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}=-2\left(n \beta_{0}+\beta_{1} \sum_{i=1}^{n} X_{i}-\sum_{i=1}^{n} Y_{i}\right)=0 δβ0δi=1∑n(Yi−β0−β1Xi)2=−2(nβ0+β1i=1∑nXi−i=1∑nYi)=0
将上述除以 2,求解 β 0 \beta_{0} β0,得到
β 0 = Y ˉ − β 1 X ˉ \beta_{0}=\bar{Y}-\beta_{1} \bar{X} β0=Yˉ−β1Xˉ
现在,
δ δ β 1 ∑ i = 1 n ( Y i − β 0 − β 1 X i ) 2 = − 2 ∑ i = 1 n ( X i Y i − β 0 X i − β 1 X i 2 ) = 0 \frac{\delta}{\delta \beta_{1}} \sum_{i=1}^{n}\left(Y_{i}-\beta_{0}-\beta_{1} X_{i}\right)^{2}=-2 \sum_{i=1}^{n}\left(X_{i} Y_{i}-\beta_{0} X_{i}-\beta_{1} X_{i}^{2}\right)=0 δβ1δi=1∑n(Yi−β0−β1Xi)2=−2i=1∑n(XiYi−β0Xi−β1Xi2)=0
因此,
β 1 = ∑ i = 1 n ( X i Y i − X i Y ˉ ) ∑ i = 1 n ( X i 2 − X i X ˉ ) = ∑ i = 1 n ( X i Y i ) − n X ˉ Y ˉ ∑ i = 1 n ( X i 2 ) − n X ˉ 2 = cov ( X , Y ) var ( X ) \beta_{1}=\frac{\sum_{i=1}^{n}\left(X_{i} Y_{i}-X_{i} \bar{Y}\right)}{\sum_{i=1}^{n}\left(X_{i}^{2}-X_{i} \bar{X}\right)}=\frac{\sum_{i=1}^{n}\left(X_{i} Y_{i}\right)-n \bar{X} \bar{Y}}{\sum_{i=1}^{n}\left(X_{i}^{2}\right)-n \bar{X}^{2}}=\frac{\operatorname{cov}(X, Y)}{\operatorname{var}(X)} β1=∑i=1n(Xi2−XiXˉ)∑i=1n(XiYi−XiYˉ)=∑i=1n(Xi2)−nXˉ2∑i=1n(XiYi)−nXˉYˉ=var(X)cov(X,Y)
其中 cov ( X , Y ) \operatorname{cov}(X, Y) cov(X,Y) 是 X X X 和 Y Y Y 的协方差——衡量 X X X 如何随 Y Y Y 变化的量度。
问题:在一堂统计课上,测量了 30 名学生的体重和身高,如下表所示。
Student 1 2 3 4 5 6 7 8 9 Height (m) 1.43 1.10 1.24 1.36 2.26 1.25 1.74 1.55 1.51 Weight (kg) 92.18 77.76 65.44 114.19 82.81 106.66 94.44 75.32 67.35 Student 10 11 12 13 14 15 16 17 18 Height ( m ) 1.82 1.57 1.59 2.19 1.54 2.06 1.86 1.76 1.51 Weight (kg) 101.55 76.37 91.66 75.85 88.82 83.02 74.66 97.57 104.56 Student 19 20 21 22 23 24 25 26 27 Height (m) 2.39 1.83 2.02 1.99 1.40 1.54 1.60 1.88 1.52 Weight (kg) 113.36 64.71 103.79 70.02 78.35 80.70 90.54 91.55 82.57 Student 28 29 30 Height (m) 1.41 1.38 1.18 Weight (kg) 82.49 87.98 67.54 \begin{array}{lrrrrrrrrrr}\hline \text { Student } & \mathbf{1} & \mathbf{2} & \mathbf{3} & \mathbf{4} & \mathbf{5} & \mathbf{6} & \mathbf{7} & \mathbf{8} & \mathbf{9} \\\text { Height (m) } & 1.43 & 1.10 & 1.24 & 1.36 & 2.26 & 1.25 & 1.74 & 1.55 & 1.51 \\\text { Weight (kg) } & 92.18 & 77.76 & 65.44 & 114.19 & 82.81 & 106.66 & 94.44 & 75.32 & 67.35 \\\hline \text { Student } & \mathbf{1 0} & \mathbf{1 1} & \mathbf{1 2} & \mathbf{1 3} & \mathbf{1 4} & \mathbf{1 5} & \mathbf{1 6} & \mathbf{1 7} & \mathbf{1 8} \\\text { Height }(\mathrm{m}) & 1.82 & 1.57 & 1.59 & 2.19 & 1.54 & 2.06 & 1.86 & 1.76 & 1.51 \\\text { Weight (kg) } & 101.55 & 76.37 & 91.66 & 75.85 & 88.82 & 83.02 & 74.66 & 97.57 & 104.56 \\\hline \text { Student } & \mathbf{1 9} & \mathbf{2 0} & \mathbf{2 1} & \mathbf{2 2} & \mathbf{2 3} & \mathbf{2 4} & \mathbf{2 5} & \mathbf{2 6} & \mathbf{2 7} \\\text { Height (m) } & 2.39 & 1.83 & 2.02 & 1.99 & 1.40 & 1.54 & 1.60 & 1.88 & 1.52 \\\text { Weight (kg) } & 113.36 & 64.71 & 103.79 & 70.02 & 78.35 & 80.70 & 90.54 & 91.55 & 82.57 \\\hline \text { Student } & \mathbf{2 8} & \mathbf{2 9} & \mathbf{3 0} & & & & & & \\\text { Height (m) } & 1.41 & 1.38 & 1.18 & & & & & & \\\text { Weight (kg) } & 82.49 & 87.98 & 67.54 & & & & & & \\\hline\end{array} Student Height (m) Weight (kg) Student Height (m) Weight (kg) Student Height (m) Weight (kg) Student Height (m) Weight (kg) 11.4392.18101.82101.55192.39113.36281.4182.4921.1077.76111.5776.37201.8364.71291.3887.9831.2465.44121.5991.66212.02103.79301.1867.5441.36114.19132.1975.85221.9970.0252.2682.81141.5488.82231.4078.3561.25106.66152.0683.02241.5480.7071.7494.44161.8674.66251.6090.5481.5575.32171.7697.57261.8891.5591.5167.35181.51104.56271.5282.57
解:
根据以上数据,我们得到以下结果:
∑ X Y = 4282.115 , ∑ Y = 2583.81 , ∑ X = 49.48 , n = 30 , X ˉ = 1.6493 Y ˉ = 86.127 , ∑ X 2 = 84.699 \begin{aligned}&\sum X Y=4282.115, \sum Y=2583.81, \sum X=49.48, n=30, \bar{X}=1.6493 \\&\bar{Y}=86.127, \sum X^{2}=84.699\end{aligned} ∑XY=4282.115,∑Y=2583.81,∑X=49.48,n=30,Xˉ=1.6493Yˉ=86.127,∑X2=84.699
这些值代入方程中,
β 1 = ∑ i = 1 n ( X i Y i ) − n X ˉ Y ˉ ∑ i = 1 n ( X i 2 ) − n X ˉ 2 = 4282.115 − 30 ( 1.6493 ) ( 86.13 ) 84.699 − 30 ( 1.6493 ) 2 = 6.6503 \beta_{1}=\frac{\sum_{i=1}^{n}\left(X_{i} Y_{i}\right)-n \bar{X} \bar{Y}}{\sum_{i=1}^{n}\left(X_{i}^{2}\right)-n \bar{X}^{2}}=\frac{4282.115-30(1.6493)(86.13)}{84.699-30(1.6493)^{2}}=6.6503 β1=∑i=1n(Xi2)−nXˉ2∑i=1n(XiYi)−nXˉYˉ=84.699−30(1.6493)24282.115−30(1.6493)(86.13)=6.6503
回归的斜率为 6.6503。
然后代入方程中的 Y ˉ , X ˉ \bar{Y}, \bar{X} Yˉ,Xˉ 和 β 1 \beta_{1} β1,得到常数项:
β 0 = Y ˉ − β 1 X ˉ = 86.13 − ( 6.6503 ) ( 1.6493 ) = 75.1587 \beta_{0}=\bar{Y}-\beta_{1} \bar{X}=86.13-(6.6503)(1.6493)=75.1587 β0=Yˉ−β1Xˉ=86.13−(6.6503)(1.6493)=75.1587
R 中的回归计算:
在 R 中指定身高和体重的值,如下所示:
height<-c(1.43, 1.10, 1.24, 1.36, 2.26, 1.25, 1.74, 1.55, 1.51, 1.82, 1.57, 1.59,
2.19, 1.54, 2.06, 1.86, 1.76, 1.51, 2.39, 1.83, 2.02, 1.99, 1.40, 1.54, 1.60,
1.88, 1.52, 1.41, 1.38, 1.18)
weight<-c(92.18, 77.76, 65.44, 114.19, 82.81, 106.66, 94.44, 75.32, 67.35,
101.55, 76.37, 91.66, 75.85, 88.82, 83.02, 74.66, 97.57, 104.56, 113.36, 64.71,
103.79, 70.02, 78.35, 80.70, 90.54, 91.55, 82.57, 82.49, 87.98, 67.54)
通过回归学生的体重与身高来拟合模型,结果将随之而来。
Call:
lm(formula = weight ~ height)
Residuals:
Min 1Q Median 3Q Max
-22.619 -9.917 -2.371 7.660 29.987
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 75.159 13.433 5.595 5.47e-06 ***
height 6.650 7.995 0.832 0.413
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.05 on 28 degrees of freedom
Multiple R-squared: 0.02412, Adjusted R-squared: -0.01074
F-statistic:0.6919 on 1 and 28 DF, p-value: 0.4125
import matplotlib.pyplot as plt
ages = cleaned_data['Age']
heights = cleaned_data['Height']
plt.scatter(ages,heights, label='Raw Data')
plt.title('Height VS Age')
plt.xlabel('Age[Years]')
plt.ylabel('Height[Inches]')
plt.legend()
详情参阅 - 亚图跨际