机器学习线性回归模型

Linear Regression is one of the fundamental supervised-machine learning algorithm. While it is relatively simple and might not seem fancy enough when compared to other Machine Learning algorithms, it remains widely used across various domains such as Biology, Social Sciences, Finance, Marketing. It is extremely powerful and can be used to forecast trends or generate insights.Thus, I simply cannot emphasize enough how important it is to know Linear Regression — its working and variants — inside out before moving on to more complicated ML techniques.

线性回归是基本的监督机器学习算法之一。尽管它相对简单，并且与其他机器学习算法相比似乎还不够花哨，但它仍然在生物学，社会科学，金融，市场营销等各个领域得到广泛使用。它非常强大，可用于预测趋势或产生见解。因此，在继续使用更复杂的ML技术之前，我简直无法强调了解线性回归(其工作原理和变量)的重要性。

Linear Regression Models are extremely powerful and can be used to forecast trends & generate insights.

线性回归模型非常强大，可用于预测趋势和产生见解。

The objective of the article is to provide a comprehensive overview of linear regression model. It will serve as an excellent guide for last-minute revisions or to develop a mindmap for studying Linear Regression in detail.

本文的目的是提供线性回归模型的全面概述。它可以作为最新修订的出色指南，也可以为详细研究线性回归提供思路。

Note: Throughout this article, we will work with the popular Boston Housing Dataset which can be imported directly in Python using sklearn.datasets or in R using the library MASS(Modern Applied Statistics Functions). The code chunks are written in R.

注意：在本文中，我们将使用流行的Boston Housing数据集，可以使用sklearn.datasets在Python中直接将其导入，或者使用MASS(现代应用统计函数)库在R中直接将其导入。 代码块用R编写。

什么是线性回归？ (What is Linear Regression?)

Linear Regression is a statistical/machine learning technique that attempts to model the linear relationship between the independent predictor variables X and a dependent quantitative response variable Y. It is important that the predictor and response variables be numerical values. A general linear regression model can be represented mathematically as

线性回归是一种统计/机器学习技术，试图对独立预测变量X和相关定量响应变量Y之间的线性关系建模。预测变量和响应变量为数值很重要。一般的线性回归模型可以用数学表示为

Linear Regression Model Equation; Image by Author 线性回归模型方程；图片作者

Since the linear regression model approximates the relationship between Y and X, by capturing the irreducible error term we get

由于线性回归模型近似了Y和X之间的关系，因此通过捕获不可约误差项，我们得到

Linear Regression Model Equation with Approximation; Image by Author 具有近似的线性回归模型方程；图片作者

Here, we will use Linear Regression to predict Median House Value (Y/response variable = medv)for 506 neighborhoods around Boston.

在这里，我们将使用线性回归来预测波士顿附近506个社区的房屋中位价(Y /响应变量= medv)。

线性回归揭示了哪些见解？ (What insights does Linear Regression reveal?)

Using Linear Regression to predict median house values will help answer the following five questions:

使用线性回归来预测房屋中位数将有助于回答以下五个问题：

Is there a linear relationship between the predictor & response variables?
预测变量和响应变量之间是否存在线性关系？
Is there an association between the predictor & response variables? How strong?
预测变量和响应变量之间是否存在关联？有多强？
How does each predictor variable effect the response variable?
每个预测变量如何影响响应变量？
How accurate is the prediction of response variable?
预测响应变量的准确性如何？
Is there any interaction among the independent variable?
自变量之间是否存在任何相互作用？

线性回归模型的类型 (Types of Linear Regression Model)

Depending on the number of predictor variables, Linear Regression can be classified into two categories:

根据预测变量的数量，线性回归可以分为两类：

Simple Linear Regression — One predictor variable.
简单线性回归-一个预测变量。
Multiple Linear Regression — Two or more predictor variables.
多元线性回归-两个或多个预测变量。

The simplicity of linear regression model can be attributed to its core assumptions. However, these assumptions introduce bias in the model which leads to over-generalization/under-fitting (More about Bias-Variance Tradeoff).

线性回归模型的简单性可以归因于其核心假设。但是，这些假设在模型中引入了偏差，从而导致过度概括/拟合不足(有关偏差偏差权衡的更多信息)。

An

一个

线性回归模型的假设 (Assumptions of Linear Regression Model)

LINE — a simple acronym that captures the four assumptions of Linear Regression Model.

LINE —一个简单的首字母缩略词，代表了线性回归模型的四个假设。

Linear Relationship: The relationship between predictor & response variable is linear.
大号 inear关系：预测＆响应变量之间的关系是线性的。
Independent Observations: Observations in the dataset are independent of each other.
我 ndependent观察：数据集中的意见是相互独立的。
Normal distribution of residuals.
N残差的正态分布。
Errors/Residuals have a constant variance: Also known as Homoscedasticity.
错误 /残差具有恒定的方差：也称为同方差。

简单线性回归 (Simple Linear Regression)

A simple learning regression model predicts the response variable Y using a single predictor variable X. For the Boston Housing Dataset, post analyzing the correlation between Median House Value / medv column and the 12 predictor columns, scatterplot of medv with a few correlated column in shown below:

一个简单的学习回归模型使用单个预测变量X预测响应变量Y。对于Boston Housing Dataset，后期分析中位数房屋价值/ medv列与12个预测变量列之间的相关性，medv的散点图以及图中的一些相关列下面：

Figure 1: Scatterplot for Boston Housing Dataset; Image by Author 图1：波士顿房屋数据集的散点图；图片作者

On observing the scatterplots, we notice that medv and rm (average number of rooms) have an almost linear relationship. Therefore, their relationship can be represented as

在观察散点图时，我们注意到medv和rm(平均房间数)几乎呈线性关系。因此，他们的关系可以表示为

Median Price Prediction Generalized Equation; Image by Author 中位数价格预测广义方程；图片作者

The goal is to fit a linear model by estimating the coefficients that fits as close to the 506 datapoints as possible. The difference between the predicted and the observed value is the error which needs to be minimized to find best fit. A common approach used in minimizing the least squares of errors is known as Ordinary Least Squares (OLS) method.

目标是通过估计尽可能接近506个数据点的系数来拟合线性模型。预测值和观察值之间的差异是误差，需要最小化该误差才能找到最佳拟合。最小化误差最小二乘的常用方法称为普通最小二乘(OLS)方法。

To create a simple linear regression model in R, run the following code chunk:

要在R中创建简单的线性回归模型，请运行以下代码块：

simpleLinearModel <- lm(medv ~ rm, data = Boston)

Let’s look at the fitted model,

让我们看一下拟合模型，

plot(rm ,medv)
abline(simpleLinearModel, col = ‘red’)

Figure 2: Simple Linear Regression Line fitted to the training data; Image by Author 图2：拟合训练数据的简单线性回归线；图片作者

Assess the Linear Regression Model’s accuracy using RSE, R², adjusted R², F-statistic.

使用RSE，R²，调整后的R²，F统计量评估线性回归模型的准确性。

The summary of the model (Image below) tells us about the coefficient and helps in assessing the accuracy of the model using metrics such as

模型摘要(下图)告诉我们有关系数的信息，并有助于使用诸如

Residual Standard Error (RSE)
残留标准误差(RSE)
R² Statistic
R²统计
Adjusted R-squared
调整后的R平方
F-statistic
F统计

which quantify how well the model fits the training data.

量化模型拟合训练数据的程度。

print(summary(simpleLinearModel))

Figure 3: Summary of Simple Linear Regression Model; Image by Author 图3：简单线性回归模型摘要；图片作者

如何解释简单线性回归模型？ (How to interpret Simple Linear Regression Model?)

Using Simple Linear Regression to predict median house values we can answer the following:

使用简单线性回归来预测房屋中位数，我们可以回答以下问题：

Is there an association between rm & medv? How strong?
rm和medv之间有关联吗？ 有多强？

The association between medv and rm and its strength can be determined by observing the p-value corresponsing to the F-statistic in the Summary Table (Figure 3). As the p-value is very low, there is a strong association between medv and rm.

medv和rm及其强度之间的关联可以通过观察汇总表(图3)中F统计量对应的p值来确定。由于p值非常低，因此medv和rm之间有很强的联系。

How does rm effect medv?
rm如何影响medv？

According to this simple linear regression model, unit increase in the number of rooms leads to a $9.102k increase in median house value.

根据这个简单的线性回归模型，房间数量的单位增加导致房屋中位数价值增加$ 9,102k。

How accurate is the prediction of response variable?
预测响应变量的准确性如何？

The RSE estimates the standard deviation of medv from the real regression line and this is only 6.616 but indicates an error of roughly ~30%. On the other hand, R² shows that only 48% variablity in medv is explained by rm. The adjusted R² & F-statistic are useful metric for multiple linear regression.

RSE估计medv与实际回归线的标准偏差，这仅为6.616，但表明误差约为30％。另一方面，R 2显示rm解释了medv中只有48％的变异性。调整后的R²和F统计量对于多元线性回归很有用。

Is there a linear relationship between the Median House Value (medv) & the number of rooms in the house (rm)?
房屋中值(medv)与房屋房间数(rm)之间是否存在线性关系？

Other than using Figure 1 to identify an almost linear relationship between medv & rm, residual plots as shown in Figure 3, help idenitfy linearity, if there is no pattern present. As there is a little pattern, it indicates that there is a non-linearity component in the relationship albeit a little weak.

除了不使用图1来确定medv和rm之间的几乎线性关系外，如图3所示的残差图还可以帮助确定线性，如果不存在模式的话。由于存在少量模式，因此表明该关系中存在一个非线性成分，尽管有些微弱。

Figure 4: Residual plot with a smooth line to identify trend; Image by Author 图4：带有平滑线以识别趋势的残差图；图片作者

多元线性回归 (Multiple Linear Regression)

Multiple Linear Regression Model attempts to model a linear relationship between two or more predictor variables and a response variable. The most important question that comes to one’s mind is

多元线性回归模型试图对两个或多个预测变量与响应变量之间的线性关系进行建模。一个人想到的最重要的问题是

“How to choose predictor variables which are useful?”

“如何选择有用的预测变量？”

This is known as Regression Variable Selection can be acheived by using:

这称为回归变量选择，可以使用以下方法实现：

Best Subset Selection
最佳子集选择
Stepwise Selection — Forward, Backward, Hybrid
逐步选择-前进，后退，混合

There are plenty of other ways to acheive this. By using Forward Stepwise Selection, I found that all predictor variables except age and indus are important for predicting medv.

还有很多其他方法可以达到此目的。通过使用正向逐步选择，我发现除年龄和印度河外的所有预测变量对预测medv都很重要。

#Variable selection using stepwise regression
nullmodel <- lm(medv ~ 1, data = Boston)
fullmodel <- lm(medv ~ ., data = Boston)#Forward Stepwise Selection
mulitpleLinearModel <- step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel), direction = "forward")

如何解释多元线性回归模型？ (How to interpret Multiple Linear Regression Model?)

Using Simple Linear Regression to predict median house values we can answer the following questions using the summary of the model:

使用简单线性回归来预测房屋中位数，我们可以使用模型摘要来回答以下问题：

Figure 5: Summary of Multiple Linear Regression Model; Image by Author 图5：多元线性回归模型摘要；图片作者

Is there an association between the subset of predictor variables & medv? How strong?
预测变量和medv的子集之间有关联吗？ 有多强？

As the p-value corresponsing to the F-statistic in the Summary Table (Figure 5) is very low, there is a strong association between the subset of predictor variables & medv.

由于对应于摘要表(图5)中F统计量的p值非常低，因此预测变量和medv的子集之间存在很强的关联。

How do the various predictor variable effect medv?
各种预测变量如何影响medv？

According to this multiple linear regression model, each predictor variable has a strong association with medv and the exact contribution can be discerned by using simple linear models.

根据此多元线性回归模型，每个预测变量与medv都有很强的关联，并且可以通过使用简单的线性模型来区分确切的贡献。

How accurate is the prediction of response variable?
预测响应变量的准确性如何？

The adjusted R² penalizes additional predictor variables which are added to the model and don’t improve it as opposed to R² which increases with every variable that is added to the model. Since the difference between the two is not much, we can deduce that this model is more accurate than the simple linear regression model which could explain only 48% variability in medv as opposed to 74% which can be explained by multiple linear regression.

调整后的R²会惩罚添加到模型中的其他预测变量，而不会对其进行改善，而R²则会随着添加到模型中的每个变量的增加而增加。由于两者之间的差异不大，我们可以推断出该模型比简单的线性回归模型更为准确，后者仅能解释medv的48％变异，而不能解释多元线性回归的74％。

线性回归模型的潜在问题 (Potential Problems with Linear Regression Model)

Having looked at Linear Regression Models, its types and assessment, it is important to acknowledge its shortcomings. Due to the assumptions of the linear regression model, there are several problems which plague Linear Regression Models such as:

研究了线性回归模型，其类型和评估之后，重要的是要认识到它的缺点。由于线性回归模型的假设，存在困扰线性回归模型的几个问题，例如：

Collinearity (How to handle multi-collinearity)
共线性( 如何处理多重共线性 )
Correlation of residuals
残差的相关
Non-constant variance/ heteroscedasticity of residuals
残差的非恒定方差/异方差
Outliers
离群值
Non-linear relationship
非线性关系

Article on how to deal with these issues is in the pipeline.

有关如何处理这些问题的文章正在准备中。

Projects on Github exploring simple/multiple/polynomial linear regression, non-linear transformation of predictors, stepwise selection:

Github上的项目探索简单/多项式/多项式线性回归，预测变量的非线性转换，逐步选择：

Boston Housing Price Prediction
波士顿房屋价格预测
Fuel Efficiency Prediction (autompg)
燃油效率预测 (自动mpg)
Predicting Wages
预测工资

翻译自: https://towardsdatascience.com/linear-regression-model-for-ml-cd18a392bd8b