多元线性回归中多重共线性
Linear Regression is one of the simplest and most widely used algorithms for Supervised machine learning problems where the output is a numerical quantitative variable and the input is a bunch of independent variables or single variable.
对于有监督的机器学习问题,线性回归是最简单且使用最广泛的算法之一,其中输出是数值定量变量,输入是一堆自变量或单变量。
The math behind it is easy to understand and that’s what makes Linear Regression one of my most favorite algorithms to work with. But this simplicity comes with a price.
它背后的数学原理很容易理解,这就是线性回归成为我最喜欢使用的算法之一的原因。 但是,这种简单性需要付出代价。
When we decide to fit a Linear Regression model, we have to make sure that some conditions are satisfied or else our model will perform poorly or will give us incorrect interpretations. So what are some of these conditions that have to be met?
当我们决定拟合线性回归模型时,我们必须确保满足某些条件,否则我们的模型将表现不佳或将给我们错误的解释。 那么必须满足哪些条件呢?
Linearity: X and the mean of Y have a Linear Relationship
线性 :X和Y的平均值具有线性关系
2. Homoscedasticity: variance of the error terms is the same for all values of X.
2. 均方差:所有X值的误差项的方差都相同。
3. No collinearity: independent variables are not highly correlated with each other
3. 没有共线性:自变量彼此之间没有高度相关性
4.Normality: Y is normally distributed for any value of X.
4. 正态性 :对于任何X值,Y 均呈正态分布。
If the above four conditions are satisfied, we can expect our Linear Regression model to perform well.
如果满足以上四个条件,我们可以期望线性回归模型表现良好。
So how do we ensure the above conditions are met? Well, If I start going into the depth of all of the above conditions, it might result in a very long article. So for this article, I will go over the third condition of No collinearity meaning I will explain what Multicollinearity and how it is a problem in the first place and what can be done to overcome it.
那么我们如何确保满足以上条件? 好吧,如果我开始深入研究以上所有条件,可能会导致篇幅很长。 因此,对于本文,我将讨论无共线性的第三个条件,这意味着我将首先解释什么是多重共线性,以及这是一个问题以及如何解决该问题。
When we have a Supervised Machine Learning Regression problem, We know we have a bunch of Independent variables and an Output variable which will be used to train our model and make predictions and interpretations.
当我们遇到监督式机器学习回归问题时,我们知道我们有一堆自变量和一个输出变量,这些变量将用于训练我们的模型并进行预测和解释。
In a Multivariate Linear Regression problem, we make predictions based off of the model trained and use the coefficients to make interpretations of the model for example:
在多元线性回归问题中,我们根据训练的模型进行预测,并使用系数对模型进行解释,例如:
The above equation states that a unit increase in X1, will result in a B1 increase in the value of Y and a unit increase in X2 will result in a B2 increase in the value of Y.
上面的等式指出,X1的单位增加将导致Y值增加B1,X2的单位增加将导致Y值增加B2。
The coefficients are mandatory in order to understand which variable has the highest influence on the model.
系数是强制性的,以便了解哪个变量对模型的影响最大。
So how is multicollinearity a problem? Well, When we have independent variables that are highly related to each other, our coefficients won’t be reliable and we cannot make accurate interpretations based on their values.
那么多重共线性如何成为问题呢? 好吧,当我们有彼此高度相关的自变量时,我们的系数将不可靠,并且我们无法基于它们的值进行准确的解释。
To explain this point further, I created two dummy input variables in Python and one dependent output variable.
为了进一步说明这一点,我在Python中创建了两个伪输入变量和一个从属输出变量。
x3 = np.random.randint(0,100,100)
x4 = 3*x3 + np.random.randint(0,100,100)
y1 = (4*x3) + np.random.randint(0,100,100)
Creating the scatterplot for the variables gives us:
为变量创建散点图可以使我们:
plt.figure(figsize = (12,5))
plt.subplot(1,2,1)
plt.xlabel('x3')
sns.scatterplot(x3,y1)
plt.subplot(1,2,2)
plt.xlabel('x4')
sns.scatterplot(x4,y1)
The scatterplot shows that both x3 and x4 are have a linear relationship with y1. Lets look at the correlation matrix for the variables and see what else can we interpret.I put my variables into a DataFrame by the name of S2 and created a correlation matrix.
散点图显示x3和x4与y1都具有线性关系。 让我们看一下变量的相关矩阵,看看还能解释什么。我将变量以S2的名称放入DataFrame中,并创建了一个相关矩阵。
S2.corr()
By the looks of the correlation matrix, it seems that both X3 and X4 not only have a high positive correlation with y1 but also are highly correlated with each other. Let’s see how this will affect our results.
从相关矩阵的外观来看,似乎X3和X4不仅与y1具有很高的正相关性,而且彼此之间也具有高度相关性。 让我们看看这将如何影响我们的结果。
Before I fit a Linear Regression model to my variables, we have to understand the concept of P-values and the Null hypothesis.
在将线性回归模型拟合到变量之前,我们必须了解P值的概念和Null假设 。
The P-value is used to either reject or accept the Null Hypothesis.
P值用于拒绝或接受零假设。
The Null Hypothesis in our case is that ‘The variable does not have a significant relation with y”.
在我们的情况下,零假设是“ 变量与y没有显着关系 ”。
If the P-value is less than the threshold of 0.005, then we have to reject the Null hypothesis, otherwise, we have to accept it. So let’s move forward
如果P值小于0.005的阈值, 则我们必须拒绝Null假设 ,否则, 我们必须接受它 。 所以让我们前进
I import the stats model from the scipy library and use it to fit an Ordinary Least Squares model to my variables.
我从scipy库导入stats模型,并使用它将普通的最小二乘模型拟合到我的变量中。
The independent variables being X3 and X4 and the dependent variable being y1.
自变量为X3和X4,因变量为y1。
X = S2[['X3','X4']]
y = S2['y1']import statsmodels.api as sm
from scipy import statsX = sm.add_constant(X3)
est = sm.OLS(y,X)
est2 = est.fit()
print(est2.summary())
The results we get are:
我们得到的结果是:
We get a very high R2 score which shows that our model explains the variance in the model quite well. The coefficients on the other hand, tell an entirely different story.
我们获得了很高的R2分数,这表明我们的模型很好地解释了模型中的方差。 另一方面,系数则讲述了一个完全不同的故事。
The P-value for our X4 variable shows that we cannot reject the Null-hypothesis meaning X4 does not have a significant relation with y.
X4变量的P值表明我们不能拒绝Null假设,这意味着X4与y没有显着关系。
Furthermore, the coefficient is negative as well which cannot be possible as the scatterplots showed that y had a positive relationship will the independent variable.
此外,系数也为负,这是不可能的,因为散点图表明y与正变量成正比。
So to sum it up, Our coefficients are not reliable and our P-values cannot be trusted.
综上所述, 我们的系数不可靠,我们的P值不可信。
仅使用一个变量进行回归 (Regression with one variable only)
In the previous multivariate example, our results showed that X4 did not have a significant relation with y1. So let us try to analyze y1 and X4 alone and see what we get.
在前面的多变量示例中,我们的结果表明X4与y1没有显着关系。 因此,让我们尝试单独分析y1和X4并看看我们得到了什么。
X3 = S2['X4']
y1 = S2['y1']
import statsmodels.api as sm
from scipy import statsX = sm.add_constant(X3)
est = sm.OLS(y1,X)
est2 = est.fit()
print(est2.summary())
After fitting our OLS model, we get
拟合我们的OLS模型后,我们得到
The coefficient is now positive and we can reject the Null Hypothesis that X4 is not related to y1. But one more thing we can take from this model is that our R-squared value has reduced significantly from 0.942 to 0.826. So what does that tell us? Well, if our goal is prediction, we might need to think before removing variables but if our goal is an interpretation of each coefficient, then collinearity can be troublesome and we have to consider which variables to keep and which to remove.
系数现在为正, 我们可以拒绝零假设 ,即X4与y1不相关。 但是我们可以从该模型中得到的另一件事是,我们的R平方值已从0.942显着降低到0.826。 那这告诉我们什么呢? 好吧,如果我们的目标是预测,则可能需要在删除变量之前进行思考,但是如果我们的目标是对每个系数的解释,则共线性可能会很麻烦,我们必须考虑保留哪些变量以及删除哪些变量。
[1]: Gareth James.. An introduction to statistical learninghttp://faculty.marshall.usc.edu/gareth-james/ISL/
[1]:Gareth James .. 统计学习简介 http://faculty.marshall.usc.edu/gareth-james/ISL/
翻译自: https://medium.com/swlh/how-multicollinearity-is-a-problem-in-linear-regression-dbb76e25cd80
多元线性回归中多重共线性