python虚拟变量回归_如何处理虚拟变量的共线性以进行线性回归?

I am using scikit-learn LogisticRegression on a dataset of household characteristics and trying to understand how to prepare the independent variables.

I have created binary dummy variables in place of categorical variables.

e.g. The variable DWELLING_TYPE which had 3 possible values DetachedHouse, SemiDetached and Apartment has been replaced with 3 binary variables DWELLING_TYPE_DetachedHouse, DWELLING_TYPE_SemiDetached and DWELLING_TYPE_Apartmentthat each has the value1or0`.

Clearly these 3 variables are co-dependent (co-linear?) because if one of these variables is 1, the other 2 must be 0. My understanding is that co-linearity should be minimised for Logistic Regression, so should I be omitting one of these variables from the input matrix?

解决方案

Yes. It's a good practice. When you convert your categorical variables into dummies you can drop one of the dummies. It will reduce the redundancy from your input features.

In python you can do it by using pd.get_dummies

pd.get_dummies(df, columns=categorical_columns, drop_first=True)

setting drop_first parameter as True will work for you.

你可能感兴趣的:(python虚拟变量回归)