特征选择方法的三大类型 [1]:
1.filter method :利用一些统计指标进行特征选择,和模型没有关系
2.wrapper method:结合模型来做,每次加入或者减少特征看对模型的准确度是否有提升,如果有提升,那么就增加或者减少,所以
需要不断构建模型来判断是否要加入特征
3.embedded method:结合模型来做,和模型训练一起做,即模型训练完,特征就出来了;
所以,wrapper method 要不断的构建模型,花费的资源是比较多的!
filter的部分方法 [2] :
可视化:
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
iris=load_iris()
df=pd.DataFrame(iris.data,columns=["x1","x2","x3","x4"])
# print(df.info()) # 均非空
plt.figure(figsize=(8,6))
df_corr=df.corr(method="pearson") # 这里可以指定响应的方法
sb.heatmap(df_corr,annot=True)
plt.show()
假设把x1作为y,那么x2与x1的相关系数为-0.12 , x3 与x1的相关系数为0.87,x4与x1的相关系数为0.82 ,认为x2与x1相关性很弱,可以删除,常见的做法,是设定一个阈值,来判断
# 定位x1的相关系数
x1_corr=abs(df_corr["x1"])
print(x1_corr[x1_corr>0.5])
from sklearn.feature_selection import VarianceThreshold
X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
selector=VarianceThreshold() # 默认是0
X=selector.fit_transform(X)
print(selector.variances_)
print()
print(X)
# 输出
[0. 0.22222222 2.88888889 0. ]
[[2 0]
[1 4]
[1 1]]
wapper的部分方法[2]:
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load the iris datasets
dataset = datasets.load_iris()
# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression(multi_class="auto",solver="lbfgs",max_iter=1000)
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(dataset.data, dataset.target)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
embedded的部分方法[2]:
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris=load_iris()
ss=StandardScaler()
X,y=iris.data[:,:3],iris.data[:,3]
X=ss.fit_transform(X)
model_lasso=LassoCV(alphas=[1,0.5,0.1,0.05,0.001],cv=5).fit(X,y)
print(model_lasso.alpha_)
print(model_lasso.coef_) # 若为0表示剔除
refer:
[1] https://towardsdatascience.com/feature-selection-with-pandas-e3690ad8504b
[2] https://sebastianraschka.com/faq/docs/feature_sele_categories.html#:~:text=The%20third%20class%2C%20embedded%20methods,metric%20is%20used%20during%20learning.
[3] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html
[4] https://arxiv.org/pdf/1202.3725.pdf
[5] https://datascience.stackexchange.com/questions/64260/pearson-vs-spearman-vs-kendall#:~:text=In%20the%20normal%20case%2C%20Kendall,small%20samples%20or%20some%20outliers.
[6] https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
[7] https://towardsdatascience.com/feature-selection-using-regularisation-a3678b71e499