sklearn.feature_selection
模块中的类在高维度的样本集上进行特征选择、降维来提升估计器的性能。
sklearn.feature_selection.VarianceThreshold(threshold=0.0)
方差选择法是一种进行特征选择的简单的baseline方法,它移除所有不满足给定阈值要求的特征。阈值默认为0,此方法默认移除方差为0的特征,即在所有样本上取值相同的特征。
example:
from sklearn.feature_selection import VarianceThreshold
# X_train的方差为[0. , 0.22222222, 2.88888889, 0. ]
X_train = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
# X_test的方差为[1.55555556, 8.66666667, 0.22222222, 0.22222222]
X_test = [[1, 2, 0, 3], [3, 4, 0, 3], [4, 9, 1, 2]]
# 实例化一个阈值为0.5的方差选择器对象
selector = VarianceThreshold(threshold=0.5)
# fit方法计算给定数据(这里为X_train)的方差,返回一个实例化对象
selector.fit(X_train)
# variances属性返回X_train每个特征的方差值
selector.variances_
# 使用方差选择器去选择X_train的特征,输出为第二列[[0],[4],[1]]
X_train = selector.transform(X_train)
# 使用方差选择器去选择X_test的特征,输出为第二列[[0],[0],[1]]
X_test = selector.transform(X_test)
单变量特征选择是在单变量统计检验的基础上选择最佳特征。它可以看做估计器的预处理阶段,sklearn将特征选择流程看做实现transform方法的对象。
单变量特征选择可以从多个方面观察每个独立特征对于目标变量的影响,清楚直观,易于理解;但忽视了特征之间可能存在的某些关系。
SelectKBest
removes all but the highest scoring featuresSelectPercentile
removes all but a user-specified highest scoring percentage of featuresSelectFpr
, false discovery rate SelectFdr
, or family wise error SelectFwe
.GenericUnivariateSelect
allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.SelectKBest
按照scores保留K个特征;SelectPercentile
按照scores保留指定百分比的特征;SelectFpr
、SelectFdr
和SelectFwe
对每个特征使用通用的单变量统计检验;GenericUnivariateSelect
允许使用可配置策略如超参数搜索估计器选择最佳的单变量选择策略。
These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):
The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation.
method | describe | scenery |
---|---|---|
f_classif(X, y) | 计算样本的方差分析f值 | 分类 |
mutual_info_classif(X, y) | 估计离散目标变量的互信息 | 分类 |
chi2(X, y) | 计算非负特征和类别之间的卡方统计数据 | 分类 |
f_regression(X, y) | 计算单个特征和目标变量线性回归测试的f值 | 回归 |
mutual_info_regression(X, y) | 估计连续目标变量的互信息 | 回归 |
f-regression
sklearn.feature_selection.f_regression(X, y, center=True)
mutual_info_regression
sklearn.feature_selection.mutual_info_regression(X, y, discrete_features=’auto’, n_neighbors=3, copy=True, random_state=None)
discrete_features
用于指定哪些是离散的特征,n_neighbors
是[1][2]中用来估计随机变量互信息的近邻数,大的n_neighbors
可以降低估计器的方差,但会造成偏差>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)
给定一个可以给出特征权重的外部estimator
(如线性模型的系数),recursive feature elimination
(RFE)就是递归地考虑越来越小的特征集合。第一步,estimator
在所有原始特征上训练,通过coef_
或者feature_importances_
获得每个特征的重要性;第二步,最不重要的特征将会从现有的特征集中剪去,这个过程重复迭代直到特征的数目达到期望值。
RFECV实现了在RFE中使用cross-validation循环寻找最优特征数量的过程。
A recursive feature elimination example showing the relevance of pixels in a digit classification task.
A recursive feature elimination example with automatic tuning of the number of features selected with cross-validation.
sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False, norm_order=1, max_features=None)
SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_
or feature_importances_ attribute
after fitting. The features are considered unimportant and removed, if the corresponding coef_
or feature_importances_ values
are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.
example: Feature selection using SelectFromModel and LassoCV
使用sklearn.pipeline.Pipeline
流程化预处理和训练过程
clf = Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
('classification', RandomForestClassifier())
])
clf.fit(X, y)