sklearn-学习:Dimensionality reduction(降维)-(feature selection)特征选择

本文主要对对应文档的内容进行简化(以代码示例为主)及汉化

对应文档位置:http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection


1.13. Feature selection

feature selection 作用: 增加分类器的score ,提升分类器在高纬数据集上的表现

  • 1.13.1. Removing features with low variance
  • from sklearn.feature_selection import VarianceThreshold
    X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
    sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
    sel.fit_transform(X)
    array([[0, 1],
           [1, 0],
           [0, 0],
           [1, 1],
           [1, 0],
           [1, 1]])
    说明:
    VarianceThreshold 
  • 默认值:去除差异值为0(或者为相同值的变量)
  • VarianceThreshold(threshold=(.8 * (1 - .8))) ,例子中假设为bool型变量(取值为0,1),其参数threshold的值为方差值;                            对于伯努利分布,其方差为p(1-p)=0.8*(1-0.8)
  • 1.13.2. Univariate feature selection(单变量特征选择)
  • Univariate feature selection works by selecting the best features based on univariate statistical tests(单特征选择分为两部分:特征评分,选择特征),特征选择的方法有:
  • SelectKBest:按照评分函数,选出最好的K个特征
    SelectPercentile
    SelectFpr
  • SelectFdr
  • SelectFwe
    GenericUnivariateSelect

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape
(150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
(150, 2)
class sklearn.feature_selection.SelectKBest(score_func=k=10) 
返回值:scores_ : array-like, shape=(n_features,)Scores of features.
               pvalues_ : array-like, shape=(n_features,)p-values of feature scores.
评分函数:
  • For regression: f_regression     (用于回归)
  • For classification: chi2 or f_classif (用于分类)

1.13.3. Recursive feature elimination(RFE)
递归消除特征:
First, the estimator is trained on the initial set of features and weights are assigned to each one of them. 
首先,通过全部特征训练评估函数,得出每个特征的权重
Then, features whose absolute weights are the smallest are pruned from the current set features. 
然后,将最小权重的特征(按照一定阈值)从特征集合中去除
That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached
循环执行过程1、2,直到特征数达成需要
RFECV  performs RFE in a cross-validation loop to find the optimal number of features.
RFECV 通过较差验证,来找到选定数量的特征

RFE示例:
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
import matplotlib.pyplot as plt

digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=1, step=1) #进行递归消除特征
rfe.fit(X, y)
ranking = rfe.ranking_.reshape(digits.images[0].shape)
说明:svm可以替换为逻辑回归

RFECV 示例:
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                           n_redundant=2, n_repeated=0, n_classes=8,
                           n_clusters_per_class=1, random_state=0)

svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
              scoring='accuracy')
rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)
  • 1.13.4. Feature selection using SelectFromModel
    • 1.13.4.1. L1-based feature selection
    • 1.13.4.2. Randomized sparse models
    • 1.13.4.3. Tree-based feature selection
  • 1.13.5. Feature selection as part of a pipeline

你可能感兴趣的:(sklearn-学习:Dimensionality reduction(降维)-(feature selection)特征选择)