常用特征选择方法及python代码

特征选择

数据集中存在大量冗余的变量时不仅有损模型性能,而且还会带来建模成本的提升,因此,进行特征选择还是很有必要的。

进行特征选择最起码会带来一下三方面的好处:

  • 减少过拟合几率:冗余数据少了,基于噪音数据做决策的几率也就少了.
  • 提升准确度: 烂数据少了,好数据拟合好模型那是当然了.
  • 减少模型训练时长: 数据量少了,计算机吃的少了,跑的就快乐.

机器学习中的特征选择

下面介绍四种特征选择的方法,用到的数据集在这里下载.

1. 单变量选择

统计检验可以用来帮助我们选择与因变量关联最大的特征,sklearn中的SeleckKBest类包含了一系列用于选择特定特征的统计检验方法。下面演示使用chi^2检验(特征取值非负)来选择特征:

# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)

# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)

# summarize selected features
print(features[0:5,:])

结果中可以看到每个变量的得分以及最终选择的得分最高的四个变量:plas, test, mass和age.

[  111.52   1411.887    17.605    53.108  2175.565   127.669     5.393
   181.304]

[[ 148.     0.    33.6   50. ]
[  85.     0.    26.6   31. ]
[ 183.     0.    23.3   32. ]
[  89.    94.    28.1   21. ]
[ 137.   168.    43.1   33. ]]

2. 递归消除法

递归消除法,也就是Recursive Feature Elimination,通过递归的移除变量,然后再用剩下的变量来建模,以此来筛选特征。它使用模型的精确度来评估对因变量的贡献值。更详细的讲解可以查看sklearn中的RFE类文档。

# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

# feature extraction
model = LogisticRegression()

rfe = RFE(model, 3)
fit = rfe.fit(X, Y)

print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_

看好了啊,你需要设置你要选择多少个变量,这里REF选择的top3变量分别为:preg, mass和pedi.

Num Features: 3
Selected Features: [ True False False False False  True  True False] # 选上的标True
Feature Ranking: [1 2 3 5 6 1 1 4] # 选上的排1

3.主成分分析

这个算是很出名的降维算法了,它使用线性代数变换的方法来压缩数据,可以说很高大上了,如果对PCA不太了解的,wiki上的教程还是值得一看的,在这. sklearn提供了PCA类.

# Feature Extraction with PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA

# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]

# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)

# summarize components
print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.components_)

前三个主成分对数据集的解释度已经很高了.

Explained Variance: [ 0.88854663  0.06159078  0.02579012]
[[ -2.02176587e-03   9.78115765e-02   1.60930503e-02   6.07566861e-02

    9.93110844e-01   1.40108085e-02   5.37167919e-04  -3.56474430e-03]
[  2.26488861e-02   9.72210040e-01   1.41909330e-01  -5.78614699e-02

   -9.46266913e-02   4.69729766e-02   8.16804621e-04   1.40168181e-01]
[ -2.24649003e-02   1.43428710e-01  -9.22467192e-01  -3.07013055e-01

    2.09773019e-02  -1.32444542e-01  -6.39983017e-04  -1.25454310e-01]]

4. 特征重要性

决策树,像随机森林,可以估计特征重要性。

# Feature Importance with Extra Trees Classifier

from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier

# load data

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values

X = array[:,0:8]
Y = array[:,8]

# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)

各个特征的重要性(未排序):

[ 0.11070069  0.2213717   0.08824115  0.08068703  0.07281761  0.14548537 0.12654214  0.15415431]

总结

方法千千万,实际应用中,work不work那就是另一回事了,哭笑脸.jpg.

参考链接:https://machinelearningmastery.com/feature-selection-machine-learning-python/

你可能感兴趣的:(机器学习,特征选择,python,sklearn,降维)