数据集中存在大量冗余的变量时不仅有损模型性能,而且还会带来建模成本的提升,因此,进行特征选择还是很有必要的。
进行特征选择最起码会带来一下三方面的好处:
下面介绍四种特征选择的方法,用到的数据集在这里下载.
统计检验可以用来帮助我们选择与因变量关联最大的特征,sklearn中的SeleckKBest类包含了一系列用于选择特定特征的统计检验方法。下面演示使用chi^2检验(特征取值非负)来选择特征:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])
结果中可以看到每个变量的得分以及最终选择的得分最高的四个变量:plas, test, mass和age.
[ 111.52 1411.887 17.605 53.108 2175.565 127.669 5.393
181.304]
[[ 148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[ 183. 0. 23.3 32. ]
[ 89. 94. 28.1 21. ]
[ 137. 168. 43.1 33. ]]
递归消除法,也就是Recursive Feature Elimination,通过递归的移除变量,然后再用剩下的变量来建模,以此来筛选特征。它使用模型的精确度来评估对因变量的贡献值。更详细的讲解可以查看sklearn中的RFE类文档。
# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_
看好了啊,你需要设置你要选择多少个变量,这里REF选择的top3变量分别为:preg, mass和pedi.
Num Features: 3
Selected Features: [ True False False False False True True False] # 选上的标True
Feature Ranking: [1 2 3 5 6 1 1 4] # 选上的排1
这个算是很出名的降维算法了,它使用线性代数变换的方法来压缩数据,可以说很高大上了,如果对PCA不太了解的,wiki上的教程还是值得一看的,在这. sklearn提供了PCA类.
# Feature Extraction with PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print("Explained Variance: %s") % fit.explained_variance_ratio_
print(fit.components_)
前三个主成分对数据集的解释度已经很高了.
Explained Variance: [ 0.88854663 0.06159078 0.02579012]
[[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02
9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]
[ 2.26488861e-02 9.72210040e-01 1.41909330e-01 -5.78614699e-02
-9.46266913e-02 4.69729766e-02 8.16804621e-04 1.40168181e-01]
[ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01
2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]
决策树,像随机森林,可以估计特征重要性。
# Feature Importance with Extra Trees Classifier
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = ExtraTreesClassifier()
model.fit(X, Y)
print(model.feature_importances_)
各个特征的重要性(未排序):
[ 0.11070069 0.2213717 0.08824115 0.08068703 0.07281761 0.14548537 0.12654214 0.15415431]
方法千千万,实际应用中,work不work那就是另一回事了,哭笑脸.jpg.
参考链接:https://machinelearningmastery.com/feature-selection-machine-learning-python/