Python 利用特征提取或者特征选择进行降维

特征提取降维

主成分分析(PCA)进行特征降维

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import datasets

digits = dataset.load_digits()
features = StandardScaler().fit_transform(digits.data)
pca = PCA(n_components=0.99,whiten=True)
features_pca = pca.fit_transform(features)
  • original number of features
  • reduced number of features
>>> features.shape[1]
64
>>> features_pca.shape[1]
54
  • n_components = 0.99 保存99%的信息

  • 参数whiten = True;表示主成分转换时保证平均值为0;方差为1

  • svd_solver = "randomized"  使用随机方法找到找到第一个主成分

对线性不可分数据进行特征降维

  • 如果数据是线性可分的,可以用一条直线或者一个超平面分开,效果很好
  • 如果数据线性不可分,曲线可分,效果不好

  • 非线性降维度
  • make_circles()
    • 生成一个模拟数据集,使得一类数据完全包括一类数据
features,_ = make_circles(n_samples=1000,random_state=1,noise=0.1,factor=0.1)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA,KernelPCA
from sklearn.datasets import make_circles
from sklearn import datasets

features,_ = make_circles(n_samples=1000,random_state=1,noise=0.1,factor=0.1)
kpca = KernelPCA(kernel="rbf",gamma=15,n_components=1)
features_kpca = kpca.fit_transform(features)

>>> features.shape[1]
2
>>> features_kpca.shape[1]
1

最大化类间可分性进行特征降维

  • 使用线性判别分析(Linear Discriminant Analysis),将特征数据映射到一个可以使类间可分性最大化的成分坐标轴上。
from sklearn import datasets
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

iris = datasets.load_iris()
features = iris.data
target = iris.target

lda = LinearDiscriminantAnalysis(n_components=1)
features_lda = lda.fit(features,target).transform(features)
  • explain_variance_ratio_  每个成分保留的信息量 
>>> features.shape[1]
4
>>> features_lda.shape[1]
1
>>> lda.explained_variance_ratio_
array([0.9912126])

使用矩阵分解法进行特征降维

  • 使用非负特征矩阵(Non-Negative Matrix Factorization)进行降维
from sklearn.decomposition import NMF
from sklearn import datasets

digits = datasets.load_digits()
features = digits.data

nmf = NMF(n_components=10,random_state=1)
features_nmf = nmf.fit_transform(features)

可能出现的一个警告:

Warning (from warnings module):
  File "C:\Users\LX\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\decomposition\_nmf.py", line 1692
    warnings.warn(
ConvergenceWarning: Maximum number of iterations 200 reached. Increase it to improve convergence.
>>> features.shape[1]
64
>>> features_nmf.shape[1]
10

对稀疏数据进行特征降维

使用截断奇异值分解(Trumcated Singular Value Decomposition)法

from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix
from sklearn import datasets
import numpy as np

digits = datasets.load_digits()
features = StandardScaler().fit_transform(digits.data)
features_sparse = csr_matrix(features)
tsvd = TruncatedSVD(n_components=10)
features_sparse_tsvd = tsvd.fit(features_sparse).transform(features_sparse)
>>> features.shape[1]
64
>>> features_sparse_tsvd.shape[1]
10

特征选择降维

  • 特征选择会保留信息量较高的特征而丢弃信息量较低方法
  • 方差较小的特征就是包含的信息较少的特征

数值型特征方差的阈值化

from sklearn.feature_selection import VarianceThreshold
from sklearn import datasets

iris = datasets.load_iris()
features = iris.data
target = iris.target
thresholder = VarianceThreshold(threshold=0.5)
features_high_variance = thresholder.fit_transform(features)
print(features_high_variance[0:3])
  • 大方差特征矩阵
[[5.1 1.4 0.2]
 [4.9 1.4 0.2]
 [4.7 1.3 0.2]]

二值特征的方差阈值化

  • 一组二值特征数据,移出其中方差较小的特征
  • 特征0 80%分类0
    特征1 20%分类0
    特征2 60%分类0
from sklearn.feature_selection import VarianceThreshold
from sklearn import datasets

features = [[0,1,0],
            [0,1,1],
            [0,1,0],
            [0,1,1],
            [1,0,0]]
thresholder = VarianceThreshold(threshold=(0.75*(1-0.75)))
thresholder.fit_transform(features)


>>> a = thresholder.fit_transform(features)
>>> a
array([[0],
       [1],
       [0],
       [1],
       [0]])

处理高度相关性的特征

  • 特征矩阵中的某些特征具有较高的相关性

删除与分类任务不相管的特征

递归式特征消除

你可能感兴趣的:(Python语言程序设计,python,机器学习,开发语言)