sklearn管道类库使用小结

Pipeline可以将多个估计器串起来,例如将特征提取、正则化和分类串起来形成一个典型的机器学习工作流是非常有用的。管道的两个目的:

方便性:只需要调用fit和predict一次,就能适合所有估计器

联合参数选择:在管道中,结合网格搜索对估计器参数进行选择

在管道中的所有估计器,除了最后一个外,都必须是transformers(转换器),最后一个估计器可以是转换器或分类器

Pipeline由键值对元组列表组成的,键是一个字符串,定义指定步骤的名称,可以随意取,值是一个估计器对象

①利用Pipeline实例化管道对象

In [1]: from sklearn.pipeline import Pipeline
   ...: from sklearn.svm import SVC
   ...: from sklearn.decomposition import PCA
   ...: estimators = [('reduce_dim',PCA()),('clf',SVC())]
   ...: pipe = Pipeline(estimators)
   ...: pipe
   ...:
Out[1]:
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_component
s=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200,
 class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
②利用make_pipeline构造一个Pipeline对象

sklearn.pipeline.make_pipeline(*steps):构造时,不需要,也不允许定义估计器名称,自动有估计器类型的小写字母命名

In [2]: from sklearn.naive_bayes import GaussianNB
   ...: from sklearn.preprocessing import StandardScaler
   ...: from sklearn.pipeline import make_pipeline
   ...: make_pipeline(StandardScaler(),GaussianNB(priors=None))
   ...:
Out[2]: Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=T
rue, with_std=True)), ('gaussiannb', GaussianNB(priors=None))])

获取管道中估计器方法:

①管道中各个估计器是以元组列表的方式存储在steps属性中,可以列表索引的方式访问具体估计器

In [4]: pipe.steps
Out[4]:
[('reduce_dim',
  PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)),
 ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False))]


In [5]: pipe.steps[0]
Out[5]:
('reduce_dim',
 PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
   svd_solver='auto', tol=0.0, whiten=False))
②管道中所有估计器是以字典的方式存储在named_steps属性,可以以字典索引方式访问具体估计器

In [2]: pipe.named_steps
Out[2]:
{'clf': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=None, random_s
tate=None,
   svd_solver='auto', tol=0.0, whiten=False)}


In [3]: pipe.named_steps['reduce_dim']
Out[3]:
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

③可以以__方式设置估计器参数

In [6]: pipe.set_params(clf__C=10)
Out[6]:
Pipeline(steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_component
s=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=10, cache_size=200,
class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])


In [7]: pipe.get_params('clf__C')
Out[7]:
{'clf': SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
   decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
   max_iter=-1, probability=False, random_state=None, shrinking=True,
   tol=0.001, verbose=False),
 'clf__C': 10,
 'clf__cache_size': 200,
 'clf__class_weight': None,
 'clf__coef0': 0.0,
 'clf__decision_function_shape': None,
 'clf__degree': 3,
 'clf__gamma': 'auto',
 'clf__kernel': 'rbf',
 'clf__max_iter': -1,
 'clf__probability': False,
 'clf__random_state': None,
 'clf__shrinking': True,
 'clf__tol': 0.001,
 'clf__verbose': False,
 'reduce_dim': PCA(copy=True, iterated_power='auto', n_components=None, random_s
tate=None,
   svd_solver='auto', tol=0.0, whiten=False),
 'reduce_dim__copy': True,
 'reduce_dim__iterated_power': 'auto',
 'reduce_dim__n_components': None,
 'reduce_dim__random_state': None,
 'reduce_dim__svd_solver': 'auto',
 'reduce_dim__tol': 0.0,
 'reduce_dim__whiten': False,
 'steps': [('reduce_dim',
   PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
     svd_solver='auto', tol=0.0, whiten=False)),
  ('clf', SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
     decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
     max_iter=-1, probability=False, random_state=None, shrinking=True,
     tol=0.001, verbose=False))]}
In [9]: pipe.get_params('clf__C')['clf__C']
Out[9]: 10

④结合网格搜索GridSearchCV进行参数调优

from sklearn.linear_model import LogisticRegression
param_grid = dict(reduce_dim = [None,PCA(5),PCA(10)],
                 clf = [SVC(),LogisticRegression()],
                 clf__C=[0.1,10,100])
grid_search = GridSearchCV(pipe,param_grid = param_grid)

FeatureUnion将多个转换器结合成一个新的转换器,由一个转换器对象列表组成,在训练期间,各个转换器独立训练数据,对于数据转换,各个转换器都是并行应用,最终就是将各个转换器输出的样本矩阵合并成一个大的矩阵。FeatureUnion和pipeline具有相同的功能,两者结合建立复杂模型。

FeatureUnion由键值对元组列表组成,键是给转换步骤随意取名的字符串,值时一个估计器对象

①利用FeatureUnion实例化FeatureUnion对象

In [12]: from sklearn.pipeline import FeatureUnion
    ...: from sklearn.decomposition import PCA
    ...: from sklearn.decomposition import KernelPCA
    ...: estimators = [('linear_pca',PCA()),('kernel_pca',PCA())]
    ...: combined = FeatureUnion(estimators)
    ...: combined
    ...:
Out[12]:
FeatureUnion(n_jobs=1,
       transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_
components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', PCA(copy=True, iter
ated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False))],
       transformer_weights=None)
②利用make_union实例化FeatureUnion对象

In [14]: from sklearn.pipeline import make_union
    ...: from sklearn.decomposition import PCA
    ...: from sklearn.decomposition import KernelPCA
    ...: make_union(PCA(),KernelPCA())
    ...:
Out[14]:
FeatureUnion(n_jobs=1,
       transformer_list=[('pca', PCA(copy=True, iterated_power='auto', n_compone
nts=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('kernelpca', KernelPCA(alpha=1.0,
 coef0=1, copy_X=True, degree=3, eigen_solver='auto',
     fit_inverse_transform=False, gamma=None, kernel='linear',
     kernel_params=None, max_iter=None, n_components=None, n_jobs=1,
     random_state=None, remove_zero_eig=False, tol=0))],
       transformer_weights=None)
和Pipeline类似,也可以利用set_params方法去掉某步骤,通过制定参数为None

In [15]: combined.set_params(kernel_pca=None)
Out[15]:
FeatureUnion(n_jobs=1,
       transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_
components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', None)],
       transformer_weights=None)






你可能感兴趣的:(sklearn)