Pipeline可以将许多算法模型串联起来,可以用于把多个estamitors级联成一个estamitor,比如将特征提取、归一化、分类组织在一起形成一个典型的机器学习问题工作流。Pipleline中最后一个之外的所有estimators都必须是变换器(transformers),最后一个estimator可以是任意类型(transformer,classifier,regresser),若最后一个estimator是个分类器,则整个pipeline就可以作为分类器使用,如果最后一个estimator是个聚类器,则整个pipeline就可以作为聚类器使用。
实际上,调用pipeline的fit方法,是用前n-1个变换器处理特征,之后传递给最后的estimator训练。pipeline继承最后一个estimator的所有方法。
由此带来两点直接好处:
Pipline的方法都是执行各个学习器中对应的方法,如果该学习器没有该方法,会报错 假设该Pipline共有n个学习器。
步骤:1.首先对数据进行预处理,比如缺失值的处理
2.数据的标准化
3.降维
4.特征选择算法
5.分类或者预测或者聚类算法(估计器,estimator)
Pipeline是使用(key, value)对的列表构建的,其中key是包含要给予此步骤的名称的字符串,value是对应的不同学习器。
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris=load_iris()
pipe=Pipeline([('sc', StandardScaler()),('pca',PCA()),('svc',SVC())])
#例如‘sc’是StandardScaler()的简称,亦或者是代称
pipe.fit(iris.data,iris.target)
输出:
Pipeline(memory=None,
steps=[('sc', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
其次可以通过make_pipeline函数实现pipeline的调用:它是Pipeline类的简单实现,只需传入每个step的类实例即可,不需自己命名,自动将类的小写设为该step的名:
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler #用来解决离群点
new_pipeline=make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
输出:
Pipeline(memory=None,
steps=[('robustscaler', RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
with_scaling=True)), ('lasso', Lasso(alpha=0.0005, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=1,
selection='cyclic', tol=0.0001, warm_start=False))])
可以通过set_params重新设置每个类里边需传入的参数,设置方法为step的name__parma=参数值:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris=load_iris()
pipe=Pipeline([('sc', StandardScaler()),('pca',PCA()),('svc',SVC())])
#例如‘sc’是StandardScaler()的简称,亦或者是代称
pipe.set_params(sc__copy=False)
#改变参数的格式为 学习器简称__该学习器对应参数名=参数值
pipe.fit(iris.data,iris.target)
输出:
#可以看到sc中的copy确实由true改为false
Pipeline(memory=None,
steps=[('sc', StandardScaler(copy=False, with_mean=True, with_std=True)), ('pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
举例为pipeline与GridSearch在pca与svc中的使用:
from sklearn.model_selection import GridSearchCV
#以下格式为 学习器简写__该学习器对应的某个参数名=可选参数值范围
params = dict(pca__n_components=[2, 5, 10],
svc__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=params)
FeatureUnion把若干个transformer object组合成一个新的estimators。这个新的transformer组合了他们的输出,一个FeatureUnion对象接受一个transformer对象列表。
在训练阶段,每一个transformer都在数据集上独立的训练。在数据变换阶段,多有的训练好的Trandformer可以并行的执行。他们输出的样本特征向量被以end-to-end的方式拼接成为一个更大的特征向量。
你只需要调用一次fit和transform就可以在数据集上训练一组estimators。
可以把grid search用在FeatureUnion中所有的estimators的参数这上面。
FeatureUnion和Pipeline可以组合使用来创建更加复杂的模型。
注意:FeatureUnion无法检查两个transformers是否产生了相同的特征输出,它仅仅产生了一个原来互相分离的特征向量的集合。确保其产生不一样的特征输出是调用者的事情。
FeatureUnion对象实例使用(key, value)构成的list来构造,key是你自己起的transformation的名称,value是一个estimator对象。
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
combined = FeatureUnion(estimators)
输出:
FeatureUnion(n_jobs=None,
transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('kernel_pca', KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3, eigen_solver='auto',
fit_inverse_transform=False, gamma=None, kernel='linear',
kernel_params=None, max_iter=None, n_components=None, n_jobs=None,
random_state=None, remove_zero_eig=False, tol=0))],
transformer_weights=None)
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
estimators = [('linear_pca', PCA()), ('kernel', KernelPCA())]
combined = FeatureUnion(estimators)
combined.set_params(kernel__alpha=0.6)##注意这里是两个下划线不然会报错
print(combined)
输出:
FeatureUnion(n_jobs=None,
transformer_list=[('linear_pca', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('kernel', KernelPCA(alpha=0.6, coef0=1, copy_X=True, degree=3, eigen_solver='auto',
fit_inverse_transform=False, gamma=None, kernel='linear',
kernel_params=None, max_iter=None, n_components=None, n_jobs=None,
random_state=None, remove_zero_eig=False, tol=0))],
transformer_weights=None)
pipeline相当于feature串行处理,后一个transformer处理前一个transformer的feature结果;
featureunion相当于feature的并行处理,将所有transformer的处理结果拼接成大的feature vector。
# Author: Andreas Mueller
#
# License: BSD 3 clause
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
iris = load_iris()
X, y = iris.data, iris.target
# This dataset is way to high-dimensional. Better do PCA:
pca = PCA(n_components=2)
# Maybe some original features where good, too?
selection = SelectKBest(k=1)
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)
svm = SVC(kernel="linear")
# Do grid search over k, n_components and C:
pipeline = Pipeline([("features", combined_features), ("svm", svm)])
param_grid = dict(features__pca__n_components=[1, 2, 3],
features__univ_select__k=[1, 2],
svm__C=[0.1, 1, 10])
grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=10)
grid_search.fit(X, y)
print(grid_search.best_estimator_)
输出:
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1, score=0.960784 - 0.0s
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1, score=0.901961 - 0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s remaining: 0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=1, score=0.979167 - 0.0s
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s remaining: 0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=1
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=1, score=0.941176 - 0.0s
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s remaining: 0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=1
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=1, score=0.921569 - 0.0s
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s remaining: 0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=1
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=1, score=0.979167 - 0.0s
[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.1s remaining: 0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=1
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=1, score=0.960784 - 0.0s
[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.1s remaining: 0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=1
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=1, score=0.921569 - 0.0s
[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.1s remaining: 0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=1
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=1, score=0.979167 - 0.0s
[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.1s remaining: 0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2, score=0.960784 - 0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2, score=0.921569 - 0.0s
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2
[CV] features__pca__n_components=1, svm__C=0.1, features__univ_select__k=2, score=0.979167 - 0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=2
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=2, score=0.960784 - 0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=2
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=2, score=0.921569 - 0.0s
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=2
[CV] features__pca__n_components=1, svm__C=1, features__univ_select__k=2, score=1.000000 - 0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=2
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=2, score=0.980392 - 0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=2
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=2, score=0.901961 - 0.0s
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=2
[CV] features__pca__n_components=1, svm__C=10, features__univ_select__k=2, score=1.000000 - 0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1, score=0.960784 - 0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1, score=0.901961 - 0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=1, score=0.979167 - 0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=1
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=1, score=0.980392 - 0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=1
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=1, score=0.941176 - 0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=1
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=1, score=0.979167 - 0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=1
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=1, score=0.980392 - 0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=1
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=1, score=0.941176 - 0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=1
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=1, score=0.979167 - 0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2, score=0.980392 - 0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2, score=0.941176 - 0.0s
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2
[CV] features__pca__n_components=2, svm__C=0.1, features__univ_select__k=2, score=0.979167 - 0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=2
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=2, score=1.000000 - 0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=2
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=2, score=0.960784 - 0.0s
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=2
[CV] features__pca__n_components=2, svm__C=1, features__univ_select__k=2, score=0.979167 - 0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=2
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=2, score=0.980392 - 0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=2
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=2, score=0.921569 - 0.0s
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=2
[CV] features__pca__n_components=2, svm__C=10, features__univ_select__k=2, score=1.000000 - 0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1, score=0.980392 - 0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1, score=0.941176 - 0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=1, score=0.979167 - 0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=1
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=1, score=1.000000 - 0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=1
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=1, score=0.941176 - 0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=1
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=1, score=0.979167 - 0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=1
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=1, score=1.000000 - 0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=1
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=1, score=0.921569 - 0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=1
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=1, score=1.000000 - 0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2, score=0.980392 - 0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2, score=0.941176 - 0.0s
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2
[CV] features__pca__n_components=3, svm__C=0.1, features__univ_select__k=2, score=0.979167 - 0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=2
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=2, score=1.000000 - 0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=2
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=2, score=0.960784 - 0.0s
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=2
[CV] features__pca__n_components=3, svm__C=1, features__univ_select__k=2, score=0.979167 - 0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=2
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=2, score=1.000000 - 0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=2
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=2, score=0.921569 - 0.0s
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=2
[CV] features__pca__n_components=3, svm__C=10, features__univ_select__k=2, score=1.000000 - 0.0s
[Parallel(n_jobs=1)]: Done 54 out of 54 | elapsed: 0.4s finished
Pipeline(memory=None,
steps=[('features', FeatureUnion(n_jobs=1,
transformer_list=[('pca', PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)), ('univ_select', SelectKBest(k=2, score_func=<function f_classif at 0x7f4770b1bcf8>))],
transformer...,
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False))])