【Python实例第5讲】Pipeline与GridSearchCV降维法

机器学习训练营——机器学习爱好者的自由交流空间(入群联系qq:2279055353)

本例构造一个降维管道(pipeline), 通过它做一个支持向量分类器预测。在这里,我们演示使用函数GridSearchCVPipeline优化不同类型的估计量。请注意,Pipeline能通过参数memory实例化,将转换器存储在管道里,避免重复拟合相同的转换器。

Pipeline and GridSearchCV

本节说明具有GridSearchCV的管道使用方法。首先,导入必需的模块。

# Authors: Robert McGibbon, Joel Nothman, Guillaume Lemaitre

from __future__ import print_function, division

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF
from sklearn.feature_selection import SelectKBest, chi2

print(__doc__)

建立一个管道,包括主成分pca和线性支持向量分类器LinearSVC.

pipe = Pipeline([
    ('reduce_dim', PCA()),
    ('classify', LinearSVC())
])

建立列表型参数网格,列表里的元素是字典型参数。

N_FEATURES_OPTIONS = [2, 4, 8]
C_OPTIONS = [1, 10, 100, 1000]
param_grid = [
    {
        'reduce_dim': [PCA(iterated_power=7), NMF()],
        'reduce_dim__n_components': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
    {
        'reduce_dim': [SelectKBest(chi2)],
        'reduce_dim__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]
reducer_labels = ['PCA', 'NMF', 'KBest(chi2)']

函数GridSearchCV在建立的管道上,按照指定的参数网格,以3倍交叉验证法穷尽地搜索一个最终的估计量grid

grid = GridSearchCV(pipe, cv=3, n_jobs=1, param_grid=param_grid)

加载手写数字数据集,该数据集在【Python实例第3讲】介绍过。用估计量对象grid在数据集上作拟合。

digits = load_digits()
grid.fit(digits.data, digits.target)

计算平均分数并作图比较三种不同的降维技术在不同的特征数下的分类准确率。

mean_scores = np.array(grid.cv_results_['mean_test_score'])
# scores are in the order of param_grid iteration, which is alphabetical
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
# select score for best C
mean_scores = mean_scores.max(axis=0)
bar_offsets = (np.arange(len(N_FEATURES_OPTIONS)) *
               (len(reducer_labels) + 1) + .5)

plt.figure()
COLORS = 'bgrcmyk'
for i, (label, reducer_scores) in enumerate(zip(reducer_labels, mean_scores)):
    plt.bar(bar_offsets + i, reducer_scores, label=label, color=COLORS[i])

plt.title("Comparing feature reduction techniques")
plt.xlabel('Reduced number of features')
plt.xticks(bar_offsets + len(reducer_labels) / 2, N_FEATURES_OPTIONS)
plt.ylabel('Digit classification accuracy')
plt.ylim((0, 1))
plt.legend(loc='upper left')
plt.show()

【Python实例第5讲】Pipeline与GridSearchCV降维法_第1张图片

在管道里存储转换器

有时候,如果一个指定的转换器有可能多次使用,可以在管道里存储它的状态。这主要是通过管道参数memory实现。请注意,当拟合一个转换器时,同时存储它是耗时且占用内存资源的。下面举一个在管道存储转换器的实例。

from tempfile import mkdtemp
from shutil import rmtree
from sklearn.externals.joblib import Memory

# Create a temporary folder to store the transformers of the pipeline
cachedir = mkdtemp()
memory = Memory(cachedir=cachedir, verbose=10)
cached_pipe = Pipeline([('reduce_dim', PCA()),
                        ('classify', LinearSVC())],
                       memory=memory)

# This time, a cached pipeline will be used within the grid search
grid = GridSearchCV(cached_pipe, cv=3, n_jobs=1, param_grid=param_grid)
digits = load_digits()
grid.fit(digits.data, digits.target)

# Delete the temporary cache before exiting
rmtree(cachedir)

下面显示一部分结果。

【Python实例第5讲】Pipeline与GridSearchCV降维法_第2张图片

阅读更多精彩内容,请关注微信公众号:统计学习与大数据

你可能感兴趣的:(【Python实例第5讲】Pipeline与GridSearchCV降维法)