《Python机器学习及实践:从零开始通往Kaggle竞赛之路》第3章 进阶篇 学习笔记(七)3.1.4.2并行搜索总结

目录

3.1.4.2并行搜索

1、并行搜索

2、编程实践


3.1.4.2并行搜索

1、并行搜索

尽管采用网格搜索结合交叉验证的方法,来寻找更好超参数组合的过程非常耗时;然而,一旦获取比较好的超参数组合,则可以保持一段时间使用。因此这是值得推荐并且相对一劳永逸的性能提升方法。更可喜的是,由于各个新模型在执行交叉验证的过程中间是互相独立的,所以可以充分利用多核处理器(Multicore processor)甚至是分布式的计算资源来从事并行搜索(Parallel Grid Search),这样能够成倍地节省运算时间。

2、编程实践

对超参数搜索的过程略作修改,替换为并行搜索,看看会有怎样的效率提升。

# 代码67:使用多个线程对文本分类的朴素贝叶斯模型的超参数组合执行并行化的网格搜索
# 从sklearn.datasets中导入20类新闻文本抓取器。
from sklearn.datasets import fetch_20newsgroups
# 导入numpy,并且重命名为np。
import numpy as np

# 使用新闻抓取器从互联网上下载所有数据,并且存储在变量news中。
news = fetch_20newsgroups(subset='all')

# 从sklearn.model_selection中导入train_test_split用来分割数据。
from sklearn.model_selection import train_test_split

# 对前3000条新闻文本进行数据分割,25%文本用于未来测试。
X_train, X_test, y_train, y_test = train_test_split(news.data[:3000], news.target[:3000], test_size=0.25, random_state=33)

# 导入支持向量机(分类)模型。
from sklearn.svm import SVC

# 导入TfidfVectorizer文本抽取器。
from sklearn.feature_extraction.text import TfidfVectorizer
# 导入Pipeline。
from sklearn.pipeline import Pipeline

# 使用Pipeline简化系统搭建流程,将文本抽取与分类器模型串联起来。
clf = Pipeline([('vect', TfidfVectorizer(stop_words='english', analyzer='word')), ('svc', SVC())])

# 这里需要试验的2个超参数的的个数分别是4、3,svc__gamma的参数共有10^-2, 10^-1...。这样我们一共有12种的超参数组合,12个不同参数下的模型。
parameters = {'svc__gamma': np.logspace(-2, 1, 4), 'svc__C': np.logspace(-1, 1, 3)}

# 从sklearn.grid_search中导入网格搜索模块GridSearchCV。
from sklearn.model_selection import GridSearchCV

# 初始化配置并行网格搜索,n_jobs=-1代表使用该计算机全部的CPU。
gs = GridSearchCV(clf, parameters, verbose=2, refit=True, cv=3, n_jobs=-1)

# 执行多线程并行网格搜索。
gs.fit(X_train, y_train)
gs.best_params_, gs.best_score_

# 输出最佳模型在测试集上的准确性。
print(gs.score(X_test, y_test))

备注1:原来的导入模型from sklearn.cross_validation import train_test_split的时候,提示错误:

from sklearn.cross_validation import train_test_split
ModuleNotFoundError: No module named 'sklearn.cross_validation'

需要替换cross_validation:

from sklearn.model_selection import train_test_split

备注2:原来的导入模型from sklearn.grid_search import GridSearchCV的时候,提示错误:

from sklearn.grid_search import GridSearchCV
ModuleNotFoundError: No module named 'sklearn.grid_search'

需要替换grid_search:

from sklearn.model_selection import GridSearchCV

本地输出:

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[CV] svc__C=0.1, svc__gamma=0.01 .....................................
[CV] svc__C=0.1, svc__gamma=0.01 .....................................
[CV] svc__C=0.1, svc__gamma=0.01 .....................................
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] ....................... svc__C=0.1, svc__gamma=1.0, total=  10.5s
[CV] ...................... svc__C=0.1, svc__gamma=0.01, total=  11.1s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] ...................... svc__C=0.1, svc__gamma=0.01, total=  10.9s
[CV] svc__C=0.1, svc__gamma=10.0 .....................................
[CV] svc__C=0.1, svc__gamma=10.0 .....................................
[CV] ...................... svc__C=0.1, svc__gamma=0.01, total=  11.2s
[CV] svc__C=0.1, svc__gamma=10.0 .....................................
[CV] ....................... svc__C=0.1, svc__gamma=0.1, total=  11.1s
[CV] ....................... svc__C=0.1, svc__gamma=0.1, total=  11.0s
[CV] svc__C=1.0, svc__gamma=0.01 .....................................
[CV] svc__C=1.0, svc__gamma=0.01 .....................................
[CV] ....................... svc__C=0.1, svc__gamma=1.0, total=  10.9s
[CV] ....................... svc__C=0.1, svc__gamma=0.1, total=  11.2s
[CV] svc__C=1.0, svc__gamma=0.01 .....................................
[CV] svc__C=1.0, svc__gamma=0.1 ......................................
[CV] ...................... svc__C=0.1, svc__gamma=10.0, total=  15.6s
[CV] svc__C=1.0, svc__gamma=0.1 ......................................
[CV] ...................... svc__C=1.0, svc__gamma=0.01, total=  15.7s
[CV] svc__C=1.0, svc__gamma=0.1 ......................................
[CV] ....................... svc__C=0.1, svc__gamma=1.0, total=  16.2s
[CV] svc__C=1.0, svc__gamma=1.0 ......................................
[CV] ....................... svc__C=1.0, svc__gamma=0.1, total=  15.7s
[CV] svc__C=1.0, svc__gamma=1.0 ......................................
[CV] ...................... svc__C=0.1, svc__gamma=10.0, total=  16.2s
[CV] svc__C=1.0, svc__gamma=1.0 ......................................
[CV] ...................... svc__C=1.0, svc__gamma=0.01, total=  16.1s
[CV] svc__C=1.0, svc__gamma=10.0 .....................................
[CV] ...................... svc__C=0.1, svc__gamma=10.0, total=  16.4s
[CV] svc__C=1.0, svc__gamma=10.0 .....................................
[CV] ...................... svc__C=1.0, svc__gamma=0.01, total=  16.3s
[CV] svc__C=1.0, svc__gamma=10.0 .....................................
[CV] ....................... svc__C=1.0, svc__gamma=0.1, total=  15.6s
[CV] svc__C=10.0, svc__gamma=0.01 ....................................
[CV] ....................... svc__C=1.0, svc__gamma=0.1, total=  15.4s
[CV] svc__C=10.0, svc__gamma=0.01 ....................................
[CV] ....................... svc__C=1.0, svc__gamma=1.0, total=  15.5s
[CV] svc__C=10.0, svc__gamma=0.01 ....................................
[CV] ...................... svc__C=1.0, svc__gamma=10.0, total=  15.3s
[CV] svc__C=10.0, svc__gamma=0.1 .....................................
[CV] ....................... svc__C=1.0, svc__gamma=1.0, total=  15.7s
[CV] svc__C=10.0, svc__gamma=0.1 .....................................
[CV] ....................... svc__C=1.0, svc__gamma=1.0, total=  15.6s
[CV] svc__C=10.0, svc__gamma=0.1 .....................................
[CV] ...................... svc__C=1.0, svc__gamma=10.0, total=  15.5s
[CV] svc__C=10.0, svc__gamma=1.0 .....................................
[CV] ...................... svc__C=1.0, svc__gamma=10.0, total=  15.5s
[CV] svc__C=10.0, svc__gamma=1.0 .....................................
[CV] ..................... svc__C=10.0, svc__gamma=0.01, total=  12.3s
[CV] svc__C=10.0, svc__gamma=1.0 .....................................
[CV] ..................... svc__C=10.0, svc__gamma=0.01, total=  12.4s
[CV] svc__C=10.0, svc__gamma=10.0 ....................................
[CV] ...................... svc__C=10.0, svc__gamma=0.1, total=  12.5s
[CV] ..................... svc__C=10.0, svc__gamma=0.01, total=  12.6s
[CV] svc__C=10.0, svc__gamma=10.0 ....................................
[CV] svc__C=10.0, svc__gamma=10.0 ....................................
[CV] ...................... svc__C=10.0, svc__gamma=1.0, total=  12.7s
[CV] ...................... svc__C=10.0, svc__gamma=0.1, total=  13.0s
[CV] ...................... svc__C=10.0, svc__gamma=0.1, total=  12.9s
[CV] ...................... svc__C=10.0, svc__gamma=1.0, total=  13.0s
[CV] ..................... svc__C=10.0, svc__gamma=10.0, total=   8.7s
[CV] ...................... svc__C=10.0, svc__gamma=1.0, total=   8.9s
[CV] ..................... svc__C=10.0, svc__gamma=10.0, total=   8.4s
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:  1.1min finished
[CV] ..................... svc__C=10.0, svc__gamma=10.0, total=   8.5s
0.8226666666666667

结论:同样是网格搜索,使用多线程并行搜索技术对朴素贝叶斯模型在文本分类任务中的超参数组合进行调优,执行同样的36项计算任务一共只花费了1分13秒,寻找到的最佳的超参数组合在测试集上所能达成的最高分类准确性依然为82.27%。发现在没有影响验证准确性的前提下,通过并行搜索基础有效地利用了6核心(CPU)的计算资源,几乎6倍地提升了运算速度,节省了最佳超参数组合的搜索时间。

你可能感兴趣的:(Python机器学习及实践)