自动化机器学习tpot多核加速对比(附代码)

tpot是利用遗传算法,自动生成机器学习pipeline的自动化机器学习库。你可以输入你的训练集数据,并配置好遗传算法的参数,代码会自动给您训练出来一套pipeline。本文展示的代码分别用三种方法调用了tpot(串行,并行,dask),供大家交流参考。

import dask.array as da
from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np
from dask.distributed import Client, LocalCluster

import time

#load data
iris = load_iris()
#split training set and test set
X_train, X_test, y_train, y_test = train_test_split(iris.data, 
    iris.target, train_size=0.75, test_size=0.25, random_state=42)

#=========================>>串行运行tpot
#set simple tpot method 
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=0, random_state=42, n_jobs=1, use_dask=False)
time_start = time.time()
#use tpot to train
tpot.fit(X_train, y_train)
time_end = time.time()
print("time to use tpot on 1 core without dask is ", time_end - time_start, " s.\n")

print("Fitting score : ", tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')

#=========================>>并行运行tpot, 此处使用了参数n_jobs=-1, 代表用上计算机上的所有核
#set simple tpot method
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=0, random_state=42, n_jobs=-1, use_dask=False)
time_start = time.time()
#use tpot to train
tpot.fit(X_train, y_train)
time_end = time.time()
print("time to use tpot on cores without dask is ", time_end - time_start, " s.\n")

print("Fitting score : ", tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')

#==========================>>调用dask运行tpot
#set dask client
client = Client(LocalCluster(processes=False, threads_per_worker=1, n_workers=4))

#set tpot method with dask
tpot_d = TPOTClassifier(generations=5, population_size=50, verbosity=0, random_state=42, n_jobs=-1, use_dask=True)
time_start = time.time()
#use tpot to train
tpot_d.fit(X_train, y_train)
time_end = time.time()
print("time to use tpot on cores with dask is ", time_end - time_start, " s.\n")

print("Fitting score : ", tpot_d.score(X_test, y_test))
tpot_d.export('tpot_dask_iris_pipeline.py')

这是在本人电脑上运行输出的结果(本人电intel酷睿i7-1165G7八核),可以看到调用多核情况下训练速度确实比单核好了一些,但是跟dask共用的时候反而慢了,但是这并不能说明dask对性能有影响,或许dask的主要应用场景是集群计算。

time to use tpot on 1 core without dask is  97.58220362663269  s.
Fitting score :  0.9736842105263158

time to use tpot on cores without dask is  62.37632417678833  s.
Fitting score :  0.9736842105263158

time to use tpot on cores with dask is  151.76679229736328  s.
Fitting score :  1.0

你可能感兴趣的:(自动化,python)