现在的电脑的CPU一般都有多个核心,在Python中可以使用 multiprocessing 包比较方便地实现将计算任务分配给多个核心,使之并行地计算以实现加速的效果。
一般主要会用到的语法有
获取CPU的核心数:
n_cpu = multiprocessing.cpu_count()
并行执行函数:
proc = multiprocessing.Process(target=single_run, args=(digits, "parallel"))
proc.start()
proc.join()
其中,target属性是要并行执行的函数名,args是该函数的参数,注意要用元组的形式。
下面通过一个简单的例子来演示一下CPU并行地效果。对MINST-digits数据的10个类分别运行t-SNE降维,比较并行运行与串行运行的时间差异。
import numpy as np
import multiprocessing
from sklearn.manifold import TSNE
import time
path = "E:\\blog\\data\\MNIST50m\\"
def run_tsne(data):
t_sne = TSNE(n_components=2, perplexity=30.0)
Y = t_sne.fit_transform(data)
return Y
def single_run(digits, fold="1by1"):
for digit in digits:
print(str(digit) + " starting...")
X = np.loadtxt(path+str(digit)+".csv", dtype=np.float, delimiter=",")
t_sne = TSNE(n_components=2, perplexity=30.0)
Y = t_sne.fit_transform(X)
np.savetxt(path+fold+"\\Y"+str(digit)+".csv", Y, fmt='%f', delimiter=",")
print(str(digit) + " finished.")
def one_by_one():
begin_time = time.time()
digits = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# digits = [1, 2, 3, 4, 5, 6]
single_run(digits, "1by1")
end_time = time.time()
print("one by one time: ", end_time-begin_time)
def parallel():
begin_time = time.time()
n = 10 # 10
procs = []
n_cpu = multiprocessing.cpu_count()
chunk_size = int(n/n_cpu)
for i in range(0, n_cpu):
min_i = chunk_size * i
if i < n_cpu-1:
max_i = chunk_size * (i+1)
else:
max_i = n
digits = []
for digit in range(min_i, max_i):
digits.append(digit)
procs.append(multiprocessing.Process(target=single_run, args=(digits, "parallel")))
for proc in procs:
proc.start()
for proc in procs:
proc.join()
end_time = time.time()
print("parallel time: ", end_time-begin_time)
if __name__ == '__main__':
# one_by_one()
parallel()
串行输出如下,可以看到花了500多秒的时间。
1 starting...
1 finished.
2 starting...
2 finished.
3 starting...
3 finished.
4 starting...
4 finished.
5 starting...
5 finished.
6 starting...
6 finished.
7 starting...
7 finished.
8 starting...
8 finished.
9 starting...
9 finished.
one by one time: 538.7096929550171
而在我六核的 i5-9400F 上的并行输出如下,可以看到花了300多秒,稍微快了一些,但是效果并不理想。
4 starting...
3 starting...
0 starting...
5 starting...
2 starting...
1 starting...
0 finished.
2 finished.
4 finished.
3 finished.
5 finished.
6 starting...
1 finished.
6 finished.
7 starting...
7 finished.
8 starting...
8 finished.
9 starting...
9 finished.
parallel time: 339.75568318367004
为了更好地体现CPU并行和串行的差别,我又让它们分别对6个digit做t-SNE降维,并行的速度大概是串行的4倍。
6个digit的串行输出:
1 starting...
1 finished.
2 starting...
2 finished.
3 starting...
3 finished.
4 starting...
4 finished.
5 starting...
5 finished.
6 starting...
6 finished.
one by one time: 357.5319800376892
6个digit的并行输出:
3 starting...
4 starting...
1 starting...
5 starting...
2 starting...
0 starting...
5 finished.
0 finished.
4 finished.
2 finished.
1 finished.
3 finished.
parallel time: 85.06037616729736
总的来说,对于一些计算需求来讲,CPU多核并行能够提高一定的计算速度,但是提升能力有限,比如6核的i5处理器,速度的提升不会超过6倍。所以如果想大幅度提高速度的话,还是得用GPU版本的并行。
关于Python的GPU编程,可以参考 《Python Parallel Programming Cookbook》这本书,这是一本开源的书,在网上应该能够比较方便地找到电子版。如果实在找不到(或者懒得找)也可以联系我获取。