Python 并行计算

Python 并行计算

一、实验说明

本实验介绍 Python 并行计算能够用到的工具。

1. 环境登录

无需密码自动登录,系统用户名shiyanlou

2. 环境介绍

本课程实验环境使用Spyder。首先打开terminal,然后输入以下命令:

git clone https://github.com/jrjohansson/scientific-python-lectures
spyder -w scientific-python-lectures

关于Spyder的使用可参考文档:https://pythonhosted.org/spyder/

本实验基本在控制台下进行,可关闭其余窗口,只保留控制台。

二、实验内容

multiprocessing

Python 内置进程库,用于并发计算

import multiprocessing
import os
import time
import numpy

def task(args):
    print("PID =", os.getpid(), ", args =", args)        
    return os.getpid(), args

task("test")

=> PID = 28995 , args = test


pool = multiprocessing.Pool(processes=4)

result = pool.map(task, [1,2,3,4,5,6,7,8])

=> PID = 29006 , args = 1
   PID = 29009 , args = 4
   PID = 29007 , args = 2
   PID = 29008 , args = 3
   PID = 29006 , args = 6
   PID = 29009 , args = 5
   PID = 29007 , args = 8
   PID = 29008 , args = 7

result

=> [(29006, 1),
    (29007, 2),
    (29008, 3),
    (29009, 4),
    (29009, 5),
    (29006, 6),
    (29008, 7),
    (29007, 8)]

multiprocessing 在处理高并行任务时进程间不需要相互通信,除了将初始化数据送到进程池的时候和收集结果的时候。

IPython parallel

IPython 自带一个通用的并行计算环境,它建立在 IPython 引擎与控制器之上。使用它前要先开启 IPython 集群引擎,最简便的方法是使用 ipcluster 命令:

$ ipcluster start -n 4

为了在 Python 程序中使用 IPython 集群,我们首先要创建一个 IPython.parallel.Client 实例:

import numpy
import matplotlib.pyplot as plt
from IPython.parallel import Client

cli = Client()

使用 ids 属性,我们可以检索集群中 IPython 引擎的 id 列表:

cli.ids

=> [0, 1, 2, 3]

每一个引擎都准备好了运行任务,我们可以单独地选择引擎来跑我们的代码:

def getpid():
    """ return the unique ID of the current process """
    import os
    return os.getpid()

# first try it on the notebook process
getpid()

=> 28995

# run it on one of the engines
cli[0].apply_sync(getpid)

=> 30181

# run it on ALL of the engines at the same time
cli[:].apply_sync(getpid)

=> [30181, 30182, 30183, 30185]

最简单的调度一个函数到不同引擎上方法是使用装饰函数:

@view.parallel(block=True)

view 应该是一个引擎池。一旦绑定了装饰函数,我们就可以使用 map 方法调度函数了。

示例:

dview = cli[:]

@dview.parallel(block=True)
def dummy_task(delay):
    """ a dummy task that takes 'delay' seconds to finish """
    import os, time

    t0 = time.time()
    pid = os.getpid()
    time.sleep(delay)
    t1 = time.time()

    return [pid, t0, t1]


# generate random delay times for dummy tasks
delay_times = numpy.random.rand(4)

现在我们调用 map 方法:

dummy_task.map(delay_times)

=> [[30181, 1395044753.2096598, 1395044753.9150908],
    [30182, 1395044753.2084103, 1395044753.4959202],
    [30183, 1395044753.2113762, 1395044753.6453338],
    [30185, 1395044753.2130392, 1395044754.1905618]]

现在加入更多任务并可视化任务运行的过程:

%matplotlib qt

def visualize_tasks(results):
    res = numpy.array(results)
    fig, ax = plt.subplots(figsize=(10, res.shape[1]))

    yticks = []
    yticklabels = []
    tmin = min(res[:,1])
    for n, pid in enumerate(numpy.unique(res[:,0])):
        yticks.append(n)
        yticklabels.append("%d" % pid)
        for m in numpy.where(res[:,0] == pid)[0]:
            ax.add_patch(plt.Rectangle((res[m,1] - tmin, n-0.25),
                         res[m,2] - res[m,1], 0.5, color="green", alpha=0.5))

    ax.set_ylim(-.5, n+.5)
    ax.set_xlim(0, max(res[:,2]) - tmin + 0.)
    ax.set_yticks(yticks)
    ax.set_yticklabels(yticklabels)
    ax.set_ylabel("PID")
    ax.set_xlabel("seconds")


delay_times = numpy.random.rand(64)


result = dummy_task.map(delay_times)
visualize_tasks(result)

很漂亮,但注意观察,在最后的时候出现了一个引擎空置另一个引擎却还有多个任务要处理的现象。IPython 并行环境提供了很多可选的 “view”, 我们之前用的是 “direct view” ,现在我们试试 “balanced view”:

lbview = cli.load_balanced_view()


@lbview.parallel(block=True)
def dummy_task_load_balanced(delay):
    """ a dummy task that takes 'delay' seconds to finish """
    import os, time

    t0 = time.time()
    pid = os.getpid()
    time.sleep(delay)
    t1 = time.time()

    return [pid, t0, t1]


result = dummy_task_load_balanced.map(delay_times)
visualize_tasks(result)

延伸阅读

  • http://ipython.org/ipython-doc/dev/parallel/

MPI

当进程间需要更多的通信的时候,就需要更加复杂的解决方案了,比如MPI 与 OpenMP。MPI基于并行进程库 library/protocol。我们可以通过包 mpi4py 来使用它:http://mpi4py.scipy.org/

让我们先加载它:

from mpi4py import MPI

运行 MPI 程序之前,首先需要运行 mpirun -n N 命令, N 代表进程数量。

注意到 IPython 并行环境同样支持 MPI,但是为了方便入门,我们在下面的例子里只使用 mpi4pympirun

例1

%%file mpitest.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
   data = [1.0, 2.0, 3.0, 4.0]
   comm.send(data, dest=1, tag=11)
elif rank == 1:
   data = comm.recv(source=0, tag=11)
print "rank =", rank, ", data =", data

!mpirun -n 2 python mpitest.py

rank = 0 , data = [1.0, 2.0, 3.0, 4.0]
rank = 1 , data = [1.0, 2.0, 3.0, 4.0]

例2

将 numpy 数组从一个进程传到另一个进程:

%%file mpi-numpy-array.py
from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
   data = numpy.random.rand(10)
   comm.Send(data, dest=1, tag=13)
elif rank == 1:
   data = numpy.empty(10, dtype=numpy.float64)
   comm.Recv(data, source=0, tag=13)
print "rank =", rank, ", data =", data


!mpirun -n 2 python mpi-numpy-array.py

rank = 0 , data = [ 0.71397658  0.37182268  0.25863587  0.08007216  0.50832534  0.80038331
  0.90613024  0.99535428  0.11717776  0.48353805]
rank = 1 , data = [ 0.71397658  0.37182268  0.25863587  0.08007216  0.50832534  0.80038331
  0.90613024  0.99535428  0.11717776  0.48353805]

例3:矩阵-向量乘法

# prepare some random data
N = 16
A = numpy.random.rand(N, N)
numpy.save("random-matrix.npy", A)
x = numpy.random.rand(N)
numpy.save("random-vector.npy", x)


%%file mpi-matrix-vector.py
from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
p = comm.Get_size()
def matvec(comm, A, x):
    m = A.shape[0] / p
    y_part = numpy.dot(A[rank * m:(rank+1)*m], x)
    y = numpy.zeros_like(x)
    comm.Allgather([y_part,  MPI.DOUBLE], [y, MPI.DOUBLE])
    return y
A = numpy.load("random-matrix.npy")
x = numpy.load("random-vector.npy")
y_mpi = matvec(comm, A, x)
if rank == 0:
    y = numpy.dot(A, x)
    print(y_mpi)
    print "sum(y - y_mpi) =", (y - y_mpi).sum()



!mpirun -n 4 python mpi-matrix-vector.py

[ 6.40342716  3.62421625  3.42334637  3.99854639  4.95852419  6.13378754
  5.33319708  5.42803442  5.12403754  4.87891654  2.38660728  6.72030412
  4.05218475  3.37415974  3.90903001  5.82330226]
sum(y - y_mpi) = 0.0

例4:向量元素求和

# prepare some random data
N = 128
a = numpy.random.rand(N)
numpy.save("random-vector.npy", a)


%%file mpi-psum.py
from mpi4py import MPI
import numpy as np
def psum(a):
    r = MPI.COMM_WORLD.Get_rank()
    size = MPI.COMM_WORLD.Get_size()
    m = len(a) / size
    locsum = np.sum(a[r*m:(r+1)*m])
    rcvBuf = np.array(0.0, 'd')
    MPI.COMM_WORLD.Allreduce([locsum, MPI.DOUBLE], [rcvBuf, MPI.DOUBLE], op=MPI.SUM)
    return rcvBuf
a = np.load("random-vector.npy")
s = psum(a)
if MPI.COMM_WORLD.Get_rank() == 0:
    print "sum =", s, ", numpy sum =", a.sum()


!mpirun -n 4 python mpi-psum.py

sum = 64.948311241 , numpy sum = 64.948311241

延伸阅读

  • http://mpi4py.scipy.org

  • http://mpi4py.scipy.org/docs/usrman/tutorial.html

  • https://computing.llnl.gov/tutorials/mpi/

OpenMP

OpenMP 是一个标准的被广泛使用的基于线程的并行计算 API, 只可惜无法直接在Python中使用。因为CPython的实现中用到了 GIL(全局解释器锁)。使得同时运行多个Python线程成了不可能的任务。也因此线程无法在 Python 中做并行计算,除非 Python 只是用来包装编译型语言的代码的。所以只能选择类 Python语法 的 Cython 来做 OpenMP 计算了。

import multiprocessing
N_core = multiprocessing.cpu_count()
print("This system has %d cores" % N_core)

This system has 4 cores

下面是一个简单的例子演示 OpenMP 的使用:

%load_ext cythonmagic


%%cython -f -c-fopenmp --link-args=-fopenmp -c-g

cimport cython
cimport numpy
from cython.parallel import prange, parallel
cimport openmp

def cy_openmp_test():

    cdef int n, N

    # release GIL so that we can use OpenMP
    with nogil, parallel():
        N = openmp.omp_get_num_threads()
        n = openmp.omp_get_thread_num()
        with gil:
            print("Number of threads %d: thread number %d" % (N, n))


cy_openmp_test()

Number of threads 12: thread number 0
Number of threads 12: thread number 10
Number of threads 12: thread number 8
Number of threads 12: thread number 4
Number of threads 12: thread number 7
Number of threads 12: thread number 3
Number of threads 12: thread number 2
Number of threads 12: thread number 1
Number of threads 12: thread number 11
Number of threads 12: thread number 9
Number of threads 12: thread number 5
Number of threads 12: thread number 6

示例: 矩阵向量乘法

# prepare some random data
N = 4 * N_core

M = numpy.random.rand(N, N)
x = numpy.random.rand(N)
y = numpy.zeros_like(x)

让我们先看一眼 矩阵-向量 乘法在 Cython 中的实现:

%%cython

cimport cython
cimport numpy
import numpy

@cython.boundscheck(False)
@cython.wraparound(False)
def cy_matvec(numpy.ndarray[numpy.float64_t, ndim=2] M, 
              numpy.ndarray[numpy.float64_t, ndim=1] x, 
              numpy.ndarray[numpy.float64_t, ndim=1] y):

    cdef int i, j, n = len(x)

    for i from 0 <= i < n:
        for j from 0 <= j < n:
            y[i] += M[i, j] * x[j]

    return y


# check that we get the same results
y = numpy.zeros_like(x)
cy_matvec(M, x, y)
numpy.dot(M, x) - y




array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])


%timeit numpy.dot(M, x)

100000 loops, best of 3: 2.93 µs per loop



%timeit cy_matvec(M, x, y)

100000 loops, best of 3: 5.4 µs per loop

Cython 的实现比 numpy.dot 慢,但是不是慢很多,因此只要我们使用 OpenMP,性能就可能赶超 numpy.dot

%%cython -f -c-fopenmp --link-args=-fopenmp -c-g

cimport cython
cimport numpy
from cython.parallel import parallel
cimport openmp

@cython.boundscheck(False)
@cython.wraparound(False)
def cy_matvec_omp(numpy.ndarray[numpy.float64_t, ndim=2] M, 
                  numpy.ndarray[numpy.float64_t, ndim=1] x, 
                  numpy.ndarray[numpy.float64_t, ndim=1] y):

    cdef int i, j, n = len(x), N, r, m

    # release GIL, so that we can use OpenMP
    with nogil, parallel():
        N = openmp.omp_get_num_threads()
        r = openmp.omp_get_thread_num()
        m = n / N

        for i from 0 <= i < m:
            for j from 0 <= j < n:
                y[r * m + i] += M[r * m + i, j] * x[j]

    return y


# check that we get the same results
y = numpy.zeros_like(x)
cy_matvec_omp(M, x, y)
numpy.dot(M, x) - y




array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])




%timeit numpy.dot(M, x)

100000 loops, best of 3: 2.95 µs per loop



%timeit cy_matvec_omp(M, x, y)

1000 loops, best of 3: 209 µs per loop

就目前的问题规模而言,Cython+OpenMP 仍旧是慢,因为开销都花在了 OpenMP 与 线程的连接上了。

N_vec  = numpy.arange(25, 2000, 25) * N_core


duration_ref = numpy.zeros(len(N_vec))
duration_cy = numpy.zeros(len(N_vec))
duration_cy_omp = numpy.zeros(len(N_vec))

for idx, N in enumerate(N_vec):

    M = numpy.random.rand(N, N)
    x = numpy.random.rand(N)
    y = numpy.zeros_like(x)

    t0 = time.time()
    numpy.dot(M, x)
    duration_ref[idx] = time.time() - t0

    t0 = time.time()
    cy_matvec(M, x, y)
    duration_cy[idx] = time.time() - t0

    t0 = time.time()
    cy_matvec_omp(M, x, y)
    duration_cy_omp[idx] = time.time() - t0


fig, ax = plt.subplots(figsize=(12, 6))

ax.loglog(N_vec, duration_ref, label='numpy')
ax.loglog(N_vec, duration_cy, label='cython')
ax.loglog(N_vec, duration_cy_omp, label='cython+openmp')

ax.legend(loc=2)
ax.set_yscale("log")
ax.set_ylabel("matrix-vector multiplication duration")
ax.set_xlabel("matrix size");

对于规模更大的问题 cython+OpenMP 组合的性能超过了 numpy.dot

加速比:

((duration_ref / duration_cy_omp)[-10:]).mean()




3.0072232987815148

显然是可以做的更好的,加速比的极限是:

N_core

12

延伸阅读

  • http://openmp.org
  • http://docs.cython.org/src/userguide/parallelism.html

OpenCL

OpenCL 是异构计算的一种API, 异构计算,举个例子就是使用GPU进行数值计算。 pyopencl 包允许 OpenCL 代码在计算单元上用 Python 编译加载运行。这是很好的一种使用 OpenCL 的方式,因为耗时计算最好由编译过后的代码在计算单元上完成,Python在其中只作为控制语言工作。

%%file opencl-dense-mv.py

import pyopencl as cl
import numpy
import time

# problem size
n = 10000

# platform
platform_list = cl.get_platforms()
platform = platform_list[0]

# device
device_list = platform.get_devices()
device = device_list[0]

if False:
    print("Platform name:" + platform.name)
    print("Platform version:" + platform.version)
    print("Device name:" + device.name)
    print("Device type:" + cl.device_type.to_string(device.type))
    print("Device memory: " + str(device.global_mem_size//1024//1024) + ' MB')
    print("Device max clock speed:" + str(device.max_clock_frequency) + ' MHz')
    print("Device compute units:" + str(device.max_compute_units))

# context
ctx = cl.Context([device]) # or we can use cl.create_some_context()

# command queue
queue = cl.CommandQueue(ctx)

# kernel
KERNEL_CODE = """
//
// Matrix-vector multiplication: r = m * v
//
#define N %(mat_size)d
__kernel
void dmv_cl(__global float *m, __global float *v, __global float *r)
{
    int i, gid = get_global_id(0);

    r[gid] = 0;
    for (i = 0; i < N; i++)
    {
        r[gid] += m[gid * N + i] * v[i];
    }
}
"""

kernel_params = {"mat_size": n}
program = cl.Program(ctx, KERNEL_CODE % kernel_params).build()

# data
A = numpy.random.rand(n, n)
x = numpy.random.rand(n, 1)

# host buffers
h_y = numpy.empty(numpy.shape(x)).astype(numpy.float32)
h_A = numpy.real(A).astype(numpy.float32)
h_x = numpy.real(x).astype(numpy.float32)

# device buffers
mf = cl.mem_flags
d_A_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=h_A)
d_x_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=h_x)
d_y_buf = cl.Buffer(ctx, mf.WRITE_ONLY, size=h_y.nbytes)

# execute OpenCL code
t0 = time.time()
event = program.dmv_cl(queue, h_y.shape, None, d_A_buf, d_x_buf, d_y_buf)
event.wait()
cl.enqueue_copy(queue, h_y, d_y_buf)
t1 = time.time()

print "opencl elapsed time =", (t1-t0)

# Same calculation with numpy
t0 = time.time()
y = numpy.dot(h_A, h_x)
t1 = time.time()

print "numpy elapsed time =", (t1-t0)

# see if the results are the same
print "max deviation =", numpy.abs(y-h_y).max()

Overwriting opencl-dense-mv.py



!python opencl-dense-mv.py

/usr/local/lib/python2.7/dist-packages/pyopencl-2012.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py:36: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  "to see more.", CompilerWarning)
opencl elapsed time = 0.0188570022583
numpy elapsed time = 0.0755031108856
max deviation = 0.0136719

延伸阅读

  • http://mathema.tician.de/software/pyopencl

License

该作品在 知识共享许可协议3.0 下许可授权。

你可能感兴趣的:(Python科学计算)