python多进程并行_基于python的多进程并行处理模型

在工作中,常常会遇到需要离线处理大批量的数据或者任务的需求。例如现在有2000W的数据需要离线处理,且每个数据的逻辑处理都会消耗一定的时间,如果采用单进程处理,往往需要消耗大量的时间,少则一两个小时,多则一天(同事可不会等你慢慢的去执行)。利用concurrent模块开发多进程并发处理模型,能够快速的为我们解决类似问题。使用简单,拿来即用。该模型主要应用了:concurrent.futures - 在pthon3中是内嵌的库,python2.x需要另行安装 。主要用于多线程 & 多进程的并发处理

Executor - 提供两个子类分别创建线程池和进程池:ThreadPoolExecutor,ProcessPoolExecutor

Future - 由Executor.submit产生多个任务

concurrent.future使用实例

在python2.7中使用concurrent.futures的实例文档 - Python’s concurrent.futures

the motivation for the new concurrent module - futures - execute computations asynchronously

THreadPoolExecutor

with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:

print(list(executor.map(sleeper, x)))

ProcessPoolExecutor

ProcessPoolExecutor的用法相当的简练,不像以前的threading、multiprocessing那样繁琐

from concurrent.futures import ProcessPoolExecutor

def pool_factorizer_go(nums, nprocs):

nprocs=xxx

with ProcessPoolExecutor(max_workers=nprocs) as executor:

return {num:factors for num, factors in

zip(nums,

executor.map(factorize_naive, nums))}

简单的多进程并行处理模型

构造Job任务类,作为进程执行的最小单元

Job:包含任务对象(函数)、任务id、args参数、kwargs参数、其他

class Job():

def __init__(self, kwargs):

self.id = kwargs.get('id', None)

self.func = kwargs.get('func', None)

self.args = kwargs.get('args', None)

self.kwargs = kwargs.get('kwargs', None)

构造进程池对象

concurrent.futures.ProcessPoolExecutor: 构造进程池对象,包含进程池初始器和任务执行器通过submit提交Job任务给进程池,并附带参数

通过future.add_done_callback设置回调函数,处理Job运行结果

class ProcessPoolDemo():

def __init__(self, max_size):

self._pool = concurrent.futures.ProcessPoolExecutor(max_workers=max_size)

def excutor_job(self, job):

def job_executed(f):

if f.exception():

print 'job[{job_id}] worker[{fun}] error : [{error_msg}]'.format(

job_id = job.id, fun = job.func, error_msg = f.exception_info())

else:

print 'job[{job_id}] worker[{fun}] finish: [{msg}]'.format(

job_id = job.id, fun = job.func, msg = f.result())

future = self._pool.submit(job.func, job.args, job.kwargs)

future.add_done_callback(job_executed)

multiprocessing.Queue实现简单的进程锁

利用Queue的特点实现进程锁,不至于在多进程异步执行时出现紊乱特点 先进先出、后进后出

put方法:当队列已经装满时,以阻塞的方式等待,直到有空间后再put

get方法:当队列为空时,以阻塞的方式等待,直到队列中有内容之后,在get

from multiprocessing import Queue

self.internal_job_queue = Queue(maxsize=30 * Max_workers)

self.internal_job_queue.put("#", True) #阻塞结束后,再执行job1{job1}

self.internal_job_queue.get(False) #当队列中有元素,取出后,再执行job2{job2}

完整实例

#!/usr/local/services/python2-1.0/bin/python2.0

# --*-- coding:utf-8 --*--from __future__ import (absolute_import, unicode_literals)

import time

import random

import concurrent.futures

from multiprocessing import Queue

class Job():

def __init__(self, kwargs):

self.id = kwargs.get('id', None)

self.func = kwargs.get('func', None)

self.args = kwargs.get('args', None)

self.kwargs = kwargs.get('kwargs', None)

class ProcessPoolDemo():

def __init__(self, max_size):

self._pool = concurrent.futures.ProcessPoolExecutor(max_workers=max_size)

self.internal_job_queue = Queue(maxsize = max_size) #与进程数一致

def excutor_job(self, job):

def job_executed(f):

self.internal_job_queue.get(False) #释放一个进程资源 if f.exception():

print 'job[{job_id}] worker[{fun}] error : [{error_msg}]'.format(

job_id = job.id, fun = job.func, error_msg = f.exception_info())

else:

print 'job[{job_id}] worker[{fun}] finish: [{msg}]'.format(

job_id = job.id, fun = job.func, msg = f.result())

future = self._pool.submit(job.func, job.args, job.kwargs)

future.add_done_callback(job_executed)

def fun_demo(args, kwargs):

sleep_time = random.uniform(0, 1)

time.sleep(sleep_time)

return sleep_time

def main(args):

process_pool = ProcessPoolDemo(4) #开启四个进程

for index in range(0,10): #异步处理十个任务 process_pool.internal_job_queue.put('#', True) # 一个进程 print 'put in job[{job_id}]'.format(job_id = index)

job_dict = {'id': index, 'func' : fun_demo, 'args' : [1,2,3], 'kwargs' : {'name' : 'bear', 'age': 26}}

process_pool.excutor_job(Job(job_dict))

if __name__ == '__main__':

main(0)

Output

put in job[0]

put in job[1]

put in job[2]

put in job[3]

put in job[4]

job[2] worker[] finish: [0.0703135307709]

job[4] worker[] finish: [0.128978704821]

put in job[5]

job[0] worker[] finish: [0.309983499388]

put in job[6]

job[5] worker[] finish: [0.186503637742]

put in job[7]

job[6] worker[] finish: [0.162506532634]

put in job[8]

job[1] worker[] finish: [0.62530039205]

put in job[9]

job[3] worker[] finish: [0.707108629808]

job[7] worker[] finish: [0.386662916358]

job[8] worker[] finish: [0.361320846488]

job[9] worker[] finish: [0.325288205939]

你可能感兴趣的:(python多进程并行)