use concurrent.futures on the happy way

写在前面

最近写爬虫用多线程和多进程比较多，总的说来I/O操作应该考虑多线程，像网络请求爬虫一类的就应该使用多线程；而cpu密集型任务，比如浮点运算和分词之类应该使用多进程来最大化利用cpu。下面是我对concurrent.futures官方文档的总结和自己使用后的心得体会。

concurrent.futures介绍

@python 3.6.8

concurrent.futures主要使用的就是两个类，多线程：ThreadPoolExecutor多进程：ProcessPoolExecutor；这两个类都是抽象Executor类的子类，都继承了相同的接口。

Executor Objects

Executor类不能直接使用，而应该通过其子类TreadPoolExecutor，ProcessPoolExecutor来调用其方法。

submit

@python 3.6.8

from concurrent.futures import ThreadPoolExecutor


# submit(fn, *args, **kwargs)
# return Future object
with ThreadPoolExecutor(max_workers=1) as executor:
    future = executor.submit(pow, 323, 1235)
    print(future.result())

正如你所看到的那样，为了介绍Executor类我也不得不使用ThreadPoolExecutor类来作为示例，毕竟Executor是抽象类，无法直接使用。Executor中最重要的就是submit方法，可以将待执行函数加入到多线程池或者多进程池当中等待调度，返回Future对象，调用result方法可得到其结果。

map

@python 3.6.8

import os
from concurrent.futures import ProcessPoolExecutor


def print_hello(n):
    print("I'm number {} say hello in pid {}".format(n, os.getpid()))


# map(func, *iterables, timeout=None, chunksize=1)
with ProcessPoolExecutor(max_workers=10) as executor:
    executor.map(print_hello, range(10))

### result
# I'm number 0 say hello in pid 9254
# I'm number 1 say hello in pid 9255
# I'm number 4 say hello in pid 9254
# I'm number 2 say hello in pid 9256
# I'm number 3 say hello in pid 9257
# I'm number 5 say hello in pid 9259
# I'm number 7 say hello in pid 9261
# I'm number 8 say hello in pid 9262
# I'm number 9 say hello in pid 9263
# I'm number 6 say hello in pid 9260
###

Executor.map方法和python的map方法很相似，通过迭代器传参将函数加入到多进程池或者多线程池中。你可以看到在上面我们创建了10个进程，运行的结果先后不一，这是因为多核cpu下进程是并行执行的，调用顺序由操作系统决定。同样你会发现9258这个进程并没有执行，而9254执行了两次，那是因为虽然有10个进程在进程池中，但只要有一个进程先执行完就会回到进程池做好准备可以执行下一个任务，至于其它进程会不会被使用也是由操作系统决定的。
另外map中timeout参数指定任务执行时间，超时报错；chunksize指定*iterables的分片大小，只在多进程中有用，即每个进程执行多少任务量。示例如下：

@python 3.6.8

import os
from concurrent.futures import ProcessPoolExecutor


def print_hello(n):
    print("I'm number {} say hello in pid {}".format(n, os.getpid()))


# map(func, *iterables, timeout=None, chunksize=1)
with ProcessPoolExecutor(max_workers=10) as executor:
    executor.map(print_hello, range(10), chunksize=10)

### result
# I'm number 0 say hello in pid 10292
# I'm number 1 say hello in pid 10292
# I'm number 2 say hello in pid 10292
# I'm number 3 say hello in pid 10292
# I'm number 4 say hello in pid 10292
# I'm number 5 say hello in pid 10292
# I'm number 6 say hello in pid 10292
# I'm number 7 say hello in pid 10292
# I'm number 8 say hello in pid 10292
# I'm number 9 say hello in pid 10292
###

从上面可以看到，即使你创建了再多的进程，如果任务量全分给一个冤大头，那么其它进程也是不会执行任务的，而得到所有任务的那个进程会依次执行完任务，毕竟只有它一个在工作。这个chunksize参数是在python3.5及以后加入的，默认为1，因此要根据自己的进程数和任务量来适当指定chunksize的大小。

shutdown

@python 3.6.8

import shutil
from concurrent.futures import ThreadPoolExecutor


# shutdown(wait=True)
e = ThreadPoolExecutor(max_workers=4)
e.submit(shutil.copy, 'src1.txt', 'dest1.txt')
e.submit(shutil.copy, 'src2.txt', 'dest2.txt')
e.submit(shutil.copy, 'src3.txt', 'dest3.txt')
e.submit(shutil.copy, 'src4.txt', 'dest4.txt')
e.shutdown(wait=True)

shutdown中如果wait=Ture,会等待进程池中所有方法返回，并且将资源释放；如果wait=False，那么shutdown不会等待，但所有在进程池中的方法仍然会继续执行完毕然后释放资源。总的说来，不管wait的值如何，在进程池中任务执行完之前python程序是不会退出的，但使用shutdown(wait=True)后会在那个点等待进程池任务执行完并释放资源。
但其实更好的写法是用with，在此之前的示例和接下来的示例都是如此，它会等待进程池执行完毕并释放资源，而且写法更简洁。就像下面这样：

@python 3.6.8

with ThreadPoolExecutor(max_workers=4) as e:
    e.submit(shutil.copy, 'src1.txt', 'dest1.txt')
    e.submit(shutil.copy, 'src2.txt', 'dest2.txt')
    e.submit(shutil.copy, 'src3.txt', 'dest3.txt')
    e.submit(shutil.copy, 'src4.txt', 'dest4.txt')

ThreadPoolExecutor and ProcessPoolExecutor

在前面我已经介绍过这两个主要类的一些用法，下面我们再来详细认识一下这两个类。

@python 3.6.8

import urllib.request
import concurrent.futures


# ThreadPoolExecutor(max_workers=None, thread_name_prefix='')
# official code

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

在python3.5及以后版本中如果最大线程数没有指定，会默认指定为机器处理器数量乘以5；在python3.6及以后版本中thread_name_prefix参数被添加，这样可以更好的去分辨不同线程方便调试bug。

@python 3.6.8

import concurrent.futures
import math


# ProcessPoolExecutor(max_workers=None)
# official code

PRIMES = [
    112272535095293,
    112582705942171,
    112272535095293,
    115280095190773,
    115797848077099,
    1099726899285419]

def is_prime(n):
    if n % 2 == 0:
        return False

    sqrt_n = int(math.floor(math.sqrt(n)))
    for i in range(3, sqrt_n + 1, 2):
        if n % i == 0:
            return False
    return True

def main():
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
            print('%d is prime: %s' % (number, prime))

if __name__ == '__main__':
    main()

ProcessPoolExecutor类中如果没有指定max_workers的值，那么默认为机器上的处理器数量；如果小于或者等于0则会引发错误。
这里在运行多进程时有一个小坑，由于我是用vscode文本编辑器写代码，其实大部分文本编辑器和官方自带ide都是这样，集成IDE pycharm应该不会。因为__main__模块在运行多进程的时候必须被导入，所有如果直接使用文本编辑器的运行功能，这类编辑器运行命令执行方式可能与直接在终端使用python module.py有所不同,因此会直接卡死，如果使用的第三库中包含多进程也一样（比如jieba开启多进程）。这时候只能在终端使用python module.py命令来执行多进程程序（module为文件名）。具体详细原因我也不清楚，欢迎来探讨！

Deadlocks

@python 3.6.8

from concurrent.futures import ThreadPoolExecutor


# official code
import time
def wait_on_b():
    time.sleep(5)
    print(b.result())  # b will never complete because it is waiting on a.
    return 5

def wait_on_a():
    time.sleep(5)
    print(a.result())  # a will never complete because it is waiting on b.
    return 6


executor = ThreadPoolExecutor(max_workers=2)
a = executor.submit(wait_on_b)
b = executor.submit(wait_on_a)

And

@python 3.6.8

from concurrent.futures import ThreadPoolExecutor


# official code
def wait_on_future():
    f = executor.submit(pow, 5, 2)
    # This will never complete because there is only one worker thread and
    # it is executing this function.
    print(f.result())

executor = ThreadPoolExecutor(max_workers=1)
executor.submit(wait_on_future)

@python 3.6.8

import os
from concurrent.futures import ProcessPoolExecutor


def print_hello():
    print('hello pid: {}'.format(os.getpid()))
    print('hello')


# my code
# deadlock
def process_in_process():
    print('process_in_process pid: {}'.format(os.getpid()))
    e.submit(print_hello())


if __name__ == '__main__':
    print('main pid: {}'.format(os.getpid()))
    with ProcessPoolExecutor(max_workers=10) as e: 
        e.submit(process_in_process)

### result
# main pid: 17016
# process_in_process pid: 17017
# hello pid: 17017
# hello
......

执行上述程序，你就会发现输出hello后就卡住无法继续正常退出程序了，此时已经发生了死锁。原因在于不应该在submit提交的任务中再调用Executor或者Future的方法，即使你在主程序中创建了多个进程。官方原话：

Calling Executor or Future methods from a callable submitted to a ProcessPoolExecutor will result in deadlock.

但是这样使用是可以的

@python 3.6.8

import os
from concurrent.futures import ProcessPoolExecutor


def print_hello():
    print('hello pid: {}'.format(os.getpid()))
    print('hello')


# my code
# no deadlock
def process_in_process():
    print('process_in_process pid: {}'.format(os.getpid()))
    with ProcessPoolExecutor(max_workers=1) as e:
        e.submit(print_hello())


if __name__ == '__main__':
    print('main pid: {}'.format(os.getpid()))
    with ProcessPoolExecutor(max_workers=10) as e: 
        e.submit(process_in_process)

### result
# main pid: 18158
# process_in_process pid: 18159
# hello pid: 18159
# hello
###

Future Objects

Future实例由Executor.submit方法返回，注意Executor.map并不会返回Future对象，而是一个迭代器，迭代器的内容由你调用函数返回的内容组成。

@python 3.6.8

"""Future methods."""

cancel()
# 尝试取消调用，终止进程池中的任务，无法取消返回False，成功返回True

cancelled()
# 返回True表示成功取消

running()
# 返回True表示任务正在被调用不能取消

done()
# 返回True表示任务成功取消或者完成

result(timeout=None)
# 返回执行结果，timeout为任务运行指定时间，若超时则引发错误，不设置将永远等待任务返回。

exception(timeout=None)
# 返回任务执行异常，没有异常返回None，timeout为任务运行指定时间，同上。

add_done_callback(fn)
# 回调函数，传递Future对象给fn函数，
# 最大的特点是被调用的fn函数是在调用方法的同一个进程的同一个线程中执行。

Module Functions

@python 3.6.8

concurrent.futures.wait(fs, timeout=None, return_when=ALL_COMPLETED)
# 等待由fs传递来的futures实例完成，futures可以是由不同的Executor对象创建，
# 返回由两个集合构成的元组，第一个集合，包含已经完成和被取消的；
# 第二个集合包含没有完成的。timeout参数和前面的一样。

# return_when 常量
FIRST_COMPLETED
# 立刻返回，如果fs中有任何future完成或者被取消

FIRST_EXCEPTION
# 立刻返回，如果fs中有任何future引发错误，如果没有那么和ALL_COMPLETED一样

ALL_COMPLETED
# 返回，当所有future完成后或者被取消


concurrent.futures.as_completed(fs, timeout=None)
# 遍历fs提供的futures实例，futures可以是由不同的Executor对象创建，
# 当它们完成或者被取消时，使用yield返回迭代器。
# 如果fs提供有重复的future那么这个future只会返回一次。timeout参数和上面一样。

如何愉快的使用concurrent.futures