线程池的基类是 concurrent.futures
模块中的 Executor,Executor 提供了两个子类,即 ThreadPoolExecutor
和 ProcessPoolExecutor
,其中 ThreadPoolExecutor 用于创建线程池,而 ProcessPoolExecutor 用于创建进程池。
Exectuor 提供了如下常用方法:
submit 方法会返回一个 Future 对象,Future 类主要用于获取线程任务函数的返回值
。由于线程任务会在新线程中以异步方式执行,因此,线程执行的函数相当于一个“将来完成”的任务,所以 Python 使用 Future 来代表。
Future 提供了如下方法:
同步代码,以crawler函数模拟爬虫函数,时间延迟模拟网络IO
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import time
def crawler():
print('crawl page...')
time.sleep(2)
def main():
start = time.time()
for _ in range(4):
crawler()
end = time.time()
print(f'take {(end - start):2.3f} second')
if __name__ == '__main__':
main()
大概8秒钟
$ python3 demo00.py
crawl page...
crawl page...
crawl page...
crawl page...
take 8.007 second
看一下使用线程池
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor
def crawler():
print(f'{threading.current_thread().name} crawl page...')
time.sleep(2)
def main():
start = time.time()
# 创建线程池,最大线程数为4,线程名称前缀为 crawler(可选)
pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
for _ in range(4):
pool.submit(crawler)
pool.shutdown()
end = time.time()
print(f'take {(end - start):2.3f} second')
if __name__ == '__main__':
main()
$ python3 demo01.py
crawler_0 crawl page...
crawler_1 crawl page...
crawler_2 crawl page...
crawler_3 crawl page...
take 2.003 second
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor
def crawler(page_num, index):
print(f'{threading.current_thread().name} crawl page {page_num}, index: {index}')
time.sleep(2)
def main():
start = time.time()
pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
for page_num in range(1, 5):
# 可传多个参数
pool.submit(crawler, page_num, page_num - 1)
pool.shutdown()
end = time.time()
print(f'take {(end - start):2.3f} second')
if __name__ == '__main__':
main()
$ python3 demo02.py
crawler_0 crawl page 1, index: 0crawler_1 crawl page 2, index: 1
crawler_2 crawl page 3, index: 2
crawler_3 crawl page 4, index: 3
take 2.003 second
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor
def crawler(page_num, index):
print(f'{threading.current_thread().name} crawl page {page_num}, index: {index}')
time.sleep(2)
return f'page {page_num} finished'
def main():
start = time.time()
pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
for page_num in range(1, 5):
future = pool.submit(crawler, page_num, page_num - 1)
# 打印返回结果
print(future.result())
pool.shutdown()
end = time.time()
print(f'take {(end - start):2.3f} second')
if __name__ == '__main__':
main()
运行一下发现???变成同步了。
$ python3 demo03.py
crawl page 1, index: 0
page 1 finished
crawl page 2, index: 1
page 2 finished
crawl page 3, index: 2
page 3 finished
crawl page 4, index: 3
page 4 finished
take 8.009 second
原因是future.result() 会阻塞当前的主线程,只有等它执行完了,才会继续执行下一个submit。这么坑?
可是我就是要获取它的返回值,怎么解决?
还记得前面的add_done_callback(fn)
方法吗,传入的函数fn相当于给此线程注册一个回调函数
,当线程结束后自动调用,不会造成阻塞。
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor
def crawler(page_num, index):
print(f'{threading.current_thread().name} crawl page {page_num}, index: {index}')
time.sleep(2)
return f'page {page_num} finished'
def get_result(future):
print(future.result())
def main():
start = time.time()
pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
for page_num in range(1, 5):
future = pool.submit(crawler, page_num, page_num - 1)
# print(future.result())
# 注册回调函数,当线程执行完毕打印返回结果
future.add_done_callback(get_result)
pool.shutdown()
end = time.time()
print(f'take {(end - start):2.3f} second')
if __name__ == '__main__':
main()
$ python3 demo03.py
crawler_0 crawl page 1, index: 0
crawler_1 crawl page 2, index: 1
crawler_2 crawl page 3, index: 2
crawler_3 crawl page 4, index: 3
page 1 finished
page 4 finishedpage 2 finished
page 3 finished
take 2.003 second
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import threading
import time
from concurrent.futures import ThreadPoolExecutor
l1 = [1, 2, 3, 4]
l2 = ['a', 'b', 'c', 'd']
def crawler(page_num, index):
print(f'{threading.current_thread().name} crawl page {page_num}, index: {index}')
time.sleep(2)
return f'page {page_num} finished'
def main():
start = time.time()
pool = ThreadPoolExecutor(max_workers=4, thread_name_prefix='crawler')
# 使用map函数快速创建线程,类似于python内置的map函数
futures = pool.map(crawler, l1, l2)
pool.shutdown()
# 返回值是一个生成器类型
print(type(futures))
print(list(futures))
end = time.time()
print(f'take {(end - start):2.3f} second')
if __name__ == '__main__':
main()
$ python3 demo04.py
crawler_0 crawl page 1, index: a
crawler_1 crawl page 2, index: b
crawler_2 crawl page 3, index: c
crawler_3 crawl page 4, index: d
<class 'generator'>
['page 1 finished', 'page 2 finished', 'page 3 finished', 'page 4 finished']
take 2.003 second
注意:创建现成的说法并不准确,因为线程在创建线程池的时候就产生了,只是在等待执行任务