Python ThreadPoolExecutor并发小结

最近项目中由于需要频繁调用高德地图的接口,发现采用单线程循环调用太慢了,于是上网查了一下Python的并发,于是决定采用ThreadPoolExecutor(线程池)做并发。

在此之前,先看一下,单线程循环爬取多个网页的耗时:

import time
import requests
import threading

def get_html(url):
    print('thread id:',threading.currentThread().getName(),' 访问了:',url)
    return requests.get(url)

if __name__ == '__main__':
    URLS = ['http://www.baidu.com', 'http://www.qq.com', 'http://www.sina.com.cn', 'https://translate.google.cn', 'https://www.csdn.net']
    start = time.time()
    for url in URLS:
        response = get_html(url)
        print('url:%s ,len: %d' % (response.url, len(response.text)))
    end = time.time()
    print('耗时:'+str('%.2f'%(end - start))+'s')

运行结果如下,可以看到爬取5个网页采用单线程耗时5秒左右:

thread id: MainThread  访问了: http://www.baidu.com
url:http://www.baidu.com/ ,len: 2381
thread id: MainThread  访问了: http://www.qq.com
url:https://www.qq.com/ ,len: 229062
thread id: MainThread  访问了: http://www.sina.com.cn
url:https://www.sina.com.cn/ ,len: 527374
thread id: MainThread  访问了: https://translate.google.cn
url:https://translate.google.cn/ ,len: 183434
thread id: MainThread  访问了: https://www.csdn.net
url:https://www.csdn.net/ ,len: 401478
耗时:5.31s

接下来看看采用线程池多线程爬取:

import time
import requests
import threading
import multiprocessing
from concurrent.futures import ThreadPoolExecutor

def get_html(url):
    print('thread id:',threading.currentThread().getName(),' 访问了:',url)
    return requests.get(url)

if __name__ == '__main__':
    # 下面是map 方法的简单使用.  注意:map 返回是一个生成器 ,并且是有序的
    URLS = ['http://www.baidu.com', 'http://www.qq.com', 'http://www.sina.com.cn', 'https://translate.google.cn', 'https://www.csdn.net']
    # 获取cpu核心数
    cpu_cores = multiprocessing.cpu_count()
    start = time.time()
    with ThreadPoolExecutor(max_workers=cpu_cores) as ex:
        res_iter = ex.map(get_html,URLS)        #内部迭代中, 每个url 开启一个线程
    for res in res_iter:                    
        print('url:%s ,len: %d'%(res.url,len(res.text)))
    end = time.time()
    print('耗时:'+str('%.2f'%(end - start))+'s')

运行结果如下:

thread id: ThreadPoolExecutor-0_0  访问了: http://www.baidu.com
thread id: ThreadPoolExecutor-0_1  访问了: http://www.qq.com
thread id: ThreadPoolExecutor-0_2  访问了: http://www.sina.com.cn
thread id: ThreadPoolExecutor-0_3  访问了: https://translate.google.cn
thread id: ThreadPoolExecutor-0_4  访问了: https://www.csdn.net
url:http://www.baidu.com/ ,len: 2381
url:https://www.qq.com/ ,len: 229487
url:https://www.sina.com.cn/ ,len: 528521
url:https://translate.google.cn/ ,len: 183434
url:https://www.csdn.net/ ,len: 389435
耗时:1.47s

效果还是显而易见的,爬取5个网页开启了5个线程,共耗时不到2秒。如果是爬取更多网页,差距还会更明显。

你可能感兴趣的:(Python)