最近项目中由于需要频繁调用高德地图的接口,发现采用单线程循环调用太慢了,于是上网查了一下Python的并发,于是决定采用ThreadPoolExecutor(线程池)做并发。
在此之前,先看一下,单线程循环爬取多个网页的耗时:
import time
import requests
import threading
def get_html(url):
print('thread id:',threading.currentThread().getName(),' 访问了:',url)
return requests.get(url)
if __name__ == '__main__':
URLS = ['http://www.baidu.com', 'http://www.qq.com', 'http://www.sina.com.cn', 'https://translate.google.cn', 'https://www.csdn.net']
start = time.time()
for url in URLS:
response = get_html(url)
print('url:%s ,len: %d' % (response.url, len(response.text)))
end = time.time()
print('耗时:'+str('%.2f'%(end - start))+'s')
运行结果如下,可以看到爬取5个网页采用单线程耗时5秒左右:
thread id: MainThread 访问了: http://www.baidu.com
url:http://www.baidu.com/ ,len: 2381
thread id: MainThread 访问了: http://www.qq.com
url:https://www.qq.com/ ,len: 229062
thread id: MainThread 访问了: http://www.sina.com.cn
url:https://www.sina.com.cn/ ,len: 527374
thread id: MainThread 访问了: https://translate.google.cn
url:https://translate.google.cn/ ,len: 183434
thread id: MainThread 访问了: https://www.csdn.net
url:https://www.csdn.net/ ,len: 401478
耗时:5.31s
接下来看看采用线程池多线程爬取:
import time
import requests
import threading
import multiprocessing
from concurrent.futures import ThreadPoolExecutor
def get_html(url):
print('thread id:',threading.currentThread().getName(),' 访问了:',url)
return requests.get(url)
if __name__ == '__main__':
# 下面是map 方法的简单使用. 注意:map 返回是一个生成器 ,并且是有序的
URLS = ['http://www.baidu.com', 'http://www.qq.com', 'http://www.sina.com.cn', 'https://translate.google.cn', 'https://www.csdn.net']
# 获取cpu核心数
cpu_cores = multiprocessing.cpu_count()
start = time.time()
with ThreadPoolExecutor(max_workers=cpu_cores) as ex:
res_iter = ex.map(get_html,URLS) #内部迭代中, 每个url 开启一个线程
for res in res_iter:
print('url:%s ,len: %d'%(res.url,len(res.text)))
end = time.time()
print('耗时:'+str('%.2f'%(end - start))+'s')
运行结果如下:
thread id: ThreadPoolExecutor-0_0 访问了: http://www.baidu.com
thread id: ThreadPoolExecutor-0_1 访问了: http://www.qq.com
thread id: ThreadPoolExecutor-0_2 访问了: http://www.sina.com.cn
thread id: ThreadPoolExecutor-0_3 访问了: https://translate.google.cn
thread id: ThreadPoolExecutor-0_4 访问了: https://www.csdn.net
url:http://www.baidu.com/ ,len: 2381
url:https://www.qq.com/ ,len: 229487
url:https://www.sina.com.cn/ ,len: 528521
url:https://translate.google.cn/ ,len: 183434
url:https://www.csdn.net/ ,len: 389435
耗时:1.47s
效果还是显而易见的,爬取5个网页开启了5个线程,共耗时不到2秒。如果是爬取更多网页,差距还会更明显。