二十一. 多进程爬虫

1.多进程的使用方法如下:

from multiprocessing import Pool         #引入进程池模块
pool = Pool(processes=4)                 #创建进程池
pool.map(func,iterable[,chunksize])  #map()函数运行进程,func为需运行的函数,iterable为迭代参数。

2.多进程的性能对比:
爬取网址:https://www.qiushibaike.com/text/
爬取内容:用户ID、发表段子文字信息、好笑数量、评价数量
爬取方式:正则表达式
性能对比:(单线程,2进程,4进程)比较运行时间

import requests
import re
from multiprocessing import Pool
import time

'''解析内容,并创建多线程执行,存入CSV文档中'''
hds = {'User-Agent': 'ozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
def scraper(url):
    r = requests.get(url,headers = hds)
    ids = re.findall("

(.*?)

",r.text,re.S) contents = re.findall('
.*?(.*?)',r.text,re.S) laughs = re.findall('.*?(.*?)',r.text,re.S) comments = re.findall('.*?(.*?)',r.text,re.S) for id,content,laugh,comment in zip(ids,contents,laughs,comments): info = { 'id':id, 'content':content, 'laugh':laugh, 'comment':comment } return info if __name__ == '__main__': #多进程必须使用此语句,否则报错 '''构建网页列表并将其入队''' url_list = ["https://www.qiushibaike.com/text/page/{}/".format(i) for i in range(1,14)] hds = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3294.6 Safari/537.36'} start1 = time.time() for url in url_list: scraper(url) stop1 = time.time() print("单线程耗时", stop1 - start1) start2 = time.time() pool = Pool(processes=2) pool.map(scraper,url_list) stop2 = time.time() print("2个进程耗时", stop2 - start2) start3 = time.time() pool = Pool(processes=2) pool.map(scraper,url_list) stop3 = time.time() print("4个进程耗时", stop3 - start3)

结果为:

单线程耗时 2.73215651512146
2个进程耗时 2.477141857147217
4个进程耗时 2.3501341342926025

你可能感兴趣的:(二十一. 多进程爬虫)