通过用gevent把异步访问得到的数据提取出来。
在有道词典搜索框输入“hello”按回车。观察数据请求情况观察有道的url构建。
#url构建只需要传入word即可
url = "http://dict.youdao.com/w/eng/{}/".format(word)
def fetch_word_info(word):
url = "http://dict.youdao.com/w/eng/{}/".format(word)
resp = requests.get(url,headers=headers)
doc = pq(resp.text)
pros = ''
for pro in doc.items('.baav .pronounce'):
pros+=pro.text()
description = ''
for li in doc.items('#phrsListTab .trans-container ul li'):
description +=li.text()
return {'word':word,'音标':pros,'注释':description}
因为requests库在任何时候只允许有一个访问结束完全结束后,才能进行下一次访问。无法通过正规途径拓展成异步,因此这里使用了monkey补丁
import requests
from pyquery import PyQuery as pq
import gevent
import time
import gevent.monkey
gevent.monkey.patch_all()
words = ['good','bad','cool',
'hot','nice','better',
'head','up','down',
'right','left','east']
def synchronous():
start = time.time()
print('同步开始了')
for word in words:
print(fetch_word_info(word))
end = time.time()
print("同步运行时间: %s 秒" % str(end - start))
#执行同步
synchronous()
import requests
from pyquery import PyQuery as pq
import gevent
import time
import gevent.monkey
gevent.monkey.patch_all()
words = ['good','bad','cool',
'hot','nice','better',
'head','up','down',
'right','left','east']
def asynchronous():
start = time.time()
print('异步开始了')
events = [gevent.spawn(fetch_word_info,word) for word in words]
wordinfos = gevent.joinall(events)
for wordinfo in wordinfos:
#获取到数据get方法
print(wordinfo.get())
end = time.time()
print("异步运行时间: %s 秒"%str(end-start))
#执行异步
asynchronous()
可以对待爬网站实时异步访问,速度会大大提高。现在是爬取12个词语的信息,也就是说一瞬间对网站访问了12次,这还没啥问题,假如爬10000+个词语,使用gevent的话,那几秒钟之内就给网站一股脑的发请求,说不定网站就把爬虫封了。
将列表等分为若干个子列表,分批爬取。举例我们有一个数字列表(0-19),要均匀的等分为4份,也就是子列表有5个数。下面是我在stackoverflow查找到的列表等分方案:
方法1
seqence = list(range(20))
size = 5 #子列表长度
output = [seqence[i:i+size] for i in range(0, len(seqence), size)]
print(output)
方法2
chunks = lambda seq, size: [seq[i: i+size] for i in range(0, len(seq), size)]
print(chunks(seq, 5))
方法3
def chunks(seq,size):
for i in range(0,len(seq), size):
yield seq[i:i+size]
prinr (chunks(seq,5))
for x in chunks(req,5):
print(x)
数据量不大的情况下,选哪一种方法都可以。如果特别大,推荐使用方法3。
import requests
from pyquery import PyQuery as pq
import gevent
import time
import gevent.monkey
gevent.monkey.patch_all()
words = ['good','bad','cool',
'hot','nice','better',
'head','up','down',
'right','left','east']
def fetch_word_info(word):
url = "http://dict.youdao.com/w/eng/{}/".format(word)
resp = requests.get(url,headers=headers)
doc = pq(resp.text)
pros = ''
for pro in doc.items('.baav .pronounce'):
pros+=pro.text()
description = ''
for li in doc.items('#phrsListTab .trans-container ul li'):
description +=li.text()
return {'word':word,'音标':pros,'注释':description}
def asynchronous(words):
start = time.time()
print('异步开始了')
chunks = lambda seq, size: [seq[i: i + size] for i in range(0, len(seq), size)]
for subwords in chunks(words,3):
events = [gevent.spawn(fetch_word_info, word) for word in subwords]
wordinfos = gevent.joinall(events)
for wordinfo in wordinfos:
# 获取到数据get方法
print(wordinfo.get())
time.sleep(1)
end = time.time()
print("异步运行时间: %s 秒" % str(end - start))
asynchronous(words)