哈喽,在这个寂寞的下午来看看这篇很水的文章《使用aiohttp爬取小说》
首先,简单说下同步和异步。个人理解蛤,举个例子
同步就是,你去买泡面,买完得等商家算钱,算完钱了,你才能离开店铺。
异步就是,晚上吃个泡面,先烧个水,那么在等水开的时候,就可以撕泡面的包装,酱料包,然后等到水开了泡就是了
苦逼单身dog,只有泡面…
先简单说下协程的使用,协程是轻量型的线程,减少了上下文切换的消耗,然后巴拉巴拉巴拉……….这不是我们的重点,我们的这篇文章主要是使用aiohttp
来看看一个简单使用协程的例子。输出 hello 丑到吓哭小女孩
使用协程
import asyncio
import time
async def word():
print('hello')
print('丑到吓哭小女孩')
asyncio.run(word())
# hello
# 丑到吓哭小女孩
如果是多任务使用协程,那么就要用到get_event_loop()进行循环和接受列表的gather()。当然,还有一种create_task()的方法
import asyncio
async def word():
print('hello')
print('丑到吓哭小女孩')
task1 = word()
task2 = word()
task3 = word()
task4 = word()
task5 = word()
tasks = [
asyncio.ensure_future(task1),
asyncio.ensure_future(task2),
asyncio.ensure_future(task3),
asyncio.ensure_future(task4),
asyncio.ensure_future(task5),
]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
回到aiohttp
这里爬取的是一个小说网站,没有反爬机制,就不用session了,直接request硬肝。在aiohttp里·使用request
import asyncio
import aiohttp
url = 'http://www.shuquge.com/txt/115748/index.html'
async def get_index():
# 获取网页
async with aiohttp.request('get', url=url) as response:
print(response.status) # 200
text=await response.text()
return text
asyncio.run(get_index())
这里在调用text()方法时加了个await是因为查看源码得知text()是一个协程,调用协程需要await。
接下来使用正则提取目录,拼接url,下载一章的内容
async def download_content(page_url, page_name):
async with aiohttp.request('get', url=page_url) as response:
response.encoding = 'utf-8'
content = await response.text()
result = re.findall('(.*?)', content, re.S)
result = ''.join(result)
with open(page_name + '.txt', mode='w', encoding='utf-8') as f:
f.write(result)
async def main():
index = await get_index()
for page_url, page_name in index:
page_url = 'http://www.shuquge.com/txt/115748/' + page_url
await download_content(page_url, page_name)
这样就得到了这本小说。完整代码如下:
import asyncio
import re
import time
import aiohttp
url = 'http://www.shuquge.com/txt/115748/index.html'
async def get_index():
# 获取网页
async with aiohttp.request('get', url=url) as response:
text = await response.text()
result = re.findall('(.*?) ', text)
return result
async def download_content(page_url, page_name):
async with aiohttp.request('get', url=page_url) as response:
response.encoding = 'utf-8'
content = await response.text()
result = re.findall('(.*?)', content, re.S)
result = ''.join(result)
with open(page_name + '.txt', mode='w', encoding='utf-8') as f:
f.write(result)
async def main():
index = await get_index()
for page_url, page_name in index:
page_url = 'http://www.shuquge.com/txt/115748/' + page_url
await download_content(page_url, page_name)
start_time = time.time()
asyncio.run(main())
print('用时', time.time() - start_time)
#用时 26.709527492523193