aiohttp与asyncio实现并发爬虫模式

近日需要实现一个站点的爬虫,尝试了下aiohtp结合asyncio来实现,也参考了网上相关资料。

第一回合 异步并发居然和同步一样工作

代码如下:

async def fetch_get(session, url):
    asyncio.sleep(random.randint(3,6))
    # print('get:', url)
    async with session.get(url) as response:
        return await response.text(encoding='utf-8')
        
async def result_get(session, url):
    pass
    
async def fetch_main():
    async with aiohttp.ClientSession() as session:
        shelf_text = await fetch_get(session, FBASE_URL)
        shelf_html = etree.HTML(shelf_text)
        urls = parse(shelf_html)
        for url in urls:
            await  result_get(session, url)
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_main())

运行后会发现还是一个个URL获取内容,并没有达到并发效果。

第二回合 添加完全部任务并发执行

async def fetch_get(session, url):
    asyncio.sleep(random.randint(3,6))
    # print('get:', url)
    async with session.get(url) as response:
        return await response.text(encoding='utf-8')
        
async def result_get(session, url):
    pass
    
async def fetch_main():
    async with aiohttp.ClientSession() as session:
        shelf_text = await fetch_get(session, FBASE_URL)
        shelf_html = etree.HTML(shelf_text)
        urls = parse(shelf_html)
        tasks = []
        for url in urls:
            task = asyncio.ensure_future(result_get(session, url))
            tasks.append(task)
        await asyncio.wait(tasks)
        
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_main())

这种模式下,如果urls中很多,就会连续不停添加异步任务。

第三回合 逐步添加任务并发执行

sem = asyncio.Semaphore(30)

async def fetch_get(session, url):
    asyncio.sleep(random.randint(3,6))
    async with sem:
        async with session.get(url) as response:
            return await response.text(encoding='utf-8')
        
async def result_get(session, url):
    pass
    
async def fetch_main():
    async with sem:
        async with aiohttp.ClientSession() as session:
            shelf_text = await fetch_get(session, FBASE_URL)
            shelf_html = etree.HTML(shelf_text)
            urls = parse(shelf_html)
            tasks = []
            part_tasks = []
            for index,url in enumerate(urls):
                if index % 15 == 0:
                    asyncio.sleep(240)
                    part_tasks = []
                task = asyncio.ensure_future(result_get(session, url))
                tasks.append(task)
                part_tasks.append(task)
                if index % 15 == 0:
                    await asyncio.wait(part_tasks)
            await asyncio.wait(tasks)
        
loop = asyncio.get_event_loop()
loop.run_until_complete(fetch_main())

目前用的就是这种模式。设置为每15个url添加后会开始异步执行,并等待240秒后再开始,总并发连接数为30。
如果通过一个异步任务获取URL放数据库中,再通过另一个异步任务从数据库中获取URL来获取结果。可以在获取结果的异步任务中使用一个while循环,每次从数据库中取出一定量的URL添加至异步执行的任务中,直到数据库中全部URL执行完成为止。

你可能感兴趣的:(python)