aiohttp以及asyncio提取腾讯招聘数据

今天闲来无事, 又用刚学的异步库练了练手, 这次提取的是腾讯招聘的招聘数据, 这里面的数据是ajax加载的, 所以需要抓包获取, 总体的思路是从列表页通过抓包获取一个可以进入详情页的id, 然后接受这个id在详情页中提取数据, 用的异步以及aiohtto库, 时间比同步快了不少, 但还是有些地方不完美, session请求构造一个其实就可以了,但是在我这个程序里只能构造两次。没办法了。以下是代码, 对于学了asyncio以及aiohttp库的同学可以看一下。

import aiohttp
import asyncio
import time
"""
异步提取腾讯招聘ajax后台数据
time: 2019年8月18日10:47:00
"""
a = time.time()


class Tecent(object):

    async def get_postid(self, url):
        """
        获取postid ,传给详情页
        :param url:
        """

        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                resp = await resp.json()
                id_list = []
                if resp:
                    if resp.get('Data').get('Posts', False)[0]:
                        for row in resp.get('Data').get('Posts'):
                            post_id = row.get('PostId')
                            item = f'{post_id}'
                            id_list.append(item)
                        return id_list

    async def get_json_data(self, url):
        """
        提取数据
        :param url: detail_url
        :return: json数据
        """
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                resp = await resp.json()
                # 一些ajax加载的数据,很好提取
                BGName = resp.get('Data').get('BGName')
                CategoryName = resp.get('Data').get('CategoryName')
                LastUpdateTime = resp.get('Data').get('LastUpdateTime')
                LocationName = resp.get('Data').get('LocationName')
                # PostURL = resp.get('Data').get('PostURL')
                RecruitPostName = resp.get('Data').get('RecruitPostName')
                Responsibility = resp.get('Data').get('Responsibility')
                Requirement = resp.get('Data').get('Requirement')

                # 拼凑
                summary = BGName + "|" + CategoryName + "|" + LocationName + "|" + LastUpdateTime
                item = f'{RecruitPostName},{summary},{Responsibility},{Requirement} \n\n'

                return item

    def save_to_csv(self, data):
        """
        :param data:
        """
        import csv
        with open('aiotecent111.csv', 'a', encoding='utf-8', newline='') as f:
            f.write(data)

    async def main(self):
        tecent = Tecent()
        list_url = [f'https://careers.tencent.com/tencentcareer/api/post/Query?pageIndex={index}&pageSize=10' for index in range(1, 5)]
        tasks = [tecent.get_postid(url) for url in list_url]
        return await asyncio.gather(*tasks)


if __name__ == '__main__':
    tecent = Tecent()
    loop = asyncio.get_event_loop() # 创建事件循环
    results = loop.run_until_complete(tecent.main()) # 运行
    detail_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?postId={0}&language=zh-cn'
    for result in results:
        for id in result:
            urls = [detail_url.format(id)]
            tasks = [tecent.get_json_data(url) for url in urls]
            # 这一步 注意 创建一个事件循环就可以了
            results = loop.run_until_complete(asyncio.gather(*tasks))
            for result in results:
                tecent.save_to_csv(result)


b = time.time()
print(b-a)


以上是代码, 有什么问题欢迎提出来, 大家共同讨论。

你可能感兴趣的:(异步,爬虫异步)