scrapy 爬取新浪微博 的微博列表及微博内容

代码地址:GitHub

参考:博客

通过scrapy框架爬取指定账号的信息和微博

截止到目前(2019年01月15日)的微博账号粉丝排名:

scrapy 爬取新浪微博 的微博列表及微博内容_第1张图片

爬取方法:提取网页版的微博接口

1.重写start_request方法

    def start_requests(self):
        weibo_id = [1195354434, ]
        for wid in weibo_id:
            print('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(wid))
            yield Request('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(wid), callback=self.parse_userInfo, dont_filter=True,
                          meta={'uid': str(wid)})

2.解析个人信息,并获取containerid

3.爬取博主的微博信息,和他关注的人

    # 解析微博列表
    def parse_weibo_list(self, response):
        # 取相关信息,方便爬取下一页
        next_page = str(int(response.meta['page']) + 1)
        uid = response.meta['uid']
        containerid = response.meta['containerid']

        data = response.text
        content = json.loads(data).get('data')
        cards = content.get('cards')

        if (len(cards) > 0):
            print("-----正在爬取第%s页-----" % str(response.meta['page']))

            for j in range(len(cards)):
                card_type = cards[j].get('card_type')
                # 微博
                # if card_type == 9:
                #     mblog = cards[j].get('mblog')
                #     attitudes_count = mblog.get('attitudes_count')  # 点赞数
                #     comments_count = mblog.get('comments_count')  # 评论数
                #     created_at = self.date_format(mblog.get('created_at'))  # 发布时间
                #     reposts_count = mblog.get('reposts_count')  # 转发数
                #     scheme = cards[j].get('scheme')  # 微博地址
                #     # 替换换行后 提取字符串
                #     text = etree.HTML(str(mblog.get('text')).replace('
', '\n')).xpath('string()') # 微博内容 # pictures = mblog.get('pics') # 正文配图,返回list # pic_urls = [] # 存储图片url地址 # if pictures: # for picture in pictures: # pic_url = picture.get('large').get('url') # pic_urls.append(pic_url) # uid = response.meta['uid'] # # 保存数据 # sinaitem = SinaItem() # sinaitem["uid"] = uid # sinaitem["text"] = text # sinaitem["scheme"] = scheme # sinaitem["attitudes_count"] = attitudes_count # sinaitem["comments_count"] = comments_count # sinaitem["created_at"] = created_at # sinaitem["reposts_count"] = reposts_count # sinaitem["pictures"] = pic_urls # yield sinaitem # 关注信息 if card_type == 11: # 获取他关注的人的地址 # https://m.weibo.cn/p/index?containerid=231051_-_followers_-_1195354434_-_1042015%3AtagCategory_050&luicode=10000011&lfid=1076031195354434 查看该网页的请求过程 fllow_url = str(cards[j]['card_group'][0]['scheme']).replace('https://m.weibo.cn/p/index?', 'https://m.weibo.cn/api/container/getIndex?') print(fllow_url, '----') yield Request(url=fllow_url, callback=self.parse_fllow) # 下一页链接 # weibo_list_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + uid + '&containerid=' + containerid + '&page=' + next_page # response.meta['page'] = next_page # yield Request(weibo_list_url, callback=self.parse_weibo_list, meta=response.meta)

4.根据他关注的人的ID,再次重复此过程

    # 获取关注者的信息
    def parse_fllow(self, response):
        data = response.text
        content = json.loads(data).get('data')
        cards = content.get('cards')
        # if len(cards) > 0:
        for card in cards:
            if card.get('title') == '他的全部关注':
                for tmp in card.get('card_group'):
                    user = tmp.get('user')
                    # 获取关注的人的ID
                    uid = user.get('id')
                    yield Request('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(uid), callback=self.parse_userInfo, dont_filter=True,
                                  meta={'uid': str(uid)})

由于此过程是个循环,需要采取一定的控制条件才能爬取完成(如果不被封IP的话)

可先筛选出你感兴趣的用户,再爬取他的微博

防封的话建议采取代理IP的方式,在下载中间件中添加即可

你可能感兴趣的:(爬虫)