爬虫实战2续-批量爬取某博博文、评论和回复

回顾与问题描述

在上一篇文章中,我们针对某一篇博文编写代码实现完整爬取该博文下的所有评论和回复:

爬虫实战2-某博评论和回复_艽野尘梦better的博客-CSDN博客https://blog.csdn.net/qq_45270849/article/details/131220538?spm=1001.2014.3001.5502在这篇文章中,将对代码进行修改和增加,实现对用户搜索关键词后展示的所有博文的博文内容、评论和回复的完整爬取。

主要思路

在上一篇文章中,我们只需要知道博文的非数值型编号即可爬取该博文下的评论和回复,那么这篇文章中,我们只需要实现获取搜索关键词后所展示的所有博文的非数值型编号即可。之后循环套用之前的代码。

查看网络流后发现博文的非数值型编号只存在于网页源html代码中,且是在一个class=from的div标签下的a标签的href中,所以需要解析网页和进一步的提取。

注意:

1、评论显示有1176条,但实际上并不一定有这么多,可能会被过滤掉一部分评论。

2、爬取一定数量的评论后,某博可能会将你的cookies封停一段时间,并向你报告400的错误,这段时间即使在浏览器上也刷新不出评论,因此需要暂停一段时间后继续。

3、每个博文都有自己唯一的数值型编号,每个评论和回复也有自己唯一的编号,且回复数据中包括所回复的评论的编号,保留好这些关键字段可以方便我们连接表格。

4、博文存于单独的dataframe中,其中包括了数值型博文编号,可以与另外两个表进行连接

5、一个博文有一个非数值型编号和一个数值型编号


 

 完整代码

下面附上批量爬取博文、评论和回复的代码:

import requests
import json
import pandas as pd
#from hyper.contrib import HTTP20Adapter
import time
from requests.adapters import HTTPAdapter
from bs4 import BeautifulSoup
import re


def error_process(url,headers,cookies,data):#处理ip被封
    proxy = requests.get("http://jshk.com.cn").text
    proxies = {
               'http': 'http://' + proxy,
               'https': 'https://' + proxy,
               }
    response =sessions.get(url,cookies=cookies,proxies=proxies,headers=headers,params=data)
    if response.status_code==200:
        status=1
    else:
        status=0
    return (status,response)
    
 
cookies = {
    "注意":"用自己的"

}
 
headers = {
    #'authority': 'weibo.com',
    #'method': 'GET',
    #'path': '/ajax/statuses/buildComments?is_reload=1&id=4892220119847228&is_show_bulletin=2&is_mix=0&count=10&uid=1644948230&fetch_level=0',
    #'scheme': 'https',
    "accept": "application/json, text/plain, */*",
    "accept-encoding": "gzip, deflate, br",
    "accept-language": "zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6",
    "client-version": "v2.40.60",
    "referer": "https://weibo.com/1644948230/MCH47FjIg",
    "sec-ch-ua": "\"Microsoft Edge\";v=\"113\", \"Chromium\";v=\"113\", \"Not-A.Brand\";v=\"24\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"Windows\"",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin",
    "server-version": "v2023.06.02.2",
    "traceparent": "00-642df2976cbf1de971f7f7c3bb7bde24-386cfa439c76373c-00",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.50",
    "x-requested-with": "XMLHttpRequest",
    "x-xsrf-token": "UhCx8FLK9vBWS8xrQkRQ9r56"
}

#获取博文编号,用于构建博文网址
url='https://s.weibo.com/weibo?q=%E8%8B%B9%E6%9E%9C&page='#搜索某个关键词后的微博网址
sessions=requests.session()
blogids=[]
comments_list=[]
replys_list=[]
for i in range(1):#爬取多少页
    response =sessions.get(url+str(i),cookies=cookies, headers=headers)
    soup=BeautifulSoup(response.text,'lxml')
    divs=soup.find_all(name='div',attrs={'class':'from'})
    for div in divs:
        href=div.a.attrs['href']
        groups=re.search(r'//weibo.com/(\d*)/(.*)\?',href)
        blogid=groups.group(2)
        blogids.append(blogid)
blogs_list=[]
#获取完所有要爬的博文编号后
for  mblogid in  blogids:   
    #爬取博文内容
    data={
            'id': mblogid
            }
    url='https://weibo.com/ajax/statuses/show'
    sessions=requests.session()
    sessions.mount('https://', HTTPAdapter(max_retries=3))#最大重连次数
    #sessions.mount(url, HTTP20Adapter())
    response =sessions.get(url,cookies=cookies, headers=headers,params=data)
    #print(response.url)
    blog=json.loads(response.text)['text_raw']
    blog_id=json.loads(response.text)['id']#帖子编号
    uid=json.loads(response.text)['user']['id']#发帖人id
    attitudes_count=json.loads(response.text)['attitudes_count']#点赞数
    comments_count=json.loads(response.text)['comments_count']#评论数
    reposts_count=json.loads(response.text)['reposts_count']#转发数量
    screen_name=json.loads(response.text)['user']['screen_name']#昵称
    blog={'微博正文':blog}
    blog=pd.DataFrame(blog,index=[0])
    print("正文爬取成功")
    try:
        url='https://weibo.com/ajax/statuses/longtext'
        response =sessions.get(url,cookies=cookies, headers=headers,params=data)
        blog_all=json.loads(response.text)['data']['longTextContent']
        blog_all={'微博全文':blog_all}
        blog_all=pd.DataFrame(blog_all,index=[0])
        print("全文爬取成功")
    except:
        print("博文短,无展开全文")
        blog_all=blog
    blogs_list.append([mblogid,blog_id,uid,screen_name,attitudes_count,comments_count,reposts_count])
    #爬取评论
    data={
            'is_reload': 1,
            'id': blog_id,#帖子编号
            'is_show_bulletin': 2,
            'is_mix': 0,
            'count': 10,
            'uid': uid,
            'fetch_level': 0,
            }
    url='https://weibo.com/ajax/statuses/buildComments'
    while True:
        if comments_count==0:
            break
        response =sessions.get(url,cookies=cookies, headers=headers,params=data,timeout=30)
        if response.status_code!=200:
            print(response.status_code)
            if response.status_code==504:
                status,response=error_process(url,headers,cookies,data)
            else:
                time.sleep(300)
                sessions=requests.session()
                response =sessions.get(url,cookies=cookies, headers=headers,params=data)
                status = 1
        else:
            status=1
        if status==1:
            response=json.loads(response.text)
        else:
            break
        total_number=response['total_number']
        comments=response['data']
        commentmax_id=response['max_id']
        for i in range(len(comments)):
            blog_ID=blog_id#帖子编号
            comment_ID=comments[i]['id']#评论编号,关键属性
            create_time=comments[i]['created_at']#评论创建时间
            comment_like_counts=comments[i]['like_counts']#评论被赞数
            comment_likedbyauthor=comments[i]['isLikedByMblogAuthor']#是否被评论作者点赞
            commenter_id=comments[i]['user']['idstr']#评论者账号id
            commenter_location=comments[i]['user']['location']#评论者家乡
            commenter_name=comments[i]['user']['screen_name']#评论者昵称
            commenter_gender=comments[i]['user']['gender']#评论者性别
            try:
                commenter_ip=comments[i]['source'][2:]#评论者ip属地
            except:
                commenter_ip='无'
            #follow_me=comments[i]['user']['follow_me']#是否被作者关注
            followers_count=comments[i]['user']['followers_count']#粉丝数
            #following=comments[i]['user']['following']#是否关注作者
            friends_count=comments[i]['user']['friends_count']#朋友数
            #planet_video=comments[i]['user']['planet_video']#是否播放视频
            comment_content=comments[i]['text_raw']#评论内容
            reply_num=comments[i]['total_number']#回复数量
            comments_list.append([blog_ID,comment_ID,create_time,comment_like_counts,comment_likedbyauthor,commenter_id,
                              commenter_location,commenter_name,commenter_gender,commenter_ip,
                              followers_count,friends_count,comment_content,reply_num])
            if reply_num>0:#有回复
                reply_n=1
                data={
                        'is_reload': 1,
                        'id': comment_ID,#评论编号
                        'is_show_bulletin': 2,
                        'is_mix': 1,
                        'fetch_level': 1,
                        'max_id':0,
                        'count': 20,
                        'uid': uid,
                        }
                while True:
                    response =sessions.get(url,cookies=cookies, headers=headers,params=data)
                    response=json.loads(response.text)
                    try:
                        replys=response['data']
                    except:
                        break
                    reply_number=response['total_number']#回复数量
                    replymax_id=response['max_id']
                    for i in range(len(replys)):
                        blog_ID=blog_id#帖子编号
                        comment_ID=replys[i]['rootid']#评论编号
                        reply_object_ID=replys[i]['reply_comment']['id']#回复对象编号
                        reply_ID=replys[i]['id']#回复编号,关键属性
                        create_time=replys[i]['created_at']#reply创建时间
                        reply_like_counts=replys[i]['like_counts']#reply被赞数
                        replyer_id=replys[i]['user']['idstr']#reply者账号id
                        replyer_location=replys[i]['user']['location']#reply者家乡
                        replyer_name=replys[i]['user']['screen_name']#reply者昵称
                        replyer_gender=replys[i]['user']['gender']#reply者性别
                        replyer_ip=replys[i]['source'][2:]#reply者ip属地
                        #follow_me=comments[i]['user']['follow_me']#是否被作者关注
                        followers_count=replys[i]['user']['followers_count']#粉丝数
                        #following=comments[i]['user']['following']#是否关注作者
                        friends_count=replys[i]['user']['friends_count']#朋友数
                        #planet_video=comments[i]['user']['planet_video']#是否播放视频
                        reply_content=replys[i]['text_raw']#reply内容 
                        replys_list.append([blog_ID,comment_ID,reply_object_ID,reply_ID,create_time,reply_like_counts,replyer_id,
                                        replyer_location,replyer_name,replyer_gender,replyer_ip,
                                        followers_count,friends_count,reply_content])
                        reply_n+=1
                    if replymax_id==0:
                        print("回复爬取完毕共{}条".format(reply_n))
                        break
                    data = {
                        'flow':0,
                        'is_reload': 1,
                        'id':comment_ID,#评论编号
                        'is_show_bulletin': 2,
                        'is_mix': 0,
                        'fetch_level': 1,
                        'max_id':replymax_id,
                        'count': 20,
                        'uid': uid,
                        }
            print("爬到{}条评论了".format(len(comments_list)))
            if len(comments_list)%50==0:
                print("休息一会")
                #time.sleep(5)
                #if len(comments_list)%300==0:
                    #time.sleep(300)
                    #sessions=requests.session()
                    #sessions.mount(url, HTTP20Adapter())
        if commentmax_id==0:
            break

        data = {
            'flow':0,
            'is_reload': 1,
            'id':blog_id ,#帖子编号
            'is_show_bulletin': 2,
            'is_mix': 0,
            'max_id':commentmax_id,
            'count': 20,
            'uid': uid,
            'fetch_level': 0,
            }
    
blogs_data=pd.DataFrame(blogs_list,columns=['博文编号(字母)','博文编号','发帖人id','发帖人昵称','点赞数','评论数','转发数'])
comments_data=pd.DataFrame(comments_list,columns=['博文编号','评论编号','评论创建时间','评论被赞数','是否被评论作者点赞',
                                                  '评论者账号id','评论者家乡','评论者昵称','评论者性别','评论者ip属地',
                                                  '粉丝数','朋友数','评论内容','回复数量'])
replys_data=pd.DataFrame(replys_list,columns=['博文编号','评论编号','回复对象编号','回复编号','回复创建时间','回复被赞数',
                                                  '回复者账号id','回复者家乡','回复者昵称','回复者性别','回复者ip属地',
                                                  '粉丝数','朋友数','回复内容'])
#blogs_data.to_csv(r"",index=None,encoding='utf-8-sig')
#comments_data.to_csv(r"",index=None,encoding='utf-8-sig')
#replys_data.to_csv(r"",index=None,encoding='utf-8-sig')

你可能感兴趣的:(Python应用,爬虫,python)