只需几步即可实现Python对新浪微博手机端的爬虫

新浪微博手机端地址 https://m.weibo.cn

我要爬取微博名为[“锦鲤大王”]的2019全部微博

登录微博手机端找到需要的信息

找到Request URL
只需几步即可实现Python对新浪微博手机端的爬虫_第1张图片
找到user agent 和 cookie

只需几步即可实现Python对新浪微博手机端的爬虫_第2张图片
每条微博的具体信息都存放在这里

只需几步即可实现Python对新浪微博手机端的爬虫_第3张图片

之后根据上面的信息写爬虫代码,我的爬虫代码很粗糙,没有进一步完善,只是给需要的小伙伴提供一个思路

import requests
import csv
import time
import random
import json


def spider(page_num):
    main_url = "https://m.weibo.cn/api/container/getIndex?uid=3641513235&luicode=10000011&" \
               "lfid=231093_-_selffollowed&type=uid&value=3641513235&containerid=1076033641513235"  
    # main_url为要爬取博主的主页地址
    
    if page_num:
        main_url = main_url + '&page=' + str(page_num)
    # 微博的分页机制是每页10条微博
    
    header = {
        "user-agent": "Mozilla / 5.0(X11; Linux x86_64) AppleWebKit / 537.36(KHTML, likeGecko) "
                      "Chrome / 77.0.3865.120Safari / 537.36",
    }   # 设置请求头
    
    cookie = {
        'cookies': "输入自己微博的cookie"
    }
    
    try:
        r = requests.get(url=main_url, headers=header, cookies=cookie)
        r.raise_for_status()
    except Exception as e:
        print("爬取失败", e)
        return 0
        
    result_json = json.loads(r.content.decode('utf-8'))
    info_list = []  
    for card in result_json['data']['cards']:
        info_list_sub = []
        if card.get("mblog"):
            info_list_sub.append(card['mblog']['attitudes_count'])  # 获赞数

            info_list_sub.append(card['mblog']['comments_count'])  # 评论数

            info_list_sub.append(card['mblog']['reposts_count'])    # 转发数
            

            if page_num == 1:
                info_list_sub.append(card['mblog']['created_at'])  # 发博时间
            elif '2018' not in card['mblog']['created_at']:
                info_list_sub.append(card['mblog']['created_at'])   
            else:
                print("2019年微博爬取完毕")
                break

            info_list_sub.append(card['mblog']['weibo_position'])    # 是否原创

            if card['mblog'].get('raw_text'):
                info_list_sub.append(card['mblog']['raw_text'])   # 微博内容
            else:
                info_list_sub.append(card['mblog']['text'])

            # if card['mblog']['source'] == '':
            #     info_list_sub.append(None)
            # else:
            #     info_list_sub.append(card['mblog']['source'])

            time.sleep(random.randint(4, 6))  # 每爬取一条微博暂停4到6秒,防反爬

            info_list.append(info_list_sub)
        else:
            continue
    return info_list


def save_csv(infolist):
    with open('/home/long/Documents/weibo2.csv', 'a+', encoding='utf_8_sig', newline='') as f:
        writer = csv.writer(f)
        writer.writerows(infolist)


def main(num):
    for i in range(1, num+1):
        information = spider(i)
        save_csv(information)
        print("第%s页爬取完毕" % i)


if __name__ == '__main__':
    main(500)   # 爬取500页

希望能对需要的小伙伴有所帮助,有问题可以在评论区找我哦

你可能感兴趣的:(只需几步即可实现Python对新浪微博手机端的爬虫)