Python利用requests库爬取百度文库文章

python爬取百度文库

  • 1.Requests
  • 2.安装requests
  • 3.代码

本来马上做课设,做课设太烦了,所以只好偷工减料,但是想下载一个百度文库的文章,结果一看还要会员,这作为一个程序员怎么受得了。
Python利用requests库爬取百度文库文章_第1张图片

Python利用requests库爬取百度文库文章_第2张图片
这岂不是让一个本不富裕的家庭更加的雪上加霜。那我只能靠Python维持学习了!
Python利用requests库爬取百度文库文章_第3张图片

1.Requests

Requests 是用Python语言编写,基于 urllib,采用 Apache2 Licensed 开源协议的 HTTP 库。它比 urllib 更加方便,可以节约我们大量的工作,完全满足 HTTP 测试需求。

2.安装requests

pip install xxxx -i https://pypi.douban.com/simple  

或者参照教程

https://blog.csdn.net/qq_44176343/article/details/109362134

3.代码

"""
author:鹏鹏写代码

"""
import requests
import re #正则表达式
import json
headers = {
     
    "User-Agent": "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Mobile Safari/537.36"
}


def get_num(url):
    response = requests.get(url, headers=headers).text
    result = re.search(
        r'&md5sum=(.*)&sign=(.*)&rtcs_flag=(.*)&rtcs_ver=(.*?)".*rsign":"(.*?)",', response, re.M | re.I)  # 寻找参数
    reader = {
     
        "md5sum": result.group(1),
        "sign": result.group(2),
        "rtcs_flag": result.group(3),
        "rtcs_ver": result.group(4),
        "width": 176,
        "type": "org",
        "rsign": result.group(5)
    }

    result_page = re.findall(
        r'merge":"(.*?)".*?"page":(.*?)}', response)  # 获取每页的标签
    doc_url = "https://wkretype.bdimg.com/retype/merge/" + url[29:-5]  # 网页的前缀
    n = 0
    for i in range(len(result_page)):  # 最大同时一次爬取10页
        if i % 10 is 0:
            doc_range = '_'.join([k for k, v in result_page[n:i]])
            reader['pn'] = n + 1
            reader['rn'] = 10
            reader['callback'] = 'sf_edu_wenku_retype_doc_jsonp_%s_10' % (
                reader.get('pn'))
            reader['range'] = doc_range
            n = i
            get_page(doc_url, reader)
    else:  # 剩余不足10页
        doc_range = '_'.join([k for k, v in result_page[n:i + 1]])
        reader['pn'] = n + 1
        reader['rn'] = i - n + 1
        reader['callback'] = 'sf_edu_wenku_retype_doc_jsonp_%s_%s' % (
            reader.get('pn'), reader.get('rn'))
        reader['range'] = doc_range
        get_page(doc_url, reader)


def get_page(url, data):
    response = requests.get(url, headers=headers, params=data).text
    response = response.encode(
        'utf-8').decode('unicode_escape')  # unciode转为utf-8 然后转为中文
    response = re.sub(r',"no_blank":true', '', response)  # 清洗数据
    result = re.findall(r'c":"(.*?)"}', response)  # 寻找文本匹配
    result = '\n'.join(result)
    print(result)

if __name__ == '__main__':
    url = "http://wenku.baidu.com/view/f732a599db38376baf1ffc4ffe4733687e21fcf9"
    get_num(url)

运行结果展示:
Python利用requests库爬取百度文库文章_第4张图片
接下来做课设岂不是有手就行了,一般就是组合键完成课设了呗!
Python利用requests库爬取百度文库文章_第5张图片

你可能感兴趣的:(python,爬虫,python,爬虫,request)