Python爬虫(第八周)

一、字体反爬

基于起点中文网案例介绍字体反扒

需求:https://www.qidian.com/rank/yuepiao/  获取起点中文网月票榜排名的书名极其月票数量

通过抓包可以在 “yuepiao/” 中发现我们所需要的书名和月票数量都是html格式的数据,所以我们要用到 lxml中 的 etree 方法,利用xpath进行解析

import requests
from lxml import etree
from fake_useragent import FakeUserAgent

if __name__ == '__main__':
    # 1.确认目标的url
    url_ = 'https://www.qidian.com/rank/yuepiao/'

    # 2.构造请求头信息
    headers_ = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
        'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723',
        'Referer': 'https://www.qidian.com/rank/'
    }
    # 3.发送请求,获取相应
    response_ = requests.get(url_, headers=headers_)
    data_ = response_.text

    # 检查拿到的相应是否正确
    with open('qidian.html', 'w', encoding='utf-8') as f:
        f.write(data_)

需要注意的是:起点中文网是一个大网站,请求头信息尽量写完整,检查拿到的响应对象中是否包含我们需要的数据

经检查,我们所需要的数据在响应对象中,下一步就需要在响应对象中提取到所需要的数据,因为是html格式数据,所以提取数据的关键就是调试xpath语法,在提取前进行分析,一页有20本书,即提取结果也应该是20个

书名xpath://h4/a/text()

月票数量xpath://span/span/text() 或者 //span[@class="IuAmFihj"]/text()

注:第二种xpath语法在浏览器调试时可以取到数据,但是当我们在pycharm中运行程序时会发现并不能提取到相应数据,这是因为span的class属性值在每一次访问网站时都会发生变化

import requests
from lxml import etree
from fake_useragent import FakeUserAgent

if __name__ == '__main__':
    # 1.确认目标的url
    url_ = 'https://www.qidian.com/rank/yuepiao/'

    # 2.构造请求头信息
    headers_ = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
        'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723',
        'Referer': 'https://www.qidian.com/rank/'
    }
    # 3.发送请求,获取相应
    response_ = requests.get(url_, headers=headers_)
    data_ = response_.text

    # # 检查拿到的相应是否正确
    # with open('qidian.html', 'w', encoding='utf-8') as f:
    #     f.write(data_)

    # 4.解析数据,获取书名和月票数量
    html_obj = etree.HTML(data_)
    book_list = html_obj.xpath('//h4/a/text()')
    num_list = html_obj.xpath('//span/span/text()')
    print(book_list)
    print(num_list)

通过正常流程,我们应该拿到了书名和月票数量,但是打印我们提取到的数据是会发现出现了下面情况

['夜的命名术', '不科学御兽', '我有一棵神话树', '我就是不按套路出牌', '从红月开始', '我的云养女友', '大梦主', '深空彼岸', '这个人仙太过正经', '斗罗大陆V重生唐三', '仙狐', '大奉打更人', '星门', '人族镇守使', '东晋北府一丘八', '我只能和S级女神谈恋爱', '我真不想看见bug', '稳住别浪', '赤心巡天', '全职艺术家']
['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

书名可以正常显示,但月票数量全是乱码,这就是我们所说的字体反爬

出现的字体反爬:1.span的class属性值在每一次访问网站时都会发生变化

                             2.不经过特殊处理,无法拿到真实的数据

分析:

作为正常用户在页面直接复制月票数量:

月票

可以看见我们作为正常用户直接复制月票数量数据时都出现了乱码情况,所以普通爬虫更不可能拿到真正的数据

在 yuepiao/ 数据包中找到的月票数量对应的语句:

𘛽𘛽𘛿𘜀𘜄月票

可以看见数字似乎变成了类似𘛽这样的格式

在network中的font里面有三个woff文件,这种文件其实就是用来做字体加密的文件

那我们到底要用到哪一个呢?

我们用检查中左上角的小箭头点击页面中的月票数据,可以跳转到数据对应的标签,前面class属性的值与三个woff文件中有一个文件重名,所以我们猜测要用的就是这个woff文件

下载woff文件:1.双击下载

                          2.利用python代码发送请求下载

打开发现好像并不能查看woff字体加密文件里面的具体内容,此时我们需要一个第三方库:fontTools 需要我们自己下载

pip install fonttools

使用:

from fontTools.ttLib import TTFont
# 创建对象,参数为字体加密文件
font_obj = TTFont('FryVjKMa.woff')

# 转换格式
font_obj.saveXML('font.xml')

注意:下载时是 fonttools ,导入时是 fontTools

转换后,我们搜索cmap,可以找到:


    
    
    
    
    
    
    
    
      
      
      
      
      
      
      
      
      
      
      
    
  

这就是字体加密的转换规则,map标签就是对应关系(映射表)

阅读代码,就可以大概猜到:0x188c0 对应 8,0x188c2 对应 1 ......(0x开头表示16进制数)

 我们将其转化为10进制数:

print(int(0x188c0))  # 100544
print(int(0x188c2))  # 100546

可以发现这与我们在 yuepiao/ 数据包中找到的月票数量对应的语句:

𘛽𘛽𘛿𘜀𘜄月票

中的数字形式很像

所以可以得出结论:0x188c0 对应 8,0x188c0 的 十进制数 100544,100544 对应 8

找到对应关系的逻辑后,如何快速得到对应表呢?

from fontTools.ttLib import TTFont

# 创建对象,参数为字体加密文件
font_obj = TTFont('FryVjKMa.woff')

# 转换格式
font_obj.saveXML('font.xml')

# 得到map节点的关系映射表
res_ = font_obj.getBestCmap()
print(res_)

'''
{100544: 'eight', 100546: 'one', 100547: 'zero', 100548: 'three', 100549: 'period', 
 100550: 'four', 100551: 'two', 100552: 'nine', 100553: 'six', 100554: 'five', 
 100555: 'seven'}
'''

看结果,getBestCmap方法帮我们自动把十六进制数转换成了十进制数,并把对应关系以字典形式呈现

 当我们做爬虫时,手动去下载字体加密文件是不现实的,所以需要利用代码去下载字体加密文件

我们在 yuepiao/ 数据包的response中搜索woff,可以发现字体加密文件的url在response中

xpath语法://p/span/style/text()

结果有20个,因为页面有20本书,每一本书都对应同一个woff文件,所以取其中一个就行

import json
import re
from fontTools.ttLib import TTFont
import requests
from lxml import etree

if __name__ == '__main__':
    # 1.确认目标的url
    url_ = 'https://www.qidian.com/rank/yuepiao/'

    # 2.构造请求头信息
    headers_ = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
        'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723',
        'Referer': 'https://www.qidian.com/rank/'
    }
    # 3.发送请求,获取相应
    response_ = requests.get(url_, headers=headers_)
    data_ = response_.text

    # # 检查拿到的相应是否正确
    # with open('qidian.html', 'w', encoding='utf-8') as f:
    #     f.write(data_)

    # 4.解析数据,获取字体加密文件,书名,月票数量
    html_obj = etree.HTML(data_)
    # 获取书名
    book_list = html_obj.xpath('//h4/a/text()')

    # 获取字体加密文件
    str_ = html_obj.xpath('//p/span/style/text()')[0]
    '''
    @font-face { font-family: khQtDpBC; src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.eot?') format('eot'); 
    src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.ttf') 
    format('truetype'); } .khQtDpBC { font-family: 'khQtDpBC' !important;     
    display: initial !important; color: inherit !important; vertical-align: initial !important; }
    '''
    # 从中提取字体加密文件的url
    font_url = re.findall(r" format\('eot'\); src: url\('(.*?)'\) format\('woff'\)", str_)[0]
    # 对字体加密文件的url发送请求,获取相应的文件
    font_response = requests.get(font_url, headers=headers_)
    # 保存字体加密文件
    with open('font.woff', 'wb') as f:
        f.write(font_response.content)
    # 解析字体加密文件
    font_obj = TTFont('font.woff')
    # 转换成明文格式的xml文件
    font_obj.saveXML('font.xml')
    # 得到map节点的关系映射表(十六进制->十进制)
    res_ = font_obj.getBestCmap()
    # 将关系映射表中的英文数字转换为阿拉伯数字
    dict_ = {'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7', 'eight': '8',
             'nine': '9', 'zero': '0'}
    for i in res_:
        for j in dict_:
            if res_[i] == j:
                res_[i] = dict_[j]

    # 获取月票数量:𘛽𘛽𘛿𘜀𘜄格式
    num_ = re.findall(r'(.*?)月票

', data_) # 去掉&# list_ = [] for i in num_: list_.append(re.findall(r'\d+', i)) # 替换成一位阿拉伯数字 for i in list_: for j in enumerate(i): for k in res_: if j[1] == str(k): i[j[0]] = res_[k] # ['7', '6', '2', '1', '2']拼接 for i, j in enumerate(list_): new = ''.join(j) list_[i] = new # 5.保存书名和对应的月票数量 with open('起点中文网月榜.json', 'a', encoding='utf-8') as f: for i in range(len(book_list)): book_dict = {} book_dict[book_list[i]] = list_[i] json_data = json.dumps(book_dict, ensure_ascii=False) + ',\n' f.write(json_data)

二、案例翻页

import json
import re
import time

from fontTools.ttLib import TTFont
import requests
from lxml import etree

if __name__ == '__main__':
    for i in range(1, 6):
        # 1.确认目标的url
        url_ = f'https://www.qidian.com/rank/yuepiao/page{i}'

        # 2.构造请求头信息
        headers_ = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
            'Cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_19%22%2C%22eid%22%3A%22%22%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C45%22%2C%22l1%22%3A5%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; _csrfToken=FJAYOKmb5GpRuB6mdxwLXF1sDkKqgTL0z5gG7Ana; newstatisticUUID=1613732256_1917024121; _yep_uuid=adb684fd-87c1-4108-391c-f50ab9ac0d5c; _gid=GA1.2.180413774.1628410724; e1=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22qd_A16%22%2C%22l1%22%3A3%7D; e2=%7B%22pid%22%3A%22qd_p_qidian%22%2C%22eid%22%3A%22%22%7D; _ga_FZMMH98S83=GS1.1.1628410723.1.1.1628410744.0; _ga_PFYW0QLV3P=GS1.1.1628410723.1.1.1628410744.0; _ga=GA1.2.707336986.1628410723',
            'Referer': 'https://www.qidian.com/rank/'
        }
        # 3.发送请求,获取相应
        response_ = requests.get(url_, headers=headers_)
        data_ = response_.text

        # # 检查拿到的相应是否正确
        # with open('qidian.html', 'w', encoding='utf-8') as f:
        #     f.write(data_)

        # 4.解析数据,获取字体加密文件,书名,月票数量
        html_obj = etree.HTML(data_)
        # 获取书名
        book_list = html_obj.xpath('//h4/a/text()')

        # 获取字体加密文件
        str_ = html_obj.xpath('//p/span/style/text()')[0]
        '''
        @font-face { font-family: khQtDpBC; src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.eot?') format('eot'); 
        src: url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.woff') format('woff'), url('https://qidian.gtimg.com/qd_anti_spider/khQtDpBC.ttf') 
        format('truetype'); } .khQtDpBC { font-family: 'khQtDpBC' !important;     
        display: initial !important; color: inherit !important; vertical-align: initial !important; }
        '''
        # 从中提取字体加密文件的url
        font_url = re.findall(r" format\('eot'\); src: url\('(.*?)'\) format\('woff'\)", str_)[0]
        # 对字体加密文件的url发送请求,获取相应的文件
        font_response = requests.get(font_url, headers=headers_)
        # 保存字体加密文件
        with open('font.woff', 'wb') as f:
            f.write(font_response.content)
        # 解析字体加密文件
        font_obj = TTFont('font.woff')
        # 转换成明文格式的xml文件
        font_obj.saveXML('font.xml')
        # 得到map节点的关系映射表(十六进制->十进制)
        res_ = font_obj.getBestCmap()
        # 将关系映射表中的英文数字转换为阿拉伯数字
        dict_ = {'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7', 'eight': '8',
                 'nine': '9', 'zero': '0'}
        for i in res_:
            for j in dict_:
                if res_[i] == j:
                    res_[i] = dict_[j]

        # 获取月票数量:𘛽𘛽𘛿𘜀𘜄格式
        num_ = re.findall(r'(.*?)月票

', data_) # 去掉&# list_ = [] for i in num_: list_.append(re.findall(r'\d+', i)) # 替换成一位阿拉伯数字 for i in list_: for j in enumerate(i): for k in res_: if j[1] == str(k): i[j[0]] = res_[k] # ['7', '6', '2', '1', '2']拼接 for i, j in enumerate(list_): new = ''.join(j) list_[i] = new # 5.保存书名和对应的月票数量 with open('起点中文网月榜.json', 'a', encoding='utf-8') as f: for i in range(len(book_list)): book_dict = {} book_dict[book_list[i]] = list_[i] json_data = json.dumps(book_dict, ensure_ascii=False) + ',\n' f.write(json_data) # 6.降低请求频率 time.sleep(1)

你可能感兴趣的:(Python爬虫,python,爬虫)