四十四、字体反爬– 课程总结和实习僧爬虫作业

1、网址：https://www.shixiseng.com/intern/inn_a7xabqqr4f9u

2、反爬字体：薪资部分。

3、字体位置：在网页源代码的@font-face中。

示例代码

import re

import requests

import io

import base64

from fontTools.ttLib import TTFont

from lxml import etree

headers= {

"User - Agent": "Mozilla /5.0(Windows NT 6.1;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome /79.0.3945.130Safari / 537.36",

'referer':'https://www.shixiseng.com/interns?k=python&p=1'

}

# font_face代表的是经过base64编码后的字符串，它本身是个字体文件

font_face = '略'

# 所有我们需要通过base64进行解码，还原回这个字体文件

font_bytes = io.BytesIO(base64.b64decode(font_face))

# 有这个字体bytes数据后，就可以使用TTFont来创建一个可以操作这个字体的对象

baseFont = TTFont(font_bytes)

# 获取所有字体的形状对象

baseGlyf = baseFont['glyf']

# 建立一个内容和字体形状的映射

baseFontMap = {

0: baseGlyf['uni30'],

1: baseGlyf['uni31'],

2: baseGlyf['uni32'],

3: baseGlyf['uni33'],

4: baseGlyf['uni34'],

5: baseGlyf['uni35'],

6: baseGlyf['uni36'],

7: baseGlyf['uni37'],

8: baseGlyf['uni38'],

9: baseGlyf['uni39'],

}

# 去爬取网页

url = "https://www.shixiseng.com/intern/inn_a7xabqqr4f9u"

resp = requests.get(url, headers=headers)

text = resp.text

# 抓取出当前网页的字体文件

result = re.search(r'font-family:myFont; src:url\("data:application/octet-stream;base64,(.+?)"\)', text)

font_face = result.group(1)

b = base64.b64decode(font_face)

currentFont = TTFont(io.BytesIO(b))

# 获取当前网页的字体的所有字体的形状

currentGlyf = currentFont['glyf']

# 获取字体的code和name的映射

codeNameMap = currentFont.getBestCmap()

# 循环code和name

for code, name in codeNameMap.items():

# 先获取到当前网页，某个name下的形状

currentShape = currentGlyf[name]

currentShape.coordinates

# 循环内容和形状的字典

for number, shape in baseFontMap.items():

# 看下循环后的shape是否和当前的shape相等

# 如果是相等，那么就可以找到code与内容的映射

if shape == currentShape:

# 构建网页中的code

webcode = str(hex(code)).replace("0", "&#", 1)

# 把网页中的code值替换成数字

text = re.sub(webcode, str(number),text)

print(text)

上一篇文章第五章爬虫进阶（四十三） 2020-03-01 地址：

https://www.jianshu.com/p/ec139926c1dc

下一篇文章第六章 Scrapy框架（一） 2020-03-03 地址：

https://www.jianshu.com/p/7f8de4d4bae2

以上资料内容来源网络，仅供学习交流，侵删请私信我，谢谢。

第五章爬虫进阶（四十四） 2020-03-02

四十四、字体反爬– 课程总结和实习僧爬虫作业

你可能感兴趣的:(第五章爬虫进阶（四十四） 2020-03-02)

第五章 爬虫进阶（四十四） 2020-03-02

四十四、 字体反爬– 课程总结和实习僧爬虫作业

你可能感兴趣的:(第五章 爬虫进阶（四十四） 2020-03-02)

第五章爬虫进阶（四十四） 2020-03-02

四十四、字体反爬– 课程总结和实习僧爬虫作业

你可能感兴趣的:(第五章爬虫进阶（四十四） 2020-03-02)