58同城字体反爬

对那些被编成乱码的文字进行爬取。次卧(龤室)     餼閏㎡<(次卧3室 15平方米),,你能看出来吗

所以我们要去破解这些乱七八糟的数据

先了解一下 StringIO and BytesIO

  • StringIO

很多时候,数据读写不一定是文件,也可以在内存中读写。

StringIO顾名思义就是在内存中读写str。

要把str写入StringIO,我们需要先创建一个StringIOf = StringIO(),然后,像文件一样写入即可,getvalue()方法用于获得写入后的str。

  • BytesIO

StringIO操作的只能是str,如果要操作二进制数据,就需要使用BytesIO。

BytesIO实现了在内存中读写bytes,我们创建一个BytesIO,然后写入一些bytes,

请注意,写入的不是str,而是经过UTF-8编码的bytes。

和StringIO类似,可以用一个bytes初始化BytesIO,然后,像读文件一样读取:

  • 通过这种方法去读取数据比读取文件快多了,,就像动车,和步行。。。

字体反爬原理

  1. 网页开发者自己创造一种字体,因为在字体上每个文字都有其代号,那么以后在网页中不会直接显示这个文字最终效果,而显示其他的代号,因此即使获得了网页的文本内容,也只是获得文本的代号,而不是文字本身,

  2. 因为创造字体费时耗力,并且如果把中国3000多个常用汉字都实现,那么这个字体将达到几十兆也会影响网页加载,所以一般情况下为了反爬虫,仅会针对0~9以及少数汉字进行自己独特的创新,接下来还是使用用户系统的自带字体。

解决方法:寻找字体

  1. 一般情况下为了考虑网页渲染性能,通常网页开发者会将自填编码成base64的方式,因此我们可以在网页源代码中寻找到 @font-face 中的url 函数进行加载。

58同城字体反爬_第1张图片
就是这些东西,,,,拿下来后,进行解码变成字体

  1. 如果没有使用base64,还有一种方式就是直接把字体文件放到服务器上,然后前端通过 @font-face 的url 函数进行加载。
@font-face {
  font-family: "Open Sans";
  src: url("/fonts/OpenSans-Regular-webfont.woff2") format("woff2"),
       url("/fonts/OpenSans-Regular-webfont.woff") format("woff");
}

就类似是这样的,这种的直接下下来就好,还不用解码。

分析字体

  1. 分析字体需要将字体转换成xml 文件,然后查看其中的 cmap 和glyf 中的属性。其中 cmap储存的是code 和 name 的映射,而glyf 下储存的是每个name 下的字体绘制规则。。。

  2. 从第一步我们知道 name 对应的字体的绘制规则,但是还是不知道这个字体长啥样。可以通过一个FontCreator 的软件来打开 .tff 的字体文件,这样就可以看到每个name 对应的字体最终的呈现。。。。(又是付费的,,我选择破解版。。)
    58跟人家不一样,他是code对应字体绘制规则。。。

from fontTools.ttLib import TTFont    # 需要下载
import base64,io

# 复制下来的base64 字符串
font_face = "AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8p/XQAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQVTWFHAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOg21wxfDzz1AAsIAAAAAADY4wsbAAAAANjjCxsAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAACQACAAoABQAHAAMACAABAAQABgAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAJAACVjwAAlY8AAAACAACZPAAAmTwAAAAKAACaSwAAmksAAAAFAACeOgAAnjoAAAAHAACeowAAnqMAAAADAACfZAAAn2QAAAAIAACfkgAAn5IAAAABAACfpAAAn6QAAAAEAACfpQAAn6UAAAAGAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA"
onlineFront = TTFont(io.BytesIO(base64.b64decode(font_face)))
 # 解码后以二进制形式放到管道中,在传到TTFont 类中,也可以保存为二进制文件在去读取,,就是慢

onlineFront.saveXML('58.xml')  # 保存为xml 格式
onlineFront.save('58.ttf')  # 保存为字体格式的文件

FontTools的使用

  • 读取字体:ont = TTFont('hhh.ttf')
  • 各节点名称:font.keys()
  • 获取getGlyphOrder节点name值:`font.getGlyphOrder()
  • 获取cmap节点code与name值映射 print(font.getBestCmap())

根据映射关系获取真实内容

  1. 在网页中,直接显示的是字体的code,而不是name.并且网页开发者为了增加爬虫的难度,有可能在多次请求之间code -> name-> 最终字体之间的映射会改变,但是最终字体的形状是不会改变的。因此我们可以通过形状对比来进行判断,,,
  2. 我们可以通过分析字体,得出每个字体形状对应的文字,然后保存到一个字典中,以后再请求网页的时候,就进行反向解析,先获取字体的形状,在通过字体形状反向获取带后所对应的具体文字内容。
import requests,re
import base64
from fontTools.ttLib import TTFont
import io

# 1. 解析数据,获得 xml 文件来找映射关系(这里使用慢的方式)
font_face = 'AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8Z/YQAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQZECttAAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOC5QrhfDzz1AAsIAAAAAADaxHAuAAAAANrEcC4AAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAACQAHAAUAAwAKAAEABgACAAQACAAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAJAACVjwAAlY8AAAAHAACZPAAAmTwAAAAFAACaSwAAmksAAAADAACeOgAAnjoAAAAKAACeowAAnqMAAAABAACfZAAAn2QAAAAGAACfkgAAn5IAAAACAACfpAAAn6QAAAAEAACfpQAAn6UAAAAIAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA'
b = base64.b64decode(font_face) # 解析
# with open('123.ttf','wb') as f: # 二进制打开
#     f.write(b)
basefont = TTFont('123.ttf')  # ttf 文件时不可读的,转成xml 才能看懂
# basefont.saveXML('123.xml')


glyf = basefont['glyf']           # 获得字体形状,通过name 来获取对应的形状
baseFontMap = {                   # 建立映射关系,文字和形状
    0: glyf['glyph00001'],	      
    1: glyf['glyph00002'],  # 这些关系就是找规律,从源代码里对两三个就行。。。。。
    2: glyf['glyph00003'],
    3: glyf['glyph00004'],
    4: glyf['glyph00005'],
    5: glyf['glyph00006'],
    6: glyf['glyph00007'],
    7: glyf['glyph00008'],
    8: glyf['glyph00009'],
    9: glyf['glyph00010'],
}



# 从网络上拿网页源代码,获取code -> name -> 文字形状 之间的映射关系
url = 'https://qd.58.com/chuzu'
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"}

resp = requests.get(url,headers=headers)
text = resp.text
font_face = re.search("@font-face.+base64,(.+?)'\)",text).group(1)  # 得到字体数据,可以保存成。ttf 文件使用时在读取,但是效率低

font_bytes= io.BytesIO(base64.b64decode(font_face))   # 就是把解码后的二机制数据放到内存的管道中,等直接使用就行 
currentFont = TTFont(font_bytes)  # 传入,拿到字体

# 找code -> name 的关系
codenameMap = currentFont.getBestCmap()  # 字典,得到十进制的code,但网页源代码中时十六进制的。。
currentGlyf = currentFont['glyf']
for code,name in codenameMap.items():
   # print({'code':hex(code),'name':name})   # 得到code与name的对应关系
    currentShape = currentGlyf[name]
    for number,shape in baseFontMap.items():
        if currentShape == shape:
            # print({'code':hex(code),'number':number})    # 得到code与实际数字的关系了
            webcode = str(hex(code)).replace("0", "&#", 1) + ";"     # 改成网页源代码中的格式
            text = re.sub(webcode, str(number), text)    # 改变网页源代码就得到没有编码的形式了
with open("58_1.html",'w',encoding='utf-8') as fp:
    fp.write(text)

那么fontcreater 这个软件有什么用呢。。。通过这个软件打开.ttf 文件,可以直接得到code 与 number 的关系,所以我们可以请求下来网页源代码,建立code 与 number的关系,然后直接替换就行。。但是问题就是每一次请求获得的关系可能不一样,也就是这个方式就是一次性的,,,而上面那个方式时通过反向配对的方法,number 与形状的关系总不能变吧(当然开发者换了也没办法)

你可能感兴趣的:(爬虫)