Python爬虫实现破解58同城加密内容

在爬取58同城租房信息的联系号码时,发现抓取的‘13823661900’对应的内容是‘龒鑶龤驋鑶餼餼龒鸺閏閏’

Python爬虫实现破解58同城加密内容_第1张图片

看起来应该是字体加密,字体加密一般是网页修改了默认的字符编码集,在网页上加载的网页定义的字体文件作为字体的样式,可以正确地显示数字,但是在源码上同样的二进制数由于未加载自定义的字体文件就由计算机默认编码成了乱码。

解决办法:找到字体文件,分析文件中的映射关系,一般字体文件都是作为样式加在加密字体的部位。

Python爬虫实现破解58同城加密内容_第2张图片

 

从上图可以看到,取消勾选fangchan-secret样式后,部分加密的内容显示成了我们实际爬取的内容,所以这个fangchan-secret最可能是字体加密文件,接下来在网页源代码中搜索‘fangchan-secret’寻找字体加密文件,如下图所示:在下图的源码中,字体文件是通过base64加密之后放在js里面了,把其中加密的部分(两条红竖线内的内容)取出,可使用正则将其中的内容取出来(或者直接拷贝下来)。每次网页刷新,字体加密文件中的映射顺序可能会变,所以最好观察下映射顺序是否变化。

Python爬虫实现破解58同城加密内容_第3张图片

以下是代码:

import base64
from io import BytesIO
from fontTools.ttLib import TTFont
import requests
import re

# 正则获取字体文件内容
url = 'https://sz.58.com/zufang/37600131385868x.shtml?entinfo=37600131385868_0&fzbref=0¶ms=rankbusitimeZZ8pc0020^desc&psid=125027535203666225060654534&iuType=gz_2&ClickID=1&cookie=|||c5/nn1pMpWpci9dLRdczAg==&PGTID=0d300008-00a7-53b9-8bbc-17b565fb4b3a&apptype=0&key=&pubid=66881617&trackkey=37600131385868_c528f692-5ebb-402e-99ff-dce613171bf0_20190329165957_1553849997376&fcinfotype=gz'
res = requests.get(url)
bs64Str = re.findall("charset=utf-8;base64,(.*?)'\)", res.text)[0]
# 直接从网页源代码拷贝字体文件内容
base64Str = 'AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8J/ZQAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQVGM29AAAA4AAAADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAYqAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOiv/hBfDzz1AAsIAAAAAADYyMFWAAAAANjIwVYAAP/mBGgGLgAAAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQAAAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAAsAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wAAAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAABwAGAAUAAwAJAAIACAAEAAEACgAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAHAACVjwAAlY8AAAAGAACZPAAAmTwAAAAFAACaSwAAmksAAAADAACeOgAAnjoAAAAJAACeowAAnqMAAAACAACfZAAAn2QAAAAIAACfkgAAn5IAAAAEAACfpAAAn6QAAAABAACfpQAAn6UAAAAKAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sEC6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCYjIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTYzMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxEjESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACEiJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjMyNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgAACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwAjAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAAGAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AADAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJzaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgBjAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA'
# base64解码  base64.decodebytes()可处理str类型
binData = base64.decodebytes(base64Str.encode())
# 写入otf字体文件
filePath01 = r'F://data_temp/jiemi_20190402_03.otf'
filePath02 = r'F://data_temp/text_20190402_03.xml'
with open(filePath01, 'wb') as f:
        f.write(binData)
        f.close()
# 解析字体库
font01 = TTFont(filePath01)
# BytesIO() 把二进制数据bin_data当作文件来操作,TTFont接收一个文件类型
# font01 = TTFont(BytesIO(binData))
uniList = font01['cmap'].tables[0].ttFont.getGlyphOrder()
utfList = font01['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap  # c = font.getBestCmap()
retList = []
getText = '麣龒鸺驋龒鑶鑶麣龥龤龤'
for i in getText:
    # ord()以字符作为参数,返回对应的Unicode数值
    if ord(i) in utfList:
        text = int(utfList[ord(i)][-2:]) - 1
    else:
        text = i
    retList.append(text)
crackText = ''.join([str(i) for i in retList])
print(crackText)
# 13823661900



可以看到最后的结果打印出来是和网页上我们看到的号码是一样的,解密成功~ well done !

 

 

 

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(Python)