在抓取58同城租房信息时出现自定义字体,将原本正常的数据信息隐藏,如图所示:
从源码中查找,找到@font-face 自定义字体,将原本正常数据隐藏了
接下来处理这段加密的脚本:
def get_list(url): resp = requests.get(url) if resp: base64_str = re.findall('data:application/font-ttf;charset=utf-8;base64,(.*)\'\) format\(\'truetype\'\)}', resp.text) bin_data = base64.b64decode(base64_str[0]) fonts = TTFont(io.BytesIO(bin_data)) bestcmap = fonts.getBestCmap() newmap = {} for key in bestcmap.keys(): print(key) print(re.findall(r'(\d+)',bestcmap[key])) value = int(re.findall(r'(\d+)',bestcmap[key])[0])-1 key = hex(key) newmap[key] = value #========== {'0x9a4b': 2, '0x9e3a': 8, '0x993c': 4, '0x9ea3': 5, '0x9476': 3, '0x958f': 6, '0x9f92': 0, '0x9fa4': 1, '0x9f64': 9, '0x9fa5': 7} 但是从源码中可知,并不只是16进制的数字,因此稍加处理:
resp_ = resp.text for key,value in newmap.items(): key_ = key.replace('0x', '') + ';' if key_ in resp_: resp_ = resp_.replace(key_, str(value))
贴上完整代码:
def get_list(url):
resp = requests.get(url)
if resp:
base64_str = re.findall('data:application/font-ttf;charset=utf-8;base64,(.*)\'\) format\(\'truetype\'\)}', resp.text)
bin_data = base64.b64decode(base64_str[0])
fonts = TTFont(io.BytesIO(bin_data))
bestcmap = fonts.getBestCmap()
newmap = {}
for key in bestcmap.keys():
print(key)
print(re.findall(r'(\d+)',bestcmap[key]))
value = int(re.findall(r'(\d+)',bestcmap[key])[0])-1
key = hex(key)
newmap[key] = value
print('==========', newmap)
resp_ = resp.text
for key,value in newmap.items():
key_ = key.replace('0x', '') + ';'
if key_ in resp_:
resp_ = resp_.replace(key_, str(value))
此时的resp_就是正常的源码内容,就可以从里面抓取数据了