破解58自定义文字反爬

在抓取58同城租房信息时出现自定义字体,将原本正常的数据信息隐藏,如图所示:

破解58自定义文字反爬_第1张图片

 

 从源码中查找,找到@font-face 自定义字体,将原本正常数据隐藏了

破解58自定义文字反爬_第2张图片

接下来处理这段加密的脚本:

def get_list(url):
    resp = requests.get(url)
    if resp:
        base64_str = re.findall('data:application/font-ttf;charset=utf-8;base64,(.*)\'\) format\(\'truetype\'\)}', resp.text)

        bin_data = base64.b64decode(base64_str[0])
        fonts = TTFont(io.BytesIO(bin_data))
        bestcmap = fonts.getBestCmap()
        newmap = {}
        for key in bestcmap.keys():
            print(key)
            print(re.findall(r'(\d+)',bestcmap[key]))
            value = int(re.findall(r'(\d+)',bestcmap[key])[0])-1
            key = hex(key)
            newmap[key] = value   #========== {'0x9a4b': 2, '0x9e3a': 8, '0x993c': 4, '0x9ea3': 5, '0x9476': 3, '0x958f': 6, '0x9f92': 0, '0x9fa4': 1, '0x9f64': 9, '0x9fa5': 7}
但是从源码中可知,并不只是16进制的数字,因此稍加处理:            
resp_ = resp.text
for key,value in newmap.items():
    key_ = key.replace('0x', '&#x') + ';'
    if key_ in resp_:
        resp_ = resp_.replace(key_, str(value))

贴上完整代码:

def get_list(url):
    resp = requests.get(url)
    if resp:
        base64_str = re.findall('data:application/font-ttf;charset=utf-8;base64,(.*)\'\) format\(\'truetype\'\)}', resp.text)

        bin_data = base64.b64decode(base64_str[0])
        fonts = TTFont(io.BytesIO(bin_data))
        bestcmap = fonts.getBestCmap()
        newmap = {}
        for key in bestcmap.keys():
            print(key)
            print(re.findall(r'(\d+)',bestcmap[key]))
            value = int(re.findall(r'(\d+)',bestcmap[key])[0])-1
            key = hex(key)
            newmap[key] = value

        print('==========', newmap)
        resp_ = resp.text
        for key,value in newmap.items():
            key_ = key.replace('0x', '&#x') + ';'
            if key_ in resp_:
                resp_ = resp_.replace(key_, str(value))

此时的resp_就是正常的源码内容,就可以从里面抓取数据了                     

 

 

你可能感兴趣的:(爬虫)