最近发现某网站页面显示是正常的, 但是审查元素的时候, 却发现都是乱码, 觉得挺有意思的, 然后研究了一下, 网站实际上采用了字体混淆.
字体反爬也就是自定义字体反爬,通过调用自定义的字体文件来渲染网页中的文字,而网页中的文字不再是文字,而是相应的字体编码,通过复制或者简单的采集是无法采集到编码后的文字内容的。实际上在计算机显示的字体, 都是由他们的编码决定的, 因此如果我们修改字体的映射, 网页直接显示Unicode编码, 这样的话, 就会同上图所示, 页面显示是正常的, 但是审查元素的时候是乱码. 在本文中, 我们不讨论如何还原字体, 我们来自己实现以下这种混淆方式.
下面我们用Python来写一个可以生成混淆字体的脚本, 最终代码我会放到Github上.
首先, 我们先来写一个处理输入数据的函数, 基本作用是去掉重复的字, 然后去掉空白字符, 我们不处理emoji表情, 因此检测一下, 如果有的话, 也直接去掉.
def _pre_deal_obfuscator_input_str(s: str) -> str:
"""
Pre Deal Input Strings, to deduplicate string & remove all whitespace & remove emoji.
@param s:
@return:
"""
s = "".join(OrderedDict.fromkeys(s))
s = emoji.demojize(s)
pattern = re.compile(r'\s+')
return re.sub(pattern, '', s)
在来写一个辅助函数, 检测字体是否存在这个字, 否则的话, 这个字是不能正常显示的.
def _check_cmap_include_all_text(cmap: dict, s: str) -> bool:
for character in s:
if ord(character) not in cmap:
raise Exception(f'{character} Not in Font Lib!')
return True
接下来, 就是代码的重点了, 先简单介绍一下代码的原理, 就是获取到字体的cmap
, 然后做对应的替换, 替换成为混淆之后的代码, 这样就可以了, 原理比较简单, 直接来看代码吧.
def obfuscator(flag=0x0001, plain_text='', shadow_text='', source_font='', output_flag=0x0003, output_file_name='obfuscated_font', output_path='output', name_strings: FontNameTable = None):
"""
Main Function for Font Obfuscator.
@param flag: Notice performance issues if you input too many plain text.
0x0001 Auto Obfuscate Font, You Should Only Input Plain Text.
0x0002 You should set shadow_text.
0x0100 This flag will shuffle plain text, you should to notice we if you input shadow text and set this flag, words without correspondence.
0x1000 Add all numbers to plaintext.
0x2000 Add all lower letters to plaintext.
0x4000 Add all upper letters to plaintext.
0x8000 Add 2, 500 normal characters to plaintext.
@param plain_text:
@param shadow_text:
@param source_font:
@param output_flag: You could combine flag to output multi files.
0x0001 .ttf
0x0002 .woff & .woff2
@param output_file_name:
@param output_path:
@param name_strings:
@return:
"""
if flag & 0x1000:
plain_text += string.digits
if flag & 0x2000:
plain_text += string.ascii_lowercase
if flag & 0x4000:
plain_text += string.ascii_uppercase
if flag & 0x8000:
from py_font_obfuscator.constants import NORMAL_CHINESE_CHARACTERS
plain_text += NORMAL_CHINESE_CHARACTERS
plain_text = _pre_deal_obfuscator_input_str(plain_text)
shadow_text = _pre_deal_obfuscator_input_str(shadow_text)
if flag & 0x0100:
plain_text = list(plain_text)
random.shuffle(plain_text)
plain_text = ''.join(plain_text)
if name_strings is None:
name_strings = FontNameTable()
_map = {}
obfuscator_code_list = []
if flag & 0x0001:
obfuscator_code_list = random.sample(range(0xE000, 0xF8FF), len(plain_text))
elif flag & 0x0002:
if len(shadow_text) < len(plain_text):
raise Exception('The count of shadow text must greater than plain text!')
obfuscator_code_list = [ord(i) for i in shadow_text]
else:
obfuscator_code_list = random.sample(range(0xE000, 0xF8FF), len(plain_text))
root = _get_project_root()
if source_font:
pass
else:
source_font = root / 'base-font/KaiGenGothicCN-Regular.ttf'
source_font = TTFont(source_font)
source_cmap = source_font.getBestCmap()
_check_cmap_include_all_text(source_cmap, plain_text)
glyphs, metrics, cmap = {}, {}, {}
glyph_set = source_font.getGlyphSet()
pen = TTGlyphPen(glyph_set)
glyph_order = source_font.getGlyphOrder()
final_shadow_text: list = []
if 'null' in glyph_order:
glyph_set['null'].draw(pen)
glyphs['null'] = pen.glyph()
metrics['null'] = source_font['hmtx']['null']
final_shadow_text += ['null']
if '.notdef' in glyph_order:
glyph_set['.notdef'].draw(pen)
glyphs['.notdef'] = pen.glyph()
metrics['.notdef'] = source_font['hmtx']['.notdef']
final_shadow_text += ['.notdef']
for index, character in enumerate(plain_text):
obfuscator_code = obfuscator_code_list[index]
code_cmap_name = hex(obfuscator_code).replace('0x', 'uni')
html_code = hex(obfuscator_code).replace('0x', '') + ';'
_map[character] = html_code
final_shadow_text.append(code_cmap_name)
glyph_set[source_cmap[ord(character)]].draw(pen)
glyphs[code_cmap_name] = pen.glyph()
metrics[code_cmap_name] = source_font['hmtx'][source_cmap[ord(character)]]
cmap[obfuscator_code] = code_cmap_name
horizontal_header = {
'ascent': source_font['hhea'].ascent,
'descent': source_font['hhea'].descent,
}
fb = FontBuilder(source_font['head'].unitsPerEm, isTTF=True)
fb.setupGlyphOrder(final_shadow_text)
fb.setupCharacterMap(cmap)
fb.setupGlyf(glyphs)
fb.setupHorizontalMetrics(metrics)
fb.setupHorizontalHeader(**horizontal_header)
fb.setupNameTable(name_strings.get_name_strings())
fb.setupOS2()
fb.setupPost()
result = dict()
if output_flag & 0x0001:
result['ttf'] = f'./{output_file_name}/{output_file_name}.ttf'
fb.save(f'./{output_path}/{output_file_name}.ttf')
if output_flag & 0x0002:
_subset_ttf_font(f'./{output_path}/{output_file_name}')
return _map
注意, 这里要返回对应的字典, 因为对于原始的字符串, 需要替换成字体中的编码, 这样整个代码就写完了, 并不难. 接下来, 我们写一个代码来测试一下生成的字体
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Font Obfuscator Demotitle>
<style>
@font-face {
font-family: CustomAwesomeFont;
src: url('./obfuscated_font.woff2') format("truetype");
}
.customFont {
font-family: "CustomAwesomeFont", serif;
font-style: normal;
font-weight: normal;
font-variant: normal;
text-transform: none;
line-height: 1;
-webkit-font-smoothing: antialiased;
}
style>
head>
<body>
<p class="customFont">, 小船儿推开波浪p>
body>
html>
我们来看一下效果:
源码 https://github.com/Litt1eQ/py_font_obfuscator