本博客由闲散白帽子胖胖鹏鹏胖胖鹏潜力所写,仅仅作为个人技术交流分享,不得用做商业用途。转载请注明出处,未经许可禁止将本博客内所有内容转载、商用。
上周我们针对Diaphora的库函数对比规则进行了分析,我们主要使用了两个数据库进行对比。那么我们这次来看下,这些数据库是如何生成的。代码位置在diphora_ida.py中,代码中夹杂着大量的GUI函数以及IDA api,我们略过GUI处理的过程,对IDApython的api在代码中进行标注。(这期写的比较毛躁,请见谅)
我们跟着main函数走下去。因为我们没有设置任何环境变量,所以在代码中使用的os.getenv都只能的到None
从逻辑上可以看到,我们直接跳转到了_diff_or_export函数。
def main():
global g_bindiff
if os.getenv("DIAPHORA_AUTO") is not None:
file_out = os.getenv("DIAPHORA_EXPORT_FILE")
if file_out is None:
raise Exception("No export file specified!")
use_decompiler = os.getenv("DIAPHORA_USE_DECOMPILER")
if use_decompiler is None:
use_decompiler = False
idaapi.autoWait()
if os.path.exists(file_out):
if g_bindiff is not None:
g_bindiff = None
remove_file(file_out)
log("Database %s removed" % repr(file_out))
bd = CIDABinDiff(file_out)
project_script = os.getenv("DIAPHORA_PROJECT_SCRIPT")
if project_script is not None:
bd.project_script = project_script
bd.use_decompiler_always = use_decompiler
try:
bd.export()
except KeyboardInterrupt:
log("Aborted by user, removing crash file %s-crash..." % file_out)
os.remove("%s-crash" % file_out)
idaapi.qexit(0)
else:
_diff_or_export(True)
我们跟进_diff_or_export函数。
def _diff_or_export(use_ui, **options):
global g_bindiff
# 获取全部的函数
total_functions = len(list(Functions()))
# 检查idb文件路径
if GetIdbPath() == "" or total_functions == 0:
Warning("No IDA database opened or no function in the database.\nPlease open an IDA database and create some functions before running this script.")
return
# 设置选项
opts = BinDiffOptions(**options)
# 如果有ui就获得用户输入
if use_ui:
x = CBinDiffExporterSetup()
x.Compile()
x.set_options(opts)
if not x.Execute():
return
opts = x.get_options()
# 输入文件不能等于输出文件;输出文件名要大于5字符;输入文件和输出文件不能是ida文件,只能是sqlite文件。
if opts.file_out == opts.file_in:
Warning("Both databases are the same file!")
return
elif opts.file_out == "" or len(opts.file_out) < 5:
Warning("No output database selected or invalid filename. Please select a database file.")
return
elif is_ida_file(opts.file_in) or is_ida_file(opts.file_out):
Warning("One of the selected databases is an IDA file. Please select only database files")
return
export = True
# 如果文件已经存在则1-上次export报错了,询问是否继续进行export;2-文件名冲突,询问是否进行文件覆盖;
if os.path.exists(opts.file_out):
crash_file = "%s-crash" % opts.file_out
resume_crashed = False
crashed_before = False
if os.path.exists(crash_file):
crashed_before = True
ret = askyn_c(1, "The previous export session crashed. Do you want to resume the previous crashed session?")
if ret == -1:
log("Cancelled")
return
elif ret == 1:
resume_crashed = True
if not resume_crashed and not crashed_before:
ret = askyn_c(0, "Export database already exists. Do you want to overwrite it?")
if ret == -1:
log("Cancelled")
return
if ret == 0:
export = False
if export:
if g_bindiff is not None:
g_bindiff = None
if not resume_crashed:
remove_file(opts.file_out)
log("Database %s removed" % repr(opts.file_out))
if os.path.exists(crash_file):
os.remove(crash_file)
t0 = time.time()
# 创建CIDABinDiff类,并且设置诸多参数
# 进行参数设置以及创建数据库等一系列初始化过程
try:
bd = CIDABinDiff(opts.file_out)
bd.use_decompiler_always = opts.use_decompiler
bd.exclude_library_thunk = opts.exclude_library_thunk
bd.unreliable = opts.unreliable
bd.slow_heuristics = opts.slow
bd.relaxed_ratio = opts.relax
bd.experimental = opts.experimental
bd.min_ea = opts.min_ea
bd.max_ea = opts.max_ea
bd.ida_subs = opts.ida_subs
bd.ignore_sub_names = opts.ignore_sub_names
bd.ignore_all_names = opts.ignore_all_names
bd.ignore_small_functions = opts.ignore_small_functions
bd.function_summaries_only = opts.func_summaries_only
bd.max_processed_rows = diaphora.MAX_PROCESSED_ROWS * max(total_functions / 20000, 1)
bd.timeout = diaphora.TIMEOUT_LIMIT * max(total_functions / 20000, 1)
bd.project_script = opts.project_script
# 进行输出和对比工作
# 输出函数在bd.export
# 对比函数bd.diff
if export:
exported = False
if os.getenv("DIAPHORA_PROFILE") is not None:
log("*** Profiling export ***")
import cProfile
profiler = cProfile.Profile()
profiler.runcall(bd.export)
exported = True
profiler.print_stats(sort="time")
else:
try:
bd.export()
exported = True
except KeyboardInterrupt:
log("Aborted by user, removing crash file %s-crash..." % opts.file_out)
os.remove("%s-crash" % opts.file_out)
if exported:
log("Database exported. Took {} seconds".format(time.time() - t0))
if opts.file_in != "":
if os.getenv("DIAPHORA_PROFILE") is not None:
log("*** Profiling diff ***")
import cProfile
profiler = cProfile.Profile()
profiler.runcall(bd.diff, opts.file_in)
profiler.print_stats(sort="time")
else:
bd.diff(opts.file_in)
except:
print("Error: %s" % sys.exc_info()[1])
traceback.print_exc()
return bd
从代码中可以看到,我们核心类是CIDABinDiff,生成sqlite数据库文件的核心函数是CIDABinDiff.export,进行对比的核心函数是CIDABinDiff.diff。我们分别跟着走一走。
从代码中可以看到,我们核心类是CIDABinDiff,生成sqlite数据库文件的核心函数是CIDABinDiff.export,进行对比的核心函数是CIDABinDiff.diff。我们分别跟着走一走。
先来看diff函数
def diff(self, db):
res = diaphora.CBinDiff.diff(self, db)
# 最终,我们展示best match和partialmatch
# 并且注册热键一脸重打开结果
self.show_choosers() # 打开所有匹配的结果
self.register_menu() # 注册菜单
hide_wait_box() # 隐藏对话框
return res
diff主要就是调用之前分析的对比函数,之后进行结果展示。
我们更关心export函数,export 函数检查了下有没有之前的crash日志,然后执行do_export
# 输出函数匹配所需的数据库
def export(self):
if self.project_script is not None:
log("Loading project specific Python script...")
if not self.load_hooks():
return False
crashed_before = False
crash_file = "%s-crash" % self.db_name
if os.path.exists(crash_file):
log("Resuming a previously crashed session...")
crashed_before = True
log("Creating crash file %s..." % crash_file)
with open(crash_file, "wb") as f:
f.close()
try:
show_wait_box("Exporting database")
self.do_export(crashed_before)
finally:
hide_wait_box()
self.db.commit()
log("Removing crash file %s-crash..." % self.db_name)
os.remove("%s-crash" % self.db_name)
cur = self.db_cursor()
cur.execute("analyze")
cur.close()
self.db_close()
跟进do_export,该函数进行了一些crash处理的代码,之后调用read_function进行处理,代码片段如下
for func in func_list:
i += 1
# 输出一些进度日志
if (total_funcs > 100) and i % (total_funcs/100) == 0 or i == 1:
line = "Exported %d function(s) out of %d total.\nElapsed %d:%02d:%02d second(s), remaining time ~%d:%02d:%02d"
elapsed = time.time() - t
remaining = (elapsed / i) * (total_funcs - i)
m, s = divmod(remaining, 60)
h, m = divmod(m, 60)
m_elapsed, s_elapsed = divmod(elapsed, 60)
h_elapsed, m_elapsed = divmod(m_elapsed, 60)
replace_wait_box(line % (i, total_funcs, h_elapsed, m_elapsed, s_elapsed, h, m, s))
# 崩溃处理
if crashed_before:
rva = func - self.get_base_address()
if rva != start_func:
continue
# 当我们从上次输出的最后一个函数开始之后
# 关闭'crash'标志,并且从狭义上继续执行
When we get to the last function that was previously exported, switch
# off the 'crash' flag and continue with the next row.
crashed_before = False
continue
# read_function就是我们需要分析的主函数
props = self.read_function(func)
if props == False:
continue
read_function这个函数很长,看来是没差了,我们继续看看他是如何写入数据库的
# 输出function数据库的主函数
# 输入:
# f = 函数地址
# discard:是否跳过
# 输出:
# 包含函数属性的元组;其他,处理出错
def read_function(self, f, discard=False):
name = GetFunctionName(int(f))
true_name = name
demangled_name = Demangle(name, INF_SHORT_DN)
if demangled_name == "":
demangled_name = None
if demangled_name is not None:
name = demangled_name
# 是否要hook某些函数
if self.hooks is not None:
ret = self.hooks.before_export_function(f, name)
if not ret:
return ret
f = int(f)
# 获得函数对象
func = get_func(f)
if not func:
log("Cannot get a function object for 0x%x" % f)
return False
# 获得函数的图表示
flow = FlowChart(func)
size = 0
if not self.ida_subs:
# 未命名的函数,我们忽略掉
if name.startswith("sub_") or name.startswith("j_") or name.startswith("unknown") or name.startswith("nullsub_"):
return False
# 已经准备好识别运行时函数了?
# Already recognized runtime's function?
flags = GetFunctionFlags(f)
if flags & FUNC_LIB or flags == -1:
return False
if self.exclude_library_thunk:
# 跳过库函数和无意义的函数
flags = GetFunctionFlags(f)
if flags & FUNC_LIB or flags & FUNC_THUNK or flags == -1:
return False
image_base = self.get_base_address()
nodes = 0
edges = 0
instructions = 0
mnems = []
dones = {}
names = set()
bytes_hash = []
bytes_sum = 0
function_hash = []
outdegree = 0
indegree = len(list(CodeRefsTo(f, 1))) # 获得函数引用
assembly = {}
basic_blocks_data = {}
bb_relations = {}
bb_topo_num = {}
bb_topological = {}
switches = []
bb_degree = {}
bb_edges = []
constants = []
# f调用的函数稍后进行处理
# The callees will be calculated later
callees = list()
# 获得调用f的函数 Calculate the callers
callers = list()
for caller in list(CodeRefsTo(f, 0)):
caller_func = get_func(caller)
if caller_func and caller_func.startEA not in callers:
callers.append(caller_func.startEA)
mnemonics_spp = 1
cpu_ins_list = GetInstructionList() # 获取当前cpu架构支持的所有指令集合并排序
cpu_ins_list.sort()
# 分别处理每个block
for block in flow:
nodes += 1
instructions_data = []
block_ea = block.startEA - image_base
idx = len(bb_topological)
bb_topological[idx] = []
bb_topo_num[block_ea] = idx
# 分别处理每一条指令
for x in list(Heads(block.startEA, block.endEA)):
mnem = GetMnem(x)
disasm = GetDisasm(x)
size += ItemSize(x)
instructions += 1
# 根据指令在cpu指令集中的序号确定素数
if mnem in cpu_ins_list:
mnemonics_spp *= self.primes[cpu_ins_list.index(mnem)]
# 增加汇编语句
try:
assembly[block_ea].append([x - image_base, disasm])
except KeyError:
if nodes == 1:
assembly[block_ea] = [[x - image_base, disasm]]
else:
assembly[block_ea] = [[x - image_base, "loc_%x:" % x], [x - image_base, disasm]]
decoded_size, ins = diaphora_decode(x)
if ins.Operands[0].type in [o_mem, o_imm, o_far, o_near, o_displ]:
decoded_size -= ins.Operands[0].offb
if ins.Operands[1].type in [o_mem, o_imm, o_far, o_near, o_displ]:
decoded_size -= ins.Operands[1].offb
if decoded_size <= 0:
decoded_size = 1
# 判断是否为常量
for oper in ins.Operands:
if oper.type == o_imm:
if self.is_constant(oper, x) and self.constant_filter(oper.value):
constants.append(oper.value)
# 数据依赖
drefs = list(DataRefsFrom(x))
if len(drefs) > 0:
for dref in drefs:
if get_func(dref) is None:
str_constant = GetString(dref, -1, -1)
if str_constant is not None:
if str_constant not in constants:
constants.append(str_constant)
# 获得指令的字节
curr_bytes = GetManyBytes(x, decoded_size, False)
if curr_bytes is None or len(curr_bytes) != decoded_size:
log("Failed to read %d bytes at [%08x]" % (decoded_size, x))
continue
bytes_hash.append(curr_bytes)
bytes_sum += sum(map(ord, curr_bytes))
function_hash.append(GetManyBytes(x, ItemSize(x), False))
outdegree += len(list(CodeRefsFrom(x, 0))) # 判断有无出口
mnems.append(mnem)
op_value = GetOperandValue(x, 1) # 如果有引用则获得跳转地址
if op_value == -1:
op_value = GetOperandValue(x, 0)
tmp_name = None
if op_value != BADADDR and op_value in self.names:
tmp_name = self.names[op_value]
demangled_name = Demangle(tmp_name, INF_SHORT_DN)
if demangled_name is not None:
tmp_name = demangled_name
if not tmp_name.startswith("sub_") and not tmp_name.startswith("nullsub_"):
names.add(tmp_name)
# 处理 callees
l = list(CodeRefsFrom(x, 0))
for callee in l:
callee_func = get_func(callee)
if callee_func and callee_func.startEA != func.startEA:
if callee_func.startEA not in callees:
callees.append(callee_func.startEA)
if len(l) == 0:
l = DataRefsFrom(x)
tmp_type = None
for ref in l:
if ref in self.names:
tmp_name = self.names[ref]
tmp_type = GetType(ref)
ins_cmt1 = GetCommentEx(x, 0)
ins_cmt2 = GetCommentEx(x, 1)
instructions_data.append([x - image_base, mnem, disasm, ins_cmt1, ins_cmt2, tmp_name, tmp_type])
# 判断是否是条件分支
switch = get_switch_info_ex(x)
if switch:
switch_cases = switch.get_jtable_size() # 获得跳转表大小
results = calc_switch_cases(x, switch) # 计算跳转
if results is not None:
# It seems that IDAPython for idaq64 has some bug when reading
# switch's cases. Do not attempt to read them if the 'cur_case'
# returned object is not iterable.
can_iter = False
switch_cases_values = set()
for idx in xrange(len(results.cases)):
cur_case = results.cases[idx]
if not '__iter__' in dir(cur_case):
break
can_iter |= True
for cidx in xrange(len(cur_case)):
case_id = cur_case[cidx]
switch_cases_values.add(case_id)
if can_iter:
switches.append([switch_cases, list(switch_cases_values)])
basic_blocks_data[block_ea] = instructions_data
bb_relations[block_ea] = []
if block_ea not in bb_degree:
# bb in degree, out degree
bb_degree[block_ea] = [0, 0]
# 处理block之间的关系
for succ_block in block.succs():
succ_base = succ_block.startEA - image_base
bb_relations[block_ea].append(succ_base)
bb_degree[block_ea][1] += 1
bb_edges.append((block_ea, succ_base))
if succ_base not in bb_degree:
bb_degree[succ_base] = [0, 0]
bb_degree[succ_base][0] += 1
edges += 1
indegree += 1
if not dones.has_key(succ_block.id):
dones[succ_block] = 1
for pred_block in block.preds():
try:
bb_relations[pred_block.startEA - image_base].append(block.startEA - image_base)
except KeyError:
bb_relations[pred_block.startEA - image_base] = [block.startEA - image_base]
edges += 1
outdegree += 1
if not dones.has_key(succ_block.id):
dones[succ_block] = 1
# 建立bolck之间的拓朴
for block in flow:
block_ea = block.startEA - image_base
for succ_block in block.succs():
succ_base = succ_block.startEA - image_base
bb_topological[bb_topo_num[block_ea]].append(bb_topo_num[succ_base])
strongly_connected_spp = 0
# 计算强联通分量的拓朴排序
# 主要是为了计算自环
try:
strongly_connected = strongly_connected_components(bb_relations)
bb_topological_sorted = robust_topological_sort(bb_topological)
bb_topological = json.dumps(bb_topological_sorted)
strongly_connected_spp = 1
for item in strongly_connected:
val = len(item)
if val > 1:
strongly_connected_spp *= self.primes[val]
except:
# XXX: FIXME: The original implementation that we're using is
# recursive and can fail. We really need to create our own non
# recursive version.
strongly_connected = []
bb_topological = None
loops = 0
for sc in strongly_connected:
if len(sc) > 1:
loops += 1
else:
if sc[0] in bb_relations and sc[0] in bb_relations[sc[0]]:
loops += 1
asm = []
keys = assembly.keys()
keys.sort()
# Collect the ordered list of addresses, as shown in the assembly
# viewer (when diffing). It will be extremely useful for importing
# stuff later on.
assembly_addrs = []
# After sorting our the addresses of basic blocks, be sure that the
# very first address is always the entry point, no matter at what
# address it is.
keys.remove(f - image_base)
keys.insert(0, f - image_base)
for key in keys:
for line in assembly[key]:
assembly_addrs.append(line[0])
asm.append(line[1])
asm = "\n".join(asm)
cc = edges - nodes + 2
proto = self.guess_type(f)
proto2 = GetType(f)
try:
prime = str(self.primes[cc])
except:
log("Cyclomatic complexity too big: 0x%x -> %d" % (f, cc))
prime = 0
comment = GetFunctionCmt(f, 1)
# hash都是通过md5计算出来的
bytes_hash = md5("".join(bytes_hash)).hexdigest()
function_hash = md5("".join(function_hash)).hexdigest()
function_flags = GetFunctionFlags(f)
pseudo = None
pseudo_hash1 = None
pseudo_hash2 = None
pseudo_hash3 = None
pseudo_lines = 0
pseudocode_primes = None
if f in self.pseudo:
pseudo = "\n".join(self.pseudo[f])
pseudo_lines = len(self.pseudo[f])
# 对伪代码进行fuzzy hash 计算
pseudo_hash1, pseudo_hash2, pseudo_hash3 = self.kfh.hash_bytes(pseudo).split(";")
if pseudo_hash1 == "":
pseudo_hash1 = None
if pseudo_hash2 == "":
pseudo_hash2 = None
if pseudo_hash3 == "":
pseudo_hash3 = None
pseudocode_primes = str(self.pseudo_hash[f])
try:
clean_assembly = self.get_cmp_asm_lines(asm)
except:
clean_assembly = ""
print "Error getting assembly for 0x%x" % f
clean_pseudo = self.get_cmp_pseudo_lines(pseudo)
# md——index的计算方法
# 不知道是什么原理
md_index = 0
if bb_topological:
bb_topo_order = {}
for i, scc in enumerate(bb_topological_sorted):
for bb in scc:
bb_topo_order[bb] = i
tuples = []
for src, dst in bb_edges:
tuples.append((
bb_topo_order[bb_topo_num[src]],
bb_degree[src][0],
bb_degree[src][1],
bb_degree[dst][0],
bb_degree[dst][1],))
rt2, rt3, rt5, rt7 = (decimal.Decimal(p).sqrt() for p in (2, 3, 5, 7))
emb_tuples = (sum((z0, z1 * rt2, z2 * rt3, z3 * rt5, z4 * rt7))
for z0, z1, z2, z3, z4 in tuples)
md_index = sum((1 / emb_t.sqrt() for emb_t in emb_tuples))
md_index = str(md_index)
seg_rva = x - SegStart(x)
rva = f - self.get_base_address()
l = (name, nodes, edges, indegree, outdegree, size, instructions, mnems, names,
proto, cc, prime, f, comment, true_name, bytes_hash, pseudo, pseudo_lines,
pseudo_hash1, pseudocode_primes, function_flags, asm, proto2,
pseudo_hash2, pseudo_hash3, len(strongly_connected), loops, rva, bb_topological,
strongly_connected_spp, clean_assembly, clean_pseudo, mnemonics_spp, switches,
function_hash, bytes_sum, md_index, constants, len(constants), seg_rva,
assembly_addrs,
callers, callees,
basic_blocks_data, bb_relations)
if self.hooks is not None:
d = self.create_function_dictionary(l)
d = self.hooks.after_export_function(d)
l = self.get_function_from_dictionary(d)
return l
我们稍稍进行下总结:
1.获得函数对象以及函数的图表示,每个图都是一个block
2.对每个block进行指令处理,计算block的出度入度,处理block之间的调用关系
3.对每个block中的指令,分离操作符、立即数、调用关系、数据调用、字节、判断switch语句
4.构造block之间的拓扑图,并且求出强联通分量的拓朴排序
5.计算自环以及循环复杂度
6.计算md_index是个很奇怪的算法
7.function_hash和字节hash是通过md5计算出来的,伪代码hash才是fuzzy hash
8.最终生成一个包含着函数各种属性的元组
注:我觉得生成primes的过程比较有意思,最开始的prime根据指令生成,每个指令对应着一个prime,之后就是两prime连乘,由指令->block->函数->调用流图,这种层层递进的关系。这样将每层的prime分解,就可以得到完整的语义信息。
写在最后:Diaphora的源码阅读我差不多做了有3周,对Diaphora有了一个完整的理解,期间有将近一周的时间在做其他的事情(摸鱼、补充基础知识或者是找工作的事情),总之是对Diaphora有了一个较为完整的了解。我个人总结,Diaphora的识别机制比较有效果,有一定的容错率,当两个库函数有常量上的或者是地址上面的差别的时候,Diaphora能够忽略掉(clean code);但是当两个有语义上差别的时候,Diaphora能够分辨出来。同时引入了伪代码进行对比,更能够提高识别效率。但是我觉得缺点在于不便于移植,比如我想用在Angr上面是很困难的事情;同时依赖于函数签名,这个事情在我看来是很死板的事情,容易造成很大的漏报;同时采用的hash算法为md5而不是fuzzy hash,同样也增大了漏报率。最后,既然都不是那么容易移植的,我在Angr中引入FLIRT还是Diaphora还是需要再考虑考虑。也许真的得借助机器学习的力量了吧。