Diaphora源码阅读——如何生成库函数数据库

本博客由闲散白帽子胖胖鹏鹏胖胖鹏潜力所写,仅仅作为个人技术交流分享,不得用做商业用途。转载请注明出处,未经许可禁止将本博客内所有内容转载、商用。 

上周我们针对Diaphora的库函数对比规则进行了分析,我们主要使用了两个数据库进行对比。那么我们这次来看下,这些数据库是如何生成的。代码位置在diphora_ida.py中,代码中夹杂着大量的GUI函数以及IDA api,我们略过GUI处理的过程,对IDApython的api在代码中进行标注。(这期写的比较毛躁,请见谅)
      我们跟着main函数走下去。因为我们没有设置任何环境变量,所以在代码中使用的os.getenv都只能的到None
从逻辑上可以看到,我们直接跳转到了_diff_or_export函数。

def main():
  global g_bindiff
  if os.getenv("DIAPHORA_AUTO") is not None:
    file_out = os.getenv("DIAPHORA_EXPORT_FILE")
    if file_out is None:
      raise Exception("No export file specified!")

    use_decompiler = os.getenv("DIAPHORA_USE_DECOMPILER")
    if use_decompiler is None:
      use_decompiler = False

    idaapi.autoWait()

    if os.path.exists(file_out):
      if g_bindiff is not None:
        g_bindiff = None

      remove_file(file_out)
      log("Database %s removed" % repr(file_out))

    bd = CIDABinDiff(file_out)
    project_script = os.getenv("DIAPHORA_PROJECT_SCRIPT")
    if project_script is not None:
      bd.project_script = project_script
    bd.use_decompiler_always = use_decompiler

    try:
      bd.export()
    except KeyboardInterrupt:
      log("Aborted by user, removing crash file %s-crash..." % file_out)
      os.remove("%s-crash" % file_out)

    idaapi.qexit(0)
  else:
    _diff_or_export(True)

我们跟进_diff_or_export函数。

def _diff_or_export(use_ui, **options):
  global g_bindiff

  # 获取全部的函数
  total_functions = len(list(Functions()))
  # 检查idb文件路径
  if GetIdbPath() == "" or total_functions == 0:
    Warning("No IDA database opened or no function in the database.\nPlease open an IDA database and create some functions before running this script.")
    return

  # 设置选项
  opts = BinDiffOptions(**options)

  # 如果有ui就获得用户输入
  if use_ui:
    x = CBinDiffExporterSetup()
    x.Compile()
    x.set_options(opts)

    if not x.Execute():
      return

    opts = x.get_options()

  # 输入文件不能等于输出文件;输出文件名要大于5字符;输入文件和输出文件不能是ida文件,只能是sqlite文件。
  if opts.file_out == opts.file_in:
    Warning("Both databases are the same file!")
    return
  elif opts.file_out == "" or len(opts.file_out) < 5:
    Warning("No output database selected or invalid filename. Please select a database file.")
    return
  elif is_ida_file(opts.file_in) or is_ida_file(opts.file_out):
    Warning("One of the selected databases is an IDA file. Please select only database files")
    return

  export = True
  # 如果文件已经存在则1-上次export报错了,询问是否继续进行export;2-文件名冲突,询问是否进行文件覆盖;
  if os.path.exists(opts.file_out):
    crash_file = "%s-crash" % opts.file_out
    resume_crashed = False
    crashed_before = False
    if os.path.exists(crash_file):
      crashed_before = True
      ret = askyn_c(1, "The previous export session crashed. Do you want to resume the previous crashed session?")
      if ret == -1:
        log("Cancelled")
        return
      elif ret == 1:
        resume_crashed = True

    if not resume_crashed and not crashed_before:
      ret = askyn_c(0, "Export database already exists. Do you want to overwrite it?")
      if ret == -1:
        log("Cancelled")
        return

      if ret == 0:
        export = False

    if export:
      if g_bindiff is not None:
        g_bindiff = None

      if not resume_crashed:
        remove_file(opts.file_out)
        log("Database %s removed" % repr(opts.file_out))
        if os.path.exists(crash_file):
          os.remove(crash_file)

  t0 = time.time()
  # 创建CIDABinDiff类,并且设置诸多参数
  # 进行参数设置以及创建数据库等一系列初始化过程
  try:
    bd = CIDABinDiff(opts.file_out)
    bd.use_decompiler_always = opts.use_decompiler
    bd.exclude_library_thunk = opts.exclude_library_thunk
    bd.unreliable = opts.unreliable
    bd.slow_heuristics = opts.slow
    bd.relaxed_ratio = opts.relax
    bd.experimental = opts.experimental
    bd.min_ea = opts.min_ea
    bd.max_ea = opts.max_ea
    bd.ida_subs = opts.ida_subs
    bd.ignore_sub_names = opts.ignore_sub_names
    bd.ignore_all_names = opts.ignore_all_names
    bd.ignore_small_functions = opts.ignore_small_functions
    bd.function_summaries_only = opts.func_summaries_only
    bd.max_processed_rows = diaphora.MAX_PROCESSED_ROWS * max(total_functions / 20000, 1)
    bd.timeout = diaphora.TIMEOUT_LIMIT * max(total_functions / 20000, 1)
    bd.project_script = opts.project_script

	# 进行输出和对比工作
	# 输出函数在bd.export
	# 对比函数bd.diff
    if export:
      exported = False
      if os.getenv("DIAPHORA_PROFILE") is not None:
        log("*** Profiling export ***")
        import cProfile
        profiler = cProfile.Profile()
        profiler.runcall(bd.export)
        exported = True
        profiler.print_stats(sort="time")
      else:
        try:
          bd.export()
          exported = True
        except KeyboardInterrupt:
          log("Aborted by user, removing crash file %s-crash..." % opts.file_out)
          os.remove("%s-crash" % opts.file_out)

      if exported:
        log("Database exported. Took {} seconds".format(time.time() - t0))

    if opts.file_in != "":
      if os.getenv("DIAPHORA_PROFILE") is not None:
        log("*** Profiling diff ***")
        import cProfile
        profiler = cProfile.Profile()
        profiler.runcall(bd.diff, opts.file_in)
        profiler.print_stats(sort="time")
      else:
        bd.diff(opts.file_in)
  except:
    print("Error: %s" % sys.exc_info()[1])
    traceback.print_exc()

  return bd
从代码中可以看到,我们核心类是CIDABinDiff,生成sqlite数据库文件的核心函数是CIDABinDiff.export,进行对比的核心函数是CIDABinDiff.diff。我们分别跟着走一走。
      从代码中可以看到,我们核心类是CIDABinDiff,生成sqlite数据库文件的核心函数是CIDABinDiff.export,进行对比的核心函数是CIDABinDiff.diff。我们分别跟着走一走。

先来看diff函数

  def diff(self, db):
    res = diaphora.CBinDiff.diff(self, db)
    # 最终,我们展示best match和partialmatch
	# 并且注册热键一脸重打开结果
    self.show_choosers() # 打开所有匹配的结果
    self.register_menu() # 注册菜单
    hide_wait_box()      # 隐藏对话框
    return res
diff主要就是调用之前分析的对比函数,之后进行结果展示。

我们更关心export函数,export 函数检查了下有没有之前的crash日志,然后执行do_export

  # 输出函数匹配所需的数据库
  def export(self):
    if self.project_script is not None:
      log("Loading project specific Python script...")
      if not self.load_hooks():
        return False

    crashed_before = False
    crash_file = "%s-crash" % self.db_name
    if os.path.exists(crash_file):
      log("Resuming a previously crashed session...")
      crashed_before = True

    log("Creating crash file %s..." % crash_file)
    with open(crash_file, "wb") as f:
      f.close()

    try:
      show_wait_box("Exporting database")
      self.do_export(crashed_before)
    finally:
      hide_wait_box()

    self.db.commit()
    log("Removing crash file %s-crash..." % self.db_name)
    os.remove("%s-crash" % self.db_name)

    cur = self.db_cursor()
    cur.execute("analyze")
    cur.close()

    self.db_close()
跟进do_export,该函数进行了一些crash处理的代码,之后调用read_function进行处理,代码片段如下
for func in func_list:
  i += 1
  # 输出一些进度日志
  if (total_funcs > 100) and i % (total_funcs/100) == 0 or i == 1:
	line = "Exported %d function(s) out of %d total.\nElapsed %d:%02d:%02d second(s), remaining time ~%d:%02d:%02d"
	elapsed = time.time() - t
	remaining = (elapsed / i) * (total_funcs - i)

	m, s = divmod(remaining, 60)
	h, m = divmod(m, 60)
	m_elapsed, s_elapsed = divmod(elapsed, 60)
	h_elapsed, m_elapsed = divmod(m_elapsed, 60)

	replace_wait_box(line % (i, total_funcs, h_elapsed, m_elapsed, s_elapsed, h, m, s))

  # 崩溃处理
  if crashed_before:
	rva = func - self.get_base_address()
	if rva != start_func:
	  continue

	#  当我们从上次输出的最后一个函数开始之后
	# 关闭'crash'标志,并且从狭义上继续执行
	When we get to the last function that was previously exported, switch
	# off the 'crash' flag and continue with the next row.
	crashed_before = False
	continue

  # read_function就是我们需要分析的主函数
  props = self.read_function(func)
  if props == False:
	continue

read_function这个函数很长,看来是没差了,我们继续看看他是如何写入数据库的

  # 输出function数据库的主函数
  # 输入:
  #   f = 函数地址
  #	 discard:是否跳过
  # 输出:
  #  包含函数属性的元组;其他,处理出错
  def read_function(self, f, discard=False):
    name = GetFunctionName(int(f))
    true_name = name
    demangled_name = Demangle(name, INF_SHORT_DN)
    if demangled_name == "":
      demangled_name = None

    if demangled_name is not None:
      name = demangled_name

	# 是否要hook某些函数
    if self.hooks is not None:
      ret = self.hooks.before_export_function(f, name)
      if not ret:
        return ret

    f = int(f)
	# 获得函数对象
    func = get_func(f)
    if not func:
      log("Cannot get a function object for 0x%x" % f)
      return False

	# 获得函数的图表示
    flow = FlowChart(func)
    size = 0

    if not self.ida_subs:
      # 未命名的函数,我们忽略掉
      if name.startswith("sub_") or name.startswith("j_") or name.startswith("unknown") or name.startswith("nullsub_"):
        return False

      # 已经准备好识别运行时函数了?
	  # Already recognized runtime's function?
      flags = GetFunctionFlags(f)
      if flags & FUNC_LIB or flags == -1:
        return False

    if self.exclude_library_thunk:
      # 跳过库函数和无意义的函数
      flags = GetFunctionFlags(f)
      if flags & FUNC_LIB or flags & FUNC_THUNK or flags == -1:
        return False

    image_base = self.get_base_address()
    nodes = 0
    edges = 0
    instructions = 0
    mnems = []
    dones = {}
    names = set()
    bytes_hash = []
    bytes_sum = 0
    function_hash = []
    outdegree = 0
    indegree = len(list(CodeRefsTo(f, 1))) # 获得函数引用
    assembly = {}
    basic_blocks_data = {}
    bb_relations = {}
    bb_topo_num = {}
    bb_topological = {}
    switches = []
    bb_degree = {}
    bb_edges = []
    constants = []

    # f调用的函数稍后进行处理 
	# The callees will be calculated later
    callees = list()
    # 获得调用f的函数 Calculate the callers
    callers = list()
    for caller in list(CodeRefsTo(f, 0)):
      caller_func = get_func(caller)
      if caller_func and caller_func.startEA not in callers:
        callers.append(caller_func.startEA)

    mnemonics_spp = 1
    cpu_ins_list = GetInstructionList() # 获取当前cpu架构支持的所有指令集合并排序
    cpu_ins_list.sort()

	# 分别处理每个block
    for block in flow:
      nodes += 1
      instructions_data = []

      block_ea = block.startEA - image_base
      idx = len(bb_topological)
      bb_topological[idx] = []
      bb_topo_num[block_ea] = idx

	  # 分别处理每一条指令
      for x in list(Heads(block.startEA, block.endEA)):
        mnem = GetMnem(x)
        disasm = GetDisasm(x)
        size += ItemSize(x)
        instructions += 1

		# 根据指令在cpu指令集中的序号确定素数
        if mnem in cpu_ins_list:
          mnemonics_spp *= self.primes[cpu_ins_list.index(mnem)]

		# 增加汇编语句
        try:
          assembly[block_ea].append([x - image_base, disasm])
        except KeyError:
          if nodes == 1:
            assembly[block_ea] = [[x - image_base, disasm]]
          else:
            assembly[block_ea] = [[x - image_base, "loc_%x:" % x], [x - image_base, disasm]]

        decoded_size, ins = diaphora_decode(x)
        if ins.Operands[0].type in [o_mem, o_imm, o_far, o_near, o_displ]:
          decoded_size -= ins.Operands[0].offb
        if ins.Operands[1].type in [o_mem, o_imm, o_far, o_near, o_displ]:
          decoded_size -= ins.Operands[1].offb
        if decoded_size <= 0:
          decoded_size = 1

		# 判断是否为常量
        for oper in ins.Operands:
          if oper.type == o_imm:
            if self.is_constant(oper, x) and self.constant_filter(oper.value):
              constants.append(oper.value)

		  # 数据依赖
          drefs = list(DataRefsFrom(x))
          if len(drefs) > 0:
            for dref in drefs:
              if get_func(dref) is None:
                str_constant = GetString(dref, -1, -1)
                if str_constant is not None:
                  if str_constant not in constants:
                    constants.append(str_constant)

		# 获得指令的字节
        curr_bytes = GetManyBytes(x, decoded_size, False)
        if curr_bytes is None or len(curr_bytes) != decoded_size:
            log("Failed to read %d bytes at [%08x]" % (decoded_size, x))
            continue

        bytes_hash.append(curr_bytes)
        bytes_sum += sum(map(ord, curr_bytes))

        function_hash.append(GetManyBytes(x, ItemSize(x), False))
        outdegree += len(list(CodeRefsFrom(x, 0))) # 判断有无出口
        mnems.append(mnem)
        op_value = GetOperandValue(x, 1) # 如果有引用则获得跳转地址
        if op_value == -1:
          op_value = GetOperandValue(x, 0)

        tmp_name = None
        if op_value != BADADDR and op_value in self.names:
          tmp_name = self.names[op_value]
          demangled_name = Demangle(tmp_name, INF_SHORT_DN)
          if demangled_name is not None:
            tmp_name = demangled_name
          if not tmp_name.startswith("sub_") and not tmp_name.startswith("nullsub_"):
            names.add(tmp_name)

        # 处理 callees
        l = list(CodeRefsFrom(x, 0))
        for callee in l:
          callee_func = get_func(callee)
          if callee_func and callee_func.startEA != func.startEA:
            if callee_func.startEA not in callees:
              callees.append(callee_func.startEA)

        if len(l) == 0:
          l = DataRefsFrom(x)

        tmp_type = None
        for ref in l:
          if ref in self.names:
            tmp_name = self.names[ref]
            tmp_type = GetType(ref)

        ins_cmt1 = GetCommentEx(x, 0)
        ins_cmt2 = GetCommentEx(x, 1)
        instructions_data.append([x - image_base, mnem, disasm, ins_cmt1, ins_cmt2, tmp_name, tmp_type])

		# 判断是否是条件分支
        switch = get_switch_info_ex(x)
        if switch:
          switch_cases = switch.get_jtable_size() # 获得跳转表大小
          results = calc_switch_cases(x, switch) # 计算跳转

          if results is not None:
            # It seems that IDAPython for idaq64 has some bug when reading
            # switch's cases. Do not attempt to read them if the 'cur_case'
            # returned object is not iterable.
            can_iter = False
            switch_cases_values = set()
            for idx in xrange(len(results.cases)):
              cur_case = results.cases[idx]
              if not '__iter__' in dir(cur_case):
                break

              can_iter |= True
              for cidx in xrange(len(cur_case)):
                case_id = cur_case[cidx]
                switch_cases_values.add(case_id)

            if can_iter:
              switches.append([switch_cases, list(switch_cases_values)])

      basic_blocks_data[block_ea] = instructions_data
      bb_relations[block_ea] = []
      if block_ea not in bb_degree:
        # bb in degree, out degree
        bb_degree[block_ea] = [0, 0]

	  # 处理block之间的关系
      for succ_block in block.succs():
        succ_base = succ_block.startEA - image_base
        bb_relations[block_ea].append(succ_base)
        bb_degree[block_ea][1] += 1
        bb_edges.append((block_ea, succ_base))
        if succ_base not in bb_degree:
          bb_degree[succ_base] = [0, 0]
        bb_degree[succ_base][0] += 1

        edges += 1
        indegree += 1
        if not dones.has_key(succ_block.id):
          dones[succ_block] = 1

      for pred_block in block.preds():
        try:
          bb_relations[pred_block.startEA - image_base].append(block.startEA - image_base)
        except KeyError:
          bb_relations[pred_block.startEA - image_base] = [block.startEA - image_base]

        edges += 1
        outdegree += 1
        if not dones.has_key(succ_block.id):
          dones[succ_block] = 1

    # 建立bolck之间的拓朴
    for block in flow:
      block_ea = block.startEA - image_base
      for succ_block in block.succs():
        succ_base = succ_block.startEA - image_base
        bb_topological[bb_topo_num[block_ea]].append(bb_topo_num[succ_base])

    strongly_connected_spp = 0

    # 计算强联通分量的拓朴排序
	# 主要是为了计算自环
    try:
      strongly_connected = strongly_connected_components(bb_relations)
      bb_topological_sorted = robust_topological_sort(bb_topological)
      bb_topological = json.dumps(bb_topological_sorted)
      strongly_connected_spp = 1
      for item in strongly_connected:
        val = len(item)
        if val > 1:
          strongly_connected_spp *= self.primes[val]
    except:
      # XXX: FIXME: The original implementation that we're using is
      # recursive and can fail. We really need to create our own non
      # recursive version.
      strongly_connected = []
      bb_topological = None

    loops = 0
    for sc in strongly_connected:
      if len(sc) > 1:
        loops += 1
      else:
        if sc[0] in bb_relations and sc[0] in bb_relations[sc[0]]:
          loops += 1

    asm = []
    keys = assembly.keys()
    keys.sort()

    # Collect the ordered list of addresses, as shown in the assembly
    # viewer (when diffing). It will be extremely useful for importing
    # stuff later on.
    assembly_addrs = []

    # After sorting our the addresses of basic blocks, be sure that the
    # very first address is always the entry point, no matter at what
    # address it is.
    keys.remove(f - image_base)
    keys.insert(0, f - image_base)
    for key in keys:
      for line in assembly[key]:
        assembly_addrs.append(line[0])
        asm.append(line[1])
    asm = "\n".join(asm)

    cc = edges - nodes + 2
    proto = self.guess_type(f)
    proto2 = GetType(f)
    try:
      prime = str(self.primes[cc])
    except:
      log("Cyclomatic complexity too big: 0x%x -> %d" % (f, cc))
      prime = 0

    comment = GetFunctionCmt(f, 1)
	
	# hash都是通过md5计算出来的
    bytes_hash = md5("".join(bytes_hash)).hexdigest()
    function_hash = md5("".join(function_hash)).hexdigest()

    function_flags = GetFunctionFlags(f)
    pseudo = None
    pseudo_hash1 = None
    pseudo_hash2 = None
    pseudo_hash3 = None
    pseudo_lines = 0
    pseudocode_primes = None
    if f in self.pseudo:
      pseudo = "\n".join(self.pseudo[f])
      pseudo_lines = len(self.pseudo[f])
	  # 对伪代码进行fuzzy hash 计算
      pseudo_hash1, pseudo_hash2, pseudo_hash3 = self.kfh.hash_bytes(pseudo).split(";")
      if pseudo_hash1 == "":
        pseudo_hash1 = None
      if pseudo_hash2 == "":
        pseudo_hash2 = None
      if pseudo_hash3 == "":
        pseudo_hash3 = None
      pseudocode_primes = str(self.pseudo_hash[f])

    try:
      clean_assembly = self.get_cmp_asm_lines(asm)
    except:
      clean_assembly = ""
      print "Error getting assembly for 0x%x" % f

    clean_pseudo = self.get_cmp_pseudo_lines(pseudo)

	# md——index的计算方法
	# 不知道是什么原理
    md_index = 0
    if bb_topological:
      bb_topo_order = {}
      for i, scc in enumerate(bb_topological_sorted):
        for bb in scc:
          bb_topo_order[bb] = i
      tuples = []
      for src, dst in bb_edges:
        tuples.append((
            bb_topo_order[bb_topo_num[src]],
            bb_degree[src][0],
            bb_degree[src][1],
            bb_degree[dst][0],
            bb_degree[dst][1],))
      rt2, rt3, rt5, rt7 = (decimal.Decimal(p).sqrt() for p in (2, 3, 5, 7))
      emb_tuples = (sum((z0, z1 * rt2, z2 * rt3, z3 * rt5, z4 * rt7))
              for z0, z1, z2, z3, z4 in tuples)
      md_index = sum((1 / emb_t.sqrt() for emb_t in emb_tuples))
      md_index = str(md_index)

    seg_rva = x - SegStart(x)

    rva = f - self.get_base_address()
    l = (name, nodes, edges, indegree, outdegree, size, instructions, mnems, names,
             proto, cc, prime, f, comment, true_name, bytes_hash, pseudo, pseudo_lines,
             pseudo_hash1, pseudocode_primes, function_flags, asm, proto2,
             pseudo_hash2, pseudo_hash3, len(strongly_connected), loops, rva, bb_topological,
             strongly_connected_spp, clean_assembly, clean_pseudo, mnemonics_spp, switches,
             function_hash, bytes_sum, md_index, constants, len(constants), seg_rva,
             assembly_addrs,
             callers, callees,
             basic_blocks_data, bb_relations)

    if self.hooks is not None:
      d = self.create_function_dictionary(l)
      d = self.hooks.after_export_function(d)
      l = self.get_function_from_dictionary(d)

    return l

我们稍稍进行下总结:

1.获得函数对象以及函数的图表示,每个图都是一个block
2.对每个block进行指令处理,计算block的出度入度,处理block之间的调用关系
3.对每个block中的指令,分离操作符、立即数、调用关系、数据调用、字节、判断switch语句
4.构造block之间的拓扑图,并且求出强联通分量的拓朴排序
5.计算自环以及循环复杂度
6.计算md_index是个很奇怪的算法
7.function_hash和字节hash是通过md5计算出来的,伪代码hash才是fuzzy hash

8.最终生成一个包含着函数各种属性的元组

注:我觉得生成primes的过程比较有意思,最开始的prime根据指令生成,每个指令对应着一个prime,之后就是两prime连乘,由指令->block->函数->调用流图,这种层层递进的关系。这样将每层的prime分解,就可以得到完整的语义信息。

      写在最后:Diaphora的源码阅读我差不多做了有3周,对Diaphora有了一个完整的理解,期间有将近一周的时间在做其他的事情(摸鱼、补充基础知识或者是找工作的事情),总之是对Diaphora有了一个较为完整的了解。我个人总结,Diaphora的识别机制比较有效果,有一定的容错率,当两个库函数有常量上的或者是地址上面的差别的时候,Diaphora能够忽略掉(clean code);但是当两个有语义上差别的时候,Diaphora能够分辨出来。同时引入了伪代码进行对比,更能够提高识别效率。但是我觉得缺点在于不便于移植,比如我想用在Angr上面是很困难的事情;同时依赖于函数签名,这个事情在我看来是很死板的事情,容易造成很大的漏报;同时采用的hash算法为md5而不是fuzzy hash,同样也增大了漏报率。最后,既然都不是那么容易移植的,我在Angr中引入FLIRT还是Diaphora还是需要再考虑考虑。也许真的得借助机器学习的力量了吧。


参考文档:
[1] idapython教程。https://leanpub.com/IDAPython-Book
[2] ida sdk api参考文档。https://www.hex-rays.com/products/ida/support/sdkdoc/

你可能感兴趣的:(Diaphora源码分析,库函数识别)