Diaphora源码分析——diaphora.py分析

本博客由闲散白帽子胖胖鹏鹏胖胖鹏潜力所写,仅仅作为个人技术交流分享,不得用做商业用途。转载请注明出处,未经许可禁止将本博客内所有内容转载、商用。 

      在使用Diaphora进行库函数比对的时候,我们会先用Diaphora生成原始函数库(*.a,*.lib)生成一个sqlite特征数据库数据库(db1),再用Diaphora生成程序(*.so,*.exe,*.bin)的特征数据库(db2),之后根据两个数据库中特征进行函数识别。生成特征数据库的部分由于设计IDAPython的API,我们暂时先搁置,我们先来看如何进行的数据库对比。代码位置在这。

      diaphora.py为我们提供了两种接口,一种是从IDA中直接调用, 而另外一种是通过命令行调用。通过命令行调用时我们需要提供两个参数,一个是程序特征数据库,我们称为main;另外一个是原始库函数特征数据库,我们称为diff。设置好这两个变量之后,即进行分析工作。这里主要涉及到的知识是sql查询语句,不懂得同学建议去w3c学院学习下,这样才能看的懂代码。diaphora主要分析逻辑在diff()函数中。

  def diff(self, db):
  # 进行函数识别的主函数
  #参数:
       # db——需要进行对比的数据库
    self.last_diff_db = db

    cur = self.db_cursor()
    cur.execute('attach "%s" as diff' % db)

    try:
      cur.execute("select value from diff.version")
    except:
      log("Error: %s " % sys.exc_info()[1])
      Warning("The selected file does not look like a valid SQLite exported database!")
      cur.close()
      return False

    row = cur.fetchone()
    if not row:
      Warning("Invalid database!")
      return False

    if row["value"] != VERSION_VALUE:
      Warning("The database is from a different version (current %s, database %s)!" % (VERSION_VALUE, row[0]))
      return False

    try:
      t0 = time.time()
      log_refresh("Performing diffing...", True)
      
      self.do_continue = True
      if self.equal_db():
        log("The databases seems to be 100% equal")

      if self.do_continue:
        # 对比调用流是否相同
        self.check_callgraph()

        # 搜索未被修改的函数
        log_refresh("Finding best matches...")
        self.find_equal_matches()

        # 搜索部分修改的函数
        log_refresh("Finding partial matches")
        self.find_matches()

        if self.slow_heuristics:
          # 从调用流中恢复函数
          log_refresh("Finding with heuristic 'Callgraph matches'")
          self.find_callgraph_matches()

        if self.unreliable:
          # 搜索不可靠的匹配结果
          log_refresh("Finding probably unreliable matches")
          self.find_unreliable_matches()
        
        if self.experimental:
          # 使用经验法进行函数匹配
          log_refresh("Finding experimental matches")
          self.find_experimental_matches()

        # 最后整理未经匹配的函数
        log_refresh("Finding unmatched functions")
        self.find_unmatched()

        log("Done. Took {} seconds".format(time.time() - t0))
    finally:
      cur.close()
    return True

      在这里我们重整下思路,首先对比特征数据库版本以及是否是相同数据库,之后检查两个数据库中CFG的素数特征,再搜索最佳匹配即100%匹配的函数、部分匹配的函数,如果允许慢搜索就使用调用流匹配,再根据用户设定匹配不可靠结果以及使用经验法记性匹配,最后整理出未被匹配的函数。接下来我们逐一看下这些涉及到的函数。

0x01 equal_db()
  # 判断两个数据库是否相同
  # 即判断两个数据库中的program.md5sum是否相等,以及每个函数是否相等如果都相等就是相等的
  # program 就是本程序的数据库, diff就是需要进行对比的数据库
  # 返回值:1-相等 0-不相等
  def equal_db(self):
    cur = self.db_cursor()
    sql = "select count(*) total from program p, diff.program dp where p.md5sum = dp.md5sum"
    cur.execute(sql)
    row = cur.fetchone()
    ret = row["total"] == 1
    if not ret:
      sql = "select count(*) total from (select * from functions except select * from diff.functions) x"
      cur.execute(sql)
      row = cur.fetchone()
      ret = row["total"] == 0
    else:
      log("Same MD5 in both databases")
    cur.close()
    return ret
0x02 check_callgraph()
  # 提取program中的cfg primes以及cfg all primes;
  # 如果callgraph_primes直接认为是完全相同;
  # 否则逐一比对callgraph_all_primes,如果没有不同,也是一个完美匹配
  # 否则,返回两个相差的比率
  def check_callgraph(self):
    cur = self.db_cursor()
    sql = """select callgraph_primes, callgraph_all_primes from program
             union all
             select callgraph_primes, callgraph_all_primes from diff.program"""
    cur.execute(sql)
    rows = cur.fetchall()
    if len(rows) == 2:
      cg1 = decimal.Decimal(rows[0]["callgraph_primes"])
      cg_factors1 = json.loads(rows[0]["callgraph_all_primes"])
      cg2 = decimal.Decimal(rows[1]["callgraph_primes"])
      cg_factors2 = json.loads(rows[1]["callgraph_all_primes"])

      if cg1 == cg2:
        self.equal_callgraph = True
        log("Callgraph signature for both databases is equal, the programs seem to be 100% equal structurally")
        Warning("Callgraph signature for both databases is equal, the programs seem to be 100% equal structurally")
      else:
        FACTORS_CACHE[cg1] = cg_factors1
        FACTORS_CACHE[cg2] = cg_factors2
        diff = difference(cg1, cg2)
        total = sum(cg_factors1.values())
        if total == 0 or diff == 0:
          log("Callgraphs are 100% equal")
        else:
          percent = diff * 100. / total
          log("Callgraphs from both programs differ in %f%%" % percent)

    cur.close()

      这里我们使用到了jkutils中的difference函数,已经在代码开头进行了注释。

0x03 best match的规则和代码

      我们最关心的就是如何寻找到完全相同的函数,这里先给出diaphora的代码以及代码中涉及到的相关函数,我都在代码里进行了详细的注解。

  def find_equal_matches(self):
    cur = self.db_cursor()
    # 从计算两个数据库中的函数总数开始
    sql = """select count(*) total from functions
             union all
             select count(*) total from diff.functions"""
    cur.execute(sql)
    rows = cur.fetchall()
    if len(rows) != 2:
      Warning("Malformed database, only %d rows!" % len(rows))
      raise Exception("Malformed database!")

    self.total_functions1 = rows[0]["total"]
    self.total_functions2 = rows[1]["total"]

	# 得到两张function表里相同的数据
	# 100%相同就是这么来的
    sql = "select address ea, mangled_function, nodes from (select * from functions intersect select * from diff.functions) x"
    cur.execute(sql)
    rows = cur.fetchall()
    choose = self.best_chooser
    if len(rows) > 0:
      for row in rows:
        name = row["mangled_function"]
        ea = row["ea"]
        nodes = int(row["nodes"])

        choose.add_item(CChooser.Item(ea, name, ea, name, "100% equal", 1, nodes, nodes))
        self.matched1.add(name)
        self.matched2.add(name)

    postfix = ""
    if self.ignore_small_functions:
      postfix = " and f.instructions > 5 and df.instructions > 5 "

	# 当两个函数bytehash、rva/segment_rva、指令数量、名字相同但不全为sub_开头时,也加入bestmatch行列
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Same RVA and hash' description,
                     f.nodes bb1, df.nodes bb2
                from functions f,
                     diff.functions df
               where (df.rva = f.rva
                   or df.segment_rva = f.segment_rva)
                 and df.bytes_hash = f.bytes_hash
                 and df.instructions = f.instructions
                 and ((f.name = df.name and substr(f.name, 1, 4) != 'sub_')
                   or (substr(f.name, 1, 4) = 'sub_' or substr(df.name, 1, 4)))"""
    log_refresh("Finding with heuristic 'Same RVA and hash'")
    self.add_matches_from_query(sql, choose)

	# 当两个函数序号以及bytes_hash相同时,加入best match
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Same order and hash' description,
                     f.nodes bb1, df.nodes bb2
                from functions f,
                     diff.functions df
               where df.id = f.id
                 and df.bytes_hash = f.bytes_hash
                 and df.instructions = f.instructions
                 and ((f.name = df.name and substr(f.name, 1, 4) != 'sub_')
                   or (substr(f.name, 1, 4) = 'sub_' or substr(df.name, 1, 4)))
                 and ((f.nodes > 1 and df.nodes > 1
                   and f.instructions > 5 and df.instructions > 5)
                    or f.instructions > 10 and df.instructions > 10)"""
    log_refresh("Finding with heuristic 'Same order and hash'")
    self.add_matches_from_query(sql, choose)

	# 或者只有function_hash相同时,也加入best match
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Function hash' description,
                     f.nodes bb1, df.nodes bb2
                from functions f,
                     diff.functions df
               where f.function_hash = df.function_hash 
                 and ((f.nodes > 1 and df.nodes > 1
                   and f.instructions > 5 and df.instructions > 5)
                    or f.instructions > 10 and df.instructions > 10)"""
    log_refresh("Finding with heuristic 'Function hash'")
    self.add_matches_from_query(sql, choose)

	# bytehash和名字相同且不为空的时候也加入best match
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Bytes hash and names' description,
                     f.nodes bb1, df.nodes bb2
                from functions f,
                     diff.functions df
               where f.bytes_hash = df.bytes_hash
                 and f.names = df.names
                 and f.names != '[]'
                 and f.instructions > 5 and df.instructions > 5"""
    log_refresh("Finding with heuristic 'Bytes hash and names'")
    self.add_matches_from_query(sql, choose)

	# 只有字节hash相同的时候,也加进去
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Bytes hash' description,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.bytes_hash = df.bytes_hash
                 and f.instructions > 5 and df.instructions > 5"""
    log_refresh("Finding with heuristic 'Bytes hash'")
    self.add_matches_from_query(sql, choose)

    if not self.ignore_all_names:
      self.find_same_name(self.partial_chooser)

	# 这里是处理不可信赖的匹配结果
    if self.unreliable:
	  # 如果选中了这个选项就加进去,但是为了降低误报率,我们就不要了
	  # 不可信赖匹配:只有大小和byte和以及操作符都相同的函数
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Bytes sum' description,
                       f.nodes bb1, df.nodes bb2
                  from functions f,
                       diff.functions df
                 where f.bytes_sum = df.bytes_sum
                   and f.size = df.size
                   and f.mnemonics = df.mnemonics
                   and f.instructions > 5 and df.instructions > 5"""
      log_refresh("Finding with heuristic 'Bytes sum'")
      self.add_matches_from_query(sql, choose)

	# 完美匹配:汇编或者是伪代码相同
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Equal pseudo-code' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2
               from functions f,
                    diff.functions df
              where f.pseudocode = df.pseudocode
                and df.pseudocode is not null
                and f.pseudocode_lines >= 5 """ + postfix + """
                and f.name not like 'nullsub%'
                and df.name not like 'nullsub%'
              union
             select f.address ea, f.name name1, df.address ea2, df.name name2, 'Equal assembly' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2
               from functions f,
                    diff.functions df
              where f.assembly = df.assembly
                and df.assembly is not null
                and f.instructions > 5 and df.instructions > 5 
                and f.name not like 'nullsub%'
                and df.name not like 'nullsub%' """
    log_refresh("Finding with heuristic 'Equal assembly or pseudo-code'")
    self.add_matches_from_query(sql, choose)

	# 根据相似率匹配:清理完毕的汇编代码和伪代码相同
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Same cleaned up assembly or pseudo-code' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where (f.clean_assembly = df.clean_assembly
                  or f.clean_pseudo = df.clean_pseudo) 
                  and f.pseudocode_lines > 5 and df.pseudocode_lines > 5
                  and f.name not like 'nullsub%'
                  and df.name not like 'nullsub%' """
    log_refresh("Finding with heuristic 'Same cleaned up assembly or pseudo-code'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# 如果地址、节点数量、指令集合、边数量相同,那么也可以加入,根据相似率加入
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same address, nodes, edges and mnemonics' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.rva = df.rva
                and f.instructions = df.instructions
                and f.nodes = df.nodes
                and f.edges = df.edges
                and f.mnemonics = df.mnemonics""" + postfix
    log_refresh("Finding with heuristic 'Same address, nodes, edges and mnemonics'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, None)

    cur.close()
	
# 执行sql语句并且将结果添加入chooser中
  # 注意:添加时要保证相似率为1.00
  def add_matches_from_query(self, sql, choose):
    """ Warning: use this *only* if the ratio is known to be 1.00 """
    if self.all_functions_matched():
      return
    
    cur = self.db_cursor()
    try:
      cur.execute(sql)
    except:
      log("Error: %s" % str(sys.exc_info()[1]))
      return

    i = 0
    while 1:
      i += 1
      if i % 1000 == 0:
        log("Processed %d rows..." % i)
      row = cur.fetchone()
      if row is None:
        break

      ea = str(row["ea"])
      name1 = row["name1"]
      ea2 = str(row["ea2"])
      name2 = row["name2"]
      desc = row["description"]
      bb1 = int(row["bb1"])
      bb2 = int(row["bb2"])

      if name1 in self.matched1 or name2 in self.matched2:
        continue

      choose.add_item(CChooser.Item(ea, name1, ea2, name2, desc, 1, bb1, bb2))
      self.matched1.add(name1)
      self.matched2.add(name2)
    cur.close()
	
  def add_matches_from_query_ratio(self, sql, best, partial, unreliable=None, debug=False):
    """ 根据相似度判断归属。
  相似度=1,best match;
  0.5 self.timeout:
        log("Timeout")
        break

      i += 1
      if i % 50000 == 0:
        log("Processed %d rows..." % i)
      row = cur.fetchone()
      if row is None:
        break

      ea = str(row["ea"])
      name1 = row["name1"]
      ea2 = row["ea2"]
      name2 = row["name2"]
      desc = row["description"]
      pseudo1 = row["pseudo1"]
      pseudo2 = row["pseudo2"]
      asm1 = row["asm1"]
      asm2 = row["asm2"]
      ast1 = row["pseudo_primes1"]
      ast2 = row["pseudo_primes2"]
      bb1 = int(row["bb1"])
      bb2 = int(row["bb2"])
      md1 = row["md1"]
      md2 = row["md2"]

      if name1 in self.matched1 or name2 in self.matched2:
        continue

      r = self.check_ratio(ast1, ast2, pseudo1, pseudo2, asm1, asm2, md1, md2)
      if debug:
        print "0x%x 0x%x %d" % (int(ea), int(ea2), r)

      if r == 1:
        self.best_chooser.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      elif r >= 0.5:
        partial.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      elif r < 0.5 and unreliable is not None:
        unreliable.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      else:
        partial.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)

    cur.close()

      看完之后可能会比较头大,我们队进行best match的匹配规则进行总结。

best match匹配规则:
    Rule1:function表中的每条数据都相同
    Rule2:bytehash、rva/segment_rva、指令数量、名字(不为sub_*)相同
    Rule3:bytes_hash、函数在数据库中顺序相同
    Rule4:bytehash、名字相同
    Rule5:bytehash相同
    Rule6:function_hash相同
    Rule7:汇编或者伪代码相同
    Rule8:清理之后的汇编代码和伪代码都相同
    Rule9:RVA、node数量、边数量、指令数量、指令集合相同
注:除了Rule1之外,Rule都需要计算相似率,只有相似率为1.00的才可以被确认为best match

       Rule8、Rule9会产生partial match 和 unreliable match

      以上Rule总结主要的只有3个最重要的信息:byte hash、汇编(伪)代码、指令数量和集合。

0x04 partial match规则与代码

      partial match就是相似度小于1.0但是大于0.5(或者自定义阈值)的函数,也就是名字所说的部分匹配。这部分内容比较长,规则较多,如果嫌代码太长可以之后跳过,我在代码后面附上partial match的匹配规则。

  def find_matches(self):
  """ 进行函数匹配筛选,根据某些规则选出可能相似的函数,
      并根据函数ast、伪代码、汇编代码、mdindex等值计算相似度。
	  最后根据相似度进行归档。
	  (0-0.5):unreliable
	  (0.5-1):partial match 
	  (1): best match
  """
    choose = self.partial_chooser

    postfix = ""
    if self.ignore_small_functions:
      postfix = " and f.instructions > 5 and df.instructions > 5 "

	# Rule1: MD_Index相同,
	# 结果:best,partial
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same rare MD Index' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df,
                      (select md_index
                         from diff.functions
                        where md_index != 0
                        group by md_index
                       having count(*) <= 2
                        union 
                       select md_index
                         from main.functions
                        where md_index != 0
                        group by md_index
                       having count(*) <= 2
                      ) shared_mds
                where f.md_index = df.md_index
                  and df.md_index = shared_mds.md_index
                  and f.nodes > 10 """ + postfix
    log_refresh("Finding with heuristic 'Same rare MD Index'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, choose)

	# Rule2:MD和常量信息相同
	# 结果:best,partial
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Same MD Index and constants' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2,
                     df.tarjan_topological_sort, df.strongly_connected_spp
                from functions f,
                     diff.functions df
               where f.md_index = df.md_index
                 and f.md_index > 0
                 and ((f.constants = df.constants
                 and f.constants_count > 0)) """ + postfix
    log_refresh("Finding with heuristic 'Same MD Index and constants'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, choose)

	# Rule3:全部或大部分属性(除hash1,hash2,hash3,persudo_prime外)相同
	# 结果: best,partial
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'All attributes' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.nodes = df.nodes 
                and f.edges = df.edges
                and f.indegree = df.indegree
                and f.outdegree = df.outdegree
                and f.size = df.size
                and f.instructions = df.instructions
                and f.mnemonics = df.mnemonics
                and f.names = df.names
                and f.prototype2 = df.prototype2
                and f.cyclomatic_complexity = df.cyclomatic_complexity
                and f.primes_value = df.primes_value
                and f.bytes_hash = df.bytes_hash
                and f.pseudocode_hash1 = df.pseudocode_hash1
                and f.pseudocode_primes = df.pseudocode_primes
                and f.pseudocode_hash2 = df.pseudocode_hash2
                and f.pseudocode_hash3 = df.pseudocode_hash3
                and f.strongly_connected = df.strongly_connected
                and f.loops = df.loops
                and f.tarjan_topological_sort = df.tarjan_topological_sort
                and f.strongly_connected_spp = df.strongly_connected_spp """ + postfix + """
              union 
             select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Most attributes' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
               where f.nodes = df.nodes 
                 and f.edges = df.edges
                 and f.indegree = df.indegree
                 and f.outdegree = df.outdegree
                 and f.size = df.size
                 and f.instructions = df.instructions
                 and f.mnemonics = df.mnemonics
                 and f.names = df.names
                 and f.prototype2 = df.prototype2
                 and f.cyclomatic_complexity = df.cyclomatic_complexity
                 and f.primes_value = df.primes_value
                 and f.bytes_hash = df.bytes_hash
                 and f.strongly_connected = df.strongly_connected
                 and f.loops = df.loops
                 and f.tarjan_topological_sort = df.tarjan_topological_sort
                 and f.strongly_connected_spp = df.strongly_connected_spp """
    sql += postfix
    log_refresh("Finding with heuristic 'All or most attributes'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)

	# Rule3':如果使用了慢对比的话,则筛选出switch结构相同
	# 阈值:0.2
	# 结果: partial,unreliable
    if self.slow_heuristics:
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Switch structures' description,
                  f.pseudocode pseudo1, df.pseudocode pseudo2,
                  f.assembly asm1, df.assembly asm2,
                  f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                  f.nodes bb1, df.nodes bb2,
                  cast(f.md_index as real) md1, cast(df.md_index as real) md2
             from functions f,
                  diff.functions df
            where f.switches = df.switches
              and df.switches != '[]' """ + postfix
      log_refresh("Finding with heuristic 'Switch structures'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.2)

	# 注:使用add_matches_from_query_ratio_max并不意味着没有best match,而是仅将ratio=1的设置为best
	# Rule4:如果常量及相应数量相同
	# 阈值:0.5
	# 结果: partial,unreliable
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Same constants' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.constants = df.constants
                and f.constants_count = df.constants_count
                and f.constants_count > 0 """ + postfix
    log_refresh("Finding with heuristic 'Same constants'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.5)

	# Rule5:地址,节点数,边数、primes和指令数量相同
	# 阈值:0.5
	# 结果: partial,unreliable
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Same address, nodes, edges and primes (re-ordered instructions)' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.rva = df.rva
                and f.instructions = df.instructions
                and f.nodes = df.nodes
                and f.edges = df.edges
                and f.primes_value = df.primes_value
                and f.nodes > 3""" + postfix
    log_refresh("Finding with heuristic 'Same address, nodes, edges and primes (re-ordered instructions)'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.5)

	# Rule6:函数名字、md、指令数量相同
	# 结果: best, partial
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Import names hash' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.names = df.names
                 and f.names != '[]'
                 and f.md_index = df.md_index
                 and f.instructions = df.instructions
                 and f.nodes > 5 and df.nodes > 5""" + postfix
    log_refresh("Finding with heuristic 'Import names hash'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)

	# Rule7:节点数、边数、复杂度、操作符、名字、函数原型、入度、出度相同
	# 结果:  partial
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Nodes, edges, complexity, mnemonics, names, prototype2, in-degree and out-degree' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.nodes = df.nodes
                 and f.edges = df.edges
                 and f.mnemonics = df.mnemonics
                 and f.names = df.names
                 and f.cyclomatic_complexity = df.cyclomatic_complexity
                 and f.prototype2 = df.prototype2
                 and f.indegree = df.indegree
                 and f.outdegree = df.outdegree
                 and f.nodes > 3
                 and f.edges > 3
                 and f.names != '[]'"""  + postfix + """
               union
              select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Nodes, edges, complexity, mnemonics, names and prototype2' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.nodes = df.nodes
                 and f.edges = df.edges
                 and f.mnemonics = df.mnemonics
                 and f.names = df.names
                 and f.names != '[]'
                 and f.cyclomatic_complexity = df.cyclomatic_complexity
                 and f.prototype2 = df.prototype2""" + postfix
    log_refresh("Finding with heuristic 'Nodes, edges, complexity, mnemonics, names, prototype, in-degree and out-degree'")
    self.add_matches_from_query_ratio(sql, self.partial_chooser, self.partial_chooser)

	# Rule8:操作符、指令数量、函数名字相同
	# 结果: partial
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Mnemonics and names' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.mnemonics = df.mnemonics
                 and f.instructions = df.instructions
                 and f.names = df.names
                 and f.names != '[]'""" + postfix
    log_refresh("Finding with heuristic 'Mnemonics and names'")
    self.add_matches_from_query_ratio(sql, choose, choose)

	# Rule9:Mnemonics small-primes-product指令数量、相同
	# 阈值:0.6
	# 结果: partial,unreliable
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Mnemonics small-primes-product' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.mnemonics_spp = df.mnemonics_spp
                 and f.instructions = df.instructions
                 and f.nodes > 1 and df.nodes > 1
                 and df.instructions > 5 """ + postfix
    log_refresh("Finding with heuristic 'Mnemonics small-primes-product'")
    self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.6)

    # Search using some of the previous criterias but calculating the
    # edit distance
	# 检查之前的partial匹配,并且计算距离
    log_refresh("Finding with heuristic 'Small names difference'")
    self.search_small_differences(choose)

	# Rule10:伪代码fuzzy hash相同
	# 结果:best,partial
    if self.slow_heuristics:
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Pseudo-code fuzzy hash' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode_hash1 = f.pseudocode_hash1
                   or df.pseudocode_hash2 = f.pseudocode_hash2
                   or df.pseudocode_hash3 = f.pseudocode_hash3""" + postfix
      log_refresh("Finding with heuristic 'Pseudo-code fuzzy hashes'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, choose)
    else:
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Pseudo-code fuzzy hash' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode_hash1 = f.pseudocode_hash1""" + postfix
      log_refresh("Finding with heuristic 'Pseudo-code fuzzy hash'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, choose)

	# Rule11:伪代码和名字相同
	# 结果:best,partial,unreliable
    sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar pseudo-code and names' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.pseudocode_lines = df.pseudocode_lines
                and f.names = df.names
                and df.names != '[]'
                and df.pseudocode_lines > 5
                and df.pseudocode is not null 
                and f.pseudocode is not null""" + postfix
    log_refresh("Finding with heuristic 'Similar pseudo-code and names'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# Rule12: 伪代码 fuzzy AST hash相同
	# 结果:best,partial
    if self.slow_heuristics:
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Pseudo-code fuzzy AST hash' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode_primes = f.pseudocode_primes
                  and f.pseudocode_lines > 3
                  and length(f.pseudocode_primes) >= 35""" + postfix
      log_refresh("Finding with heuristic 'Pseudo-code fuzzy AST hash'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, choose)

	  # Rule13: 部分伪代码fuzzy AST hash相同
	  # 阈值:0.5
	  # 结果:best,partial
      sql = """  select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Partial pseudo-code fuzzy hash' description,
                        f.pseudocode pseudo1, df.pseudocode pseudo2,
                        f.assembly asm1, df.assembly asm2,
                        f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                        f.nodes bb1, df.nodes bb2,
                        cast(f.md_index as real) md1, cast(df.md_index as real) md2
                   from functions f,
                        diff.functions df
                  where substr(df.pseudocode_hash1, 1, 16) = substr(f.pseudocode_hash1, 1, 16)
                     or substr(df.pseudocode_hash2, 1, 16) = substr(f.pseudocode_hash2, 1, 16)
                     or substr(df.pseudocode_hash3, 1, 16) = substr(f.pseudocode_hash3, 1, 16)""" + postfix
      log_refresh("Finding with heuristic 'Partial pseudo-code fuzzy hash'")
      self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.5)

	# Rule14:强连通图的tarjan拓朴分类相同
	# 结果:best,partial,unreliable
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Topological sort hash' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.strongly_connected = df.strongly_connected
                and f.tarjan_topological_sort = df.tarjan_topological_sort
                and f.strongly_connected > 3
                and f.nodes > 10 """ + postfix
    log_refresh("Finding with heuristic 'Topological sort hash'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# Rule15:循环复杂度,程序原型,名字相同
	# 结果:partial
    sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity, prototype and names' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.names = df.names
                  and f.cyclomatic_complexity = df.cyclomatic_complexity
                  and f.cyclomatic_complexity >= 20
                  and f.prototype2 = df.prototype2
                  and df.names != '[]'""" + postfix
    log_refresh("Finding with heuristic 'Same high complexity, prototype and names'")
    self.add_matches_from_query_ratio(sql, choose, choose)

	# Rule16:名字和循环复杂度相同
	# 阈值:0.5
	# 结果:partial,unreliable
    if self.slow_heuristics:
      sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity and names' description,
                        f.pseudocode pseudo1, df.pseudocode pseudo2,
                        f.assembly asm1, df.assembly asm2,
                        f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                        f.nodes bb1, df.nodes bb2,
                        cast(f.md_index as real) md1, cast(df.md_index as real) md2
                   from functions f,
                        diff.functions df
                  where f.names = df.names
                    and f.cyclomatic_complexity = df.cyclomatic_complexity
                    and f.cyclomatic_complexity >= 15
                    and df.names != '[]'""" + postfix
      log_refresh("Finding with heuristic 'Same high complexity and names'")
      self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.5)

	  # Rule17:强连通图相同
	  # 阈值:0.8
	  # 结果:partial
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Strongly connected components' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.strongly_connected = df.strongly_connected
                  and df.strongly_connected > 1
                  and f.nodes > 5 and df.nodes > 5
                  and f.strongly_connected_spp > 1
                  and df.strongly_connected_spp > 1""" + postfix
      log_refresh("Finding with heuristic 'Strongly connected components'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, None, 0.80)

	# Rule18:强连通的small-primes-product相同
	# 结果:best,partial,unreliable
    sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Strongly connected components small-primes-product' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.strongly_connected_spp = df.strongly_connected_spp
                  and df.strongly_connected_spp > 1
                  and f.nodes > 10 and df.nodes > 10 """ + postfix
    log_refresh("Finding with heuristic 'Strongly connected components small-primes-product'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# Rule19:循环个数相同
	# 阈值:0.49
	# 结果:partial
    if self.slow_heuristics:
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Loop count' description,
                  f.pseudocode pseudo1, df.pseudocode pseudo2,
                  f.assembly asm1, df.assembly asm2,
                  f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                  f.nodes bb1, df.nodes bb2,
                  cast(f.md_index as real) md1, cast(df.md_index as real) md2
             from functions f,
                  diff.functions df
            where f.loops = df.loops
              and df.loops > 1
              and f.nodes > 3 and df.nodes > 3""" + postfix
      log_refresh("Finding with heuristic 'Loop count'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, None, 0.49)

	# Rule20: 强连通分量的smale prime product相同
	# 阈值:0.49
	# 结果:partial
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Strongly connected components SPP and names' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.names = df.names
                 and f.names != '[]'
                 and f.strongly_connected_spp = df.strongly_connected_spp
                 and f.strongly_connected_spp > 0
                 """ + postfix
    log_refresh("Finding with heuristic 'Strongly connected components SPP and names'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, None, 0.49)
  def add_matches_from_query_ratio_max(self, sql, best, partial, val):
  """ 查询数据库,并根据相似度判断归属。相似度大于阈值val的,都将归入best match;小于阈值大于0的,
  如果开启了partial匹配就归入partial chooser
  参数:
      sql: 需要进行执行的sql语句
	  best: 最佳匹配数据集
	  partial: 部分匹配数据集
	  val: 阈值
  """
    if self.all_functions_matched():
      return
    
    cur = self.db_cursor()
    try:
      cur.execute(sql)
    except:
      log("Error: %s" % str(sys.exc_info()[1]))
      return

    i = 0
    t = time.time()
    while self.max_processed_rows == 0 or (self.max_processed_rows != 0 and i < self.max_processed_rows):
      if time.time() - t > self.timeout:
        log("Timeout")
        break

      i += 1
      if i % 50000 == 0:
        log("Processed %d rows..." % i)
      row = cur.fetchone()
      if row is None:
        break

      ea = str(row["ea"])
      name1 = row["name1"]
      ea2 = row["ea2"]
      name2 = row["name2"]
      desc = row["description"]
      pseudo1 = row["pseudo1"]
      pseudo2 = row["pseudo2"]
      asm1 = row["asm1"]
      asm2 = row["asm2"]
      ast1 = row["pseudo_primes1"]
      ast2 = row["pseudo_primes2"]
      bb1 = int(row["bb1"])
      bb2 = int(row["bb2"])
      md1 = row["md1"]
      md2 = row["md2"]

      if name1 in self.matched1 or name2 in self.matched2:
        continue

      r = self.check_ratio(ast1, ast2, pseudo1, pseudo2, asm1, asm2, md1, md2)

      if r == 1:
        self.best_chooser.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      elif r > val:
        best.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      elif partial is not None:
        partial.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)

    cur.close()

      总结partial match的规则如下:

partial match匹配规则:
    Rule1:MD Index相同
    Rule2:MD Index和常量信息相同
    Rule3:全部或大部分属性(除hash1,hash2,hash3,persudo_prime外)相同
    Rule4:switch结构相同
    Rule5:常量和常量数量相同
    Rule6:RVA、node数,edge数、primes和指令数量相同
    Rule7:函数名字、md、指令数量相同
    Rule8:node数,edge数、复杂度、操作符、名字、函数原型、入度、出度相同
    Rule9:操作符、指令数量、函数名字相同
    Rule10:Mnemonics small-primes-product指令数量、相同
    Rule11:伪代码fuzzy hash相同
    Rule12:伪代码和名字相同
    Rule13:伪代码 fuzzy AST hash相同
    Rule14:部分伪代码fuzzy AST hash相同
    Rule15:强连通图的tarjan拓朴分类相同
    Rule16:循环复杂度,程序原型,名字相同
    Rule17:名字和循环复杂度相同
    Rule18:强连通图相同
    Rule19:循环个数相同
    Rule20:强连通分量的smale prime product相同
注:这些匹配规则中,如果相似率达到了1.0,还需加入best match

       MD Index我们暂时还不清楚,挖个坑以后填

在以上20条规则中,主要是针对MD Index、调用流图特征、伪代码fuzzy hash三者进行匹配,但是可以注意到,有些匹配和best match 是重复的,同时这些匹配过程也可能产生best match,我们在进行识别的时候,不可不注意。

0x05 从CFG中进行匹配

      这部分函数比较简单,就是当两个函数为best match和partial match时,搜索和这个函数相关的caller和callee,进行函数相似性匹配。这部分代码逻辑不放在这里赘述了。

0x06 unreliable match 规则与代码

      这部分其实我们并不关心,因为结果不准确,但是为了保证分析的完整性,我们继续跟踪。

  # 搜索不可靠的函数
  def find_unreliable_matches(self):
    choose = self.unreliable_chooser

    postfix = ""
    if self.ignore_small_functions:
      postfix = " and f.instructions > 5 and df.instructions > 5 "

	# Rule1:只有强连通分量相同
	# 阈值:0.54
	# 结果: partial ,unreliable
    if self.slow_heuristics:
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Strongly connected components' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.strongly_connected = df.strongly_connected
                  and df.strongly_connected > 2""" + postfix
      log_refresh("Finding with heuristic 'Strongly connected components'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.54)

	  # Rule2:只有loop count相同
	  # 阈值: 0.5
	  # 结果: partial,unreliable
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Loop count' description,
                  f.pseudocode pseudo1, df.pseudocode pseudo2,
                  f.assembly asm1, df.assembly asm2,
                  f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                  f.nodes bb1, df.nodes bb2,
                  cast(f.md_index as real) md1, cast(df.md_index as real) md2
             from functions f,
                  diff.functions df
            where f.loops = df.loops
              and df.loops > 1""" + postfix
      log_refresh("Finding with heuristic 'Loop count'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

	  # Rule3:Nodes, edges, 复杂度 和操作符个数相同
	  # 结果: best,partial
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Nodes, edges, complexity and mnemonics' description,
                       f.pseudocode pseudo1, df.pseudocode pseudo2,
                       f.assembly asm1, df.assembly asm2,
                       f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                       f.nodes bb1, df.nodes bb2,
                       cast(f.md_index as real) md1, cast(df.md_index as real) md2
                  from functions f,
                       diff.functions df
                 where f.nodes = df.nodes
                   and f.edges = df.edges
                   and f.mnemonics = df.mnemonics
                   and f.cyclomatic_complexity = df.cyclomatic_complexity
                   and f.nodes > 1 and f.edges > 0""" + postfix
      log_refresh("Finding with heuristic 'Nodes, edges, complexity and mnemonics'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)

	  # Rule4:Nodes, edges, 复杂度 和函数原型相同
	  # 结果: partial,unreliable
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Nodes, edges, complexity and prototype' description,
                       f.pseudocode pseudo1, df.pseudocode pseudo2,
                       f.assembly asm1, df.assembly asm2,
                       f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                       f.nodes bb1, df.nodes bb2,
                       cast(f.md_index as real) md1, cast(df.md_index as real) md2
                  from functions f,
                       diff.functions df
                 where f.nodes = df.nodes
                   and f.edges = df.edges
                   and f.prototype2 = df.prototype2
                   and f.cyclomatic_complexity = df.cyclomatic_complexity
                   and f.prototype2 != 'int()'""" + postfix
      log_refresh("Finding with heuristic 'Nodes, edges, complexity and prototype'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

	  # Rule5:Nodes, edges, 复杂度、出度和入度相同
	  # 结果: partial,unreliable
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Nodes, edges, complexity, in-degree and out-degree' description,
                       f.pseudocode pseudo1, df.pseudocode pseudo2,
                       f.assembly asm1, df.assembly asm2,
                       f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                       f.nodes bb1, df.nodes bb2,
                       cast(f.md_index as real) md1, cast(df.md_index as real) md2
                  from functions f,
                       diff.functions df
                 where f.nodes = df.nodes
                   and f.edges = df.edges
                   and f.cyclomatic_complexity = df.cyclomatic_complexity
                   and f.nodes > 3 and f.edges > 2
                   and f.indegree = df.indegree
                   and f.outdegree = df.outdegree""" + postfix
      log_refresh("Finding with heuristic 'Nodes, edges, complexity, in-degree and out-degree'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

	  # Rule6:Nodes, edges, 复杂度相同
	  # 结果: partial,unreliable
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Nodes, edges and complexity' description,
                       f.pseudocode pseudo1, df.pseudocode pseudo2,
                       f.assembly asm1, df.assembly asm2,
                       f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                       f.nodes bb1, df.nodes bb2,
                       cast(f.md_index as real) md1, cast(df.md_index as real) md2
                  from functions f,
                       diff.functions df
                 where f.nodes = df.nodes
                   and f.edges = df.edges
                   and f.cyclomatic_complexity = df.cyclomatic_complexity
                   and f.nodes > 1 and f.edges > 0""" + postfix
      log_refresh("Finding with heuristic 'Nodes, edges and complexity'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

	  # Rule7:同为小函数且伪代码相同
	  # 阈值:0.5
	  # 结果: partial,unreliable
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar small pseudo-code' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode is not null 
                  and f.pseudocode is not null
                  and f.pseudocode_lines = df.pseudocode_lines
                  and df.pseudocode_lines > 5""" + postfix
      log_refresh("Finding with heuristic 'Similar small pseudo-code'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.5)

	  # Rule8:循环0复杂度高且相同
	  # 结果: partial,unreliable
      sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity' description,
                        f.pseudocode pseudo1, df.pseudocode pseudo2,
                        f.assembly asm1, df.assembly asm2,
                        f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                        f.nodes bb1, df.nodes bb2,
                        cast(f.md_index as real) md1, cast(df.md_index as real) md2
                   from functions f,
                        diff.functions df
                  where f.cyclomatic_complexity = df.cyclomatic_complexity
                    and f.cyclomatic_complexity >= 50""" + postfix
      log_refresh("Finding with heuristic 'Same high complexity'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

unreliable match匹配规则:
    Rule1:是否同为strongly_connected
    Rule2:只有loop count相同
    Rule3:Nodes, edges, 复杂度 和操作符个数相同
    Rule4:Nodes, edges, 复杂度 和函数原型相同
    Rule5:Nodes, edges, 复杂度、出度和入度相同
    Rule6:Nodes, edges, 复杂度相同
    Rule7:同为小函数(伪代码行数小于5)且伪代码相同
    Rule8:循环复杂度高且相同

0x07 经验法进行匹配的规则与代码
  def find_experimental_matches(self):
  """使用经验法进行匹配。
  """
    choose = self.unreliable_chooser
    
    # 调用地址顺序启发搜索,即当一个函数匹配时,该函数下一个有大概率匹配
    self.find_from_matches(self.best_chooser.items)
    self.find_from_matches(self.partial_chooser.items)
    
    postfix = ""
    if self.ignore_small_functions:
      postfix = " and f.instructions > 5 and df.instructions > 5 "

	# Rule1:伪代码行数相同
	# 阈值: 0.6
	# 结果: unreliable
    sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar pseudo-code' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.pseudocode_lines = df.pseudocode_lines
                and df.pseudocode_lines > 5
                and df.pseudocode is not null 
                and f.pseudocode is not null""" + postfix
    log_refresh("Finding with heuristic 'Similar pseudo-code'")
    self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.6)

	# Rule2:nodes,edges,强连通分量相同
	# 结果: best,unreliable
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Same nodes, edges and strongly connected components' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.nodes = df.nodes
                and f.edges = df.edges
                and f.strongly_connected = df.strongly_connected
                and df.nodes > 4""" + postfix
    log_refresh("Finding with heuristic 'Same nodes, edges and strongly connected components'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, choose, self.unreliable_chooser)

	# Rule3:伪代码行数相同
	# 结果: partial,unreliable
    if self.slow_heuristics:
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar small pseudo-code' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.pseudocode_lines = df.pseudocode_lines
                  and df.pseudocode_lines <= 5
                  and df.pseudocode is not null 
                  and f.pseudocode is not null""" + postfix
      log_refresh("Finding with heuristic 'Similar small pseudo-code'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.49)

	  # Rule4:伪代码fuzzy AST hash相同
	  # 结果: partial,unreliable
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Small pseudo-code fuzzy AST hash' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode_primes = f.pseudocode_primes
                  and f.pseudocode_lines <= 5""" + postfix
      log_refresh("Finding with heuristic 'Small pseudo-code fuzzy AST hash'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)
    # Rule5:同为小函数且伪代码相同
	# 结果: best,partial
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Equal small pseudo-code' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.pseudocode = df.pseudocode
                and df.pseudocode is not null
                and f.pseudocode_lines < 5""" + postfix
    log_refresh("Finding with heuristic 'Equal small pseudo-code'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)

	# Rule6:循环复杂度、原型、名字相同
	# 阈值: 0.5
	# 结果: partial,unreliable
    sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity, prototype and names' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.names = df.names
                  and f.cyclomatic_complexity = df.cyclomatic_complexity
                  and f.cyclomatic_complexity < 20
                  and f.prototype2 = df.prototype2
                  and df.names != '[]'""" + postfix
    log_refresh("Finding with heuristic 'Same low complexity, prototype and names'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.5)

	# Rule7:循环复杂度、名字相同
	# 阈值: 0.5
	# 结果: partial,unreliable
    sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same low complexity and names' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.names = df.names
                  and f.cyclomatic_complexity = df.cyclomatic_complexity
                  and f.cyclomatic_complexity < 15
                  and df.names != '[]'""" + postfix
    log_refresh("Finding with heuristic 'Same low complexity and names'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.5)
  
    # Rule8:调用流图相同
	# 结果: best,partial,unreliable
    if self.slow_heuristics:
	  # 当数据库过大时(超过25000个函数)时,可能造成一下错误:
      # the following error: OperationalError: database or disk is full
      sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                 'Same graph' description,
                 f.pseudocode pseudo1, df.pseudocode pseudo2,
                 f.assembly asm1, df.assembly asm2,
                 f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                 f.nodes bb1, df.nodes bb2,
                 cast(f.md_index as real) md1, cast(df.md_index as real) md2
            from functions f,
                 diff.functions df
           where f.nodes = df.nodes 
             and f.edges = df.edges
             and f.indegree = df.indegree
             and f.outdegree = df.outdegree
             and f.cyclomatic_complexity = df.cyclomatic_complexity
             and f.strongly_connected = df.strongly_connected
             and f.loops = df.loops
             and f.tarjan_topological_sort = df.tarjan_topological_sort
             and f.strongly_connected_spp = df.strongly_connected_spp""" + postfix + """
             and f.nodes > 5 and df.nodes > 5
           order by
                 case when f.size = df.size then 1 else 0 end +
                 case when f.instructions = df.instructions then 1 else 0 end +
                 case when f.mnemonics = df.mnemonics then 1 else 0 end +
                 case when f.names = df.names then 1 else 0 end +
                 case when f.prototype2 = df.prototype2 then 1 else 0 end +
                 case when f.primes_value = df.primes_value then 1 else 0 end +
                 case when f.bytes_hash = df.bytes_hash then 1 else 0 end +
                 case when f.pseudocode_hash1 = df.pseudocode_hash1 then 1 else 0 end +
                 case when f.pseudocode_primes = df.pseudocode_primes then 1 else 0 end +
                 case when f.pseudocode_hash2 = df.pseudocode_hash2 then 1 else 0 end +
                 case when f.pseudocode_hash3 = df.pseudocode_hash3 then 1 else 0 end DESC"""
      log_refresh("Finding with heuristic 'Same graph'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# 最后使用暴力搜索
    log_refresh("Brute-forcing...")
    self.find_brute_force()

经验法其实分为三步,第一步CFG匹配;第二步Rule过滤;第三步暴力搜索。我们先说第一步,此处采用的CFG匹配和上文提到的不同,这里是根据一个经验,如果函数Ai和函数Bj相同,A、B表示两个数据库,i、j为他们的序号,那么A(i+1)与B(j+1)也有很大概率相同。这只是一个不可靠的经验,函数相同的判断依据依旧是相似率。第二步为Rule,主要有一下几种:

经验法规则:
    Rule1:伪代码行数相同
    Rule2:nodes,edges,strongly_connected相同
    Rule3:伪代码fuzzy AST hash相同
    Rule4:同为小函数且伪代码相同
    Rule5:循环复杂度、原型、名字相同
    Rule6:循环复杂度、名字相同

    Rule7:调用流图(nodes、edges、入度、出度、循环复杂度、tarjan拓朴分类、strongly_connected、循环)相同

这些规则没有任何可靠地依据,所以生成的结果也不一定可靠。第三步是暴力搜索,在经过上述匹配之后仍未进行分类的函数,进行暴力匹配~感觉不靠谱馁

0x08 函数相似率计算规则

      我想经过上述的分析,大家应该已经对Diaphora的匹配规则和过程有所了解了吧,我建议在阅读源码时,配合Diaphora生成好的数据库对照着阅读,这样会比较快。同时我们在上文经常提到函数相似率,最后我想分析的就是这个函数相似率算法。进行函数相似率计算的函数是check_ratio()

# 获得两个相似率
def quick_ratio(buf1, buf2):
  try:
    if buf1 is None or buf2 is None or buf1 == "" or buf1 == "":
      return 0
    s = SequenceMatcher(None, buf1.split("\n"), buf2.split("\n"))
    return s.quick_ratio()
  except:
    print "quick_ratio:", str(sys.exc_info()[1])
    return 0

#-----------------------------------------------------------------------
# 更快的获得相似率,但是好像不是很准确
def real_quick_ratio(buf1, buf2):
  try:
    if buf1 is None or buf2 is None or buf1 == "" or buf1 == "":
      return 0
    s = SequenceMatcher(None, buf1.split("\n"), buf2.split("\n"))
    return s.real_quick_ratio()
  except:
    print "real_quick_ratio:", str(sys.exc_info()[1])
    return 0

#-----------------------------------------------------------------------
# 获得两个ast树相差率
def ast_ratio(ast1, ast2):
  if ast1 == ast2:
    return 1.0
  elif ast1 is None or ast2 is None:
    return 0
  return difference_ratio(decimal.Decimal(ast1), decimal.Decimal(ast2))
  # 检查两个函数相似率
#-----------------------------------------------------------------------
  def check_ratio(self, ast1, ast2, pseudo1, pseudo2, asm1, asm2, md1, md2):
    fratio = quick_ratio
    decimal_values = "{0:.2f}"
    if self.relaxed_ratio:
      fratio = real_quick_ratio
      decimal_values = "{0:.1f}"

	# 第一计算ast相似度先
    v3 = 0
    ast_done = False
    if self.relaxed_ratio and ast1 is not None and ast2 is not None and max(len(ast1), len(ast2)) < 16:
      ast_done = True
      v3 = self.ast_ratio(ast1, ast2)
      if v3 == 1:
        return 1.0

    # 第二计算清理后的伪代码相似度
    v1 = 0
    if pseudo1 is not None and pseudo2 is not None and pseudo1 != "" and pseudo2 != "":
      tmp1 = self.get_cmp_pseudo_lines(pseudo1)
      tmp2 = self.get_cmp_pseudo_lines(pseudo2)
      if tmp1 == "" or tmp2 == "":
        log("Error cleaning pseudo-code!")
      else:
        v1 = fratio(tmp1, tmp2)
        v1 = float(decimal_values.format(v1))
        if v1 == 1.0:
		  # 如果real_quick_ratio返回1,再尝试quie_ratio
		  # 因为可能存在误报。如果real_quick_ratio得到different,
		  # 就没有必要继续查证了
          if fratio == real_quick_ratio:
            v1 = quick_ratio(tmp1, tmp2)
            if v1 == 1.0:
              return 1.0

    # 第三比较汇编代码如何
    tmp_asm1 = self.get_cmp_asm_lines(asm1)
    tmp_asm2 = self.get_cmp_asm_lines(asm2)
    v2 = fratio(tmp_asm1, tmp_asm2)
    v2 = float(decimal_values.format(v2))
    if v2 == 1:
	  # 其实检查检查伪代码类似
      if fratio == real_quick_ratio:
        v2 = quick_ratio(tmp_asm1, tmp_asm2)
        if v2 == 1.0:
          return 1.0

    if self.relaxed_ratio and not ast_done:
      v3 = fratio(ast1, ast2)
      v3 = float(decimal_values.format(v3))
      if v3 == 1:
        return 1.0

    v4 = 0.0
    if md1 == md2 and md1 > 0.0:
      #  MD-Index >= 10.0 很少见
      if self.relaxed_ratio or md1 > 10.0:
        return 1.0
		
      v4 = min((v1 + v2 + v3 + 3.0) / 4, 1.0)
    elif md1 != 0 and md2 != 0 and False:
      tmp1 = max(md1, md2)
      tmp2 = min(md1, md2)
      v4 = tmp2 * 1. / tmp1

    r = max(v1, v2, v3, v4)
    return r
总结算法:
v3 = ast_ratio(ast1,ast2)
v1 = sequence_ratio(cleaned_pesudo_code1,cleaned_pesudo_code2)
v2 = sequence_ratio(cleaned_asm_code1,cleaned_asm_code2)
if MD_Index1 == MD_Index2:
    v4 = min((v1 + v2 + v3 + 3.0) / 4, 1.0)
v4 = min(MD_Index1,MD_Index2)/max(MD_Index1,MD_Index2)
ration = max{v1,v2,v3,v4}

即,取ast语法树、反汇编代码、伪代码和MD值相似度死者中的最大值。

0x09 小结

      我们对Diaphora使用的匹配技术进行总结。从High level的角度来看,diaphora根据某些匹配规则筛选出函数,并且计算出这些函数的相似度,根据相似度以及相关阈值判断函数的归属。这些筛选规则既涉及到调用流、汇编代码等逻辑特征,还涉及到常量、指令集合等数据特征。同时,某些规则也依靠fuzzy hash和small prime,下一篇博客中,我们将详细解释diaphora如何生成特征数据库。

你可能感兴趣的:(Diaphora源码分析,库函数识别)