胖胖鹏鹏胖胖鹏

Diaphora源码分析——diaphora.py分析

本博客由闲散白帽子胖胖鹏鹏胖胖鹏潜力所写，仅仅作为个人技术交流分享，不得用做商业用途。转载请注明出处，未经许可禁止将本博客内所有内容转载、商用。

在使用Diaphora进行库函数比对的时候，我们会先用Diaphora生成原始函数库（*.a,*.lib）生成一个sqlite特征数据库数据库（db1），再用Diaphora生成程序（*.so,*.exe,*.bin）的特征数据库（db2），之后根据两个数据库中特征进行函数识别。生成特征数据库的部分由于设计IDAPython的API，我们暂时先搁置，我们先来看如何进行的数据库对比。代码位置在这。

diaphora.py为我们提供了两种接口，一种是从IDA中直接调用，而另外一种是通过命令行调用。通过命令行调用时我们需要提供两个参数，一个是程序特征数据库，我们称为main；另外一个是原始库函数特征数据库，我们称为diff。设置好这两个变量之后，即进行分析工作。这里主要涉及到的知识是sql查询语句，不懂得同学建议去w3c学院学习下，这样才能看的懂代码。diaphora主要分析逻辑在diff()函数中。

  def diff(self, db):
  # 进行函数识别的主函数
  #参数:
       # db——需要进行对比的数据库
    self.last_diff_db = db

    cur = self.db_cursor()
    cur.execute('attach "%s" as diff' % db)

    try:
      cur.execute("select value from diff.version")
    except:
      log("Error: %s " % sys.exc_info()[1])
      Warning("The selected file does not look like a valid SQLite exported database!")
      cur.close()
      return False

    row = cur.fetchone()
    if not row:
      Warning("Invalid database!")
      return False

    if row["value"] != VERSION_VALUE:
      Warning("The database is from a different version (current %s, database %s)!" % (VERSION_VALUE, row[0]))
      return False

    try:
      t0 = time.time()
      log_refresh("Performing diffing...", True)
      
      self.do_continue = True
      if self.equal_db():
        log("The databases seems to be 100% equal")

      if self.do_continue:
        # 对比调用流是否相同
        self.check_callgraph()

        # 搜索未被修改的函数
        log_refresh("Finding best matches...")
        self.find_equal_matches()

        # 搜索部分修改的函数
        log_refresh("Finding partial matches")
        self.find_matches()

        if self.slow_heuristics:
          # 从调用流中恢复函数
          log_refresh("Finding with heuristic 'Callgraph matches'")
          self.find_callgraph_matches()

        if self.unreliable:
          # 搜索不可靠的匹配结果
          log_refresh("Finding probably unreliable matches")
          self.find_unreliable_matches()
        
        if self.experimental:
          # 使用经验法进行函数匹配
          log_refresh("Finding experimental matches")
          self.find_experimental_matches()

        # 最后整理未经匹配的函数
        log_refresh("Finding unmatched functions")
        self.find_unmatched()

        log("Done. Took {} seconds".format(time.time() - t0))
    finally:
      cur.close()
    return True

在这里我们重整下思路，首先对比特征数据库版本以及是否是相同数据库，之后检查两个数据库中CFG的素数特征，再搜索最佳匹配即100%匹配的函数、部分匹配的函数，如果允许慢搜索就使用调用流匹配，再根据用户设定匹配不可靠结果以及使用经验法记性匹配，最后整理出未被匹配的函数。接下来我们逐一看下这些涉及到的函数。

0x01 equal_db()

  # 判断两个数据库是否相同
  # 即判断两个数据库中的program.md5sum是否相等，以及每个函数是否相等如果都相等就是相等的
  # program 就是本程序的数据库， diff就是需要进行对比的数据库
  # 返回值：1-相等 0-不相等
  def equal_db(self):
    cur = self.db_cursor()
    sql = "select count(*) total from program p, diff.program dp where p.md5sum = dp.md5sum"
    cur.execute(sql)
    row = cur.fetchone()
    ret = row["total"] == 1
    if not ret:
      sql = "select count(*) total from (select * from functions except select * from diff.functions) x"
      cur.execute(sql)
      row = cur.fetchone()
      ret = row["total"] == 0
    else:
      log("Same MD5 in both databases")
    cur.close()
    return ret

0x02 check_callgraph()

  # 提取program中的cfg primes以及cfg all primes;
  # 如果callgraph_primes直接认为是完全相同;
  # 否则逐一比对callgraph_all_primes，如果没有不同，也是一个完美匹配
  # 否则，返回两个相差的比率
  def check_callgraph(self):
    cur = self.db_cursor()
    sql = """select callgraph_primes, callgraph_all_primes from program
             union all
             select callgraph_primes, callgraph_all_primes from diff.program"""
    cur.execute(sql)
    rows = cur.fetchall()
    if len(rows) == 2:
      cg1 = decimal.Decimal(rows[0]["callgraph_primes"])
      cg_factors1 = json.loads(rows[0]["callgraph_all_primes"])
      cg2 = decimal.Decimal(rows[1]["callgraph_primes"])
      cg_factors2 = json.loads(rows[1]["callgraph_all_primes"])

      if cg1 == cg2:
        self.equal_callgraph = True
        log("Callgraph signature for both databases is equal, the programs seem to be 100% equal structurally")
        Warning("Callgraph signature for both databases is equal, the programs seem to be 100% equal structurally")
      else:
        FACTORS_CACHE[cg1] = cg_factors1
        FACTORS_CACHE[cg2] = cg_factors2
        diff = difference(cg1, cg2)
        total = sum(cg_factors1.values())
        if total == 0 or diff == 0:
          log("Callgraphs are 100% equal")
        else:
          percent = diff * 100. / total
          log("Callgraphs from both programs differ in %f%%" % percent)

    cur.close()

这里我们使用到了jkutils中的difference函数，已经在代码开头进行了注释。

0x03 best match的规则和代码

我们最关心的就是如何寻找到完全相同的函数，这里先给出diaphora的代码以及代码中涉及到的相关函数，我都在代码里进行了详细的注解。

  def find_equal_matches(self):
    cur = self.db_cursor()
    # 从计算两个数据库中的函数总数开始
    sql = """select count(*) total from functions
             union all
             select count(*) total from diff.functions"""
    cur.execute(sql)
    rows = cur.fetchall()
    if len(rows) != 2:
      Warning("Malformed database, only %d rows!" % len(rows))
      raise Exception("Malformed database!")

    self.total_functions1 = rows[0]["total"]
    self.total_functions2 = rows[1]["total"]

	# 得到两张function表里相同的数据
	# 100%相同就是这么来的
    sql = "select address ea, mangled_function, nodes from (select * from functions intersect select * from diff.functions) x"
    cur.execute(sql)
    rows = cur.fetchall()
    choose = self.best_chooser
    if len(rows) > 0:
      for row in rows:
        name = row["mangled_function"]
        ea = row["ea"]
        nodes = int(row["nodes"])

        choose.add_item(CChooser.Item(ea, name, ea, name, "100% equal", 1, nodes, nodes))
        self.matched1.add(name)
        self.matched2.add(name)

    postfix = ""
    if self.ignore_small_functions:
      postfix = " and f.instructions > 5 and df.instructions > 5 "

	# 当两个函数bytehash、rva/segment_rva、指令数量、名字相同但不全为sub_开头时，也加入bestmatch行列
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Same RVA and hash' description,
                     f.nodes bb1, df.nodes bb2
                from functions f,
                     diff.functions df
               where (df.rva = f.rva
                   or df.segment_rva = f.segment_rva)
                 and df.bytes_hash = f.bytes_hash
                 and df.instructions = f.instructions
                 and ((f.name = df.name and substr(f.name, 1, 4) != 'sub_')
                   or (substr(f.name, 1, 4) = 'sub_' or substr(df.name, 1, 4)))"""
    log_refresh("Finding with heuristic 'Same RVA and hash'")
    self.add_matches_from_query(sql, choose)

	# 当两个函数序号以及bytes_hash相同时，加入best match
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Same order and hash' description,
                     f.nodes bb1, df.nodes bb2
                from functions f,
                     diff.functions df
               where df.id = f.id
                 and df.bytes_hash = f.bytes_hash
                 and df.instructions = f.instructions
                 and ((f.name = df.name and substr(f.name, 1, 4) != 'sub_')
                   or (substr(f.name, 1, 4) = 'sub_' or substr(df.name, 1, 4)))
                 and ((f.nodes > 1 and df.nodes > 1
                   and f.instructions > 5 and df.instructions > 5)
                    or f.instructions > 10 and df.instructions > 10)"""
    log_refresh("Finding with heuristic 'Same order and hash'")
    self.add_matches_from_query(sql, choose)

	# 或者只有function_hash相同时，也加入best match
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Function hash' description,
                     f.nodes bb1, df.nodes bb2
                from functions f,
                     diff.functions df
               where f.function_hash = df.function_hash 
                 and ((f.nodes > 1 and df.nodes > 1
                   and f.instructions > 5 and df.instructions > 5)
                    or f.instructions > 10 and df.instructions > 10)"""
    log_refresh("Finding with heuristic 'Function hash'")
    self.add_matches_from_query(sql, choose)

	# bytehash和名字相同且不为空的时候也加入best match
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Bytes hash and names' description,
                     f.nodes bb1, df.nodes bb2
                from functions f,
                     diff.functions df
               where f.bytes_hash = df.bytes_hash
                 and f.names = df.names
                 and f.names != '[]'
                 and f.instructions > 5 and df.instructions > 5"""
    log_refresh("Finding with heuristic 'Bytes hash and names'")
    self.add_matches_from_query(sql, choose)

	# 只有字节hash相同的时候，也加进去
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Bytes hash' description,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.bytes_hash = df.bytes_hash
                 and f.instructions > 5 and df.instructions > 5"""
    log_refresh("Finding with heuristic 'Bytes hash'")
    self.add_matches_from_query(sql, choose)

    if not self.ignore_all_names:
      self.find_same_name(self.partial_chooser)

	# 这里是处理不可信赖的匹配结果
    if self.unreliable:
	  # 如果选中了这个选项就加进去，但是为了降低误报率，我们就不要了
	  # 不可信赖匹配：只有大小和byte和以及操作符都相同的函数
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Bytes sum' description,
                       f.nodes bb1, df.nodes bb2
                  from functions f,
                       diff.functions df
                 where f.bytes_sum = df.bytes_sum
                   and f.size = df.size
                   and f.mnemonics = df.mnemonics
                   and f.instructions > 5 and df.instructions > 5"""
      log_refresh("Finding with heuristic 'Bytes sum'")
      self.add_matches_from_query(sql, choose)

	# 完美匹配：汇编或者是伪代码相同
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Equal pseudo-code' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2
               from functions f,
                    diff.functions df
              where f.pseudocode = df.pseudocode
                and df.pseudocode is not null
                and f.pseudocode_lines >= 5 """ + postfix + """
                and f.name not like 'nullsub%'
                and df.name not like 'nullsub%'
              union
             select f.address ea, f.name name1, df.address ea2, df.name name2, 'Equal assembly' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2
               from functions f,
                    diff.functions df
              where f.assembly = df.assembly
                and df.assembly is not null
                and f.instructions > 5 and df.instructions > 5 
                and f.name not like 'nullsub%'
                and df.name not like 'nullsub%' """
    log_refresh("Finding with heuristic 'Equal assembly or pseudo-code'")
    self.add_matches_from_query(sql, choose)

	# 根据相似率匹配：清理完毕的汇编代码和伪代码相同
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Same cleaned up assembly or pseudo-code' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where (f.clean_assembly = df.clean_assembly
                  or f.clean_pseudo = df.clean_pseudo) 
                  and f.pseudocode_lines > 5 and df.pseudocode_lines > 5
                  and f.name not like 'nullsub%'
                  and df.name not like 'nullsub%' """
    log_refresh("Finding with heuristic 'Same cleaned up assembly or pseudo-code'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# 如果地址、节点数量、指令集合、边数量相同，那么也可以加入，根据相似率加入
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same address, nodes, edges and mnemonics' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.rva = df.rva
                and f.instructions = df.instructions
                and f.nodes = df.nodes
                and f.edges = df.edges
                and f.mnemonics = df.mnemonics""" + postfix
    log_refresh("Finding with heuristic 'Same address, nodes, edges and mnemonics'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, None)

    cur.close()
	
# 执行sql语句并且将结果添加入chooser中
  # 注意：添加时要保证相似率为1.00
  def add_matches_from_query(self, sql, choose):
    """ Warning: use this *only* if the ratio is known to be 1.00 """
    if self.all_functions_matched():
      return
    
    cur = self.db_cursor()
    try:
      cur.execute(sql)
    except:
      log("Error: %s" % str(sys.exc_info()[1]))
      return

    i = 0
    while 1:
      i += 1
      if i % 1000 == 0:
        log("Processed %d rows..." % i)
      row = cur.fetchone()
      if row is None:
        break

      ea = str(row["ea"])
      name1 = row["name1"]
      ea2 = str(row["ea2"])
      name2 = row["name2"]
      desc = row["description"]
      bb1 = int(row["bb1"])
      bb2 = int(row["bb2"])

      if name1 in self.matched1 or name2 in self.matched2:
        continue

      choose.add_item(CChooser.Item(ea, name1, ea2, name2, desc, 1, bb1, bb2))
      self.matched1.add(name1)
      self.matched2.add(name2)
    cur.close()
	
  def add_matches_from_query_ratio(self, sql, best, partial, unreliable=None, debug=False):
    """ 根据相似度判断归属。
  相似度=1，best match；
  0.5 self.timeout:
        log("Timeout")
        break

      i += 1
      if i % 50000 == 0:
        log("Processed %d rows..." % i)
      row = cur.fetchone()
      if row is None:
        break

      ea = str(row["ea"])
      name1 = row["name1"]
      ea2 = row["ea2"]
      name2 = row["name2"]
      desc = row["description"]
      pseudo1 = row["pseudo1"]
      pseudo2 = row["pseudo2"]
      asm1 = row["asm1"]
      asm2 = row["asm2"]
      ast1 = row["pseudo_primes1"]
      ast2 = row["pseudo_primes2"]
      bb1 = int(row["bb1"])
      bb2 = int(row["bb2"])
      md1 = row["md1"]
      md2 = row["md2"]

      if name1 in self.matched1 or name2 in self.matched2:
        continue

      r = self.check_ratio(ast1, ast2, pseudo1, pseudo2, asm1, asm2, md1, md2)
      if debug:
        print "0x%x 0x%x %d" % (int(ea), int(ea2), r)

      if r == 1:
        self.best_chooser.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      elif r >= 0.5:
        partial.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      elif r < 0.5 and unreliable is not None:
        unreliable.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      else:
        partial.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)

    cur.close()

看完之后可能会比较头大，我们队进行best match的匹配规则进行总结。

best match匹配规则：
    Rule1：function表中的每条数据都相同
    Rule2：bytehash、rva/segment_rva、指令数量、名字（不为sub_*）相同
    Rule3：bytes_hash、函数在数据库中顺序相同
    Rule4：bytehash、名字相同
    Rule5：bytehash相同
    Rule6：function_hash相同
    Rule7：汇编或者伪代码相同
    Rule8：清理之后的汇编代码和伪代码都相同
    Rule9：RVA、node数量、边数量、指令数量、指令集合相同
注：除了Rule1之外，Rule都需要计算相似率，只有相似率为1.00的才可以被确认为best match

Rule8、Rule9会产生partial match 和 unreliable match

以上Rule总结主要的只有3个最重要的信息：byte hash、汇编（伪）代码、指令数量和集合。

0x04 partial match规则与代码

partial match就是相似度小于1.0但是大于0.5（或者自定义阈值）的函数，也就是名字所说的部分匹配。这部分内容比较长，规则较多，如果嫌代码太长可以之后跳过，我在代码后面附上partial match的匹配规则。

  def find_matches(self):
  """ 进行函数匹配筛选，根据某些规则选出可能相似的函数，
      并根据函数ast、伪代码、汇编代码、mdindex等值计算相似度。
	  最后根据相似度进行归档。
	  （0-0.5）：unreliable
	  （0.5-1）：partial match 
	  （1）： best match
  """
    choose = self.partial_chooser

    postfix = ""
    if self.ignore_small_functions:
      postfix = " and f.instructions > 5 and df.instructions > 5 "

	# Rule1: MD_Index相同，
	# 结果：best,partial
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same rare MD Index' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df,
                      (select md_index
                         from diff.functions
                        where md_index != 0
                        group by md_index
                       having count(*) <= 2
                        union 
                       select md_index
                         from main.functions
                        where md_index != 0
                        group by md_index
                       having count(*) <= 2
                      ) shared_mds
                where f.md_index = df.md_index
                  and df.md_index = shared_mds.md_index
                  and f.nodes > 10 """ + postfix
    log_refresh("Finding with heuristic 'Same rare MD Index'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, choose)

	# Rule2：MD和常量信息相同
	# 结果：best,partial
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Same MD Index and constants' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2,
                     df.tarjan_topological_sort, df.strongly_connected_spp
                from functions f,
                     diff.functions df
               where f.md_index = df.md_index
                 and f.md_index > 0
                 and ((f.constants = df.constants
                 and f.constants_count > 0)) """ + postfix
    log_refresh("Finding with heuristic 'Same MD Index and constants'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, choose)

	# Rule3：全部或大部分属性（除hash1,hash2,hash3,persudo_prime外）相同
	# 结果: best,partial
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'All attributes' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.nodes = df.nodes 
                and f.edges = df.edges
                and f.indegree = df.indegree
                and f.outdegree = df.outdegree
                and f.size = df.size
                and f.instructions = df.instructions
                and f.mnemonics = df.mnemonics
                and f.names = df.names
                and f.prototype2 = df.prototype2
                and f.cyclomatic_complexity = df.cyclomatic_complexity
                and f.primes_value = df.primes_value
                and f.bytes_hash = df.bytes_hash
                and f.pseudocode_hash1 = df.pseudocode_hash1
                and f.pseudocode_primes = df.pseudocode_primes
                and f.pseudocode_hash2 = df.pseudocode_hash2
                and f.pseudocode_hash3 = df.pseudocode_hash3
                and f.strongly_connected = df.strongly_connected
                and f.loops = df.loops
                and f.tarjan_topological_sort = df.tarjan_topological_sort
                and f.strongly_connected_spp = df.strongly_connected_spp """ + postfix + """
              union 
             select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Most attributes' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
               where f.nodes = df.nodes 
                 and f.edges = df.edges
                 and f.indegree = df.indegree
                 and f.outdegree = df.outdegree
                 and f.size = df.size
                 and f.instructions = df.instructions
                 and f.mnemonics = df.mnemonics
                 and f.names = df.names
                 and f.prototype2 = df.prototype2
                 and f.cyclomatic_complexity = df.cyclomatic_complexity
                 and f.primes_value = df.primes_value
                 and f.bytes_hash = df.bytes_hash
                 and f.strongly_connected = df.strongly_connected
                 and f.loops = df.loops
                 and f.tarjan_topological_sort = df.tarjan_topological_sort
                 and f.strongly_connected_spp = df.strongly_connected_spp """
    sql += postfix
    log_refresh("Finding with heuristic 'All or most attributes'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)

	# Rule3'：如果使用了慢对比的话，则筛选出switch结构相同
	# 阈值：0.2
	# 结果: partial，unreliable
    if self.slow_heuristics:
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Switch structures' description,
                  f.pseudocode pseudo1, df.pseudocode pseudo2,
                  f.assembly asm1, df.assembly asm2,
                  f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                  f.nodes bb1, df.nodes bb2,
                  cast(f.md_index as real) md1, cast(df.md_index as real) md2
             from functions f,
                  diff.functions df
            where f.switches = df.switches
              and df.switches != '[]' """ + postfix
      log_refresh("Finding with heuristic 'Switch structures'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.2)

	# 注：使用add_matches_from_query_ratio_max并不意味着没有best match，而是仅将ratio=1的设置为best
	# Rule4：如果常量及相应数量相同
	# 阈值：0.5
	# 结果: partial，unreliable
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Same constants' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.constants = df.constants
                and f.constants_count = df.constants_count
                and f.constants_count > 0 """ + postfix
    log_refresh("Finding with heuristic 'Same constants'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.5)

	# Rule5：地址，节点数，边数、primes和指令数量相同
	# 阈值：0.5
	# 结果: partial，unreliable
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Same address, nodes, edges and primes (re-ordered instructions)' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.rva = df.rva
                and f.instructions = df.instructions
                and f.nodes = df.nodes
                and f.edges = df.edges
                and f.primes_value = df.primes_value
                and f.nodes > 3""" + postfix
    log_refresh("Finding with heuristic 'Same address, nodes, edges and primes (re-ordered instructions)'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.5)

	# Rule6：函数名字、md、指令数量相同
	# 结果: best, partial
    sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Import names hash' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.names = df.names
                 and f.names != '[]'
                 and f.md_index = df.md_index
                 and f.instructions = df.instructions
                 and f.nodes > 5 and df.nodes > 5""" + postfix
    log_refresh("Finding with heuristic 'Import names hash'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)

	# Rule7：节点数、边数、复杂度、操作符、名字、函数原型、入度、出度相同
	# 结果:  partial
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Nodes, edges, complexity, mnemonics, names, prototype2, in-degree and out-degree' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.nodes = df.nodes
                 and f.edges = df.edges
                 and f.mnemonics = df.mnemonics
                 and f.names = df.names
                 and f.cyclomatic_complexity = df.cyclomatic_complexity
                 and f.prototype2 = df.prototype2
                 and f.indegree = df.indegree
                 and f.outdegree = df.outdegree
                 and f.nodes > 3
                 and f.edges > 3
                 and f.names != '[]'"""  + postfix + """
               union
              select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Nodes, edges, complexity, mnemonics, names and prototype2' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.nodes = df.nodes
                 and f.edges = df.edges
                 and f.mnemonics = df.mnemonics
                 and f.names = df.names
                 and f.names != '[]'
                 and f.cyclomatic_complexity = df.cyclomatic_complexity
                 and f.prototype2 = df.prototype2""" + postfix
    log_refresh("Finding with heuristic 'Nodes, edges, complexity, mnemonics, names, prototype, in-degree and out-degree'")
    self.add_matches_from_query_ratio(sql, self.partial_chooser, self.partial_chooser)

	# Rule8：操作符、指令数量、函数名字相同
	# 结果： partial
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Mnemonics and names' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.mnemonics = df.mnemonics
                 and f.instructions = df.instructions
                 and f.names = df.names
                 and f.names != '[]'""" + postfix
    log_refresh("Finding with heuristic 'Mnemonics and names'")
    self.add_matches_from_query_ratio(sql, choose, choose)

	# Rule9：Mnemonics small-primes-product指令数量、相同
	# 阈值：0.6
	# 结果： partial，unreliable
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Mnemonics small-primes-product' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.mnemonics_spp = df.mnemonics_spp
                 and f.instructions = df.instructions
                 and f.nodes > 1 and df.nodes > 1
                 and df.instructions > 5 """ + postfix
    log_refresh("Finding with heuristic 'Mnemonics small-primes-product'")
    self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.6)

    # Search using some of the previous criterias but calculating the
    # edit distance
	# 检查之前的partial匹配，并且计算距离
    log_refresh("Finding with heuristic 'Small names difference'")
    self.search_small_differences(choose)

	# Rule10：伪代码fuzzy hash相同
	# 结果：best，partial
    if self.slow_heuristics:
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Pseudo-code fuzzy hash' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode_hash1 = f.pseudocode_hash1
                   or df.pseudocode_hash2 = f.pseudocode_hash2
                   or df.pseudocode_hash3 = f.pseudocode_hash3""" + postfix
      log_refresh("Finding with heuristic 'Pseudo-code fuzzy hashes'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, choose)
    else:
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Pseudo-code fuzzy hash' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode_hash1 = f.pseudocode_hash1""" + postfix
      log_refresh("Finding with heuristic 'Pseudo-code fuzzy hash'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, choose)

	# Rule11：伪代码和名字相同
	# 结果：best,partial,unreliable
    sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar pseudo-code and names' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.pseudocode_lines = df.pseudocode_lines
                and f.names = df.names
                and df.names != '[]'
                and df.pseudocode_lines > 5
                and df.pseudocode is not null 
                and f.pseudocode is not null""" + postfix
    log_refresh("Finding with heuristic 'Similar pseudo-code and names'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# Rule12: 伪代码 fuzzy AST hash相同
	# 结果：best，partial
    if self.slow_heuristics:
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Pseudo-code fuzzy AST hash' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode_primes = f.pseudocode_primes
                  and f.pseudocode_lines > 3
                  and length(f.pseudocode_primes) >= 35""" + postfix
      log_refresh("Finding with heuristic 'Pseudo-code fuzzy AST hash'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, choose)

	  # Rule13: 部分伪代码fuzzy AST hash相同
	  # 阈值：0.5
	  # 结果：best，partial
      sql = """  select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Partial pseudo-code fuzzy hash' description,
                        f.pseudocode pseudo1, df.pseudocode pseudo2,
                        f.assembly asm1, df.assembly asm2,
                        f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                        f.nodes bb1, df.nodes bb2,
                        cast(f.md_index as real) md1, cast(df.md_index as real) md2
                   from functions f,
                        diff.functions df
                  where substr(df.pseudocode_hash1, 1, 16) = substr(f.pseudocode_hash1, 1, 16)
                     or substr(df.pseudocode_hash2, 1, 16) = substr(f.pseudocode_hash2, 1, 16)
                     or substr(df.pseudocode_hash3, 1, 16) = substr(f.pseudocode_hash3, 1, 16)""" + postfix
      log_refresh("Finding with heuristic 'Partial pseudo-code fuzzy hash'")
      self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.5)

	# Rule14：强连通图的tarjan拓朴分类相同
	# 结果：best，partial，unreliable
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Topological sort hash' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.strongly_connected = df.strongly_connected
                and f.tarjan_topological_sort = df.tarjan_topological_sort
                and f.strongly_connected > 3
                and f.nodes > 10 """ + postfix
    log_refresh("Finding with heuristic 'Topological sort hash'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# Rule15：循环复杂度，程序原型，名字相同
	# 结果：partial
    sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity, prototype and names' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.names = df.names
                  and f.cyclomatic_complexity = df.cyclomatic_complexity
                  and f.cyclomatic_complexity >= 20
                  and f.prototype2 = df.prototype2
                  and df.names != '[]'""" + postfix
    log_refresh("Finding with heuristic 'Same high complexity, prototype and names'")
    self.add_matches_from_query_ratio(sql, choose, choose)

	# Rule16：名字和循环复杂度相同
	# 阈值：0.5
	# 结果：partial，unreliable
    if self.slow_heuristics:
      sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity and names' description,
                        f.pseudocode pseudo1, df.pseudocode pseudo2,
                        f.assembly asm1, df.assembly asm2,
                        f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                        f.nodes bb1, df.nodes bb2,
                        cast(f.md_index as real) md1, cast(df.md_index as real) md2
                   from functions f,
                        diff.functions df
                  where f.names = df.names
                    and f.cyclomatic_complexity = df.cyclomatic_complexity
                    and f.cyclomatic_complexity >= 15
                    and df.names != '[]'""" + postfix
      log_refresh("Finding with heuristic 'Same high complexity and names'")
      self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.5)

	  # Rule17:强连通图相同
	  # 阈值：0.8
	  # 结果：partial
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Strongly connected components' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.strongly_connected = df.strongly_connected
                  and df.strongly_connected > 1
                  and f.nodes > 5 and df.nodes > 5
                  and f.strongly_connected_spp > 1
                  and df.strongly_connected_spp > 1""" + postfix
      log_refresh("Finding with heuristic 'Strongly connected components'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, None, 0.80)

	# Rule18：强连通的small-primes-product相同
	# 结果：best,partial,unreliable
    sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Strongly connected components small-primes-product' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.strongly_connected_spp = df.strongly_connected_spp
                  and df.strongly_connected_spp > 1
                  and f.nodes > 10 and df.nodes > 10 """ + postfix
    log_refresh("Finding with heuristic 'Strongly connected components small-primes-product'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# Rule19：循环个数相同
	# 阈值：0.49
	# 结果：partial
    if self.slow_heuristics:
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Loop count' description,
                  f.pseudocode pseudo1, df.pseudocode pseudo2,
                  f.assembly asm1, df.assembly asm2,
                  f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                  f.nodes bb1, df.nodes bb2,
                  cast(f.md_index as real) md1, cast(df.md_index as real) md2
             from functions f,
                  diff.functions df
            where f.loops = df.loops
              and df.loops > 1
              and f.nodes > 3 and df.nodes > 3""" + postfix
      log_refresh("Finding with heuristic 'Loop count'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, None, 0.49)

	# Rule20: 强连通分量的smale prime product相同
	# 阈值：0.49
	# 结果：partial
    sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                     'Strongly connected components SPP and names' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
                from functions f,
                     diff.functions df
               where f.names = df.names
                 and f.names != '[]'
                 and f.strongly_connected_spp = df.strongly_connected_spp
                 and f.strongly_connected_spp > 0
                 """ + postfix
    log_refresh("Finding with heuristic 'Strongly connected components SPP and names'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, None, 0.49)
  def add_matches_from_query_ratio_max(self, sql, best, partial, val):
  """ 查询数据库，并根据相似度判断归属。相似度大于阈值val的，都将归入best match；小于阈值大于0的，
  如果开启了partial匹配就归入partial chooser
  参数：
      sql: 需要进行执行的sql语句
	  best: 最佳匹配数据集
	  partial: 部分匹配数据集
	  val: 阈值
  """
    if self.all_functions_matched():
      return
    
    cur = self.db_cursor()
    try:
      cur.execute(sql)
    except:
      log("Error: %s" % str(sys.exc_info()[1]))
      return

    i = 0
    t = time.time()
    while self.max_processed_rows == 0 or (self.max_processed_rows != 0 and i < self.max_processed_rows):
      if time.time() - t > self.timeout:
        log("Timeout")
        break

      i += 1
      if i % 50000 == 0:
        log("Processed %d rows..." % i)
      row = cur.fetchone()
      if row is None:
        break

      ea = str(row["ea"])
      name1 = row["name1"]
      ea2 = row["ea2"]
      name2 = row["name2"]
      desc = row["description"]
      pseudo1 = row["pseudo1"]
      pseudo2 = row["pseudo2"]
      asm1 = row["asm1"]
      asm2 = row["asm2"]
      ast1 = row["pseudo_primes1"]
      ast2 = row["pseudo_primes2"]
      bb1 = int(row["bb1"])
      bb2 = int(row["bb2"])
      md1 = row["md1"]
      md2 = row["md2"]

      if name1 in self.matched1 or name2 in self.matched2:
        continue

      r = self.check_ratio(ast1, ast2, pseudo1, pseudo2, asm1, asm2, md1, md2)

      if r == 1:
        self.best_chooser.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      elif r > val:
        best.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)
      elif partial is not None:
        partial.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
        self.matched1.add(name1)
        self.matched2.add(name2)

    cur.close()

总结partial match的规则如下：

partial match匹配规则：
    Rule1：MD Index相同
    Rule2：MD Index和常量信息相同
    Rule3：全部或大部分属性（除hash1,hash2,hash3,persudo_prime外）相同
    Rule4：switch结构相同
    Rule5：常量和常量数量相同
    Rule6：RVA、node数，edge数、primes和指令数量相同
    Rule7：函数名字、md、指令数量相同
    Rule8：node数，edge数、复杂度、操作符、名字、函数原型、入度、出度相同
    Rule9：操作符、指令数量、函数名字相同
    Rule10：Mnemonics small-primes-product指令数量、相同
    Rule11：伪代码fuzzy hash相同
    Rule12：伪代码和名字相同
    Rule13：伪代码 fuzzy AST hash相同
    Rule14：部分伪代码fuzzy AST hash相同
    Rule15：强连通图的tarjan拓朴分类相同
    Rule16：循环复杂度，程序原型，名字相同
    Rule17：名字和循环复杂度相同
    Rule18：强连通图相同
    Rule19：循环个数相同
    Rule20：强连通分量的smale prime product相同
注：这些匹配规则中，如果相似率达到了1.0，还需加入best match

MD Index我们暂时还不清楚，挖个坑以后填

在以上20条规则中，主要是针对MD Index、调用流图特征、伪代码fuzzy hash三者进行匹配，但是可以注意到，有些匹配和best match 是重复的，同时这些匹配过程也可能产生best match，我们在进行识别的时候，不可不注意。

0x05 从CFG中进行匹配

这部分函数比较简单，就是当两个函数为best match和partial match时，搜索和这个函数相关的caller和callee，进行函数相似性匹配。这部分代码逻辑不放在这里赘述了。

0x06 unreliable match 规则与代码

这部分其实我们并不关心，因为结果不准确，但是为了保证分析的完整性，我们继续跟踪。

  # 搜索不可靠的函数
  def find_unreliable_matches(self):
    choose = self.unreliable_chooser

    postfix = ""
    if self.ignore_small_functions:
      postfix = " and f.instructions > 5 and df.instructions > 5 "

	# Rule1：只有强连通分量相同
	# 阈值：0.54
	# 结果： partial ，unreliable
    if self.slow_heuristics:
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Strongly connected components' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.strongly_connected = df.strongly_connected
                  and df.strongly_connected > 2""" + postfix
      log_refresh("Finding with heuristic 'Strongly connected components'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.54)

	  # Rule2：只有loop count相同
	  # 阈值： 0.5
	  # 结果： partial,unreliable
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Loop count' description,
                  f.pseudocode pseudo1, df.pseudocode pseudo2,
                  f.assembly asm1, df.assembly asm2,
                  f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                  f.nodes bb1, df.nodes bb2,
                  cast(f.md_index as real) md1, cast(df.md_index as real) md2
             from functions f,
                  diff.functions df
            where f.loops = df.loops
              and df.loops > 1""" + postfix
      log_refresh("Finding with heuristic 'Loop count'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

	  # Rule3：Nodes, edges, 复杂度 和操作符个数相同
	  # 结果： best,partial
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Nodes, edges, complexity and mnemonics' description,
                       f.pseudocode pseudo1, df.pseudocode pseudo2,
                       f.assembly asm1, df.assembly asm2,
                       f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                       f.nodes bb1, df.nodes bb2,
                       cast(f.md_index as real) md1, cast(df.md_index as real) md2
                  from functions f,
                       diff.functions df
                 where f.nodes = df.nodes
                   and f.edges = df.edges
                   and f.mnemonics = df.mnemonics
                   and f.cyclomatic_complexity = df.cyclomatic_complexity
                   and f.nodes > 1 and f.edges > 0""" + postfix
      log_refresh("Finding with heuristic 'Nodes, edges, complexity and mnemonics'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)

	  # Rule4：Nodes, edges, 复杂度 和函数原型相同
	  # 结果： partial，unreliable
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Nodes, edges, complexity and prototype' description,
                       f.pseudocode pseudo1, df.pseudocode pseudo2,
                       f.assembly asm1, df.assembly asm2,
                       f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                       f.nodes bb1, df.nodes bb2,
                       cast(f.md_index as real) md1, cast(df.md_index as real) md2
                  from functions f,
                       diff.functions df
                 where f.nodes = df.nodes
                   and f.edges = df.edges
                   and f.prototype2 = df.prototype2
                   and f.cyclomatic_complexity = df.cyclomatic_complexity
                   and f.prototype2 != 'int()'""" + postfix
      log_refresh("Finding with heuristic 'Nodes, edges, complexity and prototype'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

	  # Rule5：Nodes, edges, 复杂度、出度和入度相同
	  # 结果： partial，unreliable
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Nodes, edges, complexity, in-degree and out-degree' description,
                       f.pseudocode pseudo1, df.pseudocode pseudo2,
                       f.assembly asm1, df.assembly asm2,
                       f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                       f.nodes bb1, df.nodes bb2,
                       cast(f.md_index as real) md1, cast(df.md_index as real) md2
                  from functions f,
                       diff.functions df
                 where f.nodes = df.nodes
                   and f.edges = df.edges
                   and f.cyclomatic_complexity = df.cyclomatic_complexity
                   and f.nodes > 3 and f.edges > 2
                   and f.indegree = df.indegree
                   and f.outdegree = df.outdegree""" + postfix
      log_refresh("Finding with heuristic 'Nodes, edges, complexity, in-degree and out-degree'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

	  # Rule6：Nodes, edges, 复杂度相同
	  # 结果： partial，unreliable
      sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
                       'Nodes, edges and complexity' description,
                       f.pseudocode pseudo1, df.pseudocode pseudo2,
                       f.assembly asm1, df.assembly asm2,
                       f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                       f.nodes bb1, df.nodes bb2,
                       cast(f.md_index as real) md1, cast(df.md_index as real) md2
                  from functions f,
                       diff.functions df
                 where f.nodes = df.nodes
                   and f.edges = df.edges
                   and f.cyclomatic_complexity = df.cyclomatic_complexity
                   and f.nodes > 1 and f.edges > 0""" + postfix
      log_refresh("Finding with heuristic 'Nodes, edges and complexity'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

	  # Rule7：同为小函数且伪代码相同
	  # 阈值：0.5
	  # 结果： partial，unreliable
      sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar small pseudo-code' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode is not null 
                  and f.pseudocode is not null
                  and f.pseudocode_lines = df.pseudocode_lines
                  and df.pseudocode_lines > 5""" + postfix
      log_refresh("Finding with heuristic 'Similar small pseudo-code'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.5)

	  # Rule8：循环0复杂度高且相同
	  # 结果： partial，unreliable
      sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity' description,
                        f.pseudocode pseudo1, df.pseudocode pseudo2,
                        f.assembly asm1, df.assembly asm2,
                        f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                        f.nodes bb1, df.nodes bb2,
                        cast(f.md_index as real) md1, cast(df.md_index as real) md2
                   from functions f,
                        diff.functions df
                  where f.cyclomatic_complexity = df.cyclomatic_complexity
                    and f.cyclomatic_complexity >= 50""" + postfix
      log_refresh("Finding with heuristic 'Same high complexity'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)

unreliable match匹配规则：
    Rule1：是否同为strongly_connected
    Rule2：只有loop count相同
    Rule3：Nodes, edges, 复杂度和操作符个数相同
    Rule4：Nodes, edges, 复杂度和函数原型相同
    Rule5：Nodes, edges, 复杂度、出度和入度相同
    Rule6：Nodes, edges, 复杂度相同
    Rule7：同为小函数（伪代码行数小于5）且伪代码相同
    Rule8：循环复杂度高且相同

0x07 经验法进行匹配的规则与代码

  def find_experimental_matches(self):
  """使用经验法进行匹配。
  """
    choose = self.unreliable_chooser
    
    # 调用地址顺序启发搜索，即当一个函数匹配时，该函数下一个有大概率匹配
    self.find_from_matches(self.best_chooser.items)
    self.find_from_matches(self.partial_chooser.items)
    
    postfix = ""
    if self.ignore_small_functions:
      postfix = " and f.instructions > 5 and df.instructions > 5 "

	# Rule1：伪代码行数相同
	# 阈值： 0.6
	# 结果： unreliable
    sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar pseudo-code' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.pseudocode_lines = df.pseudocode_lines
                and df.pseudocode_lines > 5
                and df.pseudocode is not null 
                and f.pseudocode is not null""" + postfix
    log_refresh("Finding with heuristic 'Similar pseudo-code'")
    self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.6)

	# Rule2：nodes,edges,强连通分量相同
	# 结果： best,unreliable
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
                    'Same nodes, edges and strongly connected components' description,
                     f.pseudocode pseudo1, df.pseudocode pseudo2,
                     f.assembly asm1, df.assembly asm2,
                     f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                     f.nodes bb1, df.nodes bb2,
                     cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.nodes = df.nodes
                and f.edges = df.edges
                and f.strongly_connected = df.strongly_connected
                and df.nodes > 4""" + postfix
    log_refresh("Finding with heuristic 'Same nodes, edges and strongly connected components'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, choose, self.unreliable_chooser)

	# Rule3：伪代码行数相同
	# 结果： partial,unreliable
    if self.slow_heuristics:
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar small pseudo-code' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.pseudocode_lines = df.pseudocode_lines
                  and df.pseudocode_lines <= 5
                  and df.pseudocode is not null 
                  and f.pseudocode is not null""" + postfix
      log_refresh("Finding with heuristic 'Similar small pseudo-code'")
      self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.49)

	  # Rule4：伪代码fuzzy AST hash相同
	  # 结果： partial,unreliable
      sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Small pseudo-code fuzzy AST hash' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where df.pseudocode_primes = f.pseudocode_primes
                  and f.pseudocode_lines <= 5""" + postfix
      log_refresh("Finding with heuristic 'Small pseudo-code fuzzy AST hash'")
      self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)
    # Rule5：同为小函数且伪代码相同
	# 结果： best，partial
    sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Equal small pseudo-code' description,
                    f.pseudocode pseudo1, df.pseudocode pseudo2,
                    f.assembly asm1, df.assembly asm2,
                    f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                    f.nodes bb1, df.nodes bb2,
                    cast(f.md_index as real) md1, cast(df.md_index as real) md2
               from functions f,
                    diff.functions df
              where f.pseudocode = df.pseudocode
                and df.pseudocode is not null
                and f.pseudocode_lines < 5""" + postfix
    log_refresh("Finding with heuristic 'Equal small pseudo-code'")
    self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)

	# Rule6：循环复杂度、原型、名字相同
	# 阈值： 0.5
	# 结果： partial,unreliable
    sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity, prototype and names' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.names = df.names
                  and f.cyclomatic_complexity = df.cyclomatic_complexity
                  and f.cyclomatic_complexity < 20
                  and f.prototype2 = df.prototype2
                  and df.names != '[]'""" + postfix
    log_refresh("Finding with heuristic 'Same low complexity, prototype and names'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.5)

	# Rule7：循环复杂度、名字相同
	# 阈值： 0.5
	# 结果： partial,unreliable
    sql = """  select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same low complexity and names' description,
                      f.pseudocode pseudo1, df.pseudocode pseudo2,
                      f.assembly asm1, df.assembly asm2,
                      f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                      f.nodes bb1, df.nodes bb2,
                      cast(f.md_index as real) md1, cast(df.md_index as real) md2
                 from functions f,
                      diff.functions df
                where f.names = df.names
                  and f.cyclomatic_complexity = df.cyclomatic_complexity
                  and f.cyclomatic_complexity < 15
                  and df.names != '[]'""" + postfix
    log_refresh("Finding with heuristic 'Same low complexity and names'")
    self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.5)
  
    # Rule8：调用流图相同
	# 结果： best，partial,unreliable
    if self.slow_heuristics:
	  # 当数据库过大时（超过25000个函数）时，可能造成一下错误：
      # the following error: OperationalError: database or disk is full
      sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
                 'Same graph' description,
                 f.pseudocode pseudo1, df.pseudocode pseudo2,
                 f.assembly asm1, df.assembly asm2,
                 f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
                 f.nodes bb1, df.nodes bb2,
                 cast(f.md_index as real) md1, cast(df.md_index as real) md2
            from functions f,
                 diff.functions df
           where f.nodes = df.nodes 
             and f.edges = df.edges
             and f.indegree = df.indegree
             and f.outdegree = df.outdegree
             and f.cyclomatic_complexity = df.cyclomatic_complexity
             and f.strongly_connected = df.strongly_connected
             and f.loops = df.loops
             and f.tarjan_topological_sort = df.tarjan_topological_sort
             and f.strongly_connected_spp = df.strongly_connected_spp""" + postfix + """
             and f.nodes > 5 and df.nodes > 5
           order by
                 case when f.size = df.size then 1 else 0 end +
                 case when f.instructions = df.instructions then 1 else 0 end +
                 case when f.mnemonics = df.mnemonics then 1 else 0 end +
                 case when f.names = df.names then 1 else 0 end +
                 case when f.prototype2 = df.prototype2 then 1 else 0 end +
                 case when f.primes_value = df.primes_value then 1 else 0 end +
                 case when f.bytes_hash = df.bytes_hash then 1 else 0 end +
                 case when f.pseudocode_hash1 = df.pseudocode_hash1 then 1 else 0 end +
                 case when f.pseudocode_primes = df.pseudocode_primes then 1 else 0 end +
                 case when f.pseudocode_hash2 = df.pseudocode_hash2 then 1 else 0 end +
                 case when f.pseudocode_hash3 = df.pseudocode_hash3 then 1 else 0 end DESC"""
      log_refresh("Finding with heuristic 'Same graph'")
      self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)

	# 最后使用暴力搜索
    log_refresh("Brute-forcing...")
    self.find_brute_force()

经验法其实分为三步，第一步CFG匹配；第二步Rule过滤；第三步暴力搜索。我们先说第一步，此处采用的CFG匹配和上文提到的不同，这里是根据一个经验，如果函数Ai和函数Bj相同，A、B表示两个数据库，i、j为他们的序号，那么A(i+1)与B(j+1)也有很大概率相同。这只是一个不可靠的经验，函数相同的判断依据依旧是相似率。第二步为Rule，主要有一下几种：

经验法规则：
    Rule1：伪代码行数相同
    Rule2：nodes,edges,strongly_connected相同
    Rule3：伪代码fuzzy AST hash相同
    Rule4：同为小函数且伪代码相同
    Rule5：循环复杂度、原型、名字相同
    Rule6：循环复杂度、名字相同

Rule7：调用流图(nodes、edges、入度、出度、循环复杂度、tarjan拓朴分类、strongly_connected、循环)相同

这些规则没有任何可靠地依据，所以生成的结果也不一定可靠。第三步是暴力搜索，在经过上述匹配之后仍未进行分类的函数，进行暴力匹配~感觉不靠谱馁

0x08 函数相似率计算规则

我想经过上述的分析，大家应该已经对Diaphora的匹配规则和过程有所了解了吧，我建议在阅读源码时，配合Diaphora生成好的数据库对照着阅读，这样会比较快。同时我们在上文经常提到函数相似率，最后我想分析的就是这个函数相似率算法。进行函数相似率计算的函数是check_ratio()

# 获得两个相似率
def quick_ratio(buf1, buf2):
  try:
    if buf1 is None or buf2 is None or buf1 == "" or buf1 == "":
      return 0
    s = SequenceMatcher(None, buf1.split("\n"), buf2.split("\n"))
    return s.quick_ratio()
  except:
    print "quick_ratio:", str(sys.exc_info()[1])
    return 0

#-----------------------------------------------------------------------
# 更快的获得相似率，但是好像不是很准确
def real_quick_ratio(buf1, buf2):
  try:
    if buf1 is None or buf2 is None or buf1 == "" or buf1 == "":
      return 0
    s = SequenceMatcher(None, buf1.split("\n"), buf2.split("\n"))
    return s.real_quick_ratio()
  except:
    print "real_quick_ratio:", str(sys.exc_info()[1])
    return 0

#-----------------------------------------------------------------------
# 获得两个ast树相差率
def ast_ratio(ast1, ast2):
  if ast1 == ast2:
    return 1.0
  elif ast1 is None or ast2 is None:
    return 0
  return difference_ratio(decimal.Decimal(ast1), decimal.Decimal(ast2))
  # 检查两个函数相似率

#-----------------------------------------------------------------------
  def check_ratio(self, ast1, ast2, pseudo1, pseudo2, asm1, asm2, md1, md2):
    fratio = quick_ratio
    decimal_values = "{0:.2f}"
    if self.relaxed_ratio:
      fratio = real_quick_ratio
      decimal_values = "{0:.1f}"

	# 第一计算ast相似度先
    v3 = 0
    ast_done = False
    if self.relaxed_ratio and ast1 is not None and ast2 is not None and max(len(ast1), len(ast2)) < 16:
      ast_done = True
      v3 = self.ast_ratio(ast1, ast2)
      if v3 == 1:
        return 1.0

    # 第二计算清理后的伪代码相似度
    v1 = 0
    if pseudo1 is not None and pseudo2 is not None and pseudo1 != "" and pseudo2 != "":
      tmp1 = self.get_cmp_pseudo_lines(pseudo1)
      tmp2 = self.get_cmp_pseudo_lines(pseudo2)
      if tmp1 == "" or tmp2 == "":
        log("Error cleaning pseudo-code!")
      else:
        v1 = fratio(tmp1, tmp2)
        v1 = float(decimal_values.format(v1))
        if v1 == 1.0:
		  # 如果real_quick_ratio返回1，再尝试quie_ratio
		  # 因为可能存在误报。如果real_quick_ratio得到different，
		  # 就没有必要继续查证了
          if fratio == real_quick_ratio:
            v1 = quick_ratio(tmp1, tmp2)
            if v1 == 1.0:
              return 1.0

    # 第三比较汇编代码如何
    tmp_asm1 = self.get_cmp_asm_lines(asm1)
    tmp_asm2 = self.get_cmp_asm_lines(asm2)
    v2 = fratio(tmp_asm1, tmp_asm2)
    v2 = float(decimal_values.format(v2))
    if v2 == 1:
	  # 其实检查检查伪代码类似
      if fratio == real_quick_ratio:
        v2 = quick_ratio(tmp_asm1, tmp_asm2)
        if v2 == 1.0:
          return 1.0

    if self.relaxed_ratio and not ast_done:
      v3 = fratio(ast1, ast2)
      v3 = float(decimal_values.format(v3))
      if v3 == 1:
        return 1.0

    v4 = 0.0
    if md1 == md2 and md1 > 0.0:
      #  MD-Index >= 10.0 很少见
      if self.relaxed_ratio or md1 > 10.0:
        return 1.0
		
      v4 = min((v1 + v2 + v3 + 3.0) / 4, 1.0)
    elif md1 != 0 and md2 != 0 and False:
      tmp1 = max(md1, md2)
      tmp2 = min(md1, md2)
      v4 = tmp2 * 1. / tmp1

    r = max(v1, v2, v3, v4)
    return r

总结算法：

v3 = ast_ratio(ast1,ast2)
v1 = sequence_ratio(cleaned_pesudo_code1,cleaned_pesudo_code2)
v2 = sequence_ratio(cleaned_asm_code1,cleaned_asm_code2)
if MD_Index1 == MD_Index2:
    v4 = min((v1 + v2 + v3 + 3.0) / 4, 1.0)
v4 = min(MD_Index1，MD_Index2)/max(MD_Index1，MD_Index2)
ration = max{v1,v2,v3,v4}

即，取ast语法树、反汇编代码、伪代码和MD值相似度死者中的最大值。

0x09 小结

我们对Diaphora使用的匹配技术进行总结。从High level的角度来看，diaphora根据某些匹配规则筛选出函数，并且计算出这些函数的相似度，根据相似度以及相关阈值判断函数的归属。这些筛选规则既涉及到调用流、汇编代码等逻辑特征，还涉及到常量、指令集合等数据特征。同时，某些规则也依靠fuzzy hash和small prime，下一篇博客中，我们将详细解释diaphora如何生成特征数据库。

你可能感兴趣的:(Diaphora源码分析,库函数识别)

【识别代码截图OCR工具】 stsdddd 编程工具使用编辑器
以下是一些支持识别代码截图且能较好地保留代码结构、不出现乱码的OCR工具，以及它们的具体网站：1.Umi-OCR特点：免费开源的离线OCR软件，支持截图OCR、批量OCR、PDF识别等功能。能够识别不同排版的文字，并按正确顺序输出。适用平台：Windows。下载地址：蓝奏云：Umi-OCR_文字识别工具（国内推荐，免注册/无限速）。GitHub：https://github.com/hiroi-s
Python从0到100（八十一）：神经网络-Fashion MNIST数据集取得最高的识别准确率是Dream呀 python 神经网络开发语言
前言：零基础学Python：Python从0到100最新最全教程。想做这件事情很久了，这次我更新了自己所写过的所有博客，汇集成了Python从0到100，共一百节课，帮助大家一个月时间里从零基础到学习Python基础语法、Python爬虫、Web开发、计算机视觉、机器学习、神经网络以及人工智能相关知识，成为学习学习和学业的先行者！欢迎大家订阅专栏：零基础学Python：Python从0到100最新
500多种目标检测数据集下载地址汇总（YOLO、VOC） 2401_85863780 目标检测 YOLO 目标跟踪数据集 yolo
名称辣椒病害分类数据集9076张12类别.7z【目标检测数据集】光伏电池异常检测数据集VOC+YOLO格式219张2类别_2.zip【目标检测数据集】钢丝绳破损灼伤缺陷检测数据集VOC+YOLO格式1318张2类别.7z【目标检测数据集】狗狗数据集5912张VOC+YOLO格式.zip【目标检测数据集】工地安全帽佩戴检测4000张VOC+YOLO格式.rar【目标检测数据集】手势识别0-9数字VO
2024年开源数据集地址汇总包含最新最全数据集在这你可以找到任何想要数据集萌萌哒240 深度学习目标跟踪人工智能计算机视觉
目标检测数据集和图像分类数据集是计算机视觉领域的两大重要资源，它们为训练和评估各种视觉模型提供了关键的数据支持。目标检测数据集主要用于训练模型以识别和定位图像中的特定物体。这类数据集通常包含大量的标注图像，每张图像中都标记了多个物体的位置和类别。例如，COCO（CommonObjectsinContext）数据集就是一个常用的目标检测数据集，它包含了80个类别的日常物体，如人、车、动物等，并提供了
【实战篇】Android安卓本地离线实现视频检测人脸我的青春不太冷 android 音视频数码相机人脸识别 Android人脸识别 AI
实战篇Android安卓本地离线实现视频检测人脸引言项目概述核心代码类介绍人脸检测流程项目地址总结引言在当今数字化时代，人脸识别技术已经广泛应用于各个领域，如安防监控、门禁系统、移动支付等。本文将以第三视角详细讲解如何基于bifan-wei-Face/Detector:V1.0实现人脸识别。项目概述com.github.bifan-wei:FaceDetector:V1.0是一个人脸识别项目，主要
DIY单片机串口打印函数print 时空自由民. 单片机串口通信 printf函数 ROM空间 DIY函数
原始的单片机串口只能发送单字节数据，再加个封装也就能发送个字符串，但是无法发送数字变量，要发送数字变量那基本要引入C语言的库函数printf，但是这个pintf函数好用确实是好用但是有个很大的缺点相对于小存储容量的单片机来说，就是占用存储容量特别大，我曾在51单片机上测试引入个printf函数占用1KBROM空间，这就离谱了我那个51单片机一共才8KBROM，这就不能使用printf函数了，太占用
Linux mpstat 命令使用详解 linux
简介mpstat命令（sysstat包的一部分）用于报告Linux下的CPU使用统计信息。它提供有关CPU性能的详细统计信息，如果存在多核系统，则包括有关每个单独CPU（或核心）的信息。该命令可用于性能监视和识别CPU瓶颈。安装在使用mpstat之前，确保系统上安装了sysstat包Debian/Ubuntu：sudoaptupdatesudoaptinstallsysstatCentOS/RHE
深入理解Mybatis分库分表执行原理牛马程序员_江 mybatis unix
深入理解Mybatis分库分表执行原理探究分库分表场景下Mybatis是如何将mapper.xml中sql的逻辑表，转换成实际执行时的物理表。前言工作多年，分库分表的场景也见到不少了，但是我仍然对其原理一知半解。趁着放假前时间比较富裕，我想要解答三个问题：为什么mybatis的mapper.xml文件里的sql不需要拼接表名中的分表？mybatis是如何识别分表位的？最近工作中遇到的问题：为什么我
GPU驱动及CUDA安装流程介绍醉心编码通信软件技术类 c/c++GPU Nvidia GPU驱动显卡驱动
GPU驱动及CUDA安装流程介绍1.安装前准备工作1.1.确认GPU型号和操作系统版本1.2.准备gpu驱动和CUDA软件包1.3.检查服务器GPU识别情况1.4.老版本软件包卸载1.5.安装依赖包CentOS依赖包安装示例：SUSE依赖包安装示例：Ubuntu依赖包安装示例：1.6.安装kernel相关依赖包CentOSUbuntu：SUSE:1.7.修改系统运行级别为文本模式CentOS：Ub
AI在虚拟试衣中的应用：革新在线购物体验 AI大模型应用之禅计算机软件编程原理与应用实践 java python javascript kotlin golang 架构人工智能
AI在虚拟试衣中的应用：革新在线购物体验关键词：虚拟试衣,增强现实,在线购物,深度学习,图像识别,人工智能,用户交互1.背景介绍1.1问题由来随着电子商务的迅猛发展，在线购物已经成为人们日常生活的一部分。然而，由于无法亲身试穿，在线购物体验在满足用户个性化需求方面仍存在诸多不足。传统的网页图片展示和文字描述难以真实传达衣物的质地、颜色和尺寸。因此，虚拟试衣技术应运而生，成为电商平台上提升用户体验的
openvino yolov11识别 yuyuyue249 openvino YOLO python
importcv2importpathlibfromultralyticsimportYOLOimportmatplotlib.pyplotaspltimportopenvinoasovcore=ov.Core()det_model_path=pathlib.Path("/home/yuyue/yolov11/weights/yolo11n/yolo11n.xml")det_ov_model=co
React中的key属性有什么作用，如何使用？ JJCTO袁龙 react react.js javascript 前端
React中的key属性：作用与使用指南在React中，key属性是一个非常重要的概念，尤其在构建动态列表时，它的作用不容小觑。理解key的使用对提高应用性能、优化渲染以及避免潜在的渲染问题都有很大的帮助。本文将详细探讨React中的key属性，为什么它是必需的以及如何正确使用它。1.key的作用在React中，当我们使用map或类似的方法生成列表时，key属性帮助React识别哪些元素已更改、添
数据挖掘常用算法优缺点分析天波烟客00 数据挖掘数据挖掘机器学习
领取机器学习视频教程：http://www.admin444.com/P-c8129a48常用的机器学习、数据挖掘方法有分类，回归，聚类，推荐，图像识别等。在实际应用中，一般都是采用启发式学习方式来实验。偏差&方差偏差：描述的是预测值（估计值）的期望与真实值之间的差距，偏差越大，越偏离真实数据。偏差bias其实是模型太简单而带来的估计不准确的部分---欠拟合方差：描述的是预测值的变化范围、离散程度
什么是LLM？看这一篇就够了！ Python程序员罗宾人工智能语言模型 AIGC 自然语言处理
前言自从2022年12月ChatGPT横空面世以来，AI领域获得了十足的关注和资本，其实AI的概念在早些年也火过一波，本轮AI热潮相比于之前的AI，最大的区别在于：生成式。本文主要介绍大语言模型（LargeLanguageModel，简称LLM）。大语言模型介绍什么是大语言模型（LLM）通过海量文本训练的、能识别人类语言、执行语言类任务、拥有大量参数的模型，称之为大语言模型。GPT、LLaMA、M
【llm对话系统】大模型 Llama 源码分析之并行训练方案 kakaZhui llama 人工智能 AIGC chatgpt
1.引言训练大型语言模型(LLM)需要巨大的计算资源和内存。为了高效地训练这些模型，我们需要采用各种并行策略，将计算和数据分布到多个GPU或设备上。Llama作为当前最流行的开源大模型之一，其训练代码中采用了多种并行技术。本文将深入Llama的训练代码，分析其并行训练方案，主要关注参数并行和部分结构参数共享。2.并行训练策略概述常见的并行训练策略包括：数据并行(DataParallelism,DP
【llm对话系统】大模型 Llama 源码分析之 LoRA 微调 kakaZhui llama 深度学习 pytorch AIGC chatgpt
1.引言微调(Fine-tuning)是将预训练大模型(LLM)应用于下游任务的常用方法。然而，直接微调大模型的所有参数通常需要大量的计算资源和内存。LoRA(Low-RankAdaptation)是一种高效的微调方法，它通过引入少量可训练参数，固定预训练模型的权重，从而在保持性能的同时大大减少了计算开销。本文将深入分析LoRA的原理，并结合Llama源码解读其实现逻辑，最后探讨LoRA的优势。2
【llm对话系统】大模型 Llama 源码分析之 Flash Attention kakaZhui llama 人工智能 AIGC chatgpt
1.写在前面近年来，基于Transformer架构的大型语言模型(LLM)在自然语言处理(NLP)领域取得了巨大的成功。Transformer的核心组件是自注意力(Self-Attention)机制，它允许模型捕捉输入序列中不同位置之间的关系。然而，标准的自注意力机制的计算复杂度与序列长度的平方成正比，这使得它在处理长序列时效率低下。为了解决这个问题，FlashAttention被提出，它是一种高
基于LeNet-5实现交通标志分类任务鱼弦机器学习设计类系统分类深度学习人工智能
基于LeNet-5实现交通标志分类任务介绍LeNet-5是由YannLeCun等人在1998年提出的一种卷积神经网络（CNN）结构，最初用于手写数字识别。由于其简单高效的架构，LeNet-5也被广泛应用于图像分类任务，包括交通标志识别。应用使用场景交通标志分类在智能驾驶、车道辅助系统等领域有重要应用，可以帮助自动驾驶车辆识别道路上的各种交通标志，从而进行相应的决策，提高行车安全性。原理解释LeNe
基于能量检测的语音信号端点检测 FPGA 实现鱼弦人工智能时代 fpga开发
基于能量检测的语音信号端点检测FPGA实现介绍语音信号端点检测（VoiceActivityDetection,VAD）是语音处理中的一个重要步骤，用于确定语音信号的起始和结束点。基于能量检测的方法通过计算语音信号的能量来识别活跃语音段。FPGA的并行处理能力使其非常适合用于实时的语音信号处理。应用使用场景语音识别系统：提高识别准确性，减少处理非语音片段。通信设备：降低带宽需求，通过仅传输语音部分节
【深度学习】因果推断与机器学习的高级实践数学建模_问题根因分析机器学习 2401_84239830 程序员深度学习机器学习数学建模
现阶段深度学习有三大特征：数据驱动：即数据训练，将数据输入到模型中进行训练；关联学习：模型基于给定训练数据集，进行关联学习；概率输出：即最后的输出，判断这个图片有“狗“的概率是多少。以数据驱动、关联学习、概率输出为特征的深度学习存在什么问题呢？以一个简单的图片识别问题为例：识别一张图片中是否有狗。在很多预测问题中，我们拿到的数据集往往都是有偏的，比如我们拿到的数据中有80%的图片中狗都在草地上，这
21章8节：绘制三维地形图 DAT｜R科学用R探索医药数据科学 r语言 r语言-4.2.1 地形图三维地形图
三维地形图在医学、流行病学、生态学和城市规划等多个领域中具有重要的应用价值。它不仅能够帮助研究人员更直观地了解地形对疾病传播、环境健康和公共卫生的影响，还能在疾病防控和健康风险评估中提供关键信息。在流行病学研究中，三维地形图能够显示地形起伏与气候条件如何交互作用，影响传染病的传播模式。而在环境健康领域，三维地形图则有助于科学家识别与地理因素相关的健康风险，为公共卫生决策提供数据支持。一、认识三维地
Ubuntu22.04 LTS安装USB无线网卡RTL8188ftv驱动 Zoolybo ubuntu 服务器网络
1、插上USB无线网卡使用lsusb查看无线网卡，权限不够前面就加sudosudolsusb看到有8188FTV字样的说明就已经识别了，就像上图的001总线上的005设备，下面开始安装驱动1、老规矩，先update和upgradesudoaptupdatesudoaptupgradesudoaptinstallnet-tools2、添加仓库，添加后再update一下sudoadd-apt-repo
修改题注标签 pingfanren2 word
为了防止原博主删帖，故转到自己账号中，出处如下转载：(152条消息)修改题注标签_Z_shsf的博客-CSDN博客_seq图arabic怎么解决问题：论文中存在标签图1-和标签图，如何合并两种标签成为图并一起计数按Alt+F9查看域代码因为替换框内的识别内容有限，因此事先在空白处输入期望替换后的域代码如{SEQ图*ARABIC},注意{}是按ctrl+F9得到的域标识，空白处输好后ctrl+c复制
如何防御暴力攻击（Brute Force Attack）? 安全防护
在不断变化的网络安全世界中，了解各种类型的攻击是保护自己或企业的第一步。其中一个常见的威胁是暴力攻击。让我们深入了解什么是暴力攻击，它是如何工作的，以及如何防止它。什么是暴力攻击顾名思义，暴力攻击依赖于暴力破解，这意味着它不利用任何软件漏洞或使用复杂的技术。相反，它依赖于纯粹的计算能力来彻底尝试所有的可能性。暴力攻击是一种反复试验的方法，用于获取诸如个人识别号码(PINs)、用户名、密码或其他类型
人机交互：面部识别_14.面部识别在虚拟现实和增强现实中的应用 zhubeibei168 机器人及导航人机交互 vr ar 开发语言机器人导航与定位
14.面部识别在虚拟现实和增强现实中的应用14.1虚拟现实中的面部识别在虚拟现实（VR）环境中，面部识别技术可以显著提升用户体验，使其更加沉浸和自然。通过识别用户的面部表情，VR系统可以实时调整虚拟角色的行为，增强用户与虚拟世界的互动。14.1.1面部表情识别面部表情识别是虚拟现实中最常见的应用之一。通过摄像头捕捉用户的面部图像，使用计算机视觉算法识别出用户的表情，如微笑、惊讶、愤怒等，虚拟角色可
AI真的能理解我们这个现实物理世界吗？深度剖析原理、实证及未来走向 AI_DL_CODE 人工智能深度学习 AI AI理解世界
摘要：当下，AI与深度学习广泛渗透生活各领域，大模型与海量数据加持下，其是否理解现实物理世界引发热议。文章开篇抛出疑问，随后深入介绍AI深度学习基础，包含神经网络架构、反向传播算法。继而列举AI在物理场景识别、实验数据分析中显露的“理解”迹象，也点明常识性错误、极端场景失效这类反例。从信息论、物理启发式算法剖析理论支撑，探讨融合物理知识路径，并延展至跨学科应用、评估维度、伦理社会问题，最终展望AI
人工智能在药物研发中的应用 - 从靶点发现和化合物筛选：利用AI深度学习技术加速药物研发流程 AI_DL_CODE 人工智能深度学习药物研发 deep learning
摘要：本文探讨了人工智能（AI）在药物研发中的应用，强调了AI在加速药物发现、降低成本和提高成功率方面的重要性。文章概述了AI在药物靶点识别、化合物筛选、药物设计优化等方面的应用，并详细介绍了机器学习和深度学习的基本原理。通过一个实操案例，展示了如何利用AI技术对化合物数据进行分析，预测潜在的药物候选物。案例包括数据预处理、模型训练、评估和优化等步骤，证明了AI在提高药物研发效率和准确性方面的潜力
YOLOv10改进策略【Neck】| HS-FPN：高级筛选特征融合金字塔，加强细微特征的检测 Limiiiing YOLOv10改进专栏 YOLO 深度学习计算机视觉目标检测
一、本文介绍本文将HS-FPN结构融入YOLOv10以优化目标检测网络模型。HS-FPN借助通道注意力机制及独特的多尺度融合策略，有效应对目标尺寸差异及特征稀缺问题。在YOLOv10中应用HS-FPN时，其利用高级特征筛选低级特征，增强特征表达，助力模型精准定位和识别目标，减少因尺度变化及特征不足导致的检测误差，显著提升YOLOv10在各项检测任务中的准确性与稳定性。专栏目录：YOLOv10改进目
Revit图纸文字识别与实例属性快速更新 ZOZO_888 BIM Revit 建筑模型 revit revit二次开发 bim 统一建模语言 c#开发语言程序人生
-----❤------❤-----❤-----❤-----❤又是进步的一天啊❤------❤------❤------❤------❤-----#创作灵感#在Revit项目中，图纸的管理常常涉及大量的数据更新和信息同步，尤其是需要从图纸中提取文字信息并批量更新相关实例的属性。通过开发一个Revit插件，我们可以自动化这一过程。本文将介绍如何利用RevitAPI实现图纸文字识别、实例的自动匹配与分
.NET Core项目中添加MIME类型 AitTech .NetCore .netcore
在.NETCore项目中添加MIME类型（也称为媒体类型）通常涉及配置Web服务器或中间件来识别和处理特定文件类型的请求和响应。在ASP.NETCore应用中，这通常是通过中间件配置来完成的，尤其是在处理静态文件或API响应时。1.处理静态文件的MIME类型如果你的.NETCore项目需要为静态文件（如图片、CSS、JavaScript等）提供MIME类型支持，你可以通过配置StaticFileM
Nginx负载均衡 510888780 nginx 应用服务器
Nginx负载均衡一些基础知识: nginx 的 upstream目前支持 4 种方式的分配 1)、轮询（默认）每个请求按时间顺序逐一分配到不同的后端服务器，如果后端服务器down掉，能自动剔除。 2)、weight 指定轮询几率，weight和访问比率成正比
RedHat 6.4 安装 rabbitmq bylijinnan erlang rabbitmq redhat
在 linux 下安装软件就是折腾，首先是测试机不能上外网要找运维开通，开通后发现测试机的 yum 不能使用于是又要配置 yum 源，最后安装 rabbitmq 时也尝试了两种方法最后才安装成功机器版本： [root@redhat1 rabbitmq]# lsb_release LSB Version: :base-4.0-amd64:base-4.0-noarch:core
FilenameUtils工具类 eksliang FilenameUtils common-io
转载请出自出处：http://eksliang.iteye.com/blog/2217081 一、概述这是一个Java操作文件的常用库，是Apache对java的IO包的封装，这里面有两个非常核心的类FilenameUtils跟FileUtils，其中FilenameUtils是对文件名操作的封装;FileUtils是文件封装，开发中对文件的操作，几乎都可以在这个框架里面找到。非常的好用。
xml文件解析SAX 不懂事的小屁孩 xml
xml文件解析:xml文件解析有四种方式， 1.DOM生成和解析XML文档(SAX是基于事件流的解析) 2.SAX生成和解析XML文档(基于XML文档树结构的解析) 3.DOM4J生成和解析XML文档 4.JDOM生成和解析XML 本文章用第一种方法进行解析，使用android常用的DefaultHandler import org.xml.sax.Attributes;
通过定时任务执行mysql的定期删除和新建分区，此处是按日分区酷的飞上天空 mysql
使用python脚本作为命令脚本，linux的定时任务来每天定时执行 #!/usr/bin/python # -*- coding: utf8 -*- import pymysql import datetime import calendar #要分区的表 table_name = 'my_table' #连接数据库的信息 host,user,passwd,db =
如何搭建数据湖架构？听听专家的意见蓝儿唯美架构
Edo Interactive在几年前遇到一个大问题：公司使用交易数据来帮助零售商和餐馆进行个性化促销，但其数据仓库没有足够时间去处理所有的信用卡和借记卡交易数据 “我们要花费27小时来处理每日的数据量，”Edo主管基础设施和信息系统的高级副总裁Tim Garnto说道：“所以在2013年，我们放弃了现有的基于PostgreSQL的关系型数据库系统，使用了Hadoop集群作为公司的数
spring学习——控制反转与依赖注入 a-john spring
控制反转（Inversion of Control，英文缩写为IoC）是一个重要的面向对象编程的法则来削减计算机程序的耦合问题，也是轻量级的Spring框架的核心。控制反转一般分为两种类型，依赖注入（Dependency Injection，简称DI）和依赖查找（Dependency Lookup）。依赖注入应用比较广泛。
用spool+unixshell生成文本文件的方法 aijuans xshell
例如我们把scott.dept表生成文本文件的语句写成dept.sql,内容如下: 　　set pages 50000; 　　set lines 200; 　　set trims on; 　　set heading off; 　　spool /oracle_backup/log/test/dept.lst; 　　select deptno||','||dname||','||loc
1、基础--名词解析(OOA/OOD/OOP) asia007 学习基础知识
OOA:Object-Oriented Analysis（面向对象分析方法）是在一个系统的开发过程中进行了系统业务调查以后，按照面向对象的思想来分析问题。OOA与结构化分析有较大的区别。OOA所强调的是在系统调查资料的基础上，针对OO方法所需要的素材进行的归类分析和整理，而不是对管理业务现状和方法的分析。　　OOA（面向对象的分析）模型由5个层次（主题层、对象类层、结构层、属性层和服务层）
浅谈java转成json编码格式技术百合不是茶 json编码 java转成json编码
json编码;是一个轻量级的数据存储和传输的语言在java中需要引入json相关的包,引包方式在工程的lib下就可以了 JSON与JAVA数据的转换（JSON 即 JavaScript Object Natation，它是一种轻量级的数据交换格式，非常适合于服务器与 JavaScript 之间的数据的交
web.xml之Spring配置(基于Spring+Struts+Ibatis) bijian1013 java web.xml SSI spring配置
指定Spring配置文件位置 <context-param> <param-name>contextConfigLocation</param-name> <param-value> /WEB-INF/spring-dao-bean.xml,/WEB-INF/spring-resources.xml, /WEB-INF/
Installing SonarQube（Fail to download libraries from server） sunjing Install Sonar
1. Download and unzip the SonarQube distribution 2. Starting the Web Server The default port is "9000" and the context path is "/". These values can be changed in &l
【MongoDB学习笔记十一】Mongo副本集基本的增删查 bit1129 mongodb
一、创建复本集假设mongod,mongo已经配置在系统路径变量上，启动三个命令行窗口，分别执行如下命令： mongod --port 27017 --dbpath data1 --replSet rs0 mongod --port 27018 --dbpath data2 --replSet rs0 mongod --port 27019 -
Anychart图表系列二之执行Flash和HTML5渲染白糖_ Flash
今天介绍Anychart的Flash和HTML5渲染功能 HTML5 Anychart从6.0第一个版本起，已经逐渐开始支持各种图的HTML5渲染效果了，也就是说即使你没有安装Flash插件，只要浏览器支持HTML5，也能看到Anychart的图形（不过这些是需要做一些配置的）。这里要提醒下大家，Anychart6.0版本对HTML5的支持还不算很成熟，目前还处于
Laravel版本更新异常4.2.8-> 4.2.9 Declaration of ... CompilerEngine ... should be compa bozch laravel
昨天在为了把laravel升级到最新的版本，突然之间就出现了如下错误： ErrorException thrown with message "Declaration of Illuminate\View\Engines\CompilerEngine::handleViewException() should be compatible with Illuminate\View\Eng
编程之美-NIM游戏分析-石头总数为奇数时如何保证先动手者必胜 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class Nim { /**编程之美 NIM游戏分析问题：有N块石头和两个玩家A和B，玩家A先将石头随机分成若干堆，然后按照BABA...的顺序不断轮流取石头，能将剩下的石头一次取光的玩家获胜，每次取石头时，每个玩家只能从若干堆石头中任选一堆，
lunce创建索引及简单查询 chengxuyuancsdn 查询创建索引 lunce
import java.io.File; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Docume
[IT与投资]坚持独立自主的研究核心技术 comsci it
和别人合作开发某项产品....如果互相之间的技术水平不同,那么这种合作很难进行,一般都会成为强者控制弱者的方法和手段..... 所以弱者,在遇到技术难题的时候,最好不要一开始就去寻求强者的帮助,因为在我们这颗星球上,生物都有一种控制其
flashback transaction闪回事务查询 daizj oracle sql 闪回事务
闪回事务查询有别于闪回查询的特点有以下3个：（1）其正常工作不但需要利用撤销数据，还需要事先启用最小补充日志。（2）返回的结果不是以前的“旧”数据，而是能够将当前数据修改为以前的样子的撤销SQL（Undo SQL）语句。（3）集中地在名为flashback_transaction_query表上查询，而不是在各个表上通过“as of”或“vers
Java I/O之FilenameFilter类列举出指定路径下某个扩展名的文件游其是你 FilenameFilter
这是一个FilenameFilter类用法的例子，实现的列举出“c:\\folder“路径下所有以“.jpg”扩展名的文件。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
C语言学习五函数，函数的前置声明以及如何在软件开发中合理的设计函数来解决实际问题 dcj3sjt126com c
# include <stdio.h> int f(void) //括号中的void表示该函数不能接受数据，int表示返回的类型为int类型 { return 10; //向主调函数返回10 } void g(void) //函数名前面的void表示该函数没有返回值 { //return 10; //error 与第8行行首的void相矛盾 } in
今天在测试环境使用yum安装，遇到一个问题： Error: Cannot retrieve metalink for repository: epel. Pl dcj3sjt126com centos
今天在测试环境使用yum安装，遇到一个问题： Error: Cannot retrieve metalink for repository: epel. Please verify its path and try again 处理很简单，修改文件“/etc/yum.repos.d/epel.repo”，将baseurl的注释取消， mirrorlist注释掉。即可。 &n
单例模式 shuizhaosi888 单例模式
单例模式懒汉式 public class RunMain { /** * 私有构造 */ private RunMain() { } /** * 内部类，用于占位，只有 */ private static class SingletonRunMain { priv
Spring Security（09）——Filter 234390216 Spring Security
Filter 目录 1.1 Filter顺序 1.2 添加Filter到FilterChain 1.3 DelegatingFilterProxy 1.4 FilterChainProxy 1.5
公司项目NODEJS实践0.1 逐行分析JS源代码 mongodb nginx ubuntu nodejs
一、前言前端如何独立用nodeJs实现一个简单的注册、登录功能，是不是只用nodejs+sql就可以了？其实是可以实现，但离实际应用还有距离，那要怎么做才是实际可用的。网上有很多nod
java.lang.Math liuhaibo_ljf java Math lang
System.out.println(Math.PI); System.out.println(Math.abs(1.2)); System.out.println(Math.abs(1.2)); System.out.println(Math.abs(1)); System.out.println(Math.abs(111111111)); System.out.println(Mat
linux下时间同步 nonobaba ntp
今天在linux下做hbase集群的时候，发现hmaster启动成功了，但是用hbase命令进入shell的时候报了一个错误 PleaseHoldException: Master is initializing，查看了日志，大致意思是说master和slave时间不同步，没办法，只好找一种手动同步一下，后来发现一共部署了10来台机器，手动同步偏差又比较大，所以还是从网上找现成的解决方
ZooKeeper3.4.6的集群部署 roadrunners zookeeper 集群部署
ZooKeeper是Apache的一个开源项目，在分布式服务中应用比较广泛。它主要用来解决分布式应用中经常遇到的一些数据管理问题，如：统一命名服务、状态同步、集群管理、配置文件管理、同步锁、队列等。这里主要讲集群中ZooKeeper的部署。 1、准备工作我们准备3台机器做ZooKeeper集群，分别在3台机器上创建ZooKeeper需要的目录。数据存储目录
Java高效读取大文件 tomcat_oracle java
　　读取文件行的标准方式是在内存中读取，Guava 和Apache Commons IO都提供了如下所示快速读取文件行的方法：　　Files.readLines(new File(path), Charsets.UTF_8); 　　FileUtils.readLines(new File(path)); 　　这种方法带来的问题是文件的所有行都被存放在内存中，当文件足够大时很快就会导致
微信支付api返回的xml转换为Map的方法 xu3508620 xml map 微信api
举例如下： <xml> <return_code><![CDATA[SUCCESS]]></return_code> <return_msg><![CDATA[OK]]></return_msg> <appid><