本博客由闲散白帽子胖胖鹏鹏胖胖鹏潜力所写,仅仅作为个人技术交流分享,不得用做商业用途。转载请注明出处,未经许可禁止将本博客内所有内容转载、商用。
在使用Diaphora进行库函数比对的时候,我们会先用Diaphora生成原始函数库(*.a,*.lib)生成一个sqlite特征数据库数据库(db1),再用Diaphora生成程序(*.so,*.exe,*.bin)的特征数据库(db2),之后根据两个数据库中特征进行函数识别。生成特征数据库的部分由于设计IDAPython的API,我们暂时先搁置,我们先来看如何进行的数据库对比。代码位置在这。
diaphora.py为我们提供了两种接口,一种是从IDA中直接调用, 而另外一种是通过命令行调用。通过命令行调用时我们需要提供两个参数,一个是程序特征数据库,我们称为main;另外一个是原始库函数特征数据库,我们称为diff。设置好这两个变量之后,即进行分析工作。这里主要涉及到的知识是sql查询语句,不懂得同学建议去w3c学院学习下,这样才能看的懂代码。diaphora主要分析逻辑在diff()函数中。
def diff(self, db):
# 进行函数识别的主函数
#参数:
# db——需要进行对比的数据库
self.last_diff_db = db
cur = self.db_cursor()
cur.execute('attach "%s" as diff' % db)
try:
cur.execute("select value from diff.version")
except:
log("Error: %s " % sys.exc_info()[1])
Warning("The selected file does not look like a valid SQLite exported database!")
cur.close()
return False
row = cur.fetchone()
if not row:
Warning("Invalid database!")
return False
if row["value"] != VERSION_VALUE:
Warning("The database is from a different version (current %s, database %s)!" % (VERSION_VALUE, row[0]))
return False
try:
t0 = time.time()
log_refresh("Performing diffing...", True)
self.do_continue = True
if self.equal_db():
log("The databases seems to be 100% equal")
if self.do_continue:
# 对比调用流是否相同
self.check_callgraph()
# 搜索未被修改的函数
log_refresh("Finding best matches...")
self.find_equal_matches()
# 搜索部分修改的函数
log_refresh("Finding partial matches")
self.find_matches()
if self.slow_heuristics:
# 从调用流中恢复函数
log_refresh("Finding with heuristic 'Callgraph matches'")
self.find_callgraph_matches()
if self.unreliable:
# 搜索不可靠的匹配结果
log_refresh("Finding probably unreliable matches")
self.find_unreliable_matches()
if self.experimental:
# 使用经验法进行函数匹配
log_refresh("Finding experimental matches")
self.find_experimental_matches()
# 最后整理未经匹配的函数
log_refresh("Finding unmatched functions")
self.find_unmatched()
log("Done. Took {} seconds".format(time.time() - t0))
finally:
cur.close()
return True
在这里我们重整下思路,首先对比特征数据库版本以及是否是相同数据库,之后检查两个数据库中CFG的素数特征,再搜索最佳匹配即100%匹配的函数、部分匹配的函数,如果允许慢搜索就使用调用流匹配,再根据用户设定匹配不可靠结果以及使用经验法记性匹配,最后整理出未被匹配的函数。接下来我们逐一看下这些涉及到的函数。
# 判断两个数据库是否相同
# 即判断两个数据库中的program.md5sum是否相等,以及每个函数是否相等如果都相等就是相等的
# program 就是本程序的数据库, diff就是需要进行对比的数据库
# 返回值:1-相等 0-不相等
def equal_db(self):
cur = self.db_cursor()
sql = "select count(*) total from program p, diff.program dp where p.md5sum = dp.md5sum"
cur.execute(sql)
row = cur.fetchone()
ret = row["total"] == 1
if not ret:
sql = "select count(*) total from (select * from functions except select * from diff.functions) x"
cur.execute(sql)
row = cur.fetchone()
ret = row["total"] == 0
else:
log("Same MD5 in both databases")
cur.close()
return ret
# 提取program中的cfg primes以及cfg all primes;
# 如果callgraph_primes直接认为是完全相同;
# 否则逐一比对callgraph_all_primes,如果没有不同,也是一个完美匹配
# 否则,返回两个相差的比率
def check_callgraph(self):
cur = self.db_cursor()
sql = """select callgraph_primes, callgraph_all_primes from program
union all
select callgraph_primes, callgraph_all_primes from diff.program"""
cur.execute(sql)
rows = cur.fetchall()
if len(rows) == 2:
cg1 = decimal.Decimal(rows[0]["callgraph_primes"])
cg_factors1 = json.loads(rows[0]["callgraph_all_primes"])
cg2 = decimal.Decimal(rows[1]["callgraph_primes"])
cg_factors2 = json.loads(rows[1]["callgraph_all_primes"])
if cg1 == cg2:
self.equal_callgraph = True
log("Callgraph signature for both databases is equal, the programs seem to be 100% equal structurally")
Warning("Callgraph signature for both databases is equal, the programs seem to be 100% equal structurally")
else:
FACTORS_CACHE[cg1] = cg_factors1
FACTORS_CACHE[cg2] = cg_factors2
diff = difference(cg1, cg2)
total = sum(cg_factors1.values())
if total == 0 or diff == 0:
log("Callgraphs are 100% equal")
else:
percent = diff * 100. / total
log("Callgraphs from both programs differ in %f%%" % percent)
cur.close()
这里我们使用到了jkutils中的difference函数,已经在代码开头进行了注释。
我们最关心的就是如何寻找到完全相同的函数,这里先给出diaphora的代码以及代码中涉及到的相关函数,我都在代码里进行了详细的注解。
def find_equal_matches(self):
cur = self.db_cursor()
# 从计算两个数据库中的函数总数开始
sql = """select count(*) total from functions
union all
select count(*) total from diff.functions"""
cur.execute(sql)
rows = cur.fetchall()
if len(rows) != 2:
Warning("Malformed database, only %d rows!" % len(rows))
raise Exception("Malformed database!")
self.total_functions1 = rows[0]["total"]
self.total_functions2 = rows[1]["total"]
# 得到两张function表里相同的数据
# 100%相同就是这么来的
sql = "select address ea, mangled_function, nodes from (select * from functions intersect select * from diff.functions) x"
cur.execute(sql)
rows = cur.fetchall()
choose = self.best_chooser
if len(rows) > 0:
for row in rows:
name = row["mangled_function"]
ea = row["ea"]
nodes = int(row["nodes"])
choose.add_item(CChooser.Item(ea, name, ea, name, "100% equal", 1, nodes, nodes))
self.matched1.add(name)
self.matched2.add(name)
postfix = ""
if self.ignore_small_functions:
postfix = " and f.instructions > 5 and df.instructions > 5 "
# 当两个函数bytehash、rva/segment_rva、指令数量、名字相同但不全为sub_开头时,也加入bestmatch行列
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Same RVA and hash' description,
f.nodes bb1, df.nodes bb2
from functions f,
diff.functions df
where (df.rva = f.rva
or df.segment_rva = f.segment_rva)
and df.bytes_hash = f.bytes_hash
and df.instructions = f.instructions
and ((f.name = df.name and substr(f.name, 1, 4) != 'sub_')
or (substr(f.name, 1, 4) = 'sub_' or substr(df.name, 1, 4)))"""
log_refresh("Finding with heuristic 'Same RVA and hash'")
self.add_matches_from_query(sql, choose)
# 当两个函数序号以及bytes_hash相同时,加入best match
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Same order and hash' description,
f.nodes bb1, df.nodes bb2
from functions f,
diff.functions df
where df.id = f.id
and df.bytes_hash = f.bytes_hash
and df.instructions = f.instructions
and ((f.name = df.name and substr(f.name, 1, 4) != 'sub_')
or (substr(f.name, 1, 4) = 'sub_' or substr(df.name, 1, 4)))
and ((f.nodes > 1 and df.nodes > 1
and f.instructions > 5 and df.instructions > 5)
or f.instructions > 10 and df.instructions > 10)"""
log_refresh("Finding with heuristic 'Same order and hash'")
self.add_matches_from_query(sql, choose)
# 或者只有function_hash相同时,也加入best match
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Function hash' description,
f.nodes bb1, df.nodes bb2
from functions f,
diff.functions df
where f.function_hash = df.function_hash
and ((f.nodes > 1 and df.nodes > 1
and f.instructions > 5 and df.instructions > 5)
or f.instructions > 10 and df.instructions > 10)"""
log_refresh("Finding with heuristic 'Function hash'")
self.add_matches_from_query(sql, choose)
# bytehash和名字相同且不为空的时候也加入best match
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Bytes hash and names' description,
f.nodes bb1, df.nodes bb2
from functions f,
diff.functions df
where f.bytes_hash = df.bytes_hash
and f.names = df.names
and f.names != '[]'
and f.instructions > 5 and df.instructions > 5"""
log_refresh("Finding with heuristic 'Bytes hash and names'")
self.add_matches_from_query(sql, choose)
# 只有字节hash相同的时候,也加进去
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Bytes hash' description,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.bytes_hash = df.bytes_hash
and f.instructions > 5 and df.instructions > 5"""
log_refresh("Finding with heuristic 'Bytes hash'")
self.add_matches_from_query(sql, choose)
if not self.ignore_all_names:
self.find_same_name(self.partial_chooser)
# 这里是处理不可信赖的匹配结果
if self.unreliable:
# 如果选中了这个选项就加进去,但是为了降低误报率,我们就不要了
# 不可信赖匹配:只有大小和byte和以及操作符都相同的函数
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Bytes sum' description,
f.nodes bb1, df.nodes bb2
from functions f,
diff.functions df
where f.bytes_sum = df.bytes_sum
and f.size = df.size
and f.mnemonics = df.mnemonics
and f.instructions > 5 and df.instructions > 5"""
log_refresh("Finding with heuristic 'Bytes sum'")
self.add_matches_from_query(sql, choose)
# 完美匹配:汇编或者是伪代码相同
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Equal pseudo-code' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2
from functions f,
diff.functions df
where f.pseudocode = df.pseudocode
and df.pseudocode is not null
and f.pseudocode_lines >= 5 """ + postfix + """
and f.name not like 'nullsub%'
and df.name not like 'nullsub%'
union
select f.address ea, f.name name1, df.address ea2, df.name name2, 'Equal assembly' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2
from functions f,
diff.functions df
where f.assembly = df.assembly
and df.assembly is not null
and f.instructions > 5 and df.instructions > 5
and f.name not like 'nullsub%'
and df.name not like 'nullsub%' """
log_refresh("Finding with heuristic 'Equal assembly or pseudo-code'")
self.add_matches_from_query(sql, choose)
# 根据相似率匹配:清理完毕的汇编代码和伪代码相同
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Same cleaned up assembly or pseudo-code' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where (f.clean_assembly = df.clean_assembly
or f.clean_pseudo = df.clean_pseudo)
and f.pseudocode_lines > 5 and df.pseudocode_lines > 5
and f.name not like 'nullsub%'
and df.name not like 'nullsub%' """
log_refresh("Finding with heuristic 'Same cleaned up assembly or pseudo-code'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)
# 如果地址、节点数量、指令集合、边数量相同,那么也可以加入,根据相似率加入
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same address, nodes, edges and mnemonics' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.rva = df.rva
and f.instructions = df.instructions
and f.nodes = df.nodes
and f.edges = df.edges
and f.mnemonics = df.mnemonics""" + postfix
log_refresh("Finding with heuristic 'Same address, nodes, edges and mnemonics'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, None)
cur.close()
# 执行sql语句并且将结果添加入chooser中
# 注意:添加时要保证相似率为1.00
def add_matches_from_query(self, sql, choose):
""" Warning: use this *only* if the ratio is known to be 1.00 """
if self.all_functions_matched():
return
cur = self.db_cursor()
try:
cur.execute(sql)
except:
log("Error: %s" % str(sys.exc_info()[1]))
return
i = 0
while 1:
i += 1
if i % 1000 == 0:
log("Processed %d rows..." % i)
row = cur.fetchone()
if row is None:
break
ea = str(row["ea"])
name1 = row["name1"]
ea2 = str(row["ea2"])
name2 = row["name2"]
desc = row["description"]
bb1 = int(row["bb1"])
bb2 = int(row["bb2"])
if name1 in self.matched1 or name2 in self.matched2:
continue
choose.add_item(CChooser.Item(ea, name1, ea2, name2, desc, 1, bb1, bb2))
self.matched1.add(name1)
self.matched2.add(name2)
cur.close()
def add_matches_from_query_ratio(self, sql, best, partial, unreliable=None, debug=False):
""" 根据相似度判断归属。
相似度=1,best match;
0.5 self.timeout:
log("Timeout")
break
i += 1
if i % 50000 == 0:
log("Processed %d rows..." % i)
row = cur.fetchone()
if row is None:
break
ea = str(row["ea"])
name1 = row["name1"]
ea2 = row["ea2"]
name2 = row["name2"]
desc = row["description"]
pseudo1 = row["pseudo1"]
pseudo2 = row["pseudo2"]
asm1 = row["asm1"]
asm2 = row["asm2"]
ast1 = row["pseudo_primes1"]
ast2 = row["pseudo_primes2"]
bb1 = int(row["bb1"])
bb2 = int(row["bb2"])
md1 = row["md1"]
md2 = row["md2"]
if name1 in self.matched1 or name2 in self.matched2:
continue
r = self.check_ratio(ast1, ast2, pseudo1, pseudo2, asm1, asm2, md1, md2)
if debug:
print "0x%x 0x%x %d" % (int(ea), int(ea2), r)
if r == 1:
self.best_chooser.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
self.matched1.add(name1)
self.matched2.add(name2)
elif r >= 0.5:
partial.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
self.matched1.add(name1)
self.matched2.add(name2)
elif r < 0.5 and unreliable is not None:
unreliable.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
self.matched1.add(name1)
self.matched2.add(name2)
else:
partial.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
self.matched1.add(name1)
self.matched2.add(name2)
cur.close()
看完之后可能会比较头大,我们队进行best match的匹配规则进行总结。
best match匹配规则:
Rule1:function表中的每条数据都相同
Rule2:bytehash、rva/segment_rva、指令数量、名字(不为sub_*)相同
Rule3:bytes_hash、函数在数据库中顺序相同
Rule4:bytehash、名字相同
Rule5:bytehash相同
Rule6:function_hash相同
Rule7:汇编或者伪代码相同
Rule8:清理之后的汇编代码和伪代码都相同
Rule9:RVA、node数量、边数量、指令数量、指令集合相同
注:除了Rule1之外,Rule都需要计算相似率,只有相似率为1.00的才可以被确认为best match
Rule8、Rule9会产生partial match 和 unreliable match
以上Rule总结主要的只有3个最重要的信息:byte hash、汇编(伪)代码、指令数量和集合。
partial match就是相似度小于1.0但是大于0.5(或者自定义阈值)的函数,也就是名字所说的部分匹配。这部分内容比较长,规则较多,如果嫌代码太长可以之后跳过,我在代码后面附上partial match的匹配规则。
def find_matches(self):
""" 进行函数匹配筛选,根据某些规则选出可能相似的函数,
并根据函数ast、伪代码、汇编代码、mdindex等值计算相似度。
最后根据相似度进行归档。
(0-0.5):unreliable
(0.5-1):partial match
(1): best match
"""
choose = self.partial_chooser
postfix = ""
if self.ignore_small_functions:
postfix = " and f.instructions > 5 and df.instructions > 5 "
# Rule1: MD_Index相同,
# 结果:best,partial
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same rare MD Index' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df,
(select md_index
from diff.functions
where md_index != 0
group by md_index
having count(*) <= 2
union
select md_index
from main.functions
where md_index != 0
group by md_index
having count(*) <= 2
) shared_mds
where f.md_index = df.md_index
and df.md_index = shared_mds.md_index
and f.nodes > 10 """ + postfix
log_refresh("Finding with heuristic 'Same rare MD Index'")
self.add_matches_from_query_ratio(sql, self.best_chooser, choose)
# Rule2:MD和常量信息相同
# 结果:best,partial
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Same MD Index and constants' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2,
df.tarjan_topological_sort, df.strongly_connected_spp
from functions f,
diff.functions df
where f.md_index = df.md_index
and f.md_index > 0
and ((f.constants = df.constants
and f.constants_count > 0)) """ + postfix
log_refresh("Finding with heuristic 'Same MD Index and constants'")
self.add_matches_from_query_ratio(sql, self.best_chooser, choose)
# Rule3:全部或大部分属性(除hash1,hash2,hash3,persudo_prime外)相同
# 结果: best,partial
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
'All attributes' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.indegree = df.indegree
and f.outdegree = df.outdegree
and f.size = df.size
and f.instructions = df.instructions
and f.mnemonics = df.mnemonics
and f.names = df.names
and f.prototype2 = df.prototype2
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.primes_value = df.primes_value
and f.bytes_hash = df.bytes_hash
and f.pseudocode_hash1 = df.pseudocode_hash1
and f.pseudocode_primes = df.pseudocode_primes
and f.pseudocode_hash2 = df.pseudocode_hash2
and f.pseudocode_hash3 = df.pseudocode_hash3
and f.strongly_connected = df.strongly_connected
and f.loops = df.loops
and f.tarjan_topological_sort = df.tarjan_topological_sort
and f.strongly_connected_spp = df.strongly_connected_spp """ + postfix + """
union
select f.address ea, f.name name1, df.address ea2, df.name name2,
'Most attributes' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.indegree = df.indegree
and f.outdegree = df.outdegree
and f.size = df.size
and f.instructions = df.instructions
and f.mnemonics = df.mnemonics
and f.names = df.names
and f.prototype2 = df.prototype2
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.primes_value = df.primes_value
and f.bytes_hash = df.bytes_hash
and f.strongly_connected = df.strongly_connected
and f.loops = df.loops
and f.tarjan_topological_sort = df.tarjan_topological_sort
and f.strongly_connected_spp = df.strongly_connected_spp """
sql += postfix
log_refresh("Finding with heuristic 'All or most attributes'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)
# Rule3':如果使用了慢对比的话,则筛选出switch结构相同
# 阈值:0.2
# 结果: partial,unreliable
if self.slow_heuristics:
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Switch structures' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.switches = df.switches
and df.switches != '[]' """ + postfix
log_refresh("Finding with heuristic 'Switch structures'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.2)
# 注:使用add_matches_from_query_ratio_max并不意味着没有best match,而是仅将ratio=1的设置为best
# Rule4:如果常量及相应数量相同
# 阈值:0.5
# 结果: partial,unreliable
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
'Same constants' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.constants = df.constants
and f.constants_count = df.constants_count
and f.constants_count > 0 """ + postfix
log_refresh("Finding with heuristic 'Same constants'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.5)
# Rule5:地址,节点数,边数、primes和指令数量相同
# 阈值:0.5
# 结果: partial,unreliable
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
'Same address, nodes, edges and primes (re-ordered instructions)' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.rva = df.rva
and f.instructions = df.instructions
and f.nodes = df.nodes
and f.edges = df.edges
and f.primes_value = df.primes_value
and f.nodes > 3""" + postfix
log_refresh("Finding with heuristic 'Same address, nodes, edges and primes (re-ordered instructions)'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.5)
# Rule6:函数名字、md、指令数量相同
# 结果: best, partial
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Import names hash' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.names = df.names
and f.names != '[]'
and f.md_index = df.md_index
and f.instructions = df.instructions
and f.nodes > 5 and df.nodes > 5""" + postfix
log_refresh("Finding with heuristic 'Import names hash'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)
# Rule7:节点数、边数、复杂度、操作符、名字、函数原型、入度、出度相同
# 结果: partial
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
'Nodes, edges, complexity, mnemonics, names, prototype2, in-degree and out-degree' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.mnemonics = df.mnemonics
and f.names = df.names
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.prototype2 = df.prototype2
and f.indegree = df.indegree
and f.outdegree = df.outdegree
and f.nodes > 3
and f.edges > 3
and f.names != '[]'""" + postfix + """
union
select f.address ea, f.name name1, df.address ea2, df.name name2,
'Nodes, edges, complexity, mnemonics, names and prototype2' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.mnemonics = df.mnemonics
and f.names = df.names
and f.names != '[]'
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.prototype2 = df.prototype2""" + postfix
log_refresh("Finding with heuristic 'Nodes, edges, complexity, mnemonics, names, prototype, in-degree and out-degree'")
self.add_matches_from_query_ratio(sql, self.partial_chooser, self.partial_chooser)
# Rule8:操作符、指令数量、函数名字相同
# 结果: partial
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
'Mnemonics and names' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.mnemonics = df.mnemonics
and f.instructions = df.instructions
and f.names = df.names
and f.names != '[]'""" + postfix
log_refresh("Finding with heuristic 'Mnemonics and names'")
self.add_matches_from_query_ratio(sql, choose, choose)
# Rule9:Mnemonics small-primes-product指令数量、相同
# 阈值:0.6
# 结果: partial,unreliable
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
'Mnemonics small-primes-product' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.mnemonics_spp = df.mnemonics_spp
and f.instructions = df.instructions
and f.nodes > 1 and df.nodes > 1
and df.instructions > 5 """ + postfix
log_refresh("Finding with heuristic 'Mnemonics small-primes-product'")
self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.6)
# Search using some of the previous criterias but calculating the
# edit distance
# 检查之前的partial匹配,并且计算距离
log_refresh("Finding with heuristic 'Small names difference'")
self.search_small_differences(choose)
# Rule10:伪代码fuzzy hash相同
# 结果:best,partial
if self.slow_heuristics:
sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Pseudo-code fuzzy hash' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where df.pseudocode_hash1 = f.pseudocode_hash1
or df.pseudocode_hash2 = f.pseudocode_hash2
or df.pseudocode_hash3 = f.pseudocode_hash3""" + postfix
log_refresh("Finding with heuristic 'Pseudo-code fuzzy hashes'")
self.add_matches_from_query_ratio(sql, self.best_chooser, choose)
else:
sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Pseudo-code fuzzy hash' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where df.pseudocode_hash1 = f.pseudocode_hash1""" + postfix
log_refresh("Finding with heuristic 'Pseudo-code fuzzy hash'")
self.add_matches_from_query_ratio(sql, self.best_chooser, choose)
# Rule11:伪代码和名字相同
# 结果:best,partial,unreliable
sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar pseudo-code and names' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.pseudocode_lines = df.pseudocode_lines
and f.names = df.names
and df.names != '[]'
and df.pseudocode_lines > 5
and df.pseudocode is not null
and f.pseudocode is not null""" + postfix
log_refresh("Finding with heuristic 'Similar pseudo-code and names'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)
# Rule12: 伪代码 fuzzy AST hash相同
# 结果:best,partial
if self.slow_heuristics:
sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Pseudo-code fuzzy AST hash' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where df.pseudocode_primes = f.pseudocode_primes
and f.pseudocode_lines > 3
and length(f.pseudocode_primes) >= 35""" + postfix
log_refresh("Finding with heuristic 'Pseudo-code fuzzy AST hash'")
self.add_matches_from_query_ratio(sql, self.best_chooser, choose)
# Rule13: 部分伪代码fuzzy AST hash相同
# 阈值:0.5
# 结果:best,partial
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Partial pseudo-code fuzzy hash' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where substr(df.pseudocode_hash1, 1, 16) = substr(f.pseudocode_hash1, 1, 16)
or substr(df.pseudocode_hash2, 1, 16) = substr(f.pseudocode_hash2, 1, 16)
or substr(df.pseudocode_hash3, 1, 16) = substr(f.pseudocode_hash3, 1, 16)""" + postfix
log_refresh("Finding with heuristic 'Partial pseudo-code fuzzy hash'")
self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.5)
# Rule14:强连通图的tarjan拓朴分类相同
# 结果:best,partial,unreliable
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
'Topological sort hash' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.strongly_connected = df.strongly_connected
and f.tarjan_topological_sort = df.tarjan_topological_sort
and f.strongly_connected > 3
and f.nodes > 10 """ + postfix
log_refresh("Finding with heuristic 'Topological sort hash'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)
# Rule15:循环复杂度,程序原型,名字相同
# 结果:partial
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity, prototype and names' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.names = df.names
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.cyclomatic_complexity >= 20
and f.prototype2 = df.prototype2
and df.names != '[]'""" + postfix
log_refresh("Finding with heuristic 'Same high complexity, prototype and names'")
self.add_matches_from_query_ratio(sql, choose, choose)
# Rule16:名字和循环复杂度相同
# 阈值:0.5
# 结果:partial,unreliable
if self.slow_heuristics:
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity and names' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.names = df.names
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.cyclomatic_complexity >= 15
and df.names != '[]'""" + postfix
log_refresh("Finding with heuristic 'Same high complexity and names'")
self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.5)
# Rule17:强连通图相同
# 阈值:0.8
# 结果:partial
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Strongly connected components' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.strongly_connected = df.strongly_connected
and df.strongly_connected > 1
and f.nodes > 5 and df.nodes > 5
and f.strongly_connected_spp > 1
and df.strongly_connected_spp > 1""" + postfix
log_refresh("Finding with heuristic 'Strongly connected components'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, None, 0.80)
# Rule18:强连通的small-primes-product相同
# 结果:best,partial,unreliable
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2, 'Strongly connected components small-primes-product' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.strongly_connected_spp = df.strongly_connected_spp
and df.strongly_connected_spp > 1
and f.nodes > 10 and df.nodes > 10 """ + postfix
log_refresh("Finding with heuristic 'Strongly connected components small-primes-product'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)
# Rule19:循环个数相同
# 阈值:0.49
# 结果:partial
if self.slow_heuristics:
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Loop count' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.loops = df.loops
and df.loops > 1
and f.nodes > 3 and df.nodes > 3""" + postfix
log_refresh("Finding with heuristic 'Loop count'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, None, 0.49)
# Rule20: 强连通分量的smale prime product相同
# 阈值:0.49
# 结果:partial
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
'Strongly connected components SPP and names' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.names = df.names
and f.names != '[]'
and f.strongly_connected_spp = df.strongly_connected_spp
and f.strongly_connected_spp > 0
""" + postfix
log_refresh("Finding with heuristic 'Strongly connected components SPP and names'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, None, 0.49)
def add_matches_from_query_ratio_max(self, sql, best, partial, val):
""" 查询数据库,并根据相似度判断归属。相似度大于阈值val的,都将归入best match;小于阈值大于0的,
如果开启了partial匹配就归入partial chooser
参数:
sql: 需要进行执行的sql语句
best: 最佳匹配数据集
partial: 部分匹配数据集
val: 阈值
"""
if self.all_functions_matched():
return
cur = self.db_cursor()
try:
cur.execute(sql)
except:
log("Error: %s" % str(sys.exc_info()[1]))
return
i = 0
t = time.time()
while self.max_processed_rows == 0 or (self.max_processed_rows != 0 and i < self.max_processed_rows):
if time.time() - t > self.timeout:
log("Timeout")
break
i += 1
if i % 50000 == 0:
log("Processed %d rows..." % i)
row = cur.fetchone()
if row is None:
break
ea = str(row["ea"])
name1 = row["name1"]
ea2 = row["ea2"]
name2 = row["name2"]
desc = row["description"]
pseudo1 = row["pseudo1"]
pseudo2 = row["pseudo2"]
asm1 = row["asm1"]
asm2 = row["asm2"]
ast1 = row["pseudo_primes1"]
ast2 = row["pseudo_primes2"]
bb1 = int(row["bb1"])
bb2 = int(row["bb2"])
md1 = row["md1"]
md2 = row["md2"]
if name1 in self.matched1 or name2 in self.matched2:
continue
r = self.check_ratio(ast1, ast2, pseudo1, pseudo2, asm1, asm2, md1, md2)
if r == 1:
self.best_chooser.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
self.matched1.add(name1)
self.matched2.add(name2)
elif r > val:
best.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
self.matched1.add(name1)
self.matched2.add(name2)
elif partial is not None:
partial.add_item(CChooser.Item(ea, name1, ea2, name2, desc, r, bb1, bb2))
self.matched1.add(name1)
self.matched2.add(name2)
cur.close()
总结partial match的规则如下:
partial match匹配规则:
Rule1:MD Index相同
Rule2:MD Index和常量信息相同
Rule3:全部或大部分属性(除hash1,hash2,hash3,persudo_prime外)相同
Rule4:switch结构相同
Rule5:常量和常量数量相同
Rule6:RVA、node数,edge数、primes和指令数量相同
Rule7:函数名字、md、指令数量相同
Rule8:node数,edge数、复杂度、操作符、名字、函数原型、入度、出度相同
Rule9:操作符、指令数量、函数名字相同
Rule10:Mnemonics small-primes-product指令数量、相同
Rule11:伪代码fuzzy hash相同
Rule12:伪代码和名字相同
Rule13:伪代码 fuzzy AST hash相同
Rule14:部分伪代码fuzzy AST hash相同
Rule15:强连通图的tarjan拓朴分类相同
Rule16:循环复杂度,程序原型,名字相同
Rule17:名字和循环复杂度相同
Rule18:强连通图相同
Rule19:循环个数相同
Rule20:强连通分量的smale prime product相同
注:这些匹配规则中,如果相似率达到了1.0,还需加入best match
MD Index我们暂时还不清楚,挖个坑以后填
在以上20条规则中,主要是针对MD Index、调用流图特征、伪代码fuzzy hash三者进行匹配,但是可以注意到,有些匹配和best match 是重复的,同时这些匹配过程也可能产生best match,我们在进行识别的时候,不可不注意。
这部分函数比较简单,就是当两个函数为best match和partial match时,搜索和这个函数相关的caller和callee,进行函数相似性匹配。这部分代码逻辑不放在这里赘述了。
这部分其实我们并不关心,因为结果不准确,但是为了保证分析的完整性,我们继续跟踪。
# 搜索不可靠的函数
def find_unreliable_matches(self):
choose = self.unreliable_chooser
postfix = ""
if self.ignore_small_functions:
postfix = " and f.instructions > 5 and df.instructions > 5 "
# Rule1:只有强连通分量相同
# 阈值:0.54
# 结果: partial ,unreliable
if self.slow_heuristics:
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Strongly connected components' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.strongly_connected = df.strongly_connected
and df.strongly_connected > 2""" + postfix
log_refresh("Finding with heuristic 'Strongly connected components'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.54)
# Rule2:只有loop count相同
# 阈值: 0.5
# 结果: partial,unreliable
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Loop count' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.loops = df.loops
and df.loops > 1""" + postfix
log_refresh("Finding with heuristic 'Loop count'")
self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)
# Rule3:Nodes, edges, 复杂度 和操作符个数相同
# 结果: best,partial
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Nodes, edges, complexity and mnemonics' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.mnemonics = df.mnemonics
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.nodes > 1 and f.edges > 0""" + postfix
log_refresh("Finding with heuristic 'Nodes, edges, complexity and mnemonics'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)
# Rule4:Nodes, edges, 复杂度 和函数原型相同
# 结果: partial,unreliable
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Nodes, edges, complexity and prototype' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.prototype2 = df.prototype2
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.prototype2 != 'int()'""" + postfix
log_refresh("Finding with heuristic 'Nodes, edges, complexity and prototype'")
self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)
# Rule5:Nodes, edges, 复杂度、出度和入度相同
# 结果: partial,unreliable
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Nodes, edges, complexity, in-degree and out-degree' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.nodes > 3 and f.edges > 2
and f.indegree = df.indegree
and f.outdegree = df.outdegree""" + postfix
log_refresh("Finding with heuristic 'Nodes, edges, complexity, in-degree and out-degree'")
self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)
# Rule6:Nodes, edges, 复杂度相同
# 结果: partial,unreliable
sql = """ select distinct f.address ea, f.name name1, df.address ea2, df.name name2,
'Nodes, edges and complexity' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.nodes > 1 and f.edges > 0""" + postfix
log_refresh("Finding with heuristic 'Nodes, edges and complexity'")
self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)
# Rule7:同为小函数且伪代码相同
# 阈值:0.5
# 结果: partial,unreliable
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar small pseudo-code' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where df.pseudocode is not null
and f.pseudocode is not null
and f.pseudocode_lines = df.pseudocode_lines
and df.pseudocode_lines > 5""" + postfix
log_refresh("Finding with heuristic 'Similar small pseudo-code'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, self.unreliable_chooser, 0.5)
# Rule8:循环0复杂度高且相同
# 结果: partial,unreliable
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.cyclomatic_complexity = df.cyclomatic_complexity
and f.cyclomatic_complexity >= 50""" + postfix
log_refresh("Finding with heuristic 'Same high complexity'")
self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)
unreliable match匹配规则:
Rule1:是否同为strongly_connected
Rule2:只有loop count相同
Rule3:Nodes, edges, 复杂度 和操作符个数相同
Rule4:Nodes, edges, 复杂度 和函数原型相同
Rule5:Nodes, edges, 复杂度、出度和入度相同
Rule6:Nodes, edges, 复杂度相同
Rule7:同为小函数(伪代码行数小于5)且伪代码相同
Rule8:循环复杂度高且相同
def find_experimental_matches(self):
"""使用经验法进行匹配。
"""
choose = self.unreliable_chooser
# 调用地址顺序启发搜索,即当一个函数匹配时,该函数下一个有大概率匹配
self.find_from_matches(self.best_chooser.items)
self.find_from_matches(self.partial_chooser.items)
postfix = ""
if self.ignore_small_functions:
postfix = " and f.instructions > 5 and df.instructions > 5 "
# Rule1:伪代码行数相同
# 阈值: 0.6
# 结果: unreliable
sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar pseudo-code' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.pseudocode_lines = df.pseudocode_lines
and df.pseudocode_lines > 5
and df.pseudocode is not null
and f.pseudocode is not null""" + postfix
log_refresh("Finding with heuristic 'Similar pseudo-code'")
self.add_matches_from_query_ratio_max(sql, choose, self.unreliable_chooser, 0.6)
# Rule2:nodes,edges,强连通分量相同
# 结果: best,unreliable
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2,
'Same nodes, edges and strongly connected components' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.strongly_connected = df.strongly_connected
and df.nodes > 4""" + postfix
log_refresh("Finding with heuristic 'Same nodes, edges and strongly connected components'")
self.add_matches_from_query_ratio(sql, self.best_chooser, choose, self.unreliable_chooser)
# Rule3:伪代码行数相同
# 结果: partial,unreliable
if self.slow_heuristics:
sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Similar small pseudo-code' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.pseudocode_lines = df.pseudocode_lines
and df.pseudocode_lines <= 5
and df.pseudocode is not null
and f.pseudocode is not null""" + postfix
log_refresh("Finding with heuristic 'Similar small pseudo-code'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.49)
# Rule4:伪代码fuzzy AST hash相同
# 结果: partial,unreliable
sql = """select distinct f.address ea, f.name name1, df.address ea2, df.name name2, 'Small pseudo-code fuzzy AST hash' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where df.pseudocode_primes = f.pseudocode_primes
and f.pseudocode_lines <= 5""" + postfix
log_refresh("Finding with heuristic 'Small pseudo-code fuzzy AST hash'")
self.add_matches_from_query_ratio(sql, self.partial_chooser, choose)
# Rule5:同为小函数且伪代码相同
# 结果: best,partial
sql = """select f.address ea, f.name name1, df.address ea2, df.name name2, 'Equal small pseudo-code' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.pseudocode = df.pseudocode
and df.pseudocode is not null
and f.pseudocode_lines < 5""" + postfix
log_refresh("Finding with heuristic 'Equal small pseudo-code'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser)
# Rule6:循环复杂度、原型、名字相同
# 阈值: 0.5
# 结果: partial,unreliable
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same high complexity, prototype and names' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.names = df.names
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.cyclomatic_complexity < 20
and f.prototype2 = df.prototype2
and df.names != '[]'""" + postfix
log_refresh("Finding with heuristic 'Same low complexity, prototype and names'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.5)
# Rule7:循环复杂度、名字相同
# 阈值: 0.5
# 结果: partial,unreliable
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2, 'Same low complexity and names' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.names = df.names
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.cyclomatic_complexity < 15
and df.names != '[]'""" + postfix
log_refresh("Finding with heuristic 'Same low complexity and names'")
self.add_matches_from_query_ratio_max(sql, self.partial_chooser, choose, 0.5)
# Rule8:调用流图相同
# 结果: best,partial,unreliable
if self.slow_heuristics:
# 当数据库过大时(超过25000个函数)时,可能造成一下错误:
# the following error: OperationalError: database or disk is full
sql = """ select f.address ea, f.name name1, df.address ea2, df.name name2,
'Same graph' description,
f.pseudocode pseudo1, df.pseudocode pseudo2,
f.assembly asm1, df.assembly asm2,
f.pseudocode_primes pseudo_primes1, df.pseudocode_primes pseudo_primes2,
f.nodes bb1, df.nodes bb2,
cast(f.md_index as real) md1, cast(df.md_index as real) md2
from functions f,
diff.functions df
where f.nodes = df.nodes
and f.edges = df.edges
and f.indegree = df.indegree
and f.outdegree = df.outdegree
and f.cyclomatic_complexity = df.cyclomatic_complexity
and f.strongly_connected = df.strongly_connected
and f.loops = df.loops
and f.tarjan_topological_sort = df.tarjan_topological_sort
and f.strongly_connected_spp = df.strongly_connected_spp""" + postfix + """
and f.nodes > 5 and df.nodes > 5
order by
case when f.size = df.size then 1 else 0 end +
case when f.instructions = df.instructions then 1 else 0 end +
case when f.mnemonics = df.mnemonics then 1 else 0 end +
case when f.names = df.names then 1 else 0 end +
case when f.prototype2 = df.prototype2 then 1 else 0 end +
case when f.primes_value = df.primes_value then 1 else 0 end +
case when f.bytes_hash = df.bytes_hash then 1 else 0 end +
case when f.pseudocode_hash1 = df.pseudocode_hash1 then 1 else 0 end +
case when f.pseudocode_primes = df.pseudocode_primes then 1 else 0 end +
case when f.pseudocode_hash2 = df.pseudocode_hash2 then 1 else 0 end +
case when f.pseudocode_hash3 = df.pseudocode_hash3 then 1 else 0 end DESC"""
log_refresh("Finding with heuristic 'Same graph'")
self.add_matches_from_query_ratio(sql, self.best_chooser, self.partial_chooser, self.unreliable_chooser)
# 最后使用暴力搜索
log_refresh("Brute-forcing...")
self.find_brute_force()
经验法其实分为三步,第一步CFG匹配;第二步Rule过滤;第三步暴力搜索。我们先说第一步,此处采用的CFG匹配和上文提到的不同,这里是根据一个经验,如果函数Ai和函数Bj相同,A、B表示两个数据库,i、j为他们的序号,那么A(i+1)与B(j+1)也有很大概率相同。这只是一个不可靠的经验,函数相同的判断依据依旧是相似率。第二步为Rule,主要有一下几种:
经验法规则:
Rule1:伪代码行数相同
Rule2:nodes,edges,strongly_connected相同
Rule3:伪代码fuzzy AST hash相同
Rule4:同为小函数且伪代码相同
Rule5:循环复杂度、原型、名字相同
Rule6:循环复杂度、名字相同
Rule7:调用流图(nodes、edges、入度、出度、循环复杂度、tarjan拓朴分类、strongly_connected、循环)相同
这些规则没有任何可靠地依据,所以生成的结果也不一定可靠。第三步是暴力搜索,在经过上述匹配之后仍未进行分类的函数,进行暴力匹配~感觉不靠谱馁
我想经过上述的分析,大家应该已经对Diaphora的匹配规则和过程有所了解了吧,我建议在阅读源码时,配合Diaphora生成好的数据库对照着阅读,这样会比较快。同时我们在上文经常提到函数相似率,最后我想分析的就是这个函数相似率算法。进行函数相似率计算的函数是check_ratio()
# 获得两个相似率
def quick_ratio(buf1, buf2):
try:
if buf1 is None or buf2 is None or buf1 == "" or buf1 == "":
return 0
s = SequenceMatcher(None, buf1.split("\n"), buf2.split("\n"))
return s.quick_ratio()
except:
print "quick_ratio:", str(sys.exc_info()[1])
return 0
#-----------------------------------------------------------------------
# 更快的获得相似率,但是好像不是很准确
def real_quick_ratio(buf1, buf2):
try:
if buf1 is None or buf2 is None or buf1 == "" or buf1 == "":
return 0
s = SequenceMatcher(None, buf1.split("\n"), buf2.split("\n"))
return s.real_quick_ratio()
except:
print "real_quick_ratio:", str(sys.exc_info()[1])
return 0
#-----------------------------------------------------------------------
# 获得两个ast树相差率
def ast_ratio(ast1, ast2):
if ast1 == ast2:
return 1.0
elif ast1 is None or ast2 is None:
return 0
return difference_ratio(decimal.Decimal(ast1), decimal.Decimal(ast2))
# 检查两个函数相似率
#-----------------------------------------------------------------------
def check_ratio(self, ast1, ast2, pseudo1, pseudo2, asm1, asm2, md1, md2):
fratio = quick_ratio
decimal_values = "{0:.2f}"
if self.relaxed_ratio:
fratio = real_quick_ratio
decimal_values = "{0:.1f}"
# 第一计算ast相似度先
v3 = 0
ast_done = False
if self.relaxed_ratio and ast1 is not None and ast2 is not None and max(len(ast1), len(ast2)) < 16:
ast_done = True
v3 = self.ast_ratio(ast1, ast2)
if v3 == 1:
return 1.0
# 第二计算清理后的伪代码相似度
v1 = 0
if pseudo1 is not None and pseudo2 is not None and pseudo1 != "" and pseudo2 != "":
tmp1 = self.get_cmp_pseudo_lines(pseudo1)
tmp2 = self.get_cmp_pseudo_lines(pseudo2)
if tmp1 == "" or tmp2 == "":
log("Error cleaning pseudo-code!")
else:
v1 = fratio(tmp1, tmp2)
v1 = float(decimal_values.format(v1))
if v1 == 1.0:
# 如果real_quick_ratio返回1,再尝试quie_ratio
# 因为可能存在误报。如果real_quick_ratio得到different,
# 就没有必要继续查证了
if fratio == real_quick_ratio:
v1 = quick_ratio(tmp1, tmp2)
if v1 == 1.0:
return 1.0
# 第三比较汇编代码如何
tmp_asm1 = self.get_cmp_asm_lines(asm1)
tmp_asm2 = self.get_cmp_asm_lines(asm2)
v2 = fratio(tmp_asm1, tmp_asm2)
v2 = float(decimal_values.format(v2))
if v2 == 1:
# 其实检查检查伪代码类似
if fratio == real_quick_ratio:
v2 = quick_ratio(tmp_asm1, tmp_asm2)
if v2 == 1.0:
return 1.0
if self.relaxed_ratio and not ast_done:
v3 = fratio(ast1, ast2)
v3 = float(decimal_values.format(v3))
if v3 == 1:
return 1.0
v4 = 0.0
if md1 == md2 and md1 > 0.0:
# MD-Index >= 10.0 很少见
if self.relaxed_ratio or md1 > 10.0:
return 1.0
v4 = min((v1 + v2 + v3 + 3.0) / 4, 1.0)
elif md1 != 0 and md2 != 0 and False:
tmp1 = max(md1, md2)
tmp2 = min(md1, md2)
v4 = tmp2 * 1. / tmp1
r = max(v1, v2, v3, v4)
return r
总结算法:v3 = ast_ratio(ast1,ast2)
v1 = sequence_ratio(cleaned_pesudo_code1,cleaned_pesudo_code2)
v2 = sequence_ratio(cleaned_asm_code1,cleaned_asm_code2)
if MD_Index1 == MD_Index2:
v4 = min((v1 + v2 + v3 + 3.0) / 4, 1.0)
v4 = min(MD_Index1,MD_Index2)/max(MD_Index1,MD_Index2)
ration = max{v1,v2,v3,v4}
即,取ast语法树、反汇编代码、伪代码和MD值相似度死者中的最大值。
0x09 小结
我们对Diaphora使用的匹配技术进行总结。从High level的角度来看,diaphora根据某些匹配规则筛选出函数,并且计算出这些函数的相似度,根据相似度以及相关阈值判断函数的归属。这些筛选规则既涉及到调用流、汇编代码等逻辑特征,还涉及到常量、指令集合等数据特征。同时,某些规则也依靠fuzzy hash和small prime,下一篇博客中,我们将详细解释diaphora如何生成特征数据库。