本博客由闲散白帽子胖胖鹏鹏胖胖鹏潜力所写,仅仅作为个人技术交流分享,不得用做商业用途。转载请注明出处,未经许可禁止将本博客内所有内容转载、商用。
在前面的博客中,我们简单介绍了Diaphora的文件结构和FLIRT算法的思想。同样作为库函数识别算法,FLIRT采用了pattern mactch的方法,而Diaphora则使用AST模糊哈希以及调用流模糊哈希的方法。在调研的过程中,我们发现17年有人提出了新的库函数识别方法,并且表示Diaphora虽然识别数量较多,但是False positive也很高,文章链接我贴在末尾。但是想要缓解这种方法,我们就需要学习算法之后再进行一些改进。这篇博客就来看看Diaphora的依赖库jkutils的实现。
jkutils是Diaphora的作者joxeankoret开发的一个工具集合,在github上面有开源(链接在此)。目前这个工具包里面的功能主要有以下几种:
graphs: 构建图、搜索子图、路径搜索算法等等
fuzzy_hashing: 模糊哈希(fuzzy hashing)算法
factor: 素数和因式分解工具
simple_log: 非常简单地日志工具
process: 简单地进程管理工具,主要是针对并行和线程超时。
query_utils: 构建SQL语言查询语句的工具
web_db: 对web.py的封装,支持MySQL和SQLite
工具箱中有很多工具,但是Diaphora值用上了fuzzy_hashing和factor。我们就看这两个工具的功能。有些工具虽然没有用上,但是我们可以收藏,毕竟图的处理是很复杂的事情,如果有一段高效且好用的代码的话,可以省下很多时间。
0x01 jkutils/factor
我们首先来看factor,从名字上来看就是处理各种素数和因式分解的代码。这段代码都是功能罗列,没有逻辑关联,我将注释插入到代码中,我建议大家将代码过一遍,便于我们在进行Diaphora算法解析。文档位置在https://github.com/joxeankoret/diaphora/blob/master/jkutils/factor.py。
#!/usr/bin/python
"""
Primes and factorization utilities
Part of the Malware Families Clusterization Project, adapted to be used
in the Joxean Koret's Diffing Plugin for IDA.
Copyright (c) 2012-2015 Joxean Koret
"""
import sys
import random
import decimal
#-----------------------------------------------------------------------
def primesbelow(N):
# http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
# 生成大于2小于N的所有素数
# 返回一个list
#""" 输入 N>=6, 返回一个素数列表, 2 <= p < N """
correction = N % 6 > 1
N = {0:N, 1:N-1, 2:N+4, 3:N+3, 4:N+2, 5:N+1}[N%6]
sieve = [True] * (N // 3)
sieve[0] = False
for i in range(long(N ** .5) // 3 + 1):
if sieve[i]:
k = (3 * i + 1) | 1
sieve[k*k // 3::2*k] = [False] * ((N//6 - (k*k)//6 - 1)//k + 1)
sieve[(k*k + 4*k - 2*k*(i%2)) // 3::2*k] = [False] * ((N // 6 - (k*k + 4*k - 2*k*(i%2))//6 - 1) // k + 1)
return [2, 3] + [(3 * i + 1) | 1 for i in range(1, N//3 - correction) if sieve[i]]
#-----------------------------------------------------------------------
# 默认生成100000以下的素数集合
smallprimeset = set(primesbelow(100000))
_smallprimeset = 100000
# 由于我们之前生成过100000以内的素数列表,如果小于100000,检查是否在列表中
# 否则进行Miller-Rabin素性检验
def isprime(n, precision=7):
# http://en.wikipedia.org/wiki/Miller-Rabin_primality_test#Algorithm_and_running_time
if n == 1 or n % 2 == 0:
return False
elif n < 1:
raise ValueError("Out of bounds, first argument must be > 0")
elif n < _smallprimeset:
return n in smallprimeset
d = n - 1
s = 0
while d % 2 == 0:
d //= 2
s += 1
for repeat in range(precision):
a = random.randrange(2, n - 2)
x = pow(a, d, n)
if x == 1 or x == n - 1: continue
for r in range(s - 1):
x = pow(x, 2, n)
if x == 1: return False
if x == n - 1: break
else: return False
return True
#-----------------------------------------------------------------------
# Pllard Rho Brent整数分解算法,具体说明在下面的网址
# https://comeoncodeon.wordpress.com/2010/09/18/pollard-rho-brent-integer-factorization/
# 返回:n的因子
def pollard_brent(n):
if n % 2 == 0: return 2
if n % 3 == 0: return 3
y, c, m = random.randint(1, n-1), random.randint(1, n-1), random.randint(1, n-1)
g, r, q = 1, 1, 1
while g == 1:
x = y
for i in range(r):
y = (pow(y, 2, n) + c) % n
k = 0
while k < r and g==1:
ys = y
for i in range(min(m, r-k)):
y = (pow(y, 2, n) + c) % n
q = q * abs(x-y) % n
g = gcd(q, n)
k += m
r *= 2
if g == n:
while True:
ys = (pow(ys, 2, n) + c) % n
g = gcd(abs(x - ys), n)
if g > 1:
break
return g
#-----------------------------------------------------------------------
smallprimes = primesbelow(10000) # might seem low, but 10000*10000 = 100000000, so this will fully factor every composite < 100000000
# 质因子集合,小于1万的素数,因为10000*10000=1亿,所以是能够解决1亿以内的数
# 分解质因数,得到一系列指数集合p p1*p2*p3...=n
def primefactors(n, sort=False):
factors = []
limit = long(n ** decimal.Decimal(.5)) + 1
for checker in smallprimes:
if checker > limit: break
while n % checker == 0:
factors.append(checker)
n //= checker
limit = long(n ** decimal.Decimal(.5)) + 1
if checker > limit: break
if n < 2: return factors
while n > 1:
if isprime(n):
factors.append(n)
break
factor = pollard_brent(n) # trial division did not fully factor, switch to pollard-brent
factors.extend(primefactors(factor)) # recurse to factor the not necessarily prime factor returned by pollard-brent
n //= factor
if sort: factors.sort()
return factors
#-----------------------------------------------------------------------
# 转换质因子分解的结果,将其有list转为dict
def factorization(n):
factors = {}
for p1 in primefactors(n):
try:
factors[p1] += 1
except KeyError:
factors[p1] = 1
return factors
#-----------------------------------------------------------------------
# 欧拉函数:如果是0返回1;否则进行计算
totients = {}
def totient(n):
if n == 0: return 1
try: return totients[n]
except KeyError: pass
tot = 1
for p, exp in factorization(n).items():
tot *= (p - 1) * p ** (exp - 1)
totients[n] = tot
return tot
#-----------------------------------------------------------------------
# 求最大公因数
def gcd(a, b):
if a == b: return a
while b > 0: a, b = b, a % b
return a
#-----------------------------------------------------------------------
# 求最小公倍数
def lcm(a, b):
return abs(a * b) // gcd(a, b)
#-----------------------------------------------------------------------
FACTORS_CACHE = {}
def _difference(num1, num2):
nums = [num1,
num2]
s = []
for num in nums:
if FACTORS_CACHE.has_key(num):
x = FACTORS_CACHE[num]
else:
x = factorization(long(num))
FACTORS_CACHE[num] = x
s.append(x)
diffs = {}
for x in s[0].keys():
if x in s[1].keys():
if s[0][x] != s[1][x]:
diffs[x] = max(s[0][x], s[1][x]) - min(s[0][x], s[1][x])
else:
diffs[x] = s[0][x]
for x in s[1].keys():
if x in s[0].keys():
if s[1][x] != s[0][x]:
diffs[x] = max(s[0][x], s[1][x]) - min(s[0][x], s[1][x])
else:
diffs[x] = s[1][x]
return diffs, s
#-----------------------------------------------------------------------
def difference(num1, num2):
# 计算两个素数的区别,如果一个素数只存在于一个组里面,那么该素数总数数量作为difference;如果一个素数在两个组里都出现,则把这两个素数数量差填进去
""" Calculate the difference in prime numbers. If a primer number does not
exists in one group but does in the other, the total value of the prime
number is added as differences. If a primer number exists in both groups
the values difference is added. """
diffs, s = _difference(num1, num2)
return sum(diffs.values())
#-----------------------------------------------------------------------
def difference_ratio(num1, num2):
# 计算两个数的相差比率
""" Same as differene but getting a ratio of the changes. """
diffs, s = _difference(num1, num2)
total = max(sum(s[0].values()), sum(s[1].values()))
return 1 - (sum(diffs.values()) *1. / total)
#-----------------------------------------------------------------------
def difference_matrix(samples, debug=True):
# 计算样本的差异矩阵
""" Calculate the difference matrix for the given set of samples. """
diff_matrix = {}
for x in samples:
if debug:
print "Calculating difference matrix for %s" % x
if not diff_matrix.has_key(x):
diff_matrix[x] = {}
for y in samples:
if samples[x] != samples[y]:
d = difference(samples[x], samples[y])
#print("Difference between %s and %s: %d" % (x, y, d))
diff_matrix[x][y] = d
else:
diff_matrix[x][y] = 0
return diff_matrix
从上面的注释可以看到,factor里面就是进行因数分解及相关的处理函数。fuzzy hash应该是Diaphora进行库函数识别以及相似度匹配的一个重要工具了。Fuzzy hash的主要原理就是使用一个弱哈希计算文件局部内容,并且在某个条件下将其分片,然后使用一个强hash对文件中每一片计算hash值,取出这些值连接起来,并进行压缩,最后与分片条件构成一个完整的哈希结果。我们通过判断两个模糊哈希值的相似度,判断两个文件的相似度。目前做的比较好的是ssdeep。虽然整体流程是类似的,但是fuzzy hash没有一个确定的算法,不同人有不同的处理方式。我们想要了解Diaphora在做什么,就要知道它所使用的hash函数。代码的位置在https://github.com/joxeankoret/diaphora/blob/master/jkutils/kfuzzy.py。
实话实说,fuzzy hash在没有算法和文档说明的情况下,还是很难看懂的。我这里也只是粗略的分析了每个函数的功能,由于这个算法是Diaphora作者自创的,所以这个算法的描述在他之前的几篇文章中。
import os
import sys
import base64
from itertools import imap
# 定义模加法,即将buf中每个字符转换为ASCII码,求和后模255
try:
from fasttoad_wrap import modsum
except:
def modsum(buf):
return sum(imap(ord, buf)) % 255
# psyco是一个用来加速的module,但是目前已经启用,可以使用pypy来进行多线程优化
try:
import psyco
psyco.full()
except ImportError:
pass
class CFileStr(str):
fd = None
def __init__(self, fd):
# fd是文件句柄
self.fd = fd
def __getslice__(self, x, y):
# 获得切片,就是获得从x到y处,y-x长度的数据
self.fd.seek(x, 0)
buf = self.fd.read(y-x)
self.fd.seek(y)
return buf
def __len__(self):
# 计算文件长度
old = self.fd.tell()
self.fd.seek(0, 2)
pos = self.fd.tell()
self.fd.seek(old)
return pos
class CKoretFuzzyHashing:
""" 生成文件或byte串的部分hash """
bsize = 512
output_size = 32
ignore_range = 2
big_file_size = 1024*1024*10
algorithm = None
reduce_errors = True
remove_spaces = False
def get_bytes(self, f, initial, final):
# 读取从initial开始的final长度的byte字符串
f.seek(initial)
return f.read(final)
def edit_distance(self, sign1, sign2):
# 计算两个签名值的距离,如果某位置上字符不同, 距离就增加1
if sign1 == sign2:
return 0
m = max(len(sign1), len(sign2))
distance = 0
for c in xrange(0, m):
if sign1[c:c+1] != sign2[c:c+1]:
distance += 1
return distance
def simplified(self, bytes, aggresive = False):
# 这段是将byte压缩成我们想要的长度。每个字节计算方法 base64(modsum(bsiz+1))[:outputsize]
output_size = self.output_size #输出结果的长度
ignore_range = self.ignore_range
bsize = self.bsize # block的长度
total_size = len(bytes)
size = (total_size/bsize) / output_size #output中的每个字节都对应着32个block
buf = []
reduce_errors = self.reduce_errors
# Adjust the output to the desired output size
# 循环取出blocksize+1个字节进行模加
for c in xrange(0, output_size):
tmp = bytes[c*size:(c*size+1)+bsize]
ret = sum(imap(ord, tmp)) % 255
if reduce_errors:
if ret != 255 and ret != 0:
buf.append(chr(ret))
else:
buf.append(chr(ret))
buf = "".join(buf)
# base64加密之后再裁剪
return base64.b64encode(buf).strip("=")[:output_size]
def _hash(self, bytes, aggresive = False):
# 计算byte的hash值
idx = 0
ret = []
# 对bytes中每个block计算模加并且存储到buf中
output_size = self.output_size
ignore_range = self.ignore_range
bsize = self.bsize
total_size = len(bytes)
rappend = ret.append
chunk_size = idx*bsize
reduce_errors = self.reduce_errors
# Calculate the sum of every block
while 1:
chunk_size = idx*bsize
#print "pre"
buf = bytes[chunk_size:chunk_size+bsize]
#print "post"
char = modsum(buf)
if reduce_errors:
if char != 255 and char != 0:
rappend(chr(char))
else:
rappend(chr(char))
idx += 1
if chunk_size+bsize > total_size:
break
ret = "".join(ret)
size = len(ret) / output_size
buf = []
# Adjust the output to the desired output size
# 唉。。。没怎么看懂在干什么
for c in xrange(0, output_size):
if aggresive:
buf.append(ret[c:c+size+1][ignore_range:ignore_range+1])
else:
buf.append(ret[c:c+size+1][1:2])
i = 0
for x in ret[c:c+size+1]:
i += 1
if i != ignore_range:
continue
i = 0
buf += x
break
ret = "".join(buf)
return base64.b64encode(ret).strip("=")[:output_size]
def _fast_hash(self, bytes, aggresive = False):
# 快速hash,省略了_hash后面那一步
i = -1
ret = set()
output_size = self.output_size
size = len(bytes) *1.00 / output_size
bsize = self.bsize
radd = ret.add
while i < output_size:
i += 1
buf = bytes[i*bsize:(i+1)*bsize]
char = sum(imap(ord, buf)) % 255
if self.reduce_errors:
if char != 255 and char != 0:
radd(chr(char))
else:
radd(chr(char))
ret = "".join(ret)
return base64.b64encode(ret).strip("=")[:output_size]
def xor(self, bytes):
# 对bytes里面的所有byte求异或
ret = 0
for byte in bytes:
ret ^= byte
return ret
def _experimental_hash(self, bytes, aggresive = False):
# 经验hash
idx = 0
ret = []
bsize = self.bsize
output_size = self.output_size
size = len(bytes)
ignore_range = self.ignore_range
chunk_size = idx*self.bsize
byte = None
while size > chunk_size + (bsize/output_size):
chunk_size = idx*self.bsize
if byte is None:
val = bsize
elif ord(byte) > 0:
val = ord(byte)
else:
val = output_size
buf = bytes[chunk_size:chunk_size+val]
byte = self.xor(imap(ord, buf)) % 255
byte = chr(byte)
if byte != '\xff' and byte != '\x00':
ret.append(byte)
idx += 1
ret = "".join(ret)
buf = ""
size = len(ret)/output_size
for n in xrange(0, output_size):
buf += ret[n*size:(n*size)+1]
return base64.b64encode(buf).strip("=")[:output_size]
def mix_blocks(self, bytes):
# 在每个block后面插入block的副本
idx = 0
buf = bytes
ret = ""
size1 = 0
size2 = 0
total_size = len(bytes)
while 1:
size1 = idx*self.bsize
size2 = (idx+1)*self.bsize
tmp = buf[size1:size2]
tm2 = tmp
ret += tmp
ret += tm2
idx += 1
if len(tmp) < self.bsize:
break
return ret
def cleanSpaces(self, bytes):
# 去掉所有的空格、换行符、制表符等
bytes = bytes.replace(" ", "").replace("\r", "").replace("\n", "")
bytes = bytes.replace("\t", "")
return bytes
def hash_bytes(self, bytes, aggresive = False):
# 使用指定算法进行hash计算,可以看到有三个hash拼凑而成
if self.remove_spaces:
bytes = self.cleanSpaces(bytes)
mix = self.mix_blocks(bytes)
if self.algorithm is None:
func = self._hash
else:
func = self.algorithm
hash1 = func(mix, aggresive) # mix之后的block进行hash
hash2 = func(bytes, aggresive) #直接对byte进行hash
hash3 = func(bytes[::-1], aggresive) #对byte的逆串进行hash
return hash1 + ";" + hash2 + ";" + hash3
def hash_file(self, filename, aggresive = False):
# 读取所有的文字byte串,并且进行hash
f = file(filename, "rb")
f.seek(0, 2)
size = f.tell()
if size > self.big_file_size:
print
print "Warning! Support for big files (%d MB > %d MB) is broken!" % (size/1024/1024, self.big_file_size / 1024 / 1024)
fbytes = CFileStr(f)
else:
f.seek(0)
fbytes = f.read()
f.close()
return self.hash_bytes(fbytes, aggresive)
class kdha:
# 一个部分兼容KFuzzyhash算法和标准hash格式的接口
""" Interface to make partially compatible the KFuzzy hashing algorithms with
the standard python hashlib format. This is the Koret Default Hashing Algorithm """
digest_size = 32
block_size = 512
_bytes = ""
_kfd = None
def __init__(self, bytes):
""" Initialize the object """
self._bytes = bytes
self._kfd = CKoretFuzzyHashing()
def update(self, bytes):
""" Not very usefull, just for compatibility... """
self._bytes += bytes
def hexdigest(self):
""" Returns and hexadecimal digest """
self._kfd.bsize = self.block_size
self._kfd.output_size = self.digest_size
hash = self._kfd.hash_bytes(self._bytes)
return hash
def digest(self):
""" Same as hexdigest """
return self.hexdigest()
class kfha(kdha):
# 使用fast_hash方法的,兼容标准hash的类
""" Interface to make partially compatible the KFuzzy hashing algorithms with
the standard python hashlib format. This is the Koret Fast Hashing Algorithm """
def __init__(self, bytes):
self._bytes = bytes
self._kfd = CKoretFuzzyHashing()
self._kfd.algorithm = self._kfd._fast_hash
class ksha(kdha):
# 兼容标准hash,一个使用simplified算法的接口
""" Interface to make partially compatible the KFuzzy hashing algorithms with
the standard python hashlib format. This is the Koret Simplified Hashing Algorithm """
def __init__(self, bytes):
self._bytes = bytes
self._kfd = CKoretFuzzyHashing()
self._kfd.algorithm = self._kfd.simplified
def usage():
print "Usage:", sys.argv[0], ""
def main(path):
hash = CKoretFuzzyHashing()
#hash.algorithm = hash._fast_hash
if os.path.isdir(path):
print "Signature;Simple Signature;Reverse Signature;Filename"
for root, dirs, files in os.walk(path):
for name in files:
tmp = os.path.join(root, name)
try:
ret = hash.hash_file(tmp, True)
print "%s;%s" % (ret, tmp)
except:
print "***ERROR with file %s" % tmp
print sys.exc_info()[1]
else:
hash = CKoretFuzzyHashing()
ret = hash.hash_file(path, True)
print "%s;%s" % (path, ret)
if __name__ == "__main__":
if len(sys.argv) == 1:
usage()
else:
main(sys.argv[1])
具体算法我们暂时先不关心,只知道这是fuzzy hash即可,我们把重点放在Diaphora进行函数相似度匹配的算法上面。下一篇分析Diaphora的核心代码。
参考网址:
1.flirt的说明网址。https://www.hex-rays.com/products/ida/tech/flirt/in_depth.shtml
2.库函数识别,据说比Diaphora做的准确率要高。https://deepsec.net/docs/Slides/2017/Enhancing%20Control%20Flow%20Graph_Based_Binary_Function_Identification_Clemens_Jonischkeit.pdf
3.使用Diaphora进行库函数识别的博客。https://w00tsec.blogspot.com/ 4.Diaphora使用的fuzzy hash。https://onlinelibrary.wiley.com/doi/10.1002/9781119183525.ch4