Diaphora源码分析——jkutils分析

本博客由闲散白帽子胖胖鹏鹏胖胖鹏潜力所写,仅仅作为个人技术交流分享,不得用做商业用途。转载请注明出处,未经许可禁止将本博客内所有内容转载、商用。 

      在前面的博客中,我们简单介绍了Diaphora的文件结构和FLIRT算法的思想。同样作为库函数识别算法,FLIRT采用了pattern mactch的方法,而Diaphora则使用AST模糊哈希以及调用流模糊哈希的方法。在调研的过程中,我们发现17年有人提出了新的库函数识别方法,并且表示Diaphora虽然识别数量较多,但是False positive也很高,文章链接我贴在末尾。但是想要缓解这种方法,我们就需要学习算法之后再进行一些改进。这篇博客就来看看Diaphora的依赖库jkutils的实现。

0x00 jkutils是什么?能吃么?好吃么?怎么吃?

      jkutils是Diaphora的作者joxeankoret开发的一个工具集合,在github上面有开源(链接在此)。目前这个工具包里面的功能主要有以下几种:

graphs: 构建图、搜索子图、路径搜索算法等等
fuzzy_hashing: 模糊哈希(fuzzy hashing)算法
factor: 素数和因式分解工具
simple_log: 非常简单地日志工具
process: 简单地进程管理工具,主要是针对并行和线程超时。
query_utils: 构建SQL语言查询语句的工具
web_db: 对web.py的封装,支持MySQL和SQLite

      工具箱中有很多工具,但是Diaphora值用上了fuzzy_hashing和factor。我们就看这两个工具的功能。有些工具虽然没有用上,但是我们可以收藏,毕竟图的处理是很复杂的事情,如果有一段高效且好用的代码的话,可以省下很多时间。

0x01 jkutils/factor

      我们首先来看factor,从名字上来看就是处理各种素数和因式分解的代码。这段代码都是功能罗列,没有逻辑关联,我将注释插入到代码中,我建议大家将代码过一遍,便于我们在进行Diaphora算法解析。文档位置在https://github.com/joxeankoret/diaphora/blob/master/jkutils/factor.py。

#!/usr/bin/python

"""
Primes and factorization utilities
Part of the Malware Families Clusterization Project, adapted to be used
in the Joxean Koret's Diffing Plugin for IDA.
Copyright (c) 2012-2015 Joxean Koret
"""

import sys
import random
import decimal

#-----------------------------------------------------------------------
def primesbelow(N):
  # http://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
  # 生成大于2小于N的所有素数
  # 返回一个list
  #""" 输入 N>=6, 返回一个素数列表, 2 <= p < N """
  correction = N % 6 > 1
  N = {0:N, 1:N-1, 2:N+4, 3:N+3, 4:N+2, 5:N+1}[N%6]
  sieve = [True] * (N // 3)
  sieve[0] = False
  for i in range(long(N ** .5) // 3 + 1):
    if sieve[i]:
      k = (3 * i + 1) | 1
      sieve[k*k // 3::2*k] = [False] * ((N//6 - (k*k)//6 - 1)//k + 1)
      sieve[(k*k + 4*k - 2*k*(i%2)) // 3::2*k] = [False] * ((N // 6 - (k*k + 4*k - 2*k*(i%2))//6 - 1) // k + 1)
  return [2, 3] + [(3 * i + 1) | 1 for i in range(1, N//3 - correction) if sieve[i]]

#-----------------------------------------------------------------------
# 默认生成100000以下的素数集合
smallprimeset = set(primesbelow(100000))
_smallprimeset = 100000

# 由于我们之前生成过100000以内的素数列表,如果小于100000,检查是否在列表中
# 否则进行Miller-Rabin素性检验
def isprime(n, precision=7):
  # http://en.wikipedia.org/wiki/Miller-Rabin_primality_test#Algorithm_and_running_time
  if n == 1 or n % 2 == 0:
    return False
  elif n < 1:
    raise ValueError("Out of bounds, first argument must be > 0")
  elif n < _smallprimeset:
    return n in smallprimeset


  d = n - 1
  s = 0
  while d % 2 == 0:
    d //= 2
    s += 1

  for repeat in range(precision):
    a = random.randrange(2, n - 2)
    x = pow(a, d, n)

    if x == 1 or x == n - 1: continue

    for r in range(s - 1):
      x = pow(x, 2, n)
      if x == 1: return False
      if x == n - 1: break
    else: return False

  return True

#-----------------------------------------------------------------------
# Pllard Rho Brent整数分解算法,具体说明在下面的网址
# https://comeoncodeon.wordpress.com/2010/09/18/pollard-rho-brent-integer-factorization/
# 返回:n的因子
def pollard_brent(n):
  if n % 2 == 0: return 2
  if n % 3 == 0: return 3

  y, c, m = random.randint(1, n-1), random.randint(1, n-1), random.randint(1, n-1)
  g, r, q = 1, 1, 1
  while g == 1:
    x = y
    for i in range(r):
      y = (pow(y, 2, n) + c) % n

    k = 0
    while k < r and g==1:
      ys = y
      for i in range(min(m, r-k)):
        y = (pow(y, 2, n) + c) % n
        q = q * abs(x-y) % n
      g = gcd(q, n)
      k += m
    r *= 2
  if g == n:
    while True:
      ys = (pow(ys, 2, n) + c) % n
      g = gcd(abs(x - ys), n)
      if g > 1:
        break

  return g

#-----------------------------------------------------------------------
smallprimes = primesbelow(10000) # might seem low, but 10000*10000 = 100000000, so this will fully factor every composite < 100000000
# 质因子集合,小于1万的素数,因为10000*10000=1亿,所以是能够解决1亿以内的数
# 分解质因数,得到一系列指数集合p p1*p2*p3...=n
def primefactors(n, sort=False):
  factors = []

  limit = long(n ** decimal.Decimal(.5)) + 1
  for checker in smallprimes:
    if checker > limit: break
    while n % checker == 0:
      factors.append(checker)
      n //= checker
      limit = long(n ** decimal.Decimal(.5)) + 1
      if checker > limit: break

  if n < 2: return factors

  while n > 1:
    if isprime(n):
      factors.append(n)
      break
    factor = pollard_brent(n) # trial division did not fully factor, switch to pollard-brent
    factors.extend(primefactors(factor)) # recurse to factor the not necessarily prime factor returned by pollard-brent
    n //= factor

  if sort: factors.sort()

  return factors

#-----------------------------------------------------------------------
# 转换质因子分解的结果,将其有list转为dict
def factorization(n):
  factors = {}
  for p1 in primefactors(n):
    try:
      factors[p1] += 1
    except KeyError:
      factors[p1] = 1
  return factors

#-----------------------------------------------------------------------
# 欧拉函数:如果是0返回1;否则进行计算
totients = {}
def totient(n):
  if n == 0: return 1

  try: return totients[n]
  except KeyError: pass

  tot = 1
  for p, exp in factorization(n).items():
    tot *= (p - 1)  *  p ** (exp - 1)

  totients[n] = tot
  return tot

#-----------------------------------------------------------------------
# 求最大公因数
def gcd(a, b):
  if a == b: return a
  while b > 0: a, b = b, a % b
  return a

#-----------------------------------------------------------------------
# 求最小公倍数
def lcm(a, b):
  return abs(a * b) // gcd(a, b)

#-----------------------------------------------------------------------
FACTORS_CACHE = {}
def _difference(num1, num2):
  nums = [num1,
          num2]
  s = []
  for num in nums:
    if FACTORS_CACHE.has_key(num):
      x = FACTORS_CACHE[num]
    else:
      x = factorization(long(num))
      FACTORS_CACHE[num] = x
    s.append(x)

  diffs = {}
  for x in s[0].keys():
    if x in s[1].keys():
      if s[0][x] != s[1][x]:
        diffs[x] = max(s[0][x], s[1][x]) - min(s[0][x], s[1][x])
    else:
      diffs[x] = s[0][x]
  
  for x in s[1].keys():
    if x in s[0].keys():
      if s[1][x] != s[0][x]:
        diffs[x] = max(s[0][x], s[1][x]) - min(s[0][x], s[1][x])
    else:
      diffs[x] = s[1][x]

  return diffs, s

#-----------------------------------------------------------------------
def difference(num1, num2):
  # 计算两个素数的区别,如果一个素数只存在于一个组里面,那么该素数总数数量作为difference;如果一个素数在两个组里都出现,则把这两个素数数量差填进去
  """ Calculate the difference in prime numbers. If a primer number does not 
    exists in one group but does in the other, the total value of the prime
    number is added as differences. If a primer number exists in both groups
    the values difference is added. """
  diffs, s = _difference(num1, num2)
  return sum(diffs.values())

#-----------------------------------------------------------------------
def difference_ratio(num1, num2):
  # 计算两个数的相差比率
  """ Same as differene but getting a ratio of the changes. """
  diffs, s = _difference(num1, num2)
  total = max(sum(s[0].values()), sum(s[1].values()))
  return 1 - (sum(diffs.values()) *1. / total)

#-----------------------------------------------------------------------
def difference_matrix(samples, debug=True):
  # 计算样本的差异矩阵
  """ Calculate the difference matrix for the given set of samples. """
  diff_matrix = {}
  for x in samples:
    if debug:
      print "Calculating difference matrix for %s" % x
    if not diff_matrix.has_key(x):
      diff_matrix[x] = {}
    for y in samples:
      if samples[x] != samples[y]:
        d = difference(samples[x], samples[y])
        #print("Difference between %s and %s: %d" % (x, y, d))
        diff_matrix[x][y] = d
      else:
        diff_matrix[x][y] = 0
  return diff_matrix
      从上面的注释可以看到,factor里面就是进行因数分解及相关的处理函数。
0x02 jkutils/kfuzzy.py

      fuzzy hash应该是Diaphora进行库函数识别以及相似度匹配的一个重要工具了。Fuzzy hash的主要原理就是使用一个弱哈希计算文件局部内容,并且在某个条件下将其分片,然后使用一个强hash对文件中每一片计算hash值,取出这些值连接起来,并进行压缩,最后与分片条件构成一个完整的哈希结果。我们通过判断两个模糊哈希值的相似度,判断两个文件的相似度。目前做的比较好的是ssdeep。虽然整体流程是类似的,但是fuzzy hash没有一个确定的算法,不同人有不同的处理方式。我们想要了解Diaphora在做什么,就要知道它所使用的hash函数。代码的位置在https://github.com/joxeankoret/diaphora/blob/master/jkutils/kfuzzy.py。

      实话实说,fuzzy hash在没有算法和文档说明的情况下,还是很难看懂的。我这里也只是粗略的分析了每个函数的功能,由于这个算法是Diaphora作者自创的,所以这个算法的描述在他之前的几篇文章中。

import os
import sys
import base64

from itertools import imap

# 定义模加法,即将buf中每个字符转换为ASCII码,求和后模255
try:
    from fasttoad_wrap import modsum
except:
    def modsum(buf):
        return sum(imap(ord, buf)) % 255

# psyco是一个用来加速的module,但是目前已经启用,可以使用pypy来进行多线程优化
try:
    import psyco
    psyco.full()
except ImportError:
    pass

class CFileStr(str):
    fd = None

    def __init__(self, fd):
		# fd是文件句柄
        self.fd = fd
    
    def __getslice__(self, x, y):
	    # 获得切片,就是获得从x到y处,y-x长度的数据
        self.fd.seek(x, 0)
        buf = self.fd.read(y-x)
        self.fd.seek(y)
        return buf

    def __len__(self):
	    # 计算文件长度
        old = self.fd.tell()
        self.fd.seek(0, 2)
        pos = self.fd.tell()
        self.fd.seek(old)
        return pos

class CKoretFuzzyHashing:
    """ 生成文件或byte串的部分hash """
    bsize = 512
    output_size = 32
    ignore_range = 2
    big_file_size = 1024*1024*10
    algorithm = None
    reduce_errors = True
    remove_spaces = False

    def get_bytes(self, f, initial, final):
	    # 读取从initial开始的final长度的byte字符串
        f.seek(initial)
        return f.read(final)

    def edit_distance(self, sign1, sign2):
	    # 计算两个签名值的距离,如果某位置上字符不同, 距离就增加1
        if sign1 == sign2:
            return 0
        
        m = max(len(sign1), len(sign2))
        distance = 0
        
        for c in xrange(0, m):
            if sign1[c:c+1] != sign2[c:c+1]:
                distance += 1
        
        return distance

    def simplified(self, bytes, aggresive = False):
	    # 这段是将byte压缩成我们想要的长度。每个字节计算方法 base64(modsum(bsiz+1))[:outputsize]
        output_size = self.output_size #输出结果的长度
        ignore_range = self.ignore_range
        bsize = self.bsize             # block的长度
        total_size = len(bytes)
        size = (total_size/bsize) / output_size #output中的每个字节都对应着32个block
        buf = []
        reduce_errors = self.reduce_errors
        # Adjust the output to the desired output size
		# 循环取出blocksize+1个字节进行模加
        for c in xrange(0, output_size):
            tmp = bytes[c*size:(c*size+1)+bsize]
            ret = sum(imap(ord, tmp)) % 255
            if reduce_errors:
                if ret != 255 and ret != 0:
                    buf.append(chr(ret))
            else:
                buf.append(chr(ret))
        
        buf = "".join(buf)
		# base64加密之后再裁剪
        return base64.b64encode(buf).strip("=")[:output_size]

    def _hash(self, bytes, aggresive = False):
	    # 计算byte的hash值
        idx = 0
        ret = []
        
		# 对bytes中每个block计算模加并且存储到buf中
        output_size = self.output_size
        ignore_range = self.ignore_range
        bsize = self.bsize
        total_size = len(bytes)
        rappend = ret.append
        chunk_size = idx*bsize
        reduce_errors = self.reduce_errors
        # Calculate the sum of every block
        while 1:
            chunk_size = idx*bsize
            #print "pre"
            buf = bytes[chunk_size:chunk_size+bsize]
            #print "post"
            char = modsum(buf)

            if reduce_errors:
                if char != 255 and char != 0:
                    rappend(chr(char))
            else:
                rappend(chr(char))

            
            idx += 1
            
            if chunk_size+bsize > total_size:
                break
        
        ret = "".join(ret)
        size = len(ret) / output_size
        buf = []
        
        # Adjust the output to the desired output size
		# 唉。。。没怎么看懂在干什么
        for c in xrange(0, output_size):
            if aggresive:
                buf.append(ret[c:c+size+1][ignore_range:ignore_range+1])
            else:
                buf.append(ret[c:c+size+1][1:2])
            
            i = 0
            for x in ret[c:c+size+1]:
                i += 1
                if i != ignore_range:
                    continue
                i = 0
                buf += x
                break
            
        ret = "".join(buf)
        
        return base64.b64encode(ret).strip("=")[:output_size]

    def _fast_hash(self, bytes, aggresive = False):
	    # 快速hash,省略了_hash后面那一步
        i = -1
        ret = set()
        
        output_size = self.output_size
        size = len(bytes) *1.00 / output_size
        bsize = self.bsize
        radd = ret.add
        
        while i < output_size:
            i += 1
            buf = bytes[i*bsize:(i+1)*bsize]
            char = sum(imap(ord, buf)) % 255
            if self.reduce_errors:
                if char != 255 and char != 0:
                    radd(chr(char))
            else:
                radd(chr(char))
        
        ret = "".join(ret)
        return base64.b64encode(ret).strip("=")[:output_size]

    def xor(self, bytes):
	    # 对bytes里面的所有byte求异或
        ret = 0
        for byte in bytes:
            ret ^= byte
        return ret

    def _experimental_hash(self, bytes, aggresive = False):
	    # 经验hash
        idx = 0
        ret = []
        bsize = self.bsize
        output_size = self.output_size
        size = len(bytes)
        ignore_range = self.ignore_range
        chunk_size = idx*self.bsize
        byte = None
        
        while size > chunk_size + (bsize/output_size):
            chunk_size = idx*self.bsize
            if byte is None:
                val = bsize
            elif ord(byte) > 0:
                val = ord(byte)
            else:
                val = output_size
            
            buf = bytes[chunk_size:chunk_size+val]
            byte = self.xor(imap(ord, buf)) % 255
            byte = chr(byte)
            
            if byte != '\xff' and byte != '\x00':
                ret.append(byte)
            
            idx += 1
        
        ret = "".join(ret)
        buf = ""
        size = len(ret)/output_size
        for n in xrange(0, output_size):
            buf += ret[n*size:(n*size)+1]
        
        return base64.b64encode(buf).strip("=")[:output_size]

    def mix_blocks(self, bytes):
	    # 在每个block后面插入block的副本
        idx = 0
        buf = bytes
        ret = ""
        size1 = 0
        size2 = 0
        total_size = len(bytes)
        
        while 1:
            size1 = idx*self.bsize
            size2 = (idx+1)*self.bsize
            
            tmp = buf[size1:size2]
            tm2 = tmp
            ret += tmp
            ret += tm2
            
            idx += 1
            
            if len(tmp) < self.bsize:
                break
        
        return ret

    def cleanSpaces(self, bytes):
	    # 去掉所有的空格、换行符、制表符等
        bytes = bytes.replace(" ", "").replace("\r", "").replace("\n", "")
        bytes = bytes.replace("\t", "")
        return bytes

    def hash_bytes(self, bytes, aggresive = False):
	    # 使用指定算法进行hash计算,可以看到有三个hash拼凑而成
        if self.remove_spaces:
            bytes = self.cleanSpaces(bytes)
        
        mix = self.mix_blocks(bytes)
        if self.algorithm is None:
            func = self._hash
        else:
            func = self.algorithm
        
        hash1 = func(mix, aggresive)  # mix之后的block进行hash
        hash2 = func(bytes, aggresive) #直接对byte进行hash
        hash3 = func(bytes[::-1], aggresive) #对byte的逆串进行hash
        
        return hash1 + ";" + hash2 + ";" + hash3

    def hash_file(self, filename, aggresive = False):
	    # 读取所有的文字byte串,并且进行hash
        f = file(filename, "rb")
        f.seek(0, 2)
        size = f.tell()
        
        if size > self.big_file_size:
            print
            print "Warning! Support for big files (%d MB > %d MB) is broken!" % (size/1024/1024, self.big_file_size / 1024 / 1024)
            fbytes = CFileStr(f)
        else:
            f.seek(0)
            fbytes = f.read()
            f.close()
        
        return self.hash_bytes(fbytes, aggresive)

class kdha:
    # 一个部分兼容KFuzzyhash算法和标准hash格式的接口
    """ Interface to make partially compatible the KFuzzy hashing algorithms with
    the standard python hashlib format. This is the Koret Default Hashing Algorithm """
    digest_size = 32
    block_size = 512
    _bytes = ""
    _kfd = None

    def __init__(self, bytes):
        """ Initialize the object """
        self._bytes = bytes
        self._kfd = CKoretFuzzyHashing()

    def update(self, bytes):
        """ Not very usefull, just for compatibility... """
        self._bytes += bytes

    def hexdigest(self):
        """ Returns and hexadecimal digest """
        self._kfd.bsize = self.block_size
        self._kfd.output_size = self.digest_size
        hash = self._kfd.hash_bytes(self._bytes)
        return hash

    def digest(self):
        """ Same as hexdigest """
        return self.hexdigest()

class kfha(kdha):
    # 使用fast_hash方法的,兼容标准hash的类
    """ Interface to make partially compatible the KFuzzy hashing algorithms with
    the standard python hashlib format. This is the Koret Fast Hashing Algorithm """

    def __init__(self, bytes):
        self._bytes = bytes
        self._kfd = CKoretFuzzyHashing()
        self._kfd.algorithm = self._kfd._fast_hash

class ksha(kdha):
    # 兼容标准hash,一个使用simplified算法的接口
    """ Interface to make partially compatible the KFuzzy hashing algorithms with
    the standard python hashlib format. This is the Koret Simplified Hashing Algorithm """

    def __init__(self, bytes):
        self._bytes = bytes
        self._kfd = CKoretFuzzyHashing()
        self._kfd.algorithm = self._kfd.simplified

def usage():
    print "Usage:", sys.argv[0], ""

def main(path):
    hash = CKoretFuzzyHashing()
    #hash.algorithm = hash._fast_hash
    
    if os.path.isdir(path):
        print "Signature;Simple Signature;Reverse Signature;Filename"
        for root, dirs, files in os.walk(path):
            for name in files:
                tmp = os.path.join(root, name)
                try:
                    ret = hash.hash_file(tmp, True)
                    print "%s;%s" % (ret, tmp)
                except:
                    print "***ERROR with file %s" % tmp
                    print sys.exc_info()[1]
    else:
        hash = CKoretFuzzyHashing()
        ret = hash.hash_file(path, True)
        print "%s;%s" % (path, ret)

if __name__ == "__main__":
    if len(sys.argv) == 1:
        usage()
    else:
        main(sys.argv[1])

      具体算法我们暂时先不关心,只知道这是fuzzy hash即可,我们把重点放在Diaphora进行函数相似度匹配的算法上面。下一篇分析Diaphora的核心代码。


参考网址:
1.flirt的说明网址。https://www.hex-rays.com/products/ida/tech/flirt/in_depth.shtml
2.库函数识别,据说比Diaphora做的准确率要高。https://deepsec.net/docs/Slides/2017/Enhancing%20Control%20Flow%20Graph_Based_Binary_Function_Identification_Clemens_Jonischkeit.pdf

3.使用Diaphora进行库函数识别的博客。https://w00tsec.blogspot.com/                                                                                            4.Diaphora使用的fuzzy hash。https://onlinelibrary.wiley.com/doi/10.1002/9781119183525.ch4

你可能感兴趣的:(Diaphora源码分析,库函数识别)