Existing techniques: low accuracy, poor scalability, coarse granularity, or require extensive labeled training data to function.
Limitations of Learning-based Approaches.
As shown, the system takes as input two binaries and outputs the basic block level diffing results. (系统以两个二进制作为输入,输出基本块级结果。)
Pre-processing analyzes binaries and produces inputs for embedding generation. More specifically, it produces inter-procedural CFGs for binaries and applies a token embedding generation model to generate embeddings for each token (opcode and operands). These generated token embeddings are further transformed into basic block level feature vectors. (预处理分析二进制文件并生成embedding generation的输入。更具体地说,它为二进制文件生成过程间CFG,并应用一个token embedding generation模型来为每个token(操作码和操作数)生成embedding。这些生成的token embedding被进一步转化为基本的块级特征向量。)
By combining the call graph with the control-flow graphs of each function, DEEPBINDIFF leverages IDA pro to extract basic block information, and generates an inter-procedural CFG (ICFG) that provides program-wide contextual information. (通过将调用图与每个函数的控制流图相结合,DEEPBINDIFF利用IDA pro提取基本块信息,并生成一个过程间CFG (ICFG),该CFG提供程序范围的上下文信息。)
Step1. Random Walks.
generate random walks in ICFGs so that each walk contains one possible execution path of the binary. (在ICFGs中生成随机遍历,每个遍历都包含二进制文件的一个可能执行路径。) To ensure the completeness of basic block coverage, we configure the walking engine so that every basic block is guaranteed to be contained by at least 2 random walks.
(为了保证基本块覆盖的完整性,我们配置了walking engine,保证每个基本块至少被包含在2个random walks。) Further, each random walk is set to have a length of 5 basic blocks to carry enough control flow information. (此外,每个random walk的长度被设置为5个基本块,以携带足够的控制流信息。)Then, we put random walks together
to generate a complete instruction sequence for training. (然后,我们将random walk放在一起生成一个完整的指令序列。)
Step2. Normalization.
Step3. Model Training.
CBOW model
consider each token (opcode or operand) as word, normalized random walks on top of ICFGs to be sentences and instructions around each token as its context. (将每个token(操作码或操作数)视为单词,ICFGs上的规范化随机游走是围绕每个token的句子和指令,作为其上下文。)
For example, step 3 in Figure 2 shows that the current token is cmp (shown in red), so we use one instruction before and another instruction after (shown in green) in the random walk as the context. If the target instruction is at the block boundary (e.g., first instruction in the block), then only one adjacent instruction will be considered as its context.
Step4. Feature Vector Generation.
Since each basic block could contain multiple instructions and each instruction in turn involves one opcode and potentially multiple operands, we calculate the average of the operand embeddings, concatenate with the opcode embedding to generate instruction embedding, and further sum up the instructions within the block to formulate the block feature vector. (因为每一个基本块包含多条指令,每条指令依次包括一个操作码和潜在的多个操作数。 我们计算操作数embedding的平均值, 与操作码的embedding连接生成指令的embedding, 并且进一步总结块内的指令,形成基本块特征向量 )
instruction importance
To tackle this problem, DEEPBINDIFF adopts a weighting strategy to adjust the weights of opcodes based on the opcodes importance with TF-IDF model. The calculated weight indicates how important one instruction is to the block that contains it in all the blocks within two input binaries. (为了解决这个问题,DEEPBINDIFF采用TF-IDF模型,根据操作码重要性采用加权策略来调整操作码的权重。计算出的权重表示一条指令对包含它的块有多重要在两个输入二进制文件中的所有块。)
一个基本块的特征向量为:
一个指令 i n i in_i ini包括一个操作码 p i p_i pi 和一个 k k k ( k k k可以为0)个操作数的集合 S e t t i Set_{t_i} Setti
e m b e d p i embed_{p_i} embedpi : 操作码的embedding
w e i g h t p i weight_{p_i} weightpi : 操作码的TF-IDF权值
e m b e d t i n embed_{t_{i_n}} embedtin : 操作数的embedding
1. TADW algorithm
Text-associated DeepWalk is an unsupervised graph embedding learning technique. As the name suggests, it is an improvement over the DeepWalk algorithm. (文本关联的DeepWalk是一种无监督的图嵌入学习技术。顾名思义,它是对DeepWalk算法的改进。)
DeepWalk algorithm is an online graph embedding learning algorithm that considers a set of short truncated random walks as a corpus in language modeling problem, and the graph vertices as its own vocabulary. The embeddings are then learned using random walks on the vertices in the graph. Accordingly, vertices that share similar neighbours will have similar embeddings. It excels at learning the contextual information from a graph. Nevertheless, it does not consider the node features during analysis. (DeepWalk算法是一种在线图嵌入学习算法,它将一组短截随机游动作为语言建模问题的语料库,将图的顶点作为自己的词汇表。然后使用图中顶点的随机游动来学习嵌入。因此,共享相似邻居的顶点将具有相似的嵌入。它擅长从图表中学习上下文信息。然而,它在分析过程中没有考虑节点特性。)
Text-associated DeepWalk (TADW) is able to incorporate features of vertices into the network representation learning process. (文本相关的DeepWalk (TADW)能够将顶点的特征合并到网络表示学习过程中。)
2. Graph Merging
The drawbacks of running TADW twice for the two ICFGs(one for each binary)
(1)First, it is less efficient to perform matrix factorization twice.(执行两次矩阵分解效率较低)
(2)Second, generating embeddings separately can miss some important indicators for similarity detection.(单独生成嵌入会遗漏一些用于相似度检测的重要指标。)
Ideally: a-1 (has a reference to string ‘hello’), d-3(calls fread)
In practice, the feature vectors of these basic blocks may not look very similar as one basic block could contain multiple instructions while the call or the string reference is just one of them. (这些基本块的特征向量可能看起来不太相似,因为一个基本块可能包含多个指令,而调用或字符串引用只是其中一个指令。)Besides, the two pairs also have different contextual information (node ‘a’ has no incoming edge but ‘1’ does). As a result, TADW may not generate similar embeddings for the two pairs of nodes.(此外,两对的上下文信息也不同(节点a没有传入边,而节点1有)。因此,TADW可能不会为这两对节点生成类似的嵌入。)
Graph Merging :the two ICFGs are merged and TADW runs only once on the merged graph.
(1)Particularly, DEEPBINDIFF extracts the string references and detects external library calls and system calls.(DEEPBINDIFF提取字符串引用并检测外部库调用和系统调用。)
(2)Then, it creates virtual nodes for strings and library functions, and draws edges from the callsites to these virtual nodes.(它为字符串和库函数创建虚拟节点,并从调用点绘制到这些虚拟节点的边。)
Hence, two graphs are merged into one on terminal virtual nodes.
因此,a 和 1至少有一个共同的邻居,这增强了它们之间的相似性。此外,节点“a”和“1”的邻居也具有更高的相似性,因为它们共享相似的邻居。
Moreover, since we only merge the graphs on terminal nodes, the original graph structures stay unchanged.(只合并终端节点上的图,原始图的结构保持不变。)
3. Basic Block Embeddings
With the merged graph, DEEPBINDIFF leverages TADW algorithm to generate basic block embeddings. More specifically, DEEPBINDIFF feeds the merged graph and the basic block feature vectors into TADW for multiple iterations of optimization.
The goal is to find a basic block level matching solution that maximizes the similarity for the two input binaries.
Two major limitations of performing linear assignment based on basic block embeddings to produce an optimal matching:
(1) First, linear assignment can be inefficient as binaries could contain enormous amount of blocks.
(2) Second, although embeddings include some contextual information, linear assignment itself does not consider any
graph information. (尽管embedding包含一些上下文信息,线性分配本身并不考虑任何图形信息。)
A possible improvement is to conduct linear assignment at two levels. Rather than matching basic blocks directly, we could match functions first by generating function level embeddings. Then, basic blocks within the matched functions can be further matched using basic block embeddings. This approach, however, can be severely thwarted by compiler optimizations that alter the function boundary such as function inlining.(先匹配函数(函数embedding), 再在匹配的函数内匹配基本块(基本块的embedding))
k-Hop Greedy Matching
The high-level idea is to benefit from the ICFG contextual information and find matching basic blocks based on the similarity calculated from basic block embeddings within the k-hop neighbors of already matched ones. (其高级思想是从ICFG上下文信息中获益,并根据从已经匹配的k-hop邻居中的基本块嵌入计算出的相似度来找到匹配的基本块。)