二进制代码的类型恢复-Type inference on executable-wcventure

Type inference on executable

This document collects papers that are related with Type inference on executable.

Type inference on executable

Survey

  • Caballero J, Lin Z. Type Inference on Executables. 2016, 48(4):1-35. paper
    Type inference on executable is founded upon a long history of work in programming languages. Interesting readers can refer to Caballero2016Type for more detail.

EKLAVYA

  • Zheng Leong Chua, Shiqi Shen, Prateek Saxena, Zhenkai Liang Neural Nets Can Learn Function Type Signatures From Binaries. USENIX 2017 paper

Abstract:

Function type signatures are important for binary analysis, but they are not available in COTS binaries. In this paper, we present a new system called EKLAVYA which trains a recurrent neural network to recover function type signatures from disassembled binary code. EKLAVYA assumes no knowledge of the target instruction set semantics to make such inference. More importantly, EKLAVYA results are “explicable”: we find by analyzing its model that it auto-learns relationships between instructions, compiler conventions, stack frame setup instructions, use-before-write patterns, and operations relevant to identifying types directly from binaries. In our evaluation on Linux binaries compiled with clang and gcc, for two different architectures (x86 and x64), EKLAVYA exhibits accuracy of around 84% and 81% for function argument count and type recovery tasks respectively. EKLAVYA generalizes well across the compilers tested on two different instruction sets with various optimization levels, without any specialized prior knowledge of the instruction set, compiler or optimization level.

PLDI16

  • Noonan M, Loginov A, Cok D. Polymorphic type inference for machine code. ACM Sigplan Conference on Programming Language Design and Implementation. ACM, 2016:27-41.paper
    Abstract:

    For many compiled languages, source-level types are erased very early in the compilation process. As a result, further compiler passes may convert type-safe source into type-safe machine code. Type-unsafe idioms in the original source and type-unsafe optimizations mean that type information in a stripped binary is essentially nonexistent. The problem of recovering high-level types by performing type inference over stripped machine code is called type reconstruction, and offers a useful capability in support of reverse engineering and decompilation.

    In this paper, we motivate and develop a novel type system and algorithm for machine-code type inference. The features of this type system were developed by surveying a wide collection of common source- and machine-code idioms, building a catalog of challenging cases for type reconstruction. We found that these idioms place a sophisticated set of requirements on the type system, inducing features such as recursively-constrained polymorphic types. Many of the features we identify are often seen only in expressive and powerful type systems used by high-level functional languages.

    Using these type-system features as a guideline, we have developed Retypd: a novel static type-inference algorithm for machine code that supports recursive types, polymorphism, and subtyping. Retypd yields more accurate inferred types than existing algorithms, while also enabling new capabilities such as reconstruction of pointer const annotations with 98% recall. Retypd can operate on weaker program representations than the current state of the art, removing the need for high-quality points-to information that may be impractical to compute.

SecondWrite

  • Elwazeer K, Anand K, Kotha A, et al. Scalable variable and data type detection in a binary rewriter. Acm Sigplan Conference on Programming Language Design & Implementation. ACM, 2013:51-60. paper
    We present scalable static analyses to recover variables, data types, and function prototypes from stripped x86 executables (without symbol or debug information) and obtain a functional intermediate representation (IR) for analysis and rewriting purposes. Our techniques on average run 352X faster than current techniques and still have the same precision. This enables analyzing executables as large as millions of instructions in minutes which is not possible using existing techniques. Our techniques can recover variables allocated to the floating point stack unlike current techniques. We have integrated our techniques to obtain a compiler level IR that works correctly if recompiled and produces the same output as the input executable. We demonstrate scalability, precision and correctness of our proposed techniques by evaluating them on the complete SPEC2006 benchmarks suite.

SmartDec

  • Fokin A, Derevenetc E, Chernov A, et al. SmartDec: Approaching C++ Decompilation. Reverse Engineering. IEEE, 2011:347-356. paper
    Abstract:

    Decompilation is a reconstruction of a program in a high-level language from a program in a low-level language. Typical applications of decompilation are software security assessment, malware analysis, error correction and reverse engineering for interoperability.

    Native code decompilation is traditionally considered in the context of the C programming language. C++ presents new challenges for decompilation, since the rules of translation from C++ to assembly language are far more complex than those of C. In addition, when decompiling a program that was originally written in C++, reconstruction of C++ specific constructs is desired.

    In this paper we discuss new methods that allow partial recovery of C++ specific language constructs from a lowlevel code provided that this code was obtained from a C++ compiler. The challenges that arise when decompiling such code are described. These challenges include reconstruction of polymorphic classes, class hierarchies, member functions and exception handling constructs. An approach to decompilation that is used to overcome these challenges is presented.

HOWARD

  • Slowinska A, Stancescu T, Bos H. Howard: A Dynamic Excavator for Reverse Engineering Data Structures. Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, Usa, February -, February. DBLP, 2011. paper
    Abstract:

    Even the most advanced reverse engineering techniques and products are weak in recovering data structures in stripped binaries—binaries without symbol tables. Unfortunately, forensics and reverse engineering without data structures is exceedingly hard. We present a new solution, known as Howard, to extract data structures from C binaries without any need for symbol tables. Our results are significantly more accurate than those of previous methods — sufficiently so to allow us to generate our own (partial) symbol tables without access to source code. Thus, debugging such binaries becomes feasible and reverse engineering becomes simpler. Also, we show that we can protect existing binaries from popular memory corruption attacks, without access to source code. Unlike most existing tools, our system uses dynamic analysis (on a QEMU-based emulator) and detects data structures by tracking how a program uses memory.

REWARDS

  • Lin Z, Zhang X, Xu D. Automatic Reverse Engineering of Data Structures from Binary Execution. Network and Distributed System Security Symposium, NDSS 2010, San Diego, California, Usa, February -, March. DBLP, 2011. paper

Abstract:

With only the binary executable of a program, it is useful to discover the program’s data structures and infer their syntactic and semantic definitions. Such knowledge is highly valuable in a variety of security and forensic applications. Although there exist efforts in program data structure inference, the existing solutions are not suitable for our targeted application scenarios. In this paper, we propose a reverse engineering technique to automatically reveal program data structures from binaries. Our technique, called REWARDS, is based on dynamic analysis. More specifically, each memory location accessed by the program is tagged with a timestamped type attribute. Following the program’s runtime data flow, this attribute is propagated to other memory locations and registers that share the same type. During the propagation, a variable’s type gets resolved if it is involved in a type-revealing execution point or “type sink”. More importantly, besides the forward type propagation, REWARDS involves a backward type resolution procedure where the types of some previously accessed variables get recursively resolved starting from a type sink. This procedure is constrained by the timestamps of relevant memory locations to disambiguate variables reusing the same memory location. In addition, REWARDS is able to reconstruct in-memory data structure layout based on the type information derived. We demonstrate that REWARDS provides unique benefits to two applications: memory image forensics and binary fuzzing for vulnerability discovery.

As mentioned, the REWARDS system [12] takes a dynamic approach to reverse engineering data types. But as a dynamic analysis-based approach, REWARDS cannot achieve full coverage of variables defined in a program. Instead, the coverage of REWARDS relies on those variables that are actually created and accessed during runtime. REWARDS is limited to dynamic analysis of a single execution trace.

VTint

  • Zhang C, Song C, Chen K Z, et al. VTint: Protecting Virtual Function Tables’ Integrity. Network and Distributed System Security Symposium. 2015. paper

Abstract:

In the recent past, a number of approaches have been proposed to protect certain types of control data in a program, such as return addresses saved on the stack, rendering most traditional control flow hijacking attacks ineffective. Attackers, however, can bypass these defenses by launching advanced attacks that corrupt other data, e.g., pointers indirectly used to access code. One of the most popular targets is virtual table pointers (vfptr), which point to virtual function tables (vtable) consisting of virtual function pointers. Attackers can exploit vulnerabilities, such as use-after-free and heap overflow, to overwrite the vtable or vfptr, causing further virtual function calls to be hijacked (vtable hijacking). In this paper we propose a lightweight defense solution VTint to protect binary executables against vtable hijacking attacks. It uses binary rewriting to instrument security checks before virtual function dispatches to validate vtables’ integrity. Experiments show that it only introduces a low performance overhead (less than 2%), and it can effectively protect real-world vtable hijacking attacks.

MemPick

  • Haller I, Slowinska A, Bos H. MemPick: High-level data structure detection in C/C++ binaries. Working Conference on Reverse Engineering. IEEE, 2013:32-41. paper

Abstract:

Many existing techniques for reversing data structures in C/C++ binaries are limited to low-level programming constructs, such as individual variables or structs. Unfortunately, without detailed information about a program’s pointer structures, forensics and reverse engineering are exceedingly hard. To fill this gap, we propose MemPick, a tool that detects and classifies high-level data structures used in stripped binaries. By analyzing how links between memory objects evolve throughout the program execution, it distinguishes between many commonly used data structures, such as singly- or doubly-linked lists, many types of trees (e.g., AVL, red-black trees, B-trees), and graphs. We evaluate the technique on 10 real world applications and 16 popular libraries. The results show that MemPick can identify the data structures with high accuracy

ARTISTE

  • ARTISTE: Automatic Generation of Hybrid Data Structure Signatures from Binary Code Executions paper

Abstract:

Data structure signatures can be used for finding instances of data structures holding sensitive data in memory, a crucial capability for many security applications such as memory forensics, rootkit detection, online games cheat analysis, reverse engineering, and virtual machine introspection. Manually generating data structure signatures is a tedious and error-prone process. Prior work automatically generates data structure signatures from the type definitions in the program’s source code, but unfortunately for many programs their source code is not publicly available.

In this paper we present ARTISTE, the first tool for automatically generating data structure signatures without access to the program’s source code or debugging symbols. The salient features of ARTISTE are: (1) it generates hybrid signatures that minimize false positives during scanning by combining points-to relationships, value invariants, and cycle invariants; (2) it uses a novel dynamic shape analysis to recover recursive data structures, classifying them by their shapes (e.g., doubly linkedlist or tree); (3) it identifies data structures of the same type allocated at different program points; and (4) it accumulates data structure information over multiple executions, increasingly improving its accuracy. Our experimental results on a number of binary programs show that the hybrid signatures generated by ARTISTE accurately identify instances of the data structures in memory with no false positives or false negatives in 80% of the programs, while prior signature types produce large false positive rates.

vfGrand

  • Prakash A, Hu X, Yin H. vfGuard: Strict Protection for Virtual Function Calls in COTS C++ Binaries. Network and Distributed System Security Symposium. 2015. paper

Abstract:

Control-Flow Integrity (CFI) is an important security property that needs to be enforced to prevent control- flow hijacking attacks. Recent attacks have demonstrated that existing CFI protections for COTS binaries are too permissive, and vulnerable to sophisticated code reusing attacks. Accounting for control flow restrictions imposed at higher levels of semantics is key to increasing CFI precision. In this paper, we aim to provide more stringent protection for virtual function calls in COTS C++ binaries by recovering C++ level semantics. To achieve this goal, we recover C++ semantics, including VTables and virtual callsites. With the extracted C++ semantics, we construct a sound CFI policy and further improve the policy precision by devising two filters, namely “Nested Call Filter” and “Calling Convention Filter”. We implement a prototype system called vfGuard, and evaluate its accuracy, precision, effectiveness, coverage and performance overhead against a test set including complex C++ binary modules used by Internet Explorer. Our experiments show a runtime overhead of 18.3% per module. On SpiderMonkey, an open-source JavaScript engine used by Firefox, vfGuard generated 199 call targets per virtual callsite – within the same order of magnitude as those generated from a source code based solution. The policies constructed by vfGuard are sound and of higher precision when compared to state-of-the-art binary-only CFI solutions.

YB

  • Yoo K, Barua R. Recovery of Object Oriented Features from C++ Binaries. 2014, 1:231-238. paper
    Abstract:

    Reverse engineering is the process of examining and probing a program to determine the original design. Over the past ten years researchers have produced a number of capabilities to explore, manipulate, analyze, summarize, hyperlink, synthesize, componentize, and visualize software artifacts. Many reverse engineering tools focus on non-object-oriented software binaries with the goal of transferring discovered information into the software engineers trying to reengineer or reuse it.

    In this paper, we present a method that recovers object-oriented features from stripped C++ binaries. We discover RTTI information, class hierarchies, member functions of classes, and member variables of classes. The information obtained can be used for reengineering legacy software, and for understanding the architecture of software systems.
    Our method works for stripped binaries, i.e., without symbolic or relocation information. Most deployed binaries are stripped. We compare our method with the same binaries with symbolic information to measure the accuracy of our techniques. In this manner we find our methods are able to identify 80% of virtual functions, 100% of the classes, 78% of member functions, and 55% of member variables from stripped binaries, compared to the total number of those artifacts in symbolic information in equivalent non-stripped binaries.

OBJDIGGER

  • Narasimhan P, Narasimhan P, Narasimhan P, et al. Recovering C++ Objects From Binaries Using Inter-Procedural Data-Flow Analysis. ACM Sigplan on Program Protection and Reverse Engineering Workshop. ACM, 2014:1. paper

Abstract:

Object-oriented programming complicates the already difficult task of reverse engineering software, and is being used increasingly by malware authors. Unlike traditional procedural-style code, reverse engineers must understand the complex interactions between object-oriented methods and the shared data structures with which they operate on, a tedious manual process.
In this paper, we present a static approach that uses symbolic execution and inter-procedural data flow analysis to discover object instances, data members, and methods of a common class. The key idea behind our work is to track the propagation and usage of a unique object instance reference, called a this pointer. Our goal is to help malware reverse engineers to understand how classes are laid out and to identify their methods. We have implemented our approach in a tool called OBJDIGGER, which produced encouraging results when validated on real-world malware samples.

YM

  • Narasimhan P, Narasimhan P, Narasimhan P, et al. Recovering C++ Objects From Binaries Using Inter-Procedural Data-Flow Analysis. ACM Sigplan on Program Protection and Reverse Engineering Workshop. ACM, 2014:1. paper

Abstract:

Recovering variable types or other structural information from binaries is useful for reverse engineering in security, and to facilitate other kinds of analysis on binaries. However such reverse engineering tasks often lack precise problem definitions; some information is lost during compilation, and existing tools can exhibit a variety of errors. As a step in the direction of more principled reverse engineering algorithms, we isolate a sub-task of type inference, namely determining whether each integer variable is declared as signed or unsigned. The difficulty of this task arises from the fact that signedness information in a binary, when present at all, is associated with operations rather than with data locations. We propose a graphbased algorithm in which variables represent nodes and edges connect variables with the same signedness. In a program without casts or conversions, signed and unsigned variables would form distinct connected components, but when casts are present, signed and unsigned variables will be connected. Reasoning that developers prefer source code without casts, we compute a minimum cut between signed and unsigned variables, which corresponds to a minimal set of casts required for a legal typing. We evaluate this algorithm by erasing signedness information from debugging symbols, and testing how well our tool can recover it. Applying an intra-procedural version of the algorithm to the GNU Coreutils, we we observe that many variables are unconstrained as to signedness, but that it almost all cases our tool recovers either the type from the original source, or a type that yields the same program behavior.

TOP

  • Zeng J, Fu Y, Miller K A, et al. Obfuscation resilient binary code reuse through trace-oriented programming. ACM Sigsac Conference on Computer & Communications Security. ACM, 2013:487-498. paper

Abstract:

With the wide existence of binary code, it is desirable to reuse it in many security applications, such as malware analysis and software patching. While prior approaches have shown that binary code can be extracted and reused, they are often based on static analysis and face challenges when coping with obfuscated binaries. This paper introduces trace-oriented programming (TOP), a general framework for generating new software from existing binary code by elevating the low-level binary code to C code with templates and inlined assembly. Different from existing work, TOP gains benefits from dynamic analysis such as resilience against obfuscation and avoidance of points-to analysis. Thus, TOP can be used for malware analysis, especially for malware function analysis and identification. We have implemented a proof-of-concept of TOP and our evaluation results with a range of benign and malicious software indicate that TOP is able to reconstruct source code from binary execution traces in malware analysis and identification, and binary function transplanting.

PointerScop

  • Zhang M, Prakash A, Li X, et al. Identifying and analyzing pointer misuses for sophisticated memory-corruption exploit diagnosis. Proceedings of the Western Pharmacology Society, 2012, 47(47):46-49. paper

Abstract:

Software exploits are one of the major threats to the Internet security. A large family of exploits works by corrupting memory of the victim process to execute malicious code. To quickly respond to these attacks, it is critical to automatically diagnose such exploits to find out how they circumvent existing defense mechanisms. Because of the complexity of the victim programs and sophistication of recent exploits, existing analysis techniques fall short: they either miss important attack steps or report too much irrelevant information. In this paper, based on the observation that the key steps in memory corruption exploits often involve pointer misuses, we propose a novel solution, PointerScope, to use type inference on binary execution to detect the pointer misuses induced by an exploit. These pointer misuses highlight the important attack steps of the exploit, and therefore convey valuable information about the exploit mechanisms. Our approach complements dependency-based solutions to perform more comprehensive diagnosis of sophisticated memory exploits. We prototyped PointerScope and evaluated it using real-world exploit samples and demonstrated that PointerScope can successfully capture the key attack steps, which significantly facilitates attack response.

LEGO

  • Srinivasan V, Reps T. Recovery of Class Hierarchies and Composition Relationships from Machine Code. Compiler Construction. 2014:61-84. paper

We present a reverse-engineering tool, called Lego, which recovers class hierarchies and composition relationships from stripped binaries. Lego takes a stripped binary as input, and uses information obtained from dynamic analysis to (i) group the functions in the binary into classes, and (ii) identify inheritance and composition relationships between the inferred classes. The software artifacts recovered by Lego can be subsequently used to understand the object-oriented design of software systems that lack documentation and source code, e.g., to enable interoperability. Our experiments show that the class hierarchies recovered by Lego have a high degree of agreement—measured in terms of precision and recall—with the hierarchy defined in the source code.

RHK

  • Robbins E, Howe J M, King A. Theory propagation and rational-trees. Symposium on Principles and Practice of Declarative Programming. ACM, 2013:193-204. paper

Abstract:

SAT Modulo Theories (SMT) is the problem of determining the satisfiability of a formula in which constraints, drawn from a given constraint theory T, are composed with logical connectives. The DPLL(T) approach to SMT has risen to prominence as a technique for solving these quantifier-free problems. The key idea in DPLL(T) is to closely couple unit propagation in the propositional part of the problem with theory propagation in the constraint component. In this paper it is demonstrated how reification provides a natural way for orchestrating this in the setting of logic programming. This allows an elegant implementation of DPLL(T) solvers in Prolog. The work is motivated by a problem in reverse engineering, that of type recovery from binaries. The solution to this problem requires an SMT solver where the theory is that of rational-tree constraints, a theory not supported in off-the-shelf SMT solvers, but realised as unification in many Prolog systems. The solver is benchmarked against a number of type recovery problems, and compared against a lazy-basic SMT solver built on PicoSAT.

TIE

  • Lee J H, Avgerinos T, Brumley D. TIE: Principled Reverse Engineering of Types in Binary Programs. Network and Distributed System Security Symposium, NDSS 2011, San Diego, California, Usa, February -, February. DBLP, 2011. paper

Abstract:

A recurring problem in security is reverse engineering binary code to recover high-level language data abstractions and types. High-level programming languages have data abstractions such as buffers, structures, and local variables that all help programmers and program analyses reason about programs in a scalable manner. During compilation, these abstractions are removed as code is translated down to operations on registers and one globally addressed memory region. Reverse engineering consists of “undoing” the compilation to recover high-level information so that programmers, security professionals, and analyses can all more easily reason about the binary code.
In this paper we develop novel techniques for reverse engineering data type abstractions from binary programs. At the heart of our approach is a novel type reconstruction system based upon binary code analysis. Our techniques and system can be applied as part of both static or dynamic analysis, thus are extensible to a large number of security settings. Our results on 87 programs show that TIE is both more accurate and more precise at recovering high-level types than existing mechanisms.

Lee et al. [13] presented a tool called TIE aimed at the related problem of inferring primitive types for variables in a binary. But the inferred type of TIE is limited to integer and pointer type. TIE uses inference rules that are based on the data transfers between variables and known functions. Under our setting, the technique of TIE would determine all objects as belonging to type pointer without being able to determine the actual type. Another question is that the output of TIE is the upper bound and lower bound rather than specific type, which is not accurate enough to be useful for the binary engineer.

TIE

big code programing

survy

There are many solutions proposed to use predictive method to program analysis. Most of them are work on source code level, but it is also quite enlightening. Interesting readers can refer to Bielik2015Programming for more detail.

Big Code

  • Raychev V, Vechev M, Krause A. Predicting Program Properties from “Big Code”. ACM Sigplan-Sigact Symposium on Principles of Programming Languages. ACM, 2015:111-124. paper

Abstract:

We present a new approach for predicting program properties from massive codebases (aka “Big Code”). Our approach first learns a probabilistic model from existing data and then uses this model to predict properties of new, unseen programs.

The key idea of our work is to transform the input program into a representation which allows us to phrase the problem of inferring program properties as structured prediction in machine learning. This formulation enables us to leverage powerful probabilistic graphical models such as conditional random fields (CRFs) in order to perform joint prediction of program properties.

As an example of our approach, we built a scalable prediction engine called JSNICE1 for solving two kinds of problems in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNICE predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of the cases. In the first week since its release, JSNICE was used by more than 30; 000 developers and in only few months has become a popular tool in the JavaScript developer community.

By formulating the problem of inferring program properties as structured prediction and showing how to perform both learning and inference in this context, our work opens up new possibilities for attacking a wide range of difficult problems in the context of “Big Code” including invariant generation, de-compilation, synthesis and others.

Raychev et al. 2015 [5] present a new approach for predicting program properties from “big code” base on conditional random fields (CRFs). This approach addresses two challenges in JavaScript: predicting identifier names and predicting type annotations of variables. He shows that the problem of inferring program properties can transform to structured prediction in machine learning.

Although there is big difference between source code level and binary code level, adapting source code techniques to binary code is a common trend that can still be pursued. This works well at the source code level as a lot of program structure is easy to recover. When working on stripped binaries, there is much less program structure to work with.

POPL 16

  • Katz O, Ran E Y, Yahav E. Estimating types in binaries using predictive modeling. ACM Sigplan-Sigact Symposium on Principles of Programming Languages. ACM, 2016:313-326. paper

Abstract:

Reverse engineering is an important tool in mitigating vulnerabilities in binaries. As a lot of software is developed in object-oriented languages, reverse engineering of object-oriented code is of critical importance. One of the major hurdles in reverse engineering binaries compiled from object-oriented code is the use of dynamic dispatch. In the absence of debug information, any dynamic dispatch may seem to jump to many possible targets, posing a significant challenge to a reverse engineer trying to track the program flow.

We present a novel technique that allows us to statically determine the likely targets of virtual function calls. Our technique uses object tracelets – statically constructed sequences of operations performed on an object – to capture potential runtime behaviors of the object. Our analysis automatically pre-labels some of the object tracelets by relying on instances where the type of an object is known. The resulting type-labeled tracelets are then used to train a statistical language model (SLM) for each type. We then use the resulting ensemble of SLMs over unlabeled tracelets to generate a ranking of their most likely types, from which we deduce the likely targets of dynamic dispatches. We have implemented our technique and evaluated it over real-world C++ binaries. Our evaluation shows that when there are multiple alternative targets, our approach can drastically reduce the number of targets that have to be considered by a reverse engineer


Katz et al 2016 [6] present a novel technique that allows us to statically determine the likely targets of virtual function calls. This technique can deal with the difficult task of reverse engineering binaries, combining statistical approaches and predictive modeling. It use object tracelets to capture usage, behaviors and characteristics of objects and types in a binary. But this paper only concerns about how to statically infer the likely targets for each indirect call site, rather than recover the type of all variables.

But this paper only concerns about how to statically infer the likely targets for each indirect call site, rather than recover the type of all variables. This approach is similar to our approach and it inspires us a lot. We also use the actual usage sequences as reflection of a variable and predict the type information.

BITY

  • Xu Z, Wen C, Qin S. Learning Types for Binaries. International Conference on Formal Engineering Methods. Springer, Cham, 2017:430-446. paper

Abstract:

Type inference for Binary codes is a challenging problem due partly to the fact that much type-related information has been lost during the compilation from high-level source code. Most of the existing research on binary code type inference tend to resort to program analysis techniques, which can be too conservative to infer types with high accuracy or too heavy-weight to be viable in practice. In this paper, we propose a new approach to learning types for recovered variables from their related representative instructions. Our idea is motivated by “duck typing”, where the type of a variable is determined by its features and properties. Our approach first learns a classifier from existing binaries with debug information and then uses this classifier to predict types for new, unseen binaries. We have implemented our approach in a tool called BITY and used it to conduct some experiments on a well-known benchmark coreutils (v8.4). The results show that our tool is more precise than the commercial tool Hey-Rays, both in terms of correct types and compatible types.

others

Hey-Rays

Hex-Rays [1] is most popular tools for binary code analysis. We already compared to Hex-Rays in our result. But Hex-Rays is a commercial software. As far as we know, their work is not published.

DBN

Wang et al. 2016 [7] leverages a powerful representation learning algorithm, namely deep learning, to extract source code features for defect prediction. This work converts the code snippets into vectors of tokens with structural and contextual information preserved.

你可能感兴趣的:(二进制代码的类型恢复-Type inference on executable-wcventure)