(安安理解)就是做一个界面,看看那些代码表示方式和不同的机器学习架构哪个更匹配,哪个准确率高,做一个benchmark
然后设计出一个定制的特征模型
和下面这篇文章的想法很像,但是下面这篇是通过代码相似度算法来进行展开的
VulPecker: an automated vulnerability detection system based on code similarity analysis(Published in ACSAC '16 2016 Computer Science)
1 INTRODUCTION
根据Ghaffarian and Shahriari [5]最近的调查,基于机器学习的漏洞发现可以被拆分为三个领域:
vulnerability detection based on software metrics
anomaly detection
vulnerable code pattern recognition
本论文关注于第三点
Our goal is, therefore, a supervised machine learning process that extracts patterns of vulnerable code snippets and re-identifies them in unseen source code.
the vulnerability knowledge could be divided into known patterns and unknown patterns.
The latter can only be discovered by an anomaly-based method.
We will focus on known patterns categorized for the most common weaknesses, or CWEs [
12
], a project for classifying frequent vulnerabilities.
2 RELATED WORK
The vulnerability analysis on source code can be divided in three types: lexical, syntactic and semantic analysis [
7
].
By looking at source code analysis with machine learning at different stages, we can distinguish three waves [
1
]:
1)the first wave consists of basic tools with hand-crafted features.
2)The second wave follows a lexical analysis, which has already been studied extensively, treating code as text and organizing code into classes using natural language processing techniques.
3)The ongoing development, taken into account the semantics of programming languages, is referred as the
third wave of MLoC and pointed out as future work.
Allamanis et al. represent programs as graphs [
2
], where edges
correspond to syntactic and semantic relationships. They evaluate
their approach in open source C# projects to predict variable names
based on their usage and to predict the correct variable name at the
corresponding program location. They give selected examples for
the correct prediction of variable usage and also test their model on
unseen projects. This work is relevant to the third wave of MLoC
and there is still plenty of room for improvements in accuracy and
F1 score.
[2]
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2017. Learning
to Represent Programs with Graphs.
arXiv:1711.00740 [cs]
(Nov. 2017).
http:
//arxiv.org/abs/1711.00740
arXiv: 1711.00740.
In the work of Tufano et al. [
17
], different source code representations are used to detect code clones with a deep learning approach. Identifiers, abstract syntax trees (AST), control flow graphs (CFG), or byte code, are used as representations, each providing an orthogonal view of the code snippet and demonstrating the effectiveness of each, but also creating a combined model with ensemble learning.
The authors have shown that both single and combined representations work with a very high accuracy in this experiment. We also want to test single and combined representations of source code, but for a another application: vulnerability detection.
[17] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin
White, and Denys Poshyvanyk. 2018. Deep learning similarities from different
representations of source code. In
Proceedings of the 15th International Conference
on Mining Software Repositories - MSR ’18
. ACM Press, Gothenburg, Sweden,
542–553.
https://doi.org/10.1145/3196398.3196431
The actual detection of vulnerabilities in source code is researched in [
15
] using an identifier representation in a deep learning environment. They created a data set from a variety of sources and labeled them, based on the results of statical analysis tools considering the top five CWE categories and empirically developed a best
practice pipeline consisting of a random embedding of the source
code tokens, learned the features through a one-dimensional CNN
and used it as input to a random forest classifier that decides if the
code is secure or vulnerable. This work could be assigned to the
second wave of MLoC. By creating an embedding that pays more
attention to the token semantics than just a random embedding,
this method could achieve a higher classification accuracy. In addition, with a tree representation of the code, this method could be further improved.
[15]
Rebecca L. Russell, Louis Y. Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer,
Onur Ozdemir, Paul M. Ellingwood, and Marc W. McConley. 2018. Automated
Vulnerability Detection in Source Code Using Deep Representation Learning. In
17th IEEE International Conference on Machine Learning and Applications, ICMLA
2018, Orlando, FL, USA, December 17-20, 2018
. 757–762.
https://doi.org/10.1109/
ICMLA.2018.00120
A graph based representation brings better performance for
vulnerability detection, as shown in the work of Kronjee et al. [
9
]
for CFGs to detect SQL injections and cross-site scripting in PHP
applications. They have shown that the chosen machine learning
algorithm is not crucial because it provides approximately similar
AUC-PR (area under curve for precision recall) values for the same
vulnerability. We also want to include CFGs as code representation
in our work, but for C/C++ fragments.
[9]
Jorrit Kronjee, Arjen Hommersom, and Harald Vranken. 2018. Discovering
Software Vulnerabilities Using Data-flow Analysis and Machine Learning. In
Proceedings of the 13th International Conference on Availability, Reliability and
Security (ARES 2018)
. ACM, New York, NY, USA, 6:1–6:10.
https://doi.org/10.
1145/3230833.3230856
event-place: Hamburg, Germany.
A path-based approach called Code2Vec [
3
] is used to provide a
framework that creates a fixed-length continuous vector from code
snippets of any size to detect code similarities. However, the results
show that the prediction accuracy depends less on the path than
on the variable names of the start and end nodes in the considered
path. Other limitations of this work are the non-universal closed
label vocabulary, which is limited to the training data. The work
is though a promising concept. We want to find out, how well this
approach works for vulnerability discovery in our future work.
[3]
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. code2vec: Learning Distributed Representations of Code. arXiv:1803.09473 [cs, stat]
(March 2018).
http://arxiv.org/abs/1803.09473
arXiv: 1803.09473.
Binary code representations are used for pattern analysis [
6
,
10
,
19
] by identifying potentially dangerous usage patterns of the C
standard library to predict the probability of a crash when executing
certain commands and to test them with various fuzzing tools.
However, due to high prediction errors, we move away from the
idea of analyzing byte code and look instead at code, which is
represented in a high-level format.
3 BENCHMARKING STATE-OF-THE-ART METHODS
4 FOCUS ON ACTIONABILITY
In general, static analysis tools often come with a high false positive
rate, whereas dynamic analysis tools often provide a high number
of false negatives [
16
].
Recent work [
9
,
15
] classify a code snippet at the function level.
5 COMBINING REPRESENTATIONS TO AN IMPROVED MODEL
集成对代码的不同表现形式 这样可以将词法和语法分析结合起来
集成不同的编码
6 EVALUATION PLAN