【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记

【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability Identification in Android Applications

【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第1张图片

单位:1The Hong Kong Polytechnic University(香港理工大学), 2Nankai Univerisity(南开大学), 3Tianjin University(天津大学), 4Nanyang Technological University(南洋理工大学), 5Monash University(蒙纳士大学)

会议:ICSE 2021

论文链接:ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability Identification in Android Applications

ABSTRACT

该文章提出了ATVHUNTER (Android in-app Third-party library
Vulnerability Hunter),通过对安卓app中Third-party libraries (TPLs)的精确版本的检测和对TPL vulnerabilities信息的收集,提供输入app的TPLs和相关vulnerabilities的信息。其本质是一种有先验知识的similarity-based library detection方案。

在app分析方面,ATVHUNTER采用了two-phase detection approach来identify specific TPL versions: Control Flow Graphs(CFG) as the coarse-grained feature和opcode in each Basic Block of CFG as the fine-grained feature。

在reference database创建方面,ATVHUNTER创建的TPL database 包含189,545 unique TPLs with 3,006,676 versions;ATVHUNTER创建的TPL vulnerability database 包含了TPL中出现的1,180 CVEs and 224 security bugs。

作者对ATVHUNTER进行了Effectiveness、Efficiency和Obfuscation-resilient Capability方面的Evaluation;使用ATVHUNTER对104,446个top apps进行了Large-Scale Analysis,发现其中9,050个vulnerable apps,涉及到10,616 vulnerable TPLs中的53,337 known vulnerabilities 和 7,480 security bugs。

1. INTRODUCTION

1.1 TPL detection的意义

  • Attackers can exploit the vulnerabilities in TPLs
  • Attackers can inject backdoors in TPLs
  • TPLs are scattered in different apps
  • The information of TPL components in apps may be not transparent to app developers(due to many direct or transitive dependencies)

1.2 现有的TPL detection方案

  • 无先验知识:

    • clustering-based methods:

      LibRadar(ICSE 2016)、LibD(ICSE 2017)、LibExtractor(WiSec 2020)

  • 有先验知识:

    • similarity-based methods:

      LibScout(CCS 2016)、LibID(ISSTA 2019)

【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第2张图片

数据来源:Research on Third-Party Libraries in Android Apps: A Taxonomy and Systematic Literature Review (TSE 2021)

1.3 现有TPL detection方案的weaknesses

  • Clustering-based methods:

    • require a considerable number of apps as input
    • Low recall:only can identity commonly-used TPLs
    • Labor-intensive:verifying the clustering results is labor-intensive
    • Imprecise:inability of precise version identification
  • Similarity-based methods:

    • require a predefined TPL database as the reference database
    • Low recall:current published size of TPL database is far smaller than that in the actual market
    • Imprecise:inability of precise version identification

2. ARCHITECTURE

【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第3张图片

2.1 TPL Detection

目的:根据TPL database中的数据,识别出app中包含哪些TPL

2.1.1 Preprocessing

  • Task 1:将apk反编译成bytecode并转换成IR(借助APKTOOL

  • Task 2:删除apk中的 primary module

    • primary module:app开发者实现的代码

    • non-primary module:TPLs

    • 实现方案:

      • 根据AndroidManifest.xml找到包含MainActivity的package

        • 例如:

          < manifest …… package="com.cmic.sso.myapplication" …… >
          
      • 删除package的namespace下面的文件

    • Side Effects:

      • Side Effect 1:package flattening & package renaming obfuscation 导致host code无法被删除

        • 混淆前:

          mycompany.myapplication.MyMainActivity
          mycompany.myapplication.Foo
          mycompany.myapplication.Bar
          mycompany.myapplication.extra.FirstExtra
          mycompany.myapplication.extra.SecondExtra
          mycompany.util.FirstUtil
          mycompany.util.SecondUtil
          
        • Proguard 默认混淆后:

          mycompany.myapplication.MyMainActivity
          mycompany.myapplication.a
          mycompany.myapplication.b
          mycompany.myapplication.a.a
          mycompany.myapplication.a.b
          mycompany.a.a
          mycompany.a.b
          
        • -flattenpackagehierarchy 'myobfuscated'混淆后:

          mycompany.myapplication.MyMainActivity
          mycompany.myapplication.a
          mycompany.myapplication.b
          myobfuscated.a.a
          myobfuscated.a.b
          myobfuscated.b.a
          myobfuscated.b.b
          

          myobfuscated.a替代mycompany.myapplication.extra

          导致mycompany.myapplication.extra.FirstExtra和mycompany.myapplication.extra.SecondExtra无法被删除

      • Side Effect 2:special package name 导致host code无法被删除

      • Side Effect 3:host app and TPLs have the same package namespace 导致TPLs被误删

    • Side Effect 1和2:不影响the accuracy of TPL identification

    • Side Effect 3:导致FN

2.1.2 Module Decoupling

目的:将TPLs拆分开

拆分方法:每个Class Dependency Graph (CDG)作为一个TPL candidate(借助Androguard

class dependency relationship includes:

① class inheritance
② method call relationship
③ field reference relationship

2.1.3 Feature Generation

目的:提取每个TPL的fingerprint

方法:

(1) coarse-grained feature 粗粒度特征:

① 对candidate TPLs中的每个method提取CFG(借助soot),并为CFG中的每个节点(BB)编号(按照执行顺序先后,从小到大编号)

​ 编号时,对于分支节点n的子节点:

  • outgoing edges更多的node编号为n+1
  • outgoing edges相同,statements更多的node编号为n+1

② 以nodeCount -> (child1,child2,…)的形式表示一个node

③ 以adjacency list的形式表示一个CFG(对应一个method)
adjacency list形如[parent1 -> (child1,child2,…), parent2-> …]

④ 对adjacency list计算hash值(每个adjacency list对应一个method

⑤ 将TPL的所有method对应的hash值进行排序,并对排序后的序列计算hash值,将该hash值作为TPL的coarse-grained feature(T1)

(2) fine-grained feature 细粒度特征:

【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第4张图片

① 对每个CFG,按照adjacency list,提取其中的BB的opcode(借助soot

② 对opcode sequence计算 Fuzzy Hash 值(借助ssdeep

fuzzy hash的优势是:If one part of the feature changes due to code obfuscation, it would not cause a big difference to the final fingerprint.

2.1.4 TPL Database Construction

  • We crawled all Java TPLs from Maven Repository (189,545 unique TPLs with their 3,006,676 versions) to build our TPL database.

  • We store both coarse-grained and fine-grained features in a MongoDB database.

  • We spent more than one month to collect all the TPLs and another two months to generate the TPL feature database.

2.1.5 Library Identification

目的:尝试去找到app中的TPL candidate 对应的TPL和TPL version

(1) Potential TPL Identification
  • a) Search by package names

    通过package name过滤掉一些不相关的TPL

    • 当TPL candidate的package name未被混淆时:过滤掉不相关TPL
    • 当TPL candidate的package name被混淆时:不进行任何过滤
  • b) Search by the number of classes

    本质是通过the number of classes过滤掉一些不相关的TPL

    两者中一方的class数量 < 另一方的class数量的40%时,不再进行后续比较

  • c) Search by coarse-grained features

    • coarse-grained feature(T1)完全相同,则认为匹配上

    • coarse-grained feature(T1)超过70%相同,则认为找到了potential TPL

      只对potential TPL进行后续的Version Identification

(2) Version Identification
  • 两个method之间的相似度

    Method Similarity Score (MSS)

    【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第5张图片

    • 其中$ d[m_a,m_b] 代 表 代表 m_a 和 和 m_b$的fingerprint(adjacency list的hash值)之间的Edit Distance(借助ssdeep

    • Edit Distance:the number of minimum edit operations (i.e., insertion, deletion, and substitution) that is required to modify one fingerprint to the other.

    • 如果MSS的值 ≥ θ ( = 0.85 ) \ge\theta (= 0.85) θ(=0.85),则认为两个method是matched

  • 两个TPL之间的相似度

    TPL Similarity Score (TSS)

    【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第6张图片

    • t 1 t_1 t1:代表一个来自app的TPL

    • t 2 t_2 t2:代表一个来自 TPL DB的TPL

    • M ∣ t 2 ∣ M|t_2| Mt2:t2中的method数量

    • M ∣ t 1 ∩ t 2 ∣ M|t_1 \cap t_2| Mt1t2:满足以下条件的方法 m j m_j mj的数量

      • m j m_j mj t 2 t_2 t2中的方法
      • 存在 m i ∈ t 1 m_i \in t_1 mit1 M S S ( m i , m j ) ≥ θ ( = 0.85 ) MSS(m_i,m_j) \ge \theta(=0.85) MSS(mi,mj)θ(=0.85)
      • t 1 t_1 t1 t 2 t_2 t2中至少存在一对MSS值为1的方法(完全matched的方法)
    • TSS值 ≥ δ = 0.95 \ge\delta=0.95 δ=0.95时,认为两个方法匹配上(有多个matched方法时,取TSS值最大的作为最终结果)

2.2 Vulnerable TPL-V Identification

2.2.1 Database Construction

(1) Known TPL Vulnerability Collection
  • 从TPL database中提取TPL的CPE名称
    • CPE 2.3:cpe:/::::::
  • 使用cve-search工具搜索TPL相关的vulnerability
  • Finally, we collected 1,180 CVEs from 957 unique TPLs with 38,243 affected versions.
(2) Security Bug Collection
  • We also obtain 224 security bugs from Github and Bitbucket.
  • These bugs come from 152 open-source TPLs with their corresponding 4,533 versions.

2.2.2 Vulnerable TPL-V Identification

检查匹配上的TPL是否是vulnerable的

3. EVALUATION

衡量ATVHUNTER的有效性和性能

3.1 Preparation

3.1.1 Ground-truth Dataset Construction

  • We first collect the latest versions of 500 open-source apps from F-Droid.
  • For each app, we manually analyze it and get the in-app TPLs with their specific versions.
  • We then download these TPLs with their versions from the
    Maven repository.
  • We filter 144 apps out due to the incomplete versions of TPLs maintained in the Maven repository.
  • We choose 356 apps and 189 unique TPLs with the complete 6,819
    version files as the ground truth.

3.1.2 Threshold Selection

  • We randomly select three groups (3 * 200) of apps except the aforementioned dataset to decide appropriate thresholds for MSS and TSS. 【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第7张图片

3.2 Effectiveness Evaluation

【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第8张图片

3.3 Efficiency Evaluation

【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第9张图片

3.4 Obfuscation-resilient Capability

【ICSE 2021】ATVHUNTER: Reliable Version Detection of Third-Party Libraries for Vulnerability 论文笔记_第10张图片

4. LARGE-SCALE ANALYSIS

使用ATVHUNTER来reveal real world中TPL vulnerability的impact

  • We collected commercial Android apps from Google Play based on the number of installations.
  • We finally collected 104,446 apps and found 72% of them (73,110/104,446) use TPLs.
  • 9,050/73,110 of apps include vulnerable TPLs, involving 53,337 vulnerabilities and 7,480 security bugs.
    • vulnerabilities are from 166 TPLs with 10,362 versions
    • security bugs are from 27 TPLs with 284 versions

5. DISCUSSION

Limitations:

(1)About native libraries:hash的方案可能不奏效

(2)About app packing

你可能感兴趣的:(论文笔记,android)