## 1. Introduction
**Seven Code search tool 分类**
- text-based code search.
- I/O example code search.
- API-based code search.
- ADECK [147]
- code clone search.
- binary code search. `Source code 会编译成不同二进制代码,对二进制代码检索.`
- UI search. `使用 UI 草图进行搜索` pix2code [14]
- programming video search. `在视频中搜索相关的代码。` [10]
67个code search tool 仅仅只有12个开源。
**Evaluation**
主要是 ranking metrics. MRR (Mean Reciprocal Rank), Precision.
最近几篇文章还报告了 R@1, R@5 这些。
**Challenges**
- Standard Benchmark.
- Inprove Machine Learning Models. 训练数据,跨模态表示,loss function.
- Model Fusion. DL model, traditional IR model...
- Cross-Language Searches.
- Search Tasks. UI codes, code used in programming videos. 这些新任务。
## 2. Background
一个通常的 working flow, 有7个模块。
![image-20220211154006997](https://gitee.com/hufanmax/image_bag/raw/master/image/image-20220211154006997.png)
- Query. 主要是自然语言。 **[56, 85, 103] 支持结构化的 code-based query.**
- Codebase. 不同语言,不同来源。
- Code Analysis Technique. 如何从 code 中挖掘更多**programming knowledge?** AST (抽象语法树) [136, 145]. CFG (控制流程图) [21, 130]. Call Graph (变量方法的调用关系) [74, 75].
- Modeling Technique.
- Traditional IR.
- 启发式、手工设计特征、matching score.
- ML model
- Auxiliary Technique (辅助技术,**可以调研一下**).
- Query reformulation.
- Code Cluster.
- Feedback learning.
- Evaluation Method.
- Performance Measures.
## 3. Methodology
Empirical study: Analyze the search logs of existing code search tools [6, 7, 26, 37, 38, 89, 104, 105, 138, 142]. 这个挺有意思的,分析真实场景中的 search logs.
## 5 现有代码搜索工具中的关键组件
### 5.1 Code Analysis Technique
![image-20220211172951665](https://gitee.com/hufanmax/image_bag/raw/master/image/image-20220211172951665.png)
**Semantics Analysis** 解析程序组成和依存关系
AST (抽象语法树) [136, 145]. CFG (控制流程图) [21, 130]. Call Graph (变量方法的调用关系) [74, 75].
**Relevancy Analysis**: code 与 query 的关系。
- Code difference. 识别 query 和 code 相似和不同的部分,缩小搜索范围
- static code slice. 过滤不相关的代码。
- symbolic execution.
### 5.2 Modeling Techniques
![image-20220211174551982](https://gitee.com/hufanmax/image_bag/raw/master/image/image-20220211174551982.png)
IR Models
- TF-IDF, BM25
- Boolean models 支持使用 "AND" 操作符。[86, 143]
Heuristic Models (启发式)
ML Models
### 5.3 Auxiliary techniques (辅助技术)
Inverted Index
Query Reformulation: **Expand & replace**
Code Clustering
Feedback Learning
## 7 Code Search Evaluation
![image-20220211175755059](https://gitee.com/hufanmax/image_bag/raw/master/image/image-20220211175755059.png)
## 8. CHALLENGES AND OPPORTUNITIES
Challenge 1: Diversity of the Codebase.
Challenge 2: Limited Queries.
Challenge 3: Model Construction Issues. DL model 参数过多,对训练数据的质量有要求。
Challenge 4: Evaluation Issues.
Challenge 5: Limited Performance Measures.
Challenge 6: Replication Issues.
Opportunity 1: Better Benchmarks.
Opportunity 2: DL-Based Model with Big Data.
Opportunity 3: Fusion of Different Types of Models.
Opportunity 4: Multi-Language Tool.
Opportunity 5: New Code Search Tasks.