肿瘤驱动基因和显著突变基因识别的生信方法进展


概括:

  1. Cancer is often driven by the accumulation of genetic alterations, including single nucleotide variants, small insertions or deletions, gene fusions, copy-number variations, and large chromosomal rearrangements.

  2. Recent advances in next generation sequencing technologies have helped investigators generate massive amounts of cancer genomic data and catalog somatic mutations in both common and rare cancer types. So far, the somatic mutation landscapes and signatures of >10 major cancer types have been reported; however, pinpointing driver mutations and cancer genes from millions of available cancer somatic mutations remains a monumental challenge.

  3. To tackle this important task, many methods and computational tools have been developed during the past several years and, thus, a review of its advances is urgently needed.

  4. Here, we first summarize the main features of these methods and tools for whole-exome, whole-genome and whole transcriptome sequencing data.

  5. Then, we discuss major challenges like tumor intra-heterogeneity, tumor sample saturation and functionality of synonymous mutations in cancer, all of which may result in false-positive discoveries.

  6. Finally, we highlight new directions in studying regulatory roles of noncoding somatic mutations and quantitatively measuring circulating tumor DNA in cancer.

This review may help investigators find an appropriate tool for detecting potential driver or actionable mutations in rapidly emerging precision cancer medicine


文章思路:

In this review, we focus on the description of computational approaches and tools in identifying driver mutations and SMGs in cancer using NGS data.

In this review article:

  1. we first summarized the major biological resources that are commonly used for the development of these tools.

  2. Then, we described the main features of the tools in these five types.

  3. Next, we discussed some major challenges on identification of driver mutations or SMGs from large number of somatic mutations in cancer NGS data.

  4. Finally, we highlight several new directions, such as the study of noncoding regulatory mutations through integrated pan-cancer analyses of somatic mutations using functional genomics and whole-genome
    sequencing (WGS) data.


1. Data resources for method and tool development and evaluation

NGS data resources

Network and pathway data resources

Table 1.PNG

2. Method and computational tools

识别肿瘤驱动基因的方法学分类:作者将用于识别肿瘤驱动基因和SMGs的算法和工具分为了以下五类:

  • Mutation frequency based
  • Functional impact based
  • Structural genomics based
  • Network or pathway based
  • Data integration based
Figure 1..PNG
Table 2-2.PNG

2.1 Mutation frequency based

这种类型的算法大多基因突变频率:

  • MuSiC 是一个突变分析Pipeline,并整合了测序数据和临床数据,有研究组用这个工具分析了TCGA卵巢癌找到了12个SMGs

  • ContrastRank 则是重点比较了肿瘤组织和正常组织的基因突变率。

  • OncodriveCLUST 重点分析获得功能(GoF)的基因突变,它使用了在编码区的丧失功能的基因突变作为背景

  • MutSigCV 使用了基因表达数据和replication timing information

2.2 Functional impact based

这类算法大多与基因功能相关,预测异常对基因及其蛋白质功能的影响。

  • Sorting Intolerant from Tolerant (SIFT) 基于蛋白质序列的保守程度

  • Polymorphism Phenotyping v2 (PolyPhen-2) 整合了8种基于序列的和3种基于结构的特性

  • MutationAssessor 使用了熵理论去定义进化保守模式(多物种),只能用于错义点突变(limited to nonsynonymous SNVs)

  • OncodriveFM 使用SIFT、PolyPhen-2和MutationAssessor去识别低频的SMGs

  • MutationTaster 可以用来评估突变对疾病的影响(使用进化保守性、丧失功能突变、蛋白质功能改变),不能评估跨越外显子和内含子的INDEL(>12碱基对)

  • CHASM使用49个预测性特征训练随机森林数模型,用来预测错误突变的功能影响

  • FATHMM基于Hidden Markov model(HMM)从passenger突变中识别驱动基因突变(整合同源序列和保守结构域)

  • CanDrA 基于supporting vector machine (SVM)算法,整合了由10种不同功能预测算法产生的95 个结构和进化特征

2.3 Structural genomics based

基于结构的分析算法大多基于SNV,比较少的考虑到其他类型的变异(如融合基因),该类方法的限制在于不是所有的蛋白质都有其明确的结构域信息。

  • MESA,突变富集分析(mutation set enrichment analysis)使用了两种模型——MSEA-domain 和MSEA-clust。 MSEA-domain是基于蛋白编码区的热点突变谱,MSEA-clust则是去基因组上找潜在的突变热点区域

  • ActiveDriver 认为基因突变更容易改变磷酸化位点,从8种肿瘤类型中的800个肿瘤样本中识别出了ASF1, FLBN, GRM1等SMGs (Clinical Proteomic Tumor Analysis Consortium产生了大量磷酸化相关的数据)

  • SGDriver 利用蛋白质结构信息(蛋白质配体结合位点)筛选潜在耐药突变。

  • CanBind 利用核酸、小分子、离子和肽结合位点突变筛选SMGs

  • Identification Protein Amino acid Clustering(iPAC)在三维结构上寻找非随机突变(识别出了已知的和新的SMGs)

  • eDriver 比较体细胞突变在不同结构域的分布

2.4 Network or pathway based

基于网络和通路分析的算法可以很好的对肿瘤中突变产生的突变效应有一个很好的评估。

不过,由于目前技术的限制,仍然只能覆盖潜在PPI中的20-30%,而且其中很多分析出来的网络和通路与样本有密切关系(组织类型、细胞组成、生理状态)。

  • PARADIGM 整合了CNV和基因表达数据分析在肿瘤中一致的通路。

  • PARADIGM-SHIFT是PARADIGM的扩展,可以预测突变对下游基因的影响(置信传播算法,belief-propagation algorithm),包括了获得功能和丧失功能的突变

  • TieDIE 使用网络扩散方法预测基因突变对基因表达的影响,还可以识别出与体细胞突变引起的表达谱改变的通路

  • DriverNet 通过对基因突变对基因表达网络的影响进行建模预测驱动基因突变,它的一个好处是可以识别罕见的驱动基因突变

  • VarWalker 它使用有重新开始的随机走动算法整合了大尺度癌症基因组数据到PPI网络中

  • Network-based stratification (NBS) 基于网络算法的用于识别单个肿瘤数据中的亚组

  • DawnRank 使用了PageRank算法?

  • HotNet 基于基因组交互网络,使用网络扩散方法检测显著突变的通路

  • HotNet2 使用孤立的热扩散过程来检测子网络(克服已有单个基因、通路、网络方法的限制),而且可以用来识别那些罕见突变构成的子网络

2.5 Data integration based

整合体细胞突变、结构变异、基因表达、甲基化谱来构建网络分析方法是一个重要的研究方向。人的15%基因组都有CNV变化,至关重要。

  • Diver Oncogene and Tumor Suppressor (DOTS)-Finder 整合三个信息:突变位置、功能影响、突变频率

  • SVMerge 整合多种已有算法用来检测结构变异和断点

  • CONEXIC 通过整合CNV和基因表达数据识别驱动基因突变

  • Helios 通过整合基因组数据和RNA干扰数据在大片段重复扩展区域识别SMGs (RSF-1-mediated tumorigenesis and metastasis in vivo.)

  • OncoIMPACT 基于表型影响,预测病人特异的驱动基因突变

3. Challenges on current approaches

3.1 Tumor heterogeneity and sample saturation

  • 如果在取样时不纯,有可能会导致大量的肿瘤异常信号被掩盖

  • 测序技术的限制,不同的测序技术有各自的优势和长处,但是往往不能100%覆盖所有异常,所以通过整合各种类型的技术和手段对同一特征进行分析是非常重要的

3.2 The accuracy of somatic mutation calling

  • 不同的突变检测算法检出率在0.559-0.994之间

  • 精确度在0.101-0.997

3.3 Functional synonymous mutations in cancer

  • 本文作者提到有研究表明靠近剪接位点的同义突变(synonymous)可能会使剪接位点失活,从而使基因丧失功能(TP53)。

参考资料

  1. Cheng, F., Zhao, J., & Zhao, Z. (2016). Advances in computational approaches for prioritizing driver mutations and significantly mutated genes in cancer genomes. Briefings in bioinformatics, 17(4), 642–656. https://doi.org/10.1093/bib/bbv068

你可能感兴趣的:(肿瘤驱动基因和显著突变基因识别的生信方法进展)