DNA 2. SCI 文章中基因组变异分析神器之 maftools

点击关注,桓峰基因

桓峰基因
生物信息分析,SCI文章撰写及生物信息基础知识学习:R语言学习,perl基础编程,linux系统命令,Python遇见更好的你
80篇原创内容 -->
公众号

前       言

随着癌症基因组学的进步,突变注释格式(MAF)正在被广泛接受并用于存储检测到的体细胞变异。癌症基因组图谱项目已经对30多种不同的癌症进行了测序,每种癌症的样本量都超过了200个。由体细胞变体组成的结果数据以突变注释格式的形式存储。只要数据是MAF格式,本包试图从TCGA来源或任何内部研究中以有效的方式总结、分析、注释和可视化MAF文件。

实例解析

1. 软件安装

在安装这个软件maftools时,需要先安装BioManager,然后再安装maftools,如下:

if (!require("BiocManager")) {
install.packages("BiocManager")
}
if (!require(maftools)) {
BiocManager::install("maftools")
}
library(maftools)

2. 数据读取

maftools工具需要读入两个文件,如下:

1.MAF文件-可以是gz压缩。必需的;

2.与MAF中每个样本/肿瘤样本条码相关的可选推荐的临床数据;

3.一个可选的拷贝数数据:可以是GISTIC输出或自定义表。

1.maf文件格式

MAF文件包含许多字段,从染色体名称到cosmic注释。然而,maftools中的大多数分析使用以下列如下:1.Hugo_Symbol;

2.Chromosome;

3.Start_Position;

4.End_Position;

5.Reference_Allele;

6.Tumor_Seq_Allele2;

7.Variant_Classification;

8.Variant_Type;

9.Tumor_Sample_Barcode.

同时读取maf文件和临床信息文件,看下结果,如下:

# path to TCGA LAML MAF file
laml.maf = system.file("extdata", "tcga_laml.maf.gz", package = "maftools")
# clinical information containing survival information and histology. This is
# optional
laml.clin = system.file("extdata", "tcga_laml_annot.tsv", package = "maftools")
laml = read.maf(maf = laml.maf, clinicalData = laml.clin)
## -Reading
## -Validating
## -Silent variants: 475
## -Summarizing
## -Processing clinical data
## -Finished in 3.790s elapsed (0.550s cpu)
print(laml@data[1:5, ])
## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome
## 1: ABCA10 10349 genome.wustl.edu 37 17
## 2: ABCA4 24 genome.wustl.edu 37 1
## 3: ABCB11 8647 genome.wustl.edu 37 2
## 4: ABCC3 8714 genome.wustl.edu 37 17
## 5: ABCF1 23 genome.wustl.edu 37 6
## Start_Position End_Position Strand Variant_Classification Variant_Type
## 1: 67170917 67170917 + Splice_Site SNP
## 2: 94490594 94490594 + Missense_Mutation SNP
## 3: 169780250 169780250 + Missense_Mutation SNP
## 4: 48760974 48760974 + Missense_Mutation SNP
## 5: 30554429 30554429 + Missense_Mutation SNP
## Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
## 1: T T C TCGA-AB-2988
## 2: C C T TCGA-AB-2869
## 3: G G A TCGA-AB-3009
## 4: C C T TCGA-AB-2887
## 5: G G A TCGA-AB-2920
## Protein_Change i_TumorVAF_WU i_transcript_name
## 1: p.K960R 45.66000 NM_080282.3
## 2: p.R1517H 38.12000 NM_000350.2
## 3: p.A1283V 46.97218 NM_003742.2
## 4: p.P1271S 56.41000 NM_003786.1
## 5: p.G658S 40.95000 NM_001025091.1

2.临床信息格式

临床数据格式包括:每个样本/肿瘤样本条码,和对应的临床数据,如下:

print([email protected][1:5, ])
## Tumor_Sample_Barcode FAB_classification days_to_last_followup
## 1: TCGA-AB-2802 M4 365
## 2: TCGA-AB-2803 M3 792
## 3: TCGA-AB-2804 M3 2557
## 4: TCGA-AB-2805 M0 577
## 5: TCGA-AB-2806 M1 945
## Overall_Survival_Status
## 1: 1
## 2: 1
## 3: 0
## 4: 1
## 5: 1

3.拷贝数变异格式

拷贝数数据包含样本名称,基因名称和拷贝数状态(Amp或Del)。

all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")

laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes,
gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)
## -Processing Gistic files..
## --Processing amp_genes.conf_99.txt
## --Processing del_genes.conf_99.txt
## --Processing scores.gistic
## --Summarizing by samples
laml.gistic@data[1:5, ]
## Hugo_Symbol Tumor_Sample_Barcode Variant_Classification Variant_Type
## 1: KMT2A TCGA-AB-2805 Amp CNV
## 2: LINC00689 TCGA-AB-2805 Del CNV
## 3: MIR595 TCGA-AB-2805 Del CNV
## 4: RN7SL142P TCGA-AB-2805 Del CNV
## 5: SHH TCGA-AB-2805 Del CNV
## Cytoband
## 1: AP_2:11q23.3
## 2: DP_4:7q32.3
## 3: DP_4:7q32.3
## 4: DP_4:7q32.3
## 5: DP_4:7q32.3

3.数据统计

对整体数据情况进行汇总,统计样本量,突变基因量,以及突变类型,最后整理到表格中,如下:

# Shows sample summry.
getSampleSummary(laml)
## Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1: TCGA-AB-3009 0 5 0
## 2: TCGA-AB-2807 1 0 1
## 3: TCGA-AB-2959 0 0 0
## 4: TCGA-AB-3002 0 0 0
## 5: TCGA-AB-2849 0 1 0
## ---
## 189: TCGA-AB-2942 0 0 0
## 190: TCGA-AB-2946 0 0 0
## 191: TCGA-AB-2954 0 0 0
## 192: TCGA-AB-2982 0 0 0
## 193: TCGA-AB-2903 0 0 0
## In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1: 1 25 2 1 34
## 2: 0 16 3 4 25
## 3: 0 22 0 1 23
## 4: 0 15 1 5 21
## 5: 0 16 1 2 20
## ---
## 189: 1 0 0 0 1
## 190: 0 1 0 0 1
## 191: 0 1 0 0 1
## 192: 0 1 0 0 1
## 193: 0 0 0 0 0
# Shows gene summary.
getGeneSummary(laml)
## Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
## 1: FLT3 0 0 1 33
## 2: DNMT3A 4 0 0 0
## 3: NPM1 0 33 0 0
## 4: IDH2 0 0 0 0
## 5: IDH1 0 0 0 0
## ---
## 1237: ZNF689 0 0 0 0
## 1238: ZNF75D 0 0 0 0
## 1239: ZNF827 1 0 0 0
## 1240: ZNF99 0 0 0 0
## 1241: ZPBP 0 0 0 0
## Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
## 1: 15 0 3 52 52
## 2: 39 5 6 54 48
## 3: 1 0 0 34 33
## 4: 20 0 0 20 20
## 5: 18 0 0 18 18
## ---
## 1237: 1 0 0 1 1
## 1238: 1 0 0 1 1
## 1239: 0 0 0 1 1
## 1240: 1 0 0 1 1
## 1241: 1 0 0 1 1
## AlteredSamples
## 1: 52
## 2: 48
## 3: 33
## 4: 20
## 5: 18
## ---
## 1237: 1
## 1238: 1
## 1239: 1
## 1240: 1
## 1241: 1
# shows clinical data associated with samples
getClinicalData(laml)
## Tumor_Sample_Barcode FAB_classification days_to_last_followup
## 1: TCGA-AB-2802 M4 365
## 2: TCGA-AB-2803 M3 792
## 3: TCGA-AB-2804 M3 2557
## 4: TCGA-AB-2805 M0 577
## 5: TCGA-AB-2806 M1 945
## ---
## 189: TCGA-AB-3007 M3 1581
## 190: TCGA-AB-3008 M1 822
## 191: TCGA-AB-3009 M4 577
## 192: TCGA-AB-3011 M1 1885
## 193: TCGA-AB-3012 M3 1887
## Overall_Survival_Status
## 1: 1
## 2: 1
## 3: 0
## 4: 1
## 5: 1
## ---
## 189: 0
## 190: 1
## 191: 1
## 192: 0
## 193: 0
# Shows all fields in MAF
getFields(laml)
## [1] "Hugo_Symbol" "Entrez_Gene_Id" "Center"
## [4] "NCBI_Build" "Chromosome" "Start_Position"
## [7] "End_Position" "Strand" "Variant_Classification"
## [10] "Variant_Type" "Reference_Allele" "Tumor_Seq_Allele1"
## [13] "Tumor_Seq_Allele2" "Tumor_Sample_Barcode" "Protein_Change"
## [16] "i_TumorVAF_WU" "i_transcript_name"
# Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = "laml")

数据汇总统计绘图,将每个样本中的变量数量显示为堆叠的barplot,将变量类型显示为Variant_Classification汇总的箱线图。如下:

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = "median", dashboard = TRUE,
titvRaw = FALSE)

4.突变数据可视化

瀑布图

# oncoplot for top ten mutated genes.
oncoplot(maf = laml, top = 10)

顺反式图

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
# plot titv summary
plotTiTv(res = laml.titv)

棒棒糖图(氨基酸)

# lollipop plot for DNMT3A, which is one of the most frequent mutated gene in
# Leukemia.
lollipopPlot(maf = laml, gene = "DNMT3A", AACol = "Protein_Change", showMutationRate = TRUE,
labelPos = 882)
## HGNC refseq.ID protein.ID aa.length
## 1: DNMT3A NM_022552 NP_072046 912
## 2: DNMT3A NM_153759 NP_715640 723
## 3: DNMT3A NM_175629 NP_783328 912

棒棒糖图(蛋白)

plotProtein(gene = "TP53", refSeqID = "NM_000546")

降雨图

brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)
rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.4)
## Chromosome Start_Position End_Position nMuts Avg_intermutation_dist Size
## 1: 8 98129348 98133560 7 702.0000 4212
## 2: 8 98398549 98403536 9 623.3750 4987
## 3: 8 98453076 98456466 9 423.7500 3390
## 4: 8 124090377 124096810 22 306.3333 6433
## 5: 12 97436055 97439705 7 608.3333 3650
## 6: 17 29332072 29336153 8 583.0000 4081
## Tumor_Sample_Barcode C>G C>T
## 1: TCGA-A8-A08B 4 3
## 2: TCGA-A8-A08B 1 8
## 3: TCGA-A8-A08B 1 8
## 4: TCGA-A8-A08B 1 21
## 5: TCGA-A8-A08B 4 3
## 6: TCGA-A8-A08B 4 4

突变负荷与TCGA比较

laml.mutload = tcgaCompare(maf = laml, cohortName = "Example-LAML", logscale = TRUE,
capture_size = 50)

VAF分布图

plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')

CNV可视化

基因组分布

gisticChromPlot(gistic = laml.gistic, markBands = "all")

气泡图

gisticBubblePlot(gistic = laml.gistic)

瀑布图

gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = "FAB_classification",
sortByAnnotation = TRUE, top = 10)

CBS 可视化

tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg)
## NULL

这期主要说说单纯的可视化数据,下期说说利用maftools分析一些实质性的操作,敬请期待,想学习生信的可以到B站,每天上传教程!

我们在也推出克隆进化的系列文章,从组织到单细胞的克隆进化分析都有介绍,供大家参考:

Topic 1.克隆进化之sciClone 

Topic 2.克隆进化之ClonEvol 

Topic 3. 克隆进化之 fishplot 

Topic 4. 克隆进化之 Pyclone 

Topic 5. 克隆进化之 CITUP

Topic 6. 克隆进化之 Canopy

Topic 7. 克隆进化之 Cardelino

Topic 8. 克隆进化之 RobustClone

Topic 9. 克隆进化之 TimeScape 

Clone 1. 肿瘤克隆进化之前世今生

Clone 2. 肿瘤克隆进化之不同进化模

有做肿瘤的转移,癌症的耐药性亚克隆,靶向用药等课题的老师都可以直接联系我,保证最好的服务,最顶级的科研方法,助力高分!

References:

  • Mayakonda A, Lin DC, Assenov Y, Plass C, Koeffler HP. 2018. Maftools: efficient and comprehensive analysis of somatic variants in cancer. Genome Resarch. PMID: 30341162
  • 本文使用 文章同步助手 同步

    你可能感兴趣的:(DNA 2. SCI 文章中基因组变异分析神器之 maftools)