昨天在复习Linux基础知识,没有继续记录,今天继续
参考学习资料TCGA CNV全攻略
具体数据处理流程见NIH的TCGA官网:https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/
参考文献:http://mcr.aacrjournals.org/content/12/4/485.long
先看处理流程:
Copy Number Variation Analysis Pipeline
简介:
This pipeline is built onto the existing TCGA level 2 data generated by Birdsuite and uses the DNAcopy R-package to perform a circular binary segmentation (CBS) analysis [1]. The GDC further transforms these copy number values into segment mean values, which are equal to log2(copy-number/ 2). Diploid regions will have a segment mean of zero, amplified regions will have positive values, and deletions will have negative values.
The GRCh38 probe-set was produced by mapping probe sequences to the GRCh38 reference genome and can be downloaded at the GDC Reference File Website.
Copy Number Estimation
Numeric focal-level Copy Number Variation (CNV) values were generated with "Masked Copy Number Segment" files from tumor aliquots using GISTIC2 [2], [3] on a project level. Only protein-coding genes were kept, and their numeric CNV values were further thresholded by a noise cutoff of 0.3:
- Genes with focal CNV values smaller than -0.3 are categorized as a "loss" (-1)
- Genes with focal CNV values larger than 0.3 are categorized as a "gain" (+1)
- Genes with focal CNV values between and including -0.3 and 0.3 are categorized as "neutral" (0).
GISTIC2 Command Line Parameters
gistic2 \
-b \
-seg \
-mk \
-refgene \
-ta 0.1 \
-armpeel 1 \
-brlen 0.7 \
-cap 1.5 \
-conf 0.99 \
-td 0.1 \
-genegistic 1 \
-gcm extreme \
-js 4 \
-maxseg 2000 \
-qvt 0.25 \
-rx 0 \
-savegene 1 \
(-broad 1)
[1] Olshen, Adam B., E. S. Venkatraman, Robert Lucito, and Michael Wigler. "Circular binary segmentation for the analysis of array-based DNA copy number data." Biostatistics 5, no. 4 (2004): 557-572.
[2] Mermel, Craig H., Steven E. Schumacher, Barbara Hill, Matthew L. Meyerson, Rameen Beroukhim, and Gad Getz. "GISTIC2. 0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers." Genome biology 12, no. 4 (2011): R41.
[3] Beroukhim, Rameen, Craig H. Mermel, Dale Porter, Guo Wei, Soumya Raychaudhuri, Jerry Donovan, Jordi Barretina et al. "The landscape of somatic copy-number alteration across human cancers." Nature 463, no. 7283 (2010): 899.
根据曾老师的例子下载数据
wget -c -r -np -nH -k -L --cut-dirs 6 -p -A "*snp_6*hg19*Level_3*" http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/BRCA/20160128/
如果要下载其它癌症种类,只需要改变url里面的BRCA即可。如果要下载其它类型的数据,只需要改变-A 后面的匹配规则即可,其实就是打开上面url看到的几十个文件的文件名的规律。
打开网站可以看到
$ cd Downloads/
$ wget -c -r -np -nH -k -L --cut-dirs 6 -p -A "*snp_6*hg19*Level_3*" http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/BRCA/20160128/
evel_3.2016012800.0 28%[====> ] 5.19M 7.15KB/s eta 23m 51s
下载速度限制,估计需要下载半个小时,如果用加速插件估计会快点。反正是在探索,而且时间也不是太长,就继续看教程
下载完成总共耗时50分钟
FINISHED --2019-11-08 10:39:06--
Total wall clock time: 50m 13s
Downloaded: 17 files, 19M in 50m 0s (6.62 KB/s)
Converted links in 0 files in 0 seconds.
(test) Cheng-MacBook-Pro:Downloads chelsea$
ls查看下载内容如下
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA-FFPE.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz
gdac.broadinstitute.org_BRCA.Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.tar.gz.md5
可以看到,下载的文件中有一个名称里面包含Merge_snp__genome_wide_snp_6__broad_mit_edu__Level_3__segmented_scna_minus_germline_cnv_hg19__seg.Level_3.2016012800.0.0.
,其中minus了germline的CNV的就是需要分析的somatic CNV.
#先安装一下bedtools
(test) Cheng-MacBook-Pro:Downloads chelsea$ cd ~
(test) Cheng-MacBook-Pro:~ chelsea$ cd biosoft/
(test) Cheng-MacBook-Pro:biosoft chelsea$ conda install -y bedtools
...
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
(test) Cheng-MacBook-Pro:biosoft chelsea$
拿到CNV做什么?
首先两个segment文本文件已经可以直接载入IGV查看所有BRCA样本的CNV情况,如下所示:
还可以进行CNV深度分析
注释基因
既然是对基因组片段做基因注释,那么首先就需要拿到基因的坐标信息咯,我是在gencode数据库里面下载,然后解析成下面的bed格式的,如下:
head ~/reference/gtf/gencode/protein_coding.hg19.position
chr1 69091 70008 OR4F5
chr1 367640 368634 OR4F29
chr1 621096 622034 OR4F16
chr1 859308 879961 SAMD11
chr1 879584 894689 NOC2L
chr1 895967 901095 KLHL17
chr1 901877 911245 PLEKHN1
chr1 910584 917473 PERM1
chr1 934342 935552 HES4
chr1 936518 949921 ISG15
然后要把下载的CNV文本文件,转为bed格式的,就是把列的顺序调换一下:
head Features.bed
chr1 3218610 95674710 TCGA-3C-AAAU-10A-01D-A41E-01 53225 0.0055
chr1 95676511 95676518 TCGA-3C-AAAU-10A-01D-A41E-01 2 -1.6636
chr1 95680124 167057183 TCGA-3C-AAAU-10A-01D-A41E-01 24886 0.0053
chr1 167057495 167059336 TCGA-3C-AAAU-10A-01D-A41E-01 3 -1.0999
chr1 167059760 181602002 TCGA-3C-AAAU-10A-01D-A41E-01 9213 -8e-04
chr1 181603120 181609567 TCGA-3C-AAAU-10A-01D-A41E-01 6 -1.2009
chr1 181610685 201473647 TCGA-3C-AAAU-10A-01D-A41E-01 12002 0.0055
chr1 201474400 201474544 TCGA-3C-AAAU-10A-01D-A41E-01 2 -1.4235
chr1 201475220 247813706 TCGA-3C-AAAU-10A-01D-A41E-01 29781 -4e-04
避免重复造轮子,曾老师用其擅长的bedtools解决这个需求,命令如下:
bedtools intersect -a Features.bed -b ~/reference/gtf/gencode/protein_coding.hg19.position -wa -wb \
| bedtools groupby -i - -g 1-4 -c 10 -o collapse
注释结果,可以看到,每个CNV片段都注释到了对应的基因,有些特别大的片段,会被注释到非常多的基因。
chr8 42584924 42783715 TCGA-5T-A9QA-01A-11D-A41E-01 CHRNB3,CHRNA6,THAP1,RNF170,HOOK3
chr8 42789728 42793594 TCGA-5T-A9QA-01A-11D-A41E-01 HOOK3
chr8 42797957 42933372 TCGA-5T-A9QA-01A-11D-A41E-01 RP11-598P20.5,FNTA,HOOK3
chr8 70952673 70964372 TCGA-5T-A9QA-01A-11D-A41E-01 PRDM14
chr10 42947970 43833200 TCGA-5T-A9QA-01A-11D-A41E-01 BMS1,RET,RASGEF1A,ZNF33B,CSGALNACT2
chr10 106384615 106473355 TCGA-5T-A9QA-01A-11D-A41E-01 SORCS3
chr10 106478366 107298256 TCGA-5T-A9QA-01A-11D-A41E-01 SORCS3
chr10 117457285 117457859 TCGA-5T-A9QA-01A-11D-A41E-01 ATRNL1
chr11 68990173 69277078 TCGA-5T-A9QA-01A-11D-A41E-01 MYEOV
chr11 76378708 76926535 TCGA-5T-A9QA-01A-11D-A41E-01 LRRC32,B3GNT6,OMP,TSKU,MYO7A,ACER3,CAPN5
找SOMATIC CNVS
仔细看上面IGV的pattern你会发现某些染色体的某些片段经常会扩增或者缺失,这个现象就是人们想研究是recurrent CNV regions,当然不会用肉眼看咯,这时候需要用GISTIC这个软件。找到了recurrent CNV regions同样是需要进行基因注释,才能进行下游分析咯。
PICNIC软件用法:http://www.bio-info-trainee.com/1299.html
GISTIC软件用法:http://www.bio-info-trainee.com/1648.html
TCGA : http://www.biotrainee.com/thread-1696-1-1.html
从今天的教程中知道了CNV的分析还需要用到的软件:IGV,bedtools,picnic,gistic,又有得学了。