Conpair---配对样本检查一致性及样本污染情况

https://github.com/nygenome/conpair

依赖:python、numpy、scipy、GATK3

numpy、scipy安装:

sudo pip install numpy
sudo pip install scipy

GATK4无法使用,我用的3.8.

1.官方指导写的是修改配置文件,但是CONPAIR_DIR和GATK_JAR都可以通过参数添加,PYTHONPATH没有参数添加,所以我修改配置文件只添加了CONPAIR_DIR、PYTHONPATH:

sudo vi /etc/profile
export CONPAIR_DIR=/your/path/to/CONPAIR  
export GATK_JAR=/your/path/to/GenomeAnalysisTK.jar
export PYTHONPATH=${PYTHONPATH}:/your/path/to/CONPAIR/modules/

2.参考基因组文件要求有三个,但是不需要都写在--reference的后面,只写第一个就行:
human_g1k_v37.fa
human_g1k_v37.fa.fai
human_g1k_v37.dict

3.生成pileup格式文件(Tumor和Normal两个)

run_gatk_pileup_for_sample.py -B TUMOR_bam -O TUMOR_pileup 
run_gatk_pileup_for_sample.py -B TUMOR_bam -O TUMOR_pileup

其他参数:

--reference REFERENCE               reference genome in the fasta format, two additional files (.fai, .dict) located in the same directory as the fasta file are required. You may choose to avoid specifying the reference by following the steps in the "default reference genome" section above.
--markers MARKERS                   the set of preselected genomic positions in the BED format. Default: ${CONPAIR_DIR}/data/markers/GRCh37.autosomes.phase3_shapeit2_mvncall_integrated.20130502.SNV.genotype.sselect_v4_MAF_0.4_LD_0.8.bed
--conpair_dir CONPAIR_DIR           path to ${CONPAIR_DIR}
--gatk GATK                         path to GATK JAR [$GATK by default]
--java JAVA                         path to JAVA [java by default]
--temp_dir_java TEMP_DIR_JAVA       java temporary directory to set -Djava.io.tmpdir
--xmx_java  XMX_JAVA                Xmx java memory setting [default: 12g]

主要要添加的是--reference,--gatk
--markers文件在下载包里就有,设置好配置文件的CONPAIR_DIR,没有移动过markers文件夹位置,就不用写了。

4.验证Tumor/Normal一致性

verify_concordance.py -T TUMOR_pileup -N NORMAL_pileup

Optional:
--help                              show help message and exit
--outfile OUTFILE                   write output to OUTFILE
--normal_homozygous_markers_only    use only normal homozygous positions to calculate concordance between TUMOR and NORMAL 
--min_cov MIN_COV                   require min of MIN_COV in both TUMOR and NORMAL to use the marker
--min_mapping_quality MIN_MAP_QUAL  do not use reads with mapping qual below MIN_MAP_QUAL [default: 10]
--min_base_quality  MIN_BASE_QUAL   do not use reads with base qual below MIN_BASE_QUAL of a specified position [default: 20]
--markers MARKERS                   the set of preselected genomic positions in the TXT format. Default: ${CONPAIR_DIR}/data/markers/GRCh37.autosomes.phase3_shapeit2_mvncall_integrated.20130502.SNV.genotype.sselect_v4_MAF_0.4_LD_0.8.txt

官方文档最后还写了,考虑到CNV的影响,最好加上-H 参数,然而help里并没有写这个参数,我加上-H试了一下,concordance 从99.18%升到了100%。

To eliminate the effect of copy number variation on the concordance levels, we recommend using the -H flag. If two samples are concordant the expected concordance level should be close to 99-100%.
For discordant samples concordance level should be close to 40%.
You can observe slighly lower concordance (80-99%) in presence of contamination and/or copy number changes (if the -H option wasn't used) in at least one of the samples.

5.评估污染等级

estimate_tumor_normal_contamination.py -T TUMOR_pileup -N NORMAL_pileup

Optional:
--help                              show help message and exit
--outfile OUTFILE                   write output to OUTFILE
--min_mapping_quality MIN_MAP_QUAL  do not use reads with mapping qual below MIN_MAP_QUAL [default: 10] 
--markers MARKERS                   the set of preselected genomic positions in the TXT format. Default: ${CONPAIR_DIR}/data/markers/GRCh37.autosomes.phase3_shapeit2_mvncall_integrated.20130502.SNV.genotype.sselect_v4_MAF_0.4_LD_0.8.txt
--conpair_dir CONPAIR_DIR           path to ${CONPAIR_DIR}
--grid  GRID                        grid interval [default: 0.01]

Even a very low contamination level (such as 0.5%) in the tumor sample will have a severe effect on calling somatic mutations, resulting in decreased specificity. Cross-individual contamination in the normal sample usually has a milder effect on somatic calling.

你可能感兴趣的:(Conpair---配对样本检查一致性及样本污染情况)