染色体拆分-建立索引-比对-横向分析(read coverage,mapping rate,sequencing depth)-纵向分析(snp,indel)
1. 染色体拆分
0. 取出每条染色体的行号,准备分离每条染色体(最笨的方法,请有优化的大神指教)
(先下载好两个参考基因组EnsemblPlants )
cat Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa |grep -n "dna:chromosome" >Aet_position.txt
cat Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa |grep -n "dna:chromosome" >WEW_position.txt
sed -n '1,8372172p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >1D
sed -n '8372173,19233192p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >2D
sed -n '19233193,29686238p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >3D
sed -n '29686239,38453219p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >4D
sed -n '38453220,48076148p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >5D
sed -n '48076149,56343142p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >6D
sed -n '56343143,70578551p' Aegilops_tauschii.Aet_v4.0.dna.toplevel.fa >7D
sed -n '21402081,47711240p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >2AB
sed -n '47711241,74300756p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >3AB
sed -n '74300757,97639496p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >4AB
sed -n '97639497,121190107p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >5AB
sed -n '121190108,143267599p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >6AB
sed -n '143267600,167984010p' Triticum_dicoccoides.WEWSeq_v.1.0.dna.toplevel.fa >7AB
#将染色体按照自己想要的方式合并,我需要的是1A1B1D,2A2B2D。。。。。先将1A1B和1D合并为1ABD,然后合并为1ABD 2ABD 3ABD...7ABD,名为pseudo_wheat_ref
cat 1AB 1D >1ABD
cat -n 1ABD |grep "dna:chromosome"
sed '1c >chr1A' 1ABD >chr1A
sed '9893116c >chr1B' chr1A >chr1B
sed '21402081c >chr1D' chr1B >chr1D
mv chr1D chr1ABD
cat -n chr1ABD |grep "chr"
cat -n chr2ABD |grep "chr"
cat -n chr3ABD |grep "chr"
cat -n chr4ABD |grep "chr"
cat -n chr5ABD |grep "chr"
cat -n chr6ABD |grep "chr"
cat -n chr7ABD |grep "chr"
#行号 >chr1A
#行号 >chr1B
#行号 >chr1D
#再用 >搜一次,看有没有其他序列
cat -n chr1ABD |grep ">"
cat -n chr2ABD |grep ">"
cat -n chr3ABD |grep ">"
cat -n chr4ABD |grep ">"
cat -n chr5ABD |grep ">"
cat -n chr6ABD |grep ">"
cat -n chr7ABD |grep ">"
cat chr7ABD |grep -v "dna:supercontig" >chr7ABD_pure
cat chr1ABD chr2ABD chr3ABD chr4ABD chr5ABD chr6ABD chr7ABD_pure >pure_ref_wheat.fasta
1.1 由于小麦基因组过大,软件可以生成sam/bam文件,但是在samtools index时会报错,下面是我的报错结果展示(也就花了2周而已:( )
samtools index -@ 20 4AL_TTD.sort.uniq_mkdup.bam
read mapping with mappers such as BWA, Tophat or STAR as the BAM output format used by these mappers limits the reference contig size to (2^29 - 1) bp (512 MB). Strictly speaking, the BAM files will be valid, however they cannot by indexed with "samtools index" so that random access to chromosomal regions is not possible.
查看samtools index --help时有参数,但是csi难以用于后续的gatk分析,所以需要拆分染色体
-m INT Set minimum interval size for CSI indices to 2^INT [14]
1.2 拆分染色体脚本【引1】
touch split_wheat_chrom.py
vim split_wheat_chrom.py
import argparse
from Bio import SeqIO
from itertools import product
parser = argparse.ArgumentParser()
args = parser.parse_args()
chr2_split_position = [['chr1A', 416530229], ['chr1B', 485877090], ['chr1D', 398856833], ['chr2A', 441318416], ['chr2B', 452187122], ['chr2D', 471155576], ['chr3A', 331357656], ['chr3B', 422034810], ['chr3D', 495018548], ['chr4A', 370990501], ['chr4B', 459574265], ['chr4D', 410577824], ['chr5A', 431266944], ['chr5B', 460582272], ['chr5D', 404944879], ['chr6A', 451707583], ['chr6B', 441332031], ['chr6D', 448911032], ['chr7A', 388979184], ['chr7B', 445475846], ['chr7D', 431442263], ['chrUn','null']]
with open(args.fasta) as handle:
for record, ch in product(SeqIO.parse(handle, "fasta"), chr2_split_position):
if record.id == ch[0] and ch[0] != 'chrUn':
print(">" + record.id + '_part1\n' + record.seq[:int(ch[1]) - 1])
print(">" + record.id + '_part2\n' + record.seq[int(ch[1])-1:])
if record.id == ch[0] and ch[0] == 'chrUn':
print(">" + record.id + '\n' + record.seq)
python split_wheat_chrom.py pure_ref_wheat.fasta >pure_ref_wheat_parts.fasta
1.3 拆分脚本本模块安装
必需先有python,因为我有conda,python肯定是有的,只是下面模块中from Bio import SeqIO还需要安装,安装命令
pip install biopython
recent versions of Python (starting with Python 2.7.9 and Python 3.4) include the Python package management tool pip, which allows an easy installation from the command line on all platforms.
2 建索引
2.1 参考基因组构建索引
mkdir 0.index && cd 0.index
bwa index ~/WGS/4Al_TTD/pseudo_wheat_ref/pure_ref_wheat.fasta
2.2 构建一个dict,不建索引gatk生成vcf时要报错
# -R: 输入参考基因组,可为fasta 或fasta.gz
# -O:输出文件,输出文件为sam时只包序列字典,默认条件下使用输入文件的basename并以.dict结尾
gatk CreateSequenceDictionary -R /home/wdd/WGS/4AL_TTD/pseudo_wheat_ref/pure_ref_wheat_parts.fasta -O pure_ref_wheat_parts.dict
2.3 建立索引
nohup samtools faidx pure_ref_wheat_parts.fasta &
3. 比对
3.1 比对到参考基因组上,并进行排序
为什么要排序?read mapping到参考基因组上是按照位置找的,后续软件的分析多安找名字来排,所以需要sort 【引2】
cat >mapping.sh
vim mapping.sh
bwa mem -t 20 -R "@RG\tID:4AL_resequence\tSM:4AL_resequence\tLB:WGS\tPL:Illumina" \
$ref $fa/4AL_1.clean.fq.gz $fa/4AL_2.clean.fq.gz |samtools sort -@ 20 -o 4AL_TTD.sort.bam - 1>4AL_log.mark 2>&1
3.2 取唯一比对
#-h 加头文件,不加后续gatk报错
#-q 只包含比对质量大于1的reads (整数)
#-F 包括reads没有指定FLAGS,第二列为FLAGS 4 这条reads没比对上, 256大于1次的比对,
#-v 反选
samtools view -@ 20 -h -q 1 -F 256 4AL_TTD.sort.bam |grep -v XA:Z |grep -v SA:Z |samtools view -@ 20 -b - >4AL_TTD.sort.uniq.bam
4. 横向分析测序情况
4.1 比对率 mapping rate
nohup samtools depth 4AL_TTD.sort.bam -a >./4AL_TTD.sort.bam.txt &
4.2 测序深度 sequencing depth
nohup samtools depth 4AL_TTD.sort.bam -a >./4AL_TTD.sort.bam.depth &
4.3 覆盖度 read coverage
nohup genomeCoverageBed -ibam 4AL_TTD.sort.bam -g /home/wdd/WGS/4AL_TTD/pseudo_wheat_ref/pseudo_wheat_ref.fa -bga >./4AL_TTD_cov.bedgraph &
5. 纵向分析结果
5.1.1 Mark PCR重复
• 标记/删除PCR重复的reads
• 为后续call变异位点增加可信度,去掉假阳性
存在问题:"-Xmx20G -Djava.io.tmpdir=./" 是做什么的?加上后占用cpu很大,去掉后报错
mkdir gatk_markdup && cd gatk_markdup
cat >gate_markdup.sh
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" MarkDuplicates -I 4AL_TTD.sort.uniq.bam -O 4AL_TTD.sort.uniq_mkdup.bam -M 4AL_TTD_mkdup.metrics 1>4AL_TTD_mkdup_log.mark 2>&1
samtools index -@ 20 4AL_TTD.sort.uniq_mkdup.bam
5. 1. 2 将所有的reads分组
这里其实是把mapping一步的头文件"@RG\tID:4AL_resequence\tSM:4AL_resequence\tLB:WGS\tPL:Illumina" 替换为--LB WGS -PL illumina -PU bwa -SM 4AL_TTD,如果mapping一步有头文件的话,这一步是可以省略的
gatk --list
# AddOrReplaceReadGroups: Assigns all the reads in a file to a single new read-group.
gatk AddOrReplaceReadGroups
#Required Arguments (需要哪些参数,输入、输出、-LB为数据类型、-PL测序平台、-PU比对平台/软件、-SM样品名称):
#--INPUT,-I:String Input file (BAM or SAM or a GA4GH url). Required.
#--OUTPUT,-O:File Output file (BAM or SAM). Required.
#--RGLB,-LB:String Read-Group library Required.
#--RGPL,-PL:String Read-Group platform (e.g. illumina, solid) Required.
#--RGPU,-PU:String Read-Group platform unit (eg. run barcode) Required.
#--RGSM,-SM:String Read-Group sample name Required.
cat >add_bam.sh
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" AddOrReplaceReadGroups -I 4AL_TTD.sort.uniq_mkdup.bam -O 4AL_TTD.sort.uniq_mkdup_add.bam --LB WGS -PL illumina -PU bwa -SM 4AL_TTD
samtools index -@ 20 4AL_TTD.sort.uniq_mkdup_add.bam
5.2 找变异
5.2.1 生成初始vcf文件(保证fasta有索引,有dict)
mkdir vcf && cd vcf
gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller -R /home/wdd/WGS/4AL_TTD/pseudo_wheat_ref/pure_ref_wheat_parts.fasta -I /home/wdd/WGS/4AL_TTD/1.mapping/4AL_TTD.sort.uniq_mkdup_add.bam -O 4AL_TTD_raw.vcf 1>4AL_TTD_log.HC 2>&1 HaplotypeCaller增加线程
touch chr_vcf.sh
vim chr_vcf.sh
chroms=($(grep '>' $REF |sed 's/>//' | tr '\n' ' '))
for chr in ${chroms[@]}
if [ ! -f 4AL.${chr}.vcf.gz ]; then
gatk HaplotypeCaller -R $REF -I $bam --genotyping-mode DISCOVERY \
--intervals ${chr} --sample-ploidy 6 \
-O 4AL.${chr}.vcf.gz &
done && wait
这样飞快,但有一个问题,42条染色体同时会挤爆服务器的,所以42条染色体分为3批来跑,如下cs1.bed, cs2.bed, cs3.bed
cat cs1.bed
touch chr.sh
vim chr.sh
chroms=($(awk '{print $1}' cs1.bed |tr '\n' ' '))
for chr in ${chroms[@]}
if [ ! -f 4AL.${chr}.vcf.gz ]; then
gatk HaplotypeCaller -R $REF -I $bam --genotyping-mode DISCOVERY \
--intervals ${chr} --sample-ploidy 6 \
-O 4AL.${chr}.vcf.gz &
done && wait
单条染色体生成vcf及对应的索引 HaplotypeCaller增加线程后合并
chroms=($(awk '{print $1}' cs.bed |tr '\n' ' '))
for chr in ${chroms[@]}; do
merge_vcfs=${merge_vcfs}" -I ${SM}.${chr}.vcf.gz"
done && gatk MergeVcfs ${merge_vcfs} -O ${SM}.HC.vcf.gz && echo "Vcfs haved been successfully merged"
「GATK 4」如何提高HaplotyperCaller的效率
5.2.1 找snp,过滤
SNP/indel detection was performed using the GATK HaplotypeCaller (version 3.5-0 g36282e4) set for diploids with default fil- tering settings [54]. SNPs were preliminarily filtered using GATK VariantFiltration with the parameter --filterExpres- sion “QD < 2.0 || FS > 60.0 || MQRankSum < − 12.5 || ReadPosRankSum < − 8.0 || SOR > 3.0 || MQ < 40.0.” The filtering settings for indels were “QD < 2.0, FS > 200.0,” and “ReadPosRankSum < − 20.0.”
gatk SelectVariants -select-type SNP -V 4AL_resequence.vcf -O ./snp/4AL_resequence.snp.vcf
cd snp
gatk VariantFiltration -V 4AL_resequence.snp.vcf --filter-expression "QUAL < 30.0 || QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "Filter" -O 4AL_filter.snp.vcf
gatk SelectVariants --exclude-filtered true -V 4AL_filter.snp.vcf -O 4AL_filtered.snp.vcf
cat 4AL_filtered.snp.vcf| grep '1/1' >pure_4AL.snp.vcf
5.2.3 找indel, 过滤
gatk SelectVariants -select-type INDEL -V 4AL_resequence.vcf -O ./indel/4AL_resequence.indel.vcf
cd indel
gatk VariantFiltration -V 4AL_resequence.indel.vcf --filter-expression "QUAL < 30.0 || QD < 2.0 || MQ < 40.0 || FS > 200.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -20.0" --filter-name "Filter" -O 4AL_filter.indel.vcf
gatk SelectVariants --exclude-filtered true -V 4AL_filter.indel.vcf -O 4AL_filtered.indel.vcf
SNPs that did not meet the following criteria were further excluded: (1) a total read depth (DP) > 240 and < 2200; (2) minor allele frequency (MAF) ≥0.05 for each population, and for Ae. tauschii (n = 5), MAF should be ≥ 0.2; (3) a maximum missing rate < 0.1; and (4) biallelic alleles.
1. vcf过滤
#--vcf 输入文件格式
#--minDP 最小测序深度
#--maxDP 最大测序深度
#--maf 最小等位基因频率
# –max-missing < float >完整度,介于0到1之间
vcftools --vcf 4AL_filtered.snp.vcf --minDP 240 --maxDP 2200 --maf 0.05 -max-missing 0.1 --min-alleles 2 --max-alleles 2 --recode --recode-INFO-all --out 4AL
SNP and indel annotations were performed according to the wheat genome annotation using the software SnpEff (version 4.3p)
2. CNV找拷贝数变异
3. circos 画图,将snp,indel,sv展示
CIRCOS圈图绘制 - 最简单绘图和解释
Question: SNP density plot
wget https://github.com/bedops/bedops/releases/download/v2.4.37/bedops_linux_x86_64-v2.4.37.tar.bz2
tar jxvf bedops_linux_x86_64-v2.4.37.tar.bz2
cd bin/
# /home/huawei/software/bedops/bin
vim .bashrc
export PATH="/home/huawei/software/bedops/bin:$PATH"
source .bashrc
#染色体 起始位置 终止位置
# chr1A 0 594102056
# chr2A 0 780798557
# chr3A 0 750843639
# chr4A 0 744588157
# chr5A 0 709773743
# chr6A 0 618079260
# chr7A 0 736706236
# chr1B 0 689851870
# chr2B 0 801256715
# chr3B 0 830829764
# chr4B 0 672617499
# chr5B 0 713149757
# chr6B 0 720988478
# chr7B 0 750620385
# chr1D 0 495453186
# chr2D 0 651852609
# chr3D 0 615553423
# chr4D 0 509857067
# chr5D 0 566080677
# chr6D 0 473592728
# chr7D 0 638686055
# chrUn 0 480980714
cat wheat.windows
fetchChromSizes hg38 | awk '{ print $1, "0", $2 }' > wheat.windows
bedtools makewindows -g wheat.windows -w 1000000 >wheat.bed
#染色体 起始位置 终止位置
#chr1A 0 1000000
#chr1A 1000000 2000000
#chr1A 2000000 3000000
#chr1A 3000000 4000000
#chr1A 4000000 5000000
sort-bed wheat.bed >wheat.sort.bed
#--delim '\t' 计数得到的最后一列以tab键分割,默认以|分割
#--count --range 500000表示以500000bp的长度从两边开始计数
# <(vcf2bed < pure_DT4AS.vcf)>
bedmap --delim '\t' --echo --count --range 500000 wheat.sort.bed <(vcf2bed < pure_DT4AS.vcf) > DT4AS_snp_counts.txt
circos画图中snp dinsity数据准备