5 一步法找基因变异流程

1 先查看sam文件

随机选择3个

$ samtools mpileup SRR8517854.bam |head -95|tail -3
[mpileup] 1 samples in 1 input files
chr1    10105   N       8       AAAAcAAA        kuuu>6uK
chr1    10106   N       10      CCCCcCCCC^$c    uuuukAuuAK
chr1    10107   N       10      CCCCcCCCCc      upukaFfuKF

再看另外一个:

$ samtools mpileup SRR8517854.bam |head -168944|tail -3
[mpileup] 1 samples in 1 input files
chr1    909515  N       1       c       K
chr1    909516  N       1       c       K
chr1    909517  N       1       g       K

关于samtools mpileup的具体用法和结果解释请看

2 最简单的找变异流程

align文件夹

ref=/mnt/f/kelly/bioTree/server/wesproject/hg38/Homo_sapiens_assembly38.fasta
time samtools mpileup -ugf $ref *.bam|bcftools call -vmO z -o wxs_out.vcf.gz 
less -S wxs_out.vcf.gz

3载入IGV查看

3.1先构建索引

批量构建

ls *.bam|xargs -i samtools index {}

如果是服务器需要下载到本地查看

4 去除PCR重复

需要samtools的4个命令

align文件夹

samtools markdup -r SRR7696207.bam SRR7696207.rm.bam
[markdup] error: no ms score tag. Please run samtools fixmate on file first.
[markdup] error: no ms score tag. Please run samtools fixmate on file first.

提示先fixmate

$ samtools fixmate SRR7696207.bam SRR7696207.fixmate.bam
[bam_mating_core] ERROR: Coordinate sorted, require grouped/sorted by queryname.

提示先要sort,并且要以queryname进行sort

$ samtools sort -n -o SRR7696207.sort.bam SRR7696207.bam
$ samtools sort -n -o SRR7696207.namesort.bam SRR7696207.bam
[bam_sort_core] merging from 25 files and 1 in-memory blocks...

接下来就可以逆回去了,也就是是正确的顺序(本篇最下方有说明)

$ samtools sort -n -o SRR7696207.namesort.bam SRR7696207.bam
$ samtools fixmate -m SRR7696207.namesort.bam SRR7696207.fixmate.bam
$ samtools sort -o SRR7696207.positionsort.fixmat.bam SRR7696207.fixmate.bam
$ samtools markdup -r SRR7696207.positionsort.fixmat.bam SRR7696207.rm.bam

看下以上的文件结构和大小

├── [ 136]  1
├── [ 136]  2
├── [   0]  config
├── [ 42K]  markdup.bam
├── [ 22G]  SRR7696207.bam
├── [5.4G]  SRR7696207.fixmate.bam
├── [5.2G]  SRR7696207.namesort.bam
├── [4.0G]  SRR7696207.positionsort.fixmat.bam
├── [3.8G]  SRR7696207.rm.bam
├── [2.6G]  SRR8517853.bam
├── [3.4G]  SRR8517854.bam
└── [6.9G]  SRR8517856.bam

最后查看下rm.bam

samtools view SRR7696207.rm.bam |less -S
SRR7696207.27107225     2193    chr1    10024   0       86M64H  chr5    10324   0       CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
SRR7696207.27728977     99      chr1    10025   17      91M18S  =       10025   91      TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
SRR7696207.27728977     147     chr1    10025   17      91M18S  =       10025   -91     TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
SRR7696207.4339191      163     chr1    10027   0       69M     =       10039   123     ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
SRR7696207.12487651     163     chr1    10028   0       86M     =       10028   81      CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
SRR7696207.9370306      83      chr1    10028   0       116M    =       10016   -128    CCCTAACCCTAACCCTAAACCTAACCCTAACCCTAACC
SRR7696207.12487651     83      chr1    10028   0       69S81M  =       10028   -81     TTTCTAGCAGTGGACTGCATACGTGTTCGCATACTGTG
SRR7696207.18904916     99      chr1    10030   0       81M69S  =       10030   79      CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
SRR7696207.18904916     147     chr1    10030   0       79M     =       10030   -79     CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
SRR7696207.4339191      83      chr1    10039   0       111M7S  =       10027   -123    ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
SRR7696207.20389847     99      chr1    10040   0       74M     =       10040   74      CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
SRR7696207.20389847     147     chr1    10040   0       76S74M  =       10040   -74     CATATTTTTGTTTTTTTTAATGTTACGGCGACCACCGA

看rm后的是否有差异

samtools flagstat SRR7696207.bam
samtools flagstat SRR7696207.rm.bam

结果如下

$ samtools flagstat SRR7696207.bam
55398860 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
372636 + 0 supplementary
0 + 0 duplicates
55374129 + 0 mapped (99.96% : N/A)
55026224 + 0 paired in sequencing
27513112 + 0 read1
27513112 + 0 read2
54512924 + 0 properly paired (99.07% : N/A)
54978908 + 0 with itself and mate mapped
22585 + 0 singletons (0.04% : N/A)
330146 + 0 with mate mapped to a different chr
252082 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools flagstat SRR7696207.rm.bam
52114377 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
372636 + 0 supplementary
0 + 0 duplicates
52089646 + 0 mapped (99.95% : N/A)
51741741 + 0 paired in sequencing
25868532 + 0 read1
25873209 + 0 read2
51246880 + 0 properly paired (99.04% : N/A)
51699203 + 0 with itself and mate mapped
17807 + 0 singletons (0.03% : N/A)
319884 + 0 with mate mapped to a different chr
242709 + 0 with mate mapped to a different chr (mapQ>=5)

可以再试试-S参数

samtools markdup -S SRR7696207.sam SRR7696207.mk.bam

补充:samtoolsmarkdup操作的正确顺序

  1. The first sort can be omitted if the file is already name ordered
samtools sort -n -o namesort.bam example.bam
  1. Add ms and MC tags for markdup to use later
samtools fixmate -m namesort.bam fixmate.bam
  1. Markdup needs position order
samtools sort -o positionsort.bam fixmate.bam
  1. Finally mark duplicates
samtools markdup positionsort.bam markdup.bam

以上是简单的找变异流程,并不完善。下面是GATK找变异。

你可能感兴趣的:(5 一步法找基因变异流程)