Samtools和Bcftools

Samtools和Bcftools简介

SAMtools是一个用于操作sam和bam文件的工具合集，包含有许多命令。
BCFtools主要是用来操作vcf和BCF文件的工具合集，包含有许多命令。
这些命令的使用方法如下：

1. view

view命令的主要功能是查看bam和sam文件的内容。
bam文件是sam文件的二进制格式，占用空间小，运算速度快。
view命令的用法和常用参数:

Usage: samtools view [options] | [region [...]]
默认情况下不加region，则是输出所有的region.

Options：
-b  默认输出sam格式文件，该参数设置输出bam格式
-h  默认输出的sam格式文件不带header，该参数设定输出sam文件时带header信息
-H  只输出header部分
-S  默认情况下输入时bam文件，若输入是sam文件，则最好加该参数，否则有时候会报错
-u  该参数的使用需要-b参数。默认情况下会对输出的bam文件进行压缩，设置此参数后不对文件进行压缩，能节约时间但是需要更多的磁盘空间
-c  不输出比对结果，仅仅打印匹配上的结果的总数。常常和'-f','-F','-q'联合使用
-t  File 使用一个list文件来作为header的输入，该文件中包含序列的id和长度
-T  File 使用序列fasta文件作为header的输入
-o  File 将结果输入到文件中，默认输出到标准输出
-f  INT 比对结果中必须要包含的flag，相当于一个过滤条件
-F  INT 比对结果中不能包含的flag，数字4代表该序列没有比对到参考序列上，数字8代表该序列的mate序列没有比对到参考序列上
-q  INT 允许的最小比对质量
-？ 给出更多的帮助信息，包括flag的解释

一些栗子：

#将sam文件转换成bam文件
$ samtools view -bS a.sam > a.bam

#提取比对到参考序列上的比对结果
$ samtools view -bF 4 a.bam > a.F4.bam

#提取paired reads中两条reads都比对到参考序列的比对结果，只需要把两个4+8的值12作为过滤参数即可
$ samtools view -b -F 12 a.bam > a.F12.bam

#提取没有比对到参考序列上的比对结果
$ samtools view -b -f 4 a.bam > a.f4.bam

#提取bam文件比对到scaffold1上的比对结果，并保存成sam文件格式
#提取目的区域的比对结果前需先对bam文件进行排序
$ samtools view a.bam scaffold1 > scaffold1.sam

#提取scaffold1上比对到30k到40k区域的比对结果
$ samtools view a.bam scaffold1:30000-40000 > scaffold_30k_40K.sam

#根据fasta文件，将header加入到sam或者bam文件中
$ samtools view -T genome.fasta -h scaffold1.bam > scaffold1.h.sam

2. sort

sort用来对bam文件进行排序

Usage: samtools sort [-n] [-m ]  

Options:
-m  设置运行内存大小，默认是500,000,000（即500M，支持K/M/G缩写）。对于处理大数据时，如果内存够用，可设置大些，以节约时间。
-n  设定排序方式，按short reads的ID排序。默认下是按序列在fasta文件中的顺序（即header和序列从左往右的位点）
-l  INT 设置压缩水平，0（未压缩）--9（最佳压缩）

-@  INT 设置使用的线程数

例子

$ samtools sort a.bam a.sort

3. merge

将2个或2个以上的已经sort的bam文件合并成一个bam文件，合并后的文件不需要再次sort，是已经sort过了的。

Usage: samtools merge [-nr] [-h inh.sam]    [...]
默认下，合并得到的bam文件的header信息是in1.bam中的header。可以使用-h参数指定某一个sam或bam文件的header为out.bam的header。

Options:
  -n         Input files are sorted by read name
  -t TAG     Input files are sorted by TAG value
  -r         Attach RG tag (inferred from file names)
  -u         Uncompressed BAM output
  -f         Overwrite the output BAM if exist
  -1         Compress level 1
  -l INT     Compression level, from 0 to 9 [-1]
  -R STR     Merge file in the specified region STR [all]
  -h FILE    Copy the header in FILE to  [in1.bam]
  -c         Combine @RG headers with colliding IDs [alter IDs to be distinct]
  -p         Combine @PG headers with colliding IDs [alter IDs to be distinct]
  -s VALUE   Override random seed
  -b FILE    List of input BAM filenames, one per line [null]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
  -O, --output-fmt FORMAT[,OPT[=VAL]]...
               Specify output format (SAM, BAM, CRAM)
      --output-fmt-option OPT[=VAL]
               Specify a single output file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]
  -@, --threads INT
               Number of additional threads to use [0]

4. index

对bam文件建立索引，生成后缀为.bai的文件，用于快速检索reads。
需要对bam文件先进行排序，否则会报错。
很多时候都需要索引文件的存在，特别是显示序列比对的情况下。比如samtools tview,gbrowse2等。

Usage：samtools index  [out.index]

#栗子：
$ samtools index a.bam

5. faidx

对fasta序列建立索引，生成后缀为.fai的文件。该命令也能依据索引文件快速提取fasta文件中的某一条序列。

Usage: samtools faidx  [...]

#对基因组文件建立索引
$ samtools faidx genome.fa 
生成索引文件为genome.fa.fai，是一个文本文件，分成了5列。第一列是子序列的名称；第二列是子序列的长度。 
 第三列代表第一个碱基的偏移量， 从0开始计数，换行符也统计进行。第四列表示除了最后一行外， 其他代表序
列的行的碱基数， 单位为bp。第五列表示行宽， 除了最后一行外， 其他代表序列的行的长度， 包括换行符， 
在windows系统中换行符为\r\n, 要在序列长度的基础上加2。

#有了索引文件后，可以使用以下命令很快从基因组中提取到fasta格式的子序列
$ samtools faidx genome.fa chr1 > chr1.fa

6. tview

tview能直观的显示出reads比对到基因组的情况，和基因组浏览器有点类似。

Usage: samtools tview  [ref.fasta]
当给出参考基因组的时候，会在第一排显示参考基因组的序列，否则第一排全用N表示。
按下g，则提示输入要到达基因组的某一位点。例如"chr1:100"表示到达chr1的第100个碱基处。
使用H(左)J(上)K(右)L(下)移动显示界面，大写字母移动快，小写字母移动慢。
使用空格键向右快速移动（同L），使用Backspace快速向左移动(同H)
Ctrl+H向左移动1kb碱基距离，Ctrl+L向右移动1kb碱基距离
输入'm','b','n',则使用颜色标注比对质量，碱基质量，核苷酸等。30-40的碱基质量或比对质量用白色表示；20-30黄色；0-10蓝色。
输入'.',则开启或关闭点号视图，此时点号表示比对到正义链上，逗号表示匹配到负义链上
输入'r',开启或关闭read name的显示
其他的一些命令可使用'？'查看

7. flagstat

给出bam文件的比对结果的summary。

Usage: samtools flagstat 

$ samtools flagstat a.bam
187018343 + 0 in total (QC-passed reads + QC-failed reads)  #总reads数
280885 + 0 secondary  #不知道
0 + 0 supplementary  #不知道
0 + 0 duplicates  #重复reads的数量
186439767 + 0 mapped (99.69% : N/A)  #比对到参考基因组上的reads数量
186737458 + 0 paired in sequencing  #paired reads数据数量
93368729 + 0 read1  #read1的数量
93368729 + 0 read2  #read2 的数量
180901328 + 0 properly paired (96.87% : N/A)  #正确地匹配到参考序列的reads数量
185855190 + 0 with itself and mate mapped  #一对reads都比对到了参考序列上的数量，但是并不一定比对到同一条染色体上
303692 + 0 singletons (0.16% : N/A)  #  一对reads中只有一条与参考序列相匹配的数量
2828852 + 0 with mate mapped to a different chr  # 一对reads比对到不同染色体的数量
1367179 + 0 with mate mapped to a different chr (mapQ>=5)  #一对reads比对到不同染色体的且比对质量值大于5的数量

8. depth

得到每个位点的测序深度，并输出到标准输出

Usage: samtools depth [options] in1.bam [in2.bam [...]]
Options:
   -a                  output all positions (including zero depth)
                       #输出所有位点，包括零深度的位点
   -a -a (or -aa)      output absolutely all positions, including unused ref. sequences
                       #完全输出所有位点，包括未使用到的参考序列
   -b             list of positions or regions
                       #计算BED文件中指定位置或区域的深度
   -f            list of input BAM filenames, one per line [null]
                       #使用在FILE中的指定bam文件
   -l             read length threshold (ignore reads shorter than ) [0]
                       #忽略掉长度小于此INT值的reads
   -d/-m          maximum coverage depth [8000]
                       #最大深度值
   -q             base quality threshold [0]
                       #只计算碱基质量值大于此值的reads
   -Q             mapping quality threshold [0]
                       #只计算比对质量值大于此值的reads
   -r     region
                       #只计算指定区域的reads
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

The output is a simple tab-separated table with three columns: reference name,
position, and coverage depth.  Note that positions with zero coverage may be
omitted by default; see the -a option.

#栗子
$ samtools depth a.bam | less

9. rmdup

去除PCR重复和光学重复，重复的reads仅保留一对。

Usage:  samtools rmdup [-sS]  

Options:
  -s  对SE reads去除重复。默认情况下只对PE reads去除重复
  -S  将PE reads作为SE reads来去除重复

#栗子
$ samtools a.bam b.bam

10. 一些其他有用的命令

reheader替换bam文件的头文件

Usage: samtools reheader [-P] in.header.sam in.bam > out.bam
   or  samtools reheader [-P] -i in.header.sam file.bam

Options:
    -P, --no-PG      Do not generate an @PG header line.
    -i, --in-place   Modify the bam/cram file directly.
                     (Defaults to outputting to stdout.)

#栗子
$ samtools reheader -i in.header.sam out.bam 
$ samtools reheader in.header.sam in.bam > out.bam

cat链接多个bam文件，适用于非sorted的bam文件

Usage: samtools cat [options]   [... ]
       samtools cat [options]  [... ]

Concatenate BAM or CRAM files, first those in , then those
on the command line.

Options: -b FILE  list of input BAM/CRAM file names, one per line
         -h FILE  copy the header from FILE [default is 1st input file]
         -o FILE  output BAM/CRAM

idxstats 统计一个表格，4列，分别为"序列名，序列长度，比对上的reads数，未比对上的reads数"，最后一排则显示没有比对到任何一条序列的reads number。

chr1 195471971 6112404 0
chr10 130694993 3933316 0
chr11 122082543 6550325 0
chr12 120129022 3876527 0
chr13 120421639 5511799 0
chr14 124902244 3949332 0
chr15 104043685 3872649 0
chr16 98207768 6038669 0
chr17 94987271 13544866 0
chr18 90702639 4739331 0
chr19 61431566 2706779 0
chr2 182113224 8517357 0
chr3 160039680 5647950 0
chr4 156508116 4880584 0
chr5 151834684 6134814 0
chr6 149736546 7955095 0
chr7 145441459 5463859 0
chr8 129401213 5216734 0
chr9 124595110 7122219 0
chrM 16299 1091260 0
chrX 171031299 3248378 0
chrY 91744698 259078 0
* 0 0 0

11. mpileup

使用samtools的子命令mpileup分析参考序列上的每个碱基位点的比对结果，并生成VCF/BCF格式文件，BCF是VCF的二进制文件。再使用Bcftools对VCF/BCF格式文件进行SNP/Indel calling。其中，Bcftools是附属于samtools的程序。
mpileup的常用参数：

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:
  -6, --illumina1.3+      quality is in the Illumina-1.3+ encoding
                          #碱基质量格式为Illumina 1.3+打分方式
  -A, --count-orphans     do not discard anomalous read pairs
                          #在检测变异中，不忽略异常的reads对
  -b, --bam-list FILE     list of input BAM filenames, one per line
                          #以list形式输入BAM文件。每一行代表一个bam文件路径
  -B, --no-BAQ            disable BAQ (per-Base Alignment Quality)
                          
  -C, --adjust-MQ INT     adjust mapping quality; recommended:50, disable:0 [0]
                          #用于降低比对质量的系数，如果reads中含有过多的错配，不能设置为零。BWA推荐值为50
  -d, --max-depth INT     max per-file depth; avoids excessive memory usage [250]
                         #对参考基因组上的每个位点进行SNP/Indel分析的时候，samtools对每个输入文件仅读取的
                          reads数目有上限，该上限默认值是8000/n，其中n表示输入的BAM文件个数。当该参数设置
                          的值>8000/n,则此上限有-d参数决定，否则-d参数无效。例如：当对1000个样品进行重测序
                          后进行SNP/Indel分析，上限若为8000/n，则过小（可能低于测序覆盖度了），此时-d参数
                          生效；而当需要分析的样品数较少的时候，对每个位点读取的reads数目上限则为8000/n，
                          能充分利用测序深度非常高位点的数据
  -E, --redo-BAQ          recalculate BAQ on the fly, ignore existing BQs

  -f, --fasta-ref FILE    faidx indexed reference sequence file
                          #输入参考基因组序列fasta文件，fasta文件必须有以fai为后缀的索引文件
  -G, --exclude-RG FILE   exclude read groups listed in FILE

  -l, --positions FILE    skip unlisted positions (chr pos) or regions (BED)
                          #输入bed文件，仅在指定区间内进行分析
  -q, --min-MQ INT        skip alignments with mapQ smaller than INT [0]
                          #用于分析的比对质量最小值
  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]
                          #用于分析的碱基质量最小值
  -r, --region REG        region in which pileup is generated
                          #仅仅在此区间进行分析。默认对所有区间进行分析
  -R, --ignore-RG         ignore RG tags (one BAM = one sample)
                          #忽略BAM文件中的@RG信息，认为一个BAM文件就是一个样品的数据
  --rf, --incl-flags STR|INT  required flags: skip reads with mask bits unset []
  --ff, --excl-flags STR|INT  filter flags: skip reads with mask bits set
                                            [UNMAP,SECONDARY,QCFAIL,DUP]
  -x, --ignore-overlaps   disable read-pair overlap detection

Output options:
  -o, --output FILE       write output to FILE [standard output]
                          #将结果输出到指定文件，默认输出到标准输出
  -g, --BCF               generate genotype likelihoods in BCF format
                          #进行基因分型，将结果输出到BCF格式
  -v, --VCF               generate genotype likelihoods in VCF format
                          #进行基因分型，将结果输出到VCF格式。默认输出bgzip压缩格式的VCF文件，若加入-u参数后则不进行bgzip压缩

Output options for mpileup format (without -g/-v):   #不使用-g/-v参数时有效
  -O, --output-BP         output base positions on reads
                          #输出每个reads的碱基位点
  -s, --output-MQ         output mapping quality
                          #输出比对质量
      --output-QNAME      output read names
  -a                      output all positions (including zero depth)
                          #输出所有位点，包括覆盖深度为0的位点
  -a -a (or -aa)          output absolutely all positions, including unused ref. sequences
                          #输出所有位点，包括参考基因组中为比对上的位点
Output options for genotype likelihoods (when -g/-v is used):   #当使用-g/-v参数时
  -t, --output-tags LIST  optional tags to output:
               DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []
                          #设置FORMAT和INFO的列表内容，以逗号分割。
  -u, --uncompressed      generate uncompressed VCF/BCF output
                          #不对BCF进行压缩，通常用于管道中输入到下一个命令进行分析

SNP/INDEL genotype likelihoods options (effective with -g/-v):      #SNP/Indel 基因分型参数
  -e, --ext-prob INT      Phred-scaled gap extension seq error probability [20]
  -F, --gap-frac FLOAT    minimum fraction of gapped reads [0.002]
                           #含有间隔reads的最小片段
  -h, --tandem-qual INT   coefficient for homopolymer errors [100]
  -I, --skip-indels       do not perform indel calling
                          #不检测indel变异
  -L, --max-idepth INT    maximum per-file depth for INDEL calling [250]
  -m, --min-ireads INT    minimum number gapped reads for indel candidates [1]
                          #候选INDEL的最小间隔的reads
  -o, --open-prob INT     Phred-scaled gap open seq error probability [40]
  -p, --per-sample-mF     apply -m and -F per-sample for increased sensitivity
  -P, --platforms STR     comma separated list of platforms for indels [all]
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

Notes: Assuming diploid individuals.

mpileup生成的结果包含6列：参考序列名；位置；参考基因组碱基；比对上的reads数；比对情况；比对上的碱基的质量。
其中第5列比较复杂，解释如下：

1. '.'代表read比对到参考基因组正链上
2. ','代表比对到参考基因组负链上
3. 'ATGCN'表示正链上的不匹配
4. 'atgcn'表示在负链上的不匹配
5. '*'代表模糊碱基
6. '^'代表匹配的碱基是一个read的开始；'^'后面紧跟的ascii码减去33代表比对质量；这两个符号修饰的是后面的碱基，气候紧跟的碱基代表read的第一个碱基
7. '$'代表一个read的结束，该符号修饰的是其前面的碱基
8.正则式'+[0-9]+[ACGTNacgtn]+'代表在该位点后插入的碱基
9. 正则式'-[0-9]+[ACGTNacgtn]+'代表在该位点后缺失的碱基

使用Bcftools进行variation calling

Bcftools常用来进行变异检测，用法及参数如下：

Usage:   bcftools [--version|--version-only] [--help]  

Commands:

 -- Indexing
    index        index VCF/BCF files

 -- VCF/BCF manipulation
    annotate     annotate and edit VCF/BCF files
    concat       concatenate VCF/BCF files from the same set of samples
    convert      convert VCF/BCF files to different formats and back
    isec         intersections of VCF/BCF files
    merge        merge VCF/BCF files files from non-overlapping sample sets
    norm         left-align and normalize indels
    plugin       user-defined plugins
    query        transform VCF/BCF into user-defined formats
    reheader     modify VCF/BCF header, change sample names
    sort         sort VCF/BCF file
    view         VCF/BCF conversion, view, subset and filter VCF/BCF files

 -- VCF/BCF analysis
    call         SNP/indel calling
    consensus    create consensus sequence by applying VCF variants
    cnv          HMM CNV calling
    csq          call variation consequences
    filter       filter VCF/BCF files using fixed thresholds
    gtcheck      check sample concordance, detect sample swaps and contamination
    mpileup      multi-way pileup producing genotype likelihoods
    roh          identify runs of autozygosity (HMM)
    stats        produce VCF/BCF stats

 Most commands accept VCF, bgzipped VCF, and BCF with the file type detected
 automatically even when streaming from a pipe. Indexed VCF and BCF will work
 in all situations. Un-indexed VCF and BCF and streams will work in most but
 not all situations.

#其中call常用命令参数如下：
Usage:   bcftools call [options] 

Options：
  -V, --skip-variants       skip indels/snps
                                  #不进行snp/indel检测
  -v, --variants-only             output variant sites only
                                  #仅仅输出变异位点信息
  -c, --consensus-caller          the original calling method (conflicts with -m)
                                  #SNP/INDEL分析的原始算法，适用于对一个样本分析，与-m参数只能二选一
  -m, --multiallelic-caller       alternative model for multiallelic and rare-variant calling (conflicts with -c)
                                  #适用于多种allele和稀有allele分析的算法，适用于多个样本的分析，与-c参数只能二选一
  -o, --output              write output to a file [standard output]
                                  #设置输出文件
  -O, --output-type      output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
                                  #设置输出文件的格式，该参数的值有：b（压缩BCF格式），u（不压缩BCF格式），z（压缩VCF格式），v（不压缩VCF格式）