curl -OL https://sourceforge.net/projects/samtools/files/samtools/1.6/samtools-1.6.tar.bz2
tar jxvf samtools-1.6.tar.bz2
cd samtools-1.6
./configure
make
mv samtools-1.6 samtools
ln -s ~/local/app/samtools/samtools ~/bin/samtools
or
echo 'export PATH=/home/chengquan/local/app/samtools:$PATH' >> ~/.bashrc
source ~/.bashrc
检查一下是否可以用
[sunchengquan 14:35:50 ~]
$ samtools
Program: samtools (Tools for alignments in the SAM format)
Version: 1.6 (using htslib 1.6)
Usage: samtools <command> [options]
Commands:
-- Indexing
dict create a sequence dictionary file
faidx index/extract FASTA
index index alignment
-- Editing
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
rmdup remove PCR duplicates
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
markdup mark duplicates
-- File operations
collate shuffle and group alignments by name
cat concatenate BAMs
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
-- Statistics
bedcov read depth per BED region
depth compute the depth
flagstat simple stats
idxstats BAM index stats
phase phase heterozygotes
stats generate stats (former bamcheck)
-- Viewing
flags explain BAM flags
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
以下的命令讲解就是按照这个顺序
参考:生物信息数据格式:sam,bam格式
官方说明文档:http://www.htslib.org/doc/samtools.html
作用:建立参考基因组字典
[sunchengquan 15:12:43 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ samtools dict lambda_virus.fa > lambda_virus.dict
功能:对fasta格式的文件建立索引,后缀名.fai。根据索引文件和序列文件,可以快速提取任意区域的序列文件。
fasta序列格式要求:每条序列,除了最后一行外,其他行的长度必须相同!
为了方便演示,我们使用bowtie2软件中的测试数据,在example/reference
下
查看建立fasta索引
[sunchengquan 14:44:59 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ ll
总用量 56K
-rw-r--r-- 1 sunchengquan sunchengquan 49K 7月 27 00:43 lambda_virus.fa
[sunchengquan 14:47:21 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ samtools faidx lambda_virus.fa
[sunchengquan 14:48:32 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ ll
总用量 60K
-rw-r--r-- 1 sunchengquan sunchengquan 49K 7月 27 00:43 lambda_virus.fa
-rw-r----- 1 sunchengquan sunchengquan 43 12月 20 14:48 lambda_virus.fa.fai
查看lambda_virus.fa.fai
[sunchengquan 14:49:27 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ less -S lambda_virus.fa.fai
gi|9626243|ref|NC_001416.1| 48502 74 70 71
一共5列,每列之间tab分割
第一列:序列的名称
第二列:序列长度
第三列:第一个碱基的偏移量,从0开始计数
第四列:除了最后一行外,序列中每行的碱基数
第五列:除了最后一行外,序列中每行的长度(包括换行符)
提取fasta序列
因为文件中只有一个序列名称,所以我们提取这个序列100-200bp的区间:
[sunchengquan 15:01:01 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reference]
$ samtools faidx lambda_virus.fa 'gi|9626243|ref|NC_001416.1|':100-200 > NC_001416_100_200.fa
主要功能:对bam文件建立索引,但在此之前必须进行排序(sort),生成后缀是.bai的文件。
[sunchengquan 15:06:57 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools index tmp.sorted.bam
作用:替换bam文件的信息
[sunchengquan 15:21:38 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.sorted.bam |grep -v '^@HD' > in.header.bam
[sunchengquan 15:21:56 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ less in.header.bam
[sunchengquan 15:31:10 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools reheader -P in.header.bam tmp.sorted.bam > tmp.sorted.h.bam
查看tmp.sorted.h.bam头部信息
[sunchengquan 15:31:20 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.sorted.h.bam
@SQ SN:gi|9626243|ref|NC_001416.1| LN:48502
@PG ID:bowtie2 PN:bowtie2 VN:2.3.4.3 CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"
不加-P
参数
[sunchengquan 15:33:06 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools reheader in.header.bam tmp.sorted.bam > tmp.sorted.h.bam
[sunchengquan 15:34:36 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.sorted.h.bam
@SQ SN:gi|9626243|ref|NC_001416.1| LN:48502
@PG ID:bowtie2 PN:bowtie2 VN:2.3.4.3 CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"
@PG ID:samtools PN:samtools PP:bowtie2 VN:1.6 CL:samtools reheader in.header.bam tmp.sorted.bam
作用:将由PCR duplicates 获得的reads去掉,并保留高比对质量的reads
picard,GATK中也有相似的功能
[sunchengquan 15:36:06 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools rmdup
Usage: samtools rmdup [-sS] <input.srt.bam> <output.bam>
Option: -s rmdup for SE reads
-S treat PE reads as SE in rmdup (force -s)
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
--output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
只需要加-s
参数, 就可以对单端测序进行去除PCR重复。其实对单端测序去除PCR重复很简单的,因为比对flag情况只有0,4,16,只需要它们比对到染色体的起始终止坐标一致即可,flag很容易一致。但是对于双端测序就有点复杂了
在双端测序里面一大堆的flag情况,双端测序数据,用samtools rmdup效果就很差,所以很多人建议用picard工具的MarkDuplicates 功能~~~
作用:将多个bam文件合并为一个bam文件(merge命令的区别是cat不需要将bam文件提前进行排序)
[sunchengquan 15:50:28 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools cat
Usage: samtools cat [options] <in1.bam> [... <inN.bam>]
samtools cat [options] <in1.cram> [... <inN.cram>]
Concatenate BAM or CRAM files, first those in <bamlist.fofn>, then those
on the command line.
Options: -b FILE list of input BAM/CRAM file names, one per line
-h FILE copy the header from FILE [default is 1st input file]
-o FILE output BAM/CRAM
功能:合并多个已经sort的bam文件
当有多个样本的bam文件时,可以使用samtools的merge命令将这些bam文件合并为一个排序的且保持所有输入记录并保持现有排序顺序的bam文件
格式:
samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b ] […
-l 指定压缩等级;
-b FILE 输入文件列表,一个文件一行;
-f 强制覆盖同名输出文件;
-h FILE 指定FILE内的’@’头复制到输出bam文件中并替换输出文件的文件头。否则,输出文件的文件头将从第一个输入文件复制过来;
-n 设定输入比对文件是以read名进行排序的而不是以染色体坐标排序的;
-R STRING 合并输入文件的指定区域;
-r 使RG标签添加到每一个比对文件上,标签值来自文件名;
-u 输出的bam文件不压缩;
-c 当多个输入文件包含相同的@RG头ID时,只保留第一个到合并后输出的文件。当合并多个相同样本的不同文件时,非常有用。
-p 与-c参数类似,对于要合并的每一个文件中的@PG ID只保留第一个文件中的@PG。
#-f 参数
[sunchengquan 18:47:33 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[bam_merge] File 'tmp.merge.bam' exists. Please apply '-f' to overwrite. Abort.
[sunchengquan 18:48:53 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -f tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 18:49:44 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.merge.bam
@HD VN:1.0 SO:coordinate
@SQ SN:gi|9626243|ref|NC_001416.1| LN:48502
@PG ID:bowtie2 PN:bowtie2 VN:2.3.4.3 CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"
@PG ID:bowtie2-1CE8AF82 PN:bowtie2 VN:2.3.4.3 CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"
#-p 参数
[sunchengquan 18:50:04 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -f -p tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 18:51:45 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.merge.bam
@HD VN:1.0 SO:coordinate
@SQ SN:gi|9626243|ref|NC_001416.1| LN:48502
@PG ID:bowtie2 PN:bowtie2 VN:2.3.4.3 CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq"
#-h 参数
#header中多了一行@CO SUNCHENGQUAN
[sunchengquan 18:51:49 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ cat inh.bam
@HD VN:1.0 SO:coordinate
@SQ SN:gi|9626243|ref|NC_001416.1| LN:48502
@PG ID:bowtie2 PN:bowtie2 CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq" VN:2.3.4.3
@CO SUNCHENGQUAN
[sunchengquan 18:52:54 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -h inh.bam -f -p tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 18:53:22 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.merge.bam
@HD VN:1.0 SO:coordinate
@SQ SN:gi|9626243|ref|NC_001416.1| LN:48502
@PG ID:bowtie2 PN:bowtie2 CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq" VN:2.3.4.3
@CO SUNCHENGQUAN
#-r 参数
[sunchengquan 18:58:44 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -h inh.bam -f -p -r tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 18:58:56 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.merge.bam |head -1
r523 165 gi|9626243|ref|NC_001416.1| 1 0 * = 1 0 TTTCGCACTCAATCCGCCGGGCGCGGGGGCGGCGACCTCGCGGGTTTTCGCTATTTAT (?,8F5CH3+1H<H&.7;F+H?@>%$./?%>F.+/9(H3?4%.=<6A>&"5/.FG?-% YT:Z:UP RG:Z:tmp.sorted.bam1
[sunchengquan 18:59:03 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.merge.bam |tail -1
r9929 141 * 0 0 * * 0 0 ACCGTAACAAGCAACAGGCAGNCGTGACAGCCAGCAAACCAAAACTCGACCTGACAAACACAGACTGGATTTACGGGGTGGATCTATGAAAATCATC 74)E*11'-G;2G'(!+A26+.E>F'5+.F=@A,/9105>;'E)#F#8C:;G6)G148+#$:7%=?#/H'-F("G4'D/&'&(;2$$+ YT:Z:UP RG:Z:tmp.sorted.bam2
#-R 参数
[sunchengquan 18:59:12 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools merge -h inh.bam -f -p -r -R 'gi|9626243|ref|NC_001416.1|':1-10 tmp.merge.bam tmp.sorted.bam1 tmp.sorted.bam2
[sunchengquan 19:02:45 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.merge.bam |wc -l
22
[sunchengquan 19:03:00 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.merge.bam |awk '{print $4}'
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
4
4
8
8
作用:对参考基因组每个位点做碱基堆积,用于call SNP和INDEL。主要是生成BCF、VCF文件或者pileup一个或多个bam文件。比对记录以在@RG中的样本名作为区分标识符。如果样本标识符缺失,那么每一个输入文件则视为一个样本。
mpileup命令中参数比较多,这里只介绍一些常用的参数。
参数 | 解释 |
---|---|
-A | 在检测变异中,不忽略异常的reads对 |
-C | 用于调节比对质量的系数,如果reads中含有过多的错配,不能设置为零 |
-D | 输出每个样本的reads深度 |
-l | BED文件或者包含区域位点的位置列表文件,注意:位置文件包含两列,染色体和位置,从1开始计数。BED文件至少包含3列,染色体、起始和终止位置,开始端从0开始计数。 |
-r | 在指定区域产生pileup,需已建立索引的bam文件,通常和-l参数一起使用 |
-o/v/g | 输出文件类型(标准格式文件或者VCF、BCF文件) |
-t | 设置FORMAT和INFO的列表内容,以逗号分割 |
-u | 生成未压缩的VCF和BCF文件 |
-I | 跳过INDEL检测 |
-m | 候选INDEL的最小间隔的reads |
-f | 输入有索引文件的fasta参考序列 |
-F | 含有间隔reads的最小片段 |
-Q | 最低的碱基质量,默认设置是13 |
-q | 最低的mapping质量,默认设置是0 |
samtools mpileup
[sunchengquan 19:04:21 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools mpileup tmp.sorted.bam |head
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
gi|9626243|ref|NC_001416.1| 1 N 0 * *
gi|9626243|ref|NC_001416.1| 2 N 0 * *
gi|9626243|ref|NC_001416.1| 3 N 0 * *
gi|9626243|ref|NC_001416.1| 4 N 0 * *
gi|9626243|ref|NC_001416.1| 5 N 1 G /
gi|9626243|ref|NC_001416.1| 6 N 0 * *
gi|9626243|ref|NC_001416.1| 7 N 1 C =
gi|9626243|ref|NC_001416.1| 8 N 0 * *
gi|9626243|ref|NC_001416.1| 9 N 0 * *
gi|9626243|ref|NC_001416.1| 10 N 1 C E
在pileup格式中(没有-u或者-g参数),每一行代表
mpileup生成的结果包含6列
参考序列名
位置
参考碱基
比对上的reads数
比对情况
其中第5列比较复杂,解释如下:
1 ‘.’代表与参考序列正链匹配。
2 ‘,’代表与参考序列负链匹配。
3 ‘ATCGN’代表在正链上的不匹配。
4 ‘atcgn’代表在负链上的不匹配。
5 ‘*’代表模糊碱基
6 ‘’代表匹配的碱基是一个read的开始;’'后面紧跟的ascii码减去33代表比对质量;这两个符号修饰的是后面的碱基,其后紧跟的碱基(.,ATCGatcgNn)代表该read的第一个碱基。
7 ‘$’代表一个read的结束,该符号修饰的是其前面的碱基。
8 正则式’+[0-9]+[ACGTNacgtn]+’代表在该位点后插入的碱基;
9 正则式’-[0-9]+[ACGTNacgtn]+’代表在该位点后缺失的碱基;
比对上的碱基的质量
[sunchengquan 19:30:08 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools mpileup -f ../reference/lambda_virus.fa tmp.sorted.bam |tail
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
gi|9626243|ref|NC_001416.1| 48493 A 3 ,,,$ 43D
gi|9626243|ref|NC_001416.1| 48494 C 3 ,,, 6A<
gi|9626243|ref|NC_001416.1| 48495 A 2 ,$, />
gi|9626243|ref|NC_001416.1| 48496 G 2 ,, ?D
gi|9626243|ref|NC_001416.1| 48497 G 1 , <
gi|9626243|ref|NC_001416.1| 48498 T 0 * *
gi|9626243|ref|NC_001416.1| 48499 T 1 ,$ B
gi|9626243|ref|NC_001416.1| 48500 A 0 * *
gi|9626243|ref|NC_001416.1| 48501 C 0 * *
gi|9626243|ref|NC_001416.1| 48502 G 0 * *
[sunchengquan 19:38:45 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools mpileup -uvf ../reference/lambda_virus.fa tmp.sorted.bam |head -25
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
##fileformat=VCFv4.2
##FILTER=
##samtoolsVersion=1.6+htslib-1.6
##samtoolsCommand=samtools mpileup -uvf ../reference/lambda_virus.fa tmp.sorted.bam
##reference=file://../reference/lambda_virus.fa
##contig=
##ALT=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##INFO=
##FORMAT=
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tmp.sorted.bam
gi|9626243|ref|NC_001416.1| 1 . G <*> 0 . DP=1;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0 PL 0,0,0
gi|9626243|ref|NC_001416.1| 2 . G <*> 0 . DP=2;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0,0;MQ0F=0 PL 0,0,0
gi|9626243|ref|NC_001416.1| 2 . GGCG GGCGCGGGGGCG 0 . INDEL;IDV=1;IMF=0.5;DP=2;I16=0,0,1,0,0,0,41,1681,0,0,42,1764,0,0,0,0;QS=0,1;SGB=-0.379885;MQ0F=0 PL 41,3,0
主要功能:对bam文件进行排序(不能对sam文件进行排序)
[sunchengquan 19:45:09 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools sort
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-T PREFIX Write temporary files to PREFIX.nnnn.bam
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-O, --output-fmt FORMAT[,OPT[=VAL]]...
Specify output format (SAM, BAM, CRAM)
--output-fmt-option OPT[=VAL]
Specify a single output file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
-l:设置文件压缩等级,0不压缩,9压缩最高
-m:每个线程运行内存大小(可使用K M G表示)
-n:按照read名称进行排序
-o:排序后的输出文件
-T:PREFIX临时文件前缀
-@:设置排序和压缩的线程数,默认单线程
[sunchengquan 22:24:41 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools sort -l 9 -m 90M -n -o tmp.sorted.bam -T sorted -@ 2 tmp.bam
[bam_sort_core] merging from 0 files and 2 in-memory blocks...
查看tmp.sorted.bam
[sunchengquan 22:35:28 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.sorted.bam|less -S
r1 99 gi|9626243|ref|NC_001416.1| 18401 42 122M = 18430
r1 147 gi|9626243|ref|NC_001416.1| 18430 42 198M = 18401
r2 99 gi|9626243|ref|NC_001416.1| 8886 42 275M = 8973
r2 147 gi|9626243|ref|NC_001416.1| 8973 42 188M = 8886
r3 83 gi|9626243|ref|NC_001416.1| 11599 42 338M = 11596
r3 163 gi|9626243|ref|NC_001416.1| 11596 42 67M = 11599
r4 99 gi|9626243|ref|NC_001416.1| 40075 42 184M = 40211
r4 147 gi|9626243|ref|NC_001416.1| 40211 42 48M = 40075
不加-n
参数
[sunchengquan 22:36:58 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools sort -l 9 -m 90M -o tmp.sorted.bam -T sorted -@ 2 tmp.bam
[bam_sort_core] merging from 0 files and 2 in-memory blocks...
[sunchengquan 22:38:10 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view tmp.sorted.bam|less -S
r523 165 gi|9626243|ref|NC_001416.1| 1 0 * = 1
r2543 165 gi|9626243|ref|NC_001416.1| 1 0 * = 1
r4819 163 gi|9626243|ref|NC_001416.1| 1 24 4M16I183M =
r523 89 gi|9626243|ref|NC_001416.1| 1 0 3M26I196M =
r2543 89 gi|9626243|ref|NC_001416.1| 1 42 214M = 1
r6080 177 gi|9626243|ref|NC_001416.1| 1 42 4M2I54M = 5850
r9207 177 gi|9626243|ref|NC_001416.1| 1 42 2M6I204M =
r9555 113 gi|9626243|ref|NC_001416.1| 1 42 116M = 5826
r3903 99 gi|9626243|ref|NC_001416.1| 2 42 1M8I238M =
r874 99 gi|9626243|ref|NC_001416.1| 4 42 49M = 232
r3796 113 gi|9626243|ref|NC_001416.1| 8 42 120M = 5912
作用:根据read group 分割bam文件
[sunchengquan 17:04:02 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ find *|grep '[0-9]$' |awk '{print "@RG\tID:"$0}' >> inh.bam
[sunchengquan 17:04:06 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools reheader -P inh.bam tmp.merge.bam > tmp.merge.h.bam
[sunchengquan 17:04:14 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -H tmp.merge.h.bam
@HD VN:1.0 SO:coordinate
@SQ SN:gi|9626243|ref|NC_001416.1| LN:48502
@PG ID:bowtie2 PN:bowtie2 CL:"/home/sunchengquan/local/app/bowtie2-2.3.4.3-linux-x86_64/bowtie2-align-s --wrapper basic-0 -x ../index/lambda_virus -1 reads_1.fq -2 reads_2.fq" VN:2.3.4.3
@RG ID:tmp.sorted.bam1
@RG ID:tmp.sorted.bam2
[sunchengquan 17:04:27 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools split tmp.merge.h.bam
[sunchengquan 17:04:43 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ ll * |grep '.*merge'
-rw-r----- 1 sunchengquan sunchengquan 2.7M 12月 21 16:29 tmp.merge.bam
-rw-r----- 1 sunchengquan sunchengquan 2.5M 12月 21 17:04 tmp.merge.h_0.bam
-rw-r----- 1 sunchengquan sunchengquan 2.5M 12月 21 17:04 tmp.merge.h_1.bam
-rw-r----- 1 sunchengquan sunchengquan 2.7M 12月 21 17:04 tmp.merge.h.bam
作用:bam文件转换为fastq
[sunchengquan 17:11:19 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools fastq tmp.sorted.bam -1 read1.fq.gz -2 read2.fq.gz
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 20000 reads
[sunchengquan 17:13:44 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ ll *|grep 'gz$'
-rw-r----- 1 sunchengquan sunchengquan 1.1M 12月 21 17:13 read1.fq.gz
-rw-r----- 1 sunchengquan sunchengquan 1.1M 12月 21 17:13 read2.fq.gz
[sunchengquan 17:14:20 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ zcat read1.fq.gz |wc -l
40000
[sunchengquan 17:14:58 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ cat read_1.fq |wc -l
cat: read_1.fq: 没有那个文件或目录
0
[sunchengquan 17:15:19 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ cat reads_1.fq |wc -l
40000
bam文件转换为fasta
[sunchengquan 17:15:31 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools fasta tmp.sorted.bam -1 read1.fa.gz -2 read2.fa.gz
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 20000 reads
[sunchengquan 17:18:56 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ zcat read1.fa.gz |wc -l
20000
[sunchengquan 17:19:09 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ zcat read1.fa.gz |head
>r523
GACTTCCATTGTTCATTCCACGGACACAAACAGAGAAAGGAAACGACAGAGNCCAAAAAGCCTNGCTTTCAGCACCTGTCGTTTCCTTTCTTTTCAGAGGGTATTTTAAATAAAAACATTAAGTTATGACGAAGAAGAACGGAAACGCCTTAAACCGGAAAATTTTCATAAATAGCGAAAACCCGCGAGGNCGCCGCCCCCGCGCCCGGCGGANTGAGTGCGAAA
>r2543
CAGCTGCTTTTTGTTGACTTCCATTGTTCATTCCACGGACAAAAACAGAGANAGGAAACGACAGAGGCCAAAAAGCCTCGCTTTCAGCACCTGTCGTTTCCTTTCTTTTCAGAGGGTATTTTAAATAAAAACATTAAGTTATGACGAAGAAGAACGGAAACGCCTTAAACCGGAAAAATTTCNTAAATAGCGAAAACCCGCGAGGTCGCCGCCC
>r9555
TCCCTTCTTTTAAGAGGGTATTTTAAATAAAAACATTAAGTTATGACGAAGAAGAACGGAAACGCCTTAAACCGGAAAATTTTCATAAATAGCGAAAACCCGCGAGGTCGCCGCCC
>r3903
NGCGCGGGGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGTAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCG
>r874
CGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTTCGGTTTAAGG
输出每个panel的总的碱基数;或者将一个panel中每个碱基的深度加到一起。panel指的是每个bed中的一个窗口,由chrID,start,end定义使用方法
输出每个碱基的深度,即匹配到的次数。可以通过-b
添加bed文件,来缩小范围。另外。默认最大深度为8000,如果是高深度测序应该通过-d
或-m
来定义更大的深度。如果需要统计深度为0的位点,应该加上-a
选项。参数较多,主要参数如下:
-a:输出所有的碱基深度(包括0)
-b/-r:控制深度的范围(后面跟染色体)
-f:bam文件名字
-l:设置read长度阈值
-d/-m:最大覆盖深度
-q:碱基质量阈值
-Q:比对质量阈值
查看100-200bp区域的深度
[sunchengquan 19:14:33 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools index tmp.sorted.bam
[sunchengquan 19:15:19 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools depth -r 'gi|9626243|ref|NC_001416.1|':100-200 tmp.sorted.bam > tmp.sorted.bam.depth
[sunchengquan 19:15:22 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ cat tmp.sorted.bam.depth|head
gi|9626243|ref|NC_001416.1| 100 30
gi|9626243|ref|NC_001416.1| 101 31
gi|9626243|ref|NC_001416.1| 102 31
gi|9626243|ref|NC_001416.1| 103 31
gi|9626243|ref|NC_001416.1| 104 30
gi|9626243|ref|NC_001416.1| 105 30
gi|9626243|ref|NC_001416.1| 106 30
gi|9626243|ref|NC_001416.1| 107 30
gi|9626243|ref|NC_001416.1| 108 30
gi|9626243|ref|NC_001416.1| 109 29
结果是tab分割的三列
第一列:染色体名称
第二列:碱基位置
第三列:测序深度
对bam文件进行简单的统计,一共有13项,根据QC结果,每一项都分PASS和FAIL,一般都是PASS,毕竟一般QC都在比对前做过了,低质量的大都过滤了
[sunchengquan 19:15:49 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flagstat tmp.sorted.bam
20000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
18995 + 0 mapped (94.98% : N/A)
20000 + 0 paired in sequencing
10000 + 0 read1
10000 + 0 read2
18332 + 0 properly paired (91.66% : N/A)
18416 + 0 with itself and mate mapped
579 + 0 singletons (2.90% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
总的reads
20000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
总体reads比对率
18995 + 0 mapped (94.98% : N/A)
PEreads
20000 + 0 paired in sequencing
read1
10000 + 0 read1
read2
10000 + 0 read2
PE reads比对到同一条参考序列,并且两条reads之间的距离符合设置的阈值
18332 + 0 properly paired (91.66% : N/A)
PE中两条都比对到参考序列上的reads数
18416 + 0 with itself and mate mapped
单独一条匹配到参考序列上的reads数,和上一个相加,则是总的匹配上的reads数
579 + 0 singletons (2.90% : N/A)
PE中两条分别比对到两条不同的参考序列的reads数
0 + 0 with mate mapped to a different chr
PE中两条分别比对到两条不同的参考序列的reads数(比对质量>=5)
0 + 0 with mate mapped to a different chr (mapQ>=5)
对bam/sam/cram文件的索引进行输出,一共四列:
文件的索引文件必须存在于输入文件相同目录下,如果没有需要先建立索引
[sunchengquan 19:49:46 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools idxstats tmp.sorted.bam
gi|9626243|ref|NC_001416.1| 48502 18995 579
* 0 0 426
对bam文件做详细统计,其统计结果可用mics/plot-bamstats作图
提取SN
[sunchengquan 20:08:45 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools stats tmp.sorted.bam|grep '^SN' |cut -f 2-
raw total sequences: 20000
filtered sequences: 0
sequences: 20000
is sorted: 1
1st fragments: 10000
last fragments: 10000
reads mapped: 18995
reads mapped and paired: 18416 # paired-end technology bit set + both mates mapped
reads unmapped: 1005
reads properly paired: 18332 # proper-pair bit set
reads paired: 20000 # paired-end technology bit set
reads duplicated: 0 # PCR or optical duplicate bit set
reads MQ0: 11 # mapped and MQ=0
reads QC failed: 0
non-primary alignments: 0
total length: 2178385 # ignores clipping
bases mapped: 2079519 # ignores clipping
bases mapped (cigar): 2079519 # more accurate
bases trimmed: 0
bases duplicated: 0
mismatches: 51081 # from NM fields
error rate: 2.456385e-02 # mismatches / bases mapped (cigar)
average length: 108
maximum length: 366
average quality: 16.3
insert size average: 251.6
insert size standard deviation: 44.2
inward oriented pairs: 8485
outward oriented pairs: 0
pairs with other orientation: 40
pairs on different chromosomes: 0
提取IS
[sunchengquan 21:25:35 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools stats tmp.sorted.bam|grep '^IS' |cut -f 2-|tail
357 7 7 0 0
358 5 5 0 0
359 6 6 0 0
360 4 4 0 0
361 7 7 0 0
362 5 5 0 0
363 1 1 0 0
364 0 0 0 0
365 1 1 0 0
366 4 4 0 0
将字符型和数字型flag相互转换
[sunchengquan 21:29:13 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flags 83
0x53 83 PAIRED,PROPER_PAIR,REVERSE,READ1
[sunchengquan 21:29:20 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flags read2
0x80 128 READ2
[sunchengquan 21:30:01 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flags read1
0x40 64 READ1
[sunchengquan 21:30:24 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools flags read1,read2
0xc0 192 READ1,READ2
直观显示reads比对到基因组的情况,和基因组浏览器类似
主要参数:
-d:输出类型
-p:直接定位给定位置
-s:reads显示
当给出参考基因组的时候,会在第一排给出参考基因组的序列,否则第一排全用N表示。
首先利用sort进行排序后,在利用index建立索引后,用下面命令:
[sunchengquan 21:35:17 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools tview -p 'gi|9626243|ref|NC_001416.1|':300 tmp.sorted.bam
示意图:
利用颜色来标注比对质量、碱基质量、核苷酸等。
>=30 的碱基质量或比对质量使用白色表示
20-39的用黄色表示
10-19的用绿色表示
0-9的用蓝色表示
使用字母r切换显示read name等。
具体的操作说明可以?键查看。
sam和bam文件之间相互转换,针对bam文件进行相关操作。bam文件是sam文件的二进制格式,占据内存较小且运算速度快
重要参数释义:
-b:输出bam格式,用于后续分析
-C:输出CRAM文件
-1:快速压缩(需要-b)
-u:输出未压缩的bam文件,节约时间,占据较多磁盘空间(需要-b)
-h:默认输出sam文件不带表头,该参数设定后输出带表头信息
-H:仅仅输出表头信息
-c:仅打印匹配数
-o:输出文件(stdout标准输出)
-U:输出没有经过过滤选择的reads
-t:制表分隔符文件(需要提供额外的参考数据,比如参考基因组的索引)
-L:仅包括和bed文件有重叠的reads
-r:仅输出在STR读段组中的reads
-R:仅输出特定reads
-q:定位的质量大于INT[默认0]
-l:仅输出在STR 库中的reads
-F:过滤标识(flag)(去除匹配标识对应的reads,保留剩余的)[默认0]
-f:匹配标识(flag)(保留匹配的标识对应的reads)[默认0]
-T:使用fasta格式的参考序列
[sunchengquan 21:41:12 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -h tmp.sorted.bam >tmp.sam
[sunchengquan 21:41:12 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -bS tmp.sam > tmp.sorted.bam
$ samtools view -f 16 tmp.sorted.bam|head -1
r523 89 gi|9626243|ref|NC_001416.1| 1 0 3M26I196M = 1 0 TTTCGCACTCANTCCGCCGGGCGCGGGGGCGGCGNCCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCNAGGCTTTTTGGNCTCTGTCGTTTCCTTTCTCTGTTTGTGTCCGTGGAATGAACAATGGAAGTC "1&;;,*'<-1G$427,3>D:+F+';13<:06=F2$:71D5G><@E@;0>");,G0=@7)!:H9&D.C?=1@00H=81E3C,8@<=B?;9;'.+2A!H<#!073&7=8&9G!CHDCG*>%'1C!;9!.+68,5?9304=12>4:18:C&%G7+H6-8CD0$:2"./(#D5E14?5=230>"?4?>G&@>'60&(?;44:3G5/B::4'9FC);(52!@<<6F5=5 AS:i:-97 XN:i:0 XM:i:7 XO:i:1 XG:i:26 NM:i:33 MD:Z:0G0G0G5A126G11C24T26 YT:Z:UP
[sunchengquan 21:55:03 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -F 16 tmp.sorted.bam|head -1
r523 165 gi|9626243|ref|NC_001416.1| 1 0 * = 1 0 TTTCGCACTCAATCCGCCGGGCGCGGGGGCGGCGACCTCGCGGGTTTTCGCTATTTAT (?,8F5CH3+1H<H&.7;F+H?@>%$./?%>F.+/9(H3?4%.=<6A>&"5/.FG?-% YT:Z:UP
[sunchengquan 21:56:24 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -c tmp.sorted.bam
20000
[sunchengquan 21:58:02 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ echo `samtools view -c tmp.sorted.bam`/2 |bc
10000
[sunchengquan 21:59:33 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -cf 4 tmp.sorted.bam
1005
[sunchengquan 22:09:01 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -f 4 tmp.bam|cut -f10 |head -3
GCGTNGTACNNGNNGATGNTTACGACTAANACTTNNNNNTNCNGNNT
ANTGGNTNAGCGGNCGNAGNGNTCNGNAGCNANNAANCTGNTATGNTNNNTNNATCNNNCNNNNGTGNNAANNNCCAGNNNNGGCNNNNCATACANNNTNNNCNTNNNTACANNATANNCGNNATTNNACNGTNNCGNGNNCCGNGTNNNNN
NNTNNNNNAGNTNAANANANGTNTTGNGAANCTNNANACNNCTTTTNNNNTATGNTNNTNAGCCTAG
[sunchengquan 22:16:19 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -q 30 tmp.bam |awk '{print $1,$5}'|head -3
r1 42
r1 42
r2 42
[sunchengquan 22:17:46 ~/local/app/bowtie2-2.3.4.3-linux-x86_64/example/reads]
$ samtools view -F 4 tmp.bam |awk '{print $6}'|grep '[IDNSHPX]'|head -4
72M2D119M
126M3D61M
70M1D71M
53M5D134M