参考文章:
RNA-seq(5):序列比对
RNA-seq练习 第二部分
转录组入门(5): 序列比对
BAM and SAM formats are designed to contain the same information. The SAM format is more human readable, and easier to process by conventional text based processing programs, such as awk, sed, python, cut and so on. The BAM format provides binary versions of most of the same data, and is designed to compress reasonably well.
为什么要转换格式?为了让计算机好处理。SAM(sequence Alignment/mapping)数据格式是目前高通量测序中存放比对数据的标准格式。bam是sam的二进制格式,减少了sam文件的储存量。
目前处理SAM格式的工具主要是SAMTools,这是Heng Li大神写的,除了C语言版本,还有Java的Picard,Python的Pysam,Common lisp的cl-sam等其他版本。软件对应各命令如下:
最常用的三板斧就是格式转换,排序,索引。而进阶教程就是看文档提高。
1. 格式转换(view命令)
# samtools view 使用说明:
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]
Options:
-b output BAM
-C output CRAM (requires -T)
-1 use fast BAM compression (implies -b)
-o FILE output file name [stdout]
-S ignored (input format is auto-detected)
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-@, --threads INT
Number of additional threads to use [0]
# 范例一:利用的是samtools的view选项,参数-S 输入sam文件;参数-b 指定输出的文件为bam;最后重定向写入>bam文件
$ cd mnt/f/rna_seq/aligned
$ for ((i=56;i<=62;i++));do samtools view -S SRR35899${i}.sam -b > SRR35899${i}.bam;done
# 操作记录:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ls
m3108.sam m3110_1.sam m3110.sam m3111.sam m3112.sam m3113.sam m3114.sam m3122.sam msh1.sam msh2.sam Scr.sam
#使用参数-S;-b;-@;输出选用>进行重定向写入,也可以使用-o Filename 指定输入文件名称
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools view -@ 8 -S Scr.sam -b > Scr.bam
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 256G
-rw-rw-r-- 1 zexing zexing 4.3G 6月 3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing 23G 6月 2 15:08 Scr.sam
#两个对比发现:.sam文件比.bam文件大5-6倍
#使用参数-S;-b;-@;输出选用>进行重定向写入,也可以使用-o Filename 指定输入文件名称
#参数-1压缩速度略快,结果略大
-rw-rw-r-- 1 zexing zexing 4.3G 6月 3 13:16 Scr.bam-0
-rw-rw-r-- 1 zexing zexing 4.6G 6月 3 13:09 Scr.bam_1
确认代码:
samtools view -@ 8 -S {$i}.sam -1b -o {$i}.bam
2. 排序(sort命令)
# samtools sort 使用说明:
Usage: samtools sort [options...] [in.bam]
Options:
-l INT Set compression level, from 0 (uncompressed) to 9 (best)
-m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
-n Sort by read name
-t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set)
-o FILE Write final output to FILE rather than standard output
-@, --threads INT
Number of additional threads to use [0]
#范例一:将所有的bam文件按默认的染色体位置进行排序
$ for ((i=56;i<=62;i++));do samtools sort SRR35899${i}.bam -o SRR35899${i}_sorted.bam;done
#操作记录:
#使用参数-n;-l;-@;;-o Filename 指定输入文件名称,输入文件放在最后。
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools sort -@ 8 -n -l 4 -o Scr.bam.sort Scr.bam
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 270G
-rw-rw-r-- 1 zexing zexing 4.3G 6月 3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing 4.5G 6月 3 13:30 Scr.bam.sort
-rw-rw-r-- 1 zexing zexing 23G 6月 2 15:08 Scr.sam
#使用参数-n;-l;-@;;-o Filename 指定输入文件名称,输入文件放在选项前边。
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools sort Scr.bam -@ 8 -n -l 9 -o Scr.bam.sort_9
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 274G
-rw-rw-r-- 1 zexing zexing 4.3G 6月 3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing 4.5G 6月 3 13:30 Scr.bam.sort
-rw-rw-r-- 1 zexing zexing 4.1G 6月 3 13:38 Scr.bam.sort_9
-rw-rw-r-- 1 zexing zexing 23G 6月 2 15:08 Scr.sam
#压缩比大时可以适当缩小文件大小
#默认按照染色体位置进行排序,而-n参数则是根据read名进行排序,-t 根据TAG进行排序。
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools sort -@ 8 -l 5 -o Scr.bam.sort_n Scr.bam
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 263G
-rw-rw-r-- 1 zexing zexing 4.3G 6月 3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing 4.5G 6月 3 13:30 Scr.bam.sort
-rw-rw-r-- 1 zexing zexing 2.1G 6月 3 14:26 Scr.bam.sort_n
-rw-rw-r-- 1 zexing zexing 23G 6月 2 15:08 Scr.sam
#上面Scr.bam.sort_n为默认排序,文件明显变小。为什么 BAM 文件sort之后体积会变小?因为BAM文件是压缩的二进制文件,对文件内容排序之后相似的内容排在一起,使得文件压缩比提高了,因此排序之后的BAM文件变小了,相对应的 SAM 文件就是纯文本文件,对SAM 文件进行排序就不会改变文件大小。而且由于 RNA-seq 中由于基因表达量的关系,RNA-seq 的数据比对结果 BAM 文件使用 samtools 进行 sort 之后文件压缩比例变化会比DNA-seq 更甚。另外,samtools 对 BAM 文件进行排序之后那些没有比对上的 reads 会被放在文件的末尾。
确认代码:
samtools sort -@ 8 -l 5 -o {$i}.bam.sort {$i}.bam
3. 建立索引(index命令)
#samtools index 使用说明:
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]
Options:
-b Generate BAI-format index for BAM files [default]
-c Generate CSI-format index for BAM files
-m INT Set minimum interval size for CSI indices to 2^INT [14]
-@ INT Sets the number of threads [none]
#范例一:将所有的排序文件建立索引,索引文件.bai后缀
$ for ((i=56;i<=62;i++));do samtools index SRR35899${i}_sorted.bam;done
#操作记录:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools index -@ 8 Scr.bam Scr.bam.index
[E::hts_idx_push] Unsorted positions on sequence #15: 76904671 followed by 76904427
samtools index: failed to create index for "Scr.bam"
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools index -@ 8 Scr.bam.sort Scr.bam.index
[E::hts_idx_push] Unsorted positions on sequence #19: 44575082 followed by 44574995
#以上报错,源文件Scr.bam在之前samtools sort命令时,加入参数-n,按reads名排序,无法建立Index。
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools index -@ 8 Scr.bam.sort_n Scr.bam.index
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ ll
total 263G
-rw-rw-r-- 1 zexing zexing 4.3G 6月 3 12:22 Scr.bam
-rw-rw-r-- 1 zexing zexing 2.2M 6月 3 14:58 Scr.bam.index
-rw-rw-r-- 1 zexing zexing 4.5G 6月 3 13:30 Scr.bam.sort
-rw-rw-r-- 1 zexing zexing 2.1G 6月 3 14:26 Scr.bam.sort_n
-rw-rw-r-- 1 zexing zexing 23G 6月 2 15:08 Scr.sam
#第二次源文件Scr.bam.sort_n在之前samtools sort命令时,按默认染色体位置排序,顺利建立Index。
确认代码:
samtools index -@ 8 {$i}.bam.sort {$i}.bam.index
4. 查看reads比对情况(flagstat命令)
#samtools flagstat 使用说明:
Usage: samtools flagstat [options] <in.bam>
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
-@, --threads INT
Number of additional threads to use [0]
#操作记录:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ samtools flagstat Scr.bam.sort
50976263 + 0 in total (QC-passed reads + QC-failed reads)
5208051 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
49743886 + 0 mapped (97.58% : N/A)
45768212 + 0 paired in sequencing
22884106 + 0 read1
22884106 + 0 read2
43519458 + 0 properly paired (95.09% : N/A)
43928560 + 0 with itself and mate mapped
607275 + 0 singletons (1.33% : N/A)
103944 + 0 with mate mapped to a different chr
76413 + 0 with mate mapped to a different chr (mapQ>=5)
确认代码:
samtools flagstat -@ 8 {$i}.bam.sort
5. 使用shell scripts将以上命令写在一起
范例一:
#!/bin/bash
#这里写了一个小脚本,把三个步骤写在一个for循环里,for循环会依次对每一个sam文件进行处理
for i in `seq 77 80`
do
samtools view -S /media/yanfang/FYWD/RNA_seq/sam_files/SRR9576${i}.sam -b > /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam
#第一步将比对后的sam文件转换成bam文件。-S 后面跟的是sam文件的路径;-b 指定输出的文件为bam,后面跟输出的路径;最后重定向写入bam文件
samtools sort /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}.bam -o /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam
#第二步将所有的bam文件按默认的染色体位置进行排序。-o是输出文件名
samtools index /media/yanfang/FYWD/RNA_seq/bam_files/SRR9576${i}_sorted.bam
#第三步将所有的排序文件建立索引,索引文件,生成的索引文件是以bai为后缀的
done
操作记录:
#shell script记录如下:
#!/bin/bash
#这里写一个脚本,将samtools的view、sort和index三个命令串联在一起,对赵秀娟的几个数据循环处理
# 2020/06/03 李泽兴 First release
for i in msh1 msh2 m3108 m3110 m3111 m3112 m3113 m3114
do
samtools view -@ 8 -S /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.sam -1b -o /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.bam
samtools sort -@ 8 -l 5 -o /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.bam.sort /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.bam
samtools index -@ 8 /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.bam.sort /f/xudonglab/zexing/projects/zhaoxiujuan/aligned/$i.bam.index
done
#后台运行命令如下
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ nohup sh samtools.sh &
[1] 46028
#运行的nohup_out结果如下:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ cat nohup.out
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
[W::sam_read1] Parse error at line 12148383
[main_samview] truncated file.
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] Parse error at line 13734
[main_samview] truncated file.
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
[bam_sort_core] merging from 24 files and 8 in-memory blocks...
使用shell script对比对结果进行查看,操作记录如下:
(base) zexing@DNA:~/projects/zhaoxiujuan/aligned$ for i in msh1 msh2 m3108 m3110 m3111 m3112 m3113 m3114; do samtools flagstat -@ 8 $i.bam.sort; done
56463066 + 0 in total (QC-passed reads + QC-failed reads)
4886336 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
55172624 + 0 mapped (97.71% : N/A)
51576730 + 0 paired in sequencing
25788365 + 0 read1
25788365 + 0 read2
49186058 + 0 properly paired (95.36% : N/A)
49644168 + 0 with itself and mate mapped
642120 + 0 singletons (1.24% : N/A)
107584 + 0 with mate mapped to a different chr
83662 + 0 with mate mapped to a different chr (mapQ>=5)
6. 关于samtools view的其他用法
samtools view是一个非常实用的子命令,除了之前的格式转换以外,还能进行数据提取和提取。
比如说提取1号染色体1234-123456区域的比对read。
这部分内容参考文章:转录组入门(5): 序列比对,有待进一步学习。