bam 共11列+12列的可选tag。每一列的意义见 此前的总结, sam格式详解。
做一个测试,
$ vim test.fq
@seq001_plus
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFF:FFFFFFFF,F:FFFF:F:FF:F
@seq001_minus
CTTTCTCTTTCTTTTTTTTTTTTTACTTAAACATTTTTATGATGATTTAGATGGAAGTTGTTCTTCGTCACTTAATGTTGGTTCCAGTTCTTCAACTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFF:,FF,,,:,,:F,,,,,F,:,,,,F,FF,,F,F:,F:F:,:F:FFF,,,F
压缩 $ gzip test.fq
比对,从 snakemake 复制的:
$ mkdir map
$ STAR --runThreadN 1 \
--genomeDir /home/wangjl/data/ref/hg38/gencode/index/STAR/ \
--readFilesIn test.fq.gz \
--readFilesCommand zcat \
--outFilterMultimapNmax 1 \
--outSAMtype BAM SortedByCoordinate \
--outFilterMatchNminOverLread 0.13 --outFilterScoreMinOverLread 0.13 \
--outFileNamePrefix map/A_
耗时 2min,载入基因组占了1分多。
查看结果:
$ samtools view map/A_Aligned.sortedByCoord.out.bam
seq001_plus 0 chr14 69380255 255 98M * 0 0 CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFF:FFFFFFFF,F:FFFF:F:FF:F NH:i:1 HI:i:1 AS:i:94 nM:i:1
seq001_minus 16 chr14 69380255 255 98M * 0 0 CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG F,,,FFF:F:,:F:F,:F,F,,FF,F,,,,:,F,,,,,F:,,:,,,FF,:FFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:1 HI:i:1 AS:i:94 nM:i:1
结论: bam中保存的是参考基因组fa中的正向序列。
+链
上的序列,bam中第10列的序列就是fastq的原始序列。-链
的序列,bam中的序列是fq的反向互补序列。继续测试,使用上文输出的bam
$ samtools index /home/wangjl/data/MDA/test2/mapTest/map/A_Aligned.sortedByCoord.out.bam
inputFile="/home/wangjl/data/MDA/test2/mapTest/map/A_Aligned.sortedByCoord.out.bam"
import pysam
samfile = pysam.AlignmentFile(inputFile, "rb")
for line in samfile:
print(line)
print(line.get_forward_sequence()) #和fq一致
print(line.seq) #仅仅是bam中的序列
print()
samfile.close()
输出:
seq001_plus 0 #13 69380255 255 98M * 0 0 CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 25, 37, 25, 37, 37, 25, 37]) [('NH', 1), ('HI', 1), ('AS', 94), ('nM', 1)]
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
seq001_minus 16 #13 69380255 255 98M * 0 0 CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG array('B', [37, 11, 11, 11, 37, 37, 37, 25, 37, 25, 11, 25, 37, 25, 37, 11, 25, 37, 11, 37, 11, 11, 37, 37, 11, 37, 11, 11, 11, 11, 25, 11, 37, 11, 11, 11, 11, 11, 37, 25, 11, 11, 25, 11, 11, 11, 37, 37, 11, 25, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37]) [('NH', 1), ('HI', 1), ('AS', 94), ('nM', 1)]
CTTTCTCTTTCTTTTTTTTTTTTTACTTAAACATTTTTATGATGATTTAGATGGAAGTTGTTCTTCGTCACTTAATGTTGGTTCCAGTTCTTCAACTG
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
结论: 也就是说,我们要使用 line.get_forward_sequence()
获取序列,才是 fastq 中的序列。
flag1024 表示该序列为PCR重复,但是如果 mapq255呢?
$ samtools view B04_bam_pA/pA_AGCCGGTGCGATAC-1.bam |awk '{print $2"\t"$5}' |sort|uniq -c
588 0 255
84 1024 255
87 1040 255
685 16 255
解释第二列Flag的意义: