bam 格式 | 测序比对文件探究(STAR 比对 + pysam 包操作)

bam 共11列+12列的可选tag。每一列的意义见 此前的总结, sam格式详解。

1. bam 中第10列保存的序列是原始fastq中的序列吗?

做一个测试,

(1) 一段序列,及其反向互补序列

$ vim test.fq 
@seq001_plus
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFF:FFFFFFFF,F:FFFF:F:FF:F
@seq001_minus
CTTTCTCTTTCTTTTTTTTTTTTTACTTAAACATTTTTATGATGATTTAGATGGAAGTTGTTCTTCGTCACTTAATGTTGGTTCCAGTTCTTCAACTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFF:,FF,,,:,,:F,,,,,F,:,,,,F,FF,,F,F:,F:F:,:F:FFF,,,F

压缩 $ gzip test.fq

(2) 比对

比对,从 snakemake 复制的:
$ mkdir map
$ STAR --runThreadN 1 \
	--genomeDir /home/wangjl/data/ref/hg38/gencode/index/STAR/ \
	--readFilesIn test.fq.gz \
	--readFilesCommand zcat \
	--outFilterMultimapNmax 1 \
	--outSAMtype BAM SortedByCoordinate \
	--outFilterMatchNminOverLread 0.13 --outFilterScoreMinOverLread 0.13 \
	--outFileNamePrefix map/A_
耗时 2min,载入基因组占了1分多。

(3) 查看结果

查看结果:
$ samtools view map/A_Aligned.sortedByCoord.out.bam 
seq001_plus     0       chr14   69380255        255     98M     *       0       0       CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG      FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFF:FFFFFFFF,F:FFFF:F:FF:F   NH:i:1  HI:i:1  AS:i:94 nM:i:1
seq001_minus    16      chr14   69380255        255     98M     *       0       0       CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG      F,,,FFF:F:,:F:F,:F,F,,FF,F,,,,:,F,,,,,F:,,:,,,FF,:FFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NH:i:1  HI:i:1  AS:i:94 nM:i:1

结论: bam中保存的是参考基因组fa中的正向序列。

  • 也就是说,对比到+链上的序列,bam中第10列的序列就是fastq的原始序列。
  • 比对到-链的序列,bam中的序列是fq的反向互补序列。

2. pysam 从bam怎么获取序列才能和fastq中一致?

继续测试,使用上文输出的bam
$ samtools index /home/wangjl/data/MDA/test2/mapTest/map/A_Aligned.sortedByCoord.out.bam

inputFile="/home/wangjl/data/MDA/test2/mapTest/map/A_Aligned.sortedByCoord.out.bam"
import pysam
samfile = pysam.AlignmentFile(inputFile, "rb")
for line in samfile:
	print(line)
	print(line.get_forward_sequence())  #和fq一致
	print(line.seq)  #仅仅是bam中的序列
	print()
samfile.close()

输出:

seq001_plus	0	#13	69380255	255	98M	*	0	0	CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG	array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 37, 37, 37, 25, 37, 25, 37, 37, 25, 37])	[('NH', 1), ('HI', 1), ('AS', 94), ('nM', 1)]
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG

seq001_minus	16	#13	69380255	255	98M	*	0	0	CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG	array('B', [37, 11, 11, 11, 37, 37, 37, 25, 37, 25, 11, 25, 37, 25, 37, 11, 25, 37, 11, 37, 11, 11, 37, 37, 11, 37, 11, 11, 11, 11, 25, 11, 37, 11, 11, 11, 11, 11, 37, 25, 11, 11, 25, 11, 11, 11, 37, 37, 11, 25, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])	[('NH', 1), ('HI', 1), ('AS', 94), ('nM', 1)]
CTTTCTCTTTCTTTTTTTTTTTTTACTTAAACATTTTTATGATGATTTAGATGGAAGTTGTTCTTCGTCACTTAATGTTGGTTCCAGTTCTTCAACTG
CAGTTGAAGAACTGGAACCAACATTAAGTGACGAAGAACAACTTCCATCTAAATCATCATAAAAATGTTTAAGTAAAAAAAAAAAAAGAAAGAGAAAG

结论: 也就是说,我们要使用 line.get_forward_sequence() 获取序列,才是 fastq 中的序列。

3. bam 中第5列MAPQ255(STAR输出)时第二列Flag1024的意义?

flag1024 表示该序列为PCR重复,但是如果 mapq255呢?

$ samtools view B04_bam_pA/pA_AGCCGGTGCGATAC-1.bam |awk '{print $2"\t"$5}' |sort|uniq -c
    588 0       255
     84 1024    255
     87 1040    255
    685 16      255

解释第二列Flag的意义:

  • https://broadinstitute.github.io/picard/explain-flags.html
    • -F 3844 过滤掉质量低的reads,对于没有UMI的测序,使用比对层面判断是否PCR重复。
    • -F 2820 对于有UMI的reads,按照UMI去重,而不是依赖sam比对层面判断的是否PCR重复。

Ref

  • http://events.jianshu.io/p/3a56ab001a83

你可能感兴趣的:(高通量测序,高通量测序)