70. 《Bioinformatics Data Skills》之pysam AlignmentSegment属性

pysam包使用AlignmentFile对象存储BAM/SAM文件,而通过AlignmentSegment对象存储它的read子集。AlignmentSegment拥有丰富的属性承载非常重要的信息,这里了解一下。

AlignmentSegment基本属性

仍然以NA12891_CEU_sample.bam文件为例,首先读入文件:

>>> import pysam
>>> bamfile = pysam.AlignmentFile("NA12891_CEU_sample.bam", mode = "rb")

存储第一个read的AlignmentSegment为:

>>> read = bamfile.next()

通过dir函数查看AlignmentSegment的所有属性,非常丰富:

>>> dir(read)
['__class__', '__copy__', '__deepcopy__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'aend', 'alen', 'aligned_pairs', 'bin', 'blocks', 'cigar', 'cigarstring', 'cigartuples', 'compare', 'flag', 'from_dict', 'fromstring', 'get_aligned_pairs', 'get_blocks', 'get_cigar_stats', 'get_forward_qualities', 'get_forward_sequence', 'get_overlap', 'get_reference_positions', 'get_reference_sequence', 'get_tag', 'get_tags', 'has_tag', 'header', 'infer_query_length', 'infer_read_length', 'inferred_length', 'is_duplicate', 'is_paired', 'is_proper_pair', 'is_qcfail', 'is_read1', 'is_read2', 'is_reverse', 'is_secondary', 'is_supplementary', 'is_unmapped', 'isize', 'mapping_quality', 'mapq',
'mate_is_reverse', 'mate_is_unmapped', 'mpos', 'mrnm', 'next_reference_id', 'next_reference_name', 'next_reference_start', 'opt', 'overlap', 'pnext', 'pos', 'positions', 'qend', 'qlen', 'qname', 'qqual', 'qstart', 'qual', 'query', 'query_alignment_end', 'query_alignment_length', 'query_alignment_qualities', 'query_alignment_sequence', 'query_alignment_start', 'query_length', 'query_name', 'query_qualities', 'query_sequence', 'reference_end', 'reference_id', 'reference_length', 'reference_name', 'reference_start', 'rlen', 'rname', 'rnext', 'seq', 'setTag', 'set_tag', 'set_tags', 'tags', 'template_length', 'tid', 'tlen', 'to_dict', 'to_string', 'tostring']

比较基本的有read名字,read序列,read序列质量,read序列长度等:

>>> read.query_name
'SRR005672.8895'
>>> read.query_sequence
'GGAATAAATATAGGAAATGTATAATATATAGGAAATATATATATATAGTAA'
>>> read.query_qualities
array('B', [26, 28, 27, 29, 28, 27, 29, 27, 24, 27, 28, 27, 24, 25, 27, 28, 30, 29, 27, 28, 29, 30, 29, 31, 30, 29, 29, 24, 29, 29, 29, 28, 29, 31, 32, 31, 30, 31, 30, 31, 30, 30, 22, 30, 30, 28, 28, 25, 16, 24, 26])
>>> read.query_length
51
>>> len(read.query_sequence) == read.query_length
True

Bitwise属性

SAM文件有一列bitwise flag数字,以二进制形式编码了大量信息。pysam包已经将这些信息解析为很直观的形式,可以直接通过属性读取:

>>> read.is_unmapped
False
>>> read.is_paired
True
>>> read.is_proper_pair
True
>>> read.is_qcfail
False
>>> read.is_read1
True
>>> read.is_read2
False
...

Cigar属性

此属性也是SAM文件很重要的一列,展示一个read发生多少长度的错配,插入缺失与soft-clipped等重要信息。AlignmentSegment.cigartupleAlignmentSegment.cigarstring都是cigar属性,前者返回数字元组而后者返回cigar字符串。这里我们关注soft-clip信息,因为一个read比对到的参考基因组长度可能小于它的原始长度。首先找到一个发生soft-clip的read:

>>> for read in bamfile:
...     if 'S' in read.cigarstring:
...             break
...
>>> read.cigarstring
'35M16S'

它的read序列为:

>>> read.query_sequence
'TAGGAAATGTATAATATATAGGAAATATATATATATAGGAAATATATAATA'

其中比对到参考基因组的序列为:

>>> read.query_alignment_sequence
'TAGGAAATGTATAATATATAGGAAATATATATATA'

这个read发生了16个碱基的soft-clip,即两者的差距为16个碱基:

>>> len(read.query_sequence) - len(read.query_alignment_sequence)
16

你可能感兴趣的:(70. 《Bioinformatics Data Skills》之pysam AlignmentSegment属性)