使用Pysam操作BAM文件

Pysam操作BAM文件

Pysam包是一个处理基因组数据的python模块，它打包了htslib-1.3、samtools-1.3 和 bcftools-1.3的核心功能，能在编程时非常灵活的处理bam和bcf文件，实现python处理基因组数据的无缝衔接，而不用在python程序内部调用samtools、bcftools等软件。

>>> import pysam
>>> samfile = pysam.AlignmentFile("ex1.bam", "rb")

导入pysam包，samfile是一个pysam.AlignmentFile对象，该对象有很多内置方法提供后续的分析，对BAM文件的操作也基本上基于这个对象。

AlignmentFile对象内置方法

check_index(self)
如果index文件存在则返回True

>>> samfile.check_index()
True

close(self)
关闭pysam.AlignmentFile
count(self，contig=None, start=None, stop=None, region=None, until_eof=False, read_callback='nofilter', reference=None，end=None)
计算目标区域内比对上的reads数目

>>> samfile.count(contig='chr1', start=1000000, stop=1000100)
107

count_coverage(self, contig=None, start=None, stop=None, region=None, quality_threshold=15, read_callback='all', reference=None, end=None)
计算目标区域内的覆盖度。返回1个4维的array，代表ACGT的覆盖度，而每个维度的array长度为100，里面的数字代表该碱基在各个位置上的覆盖度。

>>> samfile.count_coverage(contig='chr1', start=1000000, stop=1000100)
(array('L', [0, 1, 0, 0, 85, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 58, 0, 0, 0, 52, 0, 0, 47, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 35, 0, 0, 0, 0, 0, 0, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 22, 1, 1, 0, 0, 20, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),\
 array('L', [0, 0, 0, 0, 0, 0, 77, 0, 70, 0, 68, 67, 0, 58, 58, 0, 59, 0, 0, 1, 51, 49, 0, 48, 0, 0, 0, 41, 0, 0, 0, 40, 0, 0, 0, 38, 0, 0, 0, 36, 0, 0, 0, 36, 0, 0, 34, 0, 0, 0, 28, 0, 31, 0, 28, 0, 0, 25, 0, 0, 0, 21, 24, 0, 0, 0, 22, 22, 22, 22, 0, 0, 0, 0, 0, 19, 0, 0, 0, 0, 17, 20, 20, 0, 0, 17, 0, 17, 17, 0, 0, 0, 14, 15, 15, 0, 12, 0, 13, 0]), \
array('L', [94, 0, 88, 87, 0, 83, 0, 72, 0, 67, 0, 0, 61, 0, 0, 0, 0, 59, 56, 0, 0, 0, 0, 0, 46, 46, 40, 0, 41, 41, 41, 0, 0, 38, 38, 0, 36, 36, 36, 0, 36, 0, 36, 0, 34, 34, 0, 31, 0, 31, 0, 31, 0, 29, 0, 26, 25, 0, 24, 0, 1, 0, 0, 23, 0, 21, 0, 0, 0, 0, 0, 0, 20, 22, 21, 0, 19, 19, 9, 0, 0, 0, 0, 17, 19, 0, 0, 0, 0, 0, 15, 15, 0, 0, 0, 14, 0, 13, 0, 13]), \
array('L', [0, 90, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 37, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 21, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))

fetch(self, contig=None, start=None, stop=None, region=None, tid=None, until_eof=False, multiple_iterators=False, reference=None, end=None)
提取出比对到目标区域内的全部reads。返回的是一个迭代器，可以通过for循环或者next函数从中取出reads，我们使用next()函数取出第一条reads，reads是用AlignedSegment对象表示，可以通过该对象的内置方法再对这条reads进行一些查询操作。

>>> allreads=samfile.fetch(contig='chr1', start=1000000, stop=1000010)
>>> allreads

>>> reads1=next(allreads)
>>> reads1

find_introns(self, read_iterator)
返回一个字典{(start, stop): count}，列出了reads中的intronic sites（cigar字符串中的'N'）
get_index_statistics(self)
通过index统计该BAM文件中在各个染色体上mapped/unmapped的reads个数。

>>> samfile.get_index_statistics()
[IndexStats(contig='chr1', mapped=310538, unmapped=940, total=311478), IndexStats(contig='chr2', mapped=0, unmapped=0, total=0), IndexStats(contig='chr3', mapped=0, unmapped=0, total=0), IndexStats(contig='chr4', mapped=0, unmapped=0, total=0), IndexStats(contig='chr5', mapped=0, unmapped=0, total=0), IndexStats(contig='chr6', mapped=0, unmapped=0, total=0), IndexStats(contig='chr7', mapped=0, unmapped=0, total=0), IndexStats(contig='chr8', mapped=0,unmapped=0,
total=0), IndexStats(contig='chr9', mapped=0, unmapped=0, total=0), IndexStats(contig='chr10', mapped=0, unmapped=0, total=0), IndexStats(contig='chr11', mapped=0, unmapped=0, total=0), IndexStats(contig='chr12', mapped=0, unmapped=0, total=0), IndexStats(contig='chr13', mapped=0, unmapped=0, total=0), IndexStats(contig='chr14', mapped=0, unmapped=0, total=0), IndexStats(contig='chr15', mapped=0, unmapped=0, total=0), IndexStats(contig='chr16', mapped=0, unmapped=0, total=0), IndexStats(contig='chr17', mapped=0, unmapped=0, total=0), IndexStats(contig='chr18', mapped=0, unmapped=0, total=0), IndexStats(contig='chr19', mapped=0, unmapped=0, total=0), IndexStats(contig='chr20', mapped=0, unmapped=0, total=0), IndexStats(contig='chr21', mapped=0, unmapped=0, total=0), IndexStats(contig='chr22', mapped=0, unmapped=0, total=0), IndexStats(contig='chrX', mapped=0, unmapped=0, total=0), IndexStats(contig='chrY', mapped=0, unmapped=0, total=0), IndexStats(contig='chrM', mapped=0, unmapped=0, total=0)]

get_reference_name(self, tid)
得到tid（数字）相对应的reference name（字符串），例如在这个BAM中tid为0代表chr1，tid为1代表chr2，tid为24代表chrM。

>>> samfile.get_reference_name(0)
'chr1'
>>> samfile.get_reference_name(1)
'chr2'
>>> samfile.get_reference_name(24)
'chrM'

get_tid(self, reference)
相似的，将reference转换成tid

>>> samfile.get_tid('chr2')
1

has_index(self)
如果hts文件存在打开的index文件则返回True。
head(self, n, multiple_iterators=True)
返回含有前n个比对reads的迭代器，常用于检查bam文件。
header
得到BAM文件的头文件，用字典表示。
is_valid_tid(self, tid)
判断这个tid是否存在（合法），若存在则返回True。
lengths
返回各个reference的长度的一个元组。
mapped
返回比对的上（mapped）的reads个数。
unmapped
返回未比对的上（unmapped）的reads个数。
mate(self, AlignedSegment read)
返回传入的AlignedSegment read的另一配对reads（mate reads）。

>>> reads1

>>> samfile.mate(reads1)

nocoordinate
通过index文件统计没有比对坐标的reads数目
nreferences
返回reference的个数
pileup(self, contig=None, start=None, stop=None, region=None, reference=None, end=None, **kwargs)
类似于samtools的pileup操作，返回一个迭代器，每次迭代返回一个位点的PileupColumn对象,该对象的操作在后面进行详细介绍。

>>> pileupcolumn = samfile.pileup("chr1", 999863, 999865,truncate=True)
>>> pileupcolumn

>>> pc=next(pileupcolumn)
>>> pc

references
得到所有的reference name。

>>> samfile.references
('chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM')

text
得到字符串类型的BAM header信息
write(self, AlignedSegment read) → int
向samfile中写入信息（AlignedSegment格式reads）

AlignedSegmennt对象内置方法

该对象代表了某个比对上的segment。使用以下的内置方法属性来查询操作这个segment。

bin
不知道什么意思
cigarstring
该segment的cigar，同bam文件中一个segment记录的cigar字段,字符串格式。

>>> reads1.cigarstring
'151M'

cigartuples
cigar记录的元组格式，下面的例子就代表了151M（0 代表 M）。

>>> reads1.cigartuples
[(0, 151)]

alignment	meaning	operation
M	BAM_CMATCH	0
I	BAM_CINS	1
D	BAM_CDEL	2
N	BAM_CREF_SKIP	3
S	BAM_CSOFT_CLIP	4
H	BAM_CHARD_CLIP	5
P	BAM_CPAD	6
=	BAM_CEQUAL	7
X	BAM_CDIFF	8
B	BAM_CBACK	9

compare(self, AlignedSegment other)
和其他AlignedSegment比较，值为两个segment在reference上坐标差值

>>> reads1.compare(reads2)
-17

flag
返回该segment的flag信息
get_aligned_pairs(self, matches_only=False, with_seq=False)
展示了query和reference之间的比对情况，如果遇到indel或者skipping的情况，相应的query或者reference会变为None

>>> reads1.get_aligned_pairs(matches_only=False, with_seq=True)
[(0, 999850, 'C'), (1, 999851, 'G'), (2, 999852, 'G'), (3, 999853, 'C'), (4, 999854, 'C'), (5, 999855, 'G'), (6, 999856, 'G'), (7, 999857, 'G'), (8, 999858, 'A'), (9, 999859, 'C'), (10, 999860, 'C'), (11, 999861, 'C'), (12, 999862, 'C'), (13, 999863, 'A'), (14, 999864, 'C'), (15, 999865, 'C'), (16, 999866, 'T'), (17, 999867, 'T'), (18, 999868, 'G'), (19, 999869, 'C'), (20, 999870, 'G'), (21, 999871, 'G'), (22, 999872, 'T'), (23, 999873, 'G'), (24, 999874, 'C'), (25, 999875, 'T'), (26, 999876, 'C'), (27, 999877, 'G'), (28, 999878, 'G'), (29, 999879, 'C'), (30, 999880, 'C'), (31, 999881, 'G'), (32, 999882, 'C'), (33, 999883, 'G'), (34, 999884, 'C'), (35, 999885, 'T'), (36, 999886, 'C'), (37, 999887, 'C'), (38, 999888, 'G'), (39, 999889, 'G'), (40, 999890, 'G'), (41, 999891, 'G'), (42, 999892, 'C'), (43, 999893, 'T'), (44, 999894, 'T'), (45, 999895, 'G'), (46, 999896, 'T'), (47, 999897, 'C'), (48, 999898, 'T'), (49, 999899, 'G'), (50, 999900, 'G'), (51, 999901, 'G'), (52, 999902, 'G'), (53, 999903, 'T'), (54, 999904, 'C'), (55, 999905, 'C'), (56, 999906, 'G'), (57, 999907, 'G'), (58, 999908, 'C'), (59, 999909, 'T'), (60, 999910, 'G'), (61, 999911, 'G'), (62, 999912, 'C'), (63, 999913, 'G'), (64, 999914, 'C'), (65, 999915, 'T'), (66, 999916, 'G'), (67, 999917, 'G'), (68, 999918, 'C'), (69, 999919, 'C'), (70, 999920, 'G'), (71, 999921, 'G'), (72, 999922, 'C'), (73, 999923, 'G'), (74, 999924, 'C'), (75, 999925, 'T'), (76, 999926, 'C'), (77, 999927, 'C'), (78, 999928, 'T'), (79, 999929, 'G'), (80, 999930, 'C'), (81, 999931, 'C'), (82, 999932, 'A'), (83, 999933, 'T'), (84, 999934, 'C'), (85, 999935, 'G'), (86, 999936, 'G'), (87, 999937, 'C'), (88, 999938, 'G'), (89, 999939, 'A'), (90, 999940, 'G'), (91, 999941, 'G'), (92, 999942, 'C'), (93, 999943, 'G'), (94, 999944, 'C'), (95, 999945, 'T'), (96, 999946, 'C'), (97, 999947, 'G'), (98, 999948, 'G'), (99, 999949, 'T'), (100, 999950, 'T'), (101, 999951, 'T'), (102, 999952, 'C'), (103, 999953, 'C'), (104, 999954, 'C'), (105, 999955, 'C'), (106, 999956, 'G'), (107, 999957, 'G'), (108, 999958, 'C'), (109, 999959, 'G'), (110, 999960, 'T'), (111, 999961, 'G'), (112, 999962, 'T'), (113, 999963, 'C'), (114, 999964, 'T'), (115, 999965, 'G'), (116, 999966, 'C'), (117, 999967, 'G'), (118, 999968, 'G'), (119, 999969, 'C'), (120, 999970, 'C'), (121, 999971, 'A'), (122, 999972, 'T'), (123, 999973, 'G'), (124, 999974, 'G'), (125, 999975, 'T'), (126, 999976, 'G'), (127, 999977, 'C'), (128, 999978, 'G'), (129, 999979, 'C'), (130, 999980, 'C'), (131, 999981, 'C'), (132, 999982, 'C'), (133, 999983, 'G'), (134, 999984, 'C'), (135, 999985, 'G'), (136, 999986, 'C'), (137, 999987, 'C'), (138, 999988, 'T'), (139, 999989, 'C'), (140, 999990, 'C'), (141, 999991, 'C'), (142, 999992, 'C'), (143, 999993, 'G'), (144, 999994, 'T'), (145, 999995, 'G'), (146, 999996, 'C'), (147, 999997, 'C'), (148, 999998, 'G'), (149, 999999, 'G'), (150, 1000000, 'G')]

get_blocks(self)
返回该segment比对到reference的start和end坐标。
get_cigar_stats(self)
同样返回cigar信息，不过是两个arrays的形式。

>>> reads1.get_cigar_stats()
(array('I', [151, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array('I', [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
>>> reads2.get_cigar_stats()
(array('I', [151, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array('I', [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
>>> reads3.get_cigar_stats()
(array('I', [134, 0, 0, 0, 17, 0, 0, 0, 0, 0, 0]), array('I', [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]))

get_overlap(self, uint32_t start, uint32_t end)
返回和reference传入参数区域内比对上的碱基数

>>> reads1.get_blocks()
[(999850, 1000001)]
>>> reads1.get_overlap(999950, 1000010)
51

get_reference_positions(self, full_length=False)
返回和reference比对上的位置列表，数值为ref上的位置
get_reference_sequence(self)
返回该reads比对到的ref区域的reference碱基。

>>> reads1.get_reference_sequence()             
'CGGCCGGGACCCCACCTTGCGGTGCTCGGCCGCGCTCCGGGGCTTGTCTGGGGTCCGGCTGGCGCTGGCCGGCGCTCCTGCCATCGGCGAGGCGCTCGGTTTCCCCGGCGTGTCTGCGGCCATGGTGCGCCCCGCGCCTCCCCGTGCCGGG'

get_tags(self, with_value_type=False)
得到该segment的tag信息

>>> reads1.get_tags()
[('MC', '151M'), ('BD', 'MMMOQNNKOPNJJNKMNLMOKLNKOOMMLNNKNKNOMNKLIINPMNLPNPOJJOPOLMOPPOOLOPPOOOLMOLOPNONPPOOPQNMOLPOPOLOPNNMONDNOKKLMOLPLLPNPPLMOOOPPOOLPLOOKKLOLOPPOPLLNRNROLMJ'), ('MD', '151'), ('PG', 'MarkDuplicates.H'), ('RG', 'group1'), ('BI', 'OOPQQOQLRRPKKQNOQOPQNOPNQQNOPPONQOQROPNPLLQRPQOSQSQLLQSPNPQRSQQOQRSQQPNPQOQRPPRSRPRQRPPQOSRQQPRSOPPQPHRPLLNQRPRPPTRTSPQRQSRSRRPSPRQMMORPRQSPQMMOQQTPNPL'), ('NM', 0), ('MQ', 60), ('AS', 151), ('XS', 0)]

get_tag(self, tag, with_value_type=False)
查询某个tag的值

>>> reads1.get_tag( 'MC')
'151M'

has_tag(self, tag)
查询是否有某个tag值，若有则返回True。
infer_query_length(self, always=False)
通过CIGAR推断query的长度，不包括hard-clipped区域碱基
infer_reads_length(self)
通过CIGAR推断reads的长度，包括hard-clipped区域碱基
is_duplicate
判断是否是PCR duplicate
is_paired
判断reads是不是成对的
is_proper_pair
判断reads对是否都比对上了
is_qcfail
判断是否QC失败了
is_read1
判断是否是read1
is_read2
判断是否是read2
is_reverse
判断是否比对到了ref的负义链上
is_secondary
若不是primary比对则返回true
is_supplementary
判断是否是supplementary比对
is_unmapped
判断是否没比对上
mapping_quality
比对质量
mate_is_reverse
配对reads是否比对到负义链
mate_is_unmapped
配对reads是否没比对上
next_reference_id
配对reads比对上的reference id
next_refernece_name
配对reads比对上的reference name
next_reference_start
配对reads比对在reference的起始位置
query_alignment_end
query比对上ref的结束位置，数值为query上的相对位置

>>> reads1.query_alignment_end
151

query_alignment_start
query比对上ref的起始位置，数值为query上的相对位置
query_alignment_length
相当于query_alignment_end-query_alignment_start
query_alignment_qualities
query比对上ref的碱基的质量值

>>> reads1.query_alignment_qualities
array('B', [28, 26, 29, 34, 33, 28, 32, 32, 29, 35, 34, 34, 34, 30, 35, 34, 35, 30, 37, 37, 29, 32, 28, 37, 37, 35, 36, 29, 32, 37, 34, 29, 37, 29, 37, 35, 36, 34, 29, 32, 32, 32, 37, 35, 30, 37, 28, 36, 35, 37, 32, 32, 32, 28, 36, 34, 29, 32, 37, 35, 37, 32, 37, 29, 37, 35, 37, 32, 37, 34, 29, 32, 37, 29, 37, 35, 36, 34, 35, 37, 37, 34, 30, 28, 36, 29, 32, 37, 29, 14, 37, 32, 37, 29, 37, 35, 36, 29, 32, 28, 30, 30, 36, 34, 34, 34, 29, 32, 37, 29, 28, 37, 28, 36, 35, 37, 37, 29, 32, 37, 34, 30, 28, 37, 32, 28, 37, 37, 29, 37, 34, 34, 34, 29, 37, 29, 37, 34, 35, 36, 34, 34, 34, 29, 28, 36, 35, 30, 23, 26, 26])

query_alignment_sequence
query比对上ref的碱基
query_length
query的长度
query_name
query的名字
query_qualities
query的碱基质量值，这个和query_alignment_qualities不一样，query_alignment_qualities为query比对上那部分序列的碱基质量值，若存在soft-clip，query_alignment_qualities的碱基质量值个数会少相应的部分。
query_sequence
query碱基序列
reference_id
比对上的reference ID
reference_name
比对上的reference name
reference_length
和reference能比对上的序列的长度
reference_start
和reference能比对上部分的起始坐标（reference坐标）
set_tag(self, tag, value, value_type=None, replace=True)
设定tag值
set_tags(self, tags)
重新设定tag

PileupColumn对象内置方法

之前说到AlignmentFile对象使用pileup方法产生一个存储了PileupColumn对象的迭代器，这个PileupColumn对象就代表了某个位点上的pileup结果。通过PileupColumn对象的内置方法我们可以得到更多的信息。

nsegments
在这个pileupcolumn上比对上的segment个数
pileups
返回一个列表，列表中的元素为PuleupRead对象

>>> pr=pc.pileups
>>> pr1 = pr[0]
>>> pr1

reference_id
reference_name
reference_pos
这个pileupcolumn在ref上的位置

PileupRead对象内置方法

代表了比对到ref某个位置的reads。

alignment
返回一个AlignedSegment对象

>>> pr1.alignment

indel
后面的碱基是否处在indel上，若在insertion上则返回正数，数值为insertion长度，若在deltion上则为负数，数值为deltion长度。
is_del
此位点的碱基是否处在deltion上，若是则返回1，不是则返回0。
id_head
此位点的碱基是否是比对上的第一个碱基，若是则返回1，不是则返回0。
is_refskip
此位点的碱基是否是CIGAR N碱基，若是则返回1，不是则返回0。
is_tail
此位点的碱基是否是比对上的最后一个碱基，若是则返回1，不是则返回0。
level
the level of the read in the “viewer” mode. Note that this value is currently not computed.
query_position
该碱基位点在query上的位置，0代表第一个。若处在deltion上或者是refskip则返回None。
query_position_or_next
同上，但是如果此位点是deltion则返回下一个碱基的位置。