python生信小练习(四)

生信编程直播第一题:人类基因组的外显子区域到底有多长

方法1:分别以外显子坐标的起始、终止坐标作为字典的键值对

#计算人基因组的外显子长度
f5 = r'E:\Bioinformatics\data\human\CCDS.current.txt'
head = 1
exon = {}
for line in open(f5):
    if head:
        head -= 1
        continue
    temp = line.split('\t')
    if temp[9] != '-':    #添加条件:如果基因坐标不为空
        coordinate = temp[9].lstrip('[').rstrip(']').split(', ')
    for i in range(len(coordinate)):
        startCDS = coordinate[i].split('-')[0]
        exon[startCDS] = coordinate[i].split('-')[1]
length = 0
for startCDS, endCDS in exon.items():
    length += int(endCDS.strip("'")) - int(startCDS.strip("'")) +1    #要去除以字符串形式存储的数字的单引号,否则会报错
print(length)

方法2:以外显子起始、终止坐标创建等长列表进行计算

f5 = r'E:\Bioinformatics\data\human\CCDS.current.txt'
head = 1
startCDS = []
endCDS = []
for line in open(f5):
    if head:
        head -= 1
        continue
    temp = line.split('\t')
    if temp[9] != '-':
        coordinate = temp[9].lstrip('[').rstrip(']').split(', ')
    for i in range(len(coordinate)):
        startCDS.append(coordinate[i].split('-')[0])
        endCDS.append(coordinate[i].split('-')[1])
length = 0
for i in range(len(startCDS)):
    length += int(endCDS[i]) - int(startCDS[i]) +1    #要去除以字符串形式存储的数字的单引号,否则会报错
print(length)

生信编程直播第二题-hg19基因组序列的一些探究

因为自己目前研究番茄,所以以番茄基因组fasta文件为操作对象
首先,看一下该文件的大小:

番茄基因组fasta文件

不到一个G,约为822M
番茄一共有12条染色体,对每条染色体进行GC%, N%, Length, A, T, G, C, N统计(注意:在计算GC%时,需要去除N):
方法一:
思路为首先创建以染色体名称为key,序列为value的字典;随后,针对该字典中的value进行操作,针对序列进行遍历计数操作;最后,根据所记录的字符数进行GC%及N%的计算,并打印至屏幕。该方法非常消耗计算机资源。

import time
start = time.clock()
f = r'E:\Bioinformatics\data\S_lycopersicum_chromosomes.3.00.fa\S_lycopersicum_chromosomes.3.00.fa'
genome = {}
for line in open(f):
    if line.startswith('>'):
        chr_id = line.strip()
        genome[chr_id] = []
    else:
        genome[chr_id].append(line.strip())

print('Chr\tGC_ratio\tN_ratio\tLength\tA\tT\tG\tC\tN')
for chrom, seq in genome.items():
    seqTotal = ''.join(seq)
    ANum = 0
    TNum = 0
    GNum = 0
    CNum = 0
    NNum = 0
    GC_ratio = 0
    N_ratio = 0
    for i in seqTotal:
        if i == 'A' or i == 'a':
            ANum += 1
        elif i == 'T' or i == 't':
            TNum += 1
        elif i == 'G' or i == 'g':
            GNum += 1
        elif i == 'C' or i == 'c':
            CNum += 1
        elif i == 'N' or i == 'n':
            NNum += 1
    GC_ratio = '%.04f' % ((GNum + CNum)/(len(seqTotal)-NNum))
    N_ratio = '%.04f' % (NNum/len(seqTotal))
    print('{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(chrom, GC_ratio, N_ratio, len(seqTotal), ANum,
                                              TNum, GNum, CNum, NNum))
end = time.clock()
print("The function run time is : %.03f seconds" %(end-start))
番茄基因组信息

可以看到,耗时极长,共358秒。
对该程序进行改进是否会减少计算资源的消耗呢?

#改进版本,利用字符串计数方法 .count()
import time
start = time.clock()
f = r'E:\Bioinformatics\data\S_lycopersicum_chromosomes.3.00.fa\S_lycopersicum_chromosomes.3.00.fa'
genome = {}
for line in open(f):
    if line.startswith('>'):
        chr_id = line.strip()
        genome[chr_id] = []
    else:
        genome[chr_id].append(line.strip())

print('Chr\tGC_ratio\tN_ratio\tLength\tA\tT\tG\tC\tN')
for chrom, seq in genome.items():
    seqTotal = ''.join(seq)
    ANum = 0
    TNum = 0
    GNum = 0
    CNum = 0
    NNum = 0
    GC_ratio = 0
    N_ratio = 0
    ANum = seqTotal.count('A') + seqTotal.count('a')
    TNum = seqTotal.count('T') + seqTotal.count('t')
    GNum = seqTotal.count('G') + seqTotal.count('g')
    CNum = seqTotal.count('C') + seqTotal.count('c')
    NNum = seqTotal.count('N') + seqTotal.count('n')
    GC_ratio = '%.04f' % ((GNum + CNum)/(len(seqTotal)-NNum))
    N_ratio = '%.04f' % (NNum/len(seqTotal))
    print('{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}\t{}'.format(chrom, GC_ratio, N_ratio, len(seqTotal), ANum,
                                              TNum, GNum, CNum, NNum))
end = time.clock()
print("The function run time is : %.03f seconds" %(end-start))
方法一改进版

该方法居然快了近10倍!!看来在处理序列类对象的时候,尽量使用针对该类的方法,能不用循环就不用,这之间是“质”的差距。

你可能感兴趣的:(python生信小练习(四))