本文水平有点low,本来不打算写这么低水平的文章,怎奈我的python水平就在这么low的水平。以此贴记录初试python解决生信简单问题的学习过程,也算让自己看清楚自己python到底有多渣。
python平时自用的多,就是非工作场合的自我学习期间就可以简单写点,或者清洗规整一下数据。自我约束不到位,代码日渐奔放,已经越来越像perl的编码习惯了,$a $b $c。。。。。??
题目完整版来自:http://rosalind.info/problems/list-view/;。每个题可能有多种解法,不同解法用分别用## 1/2/3表示。如果你也同我一样刚用python处理生信数据的话,请务必先自己做一遍再看文中代码。
1. 计算序列中各碱基数目
test.txt文件:
GAGCCTACTAACGGGAT
CATCGTAATGACGGCCT
# 初级解法
#!/usr/bin/env python3
nts = {c:0 for c in 'ATGC'}
with open('./test.txt','r') as f:
for a in f:
a = a.upper()
for nt in a.rstrip():
nts[nt] += 1
print (nts)
# 进阶解题
test = open('test.txt','r').readlines()
[len(re.findall(b,''.join(test))) for b in ['A','T','G','C']]
2. 将DNA序列转化为RNA序列
#初级1
import re
with open('./test.txt','r') as f:
for line in f:
line = line.upper()
RnaSeq = re.sub('T','U',line.rstrip())
print(RnaSeq)
#进阶1
with open('./test.txt','r') as f:
for line in f:
line = line.upper()
print(line.replace('T','U'))
#进阶2
with open('test.txt','r') as f:
for line in f:
print(re.sub('T','U',line.rstrip().upper()))
3. 获取序列的反向互补序列
#初级1
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('./test.txt','r') as f:
for line in f:
seq = ''
line = line.upper()
for aa in line.rstrip():
seq += trans.get(aa)
print(seq[::-1])
#初级2
def reverse_complement(seq):
ntComplement = {'A':'T','T':'A','G':'C','C':'G'}
RevSeqList = list(reversed(seq))
RevComSeqList = [ntComplement[k] for k in RevSeqList]
RevComSeq = ''.join(RevComSeqList)
return RevComSeq
seq = ''
with open('./test.txt','r') as f:
for line in f:
line = line.upper()
print (reverse_complement(line.rstrip()))
# 进阶
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('test.txt','r') as f:
for line in f:
tmp_list = list(reversed(line.strip()))
tmp_list2 = [trans[a] for a in tmp_list]
print(''.join(tmp_list2))
4. 找出fasta文件中GC含量最大的序列
#初级1
import re
Seq = {}
seqGC = {}
with open('./test.fa','r') as f:
for line in f:
if re.match(">",line):
SeqName = line[1:]
Seq[SeqName] = ''
seqGC[SeqName] = 0
else:
line = line.upper()
line = line.rstrip()
Seq[SeqName] += line
seqGC[SeqName] += line.count('G')
seqGC[SeqName] += line.count('C')
maxGC = 0
for key , value in Seq.items():
if maxGC < float(seqGC[key]/ len(value)*100):
maxGC = float(seqGC[key] / len(value)*100)
tmp = key
print ('>'+tmp+Seq[tmp])
# 进阶初版
from operator import itemgetter
from collections import OrderedDict
SeqTest = OrderedDict()
GcContent = OrderedDict()
with open('./test.fa','r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
SeqName = line[1:]
SeqTest[SeqName] = ''
continue
SeqTest[SeqName] += line.upper()
for key, value in SeqTest.items():
totalLength = len(value)
gcNum = value.count('G') + value.count('C')
gcContent[key] = float(gcNum/totalLength)*100
sortedGC = sorted(gcContent.items(),key = itemgetter(1))
largeName = sortedGC[-1][0]
largeGCcontent = sortedGC[-1][1]
print ('most GC ratio rate is %s and it is %s ' %(largeName,largeGCContent))
# 进阶高板
seq = {}
seq_gc = {}
with open('test.fa','r') as f:
for line in f:
line = line.strip()
if re.match('>',line):
seq_id = re.sub('>', '',line)
seq[seq_id] = ''
else:
seq[seq_id] = seq[seq_id] + line.upper()
for key,s in seq.items():
seq_gc[key] = (s.count('G') + s.count('C'))/len(s)
print(max(seq_gc,key=seq_gc.get) + '\t' + max(seq_gc.values()))
5. 计算点突变数目
给两个长度为t的序列s,t和s之间的哈明距离(Hamming distance)定义为dH(s,t)。该问题即返回两条序列的哈明距离。
## 初版
fh = open('./test.txt','r')
lst = []
for line in fh:
lst.append(line.rstrip())
hamming_dis = 0
for i in range(len(lst[0])):
if lst[0][i] == lst[1][i]:
continue
hamming_dis += 1
print (hamming_dis)
#进阶1
fh = open('./test.txt','r')
seq = file.readlines()
seq1, seq2 = seq[0].strip(), seq[1].strip()
mutation = [i for i in range(len(seq1)) if seq1[i] != seq2[i]]
print (len(mutation))
#进阶2
import numpy as np
import pandas as pd
sequence = []
number = 0
with open('test.txt','r') as f:
for line in f:
sequence.append(list(line.strip().upper()))
frame = pd.DataFrame(np.asarray(sequence))
for i in range(len(frame.columns)):
if frame[i][0] != frame[i][1]:
number = number + 1
print(number)
6. 孟德尔第一定理
一个群体中有三种基因型的生物:k,显性纯合子;m,杂合子;n,隐性纯合子。假设这对形状由一对等位基因控制,且群体中随机选取的任何两个个体都能交配,求随机选取两个个体交配后,子代拥有显性等位基因的概率。
#初版
k = int(input("enter the number of homozygous dominant: "))
m = int(input("enter the number of heterozygous: "))
n = int(input("enter the number of homozygous recessive: "))
num = int(k + m + n)
choice = num*(num-1)/2.0
p = 1 - (n*(n-1)/2 + 0.25*m*(m-1)/2 + m*n*0.5)/choice
print(p)
#进阶1
from scipy.misc import comb
num = input("Number of individuals(k,m,n): ")
[k,m,n] = map(int,num.split(','))
t = k + m + n
rr = comb(n,2)/comb(t,2)
hh = comb(m,2)/comb(t,2)
hr = comb(n,1)*comb(m,1)/comb(t,2)
p = 1 - (rr+hh*1/4+hr*1/2)
print(p)
7. 将RNA翻译成蛋白质
def translate_rna(sequence):
codonTable = {
'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACU':'T',
'AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R',
'CUA':'L', 'CUC':'L', 'CUG':'L', 'CUU':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P',
'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGU':'R',
'GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A',
'GAC':'D', 'GAU':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G',
'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S',
'UUC':'F', 'UUU':'F', 'UUA':'L', 'UUG':'L',
'UAC':'Y', 'UAU':'Y', 'UAA':'', 'UAG':'',
'UGC':'C', 'UGU':'C', 'UGA':'', 'UGG':'W',
}
proteinsequence = ''
for n in range(0,len(sequence),3):
if sequence[n:n+3] in codonTable.keys():
proteinsequence += codonTable[sequence[n:n+3]]
return proteinsequence
protein_fh = open('./protein.txt','w')
with open('./rna.txt','r') as f:
for line in f:
protein_fh.write(translate_rna(line.strip('\n')))
## 进阶
import re
from collections import OrderedDict
codonTable = OrderedDict()
codonTable={
'AUA':'I','AUC':'I','AUU':'I','AUG':'M',
'ACA':'T','ACC':'T','ACG':'T','ACU':'T',
'AAC':'N','AAU':'N','AAA':'K','AAG':'K',
'AGC':'S','AGU':'S','AGA':'R','AGG':'R',
'CUA':'L','CUC':'L','CUG':'L','CUU':'L',
'CCA':'P','CCC':'P','CCG':'P','CCU':'P',
'CAC':'H','CAU':'H','CAA':'Q','CAG':'Q',
'CGA':'R','CGC':'R','CGG':'R','CGU':'R',
'GUA':'V','GUC':'V','GUG':'V','GUU':'V',
'GCA':'A','GCC':'A','GCG':'A','GCU':'A',
'GAC':'D','GAU':'D','GAA':'E','GAG':'E',
'GGA':'G','GGC':'G','GGG':'G','GGU':'G',
'UCA':'S','UCC':'S','UCG':'S','UCU':'S',
'UUC':'F','UUU':'F','UUA':'L','UUG':'L',
'UAC':'Y','UAU':'Y','UAA':'','UAG':'',
'UGC':'C','UGU':'C','UGA':'','UGG':'W',
}
rnaseq = ''
with open('./rna.txt','r') as f:
for line in f:
line = line.rstrip()
line += line.upper()
aminoAcids = []
i = 0
while i < len(rnaseq):
condon = rnaseq[i:i+3]
if codonTable[condon] != '':
aminoAcids.append(codonTable[condon])
i += 3
peptide = ''.join(aminoAcids)
print(peptide)
## 3
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna,generic_rna
# translate
rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna)
print(rna.translate())
8. 寻找DNA motif
## 1
seq = 'GATATATGCATATACTT'
motif = 'ATAT'
motif_len = len(motif)
position = []
for i in range(len(seq)-motif_len):
if seq[i:i+motif_len] == motif:
position.append(i+1)
print(position)
## 2
import re
seq = 'GATATATGCATATACTT'
print([i.start()+1 for i in re.finditer('(?=ATAT)',seq)])
9. 多个等长序列的一致性序列
比如序列如下:
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
各位点碱基个数:
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Consensus A T G C A A C T
#1 初阶
def seq_list(fasta):
seq_list = []
for line in fasta.readlines():
if not line.startswith('>'):
seq = list(line.rstrip())
seq_list.append(seq)
return seq_list
def statistic_base(seq_list):
for base in 'ATGC':
base_total = []
for sit in range(len(seq_list[0])):
col = [x[sit] for x in seq_list]
num = col.count(base)
base_total.append(num)
print('%s:%s'%(base,base_total))
fh = open('./test.fa','r')
sequence_list = seq_list(fh)
statistic_base(sequence_list)
# 进阶 初级
from collections import Counter
from collections import OrderedDict
seq = OrderedDict()
seqLength = 0
fh = open('./test.consensus.txt','wt')
with open('./test.fa','r') as f:
for line in f:
if line.startswith('>'):
seq_name = line.rstrip()
seq[seq_name] = ''
continue
seq[seq_name] += line.upper().rstrip()
seqLength = len(seq[seq_name])
a,t,g,c = [],[],[],[]
consensus = ''
for i in range(seqLength):
sequence = ''
for j in seq.keys():
sequence += seq[j][i]
a.append(sequence.count('A'))
t.append(sequence.count('T'))
g.append(sequence.count('G'))
c.append(sequence.count('C'))
counts = Counter(sequence)
consensus += counts.most_common()[0][0]
fh.write(consensus+'\n')
fh.write('\n'.join(['A:\t'+'\t'.join(map(str,a)),'C:\t'+'\t'.join(map(str,c)),'G:\t'+'\t'.join(map(str,g)),'T:\t'+'\t'.join(map(str,t))])+'\n')
fh.close()
# 进阶高级
seq = []
seq_dict = {}
with open('t.list','r') as f:
for line in f:
if not line.startswith('>'):
seq.append(list(line.strip().upper()))
for aa in ['A','T','G','C']:
seq_dict[aa] = [list(frame[i]).count(aa) for i in range(len(frame.columns))]
print(''.join(list(pd.DataFrame(seq).idxmax(axis=1))))
10. 致命的斐波那契兔子
斐波那契序列是一个序列的数字定义的递归关系Fn = Fn-1+ Fn−2 ,我们设置的起始值F1 = F2 = 1。
假设每只兔子可以活m个月,n个月后有多少只兔子?
## 1
def fib(n,m):
f= [0,1,1]
for i in range(3,n+1):
if i <= m:
total = f[i-1] + f[i-2]
elif i == m+1:
total = f[i-1] + f[i-2] - 1
else:
total = f[i-1] + f[i-2] - f[i-m-1]
f.append(total)
return(f[n])
inp = input('live month of rabbit(m),and afther n-th month;n<=100,m<=20;input(n,m): ')
[n,m]=map(int,inp.split(','))
print(fib(n,m))
11. Graph Theory
文件介绍好麻烦,自己看:http://rosalind.info/problems/grph/
总之该题有三个碱基的首尾相同就连接起来,
输入文件:
>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG
输出结果:
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
seq = {}
with open('./overlap.fa','r') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
seqname = line[1:]
seq[seqname] = ''
continue
seq[seqname] += line.upper()
for key , value in seq.items():
for key2 ,value2 in seq.items():
if key != key2 and value[-3:] == value2[:3]:
print(key+'\t'+key2)
12. 计算后代的期望值
同样懒得解释原理,具体原理看:http://rosalind.info/problems/iev/
现在有6种基因型组合夫妇:
AA-AA
AA-Aa
AA-aa
Aa-Aa
Aa-aa
aa-aa
给定6个非负整数,代表6种基因型组合的夫妇数量,求下一代显性性状的个数,假设每对夫妻有2个孩子。
def expected(a,b,c,d,f,g):
AA_AA = 1
AA_Aa = 1
AA_aa = 1
Aa_Aa = 0.75
Aa_aa = 0.5
aa_aa = 0
p = (AA_AA*a + AA_Aa*b + AA_aa*c + Aa_Aa*d + Aa_aa*f + aa_aa*g)*2
return (p)
inp = input('input(a,b,c,d,f,g): ')
[a,b,c,d,f,g] = map(int,inp.split(','))
print(expected(a,b,c,d,f,g))
13. 计算序列间的最长一致性序列,即寻找序列间公有的motif(Finding a Shared Motif)
# 测试数据
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
# 输出结果:
AC