近期在使用exonerate进行蛋白比对基因,对其结果log文件未找到方便提取的脚本,自己写了一个,python脚本(未进行优化,欢迎优化回复!)
使用:python 脚本.py log文件
思路就是:把结果Target行提取出,生成初步的三个字母的蛋白文件,再次对三个字母的蛋白文件处理转化为单个字母蛋白文件,如此即可
log文件如下:
Command line: [exonerate --model protein2genome --percent 20 --maxintron 100000 test.fa ../01-genome/gene-DB/chr/TcChr09a.fa]
Hostname: [smp10]
C4 Alignment:
------------
Query: toxin-1 TcChr09a 4150993——>4157095
Target: TcChr09a
Model: protein2genome:local
Raw score: 10528
Query range: 0 -> 2034
Target range: 4150993 -> 4157095
1 : MetThrLeuSerGlyAspIleLysAlaLeuValAspAsnProGluSerPheLeuAr : 19
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MetThrLeuSerGlyAspIleLysAlaLeuValAspAsnProGluSerPheLeuAr
4150994 : ATGACTCTCTCTGGCGATATTAAAGCGTTGGTGGACAATCCAGAATCCTTTTTAAG : 4151048
20 : gAspAsnArgLeuGlyPheAsnLeuAsnArgAsnIleAlaArgLysAspGlnLeuV : 38
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gAspAsnArgLeuGlyPheAsnLeuAsnArgAsnIleAlaArgLysAspGlnLeuV
4151049 : GGATAATCGTCTGGGCTTCAACCTCAATCGCAACATAGCGAGGAAAGACCAGCTTG : 4151105
39 : alLysLeuValArgValThrAlaAsnSerTyrAspLeuLysPheSerGluThrGlu : 56
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
alLysLeuValArgValThrAlaAsnSerTyrAspLeuLysPheSerGluThrGlu
4151106 : TAAAACTGGTTCGAGTCACAGCGAACTCGTACGATCTTAAATTTTCCGAGACAGAG : 4151159
57 : SerGluGluAsnThrIleSerSerTyrIleLeuGlyTyrLysThrAsnGluAlaAs : 75
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SerGluGluAsnThrIleSerSerTyrIleLeuGlyTyrLysThrAsnGluAlaAs
4151160 : TCAGAGGAAAACACGATATCCAGCTACATCCTTGGATACAAGACGAACGAAGCAAA : 4151216
76 : nAspAlaValPheLeuAspIleProSerArgGlyValLysGluGlyThrPheLeuP : 94
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
nAspAlaValPheLeuAspIleProSerArgGlyValLysGluGlyThrPheLeuP
4151217 : TGATGCCGTGTTTCTGGACATCCCGAGCAGAGGCGTGAAGGAGGGAACATTTTTGT : 4151273
95 : heThrSerGluLeuSerGlyCysSerLeuValValThrArgLeuLysAspAspThr : 112
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
heThrSerGluLeuSerGlyCysSerLeuValValThrArgLeuLysAspAspThr
4151274 : TCACATCTGAACTCTCCGGCTGCTCCCTCGTCGTCACACGGCTGAAAGATGATACA : 4151327
代码如下:
import os
import re
import sys
aa_codes = {
'Ala':'A','Cys':'C','Asp':'D','Glu':'E',
'Phe':'F','Gly':'G','His':'H','Lys':'K',
'Ile':'I','Leu':'L','Met':'M','Asn':'N',
'Pro':'P','Gln':'Q','Arg':'R','Ser':'S',
'Thr':'T','Val':'V','Tyr':'Y','Trp':'W'
} #转换字典列表
#下面是从log文件中提取出Target结果
t = open("pro-three.fa", "w")
with open(sys.argv[1], 'r') as f:
a =[]
for num, line in enumerate(f):
if '|' in line or '!' in line:
a.append(num + 1)
elif 'Query:' in line:
print ("\n>" + line.strip().split()[1] + " ", end= "", file = t)
elif 'Target:' in line:
print (line.strip().split()[1] + " ", end = "", file = t),
elif 'Target range:' in line:
print (line.strip().split()[2] + "——>" + line.strip().split()[4], file = t),
elif num in a:
b = re.sub(r'[^A-Za-z]','', line[1:-1])
print (b, end="", file = t)
t.close()
#下面是对结果文件进行三字符转换
fout_tmp = open('pro-tmp.fa', 'w')
with open("pro-three.fa", 'r', encoding='utf-8') as fin:
D =[]
for num, line in enumerate(fin):
if '>' in line:
D.append(num + 1)
print("\n", line, sep="", end= "", file = fout_tmp)
elif num in D:
e = re.sub(r"([A-Z])", r" \1", line).split()
for i in range(len(e)):
print(aa_codes.get(e[i]), end='', file = fout_tmp)
fin.close()
fout_tmp.close()
#下面是将最终结果剔除空行
file1 = open('pro-tmp.fa', 'r', encoding='utf-8') # 要去掉空行的文件
file2 = open('pro-one.fa', 'w', encoding='utf-8') # 生成没有空行的文件
try:
for line in file1.readlines():
if line == '\n':
line = line.strip("\n")
file2.write(line)
finally:
file1.close()
file2.close()
os.remove("pro-tmp.fa")
print ("提取结束\npro-one.fa为单字母氨基酸序列\npro-three.fa为三字母氨基酸序列")
欢迎交流!
作者邮箱:[email protected]
学习了作者:msw521sg的脚本
exonerate结果整理,获取target序列_msw521sg的博客-CSDN博客_exonerate