一、输入文件
CARD蛋白数据库fasta文件
cat test.txt
>gb|ACT97415.1|ARO:3002999|CblA-1 [mixed culture bacterium AX_gF3SD01_15]
MKAYFIAILTLFTCIATVVRAQQMSELENRIDSLLNGKKATVGIAVWTDKGDMLRYNDHVHFPLLSVFKFHVALAVLDKMDKQSISLDSIVSIKASQMPPNTYSPLRKKFPDQDFTITLRELMQYSISQSDNNACDILIEYAGGIKHINDYIHRLSIDSFNLSETEDGMHSSFEAVYRNWSTPSAMVRLLRTADEKELFSNKELKDFLWQTMIDTETGANKLKGMLPAKTVVGHKTGSSDRNADGMKTADNDAGLVILPDGRKYYIAAFVMDSYETDEDNANIIARISRMVYDAMR
>gb|AEJ08681.1|ARO:3001109|SHV-52 [Klebsiella pneumoniae]
MRYIRLCIISLLAALPLAVHASPQPLEQIKQSESQLSGRVGMIEMDLASGRTLTAWRADERFPMISTFKVVLCGAVLARVDAGDEQLERKIHYRQQDLVDYSPVSEKHLADGMTVGELCAAAITMSDNSAANLLLAIVGGPAGLTAFLRQIGDNVTRLDRWETELNEALPGDARDTTTPASMAATLRKLLTSQRLSARSQRQLLQWMVDDRVAGPLIRSVLPAGWFIADKTGAGERGARGIVALLGPNNKAERIVVIYLRDTPASMAERNQQIAGIGAALIEHWQR
>gb|AAD01868.1|ARO:3002867|dfrF [Enterococcus faecalis]
MIGLIVARSKNNVIGKNGNIPWKIKGEQKQFRELTTGNVVIMGRKSYEEIGHPLPNRMNIVVSTTTEYQGDNLVSVKSLEDALLLAKGRDVYISGGYGLFKEALQIVDKMYITEVDLNIEDGDTFFPEFDINDFEVLIGETLGEEVKYTRTFYVRKNELSRFWI
>gb|AFJ59957.1|ARO:3001989|CTX-M-130 [Escherichia coli]
MVTKRVQRMMFAAAACIPLLLGSAPLYAQTSAVQQKLAALEKSSGGRLGVALIDTADNTQVLYRGDERFPMCSTSKVMAAAAVLKQSETQKQLLNQPVEIKPADLVNYNPIAEKHVNGTMTLAELSAAALQYSDNTAMNKLIAQLGGPGGVTAFARAIGDETFRLDRTEPTLNTAIPGDPRDTTTPRAMAQTLRQLTLGHALGETQRAQLVTWLKGNTTGAASIRAGLPTSWTVGDKTGSGDYGTTNDIAVIWPQGRAPLVLVTYFTQPQQNAERRHDVLASAARIIAEGL
>gb|AEX08599.1|ARO:3002356|NDM-6 [Escherichia coli]
MELPNIMHPVAKLSTALAAALMLSGCMPGEIRPTIGQQMETGDQRFGDLVFRQLAPNVWQHTSYLDMPGFGAVASNGLIVRDGGRVLVVDTAWTDDQTAQILNWIKQEINLPVALAVVTHAHQDKMGGMDALHAAGIATYANALSNQLAPQEGMVAAQHSLTFAANGWVEPATAPNFGPLKVFYPGPGHTSDNITVGIDGTDIAFGGCLIKDSKAKSLGNLGDADTEHYAASVRAFGAAFPKASMIVMSHSAPDSRAAITHTARMADKLR
二、整python3脚本
- re.split提取aro id
- len计算序列长度
- if判断结合format格式化输出
import os, sys, re
with open("out.txt", 'w') as o:
with open("test.txt") as f:
for line in f:
line = line.strip()
if line[0] == ">":
aro = re.split(r'\|', line)[2]
o.write("{}".format(aro))
else:
o.write("\t{}\n".format(len(line)))
三、结果
ARO:3002999 296
ARO:3001109 286
ARO:3002867 164
ARO:3001989 291
ARO:3002356 270
案例
#!/usr/bin/env python3
import os
import sys
import re
ms, infile, outfile = sys.argv
with open(outfile, 'w') as o:
with open(infile) as f:
for line in f:
line = line.strip()
if line[0] == ">":
id = re.split(r' ', line)[0]
o.write("{}".format(id))
else:
o.write("\t{}\n".format(len(line)))
python3 fasta_length.py genome_merge.fasta genome_merge.length
MF3KA12003AB_00001 68
MF3KA12003AB_00002 2848
MF3KA12003AB_00003 1505
MF3KA12003AB_00004 3882
MF3KA12003AB_00005 4161
MF3KA12003AB_00006 600
MF3KA12003AB_00007 369
MF3KA12003AB_00008 546
MF3KA12003AB_00009 573
MF3KA12003AB_00010 1080
OK