python:统计每条fasta序列长度

一、输入文件

CARD蛋白数据库fasta文件

cat test.txt
>gb|ACT97415.1|ARO:3002999|CblA-1 [mixed culture bacterium AX_gF3SD01_15] 
MKAYFIAILTLFTCIATVVRAQQMSELENRIDSLLNGKKATVGIAVWTDKGDMLRYNDHVHFPLLSVFKFHVALAVLDKMDKQSISLDSIVSIKASQMPPNTYSPLRKKFPDQDFTITLRELMQYSISQSDNNACDILIEYAGGIKHINDYIHRLSIDSFNLSETEDGMHSSFEAVYRNWSTPSAMVRLLRTADEKELFSNKELKDFLWQTMIDTETGANKLKGMLPAKTVVGHKTGSSDRNADGMKTADNDAGLVILPDGRKYYIAAFVMDSYETDEDNANIIARISRMVYDAMR
>gb|AEJ08681.1|ARO:3001109|SHV-52 [Klebsiella pneumoniae] 
MRYIRLCIISLLAALPLAVHASPQPLEQIKQSESQLSGRVGMIEMDLASGRTLTAWRADERFPMISTFKVVLCGAVLARVDAGDEQLERKIHYRQQDLVDYSPVSEKHLADGMTVGELCAAAITMSDNSAANLLLAIVGGPAGLTAFLRQIGDNVTRLDRWETELNEALPGDARDTTTPASMAATLRKLLTSQRLSARSQRQLLQWMVDDRVAGPLIRSVLPAGWFIADKTGAGERGARGIVALLGPNNKAERIVVIYLRDTPASMAERNQQIAGIGAALIEHWQR
>gb|AAD01868.1|ARO:3002867|dfrF [Enterococcus faecalis] 
MIGLIVARSKNNVIGKNGNIPWKIKGEQKQFRELTTGNVVIMGRKSYEEIGHPLPNRMNIVVSTTTEYQGDNLVSVKSLEDALLLAKGRDVYISGGYGLFKEALQIVDKMYITEVDLNIEDGDTFFPEFDINDFEVLIGETLGEEVKYTRTFYVRKNELSRFWI
>gb|AFJ59957.1|ARO:3001989|CTX-M-130 [Escherichia coli] 
MVTKRVQRMMFAAAACIPLLLGSAPLYAQTSAVQQKLAALEKSSGGRLGVALIDTADNTQVLYRGDERFPMCSTSKVMAAAAVLKQSETQKQLLNQPVEIKPADLVNYNPIAEKHVNGTMTLAELSAAALQYSDNTAMNKLIAQLGGPGGVTAFARAIGDETFRLDRTEPTLNTAIPGDPRDTTTPRAMAQTLRQLTLGHALGETQRAQLVTWLKGNTTGAASIRAGLPTSWTVGDKTGSGDYGTTNDIAVIWPQGRAPLVLVTYFTQPQQNAERRHDVLASAARIIAEGL
>gb|AEX08599.1|ARO:3002356|NDM-6 [Escherichia coli] 
MELPNIMHPVAKLSTALAAALMLSGCMPGEIRPTIGQQMETGDQRFGDLVFRQLAPNVWQHTSYLDMPGFGAVASNGLIVRDGGRVLVVDTAWTDDQTAQILNWIKQEINLPVALAVVTHAHQDKMGGMDALHAAGIATYANALSNQLAPQEGMVAAQHSLTFAANGWVEPATAPNFGPLKVFYPGPGHTSDNITVGIDGTDIAFGGCLIKDSKAKSLGNLGDADTEHYAASVRAFGAAFPKASMIVMSHSAPDSRAAITHTARMADKLR

二、整python3脚本

  • re.split提取aro id
  • len计算序列长度
  • if判断结合format格式化输出
import os, sys, re
with open("out.txt", 'w') as o:
    with open("test.txt") as f:
        for line in f:
            line = line.strip()
            if line[0] == ">":
                aro = re.split(r'\|', line)[2]
                o.write("{}".format(aro))
            else:
                o.write("\t{}\n".format(len(line)))

三、结果

ARO:3002999 296
ARO:3001109 286
ARO:3002867 164
ARO:3001989 291
ARO:3002356 270

案例

#!/usr/bin/env python3
import os
import sys
import re

ms, infile, outfile = sys.argv

with open(outfile, 'w') as o:
    with open(infile) as f:
        for line in f:
            line = line.strip()
            if line[0] == ">":
                id = re.split(r' ', line)[0]
                o.write("{}".format(id))
            else:
                o.write("\t{}\n".format(len(line)))
python3 fasta_length.py genome_merge.fasta genome_merge.length
MF3KA12003AB_00001      68
MF3KA12003AB_00002      2848
MF3KA12003AB_00003      1505
MF3KA12003AB_00004      3882
MF3KA12003AB_00005      4161
MF3KA12003AB_00006      600
MF3KA12003AB_00007      369
MF3KA12003AB_00008      546
MF3KA12003AB_00009      573
MF3KA12003AB_00010      1080

OK

你可能感兴趣的:(python:统计每条fasta序列长度)