根据ID从FASTA文件中批量提取序列(Python,seqtk,seqkit),参考“冷月”, "Nickier"," 苏牧传媒"

根据ID从FASTA文件中批量提取序列是做序列分析常做的事情,有网友让我帮忙从11万条中挑选7万条,我自己写写了一个,太慢了;后来发现Biopython官方文档里面“Cookbook – Cool things to do with it”第一件事就是做这个事情的,后来我又学习了“冷月”小伙伴在知乎的帖子,稍微改写了一下,其实就是ctrl+c和ctrl+v,供大家参考,欢迎交流。

冷月的帖子:

https://zhuanlan.zhihu.com/p/57938795

Biopython官方文档:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec365

当然首先在要电脑上装python和Biopython。

然后就是我ctrl+c和ctrl+v的代码了:

#coding:utf-8

import click

from Bio import SeqIO

@click.command()

@click.option('-f', '--fastafile', help='Input a fasta file', required=True)

@click.option('-i', '--idfile', help='Input an idlist', required=True)

@click.option('-o', '--outfile', help='Input the name of result file', default='result_out.fa')

def main(fastafile="FA.fa",idfile="ID.txt",outfile= "result_out.fa"):

with open(idfile) as id_handle:

    wanted = set(line.rstrip("\n").split(None,1)[0] for line in id_handle)

print("Found %i unique identifiers in %s" % (len(wanted), idfile))

records = (r for r in SeqIO.parse(fastafile, "fasta") if r.id in wanted)

count = SeqIO.write(records, outfile, "fasta")

print("Saved %i records from %s to %s" % (count, fastafile, outfile))

if count < len(wanted):

    print("Warning %i IDs not found in %s" % (len(wanted) - count, fastafile))

if __name__ == '__main__':

    main()

另存为get_seqs_by_id.py

使用方法:

python get_seqs_by_id.py -f **.fasta -i ID.txt -o out_file.fasta

加个时间模块测试了一下,11万条中挑选7万花了不到两秒,回去看看我自己写的啥JB玩意儿啊~。。。。

其实还有下面更快的方法:

seqtk方法也十分方便("Nickier"):

https://www.jianshu.com/p/2671198ae625

seqtk subseq  **.fasta **.txt > out.fasta

还有seqkit的方法("苏牧传媒"):

https://www.jianshu.com/p/471283080bd6

你可能感兴趣的:(根据ID从FASTA文件中批量提取序列(Python,seqtk,seqkit),参考“冷月”, "Nickier"," 苏牧传媒")