根据ID从FASTA文件中批量提取序列是做序列分析常做的事情，有网友让我帮忙从11万条中挑选7万条，我自己写写了一个，太慢了；后来发现Biopython官方文档里面“Cookbook – Cool things to do with it”第一件事就是做这个事情的，后来我又学习了“冷月”小伙伴在知乎的帖子，稍微改写了一下，其实就是ctrl+c和ctrl+v，供大家参考，欢迎交流。

冷月的帖子：

https://zhuanlan.zhihu.com/p/57938795

Biopython官方文档：

http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec365

当然首先在要电脑上装python和Biopython。

然后就是我ctrl+c和ctrl+v的代码了：

#coding:utf-8

import click

from Bio import SeqIO

@click.command()

@click.option('-f', '--fastafile', help='Input a fasta file', required=True)

@click.option('-i', '--idfile', help='Input an idlist', required=True)

@click.option('-o', '--outfile', help='Input the name of result file', default='result_out.fa')

def main(fastafile="FA.fa",idfile="ID.txt",outfile= "result_out.fa"):

with open(idfile) as id_handle:

wanted = set(line.rstrip("\n").split(None,1)[0] for line in id_handle)

print("Found %i unique identifiers in %s" % (len(wanted), idfile))

records = (r for r in SeqIO.parse(fastafile, "fasta") if r.id in wanted)

count = SeqIO.write(records, outfile, "fasta")

print("Saved %i records from %s to %s" % (count, fastafile, outfile))

if count < len(wanted):

print("Warning %i IDs not found in %s" % (len(wanted) - count, fastafile))

if __name__ == '__main__':

main()

另存为get_seqs_by_id.py

使用方法：

python get_seqs_by_id.py -f **.fasta -i ID.txt -o out_file.fasta

加个时间模块测试了一下，11万条中挑选7万花了不到两秒，回去看看我自己写的啥JB玩意儿啊~。。。。

其实还有下面更快的方法：

seqtk方法也十分方便("Nickier")：

https://www.jianshu.com/p/2671198ae625

seqtk subseq **.fasta **.txt > out.fasta

还有seqkit的方法("苏牧传媒"):

https://www.jianshu.com/p/471283080bd6

根据ID从FASTA文件中批量提取序列(Python,seqtk,seqkit),参考“冷月”, "Nickier"," 苏牧传媒"

冷月的帖子：

Biopython官方文档：

当然首先在要电脑上装python和Biopython。

seqtk方法也十分方便("Nickier")：

还有seqkit的方法("苏牧传媒"):

你可能感兴趣的:(根据ID从FASTA文件中批量提取序列(Python,seqtk,seqkit),参考“冷月”, "Nickier"," 苏牧传媒")