细菌基因组下载

最近需要尽可能的下载细菌基因组,ncbi ftp站点(ftp://ftp.ncbi.nlm.nih.gov/genomes)里面"all"这个目录可以下载,不过"all"目录还有古细菌之类的哈。如果只想取Refseq或genbank数据库其一,可以通过以下方式:

#下载Refseq/genbank数据库中的细菌基因组数据
#截止20190106,有140681条基因组数据
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt
ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
#文件第12列是拼接状态(Complete Genome/Scaffold/contig),第20列是ftp路径,下面是下载组装完整的基因组信息
awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.gbff.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths
cut -d / -f 11 ftpfilepaths | paste - ftpfilepaths  | while read a b;do echo "wget -c -nd -r -np -k -L -p -nd -P genbank $b && gzip -d genbank/$a";done >run.download.sh
#脚本只下载了gbff文件,当然,如果想要gff,fna,faa通过小脚本转就可以了,或者加上“-A faa.gz,fna.gz,gff.gz,gpff.gz”

真的超级占存储空间,我只下载gbff文件,需要其他格式的文件再简单格式转化就可以了。同时,下载一批立马处理一批。
Refseq和genbank的区别
Refseq下载的是GCF开头的文件,而genbank下载的是GCA开头的文件。

具体看这里:https://zhuanlan.zhihu.com/p/20749737

GenBank是核苷酸数据库,RefSeq是基因数据库 ,具体哪个更全?互为补充,以前做分析的时候发现有些细菌基因组在GenBank里面没有在RefSeq有,还有可能两个数据库里面某个基因组只有一个fna文件没有注释,这种情况如果遇到自己动手用prokka十分钟就能解决。


生信小脚本收藏癖,有时遇到比较实用的脚本会存下来放在自己环境变量bin里面,积累会使以后的分析变得更加高效

附上两个网上看到好用的脚本

1.GenBank转faa格式

#!/usr/bin/env perl
# This program reads gbk files and extracts all amino acid sequences from the
# /translation fields into a FASTA file. The FASTA header contains a sequential
# number followed by the taxon id, which is extracted from the
# /db_xref="taxon:" field. Only letters in the 20 letter amino acid
# alphabet are retained in the FASTA file.
#
# author: Peter Menzel
#
# This file is part of Kaiju, Copyright 2015-2017 Peter Menzel and Anders Krogh
# Kaiju is licensed under the GPLv3, see the file LICENSE.
#

use strict;
use warnings;
use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ;

my $t = 0;
my $taxid;
my $protein_id;

if(!defined $ARGV[1]) { die "Usage: $0 infile.gbk outfile.faa"; }
open(OUT,">",$ARGV[1]) or die "Could not open file $ARGV[1] for writing.";
my $in_fh = new IO::Uncompress::AnyUncompress $ARGV[0] or die "Opening input file failed: $AnyUncompressError\n";

while(<$in_fh>) {
    chomp;
    if(m,/db_xref="taxon:(\d+)",) {
        $taxid = $1;
    }
    elsif(m,/protein_id="([^"]+)",) {
        $protein_id=$1;
    }
    elsif(m,\s+/translation="([^"]+)",)  {  
        if(!defined($taxid)) { die "No taxon id found in gbk file $ARGV\n";}
        print OUT ">$protein_id\_$taxid\n";
        my $seq = $1;
        $seq =~ tr/BZ/DE/;  # a.a. alphabet specifies `B' matches `N' or `D', and `Z' matches `Q' or `E.', here we use substitution with higher score
        $seq =~ s/[^ARNDCQEGHILKMFPSTWYV]//gi;
        print OUT "$seq\n";
    }
    elsif(m,\s+/translation="([^"]+)$,) {
        if(!defined($taxid)) { die "No taxon id found in gbk file $ARGV\n";}
        print OUT ">$protein_id\_$taxid\n";
        $t = 1;
        my $seq = $1;
        $seq =~ tr/BZ/DE/;
        $seq =~ s/[^ARNDCQEGHILKMFPSTWYV]//gi;
        print OUT "$seq\n";
    }
    elsif($t) {
        if(m,",) { 
            tr/BZ/DE/;
            s/[^ARNDCQEGHILKMFPSTWYV]//gi;
            print OUT $_,"\n";   
            $t = 0;
        }
        else { 
            tr/BZ/DE/;
            s/[^ARNDCQEGHILKMFPSTWYV]//gi;
            print OUT $_,"\n";   
        }
    }
}
close($in_fh);
close(OUT);

2.Genbank转fna格式

#!/usr/bin/env python

import sys
from Bio import SeqIO

if len(sys.argv) < 3:
    print('USAGE: gbk2fna GBK FNA')
    sys.exit(65)

SeqIO.write(SeqIO.parse(sys.argv[1], 'genbank'), sys.argv[2], 'fasta')

你可能感兴趣的:(细菌基因组下载)