perl应用:从NCBI提供的信息中获取需要的序列(上)

我们首先来看一下GenBank中序列的格式,我从http://www.ncbi.nlm.nih.gov/nuccore/JX118024.1中复制了这个信息到strawberry.gb中


LOCUS       JX118024                 460 bp    DNA     linear   PLN 25-SEP-2012
DEFINITION  Fragaria vesca subsp. americana RNA polymerase beta subunit (rpoC1)
            gene, partial cds; plastid.
ACCESSION   JX118024
VERSION     JX118024.1  GI:402238751
KEYWORDS    .
SOURCE      plastid Fragaria vesca subsp. americana
  ORGANISM  Fragaria vesca subsp. americana
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
            rosids; fabids; Rosales; Rosaceae; Rosoideae; Potentilleae;
            Fragariinae; Fragaria.
REFERENCE   1  (bases 1 to 460)
  AUTHORS   Njuguna,W., Liston,A., Cronn,R., Ashman,T.L. and Bassil,N.
  TITLE     Insights into phylogeny, sex function and age of Fragaria based on
            whole chloroplast genome sequencing
  JOURNAL   Mol. Phylogenet. Evol. (2012) In press
   PUBMED   22982444
  REMARK    Publication Status: Available-Online prior to print
REFERENCE   2  (bases 1 to 460)
  AUTHORS   Njuguna,W., Liston,A., Cronn,R., Ashman,T.-L. and Bassil,N.
  TITLE     Direct Submission
  JOURNAL   Submitted (16-MAY-2012) Horticulture, Oregon State University, 4017
            ALS, Corvallis, OR 97331, USA
COMMENT     ##Assembly-Data-START##
            Assembly Method       :: yasra v. 2009; mulan v. 2005
            Sequencing Technology :: Illumina
            ##Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..460
                     /organism="Fragaria vesca subsp. americana"
                     /organelle="plastid"
                     /mol_type="genomic DNA"
                     /sub_species="americana"
                     /db_xref="taxon:101019"
     gene            complement(<1..>460)
                     /gene="rpoC1"
     CDS             complement(<1..>460)
                     /gene="rpoC1"
                     /codon_start=2
                     /transl_table=11
                     /product="RNA polymerase beta subunit"
                     /protein_id="AFQ39140.1"
                     /db_xref="GI:402238752"
                     /translation="ELVMCQEKLVQEAVDTLLDNGIRGQPMRDGHNKVYKSFSDVIEG
                     KEGRFRETLLGKRVDYSGRSVIVVGPSLSLHRCGLPREIAIELFQTFVIRGLIRQHFA
                     SNIGVAKSKIREKEPVVWEILQEVMQGHPVLLNRAPTLHRLGIQAFQPILV"
ORIGIN      
        1 tactaaaatg ggctggaacg cctgtatgcc taatctatgc agagtaggcg ccctattcag
       61 caatacggga tgcccttgca taacttcctg aagtatttcc catacaaccg gctctttttc
      121 ccgaatttta ctcttagcca ctcctatatt cgaggcaaaa tgctgcctaa ttaaaccacg
      181 aattacaaat gtctggaaaa gctctattgc tatttcgcga ggcaatccac atcgatgtaa
      241 tgaaagtgaa gggcctacaa caatgacaga acgccccgaa taatcgaccc gtttgccaag
      301 cagagtctca cggaatcttc cctcttttcc ttcaattaca tctgaaaacg acttgtaaac
      361 cttattatga ccgtccctca ttggttgtcc acggattcca ttatcaagaa gcgtatccac
      421 agcttcttgt accaatttct cctgacacat gactaattct
//

这是一个很短的序列,我们来观察特征前面都是注释,不是我们想要的信息。从ORIGIN开始的下一行就是我们要获得的信息,但是每一行有数字标识,为了方便查看碱基的个数,这个也不是我们需要的。序列的最后是//   这个也就是结束的标识。

我们写这个程序的主要方法就是:

第一:忽略ORIGIN上面的所有内容

第二:当遇到//的时候停止

第三:把我们获取序列中的所有的数字替换为空

第四:注释和序列之间的区别,我们用了一个标志符in_sequence.在没有遇到ORIGIN的时候,值为0,当遇到ORIGIN是我们把它的值赋值为1.

第五:foreach中判断语句的顺序,我们按照从后往前的顺序,也就是也判读是不是最后的结束行,然后看如果是序列,然后ORI一行,然后是注释。


下面的程序我们把上边的所有注释内容页保存到@annotation中去了,并且也做了输出。

use strict;
use warnings;

my @annotation=( );
my $sequence  ='';
my $filename  ='';
parse1(\@annotation,\$sequence,\$filename);
print @annotation;
print_sequence($sequence,50);


#parse1
sub parse1
{
    my ($annotation,$dna,$filename) = @_;

	my $in_sequence = 0;
	my @GenBankFile =( );
	
	@GenBankFile    =get_file_data($filename);

	foreach my $line(@GenBankFile)
	{
		#用来匹配最后一行
		if($line=~/^\/\/\n/)
		{
			last;
		}
		elsif($in_sequence)
		{
			$$dna .=$line;
		}
		elsif($line=~/^ORIGIN/)
		{
			$in_sequence = 1;
		}
		else
		{
			push(@annotation,$line);
		}
	}
	#remove whitespace adn line numbers from DNA sequence
	$$dna =~ s/[\s0-9]//g;
}



sub get_file_data    
{      
    # A subroutine to get data from a file given its filename    
    #读取文件的子序列    
    my $dna_filename;    
    my @filedata;    
    print "please input the Path just like this f:\\\\perl\\\\data.txt\n";       
    chomp($dna_filename=<STDIN>);     
    open(DNAFILENAME,$dna_filename)||die("can not open the file!");        
    @filedata     = <DNAFILENAME>;      
    close DNAFILENAME;      
    return @filedata;#子函数的返回值一定要记住写    
}    


sub print_sequence    
{    
    # A subroutine to format and print sequence data    
    my ($sequence, $length) = @_;    
    for (my $pos =0; $pos<length($sequence);$pos+=$length)    
    {    
        print substr($sequence,$pos,$length),"\n";    
    }    
}    

结果如下:

F:\>perl\a.pl
please input the Path just like this f:\\perl\\data.txt
f:\\perl\strawberry.gb
LOCUS       JX118024                 460 bp    DNA     linear   PLN 25-SEP-2012
DEFINITION  Fragaria vesca subsp. americana RNA polymerase beta subunit (rpoC1)
            gene, partial cds; plastid.
ACCESSION   JX118024
VERSION     JX118024.1  GI:402238751
KEYWORDS    .
SOURCE      plastid Fragaria vesca subsp. americana
  ORGANISM  Fragaria vesca subsp. americana
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
            rosids; fabids; Rosales; Rosaceae; Rosoideae; Potentilleae;
            Fragariinae; Fragaria.
REFERENCE   1  (bases 1 to 460)
  AUTHORS   Njuguna,W., Liston,A., Cronn,R., Ashman,T.L. and Bassil,N.
  TITLE     Insights into phylogeny, sex function and age of Fragaria based on
            whole chloroplast genome sequencing
  JOURNAL   Mol. Phylogenet. Evol. (2012) In press
   PUBMED   22982444
  REMARK    Publication Status: Available-Online prior to print
REFERENCE   2  (bases 1 to 460)
  AUTHORS   Njuguna,W., Liston,A., Cronn,R., Ashman,T.-L. and Bassil,N.
  TITLE     Direct Submission
  JOURNAL   Submitted (16-MAY-2012) Horticulture, Oregon State University, 4017
            ALS, Corvallis, OR 97331, USA
COMMENT     ##Assembly-Data-START##
            Assembly Method       :: yasra v. 2009; mulan v. 2005
            Sequencing Technology :: Illumina
            ##Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..460
                     /organism="Fragaria vesca subsp. americana"
                     /organelle="plastid"
                     /mol_type="genomic DNA"
                     /sub_species="americana"
                     /db_xref="taxon:101019"
     gene            complement(<1..>460)
                     /gene="rpoC1"
     CDS             complement(<1..>460)
                     /gene="rpoC1"
                     /codon_start=2
                     /transl_table=11
                     /product="RNA polymerase beta subunit"
                     /protein_id="AFQ39140.1"
                     /db_xref="GI:402238752"
                     /translation="ELVMCQEKLVQEAVDTLLDNGIRGQPMRDGHNKVYKSFSDVIEG
                     KEGRFRETLLGKRVDYSGRSVIVVGPSLSLHRCGLPREIAIELFQTFVIRGLIRQHFA
                     SNIGVAKSKIREKEPVVWEILQEVMQGHPVLLNRAPTLHRLGIQAFQPILV"
tactaaaatgggctggaacgcctgtatgcctaatctatgcagagtaggcg
ccctattcagcaatacgggatgcccttgcataacttcctgaagtatttcc
catacaaccggctctttttcccgaattttactcttagccactcctatatt
cgaggcaaaatgctgcctaattaaaccacgaattacaaatgtctggaaaa
gctctattgctatttcgcgaggcaatccacatcgatgtaatgaaagtgaa
gggcctacaacaatgacagaacgccccgaataatcgacccgtttgccaag
cagagtctcacggaatcttccctcttttccttcaattacatctgaaaacg
acttgtaaaccttattatgaccgtccctcattggttgtccacggattcca
ttatcaagaagcgtatccacagcttcttgtaccaatttctcctgacacat
gactaattct

F:\>

如果我们把输入的@annotion删除了,我们就只得到我们想要的信息了。结果如下:

F:\>perl\a.pl
please input the Path just like this f:\\perl\\data.txt
f:\\perl\\strawberry.gb
tactaaaatgggctggaacgcctgtatgcctaatctatgcagagtaggcg
ccctattcagcaatacgggatgcccttgcataacttcctgaagtatttcc
catacaaccggctctttttcccgaattttactcttagccactcctatatt
cgaggcaaaatgctgcctaattaaaccacgaattacaaatgtctggaaaa
gctctattgctatttcgcgaggcaatccacatcgatgtaatgaaagtgaa
gggcctacaacaatgacagaacgccccgaataatcgacccgtttgccaag
cagagtctcacggaatcttccctcttttccttcaattacatctgaaaacg
acttgtaaaccttattatgaccgtccctcattggttgtccacggattcca
ttatcaagaagcgtatccacagcttcttgtaccaatttctcctgacacat
gactaattct

F:\>


你可能感兴趣的:(perl应用:从NCBI提供的信息中获取需要的序列(上))