我们首先来看一下GenBank中序列的格式,我从http://www.ncbi.nlm.nih.gov/nuccore/JX118024.1中复制了这个信息到strawberry.gb中
LOCUS JX118024 460 bp DNA linear PLN 25-SEP-2012 DEFINITION Fragaria vesca subsp. americana RNA polymerase beta subunit (rpoC1) gene, partial cds; plastid. ACCESSION JX118024 VERSION JX118024.1 GI:402238751 KEYWORDS . SOURCE plastid Fragaria vesca subsp. americana ORGANISM Fragaria vesca subsp. americana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; fabids; Rosales; Rosaceae; Rosoideae; Potentilleae; Fragariinae; Fragaria. REFERENCE 1 (bases 1 to 460) AUTHORS Njuguna,W., Liston,A., Cronn,R., Ashman,T.L. and Bassil,N. TITLE Insights into phylogeny, sex function and age of Fragaria based on whole chloroplast genome sequencing JOURNAL Mol. Phylogenet. Evol. (2012) In press PUBMED 22982444 REMARK Publication Status: Available-Online prior to print REFERENCE 2 (bases 1 to 460) AUTHORS Njuguna,W., Liston,A., Cronn,R., Ashman,T.-L. and Bassil,N. TITLE Direct Submission JOURNAL Submitted (16-MAY-2012) Horticulture, Oregon State University, 4017 ALS, Corvallis, OR 97331, USA COMMENT ##Assembly-Data-START## Assembly Method :: yasra v. 2009; mulan v. 2005 Sequencing Technology :: Illumina ##Assembly-Data-END## FEATURES Location/Qualifiers source 1..460 /organism="Fragaria vesca subsp. americana" /organelle="plastid" /mol_type="genomic DNA" /sub_species="americana" /db_xref="taxon:101019" gene complement(<1..>460) /gene="rpoC1" CDS complement(<1..>460) /gene="rpoC1" /codon_start=2 /transl_table=11 /product="RNA polymerase beta subunit" /protein_id="AFQ39140.1" /db_xref="GI:402238752" /translation="ELVMCQEKLVQEAVDTLLDNGIRGQPMRDGHNKVYKSFSDVIEG KEGRFRETLLGKRVDYSGRSVIVVGPSLSLHRCGLPREIAIELFQTFVIRGLIRQHFA SNIGVAKSKIREKEPVVWEILQEVMQGHPVLLNRAPTLHRLGIQAFQPILV" ORIGIN 1 tactaaaatg ggctggaacg cctgtatgcc taatctatgc agagtaggcg ccctattcag 61 caatacggga tgcccttgca taacttcctg aagtatttcc catacaaccg gctctttttc 121 ccgaatttta ctcttagcca ctcctatatt cgaggcaaaa tgctgcctaa ttaaaccacg 181 aattacaaat gtctggaaaa gctctattgc tatttcgcga ggcaatccac atcgatgtaa 241 tgaaagtgaa gggcctacaa caatgacaga acgccccgaa taatcgaccc gtttgccaag 301 cagagtctca cggaatcttc cctcttttcc ttcaattaca tctgaaaacg acttgtaaac 361 cttattatga ccgtccctca ttggttgtcc acggattcca ttatcaagaa gcgtatccac 421 agcttcttgt accaatttct cctgacacat gactaattct //
我们写这个程序的主要方法就是:
第一:忽略ORIGIN上面的所有内容
第二:当遇到//的时候停止
第三:把我们获取序列中的所有的数字替换为空
第四:注释和序列之间的区别,我们用了一个标志符in_sequence.在没有遇到ORIGIN的时候,值为0,当遇到ORIGIN是我们把它的值赋值为1.
第五:foreach中判断语句的顺序,我们按照从后往前的顺序,也就是也判读是不是最后的结束行,然后看如果是序列,然后ORI一行,然后是注释。
下面的程序我们把上边的所有注释内容页保存到@annotation中去了,并且也做了输出。
use strict; use warnings; my @annotation=( ); my $sequence =''; my $filename =''; parse1(\@annotation,\$sequence,\$filename); print @annotation; print_sequence($sequence,50); #parse1 sub parse1 { my ($annotation,$dna,$filename) = @_; my $in_sequence = 0; my @GenBankFile =( ); @GenBankFile =get_file_data($filename); foreach my $line(@GenBankFile) { #用来匹配最后一行 if($line=~/^\/\/\n/) { last; } elsif($in_sequence) { $$dna .=$line; } elsif($line=~/^ORIGIN/) { $in_sequence = 1; } else { push(@annotation,$line); } } #remove whitespace adn line numbers from DNA sequence $$dna =~ s/[\s0-9]//g; } sub get_file_data { # A subroutine to get data from a file given its filename #读取文件的子序列 my $dna_filename; my @filedata; print "please input the Path just like this f:\\\\perl\\\\data.txt\n"; chomp($dna_filename=<STDIN>); open(DNAFILENAME,$dna_filename)||die("can not open the file!"); @filedata = <DNAFILENAME>; close DNAFILENAME; return @filedata;#子函数的返回值一定要记住写 } sub print_sequence { # A subroutine to format and print sequence data my ($sequence, $length) = @_; for (my $pos =0; $pos<length($sequence);$pos+=$length) { print substr($sequence,$pos,$length),"\n"; } }
结果如下:
F:\>perl\a.pl please input the Path just like this f:\\perl\\data.txt f:\\perl\strawberry.gb LOCUS JX118024 460 bp DNA linear PLN 25-SEP-2012 DEFINITION Fragaria vesca subsp. americana RNA polymerase beta subunit (rpoC1) gene, partial cds; plastid. ACCESSION JX118024 VERSION JX118024.1 GI:402238751 KEYWORDS . SOURCE plastid Fragaria vesca subsp. americana ORGANISM Fragaria vesca subsp. americana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; fabids; Rosales; Rosaceae; Rosoideae; Potentilleae; Fragariinae; Fragaria. REFERENCE 1 (bases 1 to 460) AUTHORS Njuguna,W., Liston,A., Cronn,R., Ashman,T.L. and Bassil,N. TITLE Insights into phylogeny, sex function and age of Fragaria based on whole chloroplast genome sequencing JOURNAL Mol. Phylogenet. Evol. (2012) In press PUBMED 22982444 REMARK Publication Status: Available-Online prior to print REFERENCE 2 (bases 1 to 460) AUTHORS Njuguna,W., Liston,A., Cronn,R., Ashman,T.-L. and Bassil,N. TITLE Direct Submission JOURNAL Submitted (16-MAY-2012) Horticulture, Oregon State University, 4017 ALS, Corvallis, OR 97331, USA COMMENT ##Assembly-Data-START## Assembly Method :: yasra v. 2009; mulan v. 2005 Sequencing Technology :: Illumina ##Assembly-Data-END## FEATURES Location/Qualifiers source 1..460 /organism="Fragaria vesca subsp. americana" /organelle="plastid" /mol_type="genomic DNA" /sub_species="americana" /db_xref="taxon:101019" gene complement(<1..>460) /gene="rpoC1" CDS complement(<1..>460) /gene="rpoC1" /codon_start=2 /transl_table=11 /product="RNA polymerase beta subunit" /protein_id="AFQ39140.1" /db_xref="GI:402238752" /translation="ELVMCQEKLVQEAVDTLLDNGIRGQPMRDGHNKVYKSFSDVIEG KEGRFRETLLGKRVDYSGRSVIVVGPSLSLHRCGLPREIAIELFQTFVIRGLIRQHFA SNIGVAKSKIREKEPVVWEILQEVMQGHPVLLNRAPTLHRLGIQAFQPILV" tactaaaatgggctggaacgcctgtatgcctaatctatgcagagtaggcg ccctattcagcaatacgggatgcccttgcataacttcctgaagtatttcc catacaaccggctctttttcccgaattttactcttagccactcctatatt cgaggcaaaatgctgcctaattaaaccacgaattacaaatgtctggaaaa gctctattgctatttcgcgaggcaatccacatcgatgtaatgaaagtgaa gggcctacaacaatgacagaacgccccgaataatcgacccgtttgccaag cagagtctcacggaatcttccctcttttccttcaattacatctgaaaacg acttgtaaaccttattatgaccgtccctcattggttgtccacggattcca ttatcaagaagcgtatccacagcttcttgtaccaatttctcctgacacat gactaattct F:\>
F:\>perl\a.pl please input the Path just like this f:\\perl\\data.txt f:\\perl\\strawberry.gb tactaaaatgggctggaacgcctgtatgcctaatctatgcagagtaggcg ccctattcagcaatacgggatgcccttgcataacttcctgaagtatttcc catacaaccggctctttttcccgaattttactcttagccactcctatatt cgaggcaaaatgctgcctaattaaaccacgaattacaaatgtctggaaaa gctctattgctatttcgcgaggcaatccacatcgatgtaatgaaagtgaa gggcctacaacaatgacagaacgccccgaataatcgacccgtttgccaag cagagtctcacggaatcttccctcttttccttcaattacatctgaaaacg acttgtaaaccttattatgaccgtccctcattggttgtccacggattcca ttatcaagaagcgtatccacagcttcttgtaccaatttctcctgacacat gactaattct F:\>