多个fasta文件中去除重复序列

前言

需求:N个fasta文件,里面的序列可能有大量的重复,但是其header并不一定相同,需要把他们合并并去重。

方法一

首先用cat命令将他们合并

cat *.fasta > merge.fasta

perl合并

perl rmfastadup.pl merge.fasta nodup.fasta
rmfastadup.pl 代码如下:

#! /usr/bin/perl

# Downloaded from http://www.bioinformatics-made-simple.com
# 去除fasta文件中重复的序列
# 用法:perl rmfastadup.pl input,fasta output.fasta
use Bio::Perl;
my $infile1 = $ARGV[0];
my %FIRSTSEQ;
my $total = 0;
my $total1=0;
my $dup = 0;

open (OUTIE, ">$ARGV[1]");

$infile1 = Bio::SeqIO -> new('-format'=>'Fasta',-file => $infile1);

#read in all fasta sequences

while ((my $seqobj = $infile1->next_seq()))
{

    $rawid = $seqobj -> display_id;
    $seq = $seqobj -> seq;

    #print "ID is $rawid\n";
    #print "Sequence is $seq\n";

    

    if(defined($FIRSTSEQ{$seq}))
    {
        print "Key match with $FIRSTSEQ{$seq}\n";
    
        $dup++;
    }
    else
    {
        $FIRSTSEQ{$seq} = $rawid;
        $total++;
    }
}

while ( ($key, $value) = each(%FIRSTSEQ) )
{
    print OUTIE ">$value\n";
    print OUTIE "$key\n\n";
    $total1++;
}

print "\n$total unduplicated sequences in the file\n";
print "$dup duplicated sequences in the file\n";
print "$total1 unique sequences printed out.\n";

close(OUTIE);

这个方法在fasta文件数量少、序列少的情况下比较简单好用,但是序列数多了以后就会贼慢,让人受不了,亲测10G的fasta整整跑了四个多小时,吐血。

方法二

大量的序列,必须通过多线程的编程来提高效率,下面提供了一个perl代码,是上面的多线程版本。

用法

perl p-rmfastadup.pl file1.fasta file2.fasta <...>
最后会生成nodup.fasta文件。
另外,核心数可以自行修改,具体看代码中的注释吧

p-rmfastadup.pl的代码如下

#! /usr/bin/perl

#同时处理X个fasta文件,最终去除这些fasta文件中序列相同的(header可以不同,一概被认为是同一序列)序列,最终输出无重复的fasta文件:nodup.fasta
#usage: perl p-rmfastadup.pl file1.fasta file2.fasta <...>

use threads;
use threads::shared;
use Bio::Perl;

my %FIRSTSEQ : shared;
my $total : shared;
my $total1 : shared;
my $dup : shared;
my @files : shared;

@files = @ARGV;

# 创建线程,可以自行增减
my $t1 = threads->create(\&checkfasta);
sleep(0.1);
my $t2 = threads->create(\&checkfasta);

my $t3 = threads->create(\&checkfasta);

my $t4 = threads->create(\&checkfasta);

my $t5 = threads->create(\&checkfasta);

my $t6 = threads->create(\&checkfasta);

my $t7 = threads->create(\&checkfasta);

my $t8 = threads->create(\&checkfasta);

my $t9 = threads->create(\&checkfasta);

my $t10 = threads->create(\&checkfasta);


 # 收割线程,需要配合上面的线程
$t1->join; 
$t2->join;
$t3->join;
$t4->join;
$t5->join; 
$t6->join;
$t7->join;
$t8->join;
$t9->join; 
$t10->join;

sub checkfasta {
    while (@files) {
        my $infile1  = shift @files;
        print $infile1."\n";
        $infile1 = Bio::SeqIO -> new('-format'=>'Fasta',-file => $infile1);

        #read in all fasta sequences

        while ((my $seqobj = $infile1->next_seq()))
        {

            $rawid = $seqobj -> display_id;
            $seq = $seqobj -> seq;

            #print "ID is $rawid\n";
            #print "Sequence is $seq\n";



            if(defined($FIRSTSEQ{$seq}))
            {
                print "Key match with $FIRSTSEQ{$seq}\n";

                $dup++;
            }
            else
            {
                $FIRSTSEQ{$seq} = $rawid;
                $total++;
            }
        }
    }
}

#checkfasta();

open (OUTIE, ">nodup.fasta");
while ( ($key, $value) = each(%FIRSTSEQ) )
{
    print OUTIE ">$value\n";
    print OUTIE "$key\n\n";
    $total1++;
}


print "$total1 unique sequences printed out.\n";

close(OUTIE);

2020.6.26更新

没事干造什么轮子,已经有人写好啦!用seqkit完美解决需求!

你可能感兴趣的:(多个fasta文件中去除重复序列)