khaper 去除基因组重复

该工具主要用于去除基因组重复的序列。

详情请查看githup

Note

  • 软件需要内存 >100 G
  • 该软件要求基因组大小 < 4G

简单使用

基本思路:将>40x的Illumina reads构建kmer 频率表,然后基于该频率表去除基因组重复的序列。

1.直接从githup下载即可,需要jellyfish,使用conda安装就可以

conda install -c bioconda jellyfish

2. 数据准备

  • assemble.fasta
  • reads1.gz
  • reads2.gz

3. 构建kmer频率表

ls *.gz > fq.lst
 perl Bin/Graph.pl pipe -i fq.lst -m 2 -k 15 -s 1,3 -d Kmer_17

# 参数:
-m 最小kmer出现次数
-i fastq文件
-k kmer 大小
-d 输入文件
-s 如下所示
1: count k-mer by jellyfish
2: record unique k-mer into .h5 file
3: record unique k-mer into .bit file
4: record all k-mer into .h5 file
5: record all k-mer into .bit file
6: record all kmer into .bit with -m is 0.5 the peak
7: get the genome size, repeate rate and hete rate

注意:

# For k=17, we recommend:
perl Graph.pl pipe -i fq.lst -m 2 -k 17 -s 1,3,5 -d Kmer_17
# For k>17, we recommend:
 perl Graph.pl pipe -i fq.lst -m 2 -k 23 -s 1,2,4 -d Kmer_23

#######################################
k=15 is suitable for genome with size <100M.
k=17 is suitable for genome with size <10G.
This version is only support k<=17.

上述结果位于Kmer_17/02.Uinque_bit/kmer_17.bit

4. 去除基因组重复序列

perl remDup.pl   

       Options:
              --ref    The ref genome to build kbit
            --kbit   The unique kmer file
              --kmer   the kmer size [15]
            --sort   sort seq by length [1]

如下命令

perl Bin/remDup.pl  --kbit Kmer_17/02.Uinque_bit/kmer_17.bit \
  --kmer 17 assemble.fasta Compress 0.3

结果位于:compress file: Compress/trinity.single.fasta.gz

注意:

a. If the compress file is larger than estimated genome size, turn down the  cutoff value
b. If the compress file is small than estimated genome size, turn up the  cutoff  value

其余软件

  • Purge_dups
  • purge_haplotigs

你可能感兴趣的:(khaper 去除基因组重复)