RECON

RECON: a package for automated de novo identification of repeat families from genomic sequences

Description

Proper identification of repetitive sequences is an essential step in genome analysis. The RECON package performs de novo identification and classification of repeat sequence families from genomic sequences. The underlying algorithm is based on extensions to the usual approach of single linkage clustering of local pairwise alignments between genomic sequences. Specifically, our extensions use multiple alignment information to define the boundaries of individual copies of the repeats and to distinguish homologous but distinct repeat element families. RECON should be useful for first-pass automatic classification of repeats in newly sequenced genomes.

User Tips

Organize the initial all-vs-all pairwise comparison carefully. It may save up to half of the time. In particular, avoid collecting self hits. Self hits can slow down RECON significantly. In addition, if you are using RECON1.03 or lower, they may also cause the program to crash.

You may need to re-name your input sequences. Some sequence names are not properly recognized by the program for no apparent reason. A safe choice would be something like "seq123456", i.e., the string "seq" followed by a number. I'm working on this one.

Typically, I only focus on families with >= 10 copies. I would build consensus sequences for these families, then use the consensuses to annotate the genomic sequences (using RepeatMasker). This way, I can recover older/more divergent members of the families which were not detected in my initial all-vs-all BLAST.

If you have more than 30 to 50Mb sequences, you should consider taking an incremental approach as described in the RECON paper.

你可能感兴趣的:(on)