Beginning users should take a look at the Getting started guide for a tutorial on running TopHat-Fusion.
After running tophat, you can run tophat-fusion-post to filter out fusion candidates.
Usage: tophat-fusion-post [options]*
Options: -v/--version Prints the help message and exits. -p/--num-threads The number of threads used. The default is 1. --num-fusion-reads Fusions with at least this many supporting reads will be reported. The default is 3. --num-fusion-pairs Fusions with at least this many supporting pairs will be reported. The default is 2. --num-fusion-both The sum of supporting reads and pairs is at least this number for a fusion to be reported. The default is 0. --fusion-read-mismatches Reads support fusions if they map across fusion with at most this many mismatches. The default is 2. --fusion-multireads Reads that map to more than this many places will be ignored. The default is 2. --non-human If your annotation is different from that of human, use the option.
In addition to those files ouput by TopHat, the tophat (with --fusion-search) and tophat-fusion-post scripts produce a number of files in the directories in which they were invoked. Most of these files are internal, intermediate files that are generated for use within the pipeline. The output files you will likely want to look at are:
tophat (with --fusion-search option)
- accepted_hits.bam. A list of read alignments in SAM format. SAM is a compact short read alignment format that is increasingly being adopted. The formal specification is here.
We assume that a fusion alignment involves two chromosomes or different places on the same chromosome (distant or inversion). So, a fusion alignment is reported in two partial alignments (one before the fusion point and the other after) as follows:
(1) SRR064286.11667197 chr12 2910488255 16M chr20 625050470 ATATGTTTGAGAGGCT BCBCCBCCCBBBBBBB XF:Z:1 chr12-chr20 2910488 16M62500758F34M
(2) SRR064286.11667197 chr20 62500758255 34M = 625050470 GGCTGAGGAGGAGGAGCTCAGGGCTGAGCTTACC BB?B@;@B?AA?A?;?:@>>@@B>>;;B>?=9##XF:Z:2 chr12-chr20 2910488 16M62500758F34M
Some of fields are skipped for the sake of illustration, as shown above, 16bp of 50bp read is mapped on chr12 and 34bp is mapped on chr20. The whole fusion alignment is available in a custom field XF in both partial alignments above. XF:Z:1 and XF:Z:2 indicates the first and the second partial alignments, respectively.
A new CIGAR operator 'F' (Fusion) is introduced to indicate the presence of a fusion in read alignments. The following example shows a read alignment across a fusion between chromosome 20 and chromosome 12.
XF:Z:1 chr12-chr20 2910488 16M62500758F34M
The left most base of a read (SRR064286.11667197) is mapped to 2910488 th base of chromosome 12 from which the first 16 bases of a read are mapped from 2910488 to 2910503 on chromosome 12, the 17 th base of the read is mapped to 62500758 th base of chromosome 20, and the remaining 34 bases are mapped from 62500758 to 62500791 on chromosome 20. The precise fusion point is between 2910503 on chromsome 12 and 62500758 on chromosome 20. Note that small letters such as m, n, i, d have opposite interpretations of big letters, meaning a coordinate is decreasing instead of increasing.
In case the other end spans a fusion point, a custom field XP used as follows:
SRR064286.73586371 chr11 44890 50M XP:Z:chr1-chr10 14482
The read (SRR064286.73586371) is mapped on chr11 at 44890 th position while its mate partner spans a fusion point between chr1 and chr10.
- fusions.out. A list of fusions tophat (with --fusion-search) finds before running tophat-fusion-post, where each fusion is supported by at least one read alignment. Each row represents a fusion with the detailed description given in the below. A row can be split into seven rows using @ as a line separator.
chr20-chr17 49411707 59445685 ff 106 116 167 0 37 36 0.569598 @
11 25 38 49 63 @
CAGCGGGGCGCGCGAGCTCGCGCTCTTCCTGACCCCCGAGCCTGGGGCCG AGGTAGGGGACGGGGCTGTGGAGTTGGAGGAGAGGGTTCTCGCGGTTAGG @
CCTGCTCCCTGAAGGTGTGGACTCAACGTCAGATGTCCCGTGTGTGCCAC AGGTACCTTTGACAGGAGCGTGACCCTGCTGGAGGTGTGCGGGAGCTGGC @
106 106 106 106 106 106 106 106 106 106 106 106 106 106 98 90 84 79 79 78 74 68 65 64 63 63 59 59 56 55 52 51 20 15 12 10 8 0 0 0 0 0 0 0 0 0 0 0 0 0 @
106 106 106 106 106 106 106 106 106 106 106 106 106 98 96 94 91 86 55 54 51 50 47 47 43 43 42 41 38 32 28 27 27 22 16 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 @
-6:1 11:0 16:-3 18:1 14:6 14:6 15:7 23:0 5:21 31:5 18:19 36:-1 ...
This is a fusion between 49411707th base on chromosome 20 and 59445685th base on chromosome 17.
ff is the orientations of the two chromosomes - both chromosomes are in forwarding direction, like (chr 20) -----> -----> (chr 17).
106 is the number of reads that span the fusion, 116 is the number of mate pairs that support the fusion, 167 is the number of mate pairs that support the fusion and whose one end spans the fusion.
0 is the number of reads that contradict the fusion by mapping to only one of the chromosomes 20 and 17.
37 and 36 are the number of bases on the left and right sides of a fusion, respectively, covered by spanning reads.
The second row is likely to be dropped in the next version.
The third row shows two 50-bp contigs around a fusion point on chromosome 20. The fourth row is similarly defined for chromosome 17.
The fifth row is depth coverage by spanning reads on chromosome 20 from 49411707th base to 49411658th base. The sixth row is depth coverage by spanning reads on chromosome 17 from 59445685th base to 59445734th base.
The seventh row is distances (distance1:distance2) between a mate pair and the fusion, i.e., distance1 bewtween the left end of a pair and the left side of the fusion and distance2 between the right end of a pair and the right side of the fusion.
A useful linux command 'awk' can be used for filtering fusions as follows:awk '{if($5 > 100) print}' fusions.out | sed 's/@\t/\n/g'
Here, $5 is the number of spanning reads, this command shows the fusions supported by at least 100 spanning reads.
tophat-fusion-post
- result.html. A list of fusion candidates is given in HTML format. A sample list is found here.
- result.txt. A text version of result.html.
TopHat-Fusion is built on TopHat so that it inherits every option and output formats from TopHat (Refer to the TopHat website for installation and basic information). TopHat-Fusion algorithm is described in our poster at the CSHL Biology of Genomes conference. TopHat-Fusion consists of two sub-programs (tophat and tophat-fusion-post). Using a breast cancer cell MCF7 RNA-Seq data from Edgren et al (Genome Biology 2011). , the following tutorial demonstrates how to use TopHat-Fusion to identify fusion genes including three known fusions (BCAS4-BCAS3, ARFGEF2-SULF2, RPS6KB1-TMEM49).
Sample | Reads | -r and --mate-std-dev values |
---|---|---|
BT474 | BT474_mix | -r 50 --mate-std-dev 80 |
SKBR3 | SKBR3_mix | -r 50 --mate-std-dev 80 |
KPL4 | SRR064287 | -r 0 --mate-std-dev 80 |
MCF7 | SRR064286 | -r 0 --mate-std-dev 80 |
To run TopHat-Fusion:
tophat -o tophat_MCF7 -p 8 --fusion-search --keep-fasta-order --bowtie1 --no-coverage-search -r 0 --mate-std-dev 80 --max-intron-length 100000 --fusion-min-dist 100000 --fusion-anchor-length 13 --fusion-ignore-chromosomes chrM /path/to/h_sapiens/bowtie_index SRR064286_1.fastq SRR064286_2.fastq
- Make (top_dir) directory and run the above command under (top_dir) - see required directory structure. If you have multiple samples, you can run them under (top_dir).
- Use tophat_(sample_name) for the output directory ("-o" option) such as tophat_MCF7. The directory name (MCF7) will be used later for annotation purposes.
- You can change the number of threads using "-p" option.
- Turn on fusion algorithm (--fusion-search) and use Bowtie1 (--bowtie1).
- Turn off "coverage-search", which takes lots of memory and is slow.
- The mean fragment length of the data is 100-bp, so the inner mate distance is 0 (= 100 - 50 * 2). In this example, We use a larger standard derivation (80-bp) for inner mate distance because TopHat-Fusion makes use of the region (mate_inner_dist ± std_dev) to discover fusions.
- In addition to inter-chromosomal fusions, TopHat-Fusion tries to identify intra-chromosomal fusions due to rearrangement within a chromosome separated by at least --fusion-min-dist.
- A read supports a fusion if a read maps to both sides of a fusion by at least --fusion-anchor-length.
- In addition to outputs from TopHat, TopHat-Fusion outputs a list of potential fusions (fusions.out - the first 2,000 out of 68,168 fusions) and a modified SAM alignment that allows "fusion" alignment using 'F' CIGAR operator although it is not supported by SAM tools.
tophat-fusion-post -p 8 --num-fusion-reads 1 --num-fusion-pairs 2 --num-fusion-both 5 /path/to/h_sapiens/bowtie_index
- TopHat-Fusion uses BLAST search results for filtering out false fusions due to highly similar sequences or pseudogenes. Also, the search results can be alternatively used for annotating purposes in case there is no known genes in the provided annotation files. 50-bp sequence on the left side of a fusion and 50-bp on the right side are combined to make a 100-bp sequence, which in turn is BLASTed against the blast database. If match length (range: 0 to 100) + identity percent (0 to 100) is greater than 160, the fusion is filtered out. This BLAST step is usually done for a few hundreds of fusions after prior filtering steps. Thus, it is highly recommended to install BLAST and download blast database as follows.
- Install BLAST binaries (blastall and blastn).
- Make (top_dir)/blast directory, download human_genomic*, other_genomic*, and nt* from blast database, and extract them under (top_dir)/blast.
- Use --non-human option for genomes other than the human genome.
- The final list of fusion candidates is given in (top_dir)/tophatfusion_out/result.html.
- You may want to repeat the filtering process with various filtering parameters such as --num-fusion-reads and --num-fusion-pairs without deleting (top_dir)/tophatfusion_out, which is a database tophat-fusion-post internally uses for fast computation.
- This program requires Bowtie1 and the index files for Bowtie1, as it uses Bowtie1 internally mostly for filtering purposes.
Required directory structure
- (top_dir)
- tophat_sample_1 - the output directory by tophat and you may want to run it on several samples.
- tophat_sample_2
- ...
- tophat_sample_n
- tophatfusion_out - the output directory by tophat-fusion-post
- ensGene.txt
- refGene.txt
- blast - BLAST database