构造后缀数组的DC3算法实现

DC3算法(Difference Cover mod 3)是J. Kärkkäinen和P. Sanders在2003年发表的论文 "Simple Linear Work Suffix Array Construction"中描述的线性时间内构造后缀数组的算法。相对Prefix Doubling(前缀倍增)算法而言,虽然它的渐进时间复杂度比较小,但是常数项比较大。DC3算法的思想类似于找中位数的median of medians算法(http://en.wikipedia.org/wiki/Selection_algorithm),它采用分治思想: 先用递归方式对起始下标等于1(mod 3)和2(mod 3)的后缀排序,从而将原始的后缀集合大小缩小为2/3,设这些后缀排好序的结果为S12,然后在S12的基础上对起始下标等于0(mod 3)的后缀排序(这一步只需作两位数的基数排序,一位为0(mod 3)的起始下标,另外一位为S12的rank值),设这一步得到的排好序的后缀数组为S0,最后将S0和S12归并(类似于归并排序算法)。归并过程通过Difference Cover思想,也是在S12已知的基础上分两个cases得出相邻两个后缀的先后顺序。


实现:

 
 
 
 
/**
 * 
 * Build Suffix Array using DC3/KS Algorithm 
 *  
 *  
 * Copyright (c) 2011 ljs (http://blog.csdn.net/ljsspace/)
 * Licensed under GPL (http://www.opensource.org/licenses/gpl-license.php) 
 * 
 * @author ljs
 * 2011-07-18
 *
 */
public class DC3 {
	public static final char MAX_CHAR = '\u00FF';

	class Suffix{
		int[] sa;  
		//Note: the p-th suffix in sa: SA[rank[p]-1]];
		//p is the index of the array "rank", start with 0;
		//a text S's p-th suffix is S[p..n], n=S.length-1.
		int[] rank; 
		boolean done;
		 
		public Suffix(int[] sa,int[] rank){
			this.sa = sa;
			this.rank = rank;
		}
	}
	

	//a prefix of suffix[isuffix] represented with digits
	class Tuple{
		int isuffix; //the p-th suffix
		int[] digits;
		public Tuple(int suffix,int[] digits){
			this.isuffix = suffix;
			this.digits = digits;			
		}
		public String toString(){
			StringBuffer sb = new StringBuffer();			
			sb.append(isuffix);
			sb.append("(");
			for(int i=0;i=0;j--){
			//C[A[j]] <= A.length 
			tB[--C[tA[j].digits[d]]]=tA[j];			
		}
	}
	
	//tA: input
	//tB: output for rank caculation
	private void radixSort(Tuple[] tA,Tuple[] tB,int max,int digitsLen){
		int len = tA.length;
		int digitsTotalLen = tA[0].digits.length;
			
		for(int d=digitsTotalLen-1,j=0;jrank[q]){
					sa[k++] = q;j++;
				}else{
					if(rank12[p+1]rank[q]){
					sa[k++] = q;j++;
				}else{
					if(rank[p+1]rank[q+1]){
						sa[k++] = q;j++;
					}else{
						if(rank12[p+2]



测试:


Text: GACCCACCACC#
suffix array:
 11 8 5 1 10 7 4 9 6 3 2 0
rank array:
 12 4 11 10 7 3 9 6 2 8 5 1
Text: mississippi#
suffix array:
 11 10 7 4 1 0 9 8 6 3 5 2
rank array:
 6 5 12 10 4 11 9 3 8 7 2 1
Text: abcdefghijklmmnopqrstuvwxyz#
suffix array:
 27 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
rank array:
 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 1
Text: yabbadabbado#
suffix array:
 12 1 6 4 9 3 8 2 7 5 10 11 0
rank array:
 13 2 8 6 4 10 3 9 7 5 11 12 1
Text: DFDLKJLJldfasdlfjasdfkldjasfldafjdajfdsfjalkdsfaewefsdafdsfa#
suffix array:
 60 0 2 1 5 7 4 6 3 59 47 54 30 34 41 17 11 25 53 29 33 9 19 23 13 56 44 37 50 48 58 46 10 55 36 39 15 31 20 27 51 40 16 24 32 35 43 21 28 8 22 14 42 52 18 12 57 45 38 26 49
rank array:
 2 4 3 9 7 5 8 6 50 22 33 17 56 25 52 37 43 16 55 23 39 48 51 24 44 18 60 40 49 20 13 38 45 21 14 46 35 28 59 36 42 15 53 47 27 58 32 11 30 61 29 41 54 19 12 34 26 57 31 10 1


你可能感兴趣的:(数据结构和算法,Stringology)