Levenshtein distance algorithm

Dynamic Programming Algorithm (DPA) for Edit-Distance
Levenshtein distance is named after the Russian scientist Vladimir Levenshtein, who devised the algorithm in 1965. If you can't spell or pronounce Levenshtein, the metric is also sometimes called edit distance.

The Levenshtein distance algorithm has been used in:

1. Spell checking
2. Speech recognition
3. DNA analysis
4. Plagiarism detection

 

Definition

The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of:

1. Change a letter

2. Insert a letter

3. Delete a letter

Levenshtein distance is obtained by finding the cheapest way to transform one string into another. Transformations are the one-step operations of (single-phone) insertion, deletion and substitution. In the simplest versions substitutions cost two units except when the source and target are identical, in which case the cost is zero. Insertions and deletions costs half that of substitutions.

The greater the Levenshtein distance, the more different the strings are.

The following recurrence relations define the edit distance, d(s1,s2), of two strings s1 and s2:

d('', '') = 0               -- '' = empty string
d(s, '')  = d('', s) = |s|  -- i.e. length of s
d(s1+ch1, s2+ch2)
  = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi,
         d(s1+ch1, s2) + 1,
         d(s1, s2+ch2) + 1 )
The first two rules above are obviously true, so it is only necessary consider the last one. Here, neither string is the empty string, so each has a last character, ch1 and ch2 respectively. Somehow, ch1 and ch2 have to be explained in an edit of s1+ch1 into s2+ch2. If ch1 equals ch2, they can be matched for no penalty, i.e. 0, and the overall edit distance is d(s1,s2). If ch1 differs from ch2, then ch1 could be changed into ch2, i.e. 1, giving an overall cost d(s1,s2)+1. Another possibility is to delete ch1 and edit s1 into s2+ch2, d(s1,s2+ch2)+1. The last possibility is to edit s1+ch1 into s2 and then insert ch2, d(s1+ch1,s2)+1. There are no other alternatives. We take the least expensive, i.e. min, of these alternatives.

diag above
left min (above + delete,
diag + replace,
left + insert)



The recurrence relations imply an obvious ternary-recursive routine. This is not a good idea because it is exponentially slow, and impractical for strings of more than a very few characters.

Examination of the relations reveals that d(s1,s2) depends only on d(s1',s2') where s1' is shorter than s1, or s2' is shorter than s2, or both. This allows the dynamic programming technique to be used.

A two-dimensional matrix, m[0..|s1|,0..|s2|] is used to hold the edit distance values:

m[i,j] = d(s1[1..i], s2[1..j])

m[0,0] = 0
m[i,0] = i,  i=1..|s1|
m[0,j] = j,  j=1..|s2|

m[i,j] = min(m[i-1,j-1]
             + if s1[i]=s2[j] then 0 else 1 fi,
             m[i-1, j] + 1,
             m[i, j-1] + 1 ),  i=1..|s1|, j=1..|s2|
m[,] can be computed row by row. Row m[i,] depends only on row m[i-1,]. The time complexity of this algorithm is O(|s1|*|s2|). If s1 and s2 have a `similar' length, about `n' say, this complexity is O(n2), much better than exponential!


The Algorithm 

Step Description
1 Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns. 
2 Initialize the first column to 0..n.
Initialize the first row to 0..m.
3 Examine each character of s (i from 1 to n). 
4 Examine each character of t (j from 1 to m). 
5 If s[i] equals t[j], the cost is 0.
If s[i] doesn't equal t[j], the cost is 1. 
6 Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7 After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]. 



Complexity
The time-complexity of the algorithm is O(|s1|*|s2|), i.e. O(n2) if the lengths of both strings is about `n'. The space-complexity is also O(n2) if the whole of the matrix is kept for a trace-back to find an optimal alignment. If only the value of the edit distance is needed, only two rows of the matrix need be allocated; they can be "recycled", and the space complexity is then O(|s1|), i.e. O(n).

Variations
The costs of the point mutations can be varied to be numbers other than 0 or 1. Linear gap-costs are sometimes used where a run of insertions (or deletions) of length `x', has a cost of `ax+b', for constants `a' and `b'. If b>0, this penalises numerous short runs of insertions and deletions.

 

source:

public class Similarity {

	private int min(int one, int two, int three) {
		int min = one;
		if(two < min) {
			min = two;
		}
		if(three < min) {
			min = three;
		}
		return min;
	}
	
	public int ld(String str1, String str2) {
		int d[][];	//矩阵
		int n = str1.length();
		int m = str2.length();
		int i;	//遍历str1的
		int j;	//遍历str2的
		char ch1;	//str1的
		char ch2;	//str2的
		int temp;	//记录相同字符,在某个矩阵位置值的增量,不是0就是1
		if(n == 0) {
			return m;
		}
		if(m == 0) {
			return n;
		}
		d = new int[n+1][m+1];
		for(i=0; i<=n; i++) {	//初始化第一列
			d[i][0] = i;
		}
		for(j=0; j<=m; j++) {	//初始化第一行
			d[0][j] = j;
		}
		for(i=1; i<=n; i++) {	//遍历str1
			ch1 = str1.charAt(i-1);
			//去匹配str2
			for(j=1; j<=m; j++) {
				ch2 = str2.charAt(j-1);
				if(ch1 == ch2) {
					temp = 0;
				} else {
					temp = 1;
				}
				//左边+1,上边+1, 左上角+temp取最小
				d[i][j] = min(d[i-1][j]+1, d[i][j-1]+1, d[i-1][j-1]+temp);
			}
		}
		return d[n][m];
	}
	
	public double sim(String str1, String str2) {
		int ld = ld(str1, str2);
		return 1 - (double) ld / Math.max(str1.length(), str2.length()); 
	}
	
	public static void main(String[] args) {
		Similarity s = new Similarity();
		String str1 = "java.org";
		String str2 = "iteye.com";
		System.out.println("ld="+s.ld(str1, str2));
		System.out.println("sim="+s.sim(str1, str2));
	}
}

 

Reference

[1] Levenshtein Distance, in Three Flavors

[2] Distance Between Strings

[3] Levenshtein Distance

[4] Dynamic Programming Algorithm (DPA) for Edit-Distance

[5] An improvement on capturing similarity between strings

[6] Levenshtein

 

你可能感兴趣的:(J#,idea)