先给个Levenshtein算法 更多统计知识待补充
http://www.dewen.org/q/6668/如何设计一个比较两篇文章相似性的算法?
http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm
The Levenshtein distance is the difference between two strings. Iuse it in a web crawler application to compare the new and oldversions of a web page. If it has changed enough, I update it in mydatabase. Description
The original algorithm creates a matrix, where the size isStrLen1*StrLen2. If both strings are 1000 chars long, the resultingmatrix is 1M elements; if the strings are 10,000 chars, the matrixwill be 100M elements. If the elements are integers, it will be4*100M == 400MB. Ouch!
This version of the algorithm uses only 2*StrLen elements, so thelatter example would give 2*10,000*4 = 80 KB. The result is that,not only does it use less memory but it's also faster because thememory allocation takes less time. When both strings are about 1Kin length, the new version is more than twice as fast.Example
The original version would create a matrix[6 1,5 1], my versioncreates two vectors[6 1] (the yellow elements). In both versions,the order of the strings is irrelevant, that is, it could bematrix[5 1,6 1] and two vectors[5 1]. The new algorithm
Steps
Step Description 1 Set n to be the length of s. ("GUMBO") Set m tobe the length of t. ("GAMBOL") If n = 0, return m and exit. If m =0, return n and exit. Construct two vectors, v0[m 1] and v1[m 1],containing 0..m elements. 2 Initialize v0 to 0..m. 3 Examine eachcharacter of s (i from 1 to n). 4 Examine each character of t (jfrom 1 to m). 5 If s[i] equals t[j], the cost is 0. If s[i] is notequal to t[j], the cost is 1. 6 Set cell v1[j] equal to the minimumof: a. The cell immediately above plus 1: v1[j-1] 1. b. The cellimmediately to the left plus 1: v0[j] 1. c. The cell diagonallyabove and to the left plus the cost: v0[j-1] cost. 7 After theiteration steps (3, 4, 5, 6) are complete, the distance is found inthe cell v1[m].
This section shows how the Levenshtein distance is computed whenthe source string is "GUMBO" and the target string is"GAMBOL":
Steps 1 and 2
v0 v1 G UMB O 0 1 2 3 45 G 1 A 2 M3 B 4 O5 L 6
Steps 3 to 6, when i = 1
v0 v1 G UMB O 0 1 2 3 45 G 1 0 A 2 1 M3 2 B 4 3 O5 4 L 6 5
Steps 3 to 6, when i = 2
SWAP(v0,v1): If you look in the code you will see that I don't swapthe content of the vectors but I refer to them.
Set v1[0] to the column number, e.g. 2.
v0 v1 G U MB O 0 1 2 3 45 G 1 0 1 A 2 1 1 M3 2 2 B 43 3 O5 4 4 L 65 5
Steps 3 to 6, when i = 3
SWAP(v0,v1).
Set v1[0] to the column number, e.g. 3.
v0 v1 GU M B O 0 1 2 3 45 G 1 0 1 2 A 2 1 1 2 M3 2 2 1 B 43 3 2 O54 4 3 L 6 5 5 4
Steps 3 to 6, when i = 4
SWAP(v0,v1).
Set v1[0] to the column number, e.g. 4.
v0 v1 GUM B O 0 1 2 3 4 5 G 1 0 1 2 3 A 2 1 1 2 3 M3 2 2 1 2 B 43 32 1 O5 4 43 2 L 6 5 5 4 3
Steps 3 to 6, when i = 5
SWAP(v0,v1).
Set v1[0] to the column number, e.g. 5.
v0 v1 GUMB O 0 1 2 3 4 5 G 1 0 1 2 3 4 A 2 1 1 2 3 4 M3 2 2 1 2 3 B43 3 2 1 2 O5 4 43 2 1 L 6 5 5 4 3 2
Step 7
The distance is in the lower right hand corner of the matrix, v1[m]== 2. This corresponds to our intuitive realization that "GUMBO"can be transformed into "GAMBOL" by substituting "A" for "U" andadding "L" (one substitution and one insertion = two changes).Improvements
If you are sure that your strings will never be longer than 2^16chars, you could use ushort instead of int, if the strings are lessthan 2^8 chars, you could use byte. I guess, the algorithm would beeven faster if we use unmanaged code, but I have not tried it.References
Levenshtein Distance, in Three Flavors History
2006-03-22
Version 1.0.
2006-03-24
Detailed description of the algorithm. The code has been rewrittenso that it now follows the description. :-)
You must Sign In to use this message board.
P