原文地址:http://davefancher.com/2012/04/22/string-distances-9/
One of my first projects at Leaf has been trying to match some data read from an OCR solution against what is in the database. As sophisticated as OCR algorithms have become though it’s still not reliable enough to guarantee 100% accurate results every time due to the number of variations in type faces, artifacts introduced through scanning or faxing the document, or any number of other factors.
Most of the documents I’ve been working with have been pretty clean and I’ve been able to get an exact match automatically. One of my samples though, has some security features that intentionally obfuscate some of the information I care about. This naturally makes getting an exact match difficult. Amazingly though, the OCR result was about 80% accurate so there was still some hope.
One of my coworkers suggested that I look at some of the string distance algorithms to see if any of them could help us get closer to an exact match. He pointed me at the Levenshtein algorithm so I took a look at that along with the Hamming and Damerau-Levenshtein algorithms.
For the uninitiated (like myself a week ago), these algorithms provide a way to determine the distance between two strings. The distance is essentially a measurement of string similarity. In other words, they calculate how many steps are required to transform one string into another.
I want to look briefly at each of these and show some examples. Note that each of these algorithms are case sensitive but modifying them to ignore case is trivial.
create function HamDist(@value1 char(8000), @value2 char(8000)) returns int as begin declare @distance int declare @i int declare @len int select @distance = 0, @i =1, @len = case when len(@value1) > len(@value2) then len(@value1) else len(@value2) end if (@value1 is null) or (@value2 is null) return null while (@i <= @len) select @distance = @distance + case when substring(@value1,@i,1) != substring(@value2,@i,1) then 1 else 0 end, @i = @i +1 return @distance end
Named for the American mathematician Richard Hamming, the Hamming distance is identifies how many characters differ in equal length strings. Implementing this algorithm is pretty easy, especially with LINQ.
public static partial class StringDistance { public static int GetHammingDistance(string source, string target) { if(source.Length != target.Length) { throw new Exception("Strings must be equal length"); } int distance = source.ToCharArray() .Zip(target.ToCharArray(), (c1, c2) => new { c1, c2 }) .Count(m => m.c1 != m.c2); return distance; } }
To compute the Hamming distance we just project the character arrays from each string into a new collection (using Zip) and count the characters that don’t match.
StringDistance.GetHammingDistance("abcde", "abcde").Dump(); // 0 StringDistance.GetHammingDistance("abcde", "abcdz").Dump(); // 1 StringDistance.GetHammingDistance("abcde", "abcyz").Dump(); // 2 StringDistance.GetHammingDistance("abcde", "abxyx").Dump(); // 3 StringDistance.GetHammingDistance("abcde", "awxyz").Dump(); // 4 StringDistance.GetHammingDistance("abcde", "vwxyz").Dump(); // 5
As you can see, when the strings are identical the Hamming distance is zero and when they’re completely different they’re equal to the length of the strings.
There was some potential for using the Hamming distance in my project since each of the values I’m checking are supposed to be the same length but the OCR library threw a wrench in that plan by not recognizing some characters or mistaking smudges for additional characters. Since I couldn’t rely on equal length strings I moved on to the Levenshtein distance.
The Levenshtein distance is named for the Russian scientist, Vladimir Levenshtein. Calculating the Levenshtein distance is a bit more involved than the Hamming distance but since it applies to variable length strings it is applicable in more situations. The Levenshtein distance takes into account not only character alterations but also insertions and deletions.
public static partial class StringDistance { public static int GetLevenshteinDistance(string source, string target) { var bounds = new { Height = source.Length + 1, Width = target.Length + 1 }; int[,] matrix = new int[bounds.Height, bounds.Width]; for(int height = 0; height < bounds.Height; height++) { matrix[height, 0] = height; }; for(int width = 0; width < bounds.Width; width++) { matrix[0, width] = width; }; for(int height = 1; height < bounds.Height; height++) { for(int width = 1; width < bounds.Width; width++) { int cost = (source[height - 1] == target[width - 1]) ? 0 : 1; int insertion = matrix[height, width - 1] + 1; int deletion = matrix[height - 1, width] + 1; int substitution = matrix[height - 1, width - 1] + cost; int distance = Math.Min(insertion, Math.Min(deletion, substitution)); matrix[height, width] = distance; } } return matrix[bounds.Height - 1, bounds.Width - 1]; } }
In order to calculate the distance, this algorithm uses a matrix to track the differences as it works over the two strings. Insertions are detected by looking at the slot to the left of the current position whereas deletions are detected by looking at the slot above the current position. Substitution detection requires examining the slot above and to the left of the current position. The final distance is the number stored in the bottom right slot.
StringDistance.GetLevenshteinDistance("abcde", "abcde").Dump(); // 0 StringDistance.GetLevenshteinDistance("abcde", "abcdz").Dump(); // 1 StringDistance.GetLevenshteinDistance("abcde", "abcyz").Dump(); // 2 StringDistance.GetLevenshteinDistance("abcde", "abxyx").Dump(); // 3 StringDistance.GetLevenshteinDistance("abcde", "awxyz").Dump(); // 4 StringDistance.GetLevenshteinDistance("abcde", "vwxyz").Dump(); // 5
The above examples illustrate how the Levenshtein distance and the Hamming distance are identical for equal length strings but how does it work for strings of differing lengths?
StringDistance.GetLevenshteinDistance("abcdefg", "abcde").Dump(); // 2 StringDistance.GetLevenshteinDistance("abcdefg", "zxyde").Dump(); // 5
The Levenshtein distance is a good fit for my project but consider these examples:
StringDistance.GetLevenshteinDistance("abcde", "bacde").Dump(); // 2 StringDistance.GetLevenshteinDistance("abcde", "baced").Dump(); // 4
The major limitation of the Levenshtein distance is that it doesn’t take character transpositions into account. Does it really take two operations to transform “abcde” into “bacde” or just one? Frederick Damerau asked this same question and gave us the Damerau-Levenshtein distance.
Frederick Damerau adapted Levenshtein’s process to account properly for transpositions of adjacent characters. There are two ways to calculate the Damerau-Levenshtein distance but the one we’ll look at here is known as optimal string alignment.
The optimal string alignment algorithm is really just an extension of the Levenshtein distance algorithm.
public static partial class StringDistance { public static int GetDamerauLevenshteinDistance(string source, string target) { var bounds = new { Height = source.Length + 1, Width = target.Length + 1 }; int[,] matrix = new int[bounds.Height, bounds.Width]; for(int height = 0; height < bounds.Height; height++) { matrix[height, 0] = height; }; for(int width = 0; width < bounds.Width; width++) { matrix[0, width] = width; }; for(int height = 1; height < bounds.Height; height++) { for(int width = 1; width < bounds.Width; width++) { int cost = (source[height - 1] == target[width - 1]) ? 0 : 1; int insertion = matrix[height, width - 1] + 1; int deletion = matrix[height - 1, width] + 1; int substitution = matrix[height - 1, width - 1] + cost; int distance = Math.Min(insertion, Math.Min(deletion, substitution)); if (height > 1 && width > 1 && source[height - 1] == target[width - 2] && source[height - 2] == target[width - 1]) { distance = Math.Min(distance, matrix[height - 2, width - 2] + cost); } matrix[height, width] = distance; } } return matrix[bounds.Height - 1, bounds.Width - 1]; } }
In this variation we do an additional check to handle transpositions.
StringDistance.GetDamerauLevenshteinDistance("abcde", "abcde").Dump(); // 0 StringDistance.GetDamerauLevenshteinDistance("abcde", "abcdz").Dump(); // 1 StringDistance.GetDamerauLevenshteinDistance("abcde", "abcyz").Dump(); // 2 StringDistance.GetDamerauLevenshteinDistance("abcde", "abxyx").Dump(); // 3 StringDistance.GetDamerauLevenshteinDistance("abcde", "awxyz").Dump(); // 4 StringDistance.GetDamerauLevenshteinDistance("abcde", "vwxyz").Dump(); // 5 // Transpositions StringDistance.GetDamerauLevenshteinDistance("abcde", "bacde").Dump(); // 1 StringDistance.GetDamerauLevenshteinDistance("abcde", "baced").Dump(); // 2
All of the results are the same except for the transpositions which show one and two operations respectively.
You may have noticed that these algorithms (particularly the Levenshtein variations) don’t lend themselves well to parallelization due to their dependency on previous steps. I did decide though that although the algorithms aren’t really parallelizable, matching against a known set certainly is. Furthermore, parallelizing the matching is trivial with PLINQ.
var distances = GetStringsToMatch() .AsParallel() .Select ( x => new { Value = x, Distance = StringDistance.GetLevenshteinDistance("abcde", x) } );
We’ve looked at algorithms for calculating the Hamming distance, Levenshtein distance, and Damerau-Levenshtein distance. In my project I’m leaning toward using the traditional Levenshtein distance since I shouldn’t have to consider transpositions and the data isn’t guaranteed to be the same length. I also decided to make each of these algorithms available through extension methods on the string class to make them more accessible to other developers.
To those more knowledgeable about these things, are there other variations or techniques I should consider instead?