[笔记]Delphi实现获取字符串相似度

维基百科对字符串相似度(Damerau–Levenshtein distance)的定义是:

In information theory and computer science, the Damerau–Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or atransposition of two adjacent characters. In his seminal paper[1], Damerau not only distinguished these four edit operations but also stated that they correspond to more than 80% of all human misspellings. Damerau's paper considered only misspellings that could be corrected with at most one edit operation. The corresponding edit distance, i.e., dealing with multiple edit operations, known as the Levenshtein distance, was introduced by Levenshtein,[2] but it did not include transpositions in the set of basic operations. The name Damerau–Levenshtein distance is used to refer to the edit distance that allows multiple edit operations including transpositions, although it is not clear whether the term Damerau–Levenshtein distanceis sometimes used in some sources as to take into account non-adjacent transpositions or not.

简单翻译下,两个字符序列的DL距离,就是从一个变换到另一个的最小操作次数。这个变换包括插入一个字符删除一个字符替换一个字符、或互换两个相邻字符

而所谓“编辑距离(edit distance,或叫Levenshtein distance)”,并不包含互换两个相邻字符

主要应用是在字符拼写检查上,当然也可以用在其他地方,比方不少输入法就提供类似的校正功能(搜狗拼音输入法即实现了此功能)。

看起来简单,实现还是有一定困难的,好在有牛人已经做好相应的函数,如 KambizHow to match two strings approximately 中提供了两个函数:

计算DL距离的函数DamerauLevenshteinDistance(Str1, Str2)

function DamerauLevenshteinDistance(const Str1, Str2: string): Integer;

var

  LenStr1, LenStr2: Integer;

  I, J, T, Cost, Minimum: Integer;

  pStr1, pStr2, S1, S2: PChar;

  D, RowPrv2, RowPrv1, RowCur, Temp: PIntegerArray;

begin

  LenStr1 := Length(Str1);

  LenStr2 := Length(Str2);



  // to save some space, make sure the second index points to the shorter string

  if LenStr1 < LenStr2 then begin

    T := LenStr1;

    LenStr1 := LenStr2;

    LenStr2 := T;

    pStr1 := PChar(Str2);

    pStr2 := PChar(Str1);

  end

  else begin

    pStr1 := PChar(Str1);

    pStr2 := PChar(Str2);

  end;



  // to save some time and space, look for exact match

  while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin

    Inc(pStr1);

    Inc(pStr2);

    Dec(LenStr1);

    Dec(LenStr2);

  end;



  // when one string is empty, length of the other is the distance

  if LenStr2 = 0 then begin

    Result := LenStr1;

    Exit;

  end;



  // calculate the edit distance

  T := LenStr2 + 1;

  GetMem(D, 3 * T * SizeOf(Integer));

  FillChar(D^, 2 * T * SizeOf(Integer), 0);

  RowCur := D;

  RowPrv1 := @D[T];

  RowPrv2 := @D[2 * T];

  S1 := pStr1;



  for I := 1 to LenStr1 do begin

    Temp := RowPrv2;

    RowPrv2 := RowPrv1;

    RowPrv1 := RowCur;

    RowCur := Temp;

    RowCur[0] := I;

    S2 := pStr2;



    for J := 1 to LenStr2 do begin

      Cost := Ord(S1^ <> S2^);

      Minimum := RowPrv1[J - 1] + Cost;                 // substitution

      T := RowCur[J - 1] + 1;                           // insertion



      if T < Minimum then Minimum := T;



      T := RowPrv1[J] + 1;                              // deletion



      if T < Minimum then Minimum := T;



      if (I <> 1) and (J <> 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^) then begin

        T := RowPrv2[J - 2] + Cost;                     // transposition



        if T < Minimum then Minimum := T;

      end;



      RowCur[J] := Minimum;

      Inc(S2);

    end;



    Inc(S1);

  end;



  Result := RowCur[LenStr2];

  FreeMem(D);

end;

还有计算字符串相似度的函数 StringSimilarityRatio(Str1, Str2, IgnoreCase): Double;

返回值在0到1之间,0表示不相似,1表示完全相似。

function StringSimilarityRatio(const Str1, Str2: string; IgnoreCase: Boolean): Double;

var

  MaxLen: Integer;

  Distance: Integer;

begin

  Result := 1.0;



  if Length(Str1) > Length(Str2) then

    MaxLen := Length(Str1)

  else

    MaxLen := Length(Str2);



  if MaxLen <> 0 then begin

    if IgnoreCase then

      Distance := DamerauLevenshteinDistance(LowerCase(Str1), LowerCase(Str2))

    else

      Distance := DamerauLevenshteinDistance(Str1, Str2);



    Result := Result - (Distance / MaxLen);

  end;

end;

后来data man 参考一个德国人的ApproxStrUtils单元(该单元计算的是L距离,不是DL距离)给出一个据说效率更高的DL距离函数遗憾的是调用它会有“Invalid Pointer Operation”的报错,还没有Debug出问题所在,所以暂时先用前一个版本吧。

function DamerauLevenshteinDistance2(const Str1, Str2: string): Integer;

  function Min(const A, B, C: Integer): Integer; inline;

  begin

    Result := A;

    if B < A then

      Result := B;

    if C < Result then

      Result := C;

  end;



var

  LenStr1, LenStr2: Integer;

  I, J, T, Cost, PrevCost: Integer;

  pStr1, pStr2, S1, S2: PChar;

  D: PIntegerArray;

begin

  LenStr1 := Length(Str1);

  LenStr2 := Length(Str2);



  // to save some space, make sure the second index points to the shorter string

  if LenStr1 < LenStr2 then begin

    T := LenStr1;

    LenStr1 := LenStr2;

    LenStr2 := T;

    pStr1 := PChar(Str2);

    pStr2 := PChar(Str1);

  end

  else begin

    pStr1 := PChar(Str1);

    pStr2 := PChar(Str2);

  end;



  // to save some time and space, look for exact match

  while (LenStr2 <> 0) and (pStr1^ = pStr2^) do begin

    Inc(pStr1);

    Inc(pStr2);

    Dec(LenStr1);

    Dec(LenStr2);

  end;



  while (LenStr2 <> 0) and ((pStr1 + LenStr1 - 1)^ = (pStr2 + LenStr2 - 1)^) do begin

    Dec(LenStr1);

    Dec(LenStr2);

  end;



  if LenStr2 = 0 then begin

    Result := LenStr1;

    Exit;

  end;



  // calculate the edit distance

  T := LenStr2 + 1;

  GetMem(D, T * SizeOf(Integer));



  for I := 0 to T do D[I] := I;



  S1 := pStr1;

  for I := 1 to LenStr1 do begin

    PrevCost := I - 1;

    Cost := I;

    S2 := pStr2;



    for J := 1 to LenStr2 do begin

      if (S1^ = S2^) or ((I > 1) and (J > 1) and (S1^ = (S2 - 1)^) and (S2^ = (S1 - 1)^)) then

        Cost := PrevCost

      else

        Cost := 1 + min(Cost, PrevCost, D[J]);



      PrevCost := D[J];

      D[J] := Cost;

      Inc(S2);

    end;



    Inc(S1);

  end;



  Result := D[LenStr2];

  FreeMem(D);

end;

参考文献:

  1. Damerau–Levenshtein_distance
    http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
  2. How to match two strings approximately
    http://www.delphiarea.com/articles/how-to-match-two-strings-approximately/
  3. Fuzzy string matching
    www.delphiarea.com/articles/how-to-match-two-strings-approximately
  4. Fuzzy search in strings
    http://www.gausi.de/approxstrutils-en.html
Technorati 标签: , , , , , ,

你可能感兴趣的:(Delphi)