字符串相似性度量方法

推荐文章:

  1. FIVE MOST POPULAR SIMILARITY MEASURES IMPLEMENTATION IN PYTHON
    2.字符串相似算法-Jaro-Winkler Distance

推荐代码实现:

  1. python-string-similarity
    A library implementing different string similarity and distance measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Jaro-Winkler, Longest Common Subsequence, cosine similarity etc.) are currently implemented.

The main characteristics of each implemented algorithm are presented below. The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length m and n respectively.

Normalized? Metric? Type Cost Typical usage
Levenshtein distance No Yes O(m*n) 1
Normalized Levenshtein distance similarity Yes No O(m*n) 1
Weighted Levenshtein distance No No O(m*n) 1 OCR
Damerau-Levenshtein 3 distance No Yes O(m*n) 1
Optimal String Alignment 3 distance No No O(m*n) 1
Jaro-Winkler similarity distance Yes No O(m*n) typo correction
Longest Common Subsequence distance No No O(m*n) 1,2 diff utility, GIT reconciliation
Metric Longest Common Subsequence distance Yes Yes O(m*n) 1,2
N-Gram distance Yes No O(m*n)
Q-Gram distance No No Profile O(m+n)
Cosine similarity similarity distance Yes No Profile O(m+n)
Jaccard index similarity distance Yes Yes Set O(m+n)
Sorensen-Dice coefficient similarity distance Yes No Set O(m+n)

[1] In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the dynamic programming

你可能感兴趣的:(字符串相似性度量方法)