Python diffib库
import difflib
query_str = '广州公安'
s1 = '广州市邮政局'
s2 = '广州市公安局'
s3 = '广州市检查院'
print(difflib.SequenceMatcher(None, query_str, s1).quick_ratio())
print(difflib.SequenceMatcher(None, query_str, s2).quick_ratio())
print(difflib.SequenceMatcher(None, query_str, s3).quick_ratio())
返回的结果为:
0.4
0.8
0.4
在比较里,常常发现一些空格字符没有用,想把它们丢掉,有没有方法呢?肯定是有的,这里就介绍一个SequenceMatcher的使用。例子如下:
from difflib import SequenceMatcher
def show_results(match):
print(' a = {}'.format(match.a))
print(' b = {}'.format(match.b))
print(' size = {}'.format(match.size))
i, j, k = match
print(' A[a:a+size] = {!r}'.format(A[i:i + k]))
print(' B[b:b+size] = {!r}'.format(B[j:j + k]))
A = " abcd"
B = "abcd abcd"
print('A = {!r}'.format(A))
print('B = {!r}'.format(B))
print('\nWithout junk detection:')
s1 = SequenceMatcher(None, A, B)
match1 = s1.find_longest_match(0, len(A), 0, len(B))
show_results(match1)
print('\nTreat spaces as junk:')
s2 = SequenceMatcher(lambda x: x == " ", A, B)
match2 = s2.find_longest_match(0, len(A), 0, len(B))
show_results(match2)
结果输出如下:
A = ' abcd'
B = 'abcd abcd'
Without junk detection:
a = 0
b = 4
size = 5
A[a:a+size] = ' abcd'
B[b:b+size] = ' abcd'
Treat spaces as junk:
a = 1
b = 0
size = 4
A[a:a+size] = 'abcd'
B[b:b+size] = 'abcd'
ratio()函数:
返回序列相似性的度量,作为[0,1]范围内的浮点数
Differ()和compare()函数:
class difflib.Differ([ linejunk [,charjunk ] ] )
可选关键字参数linejunk和charjunk用于过滤函数(或None),默认值是None