掌握搜索系统中输入多个词项与查询时倒排记录表的合并算法。
充分理解搜索系统中输入多个词项与查询时倒排记录表的合并算法,并通过python编程实现。当用户在提示后输入查询语句即可以实现多个词项与查询时倒排记录表的合并。
系统读取预设文档返回所有可查询的词项,用户通过提示输入查询词项,系统分别计算所有词项的倒排记录表,然后执行多个词项与查询时倒排记录表的合并算法,并将合并结果输出。
分为提示输入模块与多个词项与查询时倒排记录表的合并模块两个功能模块。
find函数用于返回用户所输入词项的倒排记录表用于合并计算。
'''
返回某个词的倒排记录表
'''
def find(test, dict1, dict0):
ft0 = re.split('[()]', test)
ft = []
for i in range(len(ft0)):
ft = ft + ((ft0[i].replace(' ', '')).split("AND"))
ft = [i for i in ft if i != '']
p = []
for j in range(len(ft)):
p0 = []
if('OR' in ft[j]):
ft1 = ft[j].split('OR')
for k in range(len(ft1)):
if(ft1[k] in dict1):
p0 = p0 + dict1[ft1[k]]
p0 = list(set(p0))
elif('NOT' in ft[j]):
ft[j] = ft[j].replace('NOT', '')
if(ft[j] in dict1):
p0 = [y for y in dict0 if y not in dict1[ft[j]]]
else:
p0 = dict0
elif(ft[j] in dict1):
p0 = dict1[ft[j]]
p.append(p0)
return p
Intersect函数为输入多个词项与查询时倒排记录表合并算法模块,首先通过len函数获取倒排记录表个数,然后通过下标循环获取倒排记录表,再通过循环获取倒排记录表元素,与上次合并结果进行倒排记录表合并,最终返回该结果列表。
'''
倒排记录表合并算法
'''
def Intersect(p):
r = p[0]
for i in range(1, len(p)):
j, k = 0, 0
r0 = []
while(j < len(p[i]) and k < len(r)):
if(p[i][j] == r[k]):
r0.append(r[k])
j, k = j + 1, k + 1
elif(p[i][j] > r[k]):
k = k + 1
else:
j = j + 1
r = r0
return r
createdict函数调用了python字符串处理的re库,处理预设的文档,返回所有词项用于提示用户可选词项,并计算所有词项的倒排记录表。
'''
创建文档词典
'''
def createdict(f0):
dl = list(set(re.split('[ \n?!,.;]', f0)))
dl.pop(0)
d = f0.split('\n')
dict1 = {}
dict0 = []
for i in range(len(d)):
dict0.append(i + 1)
for word in dl :
if word in d[i]:
if word not in dict1:
dict1[word] = [i + 1]
else:
dict1[word].append(i + 1)
return dict1, dict0
'''
对倒排记录表排序
'''
def sort0(p):
l = len(p)
for i in range(0, l):
p[i].append(len(p[i]))
p = sorted(p, key = (lambda x:x[-1]))
for i in range(0, l):
p[i].pop()
return p
多个print函数用于提示,input函数获取用户输入的字符串,然后通过p = find(ft0, dict1, dict0)语句实现计算。
test语句为调试的过程。
"""
d为document,ft为findtext,r为result,dict1为原始词典,dict0为文档总数
"""
import re
'''
打开文档
'''
f = open("document.txt", "r")
f0 = f.read()
f.close()
dict1, dict0 = createdict(f0)
k = [key for key in dict1]
print("可供查询的词项为:", k, "\n")
#print("请输入形如教材的标准查询:", end = '')
#ft0 = input()
test = '(things OR who) AND (mean AND NOT always)'
p = find(test, dict1, dict0)
print("\n倒排记录表为:", p)
p = sort0(p)
print("合并结果为:", Intersect(p))
document.txt模拟文档如下,应该可以用任意一篇英文文档尝试。
There are moments in life when you miss someone so much that you just want to pick them from your dreams and hug them for real! Dream what you want to dream;go where you want to go;be what you want to be,because you have only one life and one chance to do all the things you want to do.
May you have enough happiness to make you sweet,enough trials to make you strong,enough sorrow to keep you human,enough hope to make you happy? Always put yourself in others’shoes.If you feel that it hurts you,it probably hurts the other person, too.
The happiest of people don’t necessarily have the best of everything;they just make the most of everything that comes along their way.Happiness lies for those who cry,those who hurt, those who have searched,and those who have tried,for only they can appreciate the importance of people
Who have touched their lives.Love begins with a smile,grows with a kiss and ends with a tear.The brightest future will always be based on a forgotten past, you can’t go on well in life until you let go of your past failures and heartaches.
When you were born,you were crying and everyone around you was smiling.Live your life so that when you die,you’re the one who is smiling and everyone around you is crying.
Please send this message to those people who mean something to you,to those who have touched your life in one way or another,to those who make you smile when you really need it,to those that make you see the brighter side of things when you are really down,to those who you want to let them know that you appreciate their friendship.And if you don’t, don’t worry,nothing bad will happen to you,you will just miss out on the opportunity to brighten someone’s day with this message.
在调试过程中,除了实现了对多个词项与查询的倒排记录表合并,还实现了OR和NOT的倒排记录表合并,因为不是主要目标,故在程序调试章节展示。
输入查询语句down AND day AND bad AND see,得到合并结果如下图。