第5小节介绍布尔查询的实现
之前在第3小节中已经建立好的各个域的倒排表,对于每个域,都支持AND、OR和NOT这几个简单的布尔查询。
首先导入id与文档路径的映射,方便打印时使用:
def opendb(): opdb=open('dbase_id_doc','r') iddoc=pickle.load(opdb) return iddoc
函数主入口:
def main(): print '* Which Field Do You Want to Search?' print """ (T)o (F)rom (D)ate (S)ubject (b)ody """ first_letter=raw_input('* Please enter the First Letter as shown above: ') if first_letter=='T': mapping=choose_dbase('dbase_to') elif first_letter=='F': mapping=choose_dbase('dbase_from') elif first_letter=='D': mapping=choose_dbase('dbase_date') elif first_letter=='S': mapping=choose_dbase('dbase_subject') elif first_letter=='b': mapping=choose_dbase('dbase_body') else: print '* Wrong Input! Doesn\'t have this Option...' print '* Please check again, terminate and quit now!' exit() while True: query_input=raw_input('Please input query string(support simple boolean query \'AND\'|\'OR\'|\'NOT\', \'q\' to quit): ') if query_input=='q': exit() elif 'AND' in re.findall('AND',query_input): sqt=and_split(query_input) merge_query('AND',sqt,mapping) elif 'OR' in re.findall('OR',query_input): sqt=or_split(query_input) merge_query('OR',sqt,mapping) elif 'NOT' in re.findall('NOT',query_input): sqt=not_split(query_input) merge_query('NOT',sqt,mapping) else: merge_query('default',query_input,mapping)简单的布尔查询现在可以支持单个词查询(xxx),布尔查询OR(xxx1 OR xxx2)、NOT(xxx1 NOT xxx2, NOT xxx)、AND(xxx1 AND xxx2,xxx1 AND xxx2 AND xxx3)这几种形式,对于一个布尔查询输入,首先进行分词,确定是上述的哪一种情况,然后再调用相应的子程序进行处理
''' If you input an query string containing bool word: AND, NOT, and OR, then function "*_split()" splits it into two words. For example,'jfk1 AND jfk2' -> ['jfk1','jfk2'] Only 'AND' can support 3 words query, like 'jfk1 AND jfk2 AND jfk3'. Attention: query like 'NOT jfk' is not included here, it will be processed in a different way (Check main() for details). ''' def and_split(query_token): query=re.split(' AND ',query_token) return query def or_split(query_token): query=re.split(' OR ',query_token) return query def not_split(query_token): query=re.split(' NOT ',query_token) return query
(注意以上不包括NOT xxx的实现方法,随后再介绍)
merge_query函数,利用python中的set(集合)类型。AND查询对左右每个词的查询结果取交集,OR取并集,NOT取交集。
对于AND请求,程序会计算查询向量和文档向量之间的余弦相似度。余弦相似度的计算方法:
xxx1在文档d中出现的频率为tknum1,xxx2在文档d中出现的频率为tknum2,查询向量为(tknum1,tknum2)由于intersection中记录的是文档集中同时出现xxx1和xxx2的所有文档,因此文档向量为(1,1)。这样查询与文档间的余弦相似度为:tk=(tknum1,tknum2)×(1,1)=tknum1*1+tknum2*1
''' Merge the respective query of two split word into one ultimate Query Result, according to the boolen-word. Using 'set' type in python. ''' def merge_query(bool_identifier,splited_query_token,mapping): #单个词的查询 if bool_identifier=='default': if check_existence(splited_query_token,mapping)==True: print '****Hit docs for query: \'',splited_query_token,'\'****' id_doc=opendb() i=1 for doc in mapping[splited_query_token]: print i,': ',id_doc[doc[0]] i=i+1 print '******************************************' else: return #布尔查询 else: set1=set() set2=set() for word in splited_query_token: if check_existence(word,mapping)==False: return for doc1 in mapping[splited_query_token[0]]: set1.add(doc1[0]) for doc2 in mapping[splited_query_token[1]]: set2.add(doc2[0]) # dealing with 3 words in 'AND' query if bool_identifier=='AND': if len(splited_query_token)==3: set3=set() for doc3 in mapping[splited_query_token[2]]: set3.add(doc3[0]) # dealing with 3 words in 'AND' query if bool_identifier=='AND': if len(splited_query_token)==3: intersection=set1 & set2 & set3 # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # consine similarity of every hit doc and the query vector map_cnt={} for i in intersection: tknum1=0 tknum2=0 tknum3=0 for token1 in mapping[splited_query_token[0]]: if i==token1[0]: tknum1=token1[1] continue for token2 in mapping[splited_query_token[1]]: if i==token2[0]: tknum2=token2[1] continue for token3 in mapping[splited_query_token[1]]: #此处bug,应该是[2] if i==token3[0]: tknum3=token3[1] continue # cosine similarity tk=tknum1*1+tknum2*1+tknum3*1 map_cnt.setdefault(i,[]).append(tk) # sort sorted_list=[] sorted_list=sorted(map_cnt.iteritems(), key=lambda a:a[1], reverse=True) # +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ print_result(sorted_list,'AND',splited_query_token) else: intersection=set1 & set2 # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # consine similarity of every hit doc and the query vector map_cnt={} for i in intersection: tknum1=0 tknum2=0 for token1 in mapping[splited_query_token[0]]: if i==token1[0]: tknum1=token1[1] continue for token2 in mapping[splited_query_token[1]]: if i==token2[0]: tknum2=token2[1] continue # cosine similarity tk=tknum1*1+tknum2*1 map_cnt.setdefault(i,[]).append(tk) # sort sorted_list=[] sorted_list=sorted(map_cnt.iteritems(), key=lambda a:a[1], reverse=True) # +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ print_result(sorted_list,'AND',splited_query_token) if bool_identifier=='OR': union=set1 | set2 print_result(union,'OR',splited_query_token) if bool_identifier=='NOT': complement=set1 - set2 print_result(complement,'NOT',splited_query_token)NOT xxx的实现方法,思路见英文注释
# Deal with query style like 'NOT jfk', the idea is as follows: # First, check whether 'jfk' is in the mapping-dictionary, # if does, using 'set_not' preserve docs that contain token 'jfk', # then, using 'union_set' preserve all doc list( It will take a LONG TIME! ) # get the complementary set of 'set_not' and 'union_set':'complement_set', and print it out. # Again, # - This method is not an optimal one, it need to traverse all docs, and futher more, merge it! # - Even I run it in my lab's (Infonet Lab.) server computer, the time it takes is still very long! elif 'NOT ' in re.findall('^NOT ',query_input): query_input_tmp=query_input[4:] # ----if---- if check_existence(query_input_tmp,mapping)==True: print '****Hit docs for query: \'',query_input,'\'****' complement_set=set() set_not=set() for item in mapping[query_input_tmp]: set_not.add(item[0]) union_set=set() for key in mapping.keys(): set_tmp=set() for doc in mapping[key]: set_tmp.add(doc[0]) union_set=union_set | set_tmp #自己这里的实现算法太差 complement_set=union_set - set_not id_doc=opendb() i=1 for doc in complement_set: print i,': ',id_doc[doc] i=i+1 print '******************************************' else: pass # ----end if----最后是几个功能函数:
1.选择要导入的datebase
def choose_dbase(Dbase_Name): print '* Import pickle file \'',Dbase_Name,'\' from hard disk' print '* Load the inverted list into memory...' mydb=open(Dbase_Name,'r') tmp_mapping=pickle.load(mydb) print '* Done!' print '* Now, you can query!' return tmp_mapping
2.检查分词是否在字典中存在
def check_existence(word,mapping): if word not in mapping.keys(): print '* Opps, doesn\'t have token: \'',word,'\'' print '* Please Try again...' return False else: return True3.根据输入参数调整打印结果,不同布尔操作符间稍有差异(主要是AND),另外对于AND查询可以调整用户想看到的结果数目(因为涉及到了排序,即余弦相似度的计算)
def print_result(set,bool_identifier,splited_query_token): # dealing with 3 words print in 'AND' if bool_identifier=='AND': if len(splited_query_token)==3: print '****Hit docs for query: \'',splited_query_token[0],'AND',splited_query_token[1],'AND',splited_query_token[2],'\'****' print 'calculate the Cosine Similarity between each hit doc and query vector (just using \'tf\')' print 'Outputing the sorted result:' else: print '****Hit docs for query: \'',splited_query_token[0],'AND',splited_query_token[1],'\'****' print 'calculate the Cosine Similarity between each hit doc and query vector (just using \'tf\')' print 'Outputing the sorted result:' # dealing with the normal case else: print '****Hit docs for query: \'',splited_query_token[0],bool_identifier,splited_query_token[1],'\'****' id_doc=opendb() if bool_identifier=='AND': result_num=raw_input('< How many results do you want to print?(input \'all\' print all)> ') if result_num=='all': m=1 for i in set: print m,': ',id_doc[i[0]],' ( cosine similarity:',i[1],')' m=m+1 print '****************************************************************' else: m=1 for i in set: print m,': ',id_doc[i[0]],' ( cosine similarity:',i[1],')' m=m+1 if m <= int(result_num): continue else: break print '****************************************************************' else: m=1 for i in set: print m,': ',id_doc[i] m=m+1 print '****************************************************************'
给出对Subject区域进行“the AND for AND one”搜索后打印出排序前20文档及其余弦相似度计算的结果: