50万邮件文本分域检索与查询的python实现(5)

第5小节介绍布尔查询的实现

之前在第3小节中已经建立好的各个域的倒排表,对于每个域,都支持AND、OR和NOT这几个简单的布尔查询。

首先导入id与文档路径的映射,方便打印时使用:

def opendb():
 opdb=open('dbase_id_doc','r')
 iddoc=pickle.load(opdb)
 return iddoc 

函数主入口:

def main():
 print '* Which Field Do You Want to Search?'
 print """
        (T)o
        (F)rom
        (D)ate
        (S)ubject
        (b)ody
       """
 first_letter=raw_input('* Please enter the First Letter as shown above: ')

 if first_letter=='T':
  mapping=choose_dbase('dbase_to')
 elif first_letter=='F':
  mapping=choose_dbase('dbase_from')
 elif first_letter=='D':
  mapping=choose_dbase('dbase_date')
 elif first_letter=='S':
  mapping=choose_dbase('dbase_subject')
 elif first_letter=='b':
  mapping=choose_dbase('dbase_body')
 else:
  print '* Wrong Input! Doesn\'t have this Option...'
  print '* Please check again, terminate and quit now!'
  exit()

 while True:
  query_input=raw_input('Please input query string(support simple boolean query \'AND\'|\'OR\'|\'NOT\', \'q\' to quit): ')
 
  if query_input=='q':
   exit()

  elif 'AND' in re.findall('AND',query_input):
   sqt=and_split(query_input)
   merge_query('AND',sqt,mapping)

  elif 'OR' in re.findall('OR',query_input):
   sqt=or_split(query_input)
   merge_query('OR',sqt,mapping)

  elif 'NOT' in re.findall('NOT',query_input):
   sqt=not_split(query_input)
   merge_query('NOT',sqt,mapping)
  
  else:
   merge_query('default',query_input,mapping)
简单的布尔查询现在可以支持单个词查询(xxx),布尔查询OR(xxx1 OR xxx2)、NOT(xxx1 NOT xxx2, NOT xxx)、AND(xxx1 AND xxx2,xxx1 AND xxx2 AND xxx3)这几种形式,对于一个布尔查询输入,首先进行分词,确定是上述的哪一种情况,然后再调用相应的子程序进行处理

'''
  If you input an query string containing bool word: AND, NOT, and OR, 
  then function "*_split()" splits it into two words.
  For example,'jfk1 AND jfk2' -> ['jfk1','jfk2']
  Only 'AND' can support 3 words query, like 'jfk1 AND jfk2 AND jfk3'.
  Attention: 
   query like 'NOT jfk' is not included here, it will be processed in a different way (Check main() for details).
'''

def and_split(query_token):
 query=re.split(' AND ',query_token)
 return query

def or_split(query_token):
 query=re.split(' OR ',query_token)
 return query

def not_split(query_token):
 query=re.split(' NOT ',query_token)
 return query

(注意以上不包括NOT xxx的实现方法,随后再介绍)

merge_query函数,利用python中的set(集合)类型。AND查询对左右每个词的查询结果取交集,OR取并集,NOT取交集。

对于AND请求,程序会计算查询向量和文档向量之间的余弦相似度。余弦相似度的计算方法:

xxx1在文档d中出现的频率为tknum1,xxx2在文档d中出现的频率为tknum2,查询向量为(tknum1,tknum2)由于intersection中记录的是文档集中同时出现xxx1和xxx2的所有文档,因此文档向量为(1,1)。这样查询与文档间的余弦相似度为:tk=(tknum1,tknum2)×(1,1)=tknum1*1+tknum2*1

'''
  Merge the respective query of two split word into one ultimate Query Result,
  according to the boolen-word.
  Using 'set' type in python. 
'''
def merge_query(bool_identifier,splited_query_token,mapping):
 #单个词的查询
 if bool_identifier=='default':
  if check_existence(splited_query_token,mapping)==True:
    print '****Hit docs for query: \'',splited_query_token,'\'****'
  
    id_doc=opendb()

    i=1
    for doc in mapping[splited_query_token]:
     print i,': ',id_doc[doc[0]]
     i=i+1
    print '******************************************'
  else:
   return

 #布尔查询
 else:
  set1=set()
  set2=set()
  for word in splited_query_token:
   if check_existence(word,mapping)==False:
    return
  for doc1 in mapping[splited_query_token[0]]:
   set1.add(doc1[0])
  for doc2 in mapping[splited_query_token[1]]:
   set2.add(doc2[0])

 # dealing with 3 words in 'AND' query
  if bool_identifier=='AND':
    if len(splited_query_token)==3:
     set3=set()
     for doc3 in mapping[splited_query_token[2]]:
      set3.add(doc3[0])
              
  # dealing with 3 words in 'AND' query
  if bool_identifier=='AND':
   if len(splited_query_token)==3:
    intersection=set1 & set2 & set3
    # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    # consine similarity of every hit doc and the query vector
    map_cnt={}
    for i in intersection:
     tknum1=0
     tknum2=0
     tknum3=0
     for token1 in mapping[splited_query_token[0]]:
      if i==token1[0]:
       tknum1=token1[1]
       continue
     for token2 in mapping[splited_query_token[1]]:
      if i==token2[0]:
       tknum2=token2[1]
       continue
     for token3 in mapping[splited_query_token[1]]: #此处bug,应该是[2]
      if i==token3[0]:
       tknum3=token3[1]
       continue
     # cosine similarity
     tk=tknum1*1+tknum2*1+tknum3*1
     map_cnt.setdefault(i,[]).append(tk)
     
    # sort 
    sorted_list=[]
    sorted_list=sorted(map_cnt.iteritems(), key=lambda a:a[1], reverse=True)
    # +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    print_result(sorted_list,'AND',splited_query_token)
   
   else:
    intersection=set1 & set2
    # ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    # consine similarity of every hit doc and the query vector
    map_cnt={}
    for i in intersection:
     tknum1=0
     tknum2=0
     for token1 in mapping[splited_query_token[0]]:
      if i==token1[0]:
       tknum1=token1[1]
       continue
     for token2 in mapping[splited_query_token[1]]:
      if i==token2[0]:
       tknum2=token2[1]
       continue
     # cosine similarity
     tk=tknum1*1+tknum2*1
     map_cnt.setdefault(i,[]).append(tk)
     
    # sort 
    sorted_list=[]
    sorted_list=sorted(map_cnt.iteritems(), key=lambda a:a[1], reverse=True)
    # +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    print_result(sorted_list,'AND',splited_query_token)

  if bool_identifier=='OR':
   union=set1 | set2
   print_result(union,'OR',splited_query_token)

  if bool_identifier=='NOT':
   complement=set1 - set2
   print_result(complement,'NOT',splited_query_token)
NOT xxx的实现方法,思路见英文注释

 #  Deal with query style like 'NOT jfk', the idea is as follows:
 #  First, check whether 'jfk' is in the mapping-dictionary,
 #  if does, using 'set_not' preserve docs that contain token 'jfk',
 #  then, using 'union_set' preserve all doc list( It will take a LONG TIME! )
 #  get the complementary set of 'set_not' and 'union_set':'complement_set', and print it out.
 #  Again,
 #   - This method is not an optimal one, it need to traverse all docs, and futher more, merge it!
 #   - Even I run it in my lab's (Infonet Lab.) server computer, the time it takes is still very long!

  elif 'NOT ' in re.findall('^NOT ',query_input):
   query_input_tmp=query_input[4:]
   #  ----if----
   if check_existence(query_input_tmp,mapping)==True:
    print '****Hit docs for query: \'',query_input,'\'****'
   
    complement_set=set()
    set_not=set()
    for item in mapping[query_input_tmp]:
     set_not.add(item[0])
    
    union_set=set()
    for key in mapping.keys():
     set_tmp=set()
     for doc in mapping[key]:
      set_tmp.add(doc[0])
     union_set=union_set | set_tmp     #自己这里的实现算法太差
    complement_set=union_set - set_not
    
    id_doc=opendb()

    i=1
    for doc in complement_set:
     print i,': ',id_doc[doc]
     i=i+1
    print '******************************************'
   
   else:
    pass
   #  ----end if----
最后是几个功能函数:

1.选择要导入的datebase

def choose_dbase(Dbase_Name):
  print '* Import pickle file \'',Dbase_Name,'\' from hard disk'
  print '* Load the inverted list into memory...'
  mydb=open(Dbase_Name,'r')
  tmp_mapping=pickle.load(mydb)
  print '* Done!'
  print '* Now, you can query!'
  return tmp_mapping

2.检查分词是否在字典中存在

def check_existence(word,mapping):
 if word not in mapping.keys():
  print '* Opps, doesn\'t have token: \'',word,'\''
  print '* Please Try again...'
  return False
 else:
  return True
3.根据输入参数调整打印结果,不同布尔操作符间稍有差异(主要是AND),另外对于AND查询可以调整用户想看到的结果数目(因为涉及到了排序,即余弦相似度的计算)

def print_result(set,bool_identifier,splited_query_token):
 # dealing with 3 words print in 'AND'
 if bool_identifier=='AND':
  if len(splited_query_token)==3:
   print '****Hit docs for query: \'',splited_query_token[0],'AND',splited_query_token[1],'AND',splited_query_token[2],'\'****'
   print 'calculate the Cosine Similarity between each hit doc and query vector (just using \'tf\')'
   print 'Outputing the sorted result:'
  else:
   print '****Hit docs for query: \'',splited_query_token[0],'AND',splited_query_token[1],'\'****'
   print 'calculate the Cosine Similarity between each hit doc and query vector (just using \'tf\')'
   print 'Outputing the sorted result:'
 # dealing with the normal case
 else:
  print '****Hit docs for query: \'',splited_query_token[0],bool_identifier,splited_query_token[1],'\'****'

 id_doc=opendb()

 if bool_identifier=='AND':
  result_num=raw_input('< How many results do you want to print?(input \'all\' print all)>  ')
  if result_num=='all':
   m=1
   for i in set:
    print m,': ',id_doc[i[0]],'	  ( cosine similarity:',i[1],')'
    m=m+1
   print '****************************************************************'
  else:
   m=1
   for i in set:
    print m,': ',id_doc[i[0]],'	  ( cosine similarity:',i[1],')'
    m=m+1
    if m <= int(result_num):
     continue
    else:
     break
   print '****************************************************************'

 else:
  m=1
  for i in set:
   print m,': ',id_doc[i]
   m=m+1
  print '****************************************************************'

给出对Subject区域进行“the AND for AND one”搜索后打印出排序前20文档及其余弦相似度计算的结果:50万邮件文本分域检索与查询的python实现(5)_第1张图片



你可能感兴趣的:(50万邮件文本分域检索与查询的python实现(5))