50万邮件文本分域检索与查询的python实现(4)

第四小节介绍对每个域的倒排表分别进行Top 50统计

在shell中与用户交互,根据用户输入决定导入哪个datebase。然后统计字典中每个键值的数目,接着调用sorted函数。代码如下:

import pickle
def frncy(Field_Name,Dbase_Name):
 mydb=open(Dbase_Name,'r')
 mapping=pickle.load(mydb)

 mapping_cnt={}
 for key in mapping.keys():
  cnt=0
  for doc in mapping[key]:
   # doc[1] is the number of 'key' in every doc[0]
   cnt=cnt+doc[1]
  mapping_cnt.setdefault(key,[]).append(cnt)

 print
 print '********Top 50 Tokens and Their Frequency for \'',Field_Name,'\'*********'

 sorted_list=[]
 sorted_list=sorted(mapping_cnt.iteritems(), key=lambda a:a[1], reverse=True)
 for i in range(50):
  tmp=sorted_list[i]
  print '* Token',i+1,': ',tmp[0], '   ( Frequency:',tmp[1],')'
 print '**********************************************************'

def main():
 print '( ------- Top 50 Tokens for Any Field -------- )'
 while True:
  field_name=raw_input('* Please input field name(\'To\'|\'From\'|\'Subject\'|\'body\', \'q\' to quit): ')
  if field_name=='q':
   exit()
  elif field_name=='To':
   frncy('To','dbase_to')
  elif field_name=='From':
   frncy('From','dbase_from')
  elif field_name=='Subject':
   frncy('Subject','dbase_subject')
  elif field_name=='body':
   frncy('mail body','dbase_body')
  else:
   print '** Opps, not having this field! Try again...'

if __name__ == '__main__':
 main()
 

给出“Subject”域的Top 50

50万邮件文本分域检索与查询的python实现(4)_第1张图片



你可能感兴趣的:(list,python,input,lambda,import,token)