广告系统中weak-and算法原理及编码验证

wand(weak and)算法基本思路

  一般搜索的query比较短,但如果query比较长,如是一段文本,需要搜索相似的文本,这时候一般就需要wand算法,该算法在广告系统中有比较成熟的应

该,主要是adsense场景,需要搜索一个页面内容的相似广告。

   Wand方法简单来说,一般我们在计算文本相关性的时候,会通过倒排索引的方式进行查询,通过倒排索引已经要比全量遍历节约大量时间,但是有时候仍

然很慢。

   原因是很多时候我们其实只是想要top n个结果,一些结果明显较差的也进行了复杂的相关性计算,而weak-and算法通过计算每个词的贡献上限来估计文档

的相关性上限,从而建立一个阈值对倒排中的结果进行减枝,从而得到提速的效果。

   wand算法首先要估计每个词对相关性贡献的上限,最简单的相关性就是TF*IDF,一般query中词的TF均为1,IDF是固定的,因此就是估计一个词在文档中的

词频TF上限,一般TF需要归一化,即除以文档所有词的个数,因此,就是要估算一个词在文档中所能占到的最大比例,这个线下计算即可。

   知道了一个词的相关性上界值,就可以知道一个query和一个文档的相关性上限值,显然就是他们共同的词的相关性上限值的和。

   这样对于一个query,获得其所有词的相关性贡献上限,然后对一个文档,看其和query中都出现的词,然后求这些词的贡献和即可,然后和一个预设值比

较,如果超过预设值,则进入下一步的计算,否则则丢弃。

  如果按照这样的方法计算n个最相似文档,就要取出所有的文档,每个文档作预计算,比较threshold,然后决定是否在top-n之列。这样计算当然可行,但

是还是可以优化的。

 

 

   wand(weak and)算法原理演示

   代码实现了主要的算法逻辑以验证算法的有效性,供大家参考,该实现优化了原始算法的一些逻辑尽量减少了无谓的循环:

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
www.169it.com
#!/usr/bin/python
#wangben updated 20130108
class WAND:
     '''implement wand algorithm'''
     def __init__( self , InvertIndex, last_docid):
         self .invert_index = InvertIndex #InvertIndex: term -> docid1, docid2, docid3 ...
         self .current_doc = 0
         self .current_invert_index = {}
         self .query_terms = []
         self .threshold = 2
         self .sort_terms = []
         self .LastID = 2000000000 #big num
         self .debug_count = 0
         self .last_docid = last_docid
     def __InitQuery( self , query_terms):
         '''check terms len > 0'''
         self .current_doc = - 1
         self .current_invert_index.clear()
         self .query_terms = query_terms
         self .sort_terms[:] = []
         self .debug_count = 0
         for term in query_terms:
             #initial start pos from the first position of term's invert_index
             self .current_invert_index[term] = [ self .invert_index[term][ 0 ], 0 ] #[ docid, index ]
                                                                                    
     def __SortTerms( self ):
         if len ( self .sort_terms) = = 0 :
             for term in self .query_terms:
                 if term in self .current_invert_index:
                     doc_id = self .current_invert_index[term][ 0 ]
                     self .sort_terms.append([ int (doc_id), term ])
         self .sort_terms.sort()
                                                                                            
     def __PickTerm( self , pivot_index):
         return 0
     def __FindPivotTerm( self ):
         score = 0
         for i in range ( 0 , len ( self .sort_terms)):
             score + = 1
             if score > = self .threshold:
                 return [ self .sort_terms[i][ 1 ], i]
         return [ None , len ( self .sort_terms) ]
     def __IteratorInvertIndex( self , change_term, docid, pos):
         '''move to doc id > docid'''
         doc_list = self .invert_index[change_term]
         i = 0
         for i in range (pos, len (doc_list)):
             if doc_list[i] > = docid:
                 pos = i
                 docid = doc_list[i]
                 break
         return [ docid, pos ]
     def __AdvanceTerm( self , change_index, docid ):
         change_term = self .sort_terms[change_index][ 1 ]
         pos = self .current_invert_index[change_term][ 1 ]
         (new_doc, new_pos) = \
             self .__IteratorInvertIndex(change_term, docid, pos)
                                                                                        
         self .current_invert_index[change_term] = \
             [ new_doc , new_pos ]
         self .sort_terms[change_index][ 0 ] = new_doc
                                                                                        
                                                                                        
     def __Next( self ):
         if self .last_docid = = self .current_doc:
             return None
                                                                                            
         while True :
             self .debug_count + = 1
             #sort terms by doc id
             self .__SortTerms()
                                                                                            
             #find pivot term > threshold
             (pivot_term, pivot_index) = self .__FindPivotTerm()
             if pivot_term = = None :
                 #no more candidate
                 return None
                                                                                            
             #debug_info:
             for i in range ( 0 , pivot_index + 1 ):
                 print self .sort_terms[i][ 0 ], self .sort_terms[i][ 1 ], "|" ,
             print ""
                                                                                                
             pivot_doc_id = self .current_invert_index[pivot_term][ 0 ]
             if pivot_doc_id = = self .LastID: #!!
                 return None
             if pivot_doc_id < = self .current_doc:
                 change_index = self .__PickTerm(pivot_index)
                 self .__AdvanceTerm( change_index, self .current_doc + 1 )
             else :
                 first_docid = self .sort_terms[ 0 ][ 0 ]
                 if pivot_doc_id = = first_docid:
                     self .current_doc = pivot_doc_id
                     return self .current_doc
                 else :
                     #pick all preceding term
                     for i in range ( 0 , pivot_index):
                         change_index = i
                         self .__AdvanceTerm( change_index, pivot_doc_id )
                                                                                    
     def DoQuery( self , query_terms):
         self .__InitQuery(query_terms)
                                                                                        
         while True :
             candidate_docid = self .__Next()
             if candidate_docid = = None :
                 break
             print "candidate_docid:" ,candidate_docid
             #insert candidate_docid to heap
             #update threshold
         print "debug_count:" , self .debug_count
                                                                                        
if __name__ = = "__main__" :
     testIndex = {}
     testIndex[ "t1" ] = [ 0 , 1 , 2 , 3 , 6 , 2000000000 ]
     testIndex[ "t2" ] = [ 3 , 4 , 5 , 6 , 2000000000 ]
     testIndex[ "t3" ] = [ 2 , 5 , 2000000000 ]
     testIndex[ "t4" ] = [ 4 , 6 , 2000000000 ]
     w = WAND(testIndex, 6 )
     w.DoQuery([ "t1" , "t2" , "t3" , "t4" ])

  输出结果中会展示next中循环的次数,以及最后被选为candidate的docid。  这里省略了建立堆的过程,使用了一个默认阈值2作为doc的删选条件,候选doc和query doc采用重复词的个数计算UB,这里只是一个算法演示,实际使用的时候需要根据自己的相关性公式进行调整

 

本文来源:广告系统中weak-and算法原理及编码验证

你可能感兴趣的:(广告系统)