今天突然有心思去看了一下数据提取相关的论文,也算是开阔一下视野吧,主要是BM25算法的改进,通过加入新的相邻词语关系,已经查询到的唯一词数目,来进行某些权重的计算,感觉还是很有意思的。自己也总结了下,准备以后参加xapian用,并加入了Xapian的开发的邮件列表,以及IRC,以后有机会多上这里面逛逛,相信会有很大的提升。以下是我总结的BM25算法改进相关内容,自己用英语写的
函数值问题(如何将函数插入进入BM25)
The paper suggest that the contribution of a proximity distance
measure should follow a funciton of a convex shape.
so,their final function therefore uses a popular logarithm function
to convert a proximity distance measure to a proximity feature value,
which is then combined with the existing retrieval functions-
Okapi BM25 model.
什么是minimum pair distance(MinDist)
The minimum pair distance is defined as the smallest distance
value of all pairs of unique matched query terms.
实现原理
1 Assume that a document matches K unique query terms,and the total number of
occurrences of these K query terms is N.We can record the positions of these N occurrences in order
in the inverted index so that we can scan them one by one.
2 while scanning.we maintain a list of length K,in which we store the last position of each seen query term. In other words,if a term t occurs twice,we would record the location of the first
occurrence when the scanning hits the first one and update it when we hit the second one.
3 in each step,we calculate the span solely based on the information in the list,and finally select
the smallest span value we have ever obtained during the scanning process.
4 Since K is often very small,the algorithm is close to linear in terms of N
Here is an example:
Document1: t1 t2 t1 t3 t5 t4 t2 t3 t4
Document2: t4 t3 t2 t1 t5 t1 t3 t6
the inverted index should be like this:
term Document ID and occurrence the positions of this term
t1 1[2],2[2] [1,3,4,6]
t2 1[2],2[1] [2,7,3]
t3 1[2],2[2] [4,8,2,7]
t5 1[1],2[5] [5,5]
t4 1[2],2[1] [6,9,1]
t6 2[1] [8]
now ,Assume that the Query terms is t1 t3 t6
then Document1 matches two unique terms(t1 and t3),and the total number of occurrences of
these two query terms is four.and we have already got the position of these uique query term
in the inverted index .
Then we scan the inverted index t1 and t2,maintain a list of length two for Document1
now the positions of these four occurrences are in order.we scan them one by one.
step1 : Scan the inverted index,the position of t1 is the lowest,the list is {1}
step2 : Scan the inverted index,the position of t1 is also the lowest,the list is {3}
step3 : Scan the inverted index,the position of t3 is now the lowest,the list is {3,4} ,now the smallest span value 1.
step4 : Scan the inverted index,the position of t3 is in the list,the list is {3,8},and we do not update the smallest span value.
Finally we got the smallest span value .and we can use the new retrieval funciton to calculate weights
R (Q, D) = BM25(Q, D) + π(Q, D)
π(Q, D) is the function :π(Q, D) = log(α + exp(-δ(Q, D)))
δ(Q, D) is the smallest span value of the matched query terms.
α is a parameter introduced here to allow for certain variations. and α=0.3 work well for most data
sets(I got this form the paper)