如果是用 point-wise 的方法, 根据ctr做倒排, 会出现 high similar items were clustered together 的现象. 相似的item扎堆, 这种体验并不友好.
Maximal Marginal Relevance .
详见参考[2].
大致思想: 给定一个Query, 召回了一些文档A. 要从集合A中选一个大小为k的子集 A k A_k Ak 呈现给用户. 每挑选一个元素i时, 综合考虑 A k A_k Ak 的多样性和与Query相关性.
对 A k A_k Ak的评分函数为:
(1) ρ ( A k ) = α × d ( A k ) + ( 1 − α ) ∑ a i ∈ A k s ( a i ) \rho(A_k)=\alpha \times d(A_k)+(1-\alpha)\sum_{a_i\in A_k} s(a_i) \tag 1 ρ(Ak)=α×d(Ak)+(1−α)ai∈Ak∑s(ai)(1)
where A k A_k Ak is a subset of A A A of size k.
s ( a ) s(a) s(a) means the relevance between item a a a and the current customer.
d ( A k ) d(A_k) d(Ak) is the measure of the diversity of A k A_k Ak.
the optimal subset is given by:
(2) A k ∗ : = arg max A k ∈ A , ∣ A k ∣ = k ρ ( A k ) A_k^*:= \underset{{A_k\in A, |A_k|=k}}{\arg \max} \rho(A_k) \tag 2 Ak∗:=Ak∈A,∣Ak∣=kargmaxρ(Ak)(2)
给出多样性的具体描述:
(2) d ( A k ) = ∑ i k ∑ j i d i s t a n c e ( a i , a j ) d(A_k)=\sum_i^k \sum_j^i distance(a_i,a_j) \tag 2 d(Ak)=i∑kj∑idistance(ai,aj)(2)
(3) d i s t a n c e ( a i , a j ) = v i r t u a l C a t e D i s t a n c e ( a i , a j ) ∗ s p a n W e i g h t ( ∣ i − j ∣ ) distance(a_i,a_j)=virtualCateDistance(a_i,a_j)*spanWeight(|i-j|) \tag 3 distance(ai,aj)=virtualCateDistance(ai,aj)∗spanWeight(∣i−j∣)(3)
虚拟类目相似度与item间距综合考虑.
Eq(2) is a special case of the NP-hard (Non-deterministic Polynomial time problem) maximum set cover problem.
We have to use an iterative greedy procedure to obtain a near-optimal solution.
(5) A i + 1 = A i ∪ { arg max a ∈ A − A i ρ ( A i ∪ { a } ) } A_{i+1}=A_i \cup \{\underset{a\in A-A_i}{\arg\max} \rho(A_i\cup \{a\})\} \tag 5 Ai+1=Ai∪{a∈A−Aiargmaxρ(Ai∪{a})}(5)
分享一种很简单, 应用也很广泛的做法.
定义两个元素之间的相似度 d i s t a n c e ( i , j ) ∈ { 0 , 1 } distance(i,j) \in \{0,1\} distance(i,j)∈{0,1}, 电商推荐中可以认为两个商品同类目,同店铺, 同品牌 等, 命中其一就是相似.
d i s t a n c e ( a i , a j ) = { 0 , a i 与 a j 类 目 相 等 , 作 者 相 等 . . . 1 , o t h e r s distance(a_i,a_j) = \begin{cases} 0 & , a_i 与 a_j 类目相等,作者相等... \\ 1 &, others \end{cases} distance(ai,aj)={01,ai与aj类目相等,作者相等...,others
定义元素和集合之间的相似度
d i s t a n c e ( S , a ) = ∑ b ∈ S d i s t a n c e ( a , b ) distance(S,a)=\underset{b\in S} {\sum} distance(a,b) distance(S,a)=b∈S∑distance(a,b)
那么迭代过程就是:
(10) A i + 1 = A i ∪ { arg max a ∈ A − A i , d i s t a n c e ( A i , a ) = 0 s ( a ) } A_{i+1}=A_i \cup \{ \underset{a\in A-A_i, distance(A_i,a)=0} {\arg\max} s(a) \} \tag {10} Ai+1=Ai∪{a∈A−Ai,distance(Ai,a)=0argmaxs(a)}(10)
即在所有满足打散的结果中选取最相关的那个.