#论文笔记# Demystifying Core Ranking in Pinterest Image Search

本文是 Pinterest 试验的图片检索算法,本文从:training data,user/image featurization 和 ranking models 等角度进行解读,并做了性能和质量方面的测试

Backgroud:

1. Purpose: improve the image search quality by evolving ranking techniques

The first search system was built on top of Apache Lucene and solr, the result were ranked by text relevance scores between queries and text description of images. The search system was subsequently evolved by adding varies advancements that address unique challenges of Pinterest ( got problems and explored solutions)

2. Three challenges: 

The first challenge is the reason why users use Pinterest to search images. It can be understood by adding pins for users to find their intent and explicit, such as 'try it', 'close up' & 'repin'. However, it rises new problem that it is difficult to define a universal performance rule to distinguish the weight between 'try it' and 'close up', which one is more important? An other challenge comes from the the nature of images. An image may reflect many information which is difficult to extract and describe with simple words. The third is balancing the efficiency and effectiveness of ranking algorithms. Though there are many algorithms implemented in industries, it is rarely known that which algorithm is the best for a given application.

3. Three aspects:

data: combine the engagement training data & human curated relevance judgement data

Featurization: include feature engineering, word embedding and visual embedding, visual relevance signal, query intent understanding, user intent understanding etc.

Modeling: cascading core ranking component ( to achieve the balance between search latency and search quality)

Section 2: Introduce how to curate training data 

maxmize both engagement & relevance

engagement data (user behavioral metrics):

click-through engagement Log has became the standard training data for learning to rank optimization in search engine,< q,u, (P, T) >, but T can be multiple actions towards pins including click-through, repin, close-up, like, hide, comment, try-it, etc, 

so a new challenge: how we should simultaneously combine and optimize multiple feedbacks?

方案1:每个engagement训练一个模型,然后再对多个模型进行ensemble和调试, 但是尝试了很多方法,没有办法得到一个不牺牲任何engagement的方法

方案2: 在data_level进行ensemble, l (p | q,u)代表用户u在搜索词q下对pin(图片)的engagement-based quality label,用lp表示。于是对图片集 P (user query recall)可以生成label set L ,对于图片集P中的每个图片,它的lp是所有T集合的线性组合, 

公式1, T是一系列action,包括上面提到的(repin,close Up 等),ct是action的计数,wt是action的权重,权重与action的volume成反比。(volume of Close up  > click )

对公式1 用公式2 做归一化,以位置和年龄为基础 归一化每个pin的label, agep and posp are the age and position of pin p, τ is the normalized weight for the ages of pins, and λ is the parameter that controls the position decay.

another challenge: 样本偏斜,negative > positive , 修剪样本集,1. 舍弃lp都小于零的query group2. 舍弃负样本数超过某个阈值的query group 

最终得到< q,u, (P, T, L) >

human relevance data (human relevance judgment):

虽然大规模不可靠用户搜索会话的聚集提供了隐含的反馈的可靠的参与训练数据,但是它也带来了来自当前排序函数的偏倚。例如,位置偏倚就是其中之一。To correct the ranking bias, we also curate relevance judgment data from human experts, 类似问卷调查,人工打标。

combining:

we simply consider each trainingdata source independently and feed each of which into the ranking function to train a specific model

Section 3: Feature representation for ranking

how we enhance traditional ranking features to address unique challenges

Beyond Text Relevance Feature:

背景:the text description of each Pin usually is short and noisy. 

1. word-level relevance: 为每个图片标注text annotation in the format of (n-gram), unigrams, bigrams and trigrams from different sources such as texts, description, text from the crawled linked web page and automatically classified annotation label.然后用bm25或者proximity BM25算text matching score。

2. Categoryboost: 32 L1 categories and 500 L2 categories.

3. Topicboost: topic 是通过statistical topic modeling分析words的分布找到的(Latent Dirichletallocation topic modeling)

4.Embedding Features: 在latent representation space 上描述了  users’ query request 和 the pins之间的相似程度。

User Intent Features: help our learning algorithm learn which type of pins are “really” relevant and interesting to users

1. Navboost:描述了一个图片在一般情况下和特殊搜索词和用户群下的表现,根据先验经验统计close up, click, long-click and repin,还有 country, gender, aggregation time 等信息。

2. Tokenboost: how well a pin performs in general and in context of a specific token

3.Gender Features:

4.Personalized Features:

Socialness, Visual and other Features

图片大小, 宽高比,图像哈希特征,pin following, friends following

Section 4: Cascading models

三级: light-weight stage, full-ranking stage, and re-ranking stage.如图4, 百万级到万集到1000,

q: query, u:user, p:pin, x: feature for a tuple with (q, u, p ), lp 同section 2, y是lp的真值,sp 是quality score of p given q & u, L 是Loss 是function, S 是 score function.

table 1 是各个stage 用的model。略过基于规则的模型,因为非常直观。

light-weight stage:

filter out negative pins

full-ranking stage:

re-ranking stage:

提高新鲜度、多样性、地方性和语言意识。

models:

Gradient Boost Decision Tree (GBDT):use mean square loss as the training loss

Deep Neural Network (DNN):learns to predict quality score sp, 变成多分类问题 4-scale label [1, 2, 3, 4]. lp变成y ∈ [1, 2, 3, 4]。cross entropy loss

Convolutional Neural Network (CNN):除了结构不同,其他与DNN都是一样的

RankNet:找到文本对的正确排序, one model learns a ranking function which predicts the S(q,u,pi,pj, θ probability spi > spj. 基于未归一的lp,lpi > lpj ? lpi:lpj.  loss function 用公式6

RankSVM: 

Gradient Boost Ranking Tree (GBRT):RankSVM 和 GBDT的组合, 每次迭代都是为了学到一个function that pi 排在pj前面的概率, 和gbdt相似的是,rander是一个有限深度的回归树

模型融合:

engagement data 和 human relevance data 训练出了不同的种模型,用线性融合。

对于训练阶段计算量小的模型,在训练过程中融合,反之在训练结束后融合。

Section 5:Experiments

Offline Experimental Setting:

每个国家和语言,human-relevance data : 5000 query and performed user judgement for 400 pins per query ; engagement data: 最近7天 随机抽取1%的log. 70% 用于训练,20%测试,10%验证。

Offline Measurement Metrics 图6 是部分feature分布。offline 用ndcg做评估。

Online Experimental Setting:

100 buckets

Online Measurement Metrics 

用户level 和 query level

For query-level measurement metrics, repin per search (Qrepin), click per search (Qclick), close up per search

(Qclose up) and engagement per search (Qengaged) were the main metrics we used

Performance Results

5.3.1 Lightweight Ranking Comparison.

图7 离线数据下, 以rule based 为基准, ranksvm 有很大提升,但是线上ranksvm提升不大,通病, 提升不高但是延迟降低。table2

5.3.2 Full Ranking Comparison

offline: 对比ranksvm, CNN ⪰ GBRT ⪰ DNN ⪰ RankNet ⪰ GBDT

online: 神经网络模型引入latency, 线下预先计算好,作为feature添加到线上的数模型。

5.3.3 Re-ranking Comparison.

当基于规则的rerank模型被换为GBRT时,在fresh pin 的repin rate和ctr增加了20%。

reference

归一化折损累积增益(NDCG)

在精确率与召回率中,返回集中每个项目的地位(权值)是一样,即位置k处的项目与位置1处的项目地位一样。但用户对返回结果的理解却不是这样。对于搜索引擎给出的排序结果,最前面的答案会远比排在后面的答案重要。

归一化折损累积增益(NDCG,Normalized Discounted Cumulative G a i n) 便考虑了这种情况。N D C C包含了3 个递进的指标: 累积增益(CG,Cumulative Gain),折损累积增益(DCG,Discounted Cumulative Gain),进而得到归一化折损累积增益。CG是对排序返回的最前面k个项目的相关性得分求和,而DCG在每个项目的得分乘上一个权值,该权值与位置成反关系,即位置越靠前,权值越大。NDCG则对每项的带权值得分先进行归一化(把每个项目的得分除以最好的那个项目的得分),这样得分总是落在0.0和1.0之间。维基百科上的相关文章有更详细的数学公式。

DCG或NDCG在信息检索中或者那些对项目的返回位置关心的模型方法找中用的比较多。

你可能感兴趣的:(#论文笔记# Demystifying Core Ranking in Pinterest Image Search)