Studying Recommendation Algorithms by Graph Analysis

这是我读完Studying Recommendation Algorithms by Graph Analysis这篇论文所做的笔记,绝非原创只是一些零碎知识的整理。不妥之处还望广大博友积极提出意见!


                          Studying Recommendation Algorithms
                                            by Graph Analysis

基于图论分析的推荐算法研究

Abstract:

We present a novel frameworkfor studying recommendation algorithms in terms of the ‘jumps’that they make toconnect people to artifacts. This approach emphasizes reachability via analgorithm within the implicit graph structure underlying are commender datasetand allows us to consider questions relating algorithmic parameters toproperties of the datasets.

我们提出了一种新的框架,用于研究推荐算法的“跳跃”

他们把人们连接到工件上。这种方法在隐含的图形结构中通过算法强调可达性,并允许我们将算法参数与数据集的属性联系在一起问题。

We illustrate the approachwith a common jump called the ‘hammock’ using movie recommender datasets.

我们通过使用电影推荐数据集的“吊床”来演示这种方法。

1. Introduction

information overload    信息系过载

recommender landscape  推荐景观元素

customized search engines  定制搜索引擎

 handcrafted content indices, 手工制作的内容指标

 personalized shopping agents on e-commercesites电子商务的个性化网络代购

news-on-demand services 新闻在线需求服务

simple keyword matching ofconsumer profiles

消费者资料简单关键字匹配

harnessing online informationresources 利用网络信息资源

information aggregation 信息聚合

1.1 Motivating scenarios  (思想)激发的场景

Transactional data 交易数据

Genre 类型

follow power-lawdistributions. 遵循幂率分布

The common theme among these applicationsis that they emphasize many important aspects of a recommender system otherthan predictive accuracy his framework allows a more direct approach to reasoningabout recommendation algorithms and their relationship to the recommendation patternsofusers.

We effectively ignore theissue of predictive accuracy,and so the framework is a complement to approachesbased on field studies.

这些应用程序的共同主题是,它们强调推荐系统的许多重要方面,而不是预测精度,这允许更直接的方法来推理推荐算法及其与建议模式用户的关系。

我们实际上忽略了预测准确性的问题,因此框架是基于实地研究的方法的补充。

2. Characterizingrecommendation algorithms

Most current research efforts castrecommendation  as a specialized task ofinformation retrieval/filtering or as a task of function approximation/

learning mappings

目前大多数研究工作都将推荐作为信息检索/过滤的专门任务,或者作为函数近似/学习映射的任务

Even approaches that focus on clusteringview clustering primarily as a pre-processing step for functional modeling, oras a technique to ensure scalability  orto overcome sparsity of ratings.This emphasis on functional modeling andretrieval has influenced evaluation criteria for recommender systems.

甚至,将重点放在集群视图的方法当做功能建模的预处理步骤,或者将推荐算法作为一种确保可伸缩性或克服评级稀疏性的技术。这种对功能建模和检索的强调影响了推荐系统的评估标准。

Traditional information retrieval 传统的信息检索

Ideas such as cross-validation on an unseentest set have been used to evaluate mappings from people to artifacts,especially in collaborative filtering recommender systems.

在一个不可见的测试集上,例如交叉验证之类的想法已经被用来评估从人们到工件的映射,特别是在协同过滤推荐系统中。

Such approaches miss many desirable aspectsof the recommendation process, namely:

这种方法忽忽略了推荐程序的许多可取的方面,即:

1. Recommendation is an indirect way ofbringing people together。 推荐是把人们聚集在一起的间接方式

2. Recommendation,as aprocess,should emphasize      modeling connections from people toartifacts, besides predicting ratings for artifacts.

作为一个过程,推荐系统应该强调从人类到工件的建模连接,而不是预测产品的等级。

 

3. Recommendations should be explainable andbelievable.

推荐算法应该是可以解释和可信的。

 

4. Recommendations are not delivered inisolation,  but in the context of animplicit/explicit socialnetwork.

推荐算法不是孤立地传递,而是在一个隐式/显式的社交网络环境中进行传递。

 

2.1 Related research  相关研究

as a basis to A  作为A的一个基础

Most graph-based algorithms for information networks can bestudied in terms of the graph modeling,and the structures/operations that aremined/conducted on the graph.

大多数基于图形的信息网络算法都可以从图形建模的角度来研究,以及在图上进行的构造/操作。

One of the most celebrated examples ofgraph analysis arises in search engines that exploit both link information andtextual content.

最著名的图形分析例子之一就是出现在搜索引擎中,它利用了链接信息和文本内容。

Google essentially models a onemodedirected graph and uses measures involving principal components toascertain‘page ranks.’ Jon Kleinberg’s HITS algorithm goes a step further by viewing the one-mode web graph asactually comprising two modes (called hubs and authorities).

谷歌实质上是一种有向图的模型,并使用包含主成分的度量来确定“页面等级”。Jon Kleinberg的热门算法更进一步,将单模的网络图看作是由两种模式(称为集线器和权威)组成的。

A hub is a node primarily with edges toauthorities, and so a good hub has links to many authorities. A good authorityis a page that is linked to by many hubs.

集线器是一个主要由边缘到权威的节点,因此一个好的集线器与许多部门有关联。一个好的权威是一个与许多中心联系在一起的页面。

The use of link analysis in recommendersystems washighlighted by the “referral chaining”technique ofthe ReferralWebproject (Kautz et al,1997).The ideais to use the cooccurrence of names in anyof the  documents available on the web todetect the       existence of directrelationships between people   anddiscover a social network. The underlying      assumption is that people with similarinterests swarm in the same circles to discover collaborators(Payton, 1998).

推荐系统中使用链接分析的方法被“推荐链接”技术(Kautzet al .,1997)所强调。这个想法是在网络上可用的任何文档中使用名称的共同出现来检测人们之间的直接关系并发现一个社交网络。潜在的假设是,有相似兴趣的人在同一个圈子里发现合作者(Payton,1998)。

The exploration of link analysis in socialstructures has led to several new avenues of research, most notably small-worldnetworks. Small-world networks are highly clustered but relatively sparsenetworks with small average length.

在社会结构中的链接分析的探索导致了一些新的研究途径,尤其是小世界的网络。小世界网络是高度集群式但平均长度很小的相对稀疏的网络。

Watts and Strogatz (1998) provide the firstsuch   characterization of small-world networksin the    form of a graph generation model.

Watts和Strogatz(1998)以图形生成模型的形式提供了第一个这样的小世界网络的描述。

Thus, small-world networks fall in betweenregular and random networks, having the small average lengths of random networksbut high clustering coefficients akin to regular networks.

因此,小世界的网络在普通和随机网络之间下降,小世界网络的随机网络的平均长度较小,但与普通网络相似的高聚类系数。

The small-world network concept hasimplications for a variety of domains.

小世界网络的概念对很多邻域都有所影响。

Watts and Strogatz simulate the spread ofan infectious disease in a small-world network Watts and Strogatz (1998).Adamic shows that the world wide web is a small-world network and suggests thatsearch  engines capable of exploitingthis fact can be more effective in hyperlink modeling, crawling, and findingauthoritative sources。

美国Watts和Strogatz模拟了一种传染病在Watts和Strogatz小世界网络 (1998)的传播。Adamic指出,万维网是一个小世界的网络,并且表明,能够使用这个观点的搜索引擎可以更有效地在超链接建模、爬虫和找到权威的资源。

crawling strategy  爬虫策略

3.Graph analysis

We develop a novel way to study algorithmsfor recommender systems. Algorithms are distinguished, not by the predictedratings of services/artifacts they produce, but by the combinations of peopleand artifacts that they bring together.Two algorithms are considered equivalentif they bring together identical sets of nodes regardless of whether they workin qualitatively different ways. Our emphasis is on the role of are commendersystem as a mechanism for bridging entities in a socialnet work.We refer to thisapproach of studying recommendation as jumping connections.

我们开发了一种新的方法来研究推荐系统的算法。算法的区别在于,不是通过预测的服务/工件的评级,而是通过人员和工件的组合来实现的。如果它们将相同的节点集集合在一起,而不管它们是否以质量不同的方式工作,两种算法被认为是等价的。我们强调的是,在社会网络工作中作为桥梁实体的一种机制的作用。我们把这种学习建议的方法称为跳跃连接。

Our metrics, hence, will not lead adesigner to directly conclude that an algorithm A is more accuratethan analgorithm B. such conclusions can only be made through a field evaluation(involving feedback and reactions from users)or via survey/interview procedures.

因此,我们的度量标准不会导致设计人员直接得出结论:算法a比算法B更准确。这样的结论只能通过实地评估(包括来自用户的反馈和反应)或通过调查/面试程序得出。

By restricting its scope to exclude theactual aspect of making ratings and predictions, the jumping connectionsframework provides a systematic and rigorous way to study recommender systems.

跳转连接框架通过限制其范围来排除进行评级和预测的实际方面,为研究推荐系统提供了系统而严谨的方法。

Notice also that when an algorithm bringstogether person X and artifactY,it could imply either a positive recommendationor a negative one.Such differences are,again,not captured by our frameworkunless the mechanism for making connections restricts its jumps, forinstance,to only those artifacts for which ratings satisfy some threshold.Inother words,thresholds for making recommendations could be abstracted into themechanism for jumping.

注意,当一个算法把person X和artifactY结合在一起时,它可能暗示着一个积极的建议或者一个消极的建议。这样的差异,再一次没有被我们的框架所捕获,除非建立连接的机制限制了它的跳跃,例如,只有那些等级满足某个阈值的工件。换句话说,提出建议的门槛可以被抽象成跳跃的机制。

Jumping connections satisfies all theaspects outlined in the previous section.It involves a social-network model,and thus, emphasizes connections rather than prediction. The nature ofconnections jumped also aids in explaining therecommendations.Thegraph-theoreticna-ture of jumping connections allows the useof mathematical models(such as random graphs) to analyze the properties of thesocial networks in which recommender algorithms operate.

跳转连接满足前一节中概述的所有方面。它涉及到一个社会网络模型,因此强调连接而不是预测。联系的本质也有助于解释这些建议。跳转连接的图形理论允许使用数学模型(如随机图形)来分析推荐算法运行的社交网络的属性。

3.1.The jumping connections construction

We view a recommender system as exploitingthe     social connections (the jumps)that bring together a person with other people who have rated an       artifact of (potential) interest.

我们认为一个推荐系统是利用社会联系(跳跃),把一个人与其他有潜在兴趣的人联系在一起。

 

3.2.Basic approach

Our basic approach is to think of theoperation of a given recommendation algorithm as performing a   jump and study this jump algorithmicallyusing tools from random graph theory. Our specific contributions in this paperare the emphasis on graph analysis as a way to study recommendation algorithmsand  the use of random graph models topredict properties of interest about the social network and recommender graphs.The mathematical machinery brought to bear upon this problem is the work ofNewman et al. (2000);

我们的基本方法是将给定的推荐算法的操作考虑为执行一个跳跃,并利用随机图理论中的工具来研究这个跳跃算法。本文的具体贡献是将重点放在图分析上,以此作为研究推荐算法的一种方法,并利用随机图形模型来预测社交网络和推荐图的兴趣属性。这一问题的数学机制是纽曼等人(2000)的工作;

 

3.3.Hammocks

There is some consensus in the communitythat hammocks are fundamental in recommender algorithms since they representcommonality of ratings. It is our hypothesis that hammocks are fundamental toall recommender system jumps.

在学术界有一些共识认为,在推荐算法中,吊床是最基本的,因为它们代表了等级的共性。我们的假设是,吊床是所有推荐系统的基础。

 

3.4.Random graph models

Our goal is to be able to answer questionsabout hammock width, minimum ratings, and path length in a typical graph. Theapproach we take is to use a model of random graphs adapted from the work ofNewmanetal.This model,while having limitations,is the best-fit of existingmodels, and as we shall see, provides imprecise but descriptive results.

我们的目标是能够回答关于hammock宽度、最低等级和路径长度的问题。我们采取的方法是使用一种随机图形的模型,它是根据新一代的作品改编而成的。这个模型虽然有局限性,但它是现有模型中最适合的,我们将看到,它提供了不精确但描述性的结果。

 

3.5.Modeling the social network graph

The Newman-Strogatz-Watts model:

To obtain an expression that describes thetypical length of a path, we can consider how many steps we need to go from anode to be able to get to every other node in the graph。

为了获得描述路径的典型长度的表达式,我们可以考虑从一个节点到能够到达图中的其他节点的步骤需要多少步骤。

 

3.6.Modeling the recommender graph

各种负责的公式

 

3.7.Caveats with the Newman-Strogatz-Watts equations    需要注意的是Newman-Strogatz-Watts方程

There are various problems with using theabove formulas in a realistic setting (Heath,2001). First, unlike most resultsin random graph theory, the formulas do not include any guarantees and/orconfidence levels. Second, all the equations above are obtained over the ensembleof random graphs that have the given degree distribution, and hence assume that

all such graphs are equally likely. Thespecificity of the jumping connections construction implies that the G s and Gr graphs are poor candidates to serve as a typical random instance of a graph.

在实际设置中使用上述公式存在各种问题(希思,2001)。首先,不同于随机图理论的大多数结果,公式不包括任何保证/信心的水平。第二,上面的所有方程都是通过这个得到的随机图的集合有给定的度分布,因此假设所有这些图表都同样有可能。跳跃连接结构的特殊性这意味着G和G r图是一个典型的随机例子的图。

In addition, the equations utilizing N Pand N M assume that all vertices are reachable from any starting vertex. Thiswill not be satisfied for very strict jumping constraints.

此外,利用N P和N M的方程假设所有的顶点都可以从任何起始点到达。对于非常严格的跳跃限制,这是不满足的。

Finally, the Newman-Strogatz-Watts model isfundamentally more complicated than traditional models of random graphs. It hasa potentially infinite set of parameters ,doesn’t address the possibility ofmultiple edges, loops and, by not fixing the size of the graph, assumes thatthe same degree distribution sequence applies for all graphs, of all sizes.These observations hint that we cannot hope for more than a qualitativeindication of the dependence of the average path length on the jumpconstraints.

最后,newman - strogatz - watts模型比传统的随机图形模型要复杂得多。它有一个潜在的无限的参数集,并没有解决多重边的可能性,循环和,不确定图的大小,假设相同的程度分布序列适用于所有大小的图形。这些观察提示,我们不能期望在跳跃约束条件下的平均路径长度依赖于一个定性指示。

4.Experimental results   实验结果

4.1.Preliminary investigation  初步调查

4.2.Experiments

The goal of our experiments is toinvestigate the effect of the hammock width w on the average characteristicpath lengths of the induced G s social network graph and G r recommender graphfor the above datasets.

我们的实验目的是研究吊床宽度w对所诱导的G s社会网络图的平均特征路径长度的影响,以及针对上述数据集的G r推荐图。

Threshold  阈

Once again, the formulas assumesignificantly less clustering than the actual data. In other words,the highervalues of lengths from the actual measurements indicate that there is somesource of clustering that is not captured by the degree distribution, and so isnot included in the formulas.

与实际数据相比,这些公式所假设的聚类要少得多。换句话说,从实际测量值中得到的值越高,表明有一些不被学位分配所捕获的聚类的源,所以公式中没有包含。

 

4.3.Discussion of results

We can make severalpreliminary observations from the results so far:

我们可以从目前的结果中做几个初步的观察:

1. As Newman et al. pointout, the random graph model defined by degree distributions makes strong qualitativepredictions of actual lengths,using only local information about the number offirst and second nearest neighbors. We have demonstrated that this holds trueeven for graphs induced by hammock jumps.

正如Newman等人指出的,由度分布定义的随机图形模型对实际长度进行了较强的定性预测,只使用关于第一个和第二个最近邻居的数量的本地信息。我们已经证明,即使是由吊床跳引起的图形,这也适用。

 

2. For sufficiently largevalues of κ,the relationship between hammock width w and average l pp lengthfollows two distinct phases:

 (i) in the first regime, there is a steadyincrease of l pp up to a threshold <κ, where only edges are lost;

(ii) in the second phase,nodes are shattered but without much effect on the averagel pp values.Thistwo-phase phenomenon can thus serve as a crucial calibration mechanism forconnecting the number of people to be brought together by a recommender systemand the average l pp length.

对于足够大的κ值,吊床宽度w和平均l页长度之间的关系是两个截然不同的阶段:

(i)在第一个政权,有一个稳定的增长的l pp阈值<κ,只有边缘丢失;

(ii)在第二阶段,节点被粉碎,但对averagel pp值没有太大影响。因此,这两相现象可以作为一种关键的校准机制,将人们的数量与推荐系统和平均l pp长度联系在一起。

 

3. Visualizing the overlap inthe social network graphs, as hammock width constraints are increased, leads tointeresting observations.

随着吊床宽度约束的增加,可视化社交网络图中的重叠,导致有趣的观察。

Smaller and smaller graphsare created, but with a common core, as illustrated by the ‘clam shells’ in thediagram. In other words, there is not a handful of people or movies who form acutset for the graph—there is no small group of people or movies whose removal(by a sufficiently strict hammock width) causes the entire graph to be brokeninto two or more pieces .

越来越小的图形被创建,但是有一个共同的核心,正如图中“clam shell”所示。换句话说,并不是只有少数人或电影会为图形制作一个分割——没有一小部分人或电影的删除(通过一个足够严格的吊床宽度),导致整个图形被分成两个或多个部分。

 

Our study is the first to investigatethis property for graphs induced by hammock jumps. However, we believe thisrobustness is due to the homogeneous nature of ratings in the movie domain

我们的第一个研究是这个属性的由吊床跳跃引起的图形。然而,我们认为这种健壮性是由于电影领域中评级的同质性所致。

 

5. Concluding remarks结束语

This research makes two keycontributions.

First,we have shown howalgorithms for recommender systems can be studied in terms of the connectionsthey make in a bipartite graph. This view enables a new methodology to conductexperimental analysis and comparison of algorithms. Using this approach,algorithms can be distinguished by the pairs of nodes that are broughttogether.

这项研究有两个关键的贡献。

首先,我们已经展示了如何将推荐系统的算法应用于两党图的连接。这一观点使一种新的方法能够对算法进行实验分析和比较。使用这种方法,算法可以通过组合在一起的节点对来区分。

 

Second, we have described theapplication of our framework to a particular form of jump—the hammock jump. Wehave demonstrated a two-phase phenomenon for the induced social network graphthat allows us to connect the minimum rating constraint κ, the hammock width w,the size of the largest component, and the average person-person length l pp .In particular, the choice of κ determines the phase transition in varying thehammock width w, with κ being an upper bound on w. Once formalized further (seebelow), this connection will permit tradeoff analysis between choices for theminimum rating constraint, and the strength of jumps as measured by the hammockwidth.

第二,我们将框架的应用描述为跳吊床跳跃的一种特殊形式。我们演示了一个两阶段的现象引起社会网络图,允许我们把最低评级约束κ,吊床宽度w,最大大小的组件,内人的平均长度页。特别是,κ的选择决定了不同的相变吊床宽度w,κ是一个上限w。一旦正式进一步(见下文),这个连接将允许选择最低评级之间的权衡分析约束,和跳跃的力量以吊床的宽度。

5.1. Designing a recommendation algorithm

5.2. Future work

We now describe variousopportunities for future research involving graph analysis, more expressiveforms of jumps, and development of new random graph models.

我们现在描述了未来研究的各种机会,包括图分析,更多表达形式的跳跃,以及开发新的随机图形模型。

 

The existence of ratingsstructures such as the hits-buffs distribution and our ability to exploit themis very crucial to the success of recommender systems.

评级结构的存在,如打击者的分布和我们利用它们的能力对于推荐系统的成功至关重要。

 

This approach would clearlypoint us in the direction of trying to find algorithms that use the strongestjump possible for all datasets. However, these algorithms may becomputationally intensive  and latentvariables , while in a particular dataset the jump is actually equivalent toone performed by a much less expensive algorithm. This point is illustrated bythe observation in our experimentation that the path lengths in the recommendergraph are between 1 and 2, which suggests that algorithms that attempt to findlonger paths will end up being equivalent to algorithms that do not. Therefore,the choice of algorithm is strongly dependent on the dataset, and simplermachinery may be more efficient and cost effective than more complex algorithmsthat would perform better in other contexts.

这一方法将明确地指出我们的方向,试图找到所有数据集最强大的算法。然而,这些算法可能是计算密集型的和潜在的变量,而在一个特定的数据集里,跳跃实际上相当于一个更便宜的算法执行的。我们在实验中观察到,在推荐图上的路径长度在1到2之间,这表明,试图找到更长的路径的算法最终会被等价于没有的算法。因此,算法的选择依赖于数据集,更简单的机器可能比更复杂的算法更有效和更有效,在其他环境中性能更好。

 


你可能感兴趣的:(算法研究,Studying,Recommendat)