网上对Page算法讲解的很多,实现代码也很多很杂, 所以为了找到一个更高质量的PageRank算法的实现,
我阅读了Python Networkx库上自带的pagerank方法的源码。部分多余内容我删除了,有兴趣可以直接下这个库查看源码
源码的地址在http://networkx.github.io/download.html
具体的pagerank代码我已经上传到网盘在http://pan.baidu.com/s/1ntOafH3
PageRank算法最主要的地方在于对两个问题的解决,一个是dangling nodes,一个是spider trap
前者是说,没有引用其他网页的链接,用图角度的理解就是出度为0。
后者是说,进入到一个网页或者几个网页,这个单个网页,或者几个网页之间相互引用,这样资源进去后就一直在里面转,出不来了,造成rank sink问题。
对于dangling nodes,我们可以计算他们的PR贡献值,然后均分给所有节点
对于spider trap,需要用心灵漂移的方式去解决。
Networkx库里面,主要有三种计算PageRank的方法(PS为什么是三种,我直观觉得是因为这是三个人写的,因为连某些效果完全相同的初始化语句,三个写的都不一样)
1 pagerank函数
以图结构为基础
通过迭代收敛方法计算PR值(PageRank值)
2 pagerank_numpy函数
将图转换为numpy邻接矩阵,
通过google_matrix和numpy矩阵计算
通过计算最大特征值对应的主特征向量,即为所求PR值
3 pagerank_scipy函数
将图转换为sparse稀疏矩阵
通过迭代收敛方法计算PR值
"""PageRank analysis of graph structure. """ # BSD license. # NetworkX:http://networkx.lanl.gov/ import networkx as nx @not_implemented_for('multigraph') def pagerank(G, alpha=0.85, personalization=None, max_iter=100, tol=1.0e-6, nstart=None, weight='weight', dangling=None): """Return the PageRank of the nodes in the graph. Parameters ----------- G : graph A NetworkX graph. 在PageRank算法里面是有向图 alpha : float, optional 稳定系数, 默认0.85, 心灵漂移teleporting系数,用于解决spider trap问题 personalization: dict, optional 个性化向量,确定在分配中各个节点的权重 格式举例,比如四个点的情况: {1:0.25,2:0.25,3:0.25,4:0.25} 默认个点权重相等,也可以给某个节点多分配些权重,需保证权重和为1. max_iter : integer, optional 最大迭代次数 tol : float, optional 迭代阈值 nstart : dictionary, optional 整个网络各节点PageRank初始值 weight : key, optional 各边权重 dangling: dict, optional 字典存储的是dangling边的信息 key --dangling边的尾节点,也就是dangling node节点 value --dangling边的权重 PR值按多大程度将资源分配给dangling node是根据personalization向量分配的 This must be selected to result in an irreducible transition matrix (see notes under google_matrix). It may be common to have the dangling dict to be the same as the personalization dict. Notes ----- 特征值计算是通过迭代方法进行的,不能保证收敛,当超过最大迭代次数时,还不能减小到阈值内,就会报错 """ #步骤一:图结构的准备-------------------------------------------------------------------------------- if len(G) == 0: return {} if not G.is_directed(): D = G.to_directed() else: D = G # Create a copy in (right) stochastic form W = nx.stochastic_graph(D, weight=weight) N = W.number_of_nodes() # 确定PR向量的初值 if nstart is None: x = dict.fromkeys(W, 1.0 / N) #和为1 else: # Normalized nstart vector s = float(sum(nstart.values())) x = dict((k, v / s) for k, v in nstart.items()) if personalization is None: # Assign uniform personalization vector if not given p = dict.fromkeys(W, 1.0 / N) else: missing = set(G) - set(personalization) if missing: raise NetworkXError('Personalization dictionary ' 'must have a value for every node. ' 'Missing nodes %s' % missing) s = float(sum(personalization.values())) p = dict((k, v / s) for k, v in personalization.items()) #归一化处理 if dangling is None: # Use personalization vector if dangling vector not specified dangling_weights = p else: missing = set(G) - set(dangling) if missing: raise NetworkXError('Dangling node dictionary ' 'must have a value for every node. ' 'Missing nodes %s' % missing) s = float(sum(dangling.values())) dangling_weights = dict((k, v/s) for k, v in dangling.items()) dangling_nodes = [n for n in W if W.out_degree(n, weight=weight) == 0.0] #dangling_nodes dangling节点 #danglesum dangling节点PR总值 #dangling初始化 默认为personalization #dangling_weights 根据dangling而生成,决定dangling node资源如何分配给全局的矩阵 # 迭代计算-------------------------------------------------------------------- #PR=alpha*(A*PR+dangling分配)+(1-alpha)*平均分配 #也就是三部分,A*PR其实是我们用图矩阵分配的,dangling分配则是对dangling node的PR值进行分配,(1-alpha)分配则是天下为公大家一人一份分配的 #其实通俗的来说,我们可以将PageRank看成抢夺大赛,有三种抢夺机制。 #1,A*PR这种是自由分配,大家都愿意参与竞争交流的分配 #2,dangling是强制分配,有点类似打倒土豪分田地的感觉,你不参与自由市场,那好,我们就特地帮你强制分。 #3,平均分配,其实就是有个机会大家实现共产主义了,不让spider trap这种产生rank sink的节点捞太多油水,其实客观上也是在帮dangling分配。 #从图和矩阵的角度来说,可以这样理解,我们这个矩阵可以看出是个有向图 #矩阵要收敛-->矩阵有唯一解-->n阶方阵对应有向图是强连通的-->两个节点相互可达,1能到2,2能到1 #如果是个强连通图,就是我们上面说的第1种情况,自由竞争,那么我们可以确定是收敛的 #不然就会有spider trap造成rank sink问题 for _ in range(max_iter): xlast = x x = dict.fromkeys(xlast.keys(), 0) #x初值 danglesum = alpha * sum(xlast[n] for n in dangling_nodes) #第2部分:计算dangling_nodes的PR总值 for n in x: for nbr in W[n]: x[nbr] += alpha * xlast[n] * W[n][nbr][weight] #第1部分:将节点n的PR资源分配给各个节点,循环之 for n in x: x[n] += danglesum * dangling_weights[n] + (1.0 - alpha) * p[n] #第3部分:节点n加上dangling nodes和均分的值 # 迭代检查 err = sum([abs(x[n] - xlast[n]) for n in x]) if err < N*tol: return x raise NetworkXError('pagerank: power iteration failed to converge ' 'in %d iterations.' % max_iter) def google_matrix(G, alpha=0.85, personalization=None, nodelist=None, weight='weight', dangling=None): """Return the Google matrix of the graph. Parameters ----------- 和PageRank参数表相似 Returns ------- A : NumPy matrix Google matrix of the graph Notes ----- 这个方法返回的“谷歌矩阵” 描述了PageRank用到的马尔科夫链。 PageRank需要收敛的话,就需要方程有唯一解,这样过度矩阵必须是不可約的。 用图的角度讲,就是说两个点之间必须相互可达,节点1能到节点2,节点2也有路径到1。 不然的话,会出现spider trap,导致rank sink问题出现,造成矩阵无法收敛,集中到sink节点了。 The matrix returned represents the transition matrix that describes the Markov chain used in PageRank. For PageRank to converge to a unique solution (i.e., a unique stationary distribution in a Markov chain), the transition matrix must be irreducible. In other words, it must be that there exists a path between every pair of nodes in the graph, or else there is the potential of "rank sinks." """ import numpy as np if nodelist is None: nodelist = G.nodes() M = nx.to_numpy_matrix(G, nodelist=nodelist, weight=weight) N = len(G) if N == 0: return M # Personalization vector if personalization is None: p = np.repeat(1.0 / N, N) else: missing = set(nodelist) - set(personalization) if missing: raise NetworkXError('Personalization vector dictionary ' 'must have a value for every node. ' 'Missing nodes %s' % missing) p = np.array([personalization[n] for n in nodelist], dtype=float) p /= p.sum() # Dangling nodes if dangling is None: dangling_weights = p else: missing = set(nodelist) - set(dangling) if missing: raise NetworkXError('Dangling node dictionary ' 'must have a value for every node. ' 'Missing nodes %s' % missing) # 确定dangling node的PR资源的分配矩阵 dangling_weights = np.array([dangling[n] for n in nodelist],dtype=float) dangling_weights /= dangling_weights.sum() dangling_nodes = np.where(M.sum(axis=1) == 0)[0] #计算行和为0,也就是计算出度为0的节点,就是dangling nodes # Assign dangling_weights to any dangling nodes (nodes with no out links) for node in dangling_nodes: M[node] = dangling_weights #将这个的出度直接设成平均分配 M /= M.sum(axis=1) # 归一化 #这个时候M矩阵已经解决了dangling node的问题,再通过1-alpha解决 return alpha * M + (1 - alpha) * np.outer(np.ones(N), p) def pagerank_numpy(G, alpha=0.85, personalization=None, weight='weight', dangling=None): """ 备注 特征向量计算,是利用了Numpy在LAPACK库的方法进行的,对小型图的计算非常快速精确 """ import numpy as np if len(G) == 0: return {} M = google_matrix(G, alpha, personalization=personalization, weight=weight, dangling=dangling) # use numpy LAPACK solver eigenvalues, eigenvectors = np.linalg.eig(M.T) ind = eigenvalues.argsort() # eigenvector of largest eigenvalue at ind[-1], normalized largest = np.array(eigenvectors[:, ind[-1]]).flatten().real norm = float(largest.sum()) return dict(zip(G, map(float, largest / norm))) def pagerank_scipy(G, alpha=0.85, personalization=None, max_iter=100, tol=1.0e-6, weight='weight', dangling=None): """ ----- 该方法应用的是SciPy库,实现方法和第一个:pagerank函数是基本一致的 ----- 备注 特征向量计算,是利用了SciPy库稀疏矩阵计算 """ import scipy.sparse N = len(G) if N == 0: return {} nodelist = G.nodes() M = nx.to_scipy_sparse_matrix(G, nodelist=nodelist, weight=weight, dtype=float) S = scipy.array(M.sum(axis=1)).flatten() S[S != 0] = 1.0 / S[S != 0] Q = scipy.sparse.spdiags(S.T, 0, *M.shape, format='csr') M = Q * M # 初始化PageRank值 x = scipy.repeat(1.0 / N, N) # 确定personalization字典,各节点分配权重,默认权重相同 if personalization is None: p = scipy.repeat(1.0 / N, N) else: missing = set(nodelist) - set(personalization) if missing: raise NetworkXError('Personalization vector dictionary ' 'must have a value for every node. ' 'Missing nodes %s' % missing) p = scipy.array([personalization[n] for n in nodelist], dtype=float) p = p / p.sum() # Dangling nodes的确定和资源配置 if dangling is None: dangling_weights = p else: missing = set(nodelist) - set(dangling) if missing: raise NetworkXError('Dangling node dictionary ' 'must have a value for every node. ' 'Missing nodes %s' % missing) # Convert the dangling dictionary into an array in nodelist order dangling_weights = scipy.array([dangling[n] for n in nodelist], dtype=float) dangling_weights /= dangling_weights.sum() is_dangling = scipy.where(S == 0)[0] # 迭代 #第1部分:x*M, 第2部分dangling node分配, 第3部分rank sink解决 for _ in range(max_iter): xlast = x x = alpha * (x * M + sum(x[is_dangling]) * dangling_weights) + (1 - alpha) * p # check convergence, l1 norm err = scipy.absolute(x - xlast).sum() if err < N * tol: return dict(zip(nodelist, map(float, x))) raise NetworkXError('pagerank_scipy: power iteration failed to converge ' 'in %d iterations.' % max_iter)