海量数据挖掘Mining Massive Datasets(MMDs) -Jure Leskovec courses学习笔记之链接分析:PageRank算法
{大图分析the Analysis of Large Graphs}
社交网络Social Networks(Facebook social graph\twitter)
社交媒体网络Social Media Networks(Connections between political blogs)
信息网Information Nets(Citation引用 networks between journals and Maps of science)
通信网Communication Nets(Internet)
技术网络Technological Networks(Seven Bridges of Königsberg, power grids, road networks water distribution networks)
1996年,yahoo的思想是将所有web页面手动分类到类别集合里。 the big question on the web is which web pages on the web should we trust?
不同于pagerank, Hubs and Authorities的思想是我们有两种类型的web页面,在web图中有称作hubs的web页面,也有称作authorities的页面。
页面在web图中的引用和它获得的链接数目一样重要。the page quote on noting a graph is as important as the number of links it gets.
1. In-coming links? Out-going links?
2. 所有的in-links是对等的吗?
来自重要页面的链接权重更大。 For example a link from a given web page maybe from is more important than a link from some other web page that only receives very few in-links.
节点C有相对较高的pagerank分数,即使它仅收到一个incoming link,但是它是由相当重要的一个节点B指向C的。
每个节点如节点j,从in-links中收集votes,并将自己的votes均分传播给out-links。思想就是pagerank分数在网络上流动,这就是the flow formulation of a flow model of PageRank.
important score rank:给每个节点j赋予一个重要性分数R。节点j的重要性就是所有指向它的节点i的重要性除以那个节点的出度的和。(见图中公式)
三个变量三个等式的方程有无穷多个解,these solutions will have in common is that They will be equivalent up to the scaling factor(缩放因子).
加上一个归一化条件说可以使用Gaussian elimination(高斯消除法)可以求解以上公式,但是只能有效求解较小的网络图。
找到这种联系后,我们不再求解方程等式问题,而是寻找矩阵M的特征向量(主特征向量principal eigenvector:特征值为1对应的特征向量。)。
而求解这个特征向量的方法一般使用Power iteration.
1. 范数Norm:
给定向量x=(x1,x2,...xn) L1范数:向量各个元素绝对值之和 L2范数:向量各个元素的平方求和然后求平方根,也叫欧式范数。 Lp范数:向量各个元素绝对值的p次方求和然后求1/p次方 L∞范数:向量各个元素求绝对值,最大那个元素的绝对值
{Random Walk Interpretation & The Stationary Distribution,解释pagerank分数score的意思,pagerank分数实际上等价于图的随机游走的概率分布。page rank scores are equivalent to a probability distribution of our random walker in a graph}
1. P(t)就是给定时间,所有图中节点的概率分布。每个节点上的概率值就是给定时间random walker在这个节点上的概率。
2. 不管random walker在给定节点上的概率是多少,现在random walker要选择一个从节点i出发,指向节点j的out-link。直白来说,就是将幂迭代方法中的r(t)看成t时刻,选择节点的概率分布。
3. pagerank分数对应于random surfer在web图上给定时间给定节点的无限次游走的概率。what page rank score corresponds to the probability that this random surfer,that infinitely long kind of walks the web graph at some given time t resides at the given node.The random walk interpretation of page rank, where we can think of a score or a rank of a given node to be the probability that the random walker is at that given node at some fixed time t.
随机游走random walks实际上叫做马尔科夫过程Markov processes或者first order Mark, order Markov processes。所以如果你了解马尔科夫过程的平稳分布,你就会知道平稳的条件以及为嘛这样迭代就会到达平稳分布状态。
平稳分布的条件:certain conditions On the structure of our matrix M:也就是非周期,状态连通。(lz的总结)
{解决幂迭代公式不收敛问题,也就是web图节点重要性概率分布的存在性和唯一性问题。certain conditions that matrix M has to satisfy in order for the PageRank to exist and to be unique}
Note:其实下面的几个问题从上面lz关于markov收敛性中就可以轻松解释,spider trap和dead end本质是都是状态转移矩阵任意两个状态不是连通的,没有满足马氏链收敛定理。(不过两个节点的时候不能这么解释,如第1.1个例子)
不收敛问题:The “Spider trap” problem,trap节点的概率权重总在增大。
在这个case中,m是一个spider trap,因为m有 self loop。这样当random walker到达这个节点后没有出路,陷入其中了,总是在m上,使其概率为1了。
得到非想要的结果:The “Dead end” problem,死节点总是泄漏pagerank分数。
a, b不能将分数score传递给其它节点,这样分数就丢失了,最终收敛到0向量。
matrix m不再是随机的了,列和不再为1,而成了0向量。
问题1:dead-ends就是没有outgoing links的网页,没有出度的节点,其节点重要性不能向下传播,就会泄漏遗失leak out。
问题2:网页的out-links形成一个小组。当random walker进入这个group中时会陷进去。at the end, those pages in that part of the graph will get very high weight, and every other page will get very low weight.
改变随机游走的方式:使用randomly teleport随机瞬移(远程传送),也就是为每个节点添加了一个小概率出度。
Always Teleport总是瞬移
Note: Pij measures the probability that if we were at state i, how likely are we to transition to state j in a given time stamp.
Note: 也就是说状态转移矩阵P(也就是PageRank计算公式中的M矩阵)必须是随机的、不可约的、非周期的。而瞬移在某种程度上就是让M满足了这三个条件。actually adding random teleports gives us a in some sense stochastic transition matrix that has these properties.
{dead end的解决?}
1. 一个矩阵是随机的就是说它的列和为1。
2. 注意这里的A和M是矩阵,而a和e是列向量。
1. 周期的:访问同一个状态(节点)的间隔总是某个k的倍数。
2. the random lock here is deterministic and every two steps we return back to the same node.
3. 加入Teleports(如图中的绿线)就打破了周期性。
1. Irreducible also means that we can never get stuck in a given state.
2. we would make our given graph here irreducible is to add all these other possible links, which basically means we would add a random jumps.
{假设没有dead ends?}
1. PageRank equation的解释: Theimportance of node j is first.The sum of the importances of all the nodes ithat point to it.Then we divide that by the outdegree of i as the probabilitythat the random walker actually traverse the link towards j.And, this onlyhappens with probability beta because the random walker,when they are at nodeI,has to decide to actually follow a link, and this happens with probabilitybeta.And then of course, how likely is the random walker to visit node J?Withprobability one minus beta the random walker decides to jump.And if the randomwalker decides to jump then it will land at a given node J with probability oneover N where N is the number of nodes in the network.
2. 式中前半部分是random walk*beta,后半部分是random jump*(1-beta)
1. 一般将beta设置为0.85.Which basically means for every five steps you do a random jump.(0.85:0.15≈5:1)
2. 如果考虑dead end:
或者还是将beta设成一个矩阵,对应于M中列为0(dead end)的beta项设为0,这样random walker到达dead end时就总是random jump(teleport)了,参考上面“回到Google的解决方案:随机跳转(Random Jump / Teleport)”对应的图中绿色字体部分。
或者还是在每次计算新的r后会有score泄漏,导致r和不为1,这样我们每次计算完r后,将r renormalize,使其和为1,参考下面“Sparse Matrix Formulation稀疏矩阵公式”图中绿色字体部分部分。这个也就是真正采用的方法,其具体解决方案会在"PageRank完整算法"部分处理,并且会和spider trap一同合并处理,没讲到时都假定M中没有dead end。
Note: matrix M is still stochastic, theonly problem is that node m is a spider trap.即使是这样,改进后的Google Matrix还是能很好的解决问题。
{怎么计算图的pagerank分数,而甚至不用将其全部弄到内存中。how do we compute it for graphs that don't even fit into the main memory of a machine}
存储r需要2*4*10^9=8G; 存储A需要4*(10^9)^2=4000P太大了,不可能。
其实没必要这么推导,直接将A = betaM+(1-beta)/n代入r = A.r中不就得到了吗。
这里先假设M没有dead end,具有dead end的M的处理将在下面完事算法给出。
这个最终算法中我们同时考虑图中存在spider traps和dead ends。
考虑了dead end,也就是r的计算要在最后加入泄漏的score,1-S就是总的泄漏scores,再平均加到每个rj中。
Note: 此题有一个小trick: 就是手算时你会发现a的值总是0.1*3,也就是说b+c = 1*3 - 0.3 = 2.7
#!/usr/bin/env python # -*- coding: utf-8 -*- """ __title__ = '' __author__ = '皮' __mtime__ = '9/25/2015-025' __email__ = '[email protected]' # code is far away from bugs with the god animal protecting I love animals. They taste delicious. ┏┓ ┏┓ ┏┛┻━━━┛┻┓ ┃ ☃ ┃ ┃ ┳┛ ┗┳ ┃ ┃ ┻ ┃ ┗━┓ ┏━┛ ┃ ┗━━━┓ ┃ 神兽保佑 ┣┓ ┃ 永无BUG! ┏┛ ┗┓┓┏━┳┓┏┛ ┃┫┫ ┃┫┫ ┗┻┛ ┗┻┛ """ from math import e import numpy as np def pageRank(M, r, beta, epsilon, flag=False): it_count = 0 N = r.size # print(N) while (True): it_count += 1 r_new = beta *, r) + (1 - beta) / N if flag and (it_count == 4 or it_count == 5): print('%s次迭代后:%s' % (it_count, r_new)) # print(sum(abs(r - r_new))) if sum(abs(r - r_new)) < epsilon: break r = r_new return r, it_count def question1(): M = np.array([[0, 0, 0], [0.5, 0, 0], [0.5, 1, 1]]) print(M) r = np.array([1, 1, 1]).T print(r) beta = 0.7 print('.' * 50) r, it_count = pageRank(M, r, beta=beta, epsilon=pow(e, -6)) print("%s\n%s次迭代" % (r * 3, it_count)) def question2(): M = np.array([[0, 0, 1], [0.5, 0, 0], [0.5, 1, 0]]) print(M) r = np.array([1 / 3, 1 / 3, 1 / 3]).T print(r) beta = 0.85 print('.' * 50) r, it_count = pageRank(M, r, beta=beta, epsilon=pow(e, -6)) print("%s\n%s次迭代" % (r, it_count)) a, b, c = r epsilon = pow(e, -6) if abs(0.95 * c - (0.9 * b + 0.475 * a)) < epsilon: print('True1') if abs(c - (0.9 * b + 0.475 * a)) < epsilon: print('True2') if abs(0.85 * a - (c + 0.15 * b)) < epsilon: print('True3') if abs(c - (b + 0.575 * a)) < epsilon: print('True4') def question3(): M = np.array([[0, 0, 1], [0.5, 0, 0], [0.5, 1, 0]]) print(M) r = np.array([1, 1, 1]).T print(r) beta = 1 print('.' * 50) r, it_count = pageRank(M, r, beta=beta, epsilon=pow(e, -6), flag=True) print("%s\n%s次迭代" % (r, it_count)) question1() print('*' * 50, '\n\n') question2() print('*' * 50, '\n\n') question3()from:
