目录
一、基本概念
1.1 背景介绍
1.2 算法中心思想
二、算法和公式
2.1 PageRank公式
2.2 矩阵化表达:使用转移概率矩阵/马尔科夫矩阵
2.3 通过矩阵化表达,快速计算 PR 值
2.4 两种方式的关系
三、Dead Ends 问题
3.1 Dead Ends 的产生
3.2 解决方法:Teleport
3.3 Dead Ends 问题修正公式
四、Spider Traps 问题
4.1 Spider Traps 的产生
4.2 解决方法
4.3 Spider Traps 问题修正公式
五、代码实战
六、PageRank 优缺点
PageRank 算法由 Google 创始人 Larry Page 在斯坦福读大学时提出,又称 PR,佩奇排名。主要针对网页进行排名,计算网站的重要性,优化搜索引擎的搜索结果。PR 值是表示其重要性的因子。
1. 数量假设
当在网页模型图中,一个网页接受到的其他网页指向的入链(in-links)越多,说明该网页越重要。
2. 质量假设
当一个质量高的网页指向(out-links)一个网页,说明这个被指的网页重要。
3. 出链入链
举例:
初始化的PR值为 1/N = 1/4 。
从A将跳转到 B 或 C 的概率为 1/2。
从D将跳转到 A 的概率为 1。(矩阵的列表示出链)
(第一次迭代得到的 PR 值)
B没有任何出链(out-links)这就是 Dead Ends,Dead Ends 会导致网站权重变为 0。
举例:
使用转移概率矩阵快速计算PR值:
按照这个规律,我们在多次循环之后,会发现这个模型中所有的 PR 值都会归于 0。
修正M:
A 节点与其他节点之间无 out-links,这就是 Spider Traps,这将会导致网站权重变为向一个节点偏移。
举例:
按照这个规律,我们在多次循环之后,会发现这个模型中 A 的 PR 值都会归于 1,其他归为 0。
修正M:
# 导包
import networkx as nx
import matplotlib.pyplot as plt
import random
Graph = nx.DiGraph()
Graph.add_nodes_from(range(0, 100))
for i in range(100):
j = random.randint(0, 100)
k = random.randint(0, 100)
Graph.add_edge(k, j)
# 绘图
nx.draw(Graph, with_labels=True)
plt.show()
# 打印全部点的 PR 值
pr = nx.pagerank(Graph, max_iter=100, alpha=0.01)
print(pr)
{0: 0.009937991017511711, 1: 0.009913062905888668, 2: 0.009888464827268925, 3: 0.009888712352021399, 4: 0.009839268670029438, 5: 0.009987022158249549, 6: 0.009839268670029438, 7: 0.009839268670029438, 8: 0.00988887736852305, 9: 0.009839268670029438, 10: 0.009872066108189095, 11: 0.009839268670029438, 12: 0.009839268670029438, 13: 0.009938898608270786, 14: 0.010036548348492335, 15: 0.009839268670029438, 16: 0.009839268670029438, 17: 0.009872066108189095, 18: 0.009839268670029438, 19: 0.009839268670029438, 20: 0.009839268670029438, 21: 0.009889207401526351, 22: 0.009889207401526351, 23: 0.009938403558765836, 24: 0.009888629843770575, 25: 0.009839268670029438, 26: 0.009904946054599578, 27: 0.009839268670029438, 28: 0.010134940662971308, 29: 0.009888464827268925, 30: 0.009839268670029438, 31: 0.009839268670029438, 32: 0.010036713364993984, 33: 0.009839268670029438, 34: 0.010003750910332676, 35: 0.010012940368882492, 36: 0.009839268670029438, 37: 0.009921262265428582, 38: 0.010036878381495635, 39: 0.009839268670029438, 40: 0.009889372418028, 41: 0.009888629843770575, 42: 0.009872066108189095, 43: 0.009839268670029438, 44: 0.009938156034013362, 45: 0.009839268670029438, 46: 0.009839268670029438, 47: 0.009839268670029438, 48: 0.009889083639150113, 49: 0.009889372418028, 50: 0.01006934578665199, 51: 0.009839268670029438, 52: 0.009989167372770998, 53: 0.010003338369078551, 54: 0.009839268670029438, 55: 0.009839268670029438, 56: 0.009889083639150113, 57: 0.009839268670029438, 58: 0.009888464827268925, 59: 0.009970458422668069, 60: 0.010036053298987385, 61: 0.009921757314933532, 62: 0.009839268670029438, 63: 0.009938156034013362, 64: 0.009989332389272649, 65: 0.009839268670029438, 66: 0.009839268670029438, 67: 0.009863866748649181, 68: 0.009839268670029438, 69: 0.009839268670029438, 70: 0.010039188612518738, 71: 0.009839268670029438, 72: 0.009863866748649181, 73: 0.009872231124690746, 74: 0.01001990210466003, 75: 0.009839268670029438, 76: 0.009839268670029438, 77: 0.009962754112633105, 78: 0.009937660984508411, 79: 0.010012692844130016, 80: 0.009839268670029438, 81: 0.009839268670029438, 82: 0.009839268670029438, 83: 0.009888712352021399, 84: 0.009839268670029438, 85: 0.009888712352021399, 86: 0.009839268670029438, 87: 0.009986857141747896, 88: 0.009973016178443645, 89: 0.010095594030288237, 90: 0.009839268670029438, 91: 0.009971448521677969, 92: 0.009938651083518312, 93: 0.009863866748649181, 94: 0.009987847240757798, 95: 0.009839268670029438, 96: 0.009888464827268925, 97: 0.009922417380940133, 98: 0.009971448521677969, 99: 0.009872231124690746, 100: 0.009937660984508411}
# 最大的 PR 值
print(max(pr.values()))
0.010134940662971308
import operator
# 最大 PR 值的点
print(max(pr.items(), key=operator.itemgetter(1))[0])
# PR 值之和
print(sum(pr.values()))
28 1.0000000000000007
PageRank 优点
1. 通过网页之间的链接来决定网页的重要性,一定程度消除了人为对排名的影响。
2. 离线计算 PageRank 值,而非查找的时候计算,提升了查询的效率。
PageRank 缺点
1. 存在时间就网站,PageRank 值会越来越大,而新生的网站,PageRank 值增长慢。
2. 非查询相关的特性,查询结果会偏离搜索内容。
3. 通过“僵尸”网站或链接,人为刷 PageRank 值。