-----------------Bringing Order to the Web
一个搜索引擎收录了全球的网页信息,可是这么多的网页,到底哪个网页才是优质的网站呢?作为一个查找信息的用户,肯定希望搜索得出的结果是有充实内容的而不是充斥着大量广告的页面。那么,如何在成万上亿的网页中给他们做个重要性排序呢?PageRank很好地解决了这个问题。
整套PageRank算法基于以下一个假设:
大量网站所链接到的网站必然是一个更优秀的网站。
这个假设是很符合我们日常生活的,就拿CSDN的优秀博客来说,博主写的文章很优秀,很多人看,很多人关注,然后很多人都来学习,转载他的文章到自己的blog上,因此产生了很多链接指向博主的首页。从这个来说,对于被很多网站指向的网站,我们有足够的理由说明这个网站更为优秀。
F(u):网页u所指向的网页集合
B(u):链接到该网页的网页集合
c:归一化因子
num(u):F(u)的size
R(u):衡量网页u的rank值指标
简单版本的实现相当简单,从PageRank的核心假设来看,一个网页的rank值取决于它的B(U)集合的信息,因此,我们可以递归地定义:
反复根据这个公式计算直到R收敛即可。
E(U)表示潜在的导向页面,我们一般用整个网页集来表示。
另外,论文上说,根据他们的实验结果有:左边的项归一化之后一般控制在0.85,右边的逃离项一般归一化为0.15。
小数据收敛挺快的,大规模数据可考虑hadoop分布式计算。
#!/usr/bin/env python # coding=utf-8 import numpy as np import random class PageRank: """ generate a random graph of web page and run the page rank algorithm at the end ,give the rank """ def generate_data(self ,n): """ generate the html link graph and return it Parameters: n :the amount of web page return: link_graph : a two dimension array the random graph of the web page E_matrix : a two dimension array the flee probability of the web page """ link_graph = [[0 for i in range(n)] for j in range(n)] for i in range(n): for j in range(n): random_number = random.random() if random_number > 0.5: link_graph[i][j] = 1 E_matrix = [[1.0 / n for i in range(n)] for j in range(n)] u ,v = [] ,[] for i in range(self.n): for j in range(self.n): if link_graph[i][j]: u.append(i) v.append(j) print '-' * 50 ,'graph structre' ,'-' * 50 print zip(u,v) print '-' * 50 ,'graph structre' ,'-' * 50 return link_graph ,E_matrix def __init__(self ,n ,eps ,run_factor): """ class initial Parameters: n : the amount of web page eps : the precision of the convergence run_factor : the probability of get boring return: None """ self.n = n link_graph , self.E_matrix = self.generate_data(n) self.point_graph = self.cal_point_graph(n ,link_graph) self.eps = eps self.run_factor = run_factor def cal_point_graph(self ,n , link_graph): """ calculate the point graph from the link graph Parameters: n : the number of the web page link_graph : the link information among the web page return: point_graph :a two dimension array a two dimension graph with link information of each page """ point_graph = [[0 for i in range(n)] for j in range(n)] for j in range(n): out_num = 0 for i in range(n): if link_graph[i][j] != 0: out_num += 1 if out_num != 0: for i in range(n): if link_graph[i][j] != 0: point_graph[i][j] = 1.0 / out_num return point_graph def run(self ,n): """ pageRank algorithm Parameters: n : the number of the web page return: rank_list : a list the rank value of each page """ rank_list = np.mat([[1.0 / self.n] for i in range(self.n)]) turn_cnt = 0 while True: turn_cnt += 1 print 'runnin the %sst turn' % (turn_cnt) rank_list_1 = self.point_graph * rank_list rank_list_2 = self.E_matrix * rank_list sum_1 = np.sum(rank_list_1) sum_2 = np.sum(rank_list_2) rank_list_1 = rank_list_1 / (sum_1 / (1.0 -self.run_factor)) rank_list_2 = rank_list_2 / (sum_2 / (self.run_factor)) rank_list_new = rank_list_1 + rank_list_2 diff = rank_list_new - rank_list diff_sum = sum([abs(diff[i ,0]) for i in range(self.n)]) rank_list = rank_list_new if diff_sum < self.eps: break return [rank_list[i ,0] for i in range(self.n)] if __name__ == '__main__': n = 5 rank = PageRank(n ,1e-6 ,0.15) rank_list = rank.run(n) for i ,each in enumerate(rank_list ): print 'id %d rank_val %f' % (i ,each)