简单理解SimRank
图1.二部图
所谓二部图(bipartite graphs),是指图中的节点可以分这两个子集,任意一条边关联的两个节点分别来自于这两个子集。用I(v)和O(v)分别表示节点v的in-neighbors和out-neighbors。看上面的二部图,我们把A、B当成两个人,把a、b、c当成三件商品,有向边代表人购买的商品。simrank的基本思想是:如果两个实体相似,那么跟它们相关的实体应该也相似。比如在上图中如果a和c相似,那么A和B应该也相似,因为A和a相关,而B和c相关。
SimRank的基本公式:
\begin{equation}s(a,b)=\frac{C}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s(I_i(a),I_j(b))\label{basic}\end{equation}
s(a,b)是节点a和b的相似度,当a=b时,s(a,b)=1。$I_i(a)$表示a的第i个in-neighbor。当$I(a)=\emptyset$或$I(b)=\emptyset$时式\eqref{basic}为0。\eqref{basic}式用一句话描述就是:a和b的相似度等于a的in-neighbors和b的in-neighbors相似度的平均值。参数C是个阻尼系数,它的含义可以这么理解:假如I(a)=I(b)={A},按照\eqref{basic}式计算出sim(a,b)=C*sim(A,A)=C,所以$C\in(0,1)$。
把式\eqref{basic}应用于图1所示的二部图就是:
\begin{equation}s(A,B)=\frac{C_1}{|O(A)||O(B)|}\sum_{i=1}^{|O(A)|}\sum_{j=1}^{|O(B)|}s(O_i(A),O_j(B))\ \ \ \ for\ A\ne{B}\label{out}\end{equation}
\begin{equation}s(a,b)=\frac{C_2}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s(I_i(a),I_j(b))\ \ \ \ for\ a\ne{b}\label{in}\end{equation}
忽略$C_1$和$C_2$,\eqref{out}式是说买家A和B的相似度等于他们购买的物品之间相似度的平均值,\eqref{in}式是说物品a和b的相似度是购买它们的买家之间相似度的平均值。
对于非二部图的情况,一个节点既可能有in-neighbors也可能有out-neighbors,比如在论文引用的场景下,一篇论文既可能引用其他论文,也可能被其他论文引用。站在引用的角度,两篇论文的相似度为
\begin{equation}s_1(a,b)=\frac{C_1}{|O(a)||O(b)|}\sum_{i=1}^{|O(a)|}\sum_{j=1}^{|O(b)|}s_2(O_i(a),O_j(b))\label{hout}\end{equation}
站在被引用的角度,两篇论文的相似度为
\begin{equation}s_2(a,b)=\frac{C_2}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s_1(I_i(a),I_j(b))\label{hin}\end{equation}
根据实际的应用场景,最终的s(a,b)可以采用\eqref{hout}和\eqref{hin}其中的任何一个,或者是两者的综合。
Naive SimRank
SimRank迭代算法:
$R_0(a,b)=\left\{\begin{matrix}0\ \ \ \ if\ a\ne{b}\\1\ \ \ \ if\ a=b\end{matrix}\right.$
$R_{k+1}(a,b)=\left\{\begin{matrix}\frac{C}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}R_k(I_i(a),I_j(b))\ \ \ \ if\ a\ne{b}\\1\ \ \ \ if\ a=b\end{matrix}\right.$
$R_k(*,*)$是k的单调不减函数,$lim_{k\to\infty}R_k(a,b)=s(a,b)$,实践中发现$R_k(*,*)$收敛得很快,k不需要设得太大。
下面给出矩阵的形式,因为直接使用上面的迭代公式很难展开并行计算,数量稍微大一些(比如上十万)时在单机上跑时间和空间开销非常大。
\begin{equation}\left\{\begin{matrix}S^{(0)}=(1-c)\cdot{I_n}\\S^{(k+1)}=c\cdot{Q^T}\cdot{S^{(k)}}\cdot{Q}+(1-c)\cdot{I_n}\end{matrix}\right.\label{juzhen}\end{equation}
S是相似度矩阵。Q是转移概率矩阵,它的每一列和为1,如果从节点i可以转移到节点j,并且这样的节点i一共有n个,则$Q_{i,j}=\frac{1}{n}$
迭代误差的上界:
\begin{equation}\|S^{(k)}-S\|_{max}\le{c}^{k+1}\ \ \ \ (\vee{k}=0,1,2\ldots)\label{err_ceil}\end{equation}
矩阵的max范数定义为:
$\|X_{p\times{q}}\|_{max}\stackrel{def}{=}max_{i=1}^{p}max_{j=1}^{q}\mid{x}_{i,j}\mid$
好了,来做个小练习。图2是一张网页链接关系图,表示一所大学的主页上放了A、B两个教授的个人主页的链接,教授B和学生B的个人主页互相链接了对方,等等。下面我们要通过这种链接关系求这5个节点的相似度。
图2. 网页链接关系图
首先用一个文本文件来存储上面的有向图。
linkgraph
univ profA profB profA studentA studentA univ profB studentB studentB profB
文件中每一行的首列是一个节点,后面的列是首列的out-neighbors,即在图上游走时只能顺着箭头的方向。对于二部图情况就不一样了,\eqref{out}式是顺着二部图箭头的方向,\eqref{in}式是逆着二部图箭头的方向,即每一条边都是允许双向游走的。于是图1所示的二部图可以表示为:
linkgraph_bipartite
A a b
B b c
a A
b A B
c B
naive_simrank.py
#!/usr/bin/env python
# coding=utf-8
import numpy as np
import scipy as sp
nodes = [] # 所有的节点存入数组
nodesnum = 0 # 所有节点的数目
nodes_index = {} # <节点名,节点在nodes数组中的编号>
damp = 0.8 # 阻尼系数
trans_matrix = np.matrix(0) # 转移概率矩阵
sim_matrix = np.matrix(0) # 节点相似度矩阵
def initParam(graphFile):
'''
构建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
输入文件行格式要求:node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
'''
global nodes
global nodes_index
global trans_matrix
global sim_matrix
global damp
global nodesnum
link_in = {}
for line in open(graphFile, "r", 1024):
arr = line.strip("\n").split()
node = arr[0]
nodeid = -1
if node in nodes_index:
nodeid = nodes_index[node]
else:
nodeid = len(nodes)
nodes_index[node] = nodeid
nodes.append(node)
for ele in arr[1:]:
outneighbor = ele
outneighborid = -1
if outneighbor in nodes_index:
outneighborid = nodes_index[outneighbor]
else:
outneighborid = len(nodes)
nodes_index[outneighbor] = outneighborid
nodes.append(outneighbor)
inneighbors = []
if outneighborid in link_in:
inneighbors = link_in[outneighborid]
inneighbors.append(nodeid)
link_in[outneighborid] = inneighbors
nodesnum = len(nodes)
trans_matrix = np.zeros((nodesnum, nodesnum))
for node, inneighbors in link_in.items():
num = len(inneighbors)
prob = 1.0 / num
for neighbor in inneighbors:
trans_matrix[neighbor, node] = prob
sim_matrix = np.identity(nodesnum) * (1 - damp)
def iterate():
'''
迭代更新相似度矩阵
'''
global trans_matrix
global sim_matrix
global damp
global nodesnum
sim_matrix = damp * np.dot(np.dot(trans_matrix.transpose(),
sim_matrix), trans_matrix) + (1 - damp) * np.identity(nodesnum)
def printResult(sim_node_file):
'''
打印输出相似度计算结果
'''
global sim_matrix
global link_out
global link_in
global nodes
global nodesnum
# 打印node之间的相似度
f_out_user = open(sim_node_file, "w")
for i in range(nodesnum):
f_out_user.write(nodes[i] + "\t")
neighbour = []
for j in range(nodesnum):
if i != j:
sim = sim_matrix[i, j]
if sim == None:
sim = 0
if sim > 0:
neighbour.append((j, sim))
# 按相似度由大到小排序
neighbour = sorted(
neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
for (u, sim) in neighbour:
f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
f_out_user.write("\n")
f_out_user.close()
def simrank(graphFile, maxIteration):
global nodes_index
global trans_matrix
global sim_matrix
initParam(graphFile)
print "nodes:"
print nodes_index
print "trans ratio:"
print trans_matrix
for i in range(maxIteration):
print "iteration %d:" % (i + 1)
iterate()
print sim_matrix
if __name__ == '__main__':
graphFile = "../data/linkgraph"
sim_node_file = "../data/nodesim_naive"
maxIteration = 10
simrank(graphFile, maxIteration)
printResult(sim_node_file)
最终得到5个节点两两之间的相似度
nodesim_naive
univ profB:0.10803511296 studentB:0.02203058176
profA profB:0.36478881792 studentB:0.08159625216
profB profA:0.36478881792 univ:0.10803511296 studentB:0.0642220032 studentA:0.03022258176
studentA studentB:0.28216737792 profB:0.03022258176
studentB studentA:0.28216737792 profA:0.08159625216 profB:0.0642220032 univ:0.02203058176
平方缓存法
\eqref{err_ceil}已经证明了simrank的收敛速度是非常快的,下面给出一种可以加速收敛的方法--平方缓存法。
\begin{equation}\left\{\begin{matrix}S_{\left \langle 2 \right \rangle }^{(0)}=(1-c)\cdot{I_n}\\S_{\left \langle 2 \right \rangle }^{(k+1)}=S_{\left \langle 2 \right \rangle }^{(k)}+c^{2^k}\cdot{(Q^{2^k})^T}\cdot{S_{\left \langle 2 \right \rangle }^{(k)}}\cdot{Q^{2^k}}\end{matrix}\right.\label{square_cache}\end{equation}
平方缓存法的收敛速度更惊人:
\begin{equation}\|S_{\left \langle 2 \right \rangle }^{(k)}-S\|_{max}\le{c}^{2^k}\ \ \ \ (\vee{k}=0,1,2\ldots)\label{square_err_ceil}\end{equation}
注意$Q^{2^k}=(Q^{2^{(k-1)}})^2$即每次迭代不必从头计算$Q^{2^k}$,只需要在上一次迭代的基础上平方一下就可以了,这其实就是\eqref{square_cache}比\eqref{juzhen}快的全部原因。事实上:
$S_{\left \langle 2 \right \rangle }^{(k)}=S^{(2^k-1)}$
我们用平方缓存法重新计算图2中各节点的相似度。
square_cache_simrank.py
#!/usr/bin/env python
# coding=utf-8
import numpy as np
import scipy as sp
nodes = [] # 所有的节点存入数组
nodesnum = 0 # 所有节点的数目
nodes_index = {} # <节点名,节点在nodes数组中的编号>
damp = 0.8 # 阻尼系数
trans_matrix = np.matrix(0) # 转移概率矩阵
sim_matrix = np.matrix(0) # 节点相似度矩阵
def initParam(graphFile):
'''
构建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
输入文件行格式要求:node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
'''
global nodes
global nodes_index
global trans_matrix
global sim_matrix
global damp
global nodesnum
link_in = {}
for line in open(graphFile, "r", 1024):
arr = line.strip("\n").split()
node = arr[0]
nodeid = -1
if node in nodes_index:
nodeid = nodes_index[node]
else:
nodeid = len(nodes)
nodes_index[node] = nodeid
nodes.append(node)
for ele in arr[1:]:
outneighbor = ele
outneighborid = -1
if outneighbor in nodes_index:
outneighborid = nodes_index[outneighbor]
else:
outneighborid = len(nodes)
nodes_index[outneighbor] = outneighborid
nodes.append(outneighbor)
inneighbors = []
if outneighborid in link_in:
inneighbors = link_in[outneighborid]
inneighbors.append(nodeid)
link_in[outneighborid] = inneighbors
nodesnum = len(nodes)
trans_matrix = np.zeros((nodesnum, nodesnum))
for node, inneighbors in link_in.items():
num = len(inneighbors)
prob = 1.0 / num
for neighbor in inneighbors:
trans_matrix[node, neighbor] = prob
sim_matrix = np.identity(nodesnum) * (1 - damp)
def iterate():
'''
迭代更新相似度矩阵
'''
global trans_matrix
global sim_matrix
global damp
global nodesnum
damp=damp**2
trans_matrix=np.dot(trans_matrix,trans_matrix)
sim_matrix = damp * np.dot(np.dot(trans_matrix,
sim_matrix), trans_matrix.transpose()) + sim_matrix
def printResult(sim_node_file):
'''
打印输出相似度计算结果
'''
global sim_matrix
global link_out
global link_in
global nodes
global nodesnum
# 打印node之间的相似度
f_out_user = open(sim_node_file, "w")
for i in range(nodesnum):
f_out_user.write(nodes[i] + "\t")
neighbour = []
for j in range(nodesnum):
if i != j:
sim = sim_matrix[i, j]
if sim == None:
sim = 0
if sim > 0:
neighbour.append((j, sim))
# 按相似度由大到小排序
neighbour = sorted(
neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
for (u, sim) in neighbour:
f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
f_out_user.write("\n")
f_out_user.close()
def simrank(graphFile, maxIteration):
global nodes_index
global trans_matrix
global sim_matrix
initParam(graphFile)
print "nodes:"
print nodes_index
print "trans ratio:"
print trans_matrix
for i in range(maxIteration):
print "iteration %d:" % (i + 1)
iterate()
print sim_matrix
if __name__ == '__main__':
graphFile = "../data/linkgraph"
sim_node_file = "../data/nodesim_square"
maxIteration = 10
simrank(graphFile, maxIteration)
printResult(sim_node_file)
最终算得的节点相似度虽然跟naive方法不一致,但相似度排序跟naive方法是完全一致的。
Arnoldi迭代降维法
采用\eqref{square_cache}式每次迭代都要作矩阵的乘法,当矩阵稍微大一些时对时间和空间的消耗是非常惊人的,两个$n\times{n}$的方阵相乘时间复杂度为$n^3$,即使采用strassen分治法也需要$O(n^{2.81})$的时间复杂度,况且strassen要求两个矩阵是方阵且边长为2的整次幂。如果我们能够对\eqref{square_cache}式中的Q和S进行降维,势必会节省大量计算时间。
Arnoldi迭代法是一种对高维稀疏矩阵(说的是上文中的转移概率矩阵Q)进行降维的方法。
\begin{equation}Q=V\cdot{H}\cdot{V^T}\label{decompose}\end{equation}
其中V的各列正交,且模长为1。$V=[v_0\mid{v_1}\mid\ldots{v}_{r-1}]$,其中$v_i$是V的第i列。
H是$r\times{r}$的上三角矩阵,r是Q的秩,形如:
$H_{r\times{r}}=\begin{bmatrix}h_{1,1} & h_{1,2} & h_{1,3} & \cdots & h_{1,r}\\h_{2,1} & h_{2,2} & h_{2,3} & \cdots & h_{2,r}\\0 & h_{3,2} & h_{3,3} & \cdots & h_{3,r}\\\vdots & \ddots & \ddots & \ddots & \vdots\\0 & \cdots & 0 & h_{r,r-1} & h_{r,r}\end{bmatrix}$
V和H的计算流程如下:
$step1.$ $v_0=[1,0,0,\ldots]^T$ $step2.$ $for\ k \in [1,\alpha]$ $\ \ \ \ v_k=Q\cdot{v_{k-1}}$ $\ \ \ \ for\ j \in [0,k)$ $\ \ \ \ \ \ \ \ H[j][k-1]=v_j^T\cdot{v_k}$ $\ \ \ \ \ \ \ \ v_k=v_k-H[j][k-1]\cdot{v_j}$ $\ \ \ \ norm2=\|v_k\|_2$ $\ \ \ \ if\ norm2=0$ $\ \ \ \ \ \ \ \ break$ $\ \ \ \ H[k][k-1]=norm2$ $\ \ \ \ v_k=\frac{v_k}{norm2}$ $step3.$ $V舍弃最后一列,H舍弃最后一行$ |
在step2中,如果是由于norm2=0导致的最外层for循退出,则得到的H的边长为Q的秩即r;如果是由于达到了人为设定的循环上限$\alpha$,则得到的H的边长为$\alpha$。
arnoldi.py
#!/usr/bin/env pyton
# coding=utf-8
import numpy as np
from sparse_matrix import SparseMatrix
def arnoldi_iteration(Q, n, rank):
'''
对Q进行分解,QV=VH。
Q是输入参数,numpy.matrix类型,n行n列,Q的秩为r。
V和H都是输出参数,numpy.matrix类型。
V是n行r+1列,每列模长为1且各列正交。V的转置与逆相等。
H是r+1行r列的上三角矩阵。
rank用于限制循环次数,r<=rank。
'''
if rank > n or rank <= 0:
rank = n
V = np.zeros((n, 1))
V[0, 0] = 1
h_col_list=[]
k = 1
while k <= rank:
h_col = []
v_k = Q.