SimRank:基于图结构的相似度计算方法

简单理解SimRank

SimRank:基于图结构的相似度计算方法_第1张图片

图1.二部图

所谓二部图(bipartite graphs),是指图中的节点可以分这两个子集,任意一条边关联的两个节点分别来自于这两个子集。用I(v)和O(v)分别表示节点v的in-neighbors和out-neighbors。看上面的二部图,我们把A、B当成两个人,把a、b、c当成三件商品,有向边代表人购买的商品。simrank的基本思想是:如果两个实体相似,那么跟它们相关的实体应该也相似。比如在上图中如果a和c相似,那么A和B应该也相似,因为A和a相关,而B和c相关。

SimRank的基本公式:

\begin{equation}s(a,b)=\frac{C}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s(I_i(a),I_j(b))\label{basic}\end{equation}

s(a,b)是节点a和b的相似度,当a=b时,s(a,b)=1。$I_i(a)$表示a的第i个in-neighbor。当$I(a)=\emptyset$或$I(b)=\emptyset$时式\eqref{basic}为0。\eqref{basic}式用一句话描述就是:a和b的相似度等于a的in-neighbors和b的in-neighbors相似度的平均值。参数C是个阻尼系数,它的含义可以这么理解:假如I(a)=I(b)={A},按照\eqref{basic}式计算出sim(a,b)=C*sim(A,A)=C,所以$C\in(0,1)$。

把式\eqref{basic}应用于图1所示的二部图就是:

\begin{equation}s(A,B)=\frac{C_1}{|O(A)||O(B)|}\sum_{i=1}^{|O(A)|}\sum_{j=1}^{|O(B)|}s(O_i(A),O_j(B))\ \ \ \ for\ A\ne{B}\label{out}\end{equation}

\begin{equation}s(a,b)=\frac{C_2}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s(I_i(a),I_j(b))\ \ \ \ for\ a\ne{b}\label{in}\end{equation}

忽略$C_1$和$C_2$,\eqref{out}式是说买家A和B的相似度等于他们购买的物品之间相似度的平均值,\eqref{in}式是说物品a和b的相似度是购买它们的买家之间相似度的平均值。

对于非二部图的情况,一个节点既可能有in-neighbors也可能有out-neighbors,比如在论文引用的场景下,一篇论文既可能引用其他论文,也可能被其他论文引用。站在引用的角度,两篇论文的相似度为

\begin{equation}s_1(a,b)=\frac{C_1}{|O(a)||O(b)|}\sum_{i=1}^{|O(a)|}\sum_{j=1}^{|O(b)|}s_2(O_i(a),O_j(b))\label{hout}\end{equation}

站在被引用的角度,两篇论文的相似度为

\begin{equation}s_2(a,b)=\frac{C_2}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}s_1(I_i(a),I_j(b))\label{hin}\end{equation}

根据实际的应用场景,最终的s(a,b)可以采用\eqref{hout}和\eqref{hin}其中的任何一个,或者是两者的综合。

Naive SimRank

SimRank迭代算法:

$R_0(a,b)=\left\{\begin{matrix}0\ \ \ \ if\ a\ne{b}\\1\ \ \ \ if\ a=b\end{matrix}\right.$

$R_{k+1}(a,b)=\left\{\begin{matrix}\frac{C}{|I(a)||I(b)|}\sum_{i=1}^{|I(a)|}\sum_{j=1}^{|I(b)|}R_k(I_i(a),I_j(b))\ \ \ \ if\ a\ne{b}\\1\ \ \ \ if\ a=b\end{matrix}\right.$

$R_k(*,*)$是k的单调不减函数,$lim_{k\to\infty}R_k(a,b)=s(a,b)$,实践中发现$R_k(*,*)$收敛得很快,k不需要设得太大。

 下面给出矩阵的形式,因为直接使用上面的迭代公式很难展开并行计算,数量稍微大一些(比如上十万)时在单机上跑时间和空间开销非常大。

\begin{equation}\left\{\begin{matrix}S^{(0)}=(1-c)\cdot{I_n}\\S^{(k+1)}=c\cdot{Q^T}\cdot{S^{(k)}}\cdot{Q}+(1-c)\cdot{I_n}\end{matrix}\right.\label{juzhen}\end{equation}

S是相似度矩阵。Q是转移概率矩阵,它的每一列和为1,如果从节点i可以转移到节点j,并且这样的节点i一共有n个,则$Q_{i,j}=\frac{1}{n}$

迭代误差的上界:

\begin{equation}\|S^{(k)}-S\|_{max}\le{c}^{k+1}\ \ \ \ (\vee{k}=0,1,2\ldots)\label{err_ceil}\end{equation}

矩阵的max范数定义为:

$\|X_{p\times{q}}\|_{max}\stackrel{def}{=}max_{i=1}^{p}max_{j=1}^{q}\mid{x}_{i,j}\mid$

好了,来做个小练习。图2是一张网页链接关系图,表示一所大学的主页上放了A、B两个教授的个人主页的链接,教授B和学生B的个人主页互相链接了对方,等等。下面我们要通过这种链接关系求这5个节点的相似度。

SimRank:基于图结构的相似度计算方法_第2张图片

图2. 网页链接关系图

首先用一个文本文件来存储上面的有向图。

linkgraph

univ	profA	profB
profA	studentA
studentA	univ
profB	studentB
studentB	profB

文件中每一行的首列是一个节点,后面的列是首列的out-neighbors,即在图上游走时只能顺着箭头的方向。对于二部图情况就不一样了,\eqref{out}式是顺着二部图箭头的方向,\eqref{in}式是逆着二部图箭头的方向,即每一条边都是允许双向游走的。于是图1所示的二部图可以表示为:

linkgraph_bipartite

A    a    b
B    b    c
a    A
b    A    B
c    B

naive_simrank.py

#!/usr/bin/env python
# coding=utf-8

import numpy as np
import scipy as sp

nodes = []  # 所有的节点存入数组
nodesnum = 0  # 所有节点的数目
nodes_index = {}  # <节点名,节点在nodes数组中的编号>
damp = 0.8  # 阻尼系数
trans_matrix = np.matrix(0)  # 转移概率矩阵
sim_matrix = np.matrix(0)  # 节点相似度矩阵


def initParam(graphFile):
    '''
    构建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
    输入文件行格式要求:node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
    '''
    global nodes
    global nodes_index
    global trans_matrix
    global sim_matrix
    global damp
    global nodesnum

    link_in = {}
    for line in open(graphFile, "r", 1024):
        arr = line.strip("\n").split()
        node = arr[0]
        nodeid = -1
        if node in nodes_index:
            nodeid = nodes_index[node]
        else:
            nodeid = len(nodes)
            nodes_index[node] = nodeid
            nodes.append(node)
        for ele in arr[1:]:
            outneighbor = ele
            outneighborid = -1
            if outneighbor in nodes_index:
                outneighborid = nodes_index[outneighbor]
            else:
                outneighborid = len(nodes)
                nodes_index[outneighbor] = outneighborid
                nodes.append(outneighbor)
            inneighbors = []
            if outneighborid in link_in:
                inneighbors = link_in[outneighborid]
            inneighbors.append(nodeid)
            link_in[outneighborid] = inneighbors

    nodesnum = len(nodes)
    trans_matrix = np.zeros((nodesnum, nodesnum))
    for node, inneighbors in link_in.items():
        num = len(inneighbors)
        prob = 1.0 / num
        for neighbor in inneighbors:
            trans_matrix[neighbor, node] = prob

    sim_matrix = np.identity(nodesnum) * (1 - damp)


def iterate():
    '''
    迭代更新相似度矩阵
    '''
    global trans_matrix
    global sim_matrix
    global damp
    global nodesnum

    sim_matrix = damp * np.dot(np.dot(trans_matrix.transpose(),
                                      sim_matrix), trans_matrix) + (1 - damp) * np.identity(nodesnum)


def printResult(sim_node_file):
    '''
    打印输出相似度计算结果
    '''
    global sim_matrix
    global link_out
    global link_in
    global nodes
    global nodesnum

    # 打印node之间的相似度
    f_out_user = open(sim_node_file, "w")
    for i in range(nodesnum):
        f_out_user.write(nodes[i] + "\t")
        neighbour = []
        for j in range(nodesnum):
            if i != j:
                sim = sim_matrix[i, j]
                if sim == None:
                    sim = 0
                if sim > 0:
                    neighbour.append((j, sim))
        # 按相似度由大到小排序
        neighbour = sorted(
            neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
        for (u, sim) in neighbour:
            f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
        f_out_user.write("\n")
    f_out_user.close()


def simrank(graphFile, maxIteration):
    global nodes_index
    global trans_matrix
    global sim_matrix

    initParam(graphFile)
    print "nodes:"
    print nodes_index
    print "trans ratio:"
    print trans_matrix
    for i in range(maxIteration):
        print "iteration %d:" % (i + 1)
        iterate()
        print sim_matrix


if __name__ == '__main__':
    graphFile = "../data/linkgraph"
    sim_node_file = "../data/nodesim_naive"
    maxIteration = 10
    simrank(graphFile, maxIteration)
    printResult(sim_node_file)

最终得到5个节点两两之间的相似度

nodesim_naive

univ	profB:0.10803511296	studentB:0.02203058176	
profA	profB:0.36478881792	studentB:0.08159625216	
profB	profA:0.36478881792	univ:0.10803511296	studentB:0.0642220032	studentA:0.03022258176	
studentA	studentB:0.28216737792	profB:0.03022258176	
studentB	studentA:0.28216737792	profA:0.08159625216	profB:0.0642220032	univ:0.02203058176	

 平方缓存法

\eqref{err_ceil}已经证明了simrank的收敛速度是非常快的,下面给出一种可以加速收敛的方法--平方缓存法。

\begin{equation}\left\{\begin{matrix}S_{\left \langle 2 \right \rangle }^{(0)}=(1-c)\cdot{I_n}\\S_{\left \langle 2 \right \rangle }^{(k+1)}=S_{\left \langle 2 \right \rangle }^{(k)}+c^{2^k}\cdot{(Q^{2^k})^T}\cdot{S_{\left \langle 2 \right \rangle }^{(k)}}\cdot{Q^{2^k}}\end{matrix}\right.\label{square_cache}\end{equation}

平方缓存法的收敛速度更惊人:

\begin{equation}\|S_{\left \langle 2 \right \rangle }^{(k)}-S\|_{max}\le{c}^{2^k}\ \ \ \ (\vee{k}=0,1,2\ldots)\label{square_err_ceil}\end{equation}

注意$Q^{2^k}=(Q^{2^{(k-1)}})^2$即每次迭代不必从头计算$Q^{2^k}$,只需要在上一次迭代的基础上平方一下就可以了,这其实就是\eqref{square_cache}比\eqref{juzhen}快的全部原因。事实上:

$S_{\left \langle 2 \right \rangle }^{(k)}=S^{(2^k-1)}$

我们用平方缓存法重新计算图2中各节点的相似度。

 square_cache_simrank.py

#!/usr/bin/env python
# coding=utf-8

import numpy as np
import scipy as sp

nodes = []  # 所有的节点存入数组
nodesnum = 0  # 所有节点的数目
nodes_index = {}  # <节点名,节点在nodes数组中的编号>
damp = 0.8  # 阻尼系数
trans_matrix = np.matrix(0)  # 转移概率矩阵
sim_matrix = np.matrix(0)  # 节点相似度矩阵


def initParam(graphFile):
    '''
    构建nodes、nodes_index、trans_matrix和第0代的sim_matrix.
    输入文件行格式要求:node\toutneighbor\toutneighbor\t...或 node\tinneighbor\tinneighbor\t...
    '''
    global nodes
    global nodes_index
    global trans_matrix
    global sim_matrix
    global damp
    global nodesnum

    link_in = {}
    for line in open(graphFile, "r", 1024):
        arr = line.strip("\n").split()
        node = arr[0]
        nodeid = -1
        if node in nodes_index:
            nodeid = nodes_index[node]
        else:
            nodeid = len(nodes)
            nodes_index[node] = nodeid
            nodes.append(node)
        for ele in arr[1:]:
            outneighbor = ele
            outneighborid = -1
            if outneighbor in nodes_index:
                outneighborid = nodes_index[outneighbor]
            else:
                outneighborid = len(nodes)
                nodes_index[outneighbor] = outneighborid
                nodes.append(outneighbor)
            inneighbors = []
            if outneighborid in link_in:
                inneighbors = link_in[outneighborid]
            inneighbors.append(nodeid)
            link_in[outneighborid] = inneighbors

    nodesnum = len(nodes)
    trans_matrix = np.zeros((nodesnum, nodesnum))
    for node, inneighbors in link_in.items():
        num = len(inneighbors)
        prob = 1.0 / num
        for neighbor in inneighbors:
            trans_matrix[node, neighbor] = prob

    sim_matrix = np.identity(nodesnum) * (1 - damp)


def iterate():
    '''
    迭代更新相似度矩阵
    '''
    global trans_matrix
    global sim_matrix
    global damp
    global nodesnum

    damp=damp**2
    trans_matrix=np.dot(trans_matrix,trans_matrix)
    sim_matrix = damp * np.dot(np.dot(trans_matrix,
                                      sim_matrix), trans_matrix.transpose()) + sim_matrix


def printResult(sim_node_file):
    '''
    打印输出相似度计算结果
    '''
    global sim_matrix
    global link_out
    global link_in
    global nodes
    global nodesnum

    # 打印node之间的相似度
    f_out_user = open(sim_node_file, "w")
    for i in range(nodesnum):
        f_out_user.write(nodes[i] + "\t")
        neighbour = []
        for j in range(nodesnum):
            if i != j:
                sim = sim_matrix[i, j]
                if sim == None:
                    sim = 0
                if sim > 0:
                    neighbour.append((j, sim))
        # 按相似度由大到小排序
        neighbour = sorted(
            neighbour, cmp=lambda x, y: cmp(x[1], y[1]), reverse=True)
        for (u, sim) in neighbour:
            f_out_user.write(nodes[u] + ":" + str(sim) + "\t")
        f_out_user.write("\n")
    f_out_user.close()


def simrank(graphFile, maxIteration):
    global nodes_index
    global trans_matrix
    global sim_matrix

    initParam(graphFile)
    print "nodes:"
    print nodes_index
    print "trans ratio:"
    print trans_matrix
    for i in range(maxIteration):
        print "iteration %d:" % (i + 1)
        iterate()
        print sim_matrix


if __name__ == '__main__':
    graphFile = "../data/linkgraph"
    sim_node_file = "../data/nodesim_square"
    maxIteration = 10
    simrank(graphFile, maxIteration)
    printResult(sim_node_file)

最终算得的节点相似度虽然跟naive方法不一致,但相似度排序跟naive方法是完全一致的。

Arnoldi迭代降维法

采用\eqref{square_cache}式每次迭代都要作矩阵的乘法,当矩阵稍微大一些时对时间和空间的消耗是非常惊人的,两个$n\times{n}$的方阵相乘时间复杂度为$n^3$,即使采用strassen分治法也需要$O(n^{2.81})$的时间复杂度,况且strassen要求两个矩阵是方阵且边长为2的整次幂。如果我们能够对\eqref{square_cache}式中的Q和S进行降维,势必会节省大量计算时间。

Arnoldi迭代法是一种对高维稀疏矩阵(说的是上文中的转移概率矩阵Q)进行降维的方法。

\begin{equation}Q=V\cdot{H}\cdot{V^T}\label{decompose}\end{equation}

其中V的各列正交,且模长为1。$V=[v_0\mid{v_1}\mid\ldots{v}_{r-1}]$,其中$v_i$是V的第i列。

H是$r\times{r}$的上三角矩阵,r是Q的秩,形如:
$H_{r\times{r}}=\begin{bmatrix}h_{1,1} & h_{1,2} & h_{1,3} & \cdots & h_{1,r}\\h_{2,1} & h_{2,2} & h_{2,3} & \cdots & h_{2,r}\\0 & h_{3,2} & h_{3,3} & \cdots & h_{3,r}\\\vdots & \ddots & \ddots & \ddots & \vdots\\0 & \cdots  & 0 & h_{r,r-1} & h_{r,r}\end{bmatrix}$

V和H的计算流程如下:

$step1.$

$v_0=[1,0,0,\ldots]^T$

$step2.$

$for\ k \in [1,\alpha]$

$\ \ \ \ v_k=Q\cdot{v_{k-1}}$

$\ \ \ \ for\ j \in [0,k)$

$\ \ \ \ \ \ \ \ H[j][k-1]=v_j^T\cdot{v_k}$

$\ \ \ \ \ \ \ \ v_k=v_k-H[j][k-1]\cdot{v_j}$

$\ \ \ \ norm2=\|v_k\|_2$

$\ \ \ \ if\ norm2=0$

$\ \ \ \ \ \ \ \ break$

$\ \ \ \ H[k][k-1]=norm2$

$\ \ \ \ v_k=\frac{v_k}{norm2}$

$step3.$

$V舍弃最后一列,H舍弃最后一行$

在step2中,如果是由于norm2=0导致的最外层for循退出,则得到的H的边长为Q的秩即r;如果是由于达到了人为设定的循环上限$\alpha$,则得到的H的边长为$\alpha$。

arnoldi.py

#!/usr/bin/env pyton
# coding=utf-8

import numpy as np
from sparse_matrix import SparseMatrix

def arnoldi_iteration(Q, n, rank):
	'''
	对Q进行分解,QV=VH。
	Q是输入参数,numpy.matrix类型,n行n列,Q的秩为r。
	V和H都是输出参数,numpy.matrix类型。
	V是n行r+1列,每列模长为1且各列正交。V的转置与逆相等。
	H是r+1行r列的上三角矩阵。
	rank用于限制循环次数,r<=rank。
	'''
	if rank > n or rank <= 0:
		rank = n
	V = np.zeros((n, 1))
	V[0, 0] = 1
	h_col_list=[]
	k = 1
	while k <= rank:
		h_col = []
		v_k = Q.

你可能感兴趣的:(python,大数据)