【论文阅读】On clustering using random walks

《On clustering using random walks》阅读笔记

1. 问题建模

1.1 问题描述

let G ( V , E , ω ) G(V,E,\omega) G(V,E,ω) be a weighted graph, V V V is the set of nodes, E E E is the edge between nodes in V V V, ω \omega ω is the function ω : E → R n \omega:E \to \mathbb{R}^n ωERn, that measures the simularity between pairs of items(a higher value means more similar).

p i j = ω ( i , j ) d i p_{ij} = \frac{\omega(i,j)}{d_i} pij=diω(i,j)
d i = ∑ k = 1 n ω ( i , k ) d_i = \sum_{k=1}^n\omega(i,k) di=k=1nω(i,k)

M G ∈ R n × n M^G \in \mathbb{R}^{n \times n} MGRn×n is the associated transition matrix,
M i j G = { p i j ⟨ i , j ⟩ ∈ E 0 otherwise M^G_{ij} = \begin{cases} p_{ij} & \langle i,j \rangle \in E \\ 0 & \textrm{otherwise} \end{cases} MijG={pij0i,jEotherwise

Question:

  1. ω \omega ω表示节点之间的相似性,实际上我们只有无向图,表示节点之间是否有连接,怎么通过已有的信息构建 ω \omega ω
    answer: 这里的相似度可以认为是节点之间边的权值,所以 M i j G M^G_{ij} MijG可以认为是认为是以邻接矩阵操作后的数据。

这里的内容比较坑,我在论文中一直找不到关于 P visit k ( i ) P^{k}_{\textrm{visit}}(i) Pvisitk(i)是怎么计算的,在这里卡了好久好久。

在原文中的描述是这样的:

Now, denote by P v i s i t k ( i ) ∈ R n P^k_{visit}(i) \in \mathbb{R}^n Pvisitk(i)Rn the vector whose j-th component is the probability that a random walk originating at i will visit node j in its k-th step. Thus, P v i s i t k ( i ) P^k_{visit}(i) Pvisitk(i) is the i-th row in the matrix ( M G ) k (M^G)^k (MG)k, the k’th power of M G M^G MG.

现在我们知道 M G M^G MG是怎样计算的,但是 ( M G ) k (M^G)^k (MG)k呢,在原文中的描述是’'the k’th power of M G M^G MG", 我理解的应该是原有矩阵 M G M^G MG的k次方(矩阵的乘法)。

P v i s i t k ( i ) P^k_{visit}(i) Pvisitk(i) is the i-th row in the matrix ( M G ) k (M^G)^k (MG)k,

P v i s i t k ( i ) = ( M G ) i k P^k_{visit}(i) = (M^G)^k_i Pvisitk(i)=(MG)ik
( M G ) k = { P v i s i t k ( 1 ) T , P v i s i t k ( 2 ) T , … , P v i s i t k ( n ) T } (M^G)^k=\{P^k_{visit}(1)^{\mathbf{T}}, P^k_{visit}(2)^{\mathbf{T}}, \dots, P^k_{visit}(n)^{\mathbf{T}}\} (MG)k={Pvisitk(1)T,Pvisitk(2)T,,Pvisitk(n)T}

Notice: 其实到这里,和马尔可夫聚类算法(MCL)是一样的。MCL是不断迭代,知道矩阵不再改变,这里作者考虑到计算复杂,采用前k次计算结果的和来作为替代。

We now offer two methods for performing the edge separation, both based on deterministic analysis of random walks.

边缘分离,锐化

NS: Separation by neighborhood similarity.

CE: Separation by circular escape.

the weighted neighborhood : 加权领域
bipartite subgraph

P visit ≤ k ( v ) = ∑ i = 1 k P visit i ( v ) P^{\leq k}_{\textrm{visit}}(v) = \sum_{i=1}^kP^{i}_{\textrm{visit}}(v) Pvisitk(v)=i=1kPvisiti(v)

2. NS: Separation by neighborhood similarity.

Now, in order to estimate the closeness of the two node v v v and u u u , we fix some small k(eg. k = 3) and compare P visit ≤ k ( v ) P^{\leq k}_{\textrm{visit}}(v) Pvisitk(v) and P visit ≤ k ( u ) P^{\leq k}_{\textrm{visit}}(u) Pvisitk(u). The smaller the difference, the greater the intimacy between u u u and v v v.

N S ( G ) = d f n G s ( V , E , ω s ) NS(G) \xlongequal{dfn} G_s(V, E, \omega_s) NS(G)dfn Gs(V,E,ωs),
where ∀ ⟨ v , u ⟩ ∈ E , ω s ( u , v ) = s i m k ( P v i s i t ≤ k ( v ) , P v i s i t ≤ k ( u ) ) \forall \langle v, u \rangle \in E, \omega_s(u, v) = sim^k(P^{\leq k}_{visit}(v),P^{\leq k}_{visit}(u)) v,uE,ωs(u,v)=simk(Pvisitk(v),Pvisitk(u))

s i m k ( x , y ) sim^k(x,y) simk(x,y) is some similarity measure of the vectors x \mathrm{x} x and y \mathrm{y} y, whose value increases as x \mathrm{x} x and y \mathrm{y} y are more similar.

s i m k ( x , y ) sim^k(x,y) simk(x,y) the suitable choose:
f k ( x , y ) = d f n exp ⁡ ( 2 k − ∥ x − y ∥ L 1 ) − 1 (1) f^k(x,y) \xlongequal{dfn} \exp(2k − \|x − y\|_{L_1}) − 1 \tag{1} fk(x,y)dfn exp(2kxyL1)1(1)
∥ x − y ∥ L 1 = ∑ i = 1 n ∣ x i − y i ∣ \|x − y\|_{L_1} = \sum_{i=1}^n|x_i-y_i| xyL1=i=1nxiyi

another choose is:
cos ⁡ ( x , y ) = ( x , y ) ( x , x ) . ( y , y ) (2) \cos(x,y)= \frac{(x,y)}{\sqrt{(x,x)}.\sqrt{(y,y)}} \tag{2} cos(x,y)=(x,x) .(y,y) (x,y)(2)
where (·,·) denotes inner-product.(内积)

3.2 CE: Separation by circular escape.

3.3 代码实现

无向带权图

import numpy as np


def markovCluster(adjacencyMat, dimension, numIter, power=2, inflation=2):
    columnSum = np.sum(adjacencyMat, axis=0)
    probabilityMat = adjacencyMat / columnSum

    # Expand by taking the e^th power of the matrix.
    def _expand(probabilityMat, power):
        expandMat = probabilityMat
        for i in range(power - 1):
            expandMat = np.dot(expandMat, probabilityMat)
        return expandMat

    expandMat = _expand(probabilityMat, power)

    # Inflate by taking inflation of the resulting
    # matrix with parameter inflation.
    def _inflate(expandMat, inflation):
        powerMat = expandMat
        for i in range(inflation - 1):
            powerMat = powerMat * expandMat
        inflateColumnSum = np.sum(powerMat, axis=0)
        inflateMat = powerMat / inflateColumnSum
        return inflateMat

    inflateMat = _inflate(expandMat, inflation)

    for i in range(numIter):
        expand = _expand(inflateMat, power)
        inflateMat = _inflate(expand, inflation)
    print(inflateMat)
    print(np.zeros((7, 7)) != inflateMat)


if __name__ == "__main__":
    dimension = 4
    numIter = 10
    adjacencyMat = np.array([[1, 1, 1, 1],
                             [1, 1, 0, 1],
                             [1, 0, 1, 0],
                             [1, 1, 0, 1]])

    # adjacencyMat = np.array([[1, 1, 1, 1, 0, 0, 0],
    #                          [1, 1, 1, 1, 1, 0, 0],
    #                          [1, 1, 1, 1, 0, 0, 0],
    #                          [1, 1, 1, 1, 0, 0, 0],
    #                          [0, 1, 0, 0, 1, 1, 1],
    #                          [0, 0, 0, 0, 1, 1, 1],
    #                          [0, 0, 0, 0, 1, 1, 1],
    #                          ])
    markovCluster(adjacencyMat, dimension, numIter)
[[1.00000000e+000 1.00000000e+000 1.00000000e+000 1.00000000e+000]
 [5.23869755e-218 5.23869755e-218 5.23869755e-218 5.23869755e-218]
 [0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000]
 [5.23869755e-218 5.23869755e-218 5.23869755e-218 5.23869755e-218]]
[[ True  True  True  True]
 [ True  True  True  True]
 [False False False False]
 [ True  True  True  True]]

可以从中得到聚类效果 { { 1 , 2 , 4 } , { 3 } } \{\{1,2,4\},\{3\}\} {{124}{3}}

谱聚类
MCL
MCL GitHub

你可能感兴趣的:(论文阅读,论文阅读,机器学习,算法)