图表示学习之Deepwalk

图表示学习就是将图（graph）中的顶点表示为低维向量，并且向量能够包含顶点的结构信息和属性信息。
在NLP领域，word2vec算法根据语料库中单词的共现关系，将自然语言中的单词表示为低维向量。Deepwalk借鉴了word2vec算法的思想，word2vec算法是根据语料库中的句子构建单词的共现关系，然后使用target words预测context words（skip-gram模型）或是使用context words预测target words（cbow模型）；而Deepwalk则是使用Random walk根据图的连接结构，显示地构建“句子”序列，进而使用word2vec算法将图中的顶点表示为低维向量。

随机游走（random walk）

对于无向图$G=(V,E)$，$e_{i,j}$表示顶点$v_i$和顶点$v_j$之间边的权值（如果不存在边则为0）。使用随机游走构建长度为$T$的序列$\mathcal{S}=\{s^{(1)},s^{(2)},\cdots, s^{(T)}\}$，从顶点$v_i$游走到顶点$v_j$的概率为

$$ \begin{equation}p(s^{(t+1)}=v_j|s^{(t)}=v_i) = \begin{cases} \frac{e_{i,j}}{Z_i}, if \quad e_{i,j} \neq 0 \\ 0, if \quad e_{i,j} = 0 \end{cases}\end{equation} $$

其中，$Z$为所有以$v_i$为顶点的边的权值之和，即
$$Z_i=\sum_{j=1}^{|V|}e_{i,j}$$
显然，顶点$v_i$和顶点$v_j$之间边的权值越大，越有可能从顶点$v_i$游走到顶点$v_j$。random walk可以看作是一个可回头的深度优先搜索。有向图理论上可以进行随机游走，但是容易游走到出度为0的顶点，一般情况下，deep walk用于无向图比较多。

word2vec(skip-gram)

使用随机游走得到“句子”序列后，根据word2vec算法（skip-gram模型）得到目标函数，
$$maximize \frac{1}{T}\sum_{t=1}^{T}\sum_{-c \leq j \leq c, j \neq 0}\log p(w_{s^{(t+j)}}|w_{s^{(t)}})$$
其中，$w$为顶点的向量表示，最开始为随机初始化，随着目标函数的不断优化，逐渐收敛。
与word2vec算法一种，可以使用hierarchical softmax和negative sampling实现$p(w_{O}|w_I)$

deep walk的代码实现

随机游走

import networkx as nx
import random

def gen_graph():
    g = nx.Graph()
    g.add_weighted_edges_from([(1, 2, 0.5), (2, 3, 1.5), (1, 4, 1.0), (2, 4, 0.5), (4, 5, 1.0)])
    return g


def random_pick(candidates, probs):
    if len(candidates) == 0:
        return None
    Z = sum(probs)
    x = random.uniform(0, 1)
    cumulative_prob = 0.0
    for i, candidate in enumerate(candidates):
        cumulative_prob += probs[i]
        if x < cumulative_prob:
            break
    return candidate


def random_walk(g, walk_length=40, start_node=None):
    walk_seq = [start_node]
    while len(walk_seq) < walk_length:
        cur = walk_seq[-1]
        candidate_nbs = []
        candidate_ws = []
        for nbs, attr in g[cur].items():
            candidate_nbs.append(nbs)
            candidate_ws.append(attr['weight'])
        candidate = random_pick(candidate_nbs, candidate_ws)
        if candidate is not None:
            walk_seq.append(candidate)
        else:
            raise ValueError("current node with 0 degree")
    return walk_seq


def sample_walks(g, walk_length=40, num=10, shuffle=True):
    walks = []
    all_nodes = list(g.nodes())
    print("start sampling walks:")
    for i in range(num):
        print("iteration: {} / {}".format(i+1, num))
        if shuffle:
            random.shuffle(all_nodes)
        for node in all_nodes:
            walks.append(random_walk(g, walk_length, start_node=node))
    return walks


if __name__ == '__main__':
    g = gen_graph()
    print(g.nodes(data=True))
    print(g.edges(data=True))
    for node in g.nodes():
        print(g[node].items())
    # print(g[1].items())
    # print(g.neighbors(1))
    # print(g[1].keys())
    # print(g[1].values())
    walks = sample_walks(g, walk_length=4, num=2)
    print(walks)

output:
[(1, {}), (2, {}), (3, {}), (4, {}), (5, {})]
[(1, 2, {'weight': 0.5}), (1, 4, {'weight': 1.0}), (2, 3, {'weight': 1.5}), (2, 4, {'weight': 0.5}), (4, 5, {'weight': 1.0})]
dict_items([(2, {'weight': 0.5}), (4, {'weight': 1.0})])
dict_items([(1, {'weight': 0.5}), (3, {'weight': 1.5}), (4, {'weight': 0.5})])
dict_items([(2, {'weight': 1.5})])
dict_items([(1, {'weight': 1.0}), (2, {'weight': 0.5}), (5, {'weight': 1.0})])
dict_items([(4, {'weight': 1.0})])
start sampling walks:
iteration: 1 / 2
iteration: 2 / 2
[[1, 2, 3, 2], [3, 2, 1, 2], [2, 3, 2, 3], [5, 4, 1, 2], [4, 1, 4, 1], [3, 2, 3, 2], [5, 4, 1, 2], [4, 1, 2, 3], [2, 1, 2, 3], [1, 4, 1, 2]]

skip-gram

skip-gram的代码详见skip-gram的tensorflow实现

参考

DeepWalk: Online Learning of Social Representations