论文记录:LINE: Large-scale Information Network Embedding

Graph embedding

Idea:

  • first-order proximity
    Directly connected pair of nodes share similarity, which is true for some social-network-like applications.
  • second-order proximity
    Nodes with overlapping neighborhood share similarity, which is true for e-commercial or social-network applications.

First-order proximity means similar nodes appear together, while second-order proximity means similar nodes have similar choices of commodities or communities.

========================================================================================
First order proximity modeling:
First order proximity is modeled to represent strength of edges, edge weight is naturally what we want to incorporate into the model. In this paper, edge weights are normalized to form an edge distribution. Each edge(i, j) exists with probability of
Wij/sum_over_kl(Wk,l)

And this paper constructs union probability distribution of nodes i and j:
Prob(i, j) = 1 / (1 + exp(-Eit * Ej))

Then loss function is DL-divergence of above two distributions(ignore constant), as:
loss1 = sum_over_ij (-Wij * log Prob(i, j))

Second-order proximity modeling:
Second-order proximity is modeled to incorporate neighborhood information. Given node i, its neighborhood information can be represented as neighborhood node distribution.
P(j | i) = exp(Eit * Ej) / sum_over_k(exp(Eit * Ek)) for k in neighborhood of i.
The empirical distribution of P(j|i) = wij / sum_over_k(wik) for k in neighborhood of i. For none neighborhood node k, wik = 0.
DL-divergence of above two conditional distributions given i is DL(i). Sum DL(i) over all i in graph, with node importance(prestige) considered as node weight lamda_i (In this paper, node importance is treated as sum of out-edge weights (sum(wik)))
Therefore, loss function of second-order proximity is :
loss2 = sum_over_i { lamda_i * sum_over_j [ wij / sum_over_k(wik) * logP(j|i) ] }

========================================================================================
Several tricks are used to make optimization easier.
1、Negative sampling used to compute P(j | i), which is used in word2vec.
2、For sparsely connected nodes, expand its neighborhood from first-order to n-order, with some decay factor used. Decay factor is as wi…k = wij * wjk / sum_over_k(wjk)
3、To narrow the big gap between edge weights, unfold weighted-edge into several binary edges according to its weight value, and edge sampling used to avoid too-many-unfolded edges to save memory.
4、Newly added vertex embedding is computed by minimizing loss1 and loss2 without updating existing vertex embeddings.

This paper presents a new method to embedding graph. Difference with DeepWalk:
1、LINE uses first order and second order proximity to model different similarities.
2、LINE uses breadth-first search for its neighborhood(mainly first order neighborhood, nth-order for sparsely-connected nodes.), DeepWalk uses depth-first search to model behavior trace. This presents their different objectives(Relation oriented or sequence-action oriented).

你可能感兴趣的:(paper)