Graph represnetation learning alleviates the need to do feature engineering every single time (automatically learn the features)
Goal: efficient task-independent feature learning for machine learning with graphs
Why embedding?
Goal: encode nodes so that similarity in the embedding space approximates similarity in the graph
a) Encoder ENC maps from nodes to embeddings (a low-dimensional vector)
b) Define a node similarity function (i.e., a measure of similarity in the original network)
c) Decoder DEC maps from embeddings to the similarity score
d) Optimize the parameters of the encoder so that similarity ( u , v ) ≈ z v T z u (u, v)\approx z_v^Tz_u (u,v)≈zvTzu
Simplest encoding approach: encoder is just an embedding-lookup so each node is assigned a unique embedding vector
ENV ( v ) = z v = Z ⋅ v \text{ENV}(v)=z_v=Z \cdot v ENV(v)=zv=Z⋅v
where Z Z Z is matrix and each column is a node embedding and v v v is an indicator vector with all zeroes excepy a one in column indicating node v v v
Methods: DeepWalk, node2vec
z u T z v ≈ probability that u and v co-occur on a random walk over the graph z_u^Tz_v \approx \text{probability that \textit{u} and \textit{v} co-occur on a random walk over the graph} zuTzv≈probability that u and v co-occur on a random walk over the graph
Why random walks?
Intuition: find embedding of nodes in d d d-dimensional space that preserves similarity
Idea: learn node embedding such that nearby nodes are close together in the network
N R ( u ) N_R(u) NR(u): neighborhood of u u u obtained by the strategy R R R
Goal: learn a mapping f : u → R d f: u \to \mathbb{R}_d f:u→Rd: f ( u ) = z u f(u) = z_u f(u)=zu
Log-likelihood objective:
m a x f ∑ u ∈ V l o g P ( N R ( u ) ∣ z u ) \underset{f}{max}\sum_{u\in V} log P(N_R(u)|z_u) fmaxu∈V∑logP(NR(u)∣zu)
Given node u u u, we want to learn feature represnetations that are predictive of the nodes in its random walk neighborhood N R ( u ) N_R(u) NR(u)
a) Run short fixed-length random walks starting from each node u u u in the graph using the random walk strategy R R R
b) For each node u u u, collect N R ( u ) N_R(u) NR(u), the multiset of nodes visited on random walks starting from u u u
c) Optimize embeddings
m a x f ∑ u ∈ V log P ( N R ( u ) ∣ z u ) \underset{f}{max}\sum_{u\in V} \log P(N_R(u)|z_u) fmaxu∈V∑logP(NR(u)∣zu)
Parameterize P ( v ∣ z u ) P(v|z_u) P(v∣zu) using softmax
P ( v ∣ z u ) = exp ( z u T z v ) ∑ n ∈ V exp ( z u T z n ) P(v|z_u)=\frac{\exp(z_u^Tz_v)}{\sum_{n\in V}\exp(z_u^Tz_n)} P(v∣zu)=∑n∈Vexp(zuTzn)exp(zuTzv)
Let
L = ∑ u ∈ V ∑ v ∈ N R ( u ) − log P ( v ∣ z u ) = ∑ u ∈ V ∑ v ∈ N R ( u ) − log ( exp ( z u T z v ) ∑ n ∈ V exp ( z u T z n ) ) L=\sum_{u\in V}\sum_{v\in N_R(u)}-\log P(v|z_u)=\sum_{u\in V}\sum_{v\in N_R(u)}-\log(\frac{\exp(z_u^Tz_v)}{\sum_{n\in V}\exp(z_u^Tz_n)}) L=u∈V∑v∈NR(u)∑−logP(v∣zu)=u∈V∑v∈NR(u)∑−log(∑n∈Vexp(zuTzn)exp(zuTzv))
Optimizing random walk embeddings = Finding embeddings z u z_u zu that minimizes L L L
Time complexity: O ( ∣ V 2 ∣ ) O(|V^2|) O(∣V2∣)
Solution: Negative sampling
log ( exp ( z u T z v ) ∑ n ∈ V exp ( z u T z n ) ) ≈ log ( σ ( z u T z v ) ) − ∑ i = 1 k log ( z u T z n i ) , n i ∼ P V \log(\frac{\exp(z_u^Tz_v)}{\sum_{n\in V}\exp(z_u^Tz_n)}) \approx \log (\sigma(z_u^Tz_v)) - \sum_{i=1}^k \log(z_u^Tz_{n_i}), n_i\sim P_V log(∑n∈Vexp(zuTzn)exp(zuTzv))≈log(σ(zuTzv))−i=1∑klog(zuTzni),ni∼PV
where σ ( ⋅ ) \sigma(\cdot) σ(⋅) is sigmoid function and n i n_i ni follows random distribution over nodes
Instead of normalizing w.r.t. all nodes, just normalize against k k k random “negative samples” n i n_i ni
Sample k k k negative nodes each with probability proportional to its degree ( k = 5 ∼ 20 k=5\sim 20 k=5∼20 in practice)
To minimize L L L, we can use gradient descent (GD) or stochastic gradient descent (SGD)
How to randomly walk
Simplest idea: just run fixed-length, unbiased random walks starting from each node
Interpolating BFS and DFS
Two parameters:
Biased Random Walks
Idea:remember where the walk came from
Node2vec Algorithm
a) compute random walk probabilities
b) simulate r r r random walks of length l l l start from each node u u u
c) optimize the node2vec objective using SGD
Time Complexity: linear
Steps are individually parallelizable
Goal: embed a subgraph or an entire gtaph G G G
Run a standard graph embedding technique on the (sub)graph G G G and then just sum the node embeddings in the (sub)graph G G G
z G = ∑ v ∈ G z v z_G=\sum_{v\in G}z_v zG=v∈G∑zv
Introduce a “virtual node” to represent the (sub)graph and run a standart graph embedding technique