CS224W: Machine Learning with Graphs - 03 Node Embeddings

Node Embeddings

1. Graph Represnetation Learning

Graph represnetation learning alleviates the need to do feature engineering every single time (automatically learn the features)
Goal: efficient task-independent feature learning for machine learning with graphs
Why embedding?

  • Similarity of embeddings between nodes indicates their similarity in the netwrok
  • Encode network information
  • Potantially used for many downstream predictions (node classification, link prediction, graph prediction, anomalous node detection, clustering…)

2. Node Embeddings: Encoder and Decoder

Goal: encode nodes so that similarity in the embedding space approximates similarity in the graph
a) Encoder ENC maps from nodes to embeddings (a low-dimensional vector)
b) Define a node similarity function (i.e., a measure of similarity in the original network)
c) Decoder DEC maps from embeddings to the similarity score
d) Optimize the parameters of the encoder so that similarity ( u , v ) ≈ z v T z u (u, v)\approx z_v^Tz_u (u,v)zvTzu

1). “Shallow” Encoding

Simplest encoding approach: encoder is just an embedding-lookup so each node is assigned a unique embedding vector
ENV ( v ) = z v = Z ⋅ v \text{ENV}(v)=z_v=Z \cdot v ENV(v)=zv=Zv
where Z Z Z is matrix and each column is a node embedding and v v v is an indicator vector with all zeroes excepy a one in column indicating node v v v
Methods: DeepWalk, node2vec

3. Random Walk Approaches for Node Embeddings

  • Vector z u z_u zu is the embedding of node u u u
  • Probability P ( v ∣ z u ) P(v|z_u) P(vzu) is the (predicted) probability of visiting node v v v on random walks starting from node u u u
  • Random walk: given a graph and a starting point, we select one of its neighbors at random and move to this neighbor; then we select a neighbor of this point at random and move to it, etc. The (random) sequence of points visited this way is a random walk on the graph.

1). Random-walk Embeddings

z u T z v ≈ probability that  u  and  v  co-occur on a random walk over the graph z_u^Tz_v \approx \text{probability that \textit{u} and \textit{v} co-occur on a random walk over the graph} zuTzvprobability that u and v co-occur on a random walk over the graph

  • Estimate probability P R ( v ∣ u ) P_R(v|u) PR(vu) of visiting node v v v on a random walk starting from node u u u using the random walk strategy R R R
  • Optimize embeddings to encode these random walk statistics

Why random walks?

  • Expressivity: flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information (If a random walk starting from node u u u visits v v v with high probability, u u u and v v v are similar)
  • Efficiency: do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks

2). Unsupervised Feature Learning

Intuition: find embedding of nodes in d d d-dimensional space that preserves similarity
Idea: learn node embedding such that nearby nodes are close together in the network
N R ( u ) N_R(u) NR(u): neighborhood of u u u obtained by the strategy R R R
Goal: learn a mapping f : u → R d f: u \to \mathbb{R}_d f:uRd: f ( u ) = z u f(u) = z_u f(u)=zu
Log-likelihood objective:
m a x f ∑ u ∈ V l o g P ( N R ( u ) ∣ z u ) \underset{f}{max}\sum_{u\in V} log P(N_R(u)|z_u) fmaxuVlogP(NR(u)zu)
Given node u u u, we want to learn feature represnetations that are predictive of the nodes in its random walk neighborhood N R ( u ) N_R(u) NR(u)

3). Random-walk Optimization

a) Run short fixed-length random walks starting from each node u u u in the graph using the random walk strategy R R R
b) For each node u u u, collect N R ( u ) N_R(u) NR(u), the multiset of nodes visited on random walks starting from u u u
c) Optimize embeddings
m a x f ∑ u ∈ V log ⁡ P ( N R ( u ) ∣ z u ) \underset{f}{max}\sum_{u\in V} \log P(N_R(u)|z_u) fmaxuVlogP(NR(u)zu)
Parameterize P ( v ∣ z u ) P(v|z_u) P(vzu) using softmax
P ( v ∣ z u ) = exp ⁡ ( z u T z v ) ∑ n ∈ V exp ⁡ ( z u T z n ) P(v|z_u)=\frac{\exp(z_u^Tz_v)}{\sum_{n\in V}\exp(z_u^Tz_n)} P(vzu)=nVexp(zuTzn)exp(zuTzv)
Let
L = ∑ u ∈ V ∑ v ∈ N R ( u ) − log ⁡ P ( v ∣ z u ) = ∑ u ∈ V ∑ v ∈ N R ( u ) − log ⁡ ( exp ⁡ ( z u T z v ) ∑ n ∈ V exp ⁡ ( z u T z n ) ) L=\sum_{u\in V}\sum_{v\in N_R(u)}-\log P(v|z_u)=\sum_{u\in V}\sum_{v\in N_R(u)}-\log(\frac{\exp(z_u^Tz_v)}{\sum_{n\in V}\exp(z_u^Tz_n)}) L=uVvNR(u)logP(vzu)=uVvNR(u)log(nVexp(zuTzn)exp(zuTzv))
Optimizing random walk embeddings = Finding embeddings z u z_u zu that minimizes L L L
Time complexity: O ( ∣ V 2 ∣ ) O(|V^2|) O(V2)
Solution: Negative sampling
log ⁡ ( exp ⁡ ( z u T z v ) ∑ n ∈ V exp ⁡ ( z u T z n ) ) ≈ log ⁡ ( σ ( z u T z v ) ) − ∑ i = 1 k log ⁡ ( z u T z n i ) , n i ∼ P V \log(\frac{\exp(z_u^Tz_v)}{\sum_{n\in V}\exp(z_u^Tz_n)}) \approx \log (\sigma(z_u^Tz_v)) - \sum_{i=1}^k \log(z_u^Tz_{n_i}), n_i\sim P_V log(nVexp(zuTzn)exp(zuTzv))log(σ(zuTzv))i=1klog(zuTzni),niPV
where σ ( ⋅ ) \sigma(\cdot) σ() is sigmoid function and n i n_i ni follows random distribution over nodes
Instead of normalizing w.r.t. all nodes, just normalize against k k k random “negative samples” n i n_i ni
Sample k k k negative nodes each with probability proportional to its degree ( k = 5 ∼ 20 k=5\sim 20 k=520 in practice)
To minimize L L L, we can use gradient descent (GD) or stochastic gradient descent (SGD)
How to randomly walk
Simplest idea: just run fixed-length, unbiased random walks starting from each node

4). Overview of Node2vec

  • Goal: embed nodes with similar network neighborhoods close in the feature space
  • Key observation: flexible notion of network neighborhood N R ( u ) N_R(u) NR(u) of node u u u leads to rich node embeddings
  • Develop biased 2 n d 2^{nd} 2nd order random walk R R R to trade off between local (BFS) and global (DFS) views of the network

Interpolating BFS and DFS
Two parameters:

  • Return parameter p p p: return back to the previous node
  • In-out parameter q q q (ratio of BFS vs DFS): moving outwards (DFS) or inwards (BFS)

Biased Random Walks
Idea:remember where the walk came from

  • BFS-like walk: low value of p p p
  • DFS-like walk: low value of q q q

Node2vec Algorithm
a) compute random walk probabilities
b) simulate r r r random walks of length l l l start from each node u u u
c) optimize the node2vec objective using SGD
Time Complexity: linear
Steps are individually parallelizable

5). Other Random Walk Ideas

  • Different kinds of biased random walks: based on node attributes/learned weights
  • Alternative optimization schemes: directly optimize based on 1-hop and 2-hop random walk probabilities
  • Network preprocessing techniques: run random walks on modified versions of the original network

4. Embedding Entire Graphs

Goal: embed a subgraph or an entire gtaph G G G

1). Approach 1

Run a standard graph embedding technique on the (sub)graph G G G and then just sum the node embeddings in the (sub)graph G G G
z G = ∑ v ∈ G z v z_G=\sum_{v\in G}z_v zG=vGzv

2). Approach 2

Introduce a “virtual node” to represent the (sub)graph and run a standart graph embedding technique

3). Approach 3: Anonymous Walk Embeddings

  • States in anonymous walks correspond to the index of the first time we visited the node in a random walk (agnostic to the identity of the nodes visited), simulate anonymous walks w i w_i wi of l l l steps and record their counts then represent the graph as a probability distribution over these walks
  • Number of anonymous walks grows exponentially
  • Sampling anonymous walks
    Generate independently a set of m m m random walks
    Represent the graph as a probability distribution over these walks with error of more than ϵ \epsilon ϵ with probability less than δ \delta δ
    m = [ 2 ϵ 2 ( log ⁡ ( 2 η − 2 ) − log ⁡ ( δ ) ) ] m=[\frac{2}{\epsilon^2}(\log(2^{\eta}-2)-\log(\delta))] m=[ϵ22(log(2η2)log(δ))]
    where η \eta η is the total number of anonymous walks of length l l l

5. How to Use Embeddings

  • Clustering/community detection: cluster points z i z_i zi
  • Node classification: predict label of node i i i based on z i z_i zi
  • Link prediction: predict edge ( i . j ) (i. j) (i.j) based on ( z i , z j ) (z_i, z_j) (zi,zj) (where we can concatenate, average, product or take a difference between the embeddings)
  • Graph classification: graph embedding z G z_G zG via aggregating node embeddings or anonymous random walks. Predict label based on graph embedding z G z_G zG

你可能感兴趣的:(机器学习,人工智能,图论)