CS224W: Machine Learning with Graphs - 06 Graph Neural Networks (GNN) 1: GNN Model

GNN Model

0. Limitations of shallow embedding methods

  • O ( ∣ V ∣ ) O(|V|) O(V) parameters are needed: no sharing of parameters between nodes so every node has its own unique embedding
  • Inherently “transductive”: cannot generate embeddings for nodes not seen during training
  • Do not incorporate node features: features should be leveraged

1. Deep Graph Encoders

0). Deep Methods based on GNN

E N C ( v ) = ENC(v)= ENC(v)= multiple layers of non-linear transformations based on graph structure
Note: all deep encodes can be combined with node similarity functions

1). Modern ML Toolbox

Modern deep learning toolbox is designed for simple sequences and grids. But networks are far more complex

  • Arbitrary size and complex topological structure (i.e., no spatial locality like grids)
  • No fixed node ordering or reference point
  • Often dynamic and have multimodal features

2. Basics of Deep Learning

To be updated

3. Deep Learning for Graphs

1). A Naive Approach

Join adjacency matrix and features then feed them into a deep neural network
Issues:

  • O ( ∣ V ∣ ) O(|V|) O(V) parameters
  • Not applicable to graph of different sizes
  • Sensitive to node ording

2). Convolutional Networks

a). From images to graphs

Goal: generalize convolutions beyond simple lattices and leverage node features/ attributes
Problem:

  • There is no fixed notion of locality or sliding window on the graph
  • Graph is permutation invariant

Idea: transform information at the neighbors and combine it:

  • Transform “message” h i h_i hi from neighbors: W i h i W_ih_i Wihi
  • Add them up: ∑ i W i h i \sum_i W_ih_i iWihi
b). Graph convolutional networks

Idea: node’s neighborhood defines a computation graph (determine node computation graph; propagate and transform information)
Basic approach: average information from neighbors and apply a neural network
h v 0 = x v h_v^0=x_v hv0=xv
h v l + 1 = σ ( W l ∑ u ∈ N ( v ) h u l ∣ N ( v ) ∣ + B l h v l ) , ∀ l ∈ { 0 , . . . , L − 1 } h_v^{l+1}=\sigma(W_l\sum_{u\in N(v)} \dfrac{h_u^l}{|N(v)|}+B_lh_v^l), \forall l\in \{0,...,L-1\} hvl+1=σ(WluN(v)N(v)hul+Blhvl),l{0,...,L1}
z v = h v L z_v=h_v^L zv=hvL
where

  • h v l h_v^l hvl: hidden representation of node v v v at layer l l l
  • W l W_l Wl: weight matrix for neighborhood aggregation
  • B l B_l Bl: weight matrix for transforming hidden vector of self
c). Matrix formulation

Many aggregations can be performed efficiently by (sparse) matrix operations
Let H l = [ h 1 l ⋯ h ∣ V ∣ l ] T H^l=[h_1^l \cdots h_{|V|}^l]^T Hl=[h1lhVl]T, then ∑ u ∈ N ( v ) h u l = A v H l \sum_{u\in N(v)}h_u^l=A_vH^l uN(v)hul=AvHl
Let D D D be diagonal matrix where D v v = D e g ( v ) = ∣ N ( v ) ∣ D_{vv} = Deg(v)=|N(v)| Dvv=Deg(v)=N(v) then D v v − 1 = 1 / ∣ N ( v ) ∣ D_{vv}^{-1} = 1/|N(v)| Dvv1=1/N(v)
Rewriting update function in matrix form
H l + 1 = σ ( A ~ H l W l T + H l B l T ) H^{l+1}=\sigma (\tilde AH^lW_l^T+H^lB_l^T) Hl+1=σ(A~HlWlT+HlBlT)
where A ~ = D − 1 A \tilde A=D^{-1}A A~=D1A
This implies that efficient sparse matrix multiplication can be used ( A ~ \tilde A A~ is sparse)

d). How to train a GNN
  • Node embedding z v z_v zv is a function of input graph
  • Supervised setting: minimize the loss L L L
    min ⁡ θ L ( y , f ( z v ) ) \min_\theta L(y, f(z_v)) θminL(y,f(zv))
    Example: node classification
    L = − ∑ v ∈ V y v log ⁡ ( σ ( z v T ) + ( 1 − y v ) log ⁡ ( 1 − σ ( z v T ) ) L=-\sum_{v\in V}y_v\log(\sigma(z_v^T)+(1-y_v)\log(1-\sigma(z_v^T)) L=vVyvlog(σ(zvT)+(1yv)log(1σ(zvT))
  • Unsupervised setting: No node label available so use the graph structure as the supervision.
    Similar nodes have similar embeddings
    L = ∑ z u , z v CrossEntropy ( y u v , DEC ( z u , z v ) ) L=\sum_{z_u,z_v}\text{CrossEntropy}(y_{uv}, \text{DEC}(z_u,z_v)) L=zu,zvCrossEntropy(yuv,DEC(zu,zv))
    where y u v = 1 y_{uv}=1 yuv=1 when node u u u and v v v are similar and DEC is the decoder (e.g., inner product)
    Node similarity can be anything such as random walks (node2vec, DeepWalk, struc2vec), matrix factorization, and node proximity in the graph
e). Model design: overview
  1. Define a neighborhood aggregation function
  2. Define a loss function on the embeddings
  3. Train a set of nodes
  4. Generate embeddings for nodes as needed
f). Inductive Capability

The same aggregation parameters ( W l W_l Wl and B l B_l Bl) are shared for all nodes: the number of model parameters is sublinear in ∣ V ∣ |V| V and we can generalize to unseen nodes (new graphs or new nodes).

你可能感兴趣的:(机器学习,人工智能,图论)