CS224W: Machine Learning with Graphs - 09 How Expressive are GNNs

How Expressive are GNNs

0. Theory of GNNs

How powerful are GNNs?

  • Many GNN models have been proposed (e.g., GCN, GAT, GraphSAGE)
  • What is the experssive p[ower (ability to distinguish different graph structures) of these models?
  • How to design a maximally expressive GNN model?

1. Local Neighborhood Structures

We specially consider local neighborhood structures around each node in a graph
Key question: can GNN node embeddings distinguish different nodes’ local neighborhood structures?
Next: we need to understand how a GNN captures local neighborhood structures

1). Computational Graph

  • In each layer, a GNN aggregates neighboring node embeddings.
  • A GNN generates node embeddings through a computational graph defined by the neighborhood.
  • But GNN only sees node features (not IDs).
  • A GNN will generate the same embedding for two nodes if their computational graph are the same and node features are identical.
  • In general, different local neighborhoods define different computation graphs
  • Computational graphs are identical to rooted subtree structures around each node.
  • GNN’s node embeddings captures rooted subtree structures.
  • Most expressive GNN maps different rooted subtrees into different node embeddings.
  • Most expressive GNN should map subtrees to the node embeddings injectively.

Key observation: subtrees of the same depth can be recursively characterized from the leaf nodes to the root nodes

  • If each step of GNN’s aggregation can fully retain the neighboring information, the generated node embeddings can distinguish different rooted subtrees.
  • In other words, most expressive GNN would use an injective neighbor aggregation function at each step (maps different neighbors to different embeddings).

2. Design the Most Powerful GNN

1). Neighbor Aggregation

Observation: neighbor aggregation can be abstracted as a function over a multi-set (a set with repeating elements)

a). GCN (mean-pool)
  • Take element-wise mean, followed by linear function and ReLU activation
  • Theorem: GCN’s aggregation function cannot distinguish different multi-sets with the same color proportion.
b). GraphSAGE (max-pool)
  • Apply an MLP, then take element-wise max
  • Theorem: GraphSAGE’s aggregation function cannot distinguish different multi-sets with the same set of distinct colors.
c). Summary
  • Expressive power of GNNs can be characterized by that of the neighbor aggregation function
  • Neighbor aggregation is a function over multi-sets (sets with repeating elements)
  • GCN and GraphSAGE’s aggregation function cannot distinguish some basic multi-sets and hence not injective
  • Therefore, GCN and GraphSAGE are not maximally powerful GNNs.

2). Neighbor Aggregation of Graph Isomorphism Network (GIN)

  • Goal: design maximally powerful GNNs in the class of message-passing GNNs
  • This can be achieved by designing injective neighbor aggregation function over multi-sets
  • We design an NN that can model injective multi-set function
a). Injective multi-set function

Theorem: any injective multi-set function can be expressed as
Φ ( ∑ x ∈ S f ( x ) ) \Phi(\sum_{x\in S}f(x)) Φ(xSf(x))
where Φ \Phi Φ and f f f are some non-linear function
Proof Intuition: f f f produces one-hot encodings of colors. Summation of the one-hot encodings retains all the information about the input multi-set

b). Universal approximation theorem

1-hidden-layer MLP with sufficiently large hidden dimensionality and appropriate non-linearity σ ( ⋅ ) \sigma(\cdot) σ() can approximate any continuous function to an arbitrary accuracy.
We have arrived at an NN that can model any injective multi-set function
MLP Φ ( ∑ x ∈ S MLP f ( x ) ) \text{MLP}_{\Phi}(\sum_{x\in S}\text{MLP}_f(x)) MLPΦ(xSMLPf(x))
In practice, MLP hidden dimensionality of 100 to 500 is sufficient.

c). Graph isomorphism network (GIN)

Apply an MLP, element-wise sum, followed by another MLP
MLP Φ ( ∑ x ∈ S MLP f ( x ) ) \text{MLP}_{\Phi}(\sum_{x\in S}\text{MLP}_f(x)) MLPΦ(xSMLPf(x))
Theorem: GIN’s neighbor aggregation function is injective
GIN is the most expressive GNN in the class of message-passing GNNs

3). Full Model of GIN

a). WL graph kernel

Recall: color refinement algorithm in WL kernel
Given a graph G G G with a set of nodes V V V, assign an initial color c 0 ( v ) c^0(v) c0(v) to each ndoe v v v. Then iteratively refine node colors by
c k + 1 ( v ) = HASH ( { c k ( v ) , { c k ( u ) } u ∈ N ( v ) } ) c^{k+1}(v)=\text{HASH}(\{c^k(v), \{c^k(u)\}_{u\in N(v)}\}) ck+1(v)=HASH({ck(v),{ck(u)}uN(v)})
where HASH maps different inputs to different colors. After K K K steps of color refinement, c k ( v ) c^k(v) ck(v) summarizes the structure of K K K-hop neighborhood
Process continues until a stable coloring is reached
Two graphs are considered isomorphic if they have the same set of colors

b). Complete GIN model

GIN uses an NN to model the injective HASH function
c k + 1 ( v ) = HASH ( { c k ( v ) , { c k ( u ) } u ∈ N ( v ) } ) c^{k+1}(v)=\text{HASH}(\{c^k(v), \{c^k(u)\}_{u\in N(v)}\}) ck+1(v)=HASH({ck(v),{ck(u)}uN(v)})
Specifically, we will model the injective function over the tuple ( c k ( v ) , { c k ( u ) } u ∈ N ( v ) ) (c^k(v), \{c^k(u)\}_{u\in N(v)}) (ck(v),{ck(u)}uN(v)) where c k ( v ) c^k(v) ck(v) is root node features and { c k ( u ) } u ∈ N ( v ) \{c^k(u)\}_{u\in N(v)} {ck(u)}uN(v) is neighboring node colors
Theorem: any injective function over the tuple ( c k ( v ) , { c k ( u ) } u ∈ N ( v ) ) (c^k(v), \{c^k(u)\}_{u\in N(v)}) (ck(v),{ck(u)}uN(v)) can be modeled as
MLP Φ ( ( 1 + ϵ ) ⋅ MLP f ( c k ( v ) ) + ∑ u ∈ N ( v ) MLP f ( c k ( u ) ) ) \text{MLP}_{\Phi}((1+\epsilon)\cdot\text{MLP}_f(c^k(v))+\sum_{u\in N(v)}\text{MLP}_f(c^k(u))) MLPΦ((1+ϵ)MLPf(ck(v))+uN(v)MLPf(ck(u)))
where ϵ \epsilon ϵ is a learnable scalar.
If input feature c 0 ( v ) c^0(v) c0(v) is represented as one-hot, direct summation is injective.
We only need Φ \Phi Φ to ensure the injectivity
GINConv ( c k ( v ) , { c k ( u ) } u ∈ N ( v ) ) = MLP Φ ( ( 1 + ϵ ) ⋅ c k ( v ) + ∑ u ∈ N ( v ) c k ( u ) ) \text{GINConv}(c^k(v), \{c^k(u)\}_{u\in N(v)})=\text{MLP}_{\Phi}((1+\epsilon)\cdot c^k(v)+\sum_{u\in N(v)}c^k(u)) GINConv(ck(v),{ck(u)}uN(v))=MLPΦ((1+ϵ)ck(v)+uN(v)ck(u))

c). GIN and WL graph kernel

GIN can be understood as differentiable neural version of the WL graph kernel. They have exactly the same expressiveness. They are both powerful enough to distinguish most of the real-world graphs.

Update target Update function
WL graph kernel Node colors (one-hot) HASH
GIN Node embeddings (low-dim vectors) GINConv

Advantages of GIN over the WL graph kernel are:

  • Node embeddings are low-dimensional; hence, they can capture the fine-grained similarity of different nodes
  • Parameters of the update function can be learned for the downstream tasks

你可能感兴趣的:(机器学习,人工智能)