Large-scale:
GPU内存不够,CPU计算太慢
海量的数据导致不可能一次性把所有节点都输入模型进行训练,所以必须拆分成mini-batch
每个batch 是 randomly sample 得到的,这会导致每个sample得到的node几乎都 isolated from each other
而GNN需要 aggregate neighboring node features.
GraphSage提出来的方法
To compute embedding of a single node, all we need is the -hop neighborhood
(which defines the computation graph)
如下图所示,要得到node 0的embedding,只需要红色虚线圈内的节点即可,无论整个图有多大,其他的节点都不需要
Given a set of different nodes in a mini-batch,we can generate their embeddings using computational graphs. Can be computed on GPU!
问题:怎么实现?给每个node一个单独的computational graph吗?这样能并行吗?
问题:每个节点都要一起训练embedding,每条边都是依赖节点存在的,那每个节点的K hop一定包含了每条边,这样就什么也省不了?
Key idea: Construct the computational graph by (randomly) sampling at most neighbors at each hop.
因为限制在了H,所以computational graph每一层都不会太大,所以网络可以达到比较深。
-layer GNN will at most involve ∏ k = 1 K H k \prod_{k=1}^K H_k ∏k=1KHk leaf nodes in computational graph
Smaller leads to more efficient neighbor aggregation, but results in more unstable training due to the larger variance in neighbor aggregation.
Even with neighbor sampling, the size of the computational graph is still exponential with respect to number of GNN layers .
Increasing one GNN layer would make computation times more expensive.
Random sampling: fast but many times not optimal (may sample many “unimportant” nodes)
Natural graphs are “scale free”, sampling random neighbors, samples many low degree “leaf” nodes.
Random Walk with Restarts:
Compute Random Walk with Restarts score R i R_i Ri starting at the interest node
At each level sample neighbors with the highest R i R_i Ri
This strategy works much better in practice.
The size of computational graph becomes exponentially large w.r.t. the number of GNN layers.
Computation is redundant, especially when nodes in a mini-batch share many neighbors.
如下图,计算A和B时,对C、D的同样的message passing操作要计算2次
一个解决办法:HAG(hierarchical aggregation graphs),找到重复计算的内容,然后只计算一次。细节课程里没涉及
另一个解决办法是 Cluster GCN,问题:看起来Cluster GCN并没有解决这个问题
all the node embeddings are updated together using embeddings of the previous layer.
In each layer, only 2 × ( e d g e _ n u m ) 2 \times (edge\_num) 2×(edge_num) messages need to be computed. 因为每条edge要分别向两个节点计算message
For -layer GNN, only K × 2 × ( e d g e _ n u m ) K \times 2 \times (edge\_num) K×2×(edge_num) messages need to be computed.
GNN’s entire computation is only linear in #(edges) and #(GNN layers). Fast !
The layer-wise node embedding update allows the re-use of embeddings from the previous layer.
This significantly reduces the computational redundancy of neighbor sampling.
Of course, the layer-wise update is not feasible for a large graph due to limited GPU memory.
意思理解:好像neighbor sampling里每个node的computational graph是分别计算的,假设一个node有两个neighbor,这两个neighbor的computational graph里因为是每层随机取H个,这一个node不一定在neighbor的计算图里。而且好像每个computational graph都是随机每层取H个,所以同样一个node,在同一层,在不同的computational graph里的embedding可能不一样,所以不能重用。需要具体去看怎么实现的,会帮助理解。
问题: 没看懂,不是都可以re-use of embeddings from the previous layer吗?有我没有理解的内容?如果Neighbor sampling时存在重复计算,那这里也存在,甚至更多
问题:这里的意思是先算每个node的message,再按computational graph做aggregate吗?这样算的话确实去掉了重复计算
问题:这是因为GCN可以写成矩阵的形式所带来的计算优势吗?其他的GNN可以这样吗?需要从其他GNN的实现方式来看
Key idea: We can sample a small subgraph of the large graph and then perform the efficient layer-wise node embeddings update over the subgraph.
Subgraphs should retain edge connectivity structure of the original graph as much as possible.
This way, the GNN over the subgraph generates embeddings closer to the GNN over the original graph.
Real-world graph exhibits community structure
A large graph can be decomposed into many small communities.
Key insight [Chiang et al. KDD 2019]: Sample a community as a subgraph. Each subgraph retains essential local connectivity pattern of the original graph.
two steps: Pre-processing and Mini-batch training
Given a large graph, partition it into groups of nodes (i.e., subgraphs).
We can use any scalable community detection methods, e.g., Louvain, METIS
Notice: Between-group edges are dropped; group之间没有重合的 node; 所有node都在某一个group里
Sample one node group at a time. Apply GNN’s message passing over the induced subgraph. 一次只计算一个group of nodes
Solution: Aggregate multiple node groups per mini-batch.
2steps:
Partition the graph into relatively-small groups of nodes. 比以前的更小的subgraph
For each mini-batch:
问题:这里小group之间的edge是本来就存在的还是后来加的?看起来是保留的原有的edge,额外增加link,相当于改变了graph,应该不能这样做。
The rest is the same as vanilla Cluster-GCN
more representative of the entire nodes. Leads to less variance in gradient estimation.
Includes between-group edges, so Message can flow across groups.
一共有 nodes,每层最多sample H个neighbor, 一共K 层 GNN,
则 the cost is M × H K M \times H^K M×HK
一个subgraph共有 nodes,每个节点的平均degree为 D a v g D_{avg} Davg,-layer GNN,
则 the cost is K × M × D a v g K \times M \times D_{avg} K×M×Davg
Assume H = D a v g 2 H = \frac{D_{avg}}{2} H=2Davg. In other words, 50% of neighbors are sampled.
Then, Cluster-GCN is much more efficient than neighbor sampling, K的线性复杂度比指数复杂度低很多
但是,通常在Neighor Sampling时会将H设置为 D a v g D_{avg} Davg的2到3倍,因为K可以不用很大,复杂度也就不会太高,所以 Neighor Sampling 更常用。具体使用得看dataset的情况
(can be efficiently performed even on a CPU)
Wu et al. demonstrated that the performance on benchmark is not much lower by removing the non-linear activation from the GCN. 意思应该是去掉non-linear activation后效果不会降低太多, 论文里应该有细节。
Simplified GCN 因为更简单了,所以可以适用于数据量更大的情况
则去掉ReLU可以得到
H ( K ) = A ~ K X W T H^{(K)}=\tilde{A}^K X W^T H(K)=A~KXWT
A ~ K \tilde{A}^K A~K相当于是从节点出发,K步能到达的节点的reachable matrix
A ~ K \tilde{A}^K A~K也不包含任何参数,可以提前计算, Do X ← A ~ X X\leftarrow \tilde{A}X X←A~X for K times, 则
H ( K ) = X ~ W T H^{(K)}=\tilde{X} W^T H(K)=X~WT
任务现在转化为了a linear transformation of pre-computed matrix
Back to the node embedding form:
h v ( K ) = W X ~ v h^{(K)}_v=W \tilde{X}_v hv(K)=WX~v
由上式可得,Embedding of node v v v only depends on its own (pre-processed) feature
simplified GCN consists of two steps:
计算速度会非常快,参数也非常少
Surprisingly, in semi-supervised node classification benchmark, simplified GCN works comparably to the original GNNs despite being less expressive
原因:Graph Homophily
Many node classification tasks exhibit homophily structure, 例如 nodes connected by edges tend to share the same target labels
例如 社交网络里交流频繁的node之间更可能共享某些label,Two papers tend to share the same category if one cites another,Two users tend to like the same movie if they are friends in a social network
问题:什么是Graph Homophily ?应该是cluster内部的node比较类似
决定node的预测结果的是 节点自己的特征和adajcency matrix相乘的K次,也即和node自己有连接的node的特征的融合
所以当graph具有局部同质性的时候,simplified GCN可以表现的很好