Graph Attention Networks[ICLR, 2018]
- paper: Graph Attention Networks
- github: https://github.com/PetarV-/GAT (tensorflow)
- code: https://github.com/Diego999/pyGAT
该paper提出了GAT,利用masked self-attention layers进行图卷积。该方法可以赋予不同邻居节点不同的权重,可以处理transdutive问题和inductive问题。
1. introduction
- GCN[Kipf et al, ICLR, 2017]
- attention mechanism: 可以处理不同大小的input。self-attention提出:Attention is all you need!
2. Architecture
GAT主要将注意力机制(Attention mechanism)和图卷积神经网络结合起来,在聚合节点信息的时候,对于每个邻居节点赋予不同的权重(也称为attention score)。同时,和transformer提出的self-attention一样,GAT也可以实现多头(multi-heads)注意力机制,每个头单独更新参数,最终将几个头的结果进行串联或者取平均得到最终过的节点表达。
下左为得到邻居节点attention score过程,下右为多头注意力机制更新过程。
具体步骤描述如下:
- 步骤一:计算未归一化的attention acore 。沿着边将断点的节点表示的线性变换串联,并过一个单层的MLP;
- 步骤二:得到归一化后attention score 。对于按行通过softmax函数进行归一化;
- 步骤三:将节点的信息沿着边整合到一起。机制分为单头和多头,多头又有两种整合方式,第一种是将几个head的hidden vector和attention score相乘之后直接concat起来,第二种是将几个head的vector平均再过一个非线性层(在output-layer使用)。
3. Contributions
- 计算高效(computation efficient): 可并行计算
- 对于不同邻居节点给予不同重要性,让模型解释性更好
- 可用于directed graph和inductive learning场景
- GraphSAGE对每个节点指定fixed-size的邻居,并且使用LSTM的聚合器需要random-ordering的操作;但是GAT可以获得所有邻居的信息,并且不需要ordering
4. Experiment
4.1 Transductive learnig
Node Classification dataset:
- citation graph--Cora, Citeseer, Pubmed
set up details:
- 2-layer GAT
- 第一层:
- 第二层:a. Cora, Citeseer ; b. Pubmed
- 正则化:a. Cora, Citeseer ; b. Pubmed
- 加入dropout层:
4.2 Inductive learning
protein-protein interaction(PPI) dataset由包含24个graph。在训练集上训练得到每一层的参数,再利用这个参数的到val/test set的节点表示和进行节点multi-label分类任务。
set up details:
- 3-layer GAT
- layer 1 and layer 2:
- layer 3:
- skip connection
- batch size=2
补充:ELU激活函数
5. Code
本小节主要讲GAT的实现代码。第一部分讲GATLayer如何实现的,主要通过dgl的框架看一下大致的整个代码的实现思路,完整代码可以看reference的源码;第二部分讲基于GATLayer如何构建GATmodel。
参考DGL有关GAT的详细说明以及DGL中GAT示例代码。
5.1 GATLayer
==Steps==:
a. 全连接层full connected layer:,将高维转为较低维特征
b. message--计算没经过正则化(un-normalized)的attention score :,这个score可以看做edge的特征
c. reduce
- normalize: 计算attention score :
- aggregate:
from dgl.nn.pytorch import GATConv
: GATConv源码
# GATConv Layer源码关键部分(需要注意的地方)
# 主要展示了参数和residual connection部分
def __init__(self,
in_feats,
out_feats,
num_heads,
feat_drop=0., # dropout
attn_drop=0.,
negative_slope=0.2, # leakyrelu
residual=False, # 是否连接residual
activation=None,
allow_zero_in_degree=False):
#...
if residual:
if self._in_dst_feats != out_feats:
self.res_fc = nn.Linear(
self._in_dst_feats, num_heads * out_feats, bias=False)
else:
self.res_fc = Identity()
def forward(self, graph,...):
# ...
# residual
if self.res_fc is not None:
resval = self.res_fc(h_dst).view(h_dst.shape[0], -1, self._out_feats)
# h_(l+1)' = h_(l+1) + Wh_(l)
rst = rst + resval
在DGL有关GAT的详细说明中有关于GATLayer的简易实现:
class GATLayer(nn.Module):
def __init__(self, g, in_dim, out_dim):
super(GATLayer, self).__init__()
self.g = g
# equation (1)
self.fc = nn.Linear(in_dim, out_dim, bias=False)
# equation (2)
self.attn_fc = nn.Linear(2 * out_dim, 1, bias=False)
self.reset_parameters()
def reset_parameters(self):
"""Reinitialize learnable parameters."""
gain = nn.init.calculate_gain('relu')
nn.init.xavier_normal_(self.fc.weight, gain=gain)
nn.init.xavier_normal_(self.attn_fc.weight, gain=gain)
def edge_attention(self, edges):
# edge UDF for equation (2)
z2 = torch.cat([edges.src['z'], edges.dst['z']], dim=1)
a = self.attn_fc(z2)
return {'e': F.leaky_relu(a)}
def message_func(self, edges):
# message UDF for equation (3) & (4)
return {'z': edges.src['z'], 'e': edges.data['e']}
def reduce_func(self, nodes):
# reduce UDF for equation (3) & (4)
# equation (3)
alpha = F.softmax(nodes.mailbox['e'], dim=1)
# equation (4)
h = torch.sum(alpha * nodes.mailbox['z'], dim=1)
return {'h': h}
def forward(self, h):
# equation (1)
z = self.fc(h)
self.g.ndata['z'] = z
# equation (2)
self.g.apply_edges(self.edge_attention)
# equation (3) & (4)
self.g.update_all(self.message_func, self.reduce_func)
return self.g.ndata.pop('h')
# multi-heads通过叠加多个GATLayer实现
class MultiHeadGATLayer(nn.Module):
def __init__(self, g, in_dim, out_dim, num_heads, merge='cat'):
super(MultiHeadGATLayer, self).__init__()
self.heads = nn.ModuleList()
for i in range(num_heads):
self.heads.append(GATLayer(g, in_dim, out_dim))
self.merge = merge
def forward(self, h):
head_outs = [attn_head(h) for attn_head in self.heads]
if self.merge == 'cat':
# concat on the output feature dimension (dim=1)
return torch.cat(head_outs, dim=1)
else:
# merge using average
return torch.mean(torch.stack(head_outs))
5.2 GAT model
tips:
- 第一层hidden layer没有residual connection
- output layer 没有activation,其多头注意力机制采用均值的方法
class GAT(nn.Module):
def __init__(self,
g,
num_layers,
in_dim,
num_hidden,
num_classes,
heads,
activation,
feat_drop,
attn_drop,
negative_slope,
residual):
super(GAT, self).__init__()
self.g = g
self.num_layers = num_layers
self.gat_layers = nn.ModuleList()
self.activation = activation
# input projection (no residual)
self.gat_layers.append(GATConv(
in_dim, num_hidden, heads[0],
feat_drop, attn_drop, negative_slope, False, self.activation))
# hidden layers
for l in range(1, num_layers):
# due to multi-head, the in_dim = num_hidden * num_heads
self.gat_layers.append(GATConv(
num_hidden * heads[l-1], num_hidden, heads[l],
feat_drop, attn_drop, negative_slope, residual, self.activation))
# output projection
self.gat_layers.append(GATConv(
num_hidden * heads[-2], num_classes, heads[-1],
feat_drop, attn_drop, negative_slope, residual, None))
def forward(self, inputs):
h = inputs
for l in range(self.num_layers):
h = self.gat_layers[l](self.g, h).flatten(1)
# output projection
logits = self.gat_layers[-1](self.g, h).mean(1) # mean aggregation
return logits
--end--
如果有讲得不清楚的地方,欢迎提问和提意见~