论文原文:[1710.10903] Graph Attention Networks (arxiv.org)
论文代码:https://github.com/PetarV-/GAT
英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!
目录
1. 省流版
1.1. 心得
1.2. 论文框架图
2. 论文逐段精读
2.1. Abstract
2.2. Introduction
2.3. GAT architecture
2.3.1. Graph attention layer
2.4. Evaluation
2.4.1. Datasets
2.4.2. State-of-the-art method
2.4.3. Experimental setup
2.4.4. Results
2.5. Conclusions
3. 知识补充
3.1. Spectral and non-spectral approaches for GNN
3.2. Spectral domain and frequency domain
3.3. t-SNE
4. Reference List
(1)Intro里面就包含了related work的样子?
(2)狠狠赞扬Datasets的表格,我都不用总结了
①They proposed a graph attention networks (GATs), which is both suitable for inductive and transductive problems
②There is no need for special and costly matrix operation
③They test their model in Cora, Citeseer, Pubmed citation network datasets and proteinprotein interaction dataset
upfront adj.预付的;坦率的;诚实的;直爽的;预交的 adv.预付地,先期支付地
①CNN has been widely used in translation, image classification, semantic segmentation. However, it can not be used in none-grid, i.e. irregular representation, such as social/telecommunication/biological networks, 3D meshes, brain connectomes. Thus, graph structure can describe these structures more accurately
②Early works adopted recursive neural networks to process directed acyclic graphs
③They introduced spectral and non-spectral methods of graph processing
④Allowing different sizes of input, attention mechanism has been sucessfully used in NLP
⑤Attention mechanism is able to parallelize neighbors, assign weights to neighbors and be used in inductive learning
acyclic adj.无环的;非循环的;非周期的;非环状的
reminiscent adj.怀旧的;使回忆起(人或事);回忆过去的;缅怀往事的 n.回忆者;追记前事者
①Input matrix:
where denotes the number of nodes, denotes the number of features
②Then transfer node features to higher level with shared weight matrix:
where is a attention mechanism;
is the neighbor node in the neighborhood of ;
also, indicates the importance of node 's features to node .
③Normalize neighbors:
where denotes neighborhood of node and the order is set by 1, i.e. first-order neighbors.
④Further expanding function :
which is a single-layer feedforward neural network,
and where denotes a weight vector;
negative slope ;
|| denotes concatenation.
⑤Applying nonlinearity to get final output:
⑥They further introduce multi-head attention with concatenation:
where denotes normalized attention coefficients caculated by the -th attention mechanism
⑦In prediction layer, averaging is much more sensible than multi-head:
⑧The figure of this model:
where the left is attention mechanism and the right is multi-head attention mechanism with
(1)Their improvements:
①There is no need for eigendecomposition or other time-consuming calculation. Furthermore, multi-head operations can also be parallelized
②GAT allows to assign weights to neighbors
③Adopting to directed graph with imiting when there is no edge in
④Applicable to inductive
⑤GraphSAGE can not process the whole neighborhood but GAT can
⑥Compared with MoNet, which computes the node structure, GAT adopts similarity computations
on par with 与...相当
par n.(股票的)面值,票面价值;<高尔夫>标准杆数;平均量,常态,一般水平(或标准);标准(尤指某人的工作或健康)水准 adj.平价的,与票面价值相等的;平均的,正常的 vt.<高尔夫>标准杆数得分
Datasets information:
(1)Transductive learning
①In the left three datasets, nodes represent documents, undirected edges represent citations and node features represent elements of a bag-of-words representation of a document
②Class: 20 node
(2)Inductive learning
①Pre-processing: provided by Hamilton et al. (GraphSAGE)
(1)Transductive learning
Comparison table:
(2)Inductive learning
Comparison table:
where Const-GAT adopts constant attention mechanism i.e. adopting same weight for each neighbor
(3)Summary
They provide MLP for each node
(1)Transductive learning
①They adopted 2 layers model. The first layer uses , and for each multi-head. Then follows ELU. The second layer sets (number of classes) features in one attention head, then follows Softmax.
②Moreover, L2 regularization with
③Dropout rate: 0.6
(2)Inductive learning
①They chose 3 layer model. In the first two layers, and with a latter ELU. The third layer adopts followed by logistic Sigmoid.
②The set is large enough to ignore L2 regularization and dropout
③Adopting skip connections in the middle attention layers
(3)Summary
①Initialization: Glorot
②Optimizer: Adam SGD
③Learning rate: 0.01 for Pubmed, and 0.005 for others
④Early stopping strategy: 100 epochs
①They tune and adjust other model to be similar to GAT for fair
②Transformed feature representations visualization with 7 labels:
where this figure comes from the first layer in GAT on Cora dataset
①REPEAT their low time-costing of parallelizable and easy matrix operation.
②“一个特别有趣的研究方向是利用注意机制对模型的可解释性进行彻底的分析”?啊???这种话真是从2018做到2023了可解释性都还没有结果呢
③Considering edge feature is feasible
(1)Spectral domain: mainly used in GNN, adopting Fourier transform on space dimensionality
(2)Frequency domain: mainly used in signal and image processing, adopting Fourier transform on temporal dimensionality
Velickovic, P. et al. (2018) 'Graph Attention Networks', ICLR 2018. doi: https://doi.org/10.48550/arXiv.1710.10903