[论文精读]Inductive Representation Learning on Large Graphs

论文原文:Inductive representation learning on large graphs | Proceedings of the 31st International Conference on Neural Information Processing Systems (acm.org)

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!

目录

1. 省流版

1.1. 心得

1.2. 论文框架图

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related work

2.4. Proposed method: GraphSAGE

2.4.1. Embedding generation (i.e., forward propagation) algorithm

2.4.2. Learning the parameters of GraphSAGE

2.4.3. Aggregator Architectures

2.5. Experiments

2.5.1. Inductive learning on evolving graphs: Citation and Reddit data

2.5.2. Generalizing across graphs: Protein-protein interactions

2.5.3. Runtime and parameter sensitivity

2.5.4. Summary comparison between the different aggregator architectures

2.6. Theoretical analysis

2.7. Conclusion

3. 知识补充

3.1. Hyperparameter hacking

4. Reference List


1. 省流版

1.1. 心得

(1)救命为什么我没有心得我已经飞升了?

1.2. 论文框架图

[论文精读]Inductive Representation Learning on Large Graphs_第1张图片

2. 论文逐段精读

2.1. Abstract

        ①The previous works do not concern unseen nodes, only focus on nodes which are presented or labeled

        ②They adopt inductive framework by sampling and aggregating features from neighbors

2.2. Introduction

        ①Node embedding aims to decrease the dimensionalities of neighbors' feature, then transfer features to embedding vectors.

        ②For predicting new-and-unknown nodes, inductive ability is necessary for ML/DL.

        ③They generalize traditional GCN to trainable aggregation functions with inductive unsupervised learning

        ④The rich node features, including degree, text attributes, node profile information, have the ability to generalize and predict other unknown nodes

        ⑤Every node is aggregated by different hops or search depths away from it.

        ⑥The authors test their model on three classification tasks and prove its prediction ability. Besides, their model gets higher accuracy.

        ⑦Their main idea:

[论文精读]Inductive Representation Learning on Large Graphs_第2张图片

2.3. Related work

(1)Factorization-based embedding approaches

        ①Node embedding methods: random walk statistics, matrix factorization etc.

        ②However, all of these are fix in space due to the same orthogonal transformations. This graph space does not have generalization that can be transferred to other graphs, and may be reconstructed during other training sessions

(2)Supervised learning over graphs

        Previous supervised learnings on graphs mainly focus on entire graphs instead of single node

(3)Graph convolutional networks

        ①Previous GCNs can not generalize to entire or large graphs

        ②Original GCN is semi-supervised and transductive, and requires reseachers to know entire Laplacian all the time

2.4. Proposed method: GraphSAGE

2.4.1. Embedding generation (i.e., forward propagation) algorithm

        ①Assuming there are K aggregator functions

        ②Each node aggregate information from neighbors is: 

\mathrm{AGGREGATE}_{k},\forall k\in\{1,...,K\}

        ③Propagate information between layers or “search depths” utilize the weight matrices: 

\mathbf{W}^{k},\forall k\in\{1,...,K\}

        ④Forward propagating pseudocode:

[论文精读]Inductive Representation Learning on Large Graphs_第3张图片

where k represents the step, h^{k} is node in the k-th step. The fourth line indicates that each node vector \mathbf{h}_{\mathcal{N}(v)}^{k} comes from an aggregation of other neighboring nodes. Then, in the fifth line, concatenating current node and its neighborhood vector to a fully connected layer with nonlinear activation function \sigma. Lastly, the line 3 does not mean iterating all nodes. The author believes that only some nodes that meet the iteration conditions need to be retained, and the complete pseudocode is provided in the appendix.

        ⑤Traditional Weisfeiler-Lehman Isomorphism sets hush function as aggregator. If two subgraphs get the same output, it means they are isomorphic. The authors replace hush function to trainable neural network to present topological structure

        ⑥They adopt different uniform samples at each iteration to achieve faster training

placeholder  n.(替代缺失部分的)占位符,占位文字;位标(句子中必要但无实际意义的词项,如 It's a pity she left 中的 it)

2.4.2. Learning the parameters of GraphSAGE

        ①The parameters adjust by stochastic gradient descent, here is the loss function:

J_{\mathcal{G}}(\mathbf{z}_{u})=-\log\left(\sigma(\mathbf{z}_{u}^{\top}\mathbf{z}_{v})\right)-Q\cdot\mathbb{E}_{v_{n}\sim P_{n}(v)}\log\left(\sigma(-\mathbf{z}_{u}^{\top}\mathbf{z}_{v_{n}})\right)

where u is the neighbor of v , \sigma is Sigmoid function, P_{n} denotes a negative sampling distribution, Q denotes the number of negative samples

2.4.3. Aggregator Architectures

        ①The symmetry of aggregator is able to ignoring input sorting order, namely it will not be influenced by order.

        ②The first candidate aggregator is mean aggregator:

\textbf{h}_{v}^{k}\leftarrow \sigma \left ( \textbf{W}\cdot \textbf{MEAN} \left ( \left \{ \textbf{h}_{v}^{k-1} \right \} \cup \left \{ \textbf{h}_{v}^{k-1},\forall u\in N\left ( v \right ) \right \}\right )\right )

它原文的式子好像括号就打错了)which can replace the 4-th and 5-th line of Algorithm 1. This aggregator omits concatenation operation, which is able to connect different layers or depths.

        ③The second candidate aggregator is LSTM aggregator, which has better expression ability. However, it is not symmetric.

        ④The third candidate aggregator is pooling aggregator:

\text{AGGREGATE}_k^\text{pool}=\max(\left\{\sigma\left(\mathbf{W}_{\text{pool}}\mathbf{h}_{u_i}^k+\mathbf{b}\right),\forall u_i\in\mathcal{N}(v)\right\})

which is symmetric and trainable. Before the max pooling function, there can be any layer of MPL. Moreover, it is found that there is no significant difference between mean and max operation.

lattice  n.格子木架,格子金属架,格栅(用作篱笆等);斜条结构;斜格图案

2.5. Experiments

(1)They used 3 tasks to test the performance of GraphSAGE:

        ①Subject classification of academic journals using Web of Science citation dataset

        ②Community classification of Reddit posts

        ③Function classification of protein-protein interaction (PPI)

(2)Experimental set-up

        ①They compare four baselines: random classifer, logistic regression feature-based classifier, DeepWalk and combination of DeepWalk and raw features.

        ②Furthermore, the authors add four variants of GraphSAGE with different aggregator as comparison. 

        ③Activation: ReLU

        ④K=2

        ⑤Neighborhood sample sizes S_{1}=25,S_{2}=10

        ⑥Optimizer: Adam 

        ⑦They choose same hyperparameters for all GraphSAGE variants

        ⑧Comparison results:

[论文精读]Inductive Representation Learning on Large Graphs_第4张图片

2.5.1. Inductive learning on evolving graphs: Citation and Reddit data

(1)Citation data:

        ①Dataset: Thomson Reuters Web of Science Core Collection

        ②Samples: all papers in 6 biology-related fields from 2000 to 2005

        ③Number of nodes: 302,424

        ④Average degree: 9.15

        ⑤Training set: 2000-2004

        ⑥Testing set: 70% in 2005 

        ⑦Validating set: 30% in 2005 

(2)Reddit data:

        ①Dataset: graph dataset from Reddit in 2014, 9

        ②Samples: 50 communities, 232,965 posts

        ③Average degree: 492

        ④Connecting post nodes once a person comments on two posts

(3)⭐Their model has no fine-tuning

2.5.2. Generalizing across graphs: Protein-protein interactions

        ①Graph: human tissue

        ②Features: positional gene sets, motif gene sets, immunological signatures

        ③Labels: 121 gene ontology sets

        ④Average number of nodes in one graph: 2372

        ⑤Average degree: 28.8

        ⑥Training set (graphs): 20

        ⑦Testing set: 2

2.5.3. Runtime and parameter sensitivity

         ①Experiments of training time on Reddit data:

[论文精读]Inductive Representation Learning on Large Graphs_第5张图片

        ②The figure is how neighborhood sample size influences performance:

[论文精读]Inductive Representation Learning on Large Graphs_第6张图片

while K=2 and S_{1}=S_{2}. This is because they found that K=2 is generally the vertex with high slope growth in the accuracy improvement function, meaning that it grows slowly after it. On the other hand, the rate of return decreases with increasing neighborhood sample size.

2.5.4. Summary comparison between the different aggregator architectures

        ①Experiment sets: 6, 3 datasets with unsupervised and supervised respectively

        ②Test: non-parametric Wilcoxon Signed-Rank Test

2.6. Theoretical analysis

        ①They want to test whether the GraphSAGE is able to predict clustering coefficient of a node. The proportion of close triangles in 1-hop neighborhood of one node.

        ②There are features x_{v}\in U,\forall v\in V . Then assuming there is a constant positive C meets the requirement \left \| x_{v}-x_{​{v}'} \right \|_{2}> C when v and v{}' are paired. There is \forall \epsilon > 0 and after K=4 iterations:

\begin{vmatrix}z_v-c_v\end{vmatrix}<\epsilon,\forall v\in\mathcal{V}

where z_{v} is final output and c_{v} is noded clustering coefficients

2.7. Conclusion

        By sampling neighbor nodes, GraphSAGE trades off training-testing time and performance to achieve high accuracy. What is more, the authors think directed graph, multi-modal or non-uniform neighborhood sampling is promising

3. 知识补充

3.1. Hyperparameter hacking

(1)好像是作者自己提出来的概念,并没有在网上搜到解释

4. Reference List

Hamilton, W., Ying, R. & Leskovec, J. (2017) 'Inductive representation learning on large graphs', 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pp. 1025-1035. doi: https://doi.org/10.48550/arXiv.1706.02216

你可能感兴趣的:(人工智能,深度学习,计算机视觉,学习,机器学习)