论文原文:Inductive representation learning on large graphs | Proceedings of the 31st International Conference on Neural Information Processing Systems (acm.org)
英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!
目录
1. 省流版
1.1. 心得
1.2. 论文框架图
2. 论文逐段精读
2.1. Abstract
2.2. Introduction
2.3. Related work
2.4. Proposed method: GraphSAGE
2.4.1. Embedding generation (i.e., forward propagation) algorithm
2.4.2. Learning the parameters of GraphSAGE
2.4.3. Aggregator Architectures
2.5. Experiments
2.5.1. Inductive learning on evolving graphs: Citation and Reddit data
2.5.2. Generalizing across graphs: Protein-protein interactions
2.5.3. Runtime and parameter sensitivity
2.5.4. Summary comparison between the different aggregator architectures
2.6. Theoretical analysis
2.7. Conclusion
3. 知识补充
3.1. Hyperparameter hacking
4. Reference List
(1)救命为什么我没有心得我已经飞升了?
①The previous works do not concern unseen nodes, only focus on nodes which are presented or labeled
②They adopt inductive framework by sampling and aggregating features from neighbors
①Node embedding aims to decrease the dimensionalities of neighbors' feature, then transfer features to embedding vectors.
②For predicting new-and-unknown nodes, inductive ability is necessary for ML/DL.
③They generalize traditional GCN to trainable aggregation functions with inductive unsupervised learning
④The rich node features, including degree, text attributes, node profile information, have the ability to generalize and predict other unknown nodes
⑤Every node is aggregated by different hops or search depths away from it.
⑥The authors test their model on three classification tasks and prove its prediction ability. Besides, their model gets higher accuracy.
⑦Their main idea:
(1)Factorization-based embedding approaches
①Node embedding methods: random walk statistics, matrix factorization etc.
②However, all of these are fix in space due to the same orthogonal transformations. This graph space does not have generalization that can be transferred to other graphs, and may be reconstructed during other training sessions
(2)Supervised learning over graphs
Previous supervised learnings on graphs mainly focus on entire graphs instead of single node
(3)Graph convolutional networks
①Previous GCNs can not generalize to entire or large graphs
②Original GCN is semi-supervised and transductive, and requires reseachers to know entire Laplacian all the time
①Assuming there are aggregator functions
②Each node aggregate information from neighbors is:
③Propagate information between layers or “search depths” utilize the weight matrices:
④Forward propagating pseudocode:
where represents the step, is node in the -th step. The fourth line indicates that each node vector comes from an aggregation of other neighboring nodes. Then, in the fifth line, concatenating current node and its neighborhood vector to a fully connected layer with nonlinear activation function . Lastly, the line 3 does not mean iterating all nodes. The author believes that only some nodes that meet the iteration conditions need to be retained, and the complete pseudocode is provided in the appendix.
⑤Traditional Weisfeiler-Lehman Isomorphism sets hush function as aggregator. If two subgraphs get the same output, it means they are isomorphic. The authors replace hush function to trainable neural network to present topological structure
⑥They adopt different uniform samples at each iteration to achieve faster training
placeholder n.(替代缺失部分的)占位符,占位文字;位标(句子中必要但无实际意义的词项,如 It's a pity she left 中的 it)
①The parameters adjust by stochastic gradient descent, here is the loss function:
where is the neighbor of , is Sigmoid function, denotes a negative sampling distribution, denotes the number of negative samples
①The symmetry of aggregator is able to ignoring input sorting order, namely it will not be influenced by order.
②The first candidate aggregator is mean aggregator:
(它原文的式子好像括号就打错了)which can replace the 4-th and 5-th line of Algorithm 1. This aggregator omits concatenation operation, which is able to connect different layers or depths.
③The second candidate aggregator is LSTM aggregator, which has better expression ability. However, it is not symmetric.
④The third candidate aggregator is pooling aggregator:
which is symmetric and trainable. Before the max pooling function, there can be any layer of MPL. Moreover, it is found that there is no significant difference between mean and max operation.
lattice n.格子木架,格子金属架,格栅(用作篱笆等);斜条结构;斜格图案
(1)They used 3 tasks to test the performance of GraphSAGE:
①Subject classification of academic journals using Web of Science citation dataset
②Community classification of Reddit posts
③Function classification of protein-protein interaction (PPI)
(2)Experimental set-up
①They compare four baselines: random classifer, logistic regression feature-based classifier, DeepWalk and combination of DeepWalk and raw features.
②Furthermore, the authors add four variants of GraphSAGE with different aggregator as comparison.
③Activation: ReLU
④
⑤Neighborhood sample sizes
⑥Optimizer: Adam
⑦They choose same hyperparameters for all GraphSAGE variants
⑧Comparison results:
(1)Citation data:
①Dataset: Thomson Reuters Web of Science Core Collection
②Samples: all papers in 6 biology-related fields from 2000 to 2005
③Number of nodes: 302,424
④Average degree: 9.15
⑤Training set: 2000-2004
⑥Testing set: 70% in 2005
⑦Validating set: 30% in 2005
(2)Reddit data:
①Dataset: graph dataset from Reddit in 2014, 9
②Samples: 50 communities, 232,965 posts
③Average degree: 492
④Connecting post nodes once a person comments on two posts
(3)⭐Their model has no fine-tuning
①Graph: human tissue
②Features: positional gene sets, motif gene sets, immunological signatures
③Labels: 121 gene ontology sets
④Average number of nodes in one graph: 2372
⑤Average degree: 28.8
⑥Training set (graphs): 20
⑦Testing set: 2
①Experiments of training time on Reddit data:
②The figure is how neighborhood sample size influences performance:
while and . This is because they found that is generally the vertex with high slope growth in the accuracy improvement function, meaning that it grows slowly after it. On the other hand, the rate of return decreases with increasing neighborhood sample size.
①Experiment sets: 6, 3 datasets with unsupervised and supervised respectively
②Test: non-parametric Wilcoxon Signed-Rank Test
①They want to test whether the GraphSAGE is able to predict clustering coefficient of a node. The proportion of close triangles in 1-hop neighborhood of one node.
②There are features . Then assuming there is a constant positive meets the requirement when and are paired. There is and after iterations:
where is final output and is noded clustering coefficients
By sampling neighbor nodes, GraphSAGE trades off training-testing time and performance to achieve high accuracy. What is more, the authors think directed graph, multi-modal or non-uniform neighborhood sampling is promising
(1)好像是作者自己提出来的概念,并没有在网上搜到解释
Hamilton, W., Ying, R. & Leskovec, J. (2017) 'Inductive representation learning on large graphs', 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, pp. 1025-1035. doi: https://doi.org/10.48550/arXiv.1706.02216