we thed ti identify the knowledge in a trained model with the learned parameter values
知识是训练模型中学到的参数值
which we call “distillation” totransfer the knowledge from the cumbersome model to a small model that is more suitable for deployment
将知识从复杂的大型模型(教师模型)迁移到适合部署的小模型(学生网络)
为归一化的概率,即各个特征的加权之和
logits是全连接层的输出,经过sigmoid或者softmax函数变为归一化的概率值,如下面的例子:
我们输入一张猫的图片,经过网络处理之后从全连接层输出一个logits,其中logits可以表示为一个数组[8,6,1],里面的数字表示经过各个特征的加权之和后的值,经过一个归一化操作后(例如作为softmax函数的输入)可以表示作为当前分类的概率值[1,0,0]或者[0.88,0.119,0.001]。这两种不同的概率值的表现方式就是我们所说的Hard targets 和Soft targets。
对于下面的例子来讲,对于我们输入的猫的图片,当我们预测为猫的概率为1,而其它概率为0时,就把这种标签方式称作Hard targets;当我们预测猫的概率为0.88,而对于狗的概率为0.119,对于猪的概率为0.001时,我们把这种标签方式称作Soft targets。
Hard targets抹除了其它类别的可能性,可以用one-hot编码实现;
Soft targets则是保留了所有的可能性,可以用softmax来实现;
从这里可以看出Soft targets包含着比Hard targets更多的信息
引入两个损失函数调整权重
distillation loss
The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperture in the soft max of the distilled model as was used for generating the soft targets from the cumbersome model
输入: 相同温度下,学生模型和教师模型的soft targets
作用:让学生网络的类别输入预测分布尽可能拟合教师网络输出预测分布
简而言之,让学生网络尽可能模仿教师网络
常用:交叉熵损失或KL散度
student loss
The second objective function is the cross entropy with the correct labels.This is computed using exactly the same logits in softmax of the distilled model but at a temperature of 1
输入:温度为1下,学生模型的soft targets和正确标签
作用:减少教师网络中的错误信息被蒸馏到学生网络中
常用:交叉熵损失
1、在T下,训练教师网络得到soft targets 1
2、在相同温度T下,训练学生网络得到soft targets 2
3、通过soft targets 1和soft targets 2得到distillation loss
4、在温度1下,训练学生网络得到soft targets 3
5、通过soft targets 3和correct label得到student loss
【注】:
1、通常情况下,α远小于β,故设β=1-α
2、由于soft targets产生的梯度的大小按1/T^2 缩放,所以要乘一个T^2
链接:知识蒸馏的视频
通过上面的知识蒸馏,我们可以发现,知识蒸馏的本质是通过计算损失函数来更新网络参数,从而使得学生网络的参数分布与教师网络相匹配。那么,事实上我们是通过控制损失函数的值来进行知识蒸馏的,所以本文作者的创新点之一在于通过控制损失函数来控制知识蒸馏的方向:当损失函数的值为0时,网络参数不更新,表现为教师网络;当损失函数的值不为0时,网络更新,表现为学生网络,这样就能决定此时的知识蒸馏的方向,当两个网络都有机会成为教师网络时,我们称之为自由方向的知识蒸馏,作者在论文中使用“ collaboratively learn”(协作学习)来描述这一行为。
Knowledge distillation (KD) has demonstrated its effectiveness to boost the performance of graph neural networks (GNNs), where its goal is to distill knowledge from a deeper teacher GNN into a shallower student GNN.
知识蒸馏 (KD) 已证明其提高图神经网络 (GNN) 性能的有效性,其目标是将知识从较深的教师 GNN 提炼成较浅的学生 GNN。
However, it is actually difficult to train a satisfactory teacher GNN due to the well-known over-parametrized and over-smoothing issues, leading to invalid knowledge transfer in practical applications.
然而,由于众所周知的过度参数化和过度平滑问题,实际上很难训练出令人满意的教师 GNN,导致实际应用中的无效知识迁移。
过渡参数化
过渡平滑
知识迁移
In this paper, we propose the first Free-direction Knowledge Distillation framework via Reinforcement learning for GNNs, called FreeKD, which is no longer required to provide a deeper well-optimized teacher GNN.
在本文中,我们提出了第一个通过 GNN 强化学习的自由方向知识蒸馏框架,称为 FreeKD,它不再需要提供更深层次的优化教师 GNN。
专业术语
agent:智能体—>聪明的马里奥
State : 环境状态—>当前所在画面
action: 动作—>上下左右移动
reward:奖励—>【获得金币,奖励+10】,【距离重点目标距离-1,奖励+1】,【距离终点目标距离+1,奖励-1】,【达到终点,奖励+999,游戏结束】,【死亡,奖励-999,游戏结束】
policy:策略
例如下图中马里奥所在的画面,我们将当前的画面视为环境状态,此时智能体通过做不同的动作会得到不同的奖励,比如“向左—>无金币,无死亡,距离终点目标距离+1,那么这个动作的奖励就为-1”,“向右—>无金币,死亡,距离终点目标-1,那么这个动作的奖励就为-998”,同样的,“向上—>获得金币,无死亡,距离不变,奖励+10”,。当马里奥执行完这个动作后计算当前动作的奖励,并更新它的累计奖励,当前画面改变,也就是环境状态发生改变,这个时候马里奥又会开始思考下一步动作该如何进行
在训练时,我们会通过不同的不同的动作的组合积累不同的奖励,我们的目标是获得更大的奖励,
链接:强化学习入门这一篇就够了
The core idea of our work is to collaboratively build two shallower GNNs in an effort to exchange knowledge between them via reinforcement learning in a hierarchical way.
我们工作的核心思想是协作构建两个较浅的 GNN,以通过分层方式的强化学习在它们之间交换知识。
分层方式
As we observe that one typical GNN model often has better and worse performances at different nodes during training, we devise a dynamic and free-direction knowledge transfer strategy that consists of two levels of actions:
- node-level action determines the directions of knowledge transfer between the corresponding nodes of two networks; and then
- structure-level action determines which of the local structures generated by the node-level actions to be propagated.
正如我们观察到的,一个典型的 GNN 模型在训练期间在不同节点上的性能通常会越来越差,因此我们设计了一种动态和自由方向的知识转移策略,该策略由两个级别的动作组成:
1)节点级别的动作决定了方向两个网络的相应节点之间的知识转移; 然后
2) 结构级动作确定要传播的节点级动作生成的局部结构中的哪一个。
如何实现动态和自由方向?
In essence, our FreeKD is a general and principled framework which can be naturally compatiblewith GNNs of different architectures.
本质上,我们的 FreeKD 是一个通用且有原则的框架,可以自然地兼容不同架构的 GNN。
Extensive experiments on five benchmark datasets demonstrate our FreeKD outperforms two base GNNs in a large margin, and shows its efficacy to various GNNs.More surprisingly, our FreeKD has comparable or even better performance than traditional KD algorithms that distill knowledge from a deeper and stronger teacher GNN.
在5个基准数据集上进行的大量实验表明,我们的FreeKD在很大程度上优于两个基本的gnn,并显示了其对各种gnn的有效性。更令人惊讶的是,我们的FreeKD的性能与传统的KD算法相当,甚至更好,后者从更深更强的教师GNN中提取知识。
Computing methodologies → Neural networks; • Mathe-matics of computing → Graph algorithms.
计算方法→神经网络;计算的数学学→图形算法
KEYWORDS
关键词Graph Neural Networks, Free-direction Knowledge Distillation,
Reinforcement Learning
图神经网络,自由方向的知识蒸馏,强化学习
Graph data is becoming increasingly prevalent and ubiquitous with the rapid development of the Internet, such as social networks[13], citation networks [32], etc. To better handle graph-structured data, graph neural networks (GNNs) provide an effective means to learn node embeddings by aggregating feature information of neighborhood nodes [33]. Because of the powerful ability in modeling relations of data, various graph neural networks have been proposed in the past decade [10, 13, 18, 29, 33]. The representative works include GraphSAGE [13], GAT [33], GCN [18], etc
随着互联网的快速发展,图数据越来越普遍,如社交网络[13]、引文网络[32]等。为了更好地处理图结构数据,图神经网络(GNNs)通过聚合邻域节点[33]的特征信息,提供了一种学习节点嵌入的有效方法。由于其在数据关系建模方面的强大能力,在过去的十年中,人们提出了各种图神经网络[10,13,18,29,33]。代表性作品包括GraphSAGE [13]、GAT [33]、GCN [18]等
节点嵌入
Recently, some researchers extend an interesting learning scheme, called knowledge distillation (KD) , into GNNs to further improve the performance of GNNs [11, 40, 41]. The basic idea among these methods is to optimize a shallower student GNN model by distilling knowledge from a deeper teacher GNN model. For instance, LSP[41] proposed a local structure preserving module to transfer the topological structure information of a GNN teacher model. The work in [39] proposed a light GNN architecture, called TinyGNN,
and attempted to distill knowledge from a deep GNN teacher model to the light GNN model. GFKD [11] designed a data-free knowledge distillation strategy for GNNs, enabling to transfer knowledge from a GNN teacher model by generating fake graphs.
最近,一些研究人员将一种有趣的学习方案知识蒸馏(KD)扩展到gnn中,以进一步提高gnn[11,40,41]的性能。这些方法的基本思想是通过从更深的教师GNN模型中提取知识来优化较浅的学生GNN模型。例如,LSP[41]提出了一个局部结构保持模块来传递GNN教师模型的拓扑结构信息。[39]的工作提出了一种名为轻GNN架构的TinyGNN,并试图将知识从深度GNN教师模型中提取到轻GNN模型中。GFKD [11]为GNN设计了一种无数据的知识蒸馏策略,使其能够通过生成假图来从GNN教师模型中转移知识。
LSP
TinyGNN
GFKD
The above methods follow the same teacher-student architecture
as the traditional knowledge distillation methods [3, 14], and resort
to a deeper well-optimized teacher GNN for distilling knowledge.
However, when applying such an architecture to GNNs, it often
suffers from the following limitations: first, it is often difficult and
inefficient to train a satisfactory teacher GNN. As we know, the existing over-parameterized and over-smoothing issues often degrade
the performance of the deeper GNN model. Moreover, training a
deeper well-optimized model usually needs plenty of data and high
computational costs.
上述方法遵循与传统知识蒸馏方法[3,14]相同的师生架构,并采用更深层次的良好优化的教师GNN来提取知识。然而,当将这种体系结构应用于GNN时,它往往存在以下限制:首先,培训一个令人满意的教师GNN往往是困难和低效的。我们知道,现有的过参数化和过平滑问题往往会降低更深层次的GNN模型的性能。此外,训练一个更深入的优化模型通常需要大量的数据和较高的计算成本。
Second, according to [24, 39, 42], we know that a stronger teacher model may not necessarily lead to a better
student model. This may be because the mismatching of the representation capacities between a teacher model and a student model
makes the student model hard to mimic the outputs of a too strong
teacher model. Thus, it is difficult to find an optimal teacher GNN
for a student GNN in practical applications. Considering that many
powerful GNN models have been proposed in the past decade [37],
this gives rise to one intuitive thought: whether we can explore a new
knowledge distillation architecture to boost the performance of GNNs,
avoiding the obstacle involved by training a deeper well-optimized
teacher GNN?
其次,根据[24,39,42],我们知道一个更强的教师模式不一定会导致一个更好的学生模式。这可能是因为教师模型和学生模型之间的表示能力的不匹配,使得学生模型难以模拟一个过强的教师模型的输出。因此,在实际应用中,很难找到一个最优的教师GNN。考虑到在过去的十年中,[37]中已经提出了许多强大的GNN模型,这就产生了一个直观的想法:我们是否可以探索一种新的知识蒸馏架构来提高GNN的性能,通过培训一个更深入的、优化良好的教师GNN来避免所涉及的障碍?
图1:两个典型的GNN模型GraphSAGE [13]和GAT [33]经过20个epoch训练后,得到的Cora数据集上ID从1到10的节点的交叉熵损失。每个块中的值表示相应的损失。
In light of these, we propose a new knowledge distillation framework, Free-direction Knowledge Distillation based on Reinforcement learning tailored for GNNs, called FreeKD. Rather than requiring a deeper well-optimized teacher GNN for unidirectional
knowledge transfer, we collaboratively learn two shallower GNNs
in an effort to distill knowledge from each other via reinforcement
learning in a hierarchical way. This idea stems from our observation that one typical GNN model often has better and worse
performances at different nodes during training.
在此基础上,我们提出了一种新的知识蒸馏框架,即基于针对gnn的强化学习的自由方向知识蒸馏,称为FreeKD。我们不需要一个更深入的、良好优化的教师GNN来进行单向知识转移,而是协同学习两个较浅的GNN,努力通过分层的强化学习方式相互提取知识。这一想法源于我们的观察,即一个典型的GNN模型在训练过程中在不同节点上的表现往往更好或更差。
不需要深入的单向知识转移 协同学习
As shown in Figure 1, GraphSAGE [13] has lower cross entropy losses at nodes with
ID= {1, 4, 5, 7, 9}, while GAT [33] has better performances at the
rest nodes. Based on this observation, we design a free-direction
knowledge distillation strategy to dynamically exchange knowledge between two shallower GNNs to benefit from each other.
Considering that the direction of distilling knowledge for each
node will have influence on the other nodes, we thus regard determining the directions for different nodes as a sequential decision
making problem. Meanwhile, since the selection of the directions
is a discrete problem, we can not optimize it by stochastic gradient descent based methods [34].
如图1所示,GraphSAGE [13]在ID为={1、4,5,7,9}的节点上具有较低的交叉熵损失,而GAT [33]在其他节点上具有更好的性能。在此基础上,我们设计了一种自由方向的知识蒸馏策略,在两个较浅的gnn之间动态交换知识,以相互受益。考虑到每个节点的知识提取方向会影响其他节点,因此我们将不同节点的方向的确定作为一个顺序决策问题。同时,由于方向的选择是一个离散的问题,我们不能用基于随机梯度下降的方法[34]对其进行优化。
Thus, we address this problem
via reinforcement learning in a hierarchical way. Our hierarchical
reinforcement learning algorithm consists of two levels of actions:
Level 1, called node-level action, is used to distinguish which GNN
is chosen to distill knowledge to the other GNN for each node. After
determining the direction of knowledge transfer for each node, we
expect to propagate not only the soft label of the node, but also its
neighborhood relations. Thus level 2, called structure-level action,
decides which of the local structures generated by our node-level
actions to be propagated. One may argue that we could directly use
the loss, e.g., cross entropy, to decide the directions of node-level
knowledge distillation.
因此,我们通过强化学习的分层方式来解决这个问题。我们的层次强化学习算法由两个层次的动作组成:第1级,称为节点级动作,用于区分选择哪个GNN将知识提取到每个节点的另一个GNN中。在确定了每个节点的知识转移方向后,我们期望不仅要传播节点的软标签,还要传播节点的邻域关系。因此,第2级,称为结构级操作,决定要传播由节点级操作生成的哪些局部结构。有人可能会说,我们可以直接利用损失,例如,交叉熵,来决定节点级知识蒸馏的方向。
However, this heuristic strategy only considers the performance of the node itself, but neglects its influence
on other nodes, thus might lead to a sub-optimal solution. Our
experimental results also verify our reinforcement learning based
strategy significantly outperforms the above heuristic one.
然而,这种启发式策略只考虑了节点本身的性能,而忽略了它对其他节点的影响,因此可能会导致次优解。我们的实验结果也验证了我们的基于强化学习的策略显著优于上述启发式策略。
The contributions of this paper can be summarized as:
• We propose a new knowledge distillation architecture for
GNNs, avoiding requiring a deeper well-optimized teacher
model for distilling knowledge. The proposed framework is
general and principled, which can be naturally compatible
with GNNs of different architectures.
• We devise a free-direction knowledge distillation strategy via
a hierarchical reinforcement learning algorithm, which can
dynamically manage the directions of knowledge transfer
from both node-level and structure-level aspects.
• Extensive experiments on five benchmark datasets demonstrate the proposed framework promotes the performance
of two shallower GNNs in a large margin, and is valid to
various GNNs. More surprisingly, the performance of our
FreeKD is comparable to or even better than traditional KD
algorithms distilling knowledge from a deeper and stronger
teacher GNN.
本文的贡献可以总结为:
•我们提出了一种新的gnn知识蒸馏体系结构,避免了需要一个更深入的良好优化的教师模型来提取知识。该框架具有通用性和原则,可以自然地与不同架构的gnn兼容。
•我们通过层次强化学习算法设计了一种自由方向的知识蒸馏策略,该算法可以从节点级和结构级方面动态地管理知识转移的方向。
•在5个基准数据集上进行的大量实验表明,该框架在较大范围内提高了两个较浅的神经网络的性能,并且对各种神经网络都有效。更令人惊讶的是,我们的FreeKD的性能与传统的KD算法相当,甚至更好,它从更深更强的教师GNN中提取知识。
This work is related to graph neural networks, graph-based knowledge distillation, and reinforcement learning.
这项工作与图神经网络、基于图的知识蒸馏和强化学习有关。
Graph neural networks have achieved promising results in processing graph data, whose goal is to learn node embeddings by
aggregating nodes’ neighbor information.
图神经网络在处理图数据方面取得了很好的结果,其目标是通过聚合节点的邻居信息来学习节点嵌入。
In recent years, lots of GNNs have been proposed [13, 18, 33]. For instance, GCN [18] designed a convolutional neural network architecture for graph data.
GraphSAGE [13] proposed an efficient sample strategy to aggregate
neighbor nodes.
近年来,许多gnn被提出使用[13,18,33]。例如,GCN [18]为图形数据设计了一个卷积神经网络架构。GraphSAGE [13]提出了一种有效的聚合策略。
GAT [33] applied a self-attention mechanism to
GNN to assign different weights to different neighbors. SGC [36]
simplified GCN by removing nonlinearities and weight matrices
between consecutive convolutional layers. ROD [43] proposed an
ensemble learning based GNN model to fuse the knowledge in multiple hops.
GAT [33]对GNN应用自注意机制,对不同的邻居分配不同的权重。SGC [36]通过去除连续卷积层之间的非线性和权值矩阵,简化了GCN。ROD [43]提出了一种基于集成学习的GNN模型,以在多跳中融合知识。
APPNP [19] analyzed the relationship between GCN
and PageRank [27], and proposed a propagation model combined
with a personalized PageRank. Cluster-GCN [10] built an efficient
model based on graph clustering [31].
APPNP [19]分析了GCN与PageRank [27]之间的关系,并提出了一个结合个性化PageRank的传播模型。聚类-GCN[10]建立了一个基于图聚类[31]的高效模型。
Being orthogonal to the
above approaches developing different powerful GNN models, we
concentrate on developing a new knowledge distillation framework
on the basis of various GNNs.
根据上述方法开发不同强大的GNN模型,我们专注于在不同的GNN基础上开发一个新的知识精馏框架。
Knowledge distillation (KD) has been widely studied in computer
vision [5, 23], natural language processing [1, 16], etc. Recently, a
few KD methods have proposed for GNNs. LSP [41] transferred the
topological structure knowledge from a pre-trained deeper teacher
GNN to a shallower student GNN. CPF [40] designed a student
architecture that is a combination of a parameterized label propagation and MLP layers. GFKD [11] proposed a method to generate
fake graphs and distilled knowledge from a teacher GNN model
without any training data involved. The authors in [39] designed an
efficient GNN model by utilizing the information from peer nodes
to model the local structure explicitly and distilling the neighbor
structure information from a deeper GNN implicitly
知识蒸馏(KD)在计算机视觉[5,23]、自然语言处理[1,16]等领域得到了广泛的研究。最近,人们对gnn提出了一些KD方法。LSP [41]将拓扑结构知识从预先训练过的较深的教师GNN转移到较浅的学生GNN。CPF [40]设计了一个由参数化的标签传播和MLP层组成的学生体系结构。GFKD [11]提出了一种不涉及任何训练数据,从教师GNN模型中生成假图和提取知识的方法。[39]的作者设计了一个高效的GNN模型,利用对等节点的信息显式地建模局部结构,并从更深的GNN中提取邻居结构信息
The work in [9] studied a self-distillation framework, and proposed an adaptive discrepancy retaining regularizer to empower the transferability of knowledge. RDD [44] was a semi-supervised knowledge distillation method for GNNs. It online learnt a complicated teacher GNN model by ensemble learning, and distilled knowledge from the generated teacher model into the student model. Different from them, we focus on studying a new free-direction knowledge distillation architecture, with the purpose of dynamically exchanging knowledge between two shallower GNNs.
[9]的工作研究了一个自蒸馏框架,并提出了一个自适应的差异保持正则化器,以增强知识的可转移性。RDD [44]是一种神经的半监督知识蒸馏方法。通过集成学习,在线学习了一个复杂的教师GNN模型,并从生成的教师模型中提取知识到学生模型中。与它们不同的是,我们重点研究了一种新的自由方向知识蒸馏架构,目的是在两个较浅的gnn之间动态地交换知识。
Reinforcement learning aims at training agents to make optimal
decisions by learning from interactions with the environment [2].
Reinforcement learning mainly has two genres [2]: value-based
methods and policy-based methods. Value-based methods estimate
the expected reward of actions [25], while policy-based methods
take actions according to the output probabilities of the agent [35].
There is also a hybrid of this two genres, called the actor-critic
architecture [12]. The actor-critic architecture utilizes the valuebased method as a value function to estimate the expected reward,
and employs the policy-based method as a policy search strategy
to take actions. Until now, reinforcement learning has been taken
as a popular tool to solve various tasks, such as recommendation
systems [45], anomaly detection [26], multi-label learning [8], etc.
In this paper, we explore reinforcement learning for graph data
based knowledge distillation.
强化学习的目的是通过学习与环境[2]的交互来训练代理做出最优决策。强化学习主要有两种类型的[2]:基于价值的方法和基于策略的方法。基于价值的方法估计行动[25]的预期奖励,而基于策略的方法则根据代理[35]的输出概率采取行动。还有这两种类型的混合,被称为演员-评论家架构[12]。行为评论家体系结构利用基于价值的方法作为价值函数来估计预期奖励,并采用基于策略的方法作为策略搜索策略来采取行动。到目前为止,强化学习一直被认为是解决各种任务的流行工具,如推荐系统[45]、异常检测[26]、多标签学习[8]等。在本文中,我们探索了基于图数据的知识蒸馏的强化学习。
In this section, we elaborate the details of our FreeKD framework
that is shown in Figure 2. Before introducing it, we first give some
notations and preliminaries.
在本节中,我们将详细介绍图2中所示的FreeKD框架的详细信息。在介绍它之前,我们首先给出一些符号和准备工作。
Being orthogonal to those works developing various GNN models, our goal is to explore a new knowledge distillation framework
for promoting the performance of GNNs, while addressing the issue
involved because of producing a deeper teacher GNN model in the
existing KD methods.
与那些开发各种GNN模型的工作正交,我们的目标是探索一个新的知识蒸馏框架,以提高GNN的性能,同时解决在现有KD方法中产生更深层次的教师GNN模型所涉及的问题。
As shown in Figure 1, we observe typical GNN models often have
different performances at different nodes during training. Based
on this observation, we intend to dynamically exchange useful
knowledge between two shallower GNNs, so as to benefit from
each other. However, a challenging problem is attendant upon that:
how to decide the directions of knowledge distillation for different
nodes during training.
如图1所示,我们观察到典型的GNN模型在训练过程中在不同的节点上往往有不同的性能。基于这一观察结果,我们打算在两个较浅的gnn之间动态地交换有用的知识,以便相互受益。然而,一个具有挑战性的问题是:如何在训练过程中确定不同节点的知识蒸馏方向。
To address this, we propose to manage the
directions of knowledge distillation via reinforcement learning,
where we regard the directions of knowledge transfer for different
nodes as a sequential decision making problem [28]. Consequently,
we propose a free-direction knowledge distillation framework via
a hierarchical reinforcement learning, as shown in Figure 2.
为了解决这个问题,我们提出通过强化学习来管理知识蒸馏的方向,其中我们将不同节点的知识转移方向视为一个顺序决策问题[28]。因此,我们提出了一个通过层次强化学习的自由方向的知识蒸馏框架,如图2所示。
In our framework, the hierarchical reinforcement learning can be taken as
a reinforced knowledge judge that consists of two levels of actions:
- Level 1, called node-level action, is used to decide the distillation
direction of each node for propagating the soft label;- Level 2,called structure-level action, is used to determine which of the local structures generated via node-level actions to be propagated.
在我们的框架中,层次强化学习可以作为一个强化知识判断器,它包括两个层次的动作:
1)1级,称为节点级动作,用来决定每个节点的蒸馏方向;
2)2级,称为结构级动作,用来确定通过节点级动作生成的局部结构要传播。
Specifically, the reinforced knowledge judge (we call it agent for
convenience) interacts with the environment constructed by two
GNN models in each iteration, as in Figure 2. It receives the soft
labels and cross entropy losses for a batch of nodes, and regards
them as its node-level states. The agent then samples sequential
node-level actions for nodes according to a learned policy network,
where each action decides the direction of knowledge distillation
for propagating node-level knowledge.
具体来说,强化知识判断器(为了方便起见,我们称之为agent)在每次迭代中都会与由两个GNN模型构建的环境进行交互,如图2所示。它接收一批节点的软标签和交叉熵损失,并将其视为其节点级状态。然后,agent根据学习到的策略网络对节点的顺序节点级操作进行采样,其中每个动作决定了传播节点级知识的知识蒸馏方向。
Then, the agent receives the structure-level states and produces structure-level actions to decide which of the local structures generated on the basis of node-level actions to be propagated. After that, the two GNN models are trained based to the agent’s actions with a new loss function. Finally, the agent calculates the reward for each action to train the policy network, where the agent’s target is to maximize the expected reward. This process is repeatedly iterated until convergence.
然后,代理接收结构级状态并产生结构级操作,以决定基于节点层动作生成的哪些局部结构。然后,利用一个新的损失函数,根据代理的动作对两个GNN模型进行训练。最后,代理计算每个动作的奖励,以训练策略网络,其中代理的目标是最大化预期的奖励。这个过程被反复迭代,直到收敛。
In this section, we introduce our reinforcement learning based
strategy to dynamically distill node-level knowledge between two
GNN models.
在本节中,我们将介绍基于强化学习的策略来动态提取两个GNN模型之间的节点级知识。
We concatenate the following features as the node-level state vector ( [1] ,i ) for node :
(1) Soft label vector of node in GNN Φ.
(2) Cross entropy loss of node in GNN Φ.
(3) Soft label vector of node in GNN Ψ.
(4) Cross entropy loss of node in GNN Ψ.
我们将以下特征连接为节点的节点级状态向量( [1] , i ) :
(1)GNN Φ中节点的软标签向量。
(2)GNN Φ中节点的交叉熵损失。
(3)GNN Ψ中节点的软标签向量。
(4)GNN Ψ中节点的交叉熵损失。
The first two kinds of features are based on the intuition that the
cross entropy loss and soft label can quantify the useful knowledge
for node in GNN Φ to some extent. The last two kinds of features
have the same function for Ψ. Since these features can measure the
knowledge each node contains to some extent, we use them as the
feature of the node-level state for predicting the node-level actions.
前两种特征都是基于交叉熵损失和软标签可以在一定程度上量化GNN Φ中节点的有用知识的直觉。后两种特性对Ψ具有相同的功能。由于这些特性可以在一定程度上衡量每个节点所包含的知识,因此我们将它们作为节点级状态的特征来预测节点级的动作。
The node-level action ( [1] , i )∈ {0, 1} decides the direction of knowledge distillation for node . ( [1] , i ) = 0 means transferring knowledge from GNN Φ to GNN Ψ at node , while ( [1] , i )= 1 means the distillation direction from Ψ to Φ. If ( [1] , i ) = 0, we define node in Φ as agent-selected node, otherwise, we define node in Ψ as agent-selected node. The actions are sampled from the probability distributions produced by a node-level policy function π(), where is the trainable parameters in the policy network and ( ( [1] , ) ,( [1] , i ) ) means the probability to take action ( [1] , i )over the state ( [1] , ). In this paper, we adopt a three-layer MLP with the ℎ activation function as our node-level policy network.
节点级动作( [1] , i )∈{0,1}决定节点的知识蒸馏方向。( [1] , i ) = 0表示在节点将知识从GNN Φ转移到GNN Ψ,而( [1] , i ) = 1表示从Ψ到Φ的蒸馏方向。如果( [1] , i )= 0,则我们将Φ中的节点定义为代理选择的节点,否则,我们将Ψ中的节点定义为代理选择的节点。动作是从节点级策略函数产生的概率分布中采样的,其中是策略网络中的可训练参数,( ( [1] , ) ,( [1] , i ) )是指对状态( [1] , )采取行动( [1] , i )的概率。在本文中,我们采用了一个三层的MLP与ℎ激活函数作为我们的节点级策略网络。
After determining the direction of knowledge distillation for each node, the two GNN models can exchange beneficial node-level knowledge. We take Figure3(a) as an example to illustrate our idea. In Figure 3(a), the agent-selected nodes {1, 4, 5} in GNN Φ will serve as the distilled nodes to transfer knowledge to the nodes {1, 4, 5} in GNN Ψ. In the meantime, the agent-selected nodes {2, 3, 6} in Ψ will be used as the distilled nodes to distill knowledge for the nodes {2, 3, 6} in Φ. In order to transfer node-level knowledge, we utilize the KL divergence to measure the distance between the soft labels of the same node in the two GNN models, and propose to minimize a new loss function for each GNN model:
在确定了每个节点的知识蒸馏方向后,两个GNN模型可以交换有益的节点级知识。我们以图3(a)为例来说明我们的想法。在图3(a)中,GNN Φ中代理选择的节点{1,4,5}将作为蒸馏后的节点,将知识转移到GNN Ψ中的节点{1,4,5}。同时,将Ψ中代理选择的节点{2,3,6}作为蒸馏节点,提取Φ中节点{2,3,6}的知识。为了传递节点级的知识,我们利用KL散度来度量两个GNN模型中同一节点的软标签之间的距离,并提出为每个GNN模型最小化一个新的损失函数:
where the value of ( [1] , i ) is 0 or 1. When ( [1] , i )= 0, we use the divergence to make the probability distribution p ( Ψ , )match p( Φ, ) as much as possible, enabling the knowledge from Φ to be transferred to Ψ at node , and vice versa for ( [1] , i ) = 1. Thus, by minimizing the two loss functions ( Φ , ) and ( Ψ, ), we can reach the goal of dynamically exchanging useful node-level knowledge between two GNN models and thus obtaining gains from each other.
As we know, the structure information is important for graph learning. Thus, we attempt to dynamically transfer structure-level knowledge between two GNNs. It is worth noting that we don’t propagate
all neighborhood information of one node as structure-level knowledge. Instead, we propagate a neighborhood subset of the node,
which is comprised of agent-selected nodes. This is because we
think agent-selected nodes contain more useful knowledge. We take
Figure 3(b) as an example to illustrate it.1 is an agent-selected node
in Φ. When transferring its local structure information to Ψ, we
only transfer the local structure composed of {1, 4, 5}. In other
words, the local structure of node 1 we consider to transfer is made
up of agent-selected nodes. We call it agent-selected neighborhood
set. Moreover, considering the knowledge of the local structure in
graphs is not always reliable [4, 44], we design a reinforcement
learning based strategy to distinguish which of the local structures
to be propagated. Next, we introduce it in detail.
正如我们所知,结构信息对于图形学习很重要。因此,我们试图在两个gnn之间动态地转移结构级的知识。值得注意的是,我们并没有将一个节点的所有邻域信息作为结构级知识来传播。相反,我们传播节点的一个邻域子集,该子集由智能体选择的节点组成。这是因为我们认为智能体选择的节点包含了更多有用的知识。我们以图3(b)为例来说明它。1是Φ中的一个由智能体选择的节点。当将其局部结构信息传递到Ψ时,我们只转移由{1,4,5}组成的局部结构。换句话说,我们所考虑传输的节点1的局部结构是由智能体选择的节点组成的。我们称之为代理选择的邻域集。此外,考虑到图中局部结构的知识并不总是可靠的[4,44],我们设计了一种基于强化学习的策略来区分哪些局部结构需要传播。接下来,我们将详细介绍它。
We adopt the following features as the structure-level state vector s ( [2] , i ) for the local structure of node :
(1) Node-level state of node .
(2) Center similarity of node ’s agent-selected neighborhood set in the distilled network.
(3) Center similarity of the same node set as node ’s agentselected neighborhood set in the guided network.
对于节点的局部结构,我们采用以下特征作为结构级状态向量( [2] , i ) :
(1)节点的节点级状态。
(2)节点的代理选择邻域集在提取网络中的中心相似性。
(3)在引导网络中,与节点的代理邻域集选择的同一节点集的中心相似性。
Since the node-level state contains much information for measuring the information of local structures, we use the node-level
state as the first feature of structure-level state. As [38] points out,
the center similarity can indicate the performance of GNNs, where
the center similarity measures the degree of similarity between the
node and its neighbors. In other words, if center similarity is high,
the structure information should be more reliable. Thus, we also
take center similarity as another feature. Motivated by [38], we
present a similar strategy to calculate the center similarity as:
由于节点级状态包含了大量用于度量局部结构信息的信息,因此我们使用节点级状态作为结构级状态的第一个特征。正如[38]所指出的,中心相似度可以表示gnn的性能,其中中心相似度度量节点与其相邻节点之间的相似度程度。换句话说,如果中心相似度较高,则结构信息应该更加可靠。因此,我们也将中心相似性作为另一个特征。受[38]的启发,我们提出了一个类似的策略来计算中心相似度,如下:
where can be an arbitrary similarity function. Here we use the
cosine similarity function (x, y) = (x, y). u
is a two-dimension
vector. In order to better present what u stands for, we take 1 and
3 in Figure 3(b) as an example. 1 is an agent-selected node in Φ,
i.e., ( [1] , 1) = 0, and 3 is an agent-selected node in Ψ, i.e., ( [1] ,3) = 1.
For u1, its first element ( (1) , 1) is the center similarity between 1 and
{4, 5} in the distilled network Φ, while its second element ( (2) , 1) is
the center similarity between 1 and {4, 5} in the guided network Ψ. Similarly, for u3, ( (1) , 3 )measures the center similarity between 3
and {2, 6} in Ψ, and ( (2) , 3) is the center similarity between 3 and
{2, 6} Φ. In a word, the first element in u measures the center
similarity in the distilled network, and the second element measures
the center similarity in the guided network.
其中,可以是一个任意的相似度函数。这里我们使用余弦相似度函数(x,y)=(x,y)。u是一个二维向量。为了更好地展示u的代表内容,我们以图3(b)中的1和3为例。1是Φ中的代理选择节点,即( [1] , 1)= 0,而3是Ψ中的代理选择节点,即( [1] ,3) = 1。对于u1,它的第一个元素( (1) , 1)是蒸馏网络Φ中1和{4,5}之间的中心相似性,而它的第二个元素 ( (2) , 1)是引导网络Ψ中1和{4,5}之间的中心相似性。同样,对于u3, ( (1) , 3 )度量了Ψ中3和{2,6}之间的中心相似性,而( (2) , 3) 度量了Φ中3和{2,6}之间的中心相似性。总之,u中的第一个元素度量蒸馏网络中的中心相似度,第二个元素度量引导网络中的中心相似度。
Structure-level action ( [2] , i ) ∈ {0, 1} is the second level action that determines which of the structurelevel knowledge to be propagated. If ( [2] , i ) = 1, the agent decides to transfer the knowledge of the local structure encoded in the agent-selected neighborhood set of node , otherwise it will not be transferred. Similar to the node-level policy network, the structurelevel policy network that produces structure-level actions is also comprised of a three-layer MLP with the ℎ activation function.
结构级操作( [2] , i )∈{0,1}是决定要传播哪些结构级知识的第二级操作。如果( [2] , i ) = 1,代理决定转移节点的代理选择邻域集中编码的局部结构的知识,否则不会转移。与节点级策略网络类似,产生结构级操作的结构级策略网络也由一个具有ℎ激活功能的三层MLP组成。
We first introduce
how to distill structure-level knowledge from Φ to Ψ. The method
for distilling from Ψ to Φ is the same. First, we define the similarity
between two agent-selected nodes and by:
我们首先介绍如何从Φ到Ψ。从Ψ到Φ的方法相同。首先,我们通过以下方法来定义两个代理选择的节点和之间的相似性:
s( Φ , )represents the distribution of the similarities between node
and its agent-selected neighborhoods in Φ, while ˆs( Ψ , )represents the distribution of the similarities between node and its corresponding
neighborhoods in Ψ. If the local structure of node is decided to
transfer, we adopt the divergence to make ˆs( Ψ , ) match ˆs( Φ , ), so as to transfer structure-level knowledge. Similarly, we can propose
another new loss function for distilling knowledge from Ψ to Φ as:
s( Φ , )表示Φ中节点与其代理选择邻域之间的相似性分布,而ˆs( Ψ , )表示Ψ中节点与其对应邻域之间的相似性分布。如果决定对节点的局部结构进行转移,我们采用散度使ˆs( Ψ , )与 ˆs( Φ , )相匹配,从而转移结构级知识。同样,我们可以提出另一个新的损失函数,从Ψ提取知识到Φ,如:
By jointly minimizing (9) and (10), we can dynamically exchange
structure-level knowledge between Φ and Ψ.
通过联合最小化(9)和(10),我们可以在Φ和Ψ之间动态地交换结构级的知识。
In this section, we introduce the optimization procedure of our
method. The detailed training procedure is in Appendix A.1
在本节中,我们将介绍我们的方法的优化过程。详细的训练过程见附录A.1
Following [34], our actions are sampled in batch,
and obtain the delayed reward after two GNNs being updated according to a batch of sequential actions. Similar to [21], we utilize
the performance of the models after being updated as the reward.
We use the negative value of the cross entropy loss to measure the
performance of the models as in [21, 42], defined as:
在[34]之后,我们的动作被批量采样,并在根据一批连续动作更新两个gnn后获得延迟奖励。与[21]类似,我们利用模型更新后的表现作为奖励。我们使用交叉熵损失的负值来衡量模型的性能,如在[21,42]中,定义为:
where is a hyper-parameter.
is the reward for the action taken
at node , and B is a batch set of nodes from the training set. The
reward for an action consists of two parts: The first part is the
average performance for a batch of nodes, measuring the global
effects that the action brings on the GNN model; The second
part is the average performance of the neighborhoods of node , in
order to model the local effects of
.
其中,是一个超参数。是对在节点上所采取的行动的奖励,而B是来自训练集中的一批节点集。一个动作的奖励包括两部分:第一部分是一批节点的平均性能,测量动作对GNN模型带来的全局效应;第二部分是节点邻域的平均性能,以模拟的局部效应。
Optimization for Policy Networks. Following previous studies about hierarchical reinforcement learning [22], the gradient of
expected cumulative reward ∇, could be computed as follows:
策略网络。根据以往关于层次强化学习[22]的研究,预期累积奖励∇、的梯度可以计算如下:
where , is the learned parameters of the node-level policy network and structure-level policy network, respectively. Similar to
[20], to speed up convergence and reduce variance , we also add a
baseline reward that is the rewards at node in the last epoch.
The motivation behind this is to encourage the agent to achieve
better performance than that of the last epoch. Finally, we update
the parameters of policy networks by gradient ascent [35] as:
其中,,分别为节点级策略网络和结构级策略网络的学习参数。与[20]类似,为了加快收敛速度和减少方差,我们还添加了一个基线奖励,即上一个时期节点的奖励。这背后的动机是为了鼓励代理取得比上个时代更好的表现。最后,我们通过梯度上升[35]将策略网络的参数更新为:
We minimize the following loss functions for optimizing Φ and Ψ, respectively:
我们分别最小化以下损失函数来优化Φ和Ψ:
L(CE, Φ)和L(CE, ψ)分别表示模型Φ和ψ的交叉熵。
L(node, Φ)和L(node, ψ)分别表示模型Φ和ψ的节点级损失
L(struct, Φ)和L(struct, ψ)分别表示模型Φ和ψ的结构级损失
μ和ρ是权衡参数
(竟然有人专心看,还写评论,那再继续做做!)
To verify the effectiveness of our proposed FreeKD, we perform the experiments on five benchmark datasets of different domains and on GNN models of different architectures. More implementation details are given in Appendix A.3.
为了验证我们提出的 FreeKD 的有效性,我们对不同领域的五个基准数据集和不同架构的 GNN 模型进行了实验。更多实施细节在附录 A.3 中给出。
We use five widely used benchmark datasets to evaluate our methods. Cora [32] and Citeseer [32] are two citation datasets where nodes represent documents and edges represent citation relationships. Chameleon [30] and Texas [29] are two web network datasets where nodes stand for web pages and edges show their hyperlink relationships. The PPI dataset [13] consists of24 pro-tein–protein interaction graphs, corresponding to different human tissues. In Appendix A.2, we give more information about these datasets. Following [6] and [15], we use 1000 nodes for training,500 nodes for validation, and the rest for testing on the Cora and Citeseer datasets. For Chameleon and Texas datasets, we randomly split nodes of each class into 60%, 20%, and 20% for training, vali-dation and testing respectively, following [29] and [7]. For the PPI dataset, we use 20 graphs for training, 2 graphs for validation, and 2 graphs for testing, as in [7]. Following previous works [29, 33] , we study the transductive setting on the first four datasets, and the inductive setting on the PPI dataset. In the tasks of transductive setting, we predict the labels of the nodes observed during training, whereas in the task of inductive setting, we predict the labels of nodes in never seen graphs before.
我们使用五个广泛使用的基准数据集来评估我们的方法。 Cora [32] 和 Citeseer [32] 是两个引用数据集,其中节点表示文档,边表示引用关系。 Chameleon [30] 和 Texas [29] 是两个网络数据集,其中节点代表网页,边缘显示它们的超链接关系。 PPI 数据集 [13] 由 24 个蛋白质-蛋白质相互作用图组成,对应于不同的人体组织。在附录 A.2 中,我们提供了有关这些数据集的更多信息。在 [6] 和 [15] 之后,我们使用 1000 个节点进行训练,500 个节点进行验证,其余的用于在 Cora 和 Citeseer 数据集上进行测试。对于 Chameleon 和 Texas 数据集,我们按照 [29] 和 [7] 将每个类的节点随机分成 60%、20% 和 20% 分别用于训练、验证和测试。对于 PPI 数据集,我们使用 20 个图进行训练,2 个图用于验证,2 个图用于测试,如 [7] 中所示。在之前的工作 [29, 33] 之后,我们研究了前四个数据集上的转换设置,以及 PPI 数据集上的归纳设置。在 transductive setting 的任务中,我们预测在训练期间观察到的节点的标签,而在 inductive setting 的任务中,我们预测以前从未见过的图中的节点标签。
In the experiment, we adopt three popular GNN models, GCN [18], GAT [33], GraphSAGE [13], as our basic models in our method. Our framework aims to promote the performance of these GNN models. Thus, these three GNN models can be used as our baselines. Since we propose a free-direction knowledge dis-tillation framework, we also compare with five typical knowledge distillation approaches proposed recently, including KD [14], LSP[41], CPF [40], RDD [44], and GNN-SD [9], to further verify the effectiveness of our method. Following [6, 13], we use the Micro-F1 score as the evaluation measure throughout the experiment.
在实验中,我们采用三种流行的 GNN 模型,GCN [18]、GAT [33]、GraphSAGE [13],作为我们方法中的基本模型。我们的框架旨在提升这些 GNN 模型的性能。因此,这三个 GNN 模型可以用作我们的基线。由于我们提出了一个自由方向的知识蒸馏框架,我们还与最近提出的五种典型的知识蒸馏方法进行了比较,包括 KD [14]、LSP [41]、CPF [40]、RDD [44] 和 GNN-SD [9] ],进一步验证我们方法的有效性。继 [6, 13] 之后,我们在整个实验过程中使用 Micro-F1 分数作为评估指标。
In this subsection, we evaluate our method using three popular GNN models, GCN [18], GAT [33], and GraphSAGE [13]. We arbitrarily select two networks from the above three models as our basic models Φ and Ψ, and perform our method FreeKD , enabling them to learn from each other. Note that we do not perform GCN on the PPI dataset, because of the inductive setting.
在本小节中,我们使用三种流行的 GNN 模型 GCN [18]、GAT [33] 和 GraphSAGE [13] 来评估我们的方法。我们从上述三个模型中任意选择两个网络作为我们的基本模型 Φ 和 Ψ,并执行我们的方法 FreeKD ,使它们能够相互学习。请注意,由于归纳设置,我们不会在 PPI 数据集上执行 GCN。
Table 1 and Table 2 report the experimental results. As shown in Table 1 and Table 2, our FreeKD can consistently promote the performance of the basic GNN models in a large margin on all the datasets. For instance, our method can achieve more than 4.5% improvement by mutually learning from two GCN models on the Chameleon dataset, compared with the single GCN model. In sum-mary, for the transductive learning tasks, our method improves the performance by 1.01% ∼ 1.97% on the Cora and Citeseer datasets and 1.08% ∼ 4.61% on the Chameleon and Texas datasets, compared with the corresponding GNN models. For the inductive learning task, our method improves the performance by 1.31% ∼ 3.11% on the PPI dataset dataset. In addition, we observe that two GNN models either sharing the same architecture or using different architectures can both benefit from each other by using our method, which shows the efficacy to various GNN models.
表 1 和表 2 报告了实验结果。如表 1 和表 2 所示,我们的 FreeKD 可以在所有数据集上持续大幅提升基本 GNN 模型的性能。例如,与单个 GCN 模型相比,我们的方法可以通过在 Chameleon 数据集上从两个 GCN 模型相互学习来实现超过 4.5% 的改进。总之,对于转导学习任务,与相应的 GNN 模型相比,我们的方法在 Cora 和 Citeseer 数据集上的性能提高了 1.01% ∼ 1.97%,在 Chameleon 和 Texas 数据集上的性能提高了 1.08% ∼ 4.61%。对于归纳学习任务,我们的方法通过PPI数据集提高了1.31%~3.11%性能。此外,我们观察到,共享相同体系结构或使用不同体系结构的两个GNN模型都可以通过使用我们的方法彼此受益,这表明了各种GNN模型的功效。
表 1:Cora、Chameleon、Citeseer 和 Texas 数据集上转换设置中节点分类比较方法的结果 (%)。括号中的值表示我们的 FreeKD 相对于相应基线的性能改进。这里,我们将 GraphSAGE 简称为 GSAGE。
表2: PPI数据集归纳设置中节点分类的比较方法的结果 (%)。
Since our method is related to knowledge distillation, we also compare with the existing knowledge distillation methods to further verify effectiveness of our method. In this experiment, we first compare with three traditional knowledge distillation methods, KD[14], LSP [41], CPF [40] distilling knowledge from a deeper and stronger teacher GCNII model [7] into a shallower student GAT model. The structure details of GCNII and GAT could be found in the Appendix A.3. In addition, we also compare with an ensemble learning method, RDD [44], where a complex teacher network is generated by ensemble learning for distilling knowledge. Finally,we take GNN-SD [9] as another baseline, which distills knowledge from shallow layers into deep layers in one GNN. For our FreeKD,we take two GAT sharing the same structure as the basic models.
由于我们的方法与知识蒸馏有关,因此我们还与现有的知识蒸馏方法进行了比较,以进一步验证我们的方法的有效性。在本实验中,我们首先与三种传统的知识蒸馏方法进行比较,KD [14] 、LSP [41] 、CPF [40] 将知识从一个更深更强的教师GCNII模型 [7] 提炼成更浅的学生GAT模型。GCNII和GAT的结构细节见附录a.3。此外,我们还与集成学习方法RDD [44] 进行了比较,在该方法中,通过集成学习生成复杂的教师网络以提取知识。最后,我们将gnn-sd [9] 作为另一个基线,将知识从浅层提取到一个GNN中的深层。对于我们的frekd,我们采用两个与基本模型共享相同结构的GAT。
Table 3 lists the experimental results. Surprisingly, our FreeKD perform comparably or even better than the traditional knowledge distillation methods (KD, LSP, CPF) on all the datasets. This demonstrates the effectiveness of our method, as they distill knowledge from the stronger teacher GCNII while we only mutually distill knowledge between two shallower GAT. In addition, our FreeKD consistently outperforms GNN-SD and RDD, which further illustrates the effectiveness of our proposed FreeKD.
表 3 列出了实验结果。令人惊讶的是,我们的 FreeKD 在所有数据集上的表现与传统知识蒸馏方法(KD、LSP、CPF)相当甚至更好。这证明了我们方法的有效性,因为他们从更强的老师 GCNII 中提取知识,而我们只在两个较浅的 GAT 之间相互提取知识。此外,我们的 FreeKD始终优于gnn-sd和RDD,这进一步说明了我们提出的freikd的有效性。
We perform ablation study to verify the effectiveness of the components in our method. We use GCN as the basic models Φ and Ψ in our method, and conduct the experiments on two datasets of
different domains, Chameleon and Cora. When setting = 0, this means that we only transfer the node-level knowledge. We denote it FreeKD-node for short. To evaluate our reinforcement learning
based node judge module, we design three variants:• FreeKD-w.o.-judge: our FreeKD without using the agent. Φ and Ψ distills knowledge for each node from each other.
• FreeKD-loss: our FreeKD without using the reinforced knowledge judge. It determines the directions of knowledge distillation only relying on the cross entropy loss.
• FreeKD-all-neighbors: our FreeKD selecting the directions of node-level knowledge distillation via node-level actions,but using all neighborhood nodes as the local structure.
• FreeKD-all-structures: our FreeKD selecting the directions of node-level knowledge distillation, but without using structure-level actions for structure-level knowledge distillation.
我们进行消融研究以验证我们方法中组件的有效性。在我们的方法中,我们使用GCN作为 Φ 和 Ψ 的基本模型,并在变色龙和Cora这两个不同域的数据集上进行了实验。当设置 = 0时,这意味着我们只传递节点级的知识。我们简称为FreeKD节点。为了评估我们基于强化学习的节点判断模块,我们设计了三个变体:
• FreeKD-w.o.-jude: 我们的FreeKD,不使用智能体。Φ 和 Ψ 相互提取每个节点的知识。
• FreeKD-loss: 我们的FreeKD不使用强化知识判断器。它仅依靠交叉熵损失来确定知识蒸馏的方向。
• FreeKD-all-neighbors: 我们的FreeKD通过节点级动作来选择节点级知识蒸馏的方向,但是使用所有邻域节点作为局部结构。
• FreeKD-all-structures: 我们FreeKD选择节点级知识蒸馏的方向,但没有使用结构级的动作进行结构级知识蒸馏。
Table 4 shows the results. FreeKD-node is better than GCN, showing that mutually transferring node-level knowledge via reinforcement learning is useful for boosting the performance of GNNs.FreeKD obtain better results than FreeKD-node. It illustrates distilling structure knowledge by our method is beneficial to GNNs. FreeKD achieves better performance than FreeKD-w.o.-judge, illustrating dynamically determining the knowledge distillation direction is important. In addition, FreeKD outperforms FreeKD-loss.This shows that directly using the cross entropy loss to decide the directions of knowledge distillation is sub-optimal. As stated before, this heuristic strategy only considers the performance of the node itself, but neglects the influence of the node on other nodes. Additionally, FreeKD has superiority over FreeKD-all-neighbors,demonstrating that transferring part of neighborhood information selected by our method is more effective than transferring all neighborhood information for GNNs. Finally, FreeKD obtains better performance than FreeKD-all-structures, which indicates our reinforcement learning based method can transfer more reliable structure-level knowledge. In summary, these results demonstrate our proposed knowledge distillation framework with a hierarchical reinforcement learning strategy is effective.
表 4 显示了结果。 FreeKD-node 优于 GCN,表明通过强化学习相互传递节点级知识有助于提高 GNN 的性能。FreeKD比FreeKD节点获得更好的结果。它说明了通过我们的方法提取结构知识对gnn有利。FreeKD比FreeKD-w.o.-judge获得更好的性能,说明动态确定知识蒸馏方向很重要。此外,FreeKD的表现优于FreeKD-loss。这表明直接使用交叉熵损失来决定知识蒸馏的方向是次优的。如前所述,这种启发式策略仅考虑节点本身的性能,而忽略了节点对其他节点的影响。此外,FreeKD优于FreeKD-all-neighbors,表明通过我们的方法选择的部分邻域信息传输比为gnn传输所有邻域信息更有效。最后,FreeKD比FreeKD-all-structures具有更好的性能,这表明我们基于强化学习的方法可以传递更可靠的结构级知识。总而言之,这些结果证明了我们提出的具有分层强化学习策略的知识蒸馏框架是有效的。
表4: 变色龙和Cora数据集的消融研究。
We further intuitively show the effectiveness of the reinforced knowledge judge to dynamically decide the directions ofknowledge distillation. We set GCN as Φ and GraphSAGE as Ψ, and train
our FreeKD on the Cora dataset. Then, we poison Φ by adding random Gaussian noise with a standard deviation to its model parameters. Finally, we visualize the agent’s output, i.e., node-level policy probabilities ( s( [1], ), 0) and ( s([1], ), 1) at node for Φ and Ψ, respectively. To better visualize, we show a subgraph composed of the first 30 nodes and their neighborhoods.
我们进一步直观地展示了强化知识判断在动态决定知识蒸馏方向方面的有效性。我们将 GCN 设置为 Φ,将 GraphSAGE 设置为 Ψ,并在 Cora 数据集上训练我们的 FreeKD。然后,我们通过向其模型参数添加具有标准偏差 的随机高斯噪声来毒害 Φ。最后,我们可视化代理的输出,即节点级节点 的策略概率 ( s( [1], ), 0) 和 ( s([1], ), 1) 分别为 Φ 和 Ψ。为了更好地可视化,我们展示了一个由前30个节点及其邻域组成的子图。
Figure 4 shows the results using different standard deviations . In Figure 4 (a), (b), and ©, the higher the probability output by the agent is, the redder the node is. And this means that the probability for the node in this network to serve as a distilled node to transfer knowledge to the corresponding node ofthe other network is higher. As shown in Figure 4 (a), when without adding noise, the degrees of the red color in Φ and Ψ are comparable. As the noise is gradually increased in Φ, the red color becomes more and more light in Φ, but an opposite case happens in Ψ, as shown in Figure 4 (b) and ©. This is because the noise brings negative influence on the outputs of the network, leading to inaccurate soft labels and large losses. In such a case, our agent can output low probabilities for the network Φ. Thus, our agent can effectively determine the direction of knowledge distillation for each node.
图4显示了使用不同标准偏差 的结果。在图4 (a),(b) 和 © 中,代理输出的概率越高,节点越红。这意味着该网络中的节点充当蒸馏节点将知识传输到另一个网络的相应节点的概率更高。如图4 (a) 所示,在不添加噪声的情况下,Φ 和 Ψ 中的红色的度数是可比的。随着噪声在 Φ 中逐渐增加,红色在 Φ 中变得越来越轻,但是在 Ψ 中发生相反的情况,如图4 (b) 和 © 所示。这是因为噪声会对网络的输出产生负面影响,从而导致不准确的软标签和较大的损耗。在这种情况下,我们的代理可以为网络 Φ 输出低概率。因此,我们的代理可以有效地确定每个节点的知识蒸馏方向。
图 4:我们的节点级策略通过向网络 Φ 添加不同程度的噪声来增强知识判断模块的输出概率。一个网络中的节点越红,这个网络中的节点作为教师节点向另一个网络的相应节点传递知识的概率就越高。