个人笔记对模型数学上的解读部分很大程度上受到这篇博客的启发与参考
T = S ∪ Q T=S \cup Q T=S∪Q,support set and query set, support set S S S in each episode serves as the labeled training set
x i x_i xi and y i ∈ { C 1 , . . . , C N } = C T ⊂ C y_i \in \{C_1,...,C_N\}=C_T \subset C yi∈{C1,...,CN}=CT⊂C: i i i th input data and its label, C C C is the set of all
classes of either training or test dataset. C t r a i n ∩ C t e s t = ϕ C_{train} \cap C_{test}=\phi Ctrain∩Ctest=ϕ
G = ( V ; ξ ; T ) G=(V;\xi;T) G=(V;ξ;T): the graph constructed with samples from the task T T T. V:node set; E: edge set;
y i j y_{ij} yij ground-truth edge-label, defined by the ground-truth node labels
e i j = { e i j d } d = 1 2 ∈ [ 0 , 1 ] 2 \mathbf e_{ij}=\{e_{ijd}\}^2_{d=1} \in [0,1]^2 eij={eijd}d=12∈[0,1]2: edge feature, representing the (normalized) strengths of the intra- and inter-class relations of the two connected nodes.
e ~ i j d \tilde e_{ijd} e~ijd 归一化的edge feature
f v l f^l_v fvl node feature transformation network
f e ; f^;_e fe; metric network
y ^ i j \hat y_{ij} y^ij probability that the two nodes V i V_i Vi and V j V_j Vj are from the same class
δ ( y j = C k ) \delta(y_j=C_k) δ(yj=Ck) Kronecker delta function
So the use of GNNs can naturally have the great potential to solve the few-shot learning problem.
The previous GNN approaches in few-shot learning have been mainly based on the node-labeling
framework, which implicitly models the intra-cluster similarity and inter-cluster dissimilarity
少样本学习任务,视为训练一个分类器的话,可以将此任务 T T T划分为support set (输入数据和对应标签)和query set(无标签数据集)
作者采用基于元学习(meta-learning based semi-supervision)的少样本学习方法。
元学习是计算类表示形式,然后使用度量函数来度量查询样本和每个类表示形式之间的相似性。As an efficient way of meta-learning, we adopt episodic training
令 G = ( V , E ; T ) \mathcal{G} = (\mathcal{V},\mathcal{E};\mathcal{T}) G=(V,E;T)为一个episode样本 T \mathcal{T} T构建的图, ∣ T ∣ = N K + T \mathcal|{T}| = NK+T ∣T∣=NK+T ; V = { V i } i = 1 , . . . , ∣ T ∣ \mathcal{V}=\{V_i\}_{i = 1,...,|\mathcal{T}|} V={Vi}i=1,...,∣T∣ 为顶点集, E = { E i } i = 1 , . . . , ∣ T ∣ \mathcal{E}=\{E_i\}_{i = 1,...,|\mathcal{T}|} E={Ei}i=1,...,∣T∣为边集。定义 边的ground true edge label
y i yi yi和 y j yj yj是ground-truth node label; y i j y_{ij} yij是ground-truth edge label; 如果两个node的label相同,边的label = 1
然后定义每条边为一为一个二维向量 e i j = { e i j d } d = 1 2 ∈ [ 0 , 1 ] 2 e_{ij} = \{e_{ijd}\}_{d=1}^2 \in [0,1]^2 eij={eijd}d=12∈[0,1]2,每一维度是一个0到1的数,和为1,表示的是类内关系和类间关系(也就是两样本类别相似性和相异性)。比如 d = 0 d=0 d=0是一个点, d = 1 d=1 d=1是这条边对应的另一个点。
在这文章里,边的label定义 e i j e_{ij} eij应该是不对称的,也就是 e i j e_{ij} eij和 e j i e_{ji} eji 是不同的,也就是有向边。举个例子:衡量类间相似度时, e i j 1 e_{ij1} eij1与 e i j 2 e_{ij2} eij2进行比较。衡量类内相似度时, e i j 1 e_{ij1} eij1与 e j i 1 e_{ji1} eji1进行比较。
接下来是node feature initialized和edge feature initialized:
Node: Node features are initialized by the output of the convolutional embedding network
Edge: are initialized by edge labels
图网络包含 L L L层,更新法则如下:
class NodeUpdateNetwork(nn.Module):
def __init__(self,
in_features,
num_features,
ratio=[2, 1],
dropout=0.0):
super(NodeUpdateNetwork, self).__init__()
# set size
self.in_features = in_features
self.num_features_list = [num_features * r for r in ratio]
self.dropout = dropout
# layers
layer_list = OrderedDict()
for l in range(len(self.num_features_list)):
layer_list['conv{}'.format(l)] = nn.Conv2d(
in_channels=self.num_features_list[l - 1] if l > 0 else self.in_features * 3,
out_channels=self.num_features_list[l],
kernel_size=1,
bias=False)
layer_list['norm{}'.format(l)] = nn.BatchNorm2d(num_features=self.num_features_list[l],
)
layer_list['relu{}'.format(l)] = nn.LeakyReLU()
if self.dropout > 0 and l == (len(self.num_features_list) - 1):
layer_list['drop{}'.format(l)] = nn.Dropout2d(p=self.dropout)
self.network = nn.Sequential(layer_list)
def forward(self, node_feat, edge_feat):
# get size
num_tasks = node_feat.size(0)
num_data = node_feat.size(1)
# get eye matrix (batch_size x 2 x node_size x node_size)
diag_mask = 1.0 - torch.eye(num_data).unsqueeze(0).unsqueeze(0).repeat(num_tasks, 2, 1, 1).to(tt.arg.device)
# set diagonal as zero and normalize
edge_feat = F.normalize(edge_feat * diag_mask, p=1, dim=-1)
# compute attention and aggregate
# 对于torch.matmul和torch.bmm,都能实现对于batch的矩阵乘法;a.squeeze(N) 就是去掉a中指定的维数为一的维度
aggr_feat = torch.bmm(torch.cat(torch.split(edge_feat, 1, 1), 2).squeeze(1), node_feat)
node_feat = torch.cat([node_feat, torch.cat(aggr_feat.split(num_data, 1), -1)], -1).transpose(1, 2)
# non-linear transform
node_feat = self.network(node_feat.unsqueeze(-1)).transpose(1, 2).squeeze(-1)
return node_feat
输入的 in_feature : { v i l − 1 } \{v^{l-1}_i\} {vil−1}, { e i j l − 1 } \{e^{l-1}_{ij}\} {eijl−1};
aggregation部分:
# compute attention and aggregate
# 对于torch.matmul和torch.bmm,都能实现对于batch的矩阵乘法;a.squeeze(N) 就是去掉a中指定的维数为一的维度
aggr_feat = torch.bmm(torch.cat(torch.split(edge_feat, 1, 1), 2).squeeze(1), node_feat)
node_feat = torch.cat([node_feat, torch.cat(aggr_feat.split(num_data, 1), -1)], -1).transpose(1, 2)
可以看出attention的加权在这里体现在torch.bmm的矩阵乘法上,利用边的特征edge_feat与node_feat进行了attention,所以文章说的attention应该是利用边的特征信息对节点加权。
f v l f^l_v fvl是第L层节点transform网络。在代码中的实现是将原来的feature拼接上去,另外参考了Cade博客中讲到代码中的conv2d可以改成conv1d。实际上那个卷积就是一个线性层,只不过这样处理更加方便。便于后面batchnorm和dropout对特征的每个维度做处理。
class EdgeUpdateNetwork(nn.Module):
def __init__(self,
in_features,
num_features,
ratio=[2, 2, 1, 1],
separate_dissimilarity=False,
dropout=0.0):
super(EdgeUpdateNetwork, self).__init__()
# set size
self.in_features = in_features
self.num_features_list = [num_features * r for r in ratio]
self.separate_dissimilarity = separate_dissimilarity
self.dropout = dropout
# layers
layer_list = OrderedDict()
for l in range(len(self.num_features_list)):
# set layer
layer_list['conv{}'.format(l)] = nn.Conv2d(in_channels=self.num_features_list[l-1] if l > 0 else self.in_features,
out_channels=self.num_features_list[l],
kernel_size=1,
bias=False)
layer_list['norm{}'.format(l)] = nn.BatchNorm2d(num_features=self.num_features_list[l],
)
layer_list['relu{}'.format(l)] = nn.LeakyReLU()
if self.dropout > 0:
layer_list['drop{}'.format(l)] = nn.Dropout2d(p=self.dropout)
layer_list['conv_out'] = nn.Conv2d(in_channels=self.num_features_list[-1],
out_channels=1,
kernel_size=1)
self.sim_network = nn.Sequential(layer_list)
if self.separate_dissimilarity:
# layers
layer_list = OrderedDict()
for l in range(len(self.num_features_list)):
# set layer
layer_list['conv{}'.format(l)] = nn.Conv2d(in_channels=self.num_features_list[l-1] if l > 0 else self.in_features,
out_channels=self.num_features_list[l],
kernel_size=1,
bias=False)
layer_list['norm{}'.format(l)] = nn.BatchNorm2d(num_features=self.num_features_list[l],
)
layer_list['relu{}'.format(l)] = nn.LeakyReLU()
if self.dropout > 0:
layer_list['drop{}'.format(l)] = nn.Dropout(p=self.dropout)
layer_list['conv_out'] = nn.Conv2d(in_channels=self.num_features_list[-1],
out_channels=1,
kernel_size=1)
self.dsim_network = nn.Sequential(layer_list)
def forward(self, node_feat, edge_feat):
# compute abs(x_i, x_j)
x_i = node_feat.unsqueeze(2)
x_j = torch.transpose(x_i, 1, 2)
x_ij = torch.abs(x_i - x_j)
x_ij = torch.transpose(x_ij, 1, 3)# 返回输入矩阵input的转置。交换维度dim0和dim1。
# compute similarity/dissimilarity (batch_size x feat_size x num_samples x num_samples)
sim_val = F.sigmoid(self.sim_network(x_ij))
if self.separate_dissimilarity:
dsim_val = F.sigmoid(self.dsim_network(x_ij))
else:
dsim_val = 1.0 - sim_val
diag_mask = 1.0 - torch.eye(node_feat.size(1)).unsqueeze(0).unsqueeze(0).repeat(node_feat.size(0), 2, 1, 1).to(tt.arg.device)
edge_feat = edge_feat * diag_mask
merge_sum = torch.sum(edge_feat, -1, True) #求edge_feat各行的和
# set diagonal as zero and normalize
edge_feat = F.normalize(torch.cat([sim_val, dsim_val], 1) * edge_feat, p=1, dim=-1) * merge_sum
# torch.eye 返回一个行数为node_feat.size(1)的2维张量(nxn,n=node_feat.size(1)),对角线位置全1,其它位置全0
force_edge_feat = torch.cat((torch.eye(node_feat.size(1)).unsqueeze(0), torch.zeros(node_feat.size(1), node_feat.size(1)).unsqueeze(0)), 0).unsqueeze(0).repeat(node_feat.size(0), 1, 1, 1).to(tt.arg.device)
edge_feat = edge_feat + force_edge_feat #把零填充上
edge_feat = edge_feat + 1e-6
edge_feat = edge_feat / torch.sum(edge_feat, dim=1).unsqueeze(1).repeat(1, 2, 1, 1)
return edge_feat
其中 f e l f^l_e fel是计算相似度或不相似度的network函数:根据代码,归一化中 e i j l ‾ \overline {e^l_{ij}} eijl=F.norm(sim_val || dsim_val) × e i j l \times {e^l_{ij}} ×eijl其实反映的就是论文中的 f e l f^l_e fel=(sim_val || dsim_val).也就是说 f e l f^l_e fel是将similarity和dissimilarity两个联合一起表示的函数。(woc好神奇,同时表征相似和不相似)
那么(4)的式子写成 ∑ k e i k 1 l = f e l ( v i l , v j l ) e i j 1 l ∑ k f e l ( v i l , v j l ) e i j 1 l \sum_ke^l_{ik1}=\frac{f^l_e(v^l_i,v^l_j)e^l_{ij1}}{\sum_k{f^l_e(v^l_i,v^l_j)e^l_{ij1}}} k∑eik1l=∑kfel(vil,vjl)eij1lfel(vil,vjl)eij1l ∑ e i k 1 l = ∑ e ‾ i k 1 l ( 归 一 化 ) \sum e^l_{ik1}=\sum \overline e^l_{ik1}(归一化) ∑eik1l=∑eik1l(归一化)
[2019][cvpr]Edge-Labeling Graph Neural Network for Few-shot Learning 笔记