本教程将演示如何构建一个基于半监督的节点分类任务的GNN网络,任务基于一个小数据集Cora,这是一个将论文作为节点,引用关系作为边的网络结构。
任务就是预测一个论文的所属分类。每一个论文包含一个词频信息作为属性特征。
pip install dgl -i https://pypi.douban.com/simple/
import dgl.data
dataset = dgl.data.CoraGraphDataset()
print('Number of categories:', dataset.num_classes)
这样会自动下载Cora数据集到Extracting file to C:\Users\vincent\.dgl\cora_v2\
目录下,输出结果如下:
Downloading C:\Users\vincent\.dgl\cora_v2.zip from https://data.dgl.ai/dataset/cora_v2.zip...
Extracting file to C:\Users\vincent\.dgl\cora_v2
Finished data loading and preprocessing.
NumNodes: 2708
NumEdges: 10556
NumFeats: 1433
NumClasses: 7
NumTrainingSamples: 140
NumValidationSamples: 500
NumTestSamples: 1000
Done saving data into cached files.
Number of categories: 7
一个DGL数据集可能包含多个Graph,但是Cora数据集仅包含一个Graph:
g = dataset[0]
一个DGL图可以通过字典的形式存储节点的属性ndata
和边的属性edata
。在DGL Cora数据集中,graph包含下面几个节点特征:
print('Node features')
print(g.ndata)
print('Edge features')
print(g.edata)
输出结果:
Node features
{'feat': tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]), 'label': tensor([3, 4, 4, ..., 3, 3, 3]), 'test_mask': tensor([False, False, False, ..., True, True, True]), 'train_mask': tensor([ True, True, True, ..., False, False, False]), 'val_mask': tensor([False, False, False, ..., False, False, False])}
Edge features
{}
我们将构建一个两层的GCN网络,每一层通过聚合邻居信息来计算一个节点表示。
为了构建这样一个多层的GCN,我们可以简单的堆叠dgl.nn.GraphConv
模块,这个模块继承了torch.nn.Module
。
import torch
import torch.nn as nn
import dgl.data
from dgl.nn.pytorch import GraphConv
import torch.nn.functional as F
dataset = dgl.data.CoraGraphDataset()
print('Number of categories:', dataset.num_classes)
g = dataset[0]
print('Node features')
print(g.ndata)
print('Edge features')
print(g.edata)
class GCN(nn.Module):
def __init__(self, in_feats, h_feats, num_classes):
super(GCN, self).__init__()
self.conv1 = GraphConv(in_feats, h_feats)
self.conv2 = GraphConv(h_feats, num_classes)
def forward(self, g, in_feat):
h = self.conv1(g, in_feat)
h = F.relu(h)
h = self.conv2(g, h)
return h
# Create the model with given dimensions
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)
print(model)
DGL实现了很多当下流行的聚合邻居的模块,我们可以只用一行代码就可以使用。
使用DGL训练GCN与训练其他Pytorch神经网络过程类似:
import torch
import torch.nn as nn
import dgl.data
from dgl.nn.pytorch import GraphConv
import torch.nn.functional as F
dataset = dgl.data.CoraGraphDataset()
print('Number of categories:', dataset.num_classes)
g = dataset[0]
print('Node features')
print(g.ndata)
print('Edge features')
print(g.edata)
class GCN(nn.Module):
def __init__(self, in_feats, h_feats, num_classes):
super(GCN, self).__init__()
self.conv1 = GraphConv(in_feats, h_feats)
self.conv2 = GraphConv(h_feats, num_classes)
def forward(self, g, in_feat):
h = self.conv1(g, in_feat)
h = F.relu(h)
h = self.conv2(g, h)
return h
def train(g, model):
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
best_val_acc = 0
best_test_acc = 0
features = g.ndata['feat']
labels = g.ndata['label']
train_mask = g.ndata['train_mask']
val_mask = g.ndata['val_mask']
test_mask = g.ndata['test_mask']
for e in range(100):
# Forward
logits = model(g, features)
# Compute prediction
pred = logits.argmax(1)
# Compute loss
# Note that you should only compute the losses of the nodes in the training set.
loss = F.cross_entropy(logits[train_mask], labels[train_mask])
# Compute accuracy on training/validation/test
train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
test_acc = (pred[test_mask] == labels[test_mask]).float().mean()
# Save the best validation accuracy and the corresponding test accuracy.
if best_val_acc < val_acc:
best_val_acc = val_acc
best_test_acc = test_acc
# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
if e % 5 == 0:
print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
e, loss, val_acc, best_val_acc, test_acc, best_test_acc))
# Create the model with given dimensions
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)
print(model)
train(g, model)
输出结果:
In epoch 0, loss: 1.946, val acc: 0.134 (best 0.134), test acc: 0.138 (best 0.138)
In epoch 5, loss: 1.892, val acc: 0.506 (best 0.522), test acc: 0.499 (best 0.539)
In epoch 10, loss: 1.806, val acc: 0.600 (best 0.612), test acc: 0.633 (best 0.636)
In epoch 15, loss: 1.698, val acc: 0.594 (best 0.612), test acc: 0.626 (best 0.636)
In epoch 20, loss: 1.567, val acc: 0.632 (best 0.632), test acc: 0.653 (best 0.653)
In epoch 25, loss: 1.417, val acc: 0.712 (best 0.712), test acc: 0.700 (best 0.700)
In epoch 30, loss: 1.251, val acc: 0.738 (best 0.738), test acc: 0.737 (best 0.737)
In epoch 35, loss: 1.079, val acc: 0.746 (best 0.746), test acc: 0.751 (best 0.751)
In epoch 40, loss: 0.909, val acc: 0.746 (best 0.748), test acc: 0.758 (best 0.756)
In epoch 45, loss: 0.751, val acc: 0.738 (best 0.748), test acc: 0.766 (best 0.756)
In epoch 50, loss: 0.612, val acc: 0.744 (best 0.748), test acc: 0.767 (best 0.756)
In epoch 55, loss: 0.494, val acc: 0.752 (best 0.752), test acc: 0.773 (best 0.773)
In epoch 60, loss: 0.399, val acc: 0.762 (best 0.762), test acc: 0.776 (best 0.776)
In epoch 65, loss: 0.322, val acc: 0.762 (best 0.766), test acc: 0.776 (best 0.776)
In epoch 70, loss: 0.262, val acc: 0.764 (best 0.768), test acc: 0.778 (best 0.775)
In epoch 75, loss: 0.215, val acc: 0.766 (best 0.768), test acc: 0.778 (best 0.775)
In epoch 80, loss: 0.178, val acc: 0.766 (best 0.768), test acc: 0.779 (best 0.775)
In epoch 85, loss: 0.149, val acc: 0.766 (best 0.768), test acc: 0.780 (best 0.775)
In epoch 90, loss: 0.126, val acc: 0.768 (best 0.768), test acc: 0.779 (best 0.775)
In epoch 95, loss: 0.107, val acc: 0.768 (best 0.768), test acc: 0.776 (best 0.775)
在GPU上训练需要将模型和数据通过to()
方法放到GPU上:
g = g.to('cuda')
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes).to('cuda')
train(g, model)