DGL教程【一】使用Cora数据集进行分类

本教程将演示如何构建一个基于半监督的节点分类任务的GNN网络,任务基于一个小数据集Cora,这是一个将论文作为节点,引用关系作为边的网络结构。

任务就是预测一个论文的所属分类。每一个论文包含一个词频信息作为属性特征。

首先安装dgl

pip install dgl -i https://pypi.douban.com/simple/

加载Cora数据集

import dgl.data

dataset = dgl.data.CoraGraphDataset()
print('Number of categories:', dataset.num_classes)

这样会自动下载Cora数据集到Extracting file to C:\Users\vincent\.dgl\cora_v2\目录下,输出结果如下:

Downloading C:\Users\vincent\.dgl\cora_v2.zip from https://data.dgl.ai/dataset/cora_v2.zip...
Extracting file to C:\Users\vincent\.dgl\cora_v2
Finished data loading and preprocessing.
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.
Number of categories: 7

一个DGL数据集可能包含多个Graph,但是Cora数据集仅包含一个Graph:

g = dataset[0]

一个DGL图可以通过字典的形式存储节点的属性ndata和边的属性edata。在DGL Cora数据集中,graph包含下面几个节点特征:

  • train_mask:一个bool 类型的tensor,表示一个节点是不是属于training set
  • val_mask: 一个bool 类型的tensor,表示一个节点是不是属于validation set
  • test_mask:一个bool 类型的tensor,表示一个节点是不是属于test set
  • label:节点的分类标签
  • feat:节点的属性
print('Node features')
print(g.ndata)
print('Edge features')
print(g.edata)

输出结果:

Node features
{'feat': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]]), 'label': tensor([3, 4, 4,  ..., 3, 3, 3]), 'test_mask': tensor([False, False, False,  ...,  True,  True,  True]), 'train_mask': tensor([ True,  True,  True,  ..., False, False, False]), 'val_mask': tensor([False, False, False,  ..., False, False, False])}
Edge features
{}

定义一个GNN网络

我们将构建一个两层的GCN网络,每一层通过聚合邻居信息来计算一个节点表示。

为了构建这样一个多层的GCN,我们可以简单的堆叠dgl.nn.GraphConv模块,这个模块继承了torch.nn.Module

import torch
import torch.nn as nn
import dgl.data
from dgl.nn.pytorch import GraphConv
import torch.nn.functional as F

dataset = dgl.data.CoraGraphDataset()
print('Number of categories:', dataset.num_classes)
g = dataset[0]
print('Node features')
print(g.ndata)
print('Edge features')
print(g.edata)


class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h


# Create the model with given dimensions
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)
print(model)

DGL实现了很多当下流行的聚合邻居的模块,我们可以只用一行代码就可以使用。

训练GCN

使用DGL训练GCN与训练其他Pytorch神经网络过程类似:

import torch
import torch.nn as nn
import dgl.data
from dgl.nn.pytorch import GraphConv
import torch.nn.functional as F

dataset = dgl.data.CoraGraphDataset()
print('Number of categories:', dataset.num_classes)
g = dataset[0]
print('Node features')
print(g.ndata)
print('Edge features')
print(g.edata)


class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h


def train(g, model):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    best_val_acc = 0
    best_test_acc = 0

    features = g.ndata['feat']
    labels = g.ndata['label']
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']
    for e in range(100):
        # Forward
        logits = model(g, features)

        # Compute prediction
        pred = logits.argmax(1)

        # Compute loss
        # Note that you should only compute the losses of the nodes in the training set.
        loss = F.cross_entropy(logits[train_mask], labels[train_mask])

        # Compute accuracy on training/validation/test
        train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
        val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
        test_acc = (pred[test_mask] == labels[test_mask]).float().mean()

        # Save the best validation accuracy and the corresponding test accuracy.
        if best_val_acc < val_acc:
            best_val_acc = val_acc
            best_test_acc = test_acc

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if e % 5 == 0:
            print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
                e, loss, val_acc, best_val_acc, test_acc, best_test_acc))


# Create the model with given dimensions
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes)
print(model)
train(g, model)

输出结果:

In epoch 0, loss: 1.946, val acc: 0.134 (best 0.134), test acc: 0.138 (best 0.138)
In epoch 5, loss: 1.892, val acc: 0.506 (best 0.522), test acc: 0.499 (best 0.539)
In epoch 10, loss: 1.806, val acc: 0.600 (best 0.612), test acc: 0.633 (best 0.636)
In epoch 15, loss: 1.698, val acc: 0.594 (best 0.612), test acc: 0.626 (best 0.636)
In epoch 20, loss: 1.567, val acc: 0.632 (best 0.632), test acc: 0.653 (best 0.653)
In epoch 25, loss: 1.417, val acc: 0.712 (best 0.712), test acc: 0.700 (best 0.700)
In epoch 30, loss: 1.251, val acc: 0.738 (best 0.738), test acc: 0.737 (best 0.737)
In epoch 35, loss: 1.079, val acc: 0.746 (best 0.746), test acc: 0.751 (best 0.751)
In epoch 40, loss: 0.909, val acc: 0.746 (best 0.748), test acc: 0.758 (best 0.756)
In epoch 45, loss: 0.751, val acc: 0.738 (best 0.748), test acc: 0.766 (best 0.756)
In epoch 50, loss: 0.612, val acc: 0.744 (best 0.748), test acc: 0.767 (best 0.756)
In epoch 55, loss: 0.494, val acc: 0.752 (best 0.752), test acc: 0.773 (best 0.773)
In epoch 60, loss: 0.399, val acc: 0.762 (best 0.762), test acc: 0.776 (best 0.776)
In epoch 65, loss: 0.322, val acc: 0.762 (best 0.766), test acc: 0.776 (best 0.776)
In epoch 70, loss: 0.262, val acc: 0.764 (best 0.768), test acc: 0.778 (best 0.775)
In epoch 75, loss: 0.215, val acc: 0.766 (best 0.768), test acc: 0.778 (best 0.775)
In epoch 80, loss: 0.178, val acc: 0.766 (best 0.768), test acc: 0.779 (best 0.775)
In epoch 85, loss: 0.149, val acc: 0.766 (best 0.768), test acc: 0.780 (best 0.775)
In epoch 90, loss: 0.126, val acc: 0.768 (best 0.768), test acc: 0.779 (best 0.775)
In epoch 95, loss: 0.107, val acc: 0.768 (best 0.768), test acc: 0.776 (best 0.775)

在GPU上进行训练

在GPU上训练需要将模型和数据通过to()方法放到GPU上:

g = g.to('cuda')
model = GCN(g.ndata['feat'].shape[1], 16, dataset.num_classes).to('cuda')
train(g, model)

你可能感兴趣的:(DGL,sklearn,python,机器学习)