GCN、GAT实现Cora数据集节点分类(pytorch-geometric框架)

Cora数据集

介绍

下载地址:https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz

Cora数据集由深度学习论文组成,论文表示为节点,论文之间的引用关系表示为节点之间的边,每篇论文引用或被至少一篇其他论文引用,不存在孤立节点。

论文被分为以下七类之一:

  • Case_Based
  • Genetic_Algorithms
  • Neural_Networks
  • Probabilistic_Methods
  • Reinforcement_Learning
  • Rule_Learning
  • Theory

数据集组成

cora.cites  --论文之间的引用情况(边)
cora.content  --论文内容(节点特征+标签)
	节点特征使用筛选后单词的one-hot编码,若某词出现在该论文中,对应位置置1

读取数据集

path = "data/cora/"
cites = path + "cora.cites"
content = path + "cora.content"

# 索引字典,将原本的论文id转换到从0开始编码
index_dict = dict()
# 标签字典,将字符串标签转化为数值
label_to_index = dict()

features = []
labels = []
edge_index = []

with open(content,"r") as f:
    nodes = f.readlines()
    for node in nodes:
        node_info = node.split()
        index_dict[int(node_info[0])] = len(index_dict)
        features.append([int(i) for i in node_info[1:-1]])
        
        label_str = node_info[-1]
        if(label_str not in label_to_index.keys()):
            label_to_index[label_str] = len(label_to_index)
        labels.append(label_to_index[label_str])

with open(cites,"r") as f:
    edges = f.readlines()
    for edge in edges:
        start, end = edge.split()
        # 训练时将边视为无向的,但原本的边是有向的,因此需要正反添加两次
        edge_index.append([index_dict[int(start)],index_dict[int(end)]])
        edge_index.append([index_dict[int(end)],index_dict[int(start)]])

# 为每个节点增加自环,但后续GCN层默认会添加自环,跳过即可
# for i in range(2708):
#     edge_index.append([i,i])
  
# 转换为Tensor
labels = torch.LongTensor(labels)
features = torch.FloatTensor(features)
# 行归一化
# features = torch.nn.functional.normalize(features, p=1, dim=1)
edge_index =  torch.LongTensor(edge_index)

两层GCN的网络结构

class GCNNet(torch.nn.Module):
    def __init__(self, num_feature, num_label):
        super(GCNNet,self).__init__()
        self.GCN1 = GCNConv(num_feature, 16)
        self.GCN2 = GCNConv(16, num_label)  
        self.dropout = torch.nn.Dropout(p=0.5)
        
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        
        x = self.GCN1(x, edge_index)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.GCN2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

两层GAT的网络结构

class GATNet(torch.nn.Module):
    def __init__(self, num_feature, num_label):
        super(GATNet,self).__init__()
        self.GAT1 = GATConv(num_feature, 8, heads = 8, concat = True, dropout = 0.6)
        self.GAT2 = GATConv(8*8, num_label, dropout = 0.6)  
        
    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        
        x = self.GAT1(x, edge_index)
        x = F.relu(x)
        x = self.GAT2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

固定随机种子

seed = 1234
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)  
np.random.seed(seed)  # Numpy module.
# random.seed(seed)  # Python random module.
torch.manual_seed(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True

划分训练/验证/测试集,实例化Data对象用于存储

Data对象的使用方法在源码仓库里也有,地址放在文末链接。

mask = torch.randperm(len(index_dict))
train_mask = mask[:140]
val_mask = mask[140:640]
test_mask = mask[1708:2708]

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

cora = Data(x = features, edge_index = edge_index.t().contiguous(), y = labels).to(device)

训练网络

model = GATNet(features.shape[1], len(label_to_index)).to(device)
# model = GCNNet(features.shape[1], len(label_to_index)).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

for epoch in range(200):
    optimizer.zero_grad()
    out = model(cora)
    loss = F.nll_loss(out[train_mask], cora.y[train_mask])
    print('epoch: %d loss: %.4f' %(epoch, loss))
    loss.backward()
    optimizer.step()
    
    if((epoch + 1)% 10 == 0):
        model.eval()
        _, pred = model(cora).max(dim=1)
        correct = int(pred[test_mask].eq(cora.y[test_mask]).sum().item())
        acc = correct / len(test_mask)
        print('Accuracy: {:.4f}'.format(acc))
        model.train()
epoch: 0 loss: 1.9512
epoch: 1 loss: 1.7456
epoch: 2 loss: 1.5565
epoch: 3 loss: 1.3312
epoch: 4 loss: 1.1655
epoch: 5 loss: 0.9590
epoch: 6 loss: 0.8127
epoch: 7 loss: 0.7368
epoch: 8 loss: 0.6223
epoch: 9 loss: 0.6382
Accuracy: 0.8180
...
epoch: 190 loss: 0.4079
epoch: 191 loss: 0.2836
epoch: 192 loss: 0.3000
epoch: 193 loss: 0.2390
epoch: 194 loss: 0.2207
epoch: 195 loss: 0.2316
epoch: 196 loss: 0.2994
epoch: 197 loss: 0.2480
epoch: 198 loss: 0.2349
epoch: 199 loss: 0.2657
Accuracy: 0.8290

t-SNE做图观察特征空间

ts = TSNE(n_components=2)
ts.fit_transform(out[test_mask].to('cpu').detach().numpy())

x = ts.embedding_
y = cora.y[test_mask].to('cpu').detach().numpy()

xi = []
for i in range(7):
    xi.append(x[np.where(y==i)])

colors = ['mediumblue','green','red','yellow','cyan','mediumvioletred','mediumspringgreen']
plt.figure(figsize=(8, 6))
for i in range(7):
    plt.scatter(xi[i][:,0],xi[i][:,1],s=30,color=colors[i],marker='+',alpha=1)

GCN、GAT实现Cora数据集节点分类(pytorch-geometric框架)_第1张图片

参考文献

GCN论文
GAT论文
pytorch-geometric官方文档

源码地址

https://gitee.com/swy9834/gnnlab
作者用以存放基于pytorch-geometric的GNN学习代码,随缘更新。

你可能感兴趣的:(深度学习,python)