2021-4-21 paddle入门+PGL 入门deepwalk代码复现

 

目录

 

安装Paddle+pgl

deepwalk

代码解析


安装Paddle+pgl

Paddle:

  Ubuntu18.04 + cuda10.1安装命令:python -m pip install paddlepaddle-gpu==2.0.2.post101 -f https://paddlepaddle.org.cn/whl/mkl/stable.html

官网地址:https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/windows-pip.html

pgl: 命令: pip install pgl

  • deepwalk流程解析

对官方教程PGL进行学习,https://github.com/PaddlePaddle/PGL

将项目下到本地后,将example文件夹单独摘出,作为项目文件(PGL下的pgl目录会与python包pgl冲突,导致报错:No module named 'pgl.graph_kernel')。

原代码文件路劲有问题,需进入deepwalk文件夹,修改train.py内97行,数据文件路径为:"./tmp/edges.npy"

源代码可直接运行train.py 文件,模型保存在当前目录:"./model.pdparams"

  • 代码解析

  • 此处对官方提供代码进行分析。
  • 使用数据集:去除源代码使用的数据集,通过自己构建简单数据集了解流程,数据图结构如下:
  • deepwalk目录结构

  • -deepwalk

    • -tmp 存放数据信息。

    • -config.yaml 项目配置信息

    • -dataset.py:数据处理代码

    • -model.py:skip_gram模型代码

    • train.py:程序入口,

    • README.md:官方提供项目运行方式

  • 2021-4-21 paddle入门+PGL 入门deepwalk代码复现_第1张图片

  • train.py:   初始化图数据,获取随机游走序列并进行训练。

  • 
    import time
    import argparse
    
    import pgl
    import paddle
    import paddle.nn as nn
    from pgl.utils.logger import log
    import numpy as np
    import yaml
    from easydict import EasyDict as edict
    import tqdm
    from paddle.optimizer import Adam
    from pgl.utils.data import Dataloader
    from paddle.io import get_worker_info
    
    from model import SkipGramModel
    from dataset import ShardedDataset
    from dataset import BatchRandWalk
    from pgl import graph
    
    
    #定义简单图
    def build_graph():
        # 定义图中的节点数目,我们使用数字来表示图中的每个节点
        num_nodes = 10
    
        # 定义图中的边集
        edge_list = [(2, 0), (2, 1), (3, 1), (4, 0), (5, 0),
                     (6, 0), (6, 4), (6, 5), (7, 0), (7, 1),
                     (7, 2), (7, 3), (8, 0), (9, 7)]
    
        # 随机初始化节点特征,特征维度为 d
        d = 16
        feature = np.random.randn(num_nodes, d).astype("float32")
    
        # 随机地为每条边赋值一个权重
        edge_feature = np.random.randn(len(edge_list), 1).astype("float32")
    
        # 创建图对象,最多四个输入
        g = pgl.graph.Graph(num_nodes=num_nodes, edges=edge_list)
        return g
    
    def load(name):
        if name == 'cora':
            dataset = pgl.dataset.CoraDataset()
        elif name == "pubmed":
            dataset = pgl.dataset.CitationDataset("pubmed", symmetry_edges=True)
        elif name == "citeseer":
            dataset = pgl.dataset.CitationDataset("citeseer", symmetry_edges=True)
        elif name == "BlogCatalog":
            dataset = pgl.dataset.BlogCatalogDataset()
        else:
            raise ValueError(name + " dataset doesn't exists")
        indegree = dataset.graph.indegree()
        outdegree = dataset.graph.outdegree()
        return dataset.graph.to_mmap()
    
    
    def load_from_file(path):
        edges = []
        with open(path) as inf:
            for line in inf:
                u, t = line.strip("\n").split("\t")
                u, t = int(u), int(t)
                edges.append((u, t))
        edges = np.array(edges)
        graph = pgl.Graph(edges)
        return graph
    
    
    def train(model, data_loader, optim, log_per_step=10):
        model.train()
        total_loss = 0.
        total_sample = 0
    
        for batch, (src, dsts) in enumerate(data_loader):
            num_samples = len(src)
            src = paddle.to_tensor(src)
            dsts = paddle.to_tensor(dsts)
            loss = model(src, dsts)
            loss.backward()
            optim.step()
            optim.clear_grad()
    
            total_loss += loss.numpy()[0] * num_samples
            total_sample += num_samples
    
            if batch % log_per_step == 0:
                log.info("Batch %s %s-Loss %.6f" %
                         (batch, "train", loss.numpy()[0]))
    
        return total_loss / total_sample
    
    
    def main(args):
        #判断是否使用GPU
        if not args.use_cuda:
            paddle.set_device("cpu")
        if paddle.distributed.get_world_size() > 1:
            paddle.distributed.init_parallel_env()
    
        #加载数据,使用load函数
        '''
        读取数据到graph 但后续并没有用……
        '''
        # if args.edge_file:
        #     graph = load_from_file(args.edge_file)
        # else:
        #     graph = load(args.dataset)
    
        #读取测试数据 edges.npy  通过边的关系 构建graph  若需允许上述数据集可注释
        # edges = np.load("./tmp/edges.npy")
        # edges = np.concatenate([edges, edges[:, [1, 0]]])
        # graph = pgl.Graph(edges)
    
    
        '''
        使用上面定义的简单图 10个节点。
        '''
        g = build_graph()
    
        model = SkipGramModel(
            g.num_nodes,
            args.embed_size,
            args.neg_num,
            sparse=not args.use_cuda)
    
        #通过数据并行模式执行动态图模型。
        model = paddle.DataParallel(model)
        #获取训练数据  自己构造ShardedDataset
    
        train_ds = ShardedDataset(g.nodes, repeat=args.epoch)
    
        #获取每个batch 大小
        train_steps = int(len(train_ds) // args.batch_size)
        log.info("train_steps: %s" % train_steps)
        # paddle.optimizer.lr.PolynomialDecay 该接口提供学习率按多项式衰减的策略。通过多项式衰减函数,使得学习率值逐步从初始的
        '''
        learning_rate (float) - 初始学习率,数据类型为Python float。
        
        decay_steps (int) - 进行衰减的步长,这个决定了衰减周期。
        
        end_lr (float,可选)- 最小的最终学习率。默认值为0.0001。
        
        power (float,可选) - 多项式的幂。默认值为1.0。
        '''
        scheduler = paddle.optimizer.lr.PolynomialDecay(
            learning_rate=args.learning_rate,
            decay_steps=train_steps,
            end_lr=0.0001)
    
        #优化器
        optim = Adam(learning_rate=scheduler, parameters=model.parameters())
    
        #自定义BatchRandWalk
        collate_fn = BatchRandWalk(g, args.walk_len, args.win_size,
                                   args.neg_num, args.neg_sample_type)
    
        #通过Dataloader 构造每个batch数据
        data_loader = Dataloader(
            train_ds,
            batch_size=args.batch_size,
            shuffle=True,
          #  num_workers=args.sample_workers,
            collate_fn=collate_fn)
    
        #
        train_loss = train(model, data_loader, optim)
        paddle.save(model.state_dict(), "model.pdparams")
    
    
    if __name__ == '__main__':
        '''
        读取项目设置,可从命令行运行: python train.py -参数名 参数   详情参考argparse使用方式
        如:使用单GPU:python train.py --use_cuda
        '''
        parser = argparse.ArgumentParser(description='Deepwalk')
        parser.add_argument(
            "--dataset",
            type=str,
            default="BlogCatalog",
            help="dataset (cora, pubmed, BlogCatalog)")
        parser.add_argument("--use_cuda", action='store_true', help="use_cuda",default=True)
        parser.add_argument(
            "--conf",
            type=str,
            default="./config.yaml",
            help="config file for models")
        parser.add_argument("--epoch", type=int, default=40, help="Epoch")
        parser.add_argument("--edge_file", type=str, default=None)
        args = parser.parse_args()
    
        # merge user args and config file 
        config = edict(yaml.load(open(args.conf), Loader=yaml.FullLoader))
        config.update(vars(args))
        main(config)
    

     

  • model.py:  实现skip-gram的模型代码
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
    Deepwalk model file.
"""
import math

import paddle
import paddle.nn as nn
import paddle.nn.functional as F


class SkipGramModel(nn.Layer):
    '''
    初始化skipGramModel 模型
    输入:
    num_nodes:节点数量
    embed_size:嵌入维度
    neg_num:类别数?
    sparse:是否为稀疏图
    sparse_embedding:稀疏嵌入


    '''

    def __init__(self,
                 num_nodes,
                 embed_size=16,
                 neg_num=5,
                 sparse=False,
                 sparse_embedding=False):
        super(SkipGramModel, self).__init__()

        self.num_nodes = num_nodes
        self.neg_num = neg_num

        # embed_init = nn.initializer.Uniform(
        # low=-1. / math.sqrt(embed_size), high=1. / math.sqrt(embed_size))

        # paddle.nn.initializer.Uniform   随机均匀分布初始化函数
        embed_init = nn.initializer.Uniform(low=-1.0, high=1.0)

        # 创建一个参数属性对象,用户可设置参数的名称、初始化方式、学习率、正则化规则、是否需要训练、梯度裁剪方式、是否做模型平均等属性。
        emb_attr = paddle.ParamAttr(
            name="node_embedding", initializer=embed_init)

        # 若为稀疏图
        if sparse_embedding:

            def emb_func(x):
                #获得输入Tensor或SelectedRows的shape。
                d_shape = paddle.shape(x)

                x_emb = paddle.static.nn.sparse_embedding(
                    paddle.reshape(x, [-1, 1]), [num_nodes, embed_size],
                    param_attr=emb_attr)
                return paddle.reshape(x_emb,
                                      [d_shape[0], d_shape[1], embed_size])

            self.emb = emb_func
        else:
            '''
            参数:
            num_embeddings (int) - 嵌入字典的大小, input中的id必须满足 0 =< id < num_embeddings 。 。
            embedding_dim (int) - 每个嵌入向量的维度。
            padding_idx (int|long|None) - padding_idx的配置区间为 [-weight.shape[0], weight.shape[0],如果配置了padding_idx,那么在训练过程中遇到此id时会被用
            sparse (bool) - 是否使用稀疏更新,在词嵌入权重较大的情况下,使用稀疏更新能够获得更快的训练速度及更小的内存/显存占用。
            weight_attr (ParamAttr|None) - 指定嵌入向量的配置,包括初始化方法,具体用法请参见 ParamAttr ,一般无需设置,默认值为None。
            '''
            self.emb = nn.Embedding(
                num_nodes, embed_size, sparse=sparse, weight_attr=emb_attr)
        #该OP可创建一个BCEWithLogitsLoss的可调用类,计算输入 logit 和标签 label 间的 binary cross entropy with logits loss 损失。
        self.loss = paddle.nn.BCEWithLogitsLoss()

    def forward(self, src, dsts):
        # src [b, 1]
        # dsts [b, 1+neg]

        src_embed = self.emb(src)
        dsts_embed = self.emb(dsts)

        pos_embed = dsts_embed[:, 0:1]
        neg_embed = dsts_embed[:, 1:]

        pos_logits = paddle.matmul(
            src_embed, pos_embed, transpose_y=True)  # [batch_size, 1, 1]

        neg_logits = paddle.matmul(
            src_embed, neg_embed, transpose_y=True)  # [batch_size, 1, neg_num]

        #和 one-hot向量进行对比 ,计算损失值
        ones_label = paddle.ones_like(pos_logits)
        pos_loss = self.loss(pos_logits, ones_label)

        zeros_label = paddle.zeros_like(neg_logits)
        neg_loss = self.loss(neg_logits, zeros_label)

        loss = (pos_loss + neg_loss) / 2
        return loss
  • dataset.py: 数据处理方面代码,包括随机游走序列的生成,构建数据集代码, 随机游走部分由官方walk.py内random_walk 生成。
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

import paddle
import numpy as np

from pgl import graph_kernel
from pgl.utils.logger import log
from pgl.utils.data import Dataset
from pgl.sampling import random_walk
from pgl.graph_kernel import skip_gram_gen_pair

'''
随机游走参数设置

'''
class BatchRandWalk(object):
    def __init__(self, graph, walk_len, win_size, neg_num, neg_sample_type):
        self.graph = graph   #原始图
        self.walk_len = walk_len  #步长
        self.win_size = win_size  #窗口大小
        self.neg_num = neg_num  #类别
        self.neg_sample_type = neg_sample_type  #负采样方式

    def __call__(self, nodes):
        #从DataLoader 随机获取长度为batch_size 的节点集合 :nodes
        print('本次batch节点序列:',nodes)
        walks = random_walk(self.graph, nodes, self.walk_len)
        print('随机游走序列:',walks)
        src_list, pos_list = [], []
        for walk in walks:
            s, p = skip_gram_gen_pair(walk, self.win_size)
            src_list.append(s), pos_list.append(p)

        src = [s for x in src_list for s in x]
        pos = [s for x in pos_list for s in x]
        src = np.array(src, dtype=np.int64),
        pos = np.array(pos, dtype=np.int64)
        src, pos = np.reshape(src, [-1, 1]), np.reshape(pos, [-1, 1])

        neg_sample_size = [len(pos), self.neg_num]
        if self.neg_sample_type == "average":
            negs = np.random.randint(
                low=0, high=self.graph.num_nodes, size=neg_sample_size)
        elif self.neg_sample_type == "outdegree":
            pass
            #negs = alias_sample(neg_sample_size, alias, events)
        elif self.neg_sample_type == "inbatch":
            pass
        else:
            raise ValueError
        dsts = np.concatenate([pos, negs], 1)
        # [batch_size, 1] [batch_size, neg_num+1]
        return src, dsts

'''

    输入:
        nodes:图内所有节点
        repeat: 重复次数
    输出:
        self.data:构造后的数据集

'''
class ShardedDataset(Dataset):
    def __init__(self, nodes, mode="train", repeat=1):
        #循环次数
        self.repeat = repeat
#paddle.distributed.get_world_size() 返回参与当前任务的进程数。 当前进程数等于环境变量 PADDLE_TRAINERS_NUM 的值,默认值为1。
        #单进程
        if int(paddle.distributed.get_world_size()) == 1 or mode != "train":
            self.data = nodes
        else:

            #返回当前进程的rank。  当前进程rank的值等于环境变量 PADDLE_TRAINER_ID 的值,默认值为0。
            self.data = nodes[int(paddle.distributed.get_rank())::int(
                paddle.distributed.get_world_size())]

    def __getitem__(self, idx):
        return self.data[idx % len(self.data)]

    def __len__(self):
        return len(self.data) * self.repeat
  • deepwalk核心代码:
  • def random_walk(graph, nodes, max_depth):
        """Implement of random walk.
    
        This function get random walks path for given nodes and depth.
    
        Args:
            nodes: Walk starting from nodes
            max_depth: Max walking depth
    
        Return:
            A list of walks.
        """
        walk_paths = []
        # init  每个batch的起点节点
        for node in nodes:
            walk_paths.append([node])
    
        cur_walk_ids = np.arange(0, len(nodes))
        cur_nodes = np.array(nodes)
        for l in range(max_depth - 1):
            # select the walks not end
            # 判断每个节点是否有后续节点,并使用mask去除无后续节点的节点。
            cur_succs = graph.successor(cur_nodes)
            mask = [len(succ) > 0 for succ in cur_succs]
    
            if np.any(mask):
                cur_walk_ids = cur_walk_ids[mask]
                cur_nodes = cur_nodes[mask]
                cur_succs = cur_succs[mask]
            else:
                # stop when all nodes have no successor
                break
            # 计算每个有后续节点的出度,并随机生成0-1内随机数*出度值 作为随机选择的下一个节点。
            outdegree = [len(cur_succ) for cur_succ in cur_succs]
            sample_index = np.floor(
                np.random.rand(cur_succs.shape[0]) * outdegree).astype("int64")
    
            nxt_cur_nodes = []
            for s, ind, walk_id in zip(cur_succs, sample_index, cur_walk_ids):
                walk_paths[walk_id].append(s[ind])
                nxt_cur_nodes.append(s[ind])
            cur_nodes = np.array(nxt_cur_nodes)
        return walk_paths
    

    随机游走序列样例:

  • 2021-4-21 paddle入门+PGL 入门deepwalk代码复现_第2张图片

  • 此代码为对deepwalk过程的理解,第一次写博客,还请大家多多指教~

你可能感兴趣的:(图神经网络,paddlepaddle,神经网络,ubuntu)