GraphSage代码阅读笔记(TensorFlow版)

GraphSage代码阅读笔记(TensorFlow版)目录

    • 一、论文、代码链接
    • 二、文件结构
    • 三、数据解析
      • 3.1.数据集信息表:
      • 3.2数据文件解析
    • 四、代码运行及其环境安装与配置
      • 4.1环境安装
      • 4.2 代码运行
    • 五、代码文件解析
      • 5.1.__init__.py
      • 5.2.unsupervised_train.py代码
      • 5.3utils.py代码
      • 5.4.自用测试代码

一、论文、代码链接

Inductive Representation Learning on Large Graphs论文链接

graphsage代码链接

二、文件结构

1.文件目录
├──eval_scripts //验证集
├──example_data //ppi数据集
└──graphsage//模型结构定义、GCN层定义、……

2.eval_scripts //验证集目录内容
├──citation_eval.py
├──ppi_eval.py
└──reddit_eval.py

3.example_data //ppi数据集
├──toy-ppi-class_map.json //图节点id映射到类。
├──toy-ppi-feats.npy //预训练好得到的features
├──toy-ppi-G.json //图的信息
├──ttoy-ppi-walks//从一点出发随机游走到邻居节点的情况,对于每个点取198次
└──toy-ppi-id_map.json //节点编号与序号的一一对应

4.graphsage//模型结构定义
├── init //导入模块
├──aggregators // 聚合函数定义
├──inits.py // 初始化的一些公用函数
├── layers // GCN层的定义
├── metrics // 评测指标的计算
├── minibatch//minibatch iterator函数定义
├── models // 各种模型结构定义
├── neigh_samplers //定义从节点的邻居中采样的采样器
├── prediction//
├── supervised_models
├── supervised_train
├── unsupervised_train
└── utils // 工具函数的定义

三、数据解析

3.1.数据集信息表:

表一:
GraphSage代码阅读笔记(TensorFlow版)_第1张图片
表二:
GraphSage代码阅读笔记(TensorFlow版)_第2张图片
注:表一摘自https://blog.csdn.net/yyl424525/article/details/102966617

3.2数据文件解析

1.toy-ppi-G.json //图的信息
数据中只有一个图,用来做节点分类任务。
图为无向图,由nodes集和links集合构成,每个集合都是一个list,里面包含的每一个node或link都是词典形式存储的
GraphSage代码阅读笔记(TensorFlow版)_第3张图片
数据格式:
GraphSage代码阅读笔记(TensorFlow版)_第4张图片
注:(上图摘自:https://blog.csdn.net/yyl424525/article/details/102966617

2.toy-ppi-class_map.json,图节点id映射到类。格式为:{“0”: [1, 0, 0,…],…,“14754”: [1, 1, 0, 0,…]}
GraphSage代码阅读笔记(TensorFlow版)_第5张图片
3.toy-ppi-id_map.jsontoy-ppi-id_map.json //节点编号与序号的一一对应
数据格式为:{“0”: 0, “1”: 1,…, “14754”: 14754}
GraphSage代码阅读笔记(TensorFlow版)_第6张图片
4.toy-ppi-feats.npy //预训练好得到的features

  • numpy存储的节点特征数组; 由id_map.json给出的顺序。 可以省略,仅使用身份功能。
    5.toy-ppi-walks.txt
    GraphSage代码阅读笔记(TensorFlow版)_第7张图片

四、代码运行及其环境安装与配置

4.1环境安装

1.可通过pip install -r requirements.txt 安装以下版本的包

absl-py==0.2.2
astor==0.6.2
backports.weakref==1.0.post1
bleach==1.5.0
decorator==4.3.0
enum34==1.1.6
funcsigs==1.0.2
futures==3.1.0
gast==0.2.0
grpcio==1.12.1
html5lib==0.9999999
Markdown==2.6.11
mock==2.0.0
networkx==1.11
numpy==1.14.5
pbr==4.0.4
protobuf==3.6.0
scikit-learn==0.19.1
scipy==1.1.0
six==1.11.0
sklearn==0.0
tensorboard==1.8.0
tensorflow==1.8.0
termcolor==1.1.0
Werkzeug==0.14.1

注:

  • networkx版本必须小于等于1.11,否则报错

在这里插入图片描述

  • Docker程序安装:
    没有docker需要安装docker。
    也可以在[docker](https://docs.docker.com/)映像中运行GraphSage。 克隆项目后,按如下所示构建并运行映像:
    $ docker build -t graphsage .
    $ docker run -it graphsage bash
    还可以使用[nvidia-docker](https://github.com/NVIDIA/nvidia-docker)运行GPU映像:
    $ docker build -t graphsage:gpu -f Dockerfile.gpu .
    $ nvidia-docker run -it graphsage:gpu bash

4.2 代码运行

1.在命令运行unsupervised_train.py

  • unsupervised_train.py 是用节点和节点的邻接信息做loss训练,训练好可以输出节点embedding,使用EdgeMinibatchIterator
python -m graphsage.unsupervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --max_total_steps 1000 --validate_iter 10
#参考https://blog.csdn.net/yyl424525/article/details/102966617

model可选值:

* graphsage_mean -- GraphSage with mean-based aggregator
* graphsage_seq -- GraphSage with LSTM-based aggregator
* graphsage_maxpool -- GraphSage with max-pooling aggregator (as described in the NIPS 2017 paper)
* graphsage_meanpool -- GraphSage with mean-pooling aggregator (a variant of the pooling aggregator, where the element-wie mean replaces the element-wise max).
* gcn -- GraphSage with GCN-based aggregator
* n2v -- an implementation of [DeepWalk](https://arxiv.org/abs/1403.6652) (called n2v for short in the code.)

2.运行supervised_train.py,注意train_prefix参数的值也需要改:…/example_data/toy-ppi

python -m graphsage.supervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --sigmoid

GraphSage代码阅读笔记(TensorFlow版)(一)

五、代码文件解析

5.1.init.py

from __future__ import print_function
'''即使在python2.X,使用print就得像python3.X那样加括号使用'''

from __future__ import division
'''导入python未来支持的语言特征division(精确除法),
6 # 当我们没有在程序中导入该特征时,"/"操作符执行的是截断除法(Truncating Division);
7 # 当我们导入精确除法之后,"/"执行的是精确除法, "//"执行截断除除法'''

5.2.unsupervised_train.py代码

# _*_ coding:UTF-8
#supervised_train.py 是用节点分类的label来做loss训练,不能输出节点embedding,使用NodeMinibatchIterator

#unsupervised_train.py 是用节点和节点的邻接信息做loss训练,训练好可以输出节点embedding,使用EdgeMinibatchIterator

from __future__ import division
'''即使在python2.X,使用print就得像python3.X那样加括号使用'''

from __future__ import print_function
'''导入python未来支持的语言特征division(精确除法),
6 # 当我们没有在程序中导入该特征时,"/"操作符执行的是截断除法(Truncating Division);
7 # 当我们导入精确除法之后,"/"执行的是精确除法, "//"执行截断除除法'''

import os#导入操作系统模块
import time#导入时间模块
import tensorflow as tf#导入TensorFlow模块
import numpy as np#导入numpy模块

from graphsage.models import SampleAndAggregate, SAGEInfo, Node2VecModel
from graphsage.minibatch import EdgeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_data
'''如果服务器有多个GPU,tensorflow默认会全部使用。如果只想使用部分GPU,可以通过参数CUDA_VISIBLE_DEVICES来设置GPU的可见性。'''

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"# 按照PCI_BUS_ID顺序从0开始排列GPU设备   # 使用哪一块gpu,本人只有一块,需将1改为0

'''Set random seed设置相同的seed,则每次生成的随机数也相同,如果不设置seed,则每次生成的随机数都会不一样。'''
seed = 123
np.random.seed(seed)#random是一个算法,设置随机数种子,再不同设备上生成的随机数一样。
tf.set_random_seed(seed)

# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS  #构造了一个解析器FLAGS  这样就可以从命令行中传入数据,从外部定义参数,如python train.py --model gcn

tf.app.flags.DEFINE_boolean('log_device_placement', False,
                            """Whether to log device placement.""")#定义变量bool型。
#core params..#定义变量,通过命令行解析传入参数
flags.DEFINE_string('model', 'graphsage', 'model names. See README for possible values.')  #传入模型,模型名字等参数 
flags.DEFINE_float('learning_rate', 0.00001, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '', 'name of the object file that stores the training data. must be specified.')

# left to default values in main experiments 实验默认值
flags.DEFINE_integer('epochs', 1, 'number of epochs to train.')#迭代次数
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')#dropout率 避免过拟合(按照一定的概率随机丢弃一部分神经元)
# loss计算方式(权值衰减+正则化):self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')#权衰减 目的就是为了让权重减少到更小的值,在一定程度上减少模型过拟合的问题
flags.DEFINE_integer('max_degree', 100, 'maximum node degree.')#矩阵的度

flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')#第一层采样节点数 k =1 s = 25
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2')#第二层采样节点数 k = 2 s = 10

#若有concat操作,则维度变为2倍
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('neg_sample_size', 20, 'number of negative samples')#负采样数
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')#bachsize
flags.DEFINE_integer('n2v_test_epochs', 1, 'Number of new SGD epochs for n2v.')n2v SGD迭代次数
flags.DEFINE_integer('identity_dim', 0, 'Set to positive value to use identity embedding features of that dimension. Default 0.')#设置负值,用于识别嵌入特征维度  默认为0

#logging, saving, validation settings etc.
flags.DEFINE_boolean('save_embeddings', True, 'whether to save embeddings for all nodes after training')#选择是否保存嵌入
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')#用于记录和保存嵌入的基本目录
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")#验证集迭代次数
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")#验证集bach_size
flags.DEFINE_integer('gpu', 1, "which gpu to use.")#使用哪一个GPU,只有1块时需要改为0
flags.DEFINE_integer('print_every', 50, "How often to print training info.")#设置多久打印训练信息
flags.DEFINE_integer('max_total_steps', 10**10, "Maximum total number of iterations")#最大迭代次数

os.environ["CUDA_VISIBLE_DEVICES"]=str(FLAGS.gpu)#传入参数 # 使用哪一块gpu,只有一块,需将1改为0

GPU_MEM_FRACTION = 0.8#分配GPU多少资源给它使用

def log_dir():#定义嵌入数据保存目录设置函数
    log_dir = FLAGS.base_log_dir + "/unsup-" + FLAGS.train_prefix.split("/")[-2]
    log_dir += "/{model:s}_{model_size:s}_{lr:0.6f}/".format(
            model=FLAGS.model,
            model_size=FLAGS.model_size,
            lr=FLAGS.learning_rate)
    if not os.path.exists(log_dir):#如果不存在就创建一个
        os.makedirs(log_dir)#os模块创建dir
    return log_dir#将保存目录返回

# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):#定义模型评估函数,
    t_test = time.time()
    feed_dict_val = minibatch_iter.val_feed_dict(size)#采用minibatch梯度下降,比SGD、BGD快,通过feed_dict传值
    outs_val = sess.run([model.loss, model.ranks, model.mrr], 
                        feed_dict=feed_dict_val)#运行损失函数
    return outs_val[0], outs_val[1], outs_val[2], (time.time() - t_test)#返回损失值,模型ranks,

def incremental_evaluate(sess, model, minibatch_iter, size):#增加评估
    t_test = time.time()
    finished = False
    val_losses = []
    val_mrrs = []
    iter_num = 0
    while not finished:
        feed_dict_val, finished, _ = minibatch_iter.incremental_val_feed_dict(size, iter_num)
        iter_num += 1
        outs_val = sess.run([model.loss, model.ranks, model.mrr], 
                            feed_dict=feed_dict_val)
        val_losses.append(outs_val[0])
        val_mrrs.append(outs_val[2])
    return np.mean(val_losses), np.mean(val_mrrs), (time.time() - t_test)

def save_val_embeddings(sess, model, minibatch_iter, size, out_dir, mod=""):#保存验证集嵌入
    val_embeddings = []
    finished = False
    seen = set([])
    nodes = []
    iter_num = 0
    name = "val"
    while not finished:
        feed_dict_val, finished, edges = minibatch_iter.incremental_embed_feed_dict(size, iter_num)
        iter_num += 1
        outs_val = sess.run([model.loss, model.mrr, model.outputs1], 
                            feed_dict=feed_dict_val)
        #ONLY SAVE FOR embeds1 because of planetoid
        for i, edge in enumerate(edges):
            if not edge[0] in seen:
                val_embeddings.append(outs_val[-1][i,:])
                nodes.append(edge[0])
                seen.add(edge[0])
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)
    val_embeddings = np.vstack(val_embeddings)#按垂直方向嵌入
    np.save(out_dir + name + mod + ".npy",  val_embeddings)
    with open(out_dir + name + mod + ".txt", "w") as fp:
        fp.write("\n".join(map(str,nodes)))#将节点映射后转化为json格式数据存储

def construct_placeholders():#定义放置placeholder函数,tf中占位符
    # Define placeholders
    placeholders = {
        'batch1' : tf.placeholder(tf.int32, shape=(None), name='batch1'),
        'batch2' : tf.placeholder(tf.int32, shape=(None), name='batch2'),
        # negative samples for all nodes in the batch  所有nodes均为负样本
        'neg_samples': tf.placeholder(tf.int32, shape=(None,),
            name='neg_sample_size'),
        'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),
        'batch_size' : tf.placeholder(tf.int32, name='batch_size'),
    }
    return placeholders

def train(train_data, test_data=None):#定义训练函数
    G = train_data[0]# 加载图信息
    features = train_data[1] # 训练数据的features
    id_map = train_data[2]# "n" : n  节点与节点直接的id映射 已经删除了节点是不具有'val'或'test'属性 的节点

    if not features is None:#只要features不为None
        #vstack为features添加列一行0向量,用于WX + b中与b相加
        features = np.vstack([features, np.zeros((features.shape[1],))])
        #这里vstack为features添加列一行0向量,用于WX + b中与b相加。
        
    context_pairs = train_data[3] if FLAGS.random_context else None  #random walk的点对
    placeholders = construct_placeholders()
    # def construct_placeholders()定义的placeholders包含:
    # batch1, batch2, neg_samples, dropout, batch_size
    minibatch = EdgeMinibatchIterator(G, 
            id_map,
            placeholders, batch_size=FLAGS.batch_size,
            max_degree=FLAGS.max_degree, 
            num_neg_samples=FLAGS.neg_sample_size,
            context_pairs = context_pairs)
    adj_info_ph = tf.placeholder(tf.int32, shape=minibatch.adj.shape)
    adj_info = tf.Variable(adj_info_ph, trainable=False, name="adj_info")

    if FLAGS.model == 'graphsage_mean':
        # Create model
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                     features,
                                     adj_info,
                                     minibatch.deg,
                                     layer_infos=layer_infos, 
                                     model_size=FLAGS.model_size,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)
    elif FLAGS.model == 'gcn':
        # Create model
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, 2*FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, 2*FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                     features,
                                     adj_info,
                                     minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="gcn",
                                     model_size=FLAGS.model_size,
                                     identity_dim = FLAGS.identity_dim,
                                     concat=False,
                                     logging=True)

    elif FLAGS.model == 'graphsage_seq':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                     features,
                                     adj_info,
                                     minibatch.deg,
                                     layer_infos=layer_infos, 
                                     identity_dim = FLAGS.identity_dim,
                                     aggregator_type="seq",
                                     model_size=FLAGS.model_size,
                                     logging=True)

    elif FLAGS.model == 'graphsage_maxpool':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                    features,
                                    adj_info,
                                    minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="maxpool",
                                     model_size=FLAGS.model_size,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)
    elif FLAGS.model == 'graphsage_meanpool':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders, 
                                    features,
                                    adj_info,
                                    minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="meanpool",
                                     model_size=FLAGS.model_size,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)

    elif FLAGS.model == 'n2v':
        model = Node2VecModel(placeholders, features.shape[0],
                                       minibatch.deg,
                                       #2x because graphsage uses concat
                                       nodevec_dim=2*FLAGS.dim_1,
                                       lr=FLAGS.learning_rate)
    else:
        raise Exception('Error: model name unrecognized.')

    config = tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)
    config.gpu_options.allow_growth = True
    # 使用allow_growth option,刚一开始分配少量的GPU容量,然后按需慢慢的增加
    #config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTION
    
    # 设置每个GPU应该拿出多少容量给进程使用,
    # per_process_gpu_memory_fraction =0.4代表 40%
     
    config.allow_soft_placement = True  # 如果指定的设备不存在,允许TF自动分配设备
    
    # 自动选择运行设备
    # 在tf中,通过命令 "with tf.device('/cpu:0'):",允许手动设置操作运行的设备
    # 如果手动设置的设备不存在或者不可用,就会导致tf程序等待或异常,
    # 为了防止这种情况,可以设置tf.ConfigProto()中参数allow_soft_placement=True,
    # 允许tf自动选择一个存在并且可用的设备来运行操作。
 
    
    # Initialize session
    sess = tf.Session(config=config)
    merged = tf.summary.merge_all()#能够保存训练过程以及参数分布图并在tensorboard显示  
                                   #merge_all 可以将所有summary全部保存到磁盘,以便tensorboard显示。
    # 指定一个文件用来保存图
    # 格式:tf.summary.FileWritter(path,sess.graph)
    # 可以调用其add_summary()方法将训练过程数据保存在filewriter指定的文件中
    summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)
     
    # Init variables
    sess.run(tf.global_variables_initializer(), feed_dict={adj_info_ph: minibatch.adj})
    
    # Train model
    
    train_shadow_mrr = None
    shadow_mrr = None

    total_steps = 0
    avg_time = 0.0
    epoch_val_costs = []

    train_adj_info = tf.assign(adj_info, minibatch.adj)
    val_adj_info = tf.assign(adj_info, minibatch.test_adj)
    for epoch in range(FLAGS.epochs): 
        minibatch.shuffle() 

        iter = 0
        print('Epoch: %04d' % (epoch + 1))
        epoch_val_costs.append(0)
        while not minibatch.end():
            # Construct feed dictionary
            feed_dict = minibatch.next_minibatch_feed_dict()
            feed_dict.update({placeholders['dropout']: FLAGS.dropout})

            t = time.time()
            # Training step
            outs = sess.run([merged, model.opt_op, model.loss, model.ranks, model.aff_all, 
                    model.mrr, model.outputs1], feed_dict=feed_dict)
            train_cost = outs[2]
            train_mrr = outs[5]
            if train_shadow_mrr is None:
                train_shadow_mrr = train_mrr#
            else:
                train_shadow_mrr -= (1-0.99) * (train_shadow_mrr - train_mrr)

            if iter % FLAGS.validate_iter == 0:
                # Validation
                sess.run(val_adj_info.op)
                val_cost, ranks, val_mrr, duration  = evaluate(sess, model, minibatch, size=FLAGS.validate_batch_size)
                sess.run(train_adj_info.op)
                epoch_val_costs[-1] += val_cost
            if shadow_mrr is None:
                shadow_mrr = val_mrr
            else:
                shadow_mrr -= (1-0.99) * (shadow_mrr - val_mrr)

            if total_steps % FLAGS.print_every == 0:
                summary_writer.add_summary(outs[0], total_steps)
    
            # Print results
            avg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)

            if total_steps % FLAGS.print_every == 0:
                print("Iter:", '%04d' % iter, 
                      "train_loss=", "{:.5f}".format(train_cost),
                      "train_mrr=", "{:.5f}".format(train_mrr), 
                      "train_mrr_ema=", "{:.5f}".format(train_shadow_mrr), # exponential moving average
                      "val_loss=", "{:.5f}".format(val_cost),
                      "val_mrr=", "{:.5f}".format(val_mrr), 
                      "val_mrr_ema=", "{:.5f}".format(shadow_mrr), # exponential moving average
                      "time=", "{:.5f}".format(avg_time))

            iter += 1
            total_steps += 1

            if total_steps > FLAGS.max_total_steps:
                break

        if total_steps > FLAGS.max_total_steps:
                break
    
    print("Optimization Finished!")
    if FLAGS.save_embeddings:# 训练以后是否存储节点的embeddings
        sess.run(val_adj_info.op)

        save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir())

        if FLAGS.model == "n2v":
            # stopping the gradient for the already trained nodes
            train_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if not G.node[n]['val'] and not G.node[n]['test']],
                    dtype=tf.int32)
            test_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if G.node[n]['val'] or G.node[n]['test']], 
                    dtype=tf.int32)
            update_nodes = tf.nn.embedding_lookup(model.context_embeds, tf.squeeze(test_ids))
            no_update_nodes = tf.nn.embedding_lookup(model.context_embeds,tf.squeeze(train_ids))
            update_nodes = tf.scatter_nd(test_ids, update_nodes, tf.shape(model.context_embeds))
            no_update_nodes = tf.stop_gradient(tf.scatter_nd(train_ids, no_update_nodes, tf.shape(model.context_embeds)))
            model.context_embeds = update_nodes + no_update_nodes
            sess.run(model.context_embeds)

            # run random walks
            from graphsage.utils import run_random_walks
            nodes = [n for n in G.nodes_iter() if G.node[n]["val"] or G.node[n]["test"]]
            start_time = time.time()
            pairs = run_random_walks(G, nodes, num_walks=50)
            walk_time = time.time() - start_time

            test_minibatch = EdgeMinibatchIterator(G, 
                id_map,
                placeholders, batch_size=FLAGS.batch_size,
                max_degree=FLAGS.max_degree, 
                num_neg_samples=FLAGS.neg_sample_size,
                context_pairs = pairs,
                n2v_retrain=True,
                fixed_n2v=True)
            
            start_time = time.time()
            print("Doing test training for n2v.")
            test_steps = 0
            for epoch in range(FLAGS.n2v_test_epochs):
                test_minibatch.shuffle()
                while not test_minibatch.end():
                    feed_dict = test_minibatch.next_minibatch_feed_dict()
                    feed_dict.update({placeholders['dropout']: FLAGS.dropout})
                    outs = sess.run([model.opt_op, model.loss, model.ranks, model.aff_all, 
                        model.mrr, model.outputs1], feed_dict=feed_dict)
                    if test_steps % FLAGS.print_every == 0:
                        print("Iter:", '%04d' % test_steps, 
                              "train_loss=", "{:.5f}".format(outs[1]),
                              "train_mrr=", "{:.5f}".format(outs[-2]))
                    test_steps += 1
            train_time = time.time() - start_time
            save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir(), mod="-test")
            print("Total time: ", train_time+walk_time)
            print("Walk time: ", walk_time)
            print("Train time: ", train_time)

    
# main函数,加载数据并训练
def main(argv=None):
    print("Loading training data..")
    train_data = load_data(FLAGS.train_prefix, load_walks=True)'''load_data函数在graphsage.utils中定义,加载标签数据集'''
    print("Done loading training data..")
    train(train_data)'''# train函数在该文件中定义def train(train_data, test_data=None)'''

if __name__ == '__main__':
    tf.app.run()  # 解析命令行参数,调用main 函数 main(sys.argv) 
''' 
tf.app.run()的作用:通过处理flag解析,然后执行main函数
如果你的代码中的入口函数不叫main(),而是一个其他名字的函数,如test(),则你应该这样写入口tf.app.run(test())
 如果你的代码中的入口函数叫main(),则你就可以把入口写成tf.app.run()
 
 使用tf.app.run() ,上面已经有FLAGS = tf.app.flags.FLAGS了,则已经解析了输入。

则tf.app.run() 中argv=None,通过args = argv[1:] if argv else None则args=None(即不指定,后面会自动解析command)

f = flags.FLAGS构造了解析器f用以解析args, f._parse_flags(参数args)解析args列表或者command输入,args列表为空,则解析command输入,返回的flags_passthrough内为无法解析的数据列表(不包括文件名) 。
'''

5.3utils.py代码

from __future__ import print_function

import numpy as np'''导入numpy模块'''
import random'''导入randomm模块'''
import json'''导入json模块'''
import sys'''导入系统模块'''
import os'''导入操作系统模块'''

import networkx as nx'''networkx(图论)的基本操作,用于创建图等操作'''
from networkx.readwrite import json_graph'''用于将networks图保存为json图'''
version_info = list(map(int, nx.__version__.split('.')))#获取netwoeks版本信息然后转换为列表
major = version_info[0]#获取版本号点号前面的数字
minor = version_info[1]#获取版本号点号后面的数字
assert (major <= 1) and (minor <= 11), "networkx major version > 1.11"#networkx版本必须小于等于1.11,否则断言

WALK_LEN=5
N_WALKS=50
'''
type() 与 isinstance() 区别:
type() 不会认为子类是一种父类类型,不考虑继承关系。
isinstance() 会认为子类是一种父类类型,考虑继承关系。
如果要判断两个类型是否相同推荐使用 isinstance()。
参数
object – 实例对象。
classinfo – 可以是直接或间接类名、基本类型或者由它们组成的元组。

返回值
如果对象的类型与参数二的类型(classinfo)相同则返回 True,否则返回 False。
'''
'''G.nodes()  返回的是图中节点n与节点属性nodedata。'''

def load_data(prefix, normalize=True, load_walks=False):
    G_data = json.load(open(prefix + "-G.json"))#加载图信息  图信息为json文件 所以用json模块导入
    G = json_graph.node_link_graph(G_data)#Return graph from node-link data format
    #定义conversion函数
     #判断G.nodes()[0] 是否为int型(即不带nodedata)
    if isinstance(G.nodes()[0], int):
        conversion = lambda n : int(n)# lambda parameters:express
    else:
        conversion = lambda n : n#保持n不动

    if os.path.exists(prefix + "-feats.npy"):#如果路径下面存在预训练好得到的features文件
        feats = np.load(prefix + "-feats.npy")
    else:
        print("No features present.. Only identity features will be used.")
        feats = None
        #一个json存储的字典,将图节点id映射为连续整数。
    id_map = json.load(open(prefix + "-id_map.json"))#加载节点编号与序号的一一对应的id数据
    id_map = {conversion(k):int(v) for k,v in id_map.items()}
    walks = []
    class_map = json.load(open(prefix + "-class_map.json"))#标签数据加载,字典集合
     # print("class_map:",class_map                                    )
    #{"0": [1, 0, 0,...],...,"14754": [1, 1, 0, 0,...]}
    if isinstance(list(class_map.values())[0], list):#将字典数据转换为列表数据,在判断是否转换成功
        lab_conversion = lambda n : n
    else:
        lab_conversion = lambda n : int(n)#将标签转换为整型

    class_map = {conversion(k):lab_conversion(v) for k,v in class_map.items()}
    """遍历标签数据集中所有键--值,以列表返回,构造集合   #id_map的迭代中k为str类型,v为int型,将其全部转换成整型"""
        # print("class_map:",class_map)
        #{0: [1, 0, 0,...],...,14754: [1, 1, 0, 0,...]}
    
'''代码中edge对edges迭代,每次去list中的一个元组,而edge[0], edge[1]则分别表示两个顶点。
若两个顶点中至少有一个的val / test不为空,则将该边的’train_removed’设为True,否则为False。
该操作为保证’train_removed’不为空。
'''
    ## Remove all nodes that do not have val/test annotations
    ## (necessary because of networkx weirdness with the Reddit data)怪异性
    broken_count = 0
    for node in G.nodes():
        if not 'val' in G.node[node] or not 'test' in G.node[node]:
            G.remove_node(node)
            broken_count += 1
    print("Removed {:d} nodes that lacked proper annotations due to networkx versioning issues".format(broken_count))#format():把传统的%替换为{}来实现格式化输出
'''G.edges() 得到edge_list, [( , ), ( , ), … ( , )]。list中每一个元素是所表示边的两个节点信息。若设置data = True,则会显示边的权重等属性信息。'''
    ## Make sure the graph has edge train_removed annotations
    ## (some datasets might already have this..)
    print("Loaded data.. now preprocessing..")
    for edge in G.edges():
        if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or
            G.node[edge[0]]['test'] or G.node[edge[1]]['test']):
            G[edge[0]][edge[1]]['train_removed'] = True
        else:
            G[edge[0]][edge[1]]['train_removed'] = False
             #获取训练数据features并标准化
'''将val,test均为None的node选为训练数据,通过id_map获取其在feature表中的索引值,添加到train_ids数组中。根据索引train_ids,train_fests获取这些nodes的features.'''
    if normalize and not feats is None:#feats 非空  获取训练数据features并标准化
        from sklearn.preprocessing import StandardScaler
        train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']])#将val,test均为None的node选为训练数据
        train_feats = feats[train_ids]#获取节点特征
        scaler = StandardScaler()
        scaler.fit(train_feats)#计算训练数据的均值和方差
        feats = scaler.transform(feats)#计算训练数据的均值和方差,还会基于计算出来的均值和方差来转换训练数据,从而把数据转换成标准的正太分布
    ## 标准化数据,保证每个维度的特征数据方差为1,均值为0,使得预测结果不会被某些维度过大的特征值而主导
    if load_walks:# false by default
        with open(prefix + "-walks.txt") as fp:
            for line in fp:
                walk

5.4.自用测试代码

from __future__ import print_function

import numpy as np#'''导入numpy模块'''
import random#'''导入randomm模块'''
import json#'''导入json模块'''
import sys#'''导入系统模块'''
import os#'''导入操作系统模块'''
import networkx as nx#'''networkx(图论)的基本操作,用于创建图等操作'''
from networkx.readwrite import json_graph#'''用于将networks图保存为json图'''
version_info = list(map(int, nx.__version__.split('.')))#获取netwoeks版本信息然后转换为列表
WALK_LEN=5
N_WALKS=50
major = version_info[0]#获取版本号点号前面的数字
minor = version_info[1]#获取版本号点号后面的数字
assert (major <= 1) and (minor <= 11), "networkx major version > 1.11"#networkx版本必须小于等于1.11,否则断言
prefix ='toy-ppi'
G_data = json.load(open('./toy-ppi-G.json'))#加载图信息  图信息为json文件 所以用json模块导入
G = json_graph.node_link_graph(G_data)#Return graph from node-link data format
if isinstance(G.nodes()[0], int):
    conversion = lambda n : int(n)# lambda parameters:express
else:
     conversion = lambda n : n#保持n不动

print(G.nodes()[0])
0
print(G.nodes())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121,  14649, 14650, 14651, 14652, 14653, 14654, 14655, 14656, 14657, 14658, 14659, 14660, 14661, 14662, 14663, 14664, 14665, 14666, 14667, 14668, 14669, 14670, 14671, 14672, 14673, 14674, 14675, 14676, 14677, 14678, 14679, 14680, 14681, 14682, 14683, 14684, 14685, 14686, 14687, 14688, 14689, 14690, 14691, 14692, 14693, 14694, 14695, 14696, 14697, 14698, 14699, 14700, 14701, 14702, 14703, 14704, 14705, 14706, 14707, 14708, 14709, 14710, 14711, 14712, 14713, 14714, 14715, 14716, 14717, 14718, 14719, 14720, 14721, 14722, 14723, 14724, 14725, 14726, 14727, 14728, 14729, 14730, 14731, 14732, 14733, 14734, 14735, 14736, 14737, 14738, 14739, 14740, 14741, 14742, 14743, 14744, 14745, 14746, 14747, 14748, 14749, 14750, 14751, 14752, 14753, 14754]
if isinstance(G.nodes()[0], int):
    conversion = lambda n : int(n)# lambda parameters:express
else:
    conversion = lambda n : n#保持n不动
print(conversion(1.0))
1
if os.path.exists(prefix + "-feats.npy"):#如果路径下面存在预训练好得到的features文件
    feats = np.load(prefix + "-feats.npy")
else:
    print("No features present.. Only identity features will be used.")
    feats = None
id_map = json.load(open(prefix + "-id_map.json"))#加载节点编号与序号的一一对应的id数据
print(id_map)
print('*******************************************************')
id_map = {conversion(k):int(v) for k,v in id_map.items()}
print(id_map)
{'0': 0, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6, '7': 7, '8': 8, '9': 9, '10': 10, '11': 11, '12': 12, '13': 13, '14': 14, '15': 15, '16': 16, '17': 17, '18': 18, '19': 19, '20': 20, '21': 21, '22': 22, '23': 23, '24': 24, '25': 25, '26': 

你可能感兴趣的:(GraphSage代码阅读笔记(TensorFlow版))