Inductive Representation Learning on Large Graphs论文链接
graphsage代码链接
1.文件目录
├──eval_scripts //验证集
├──example_data //ppi数据集
└──graphsage//模型结构定义、GCN层定义、……
2.eval_scripts //验证集目录内容
├──citation_eval.py
├──ppi_eval.py
└──reddit_eval.py
3.example_data //ppi数据集
├──toy-ppi-class_map.json //图节点id映射到类。
├──toy-ppi-feats.npy //预训练好得到的features
├──toy-ppi-G.json //图的信息
├──ttoy-ppi-walks//从一点出发随机游走到邻居节点的情况,对于每个点取198次
└──toy-ppi-id_map.json //节点编号与序号的一一对应
4.graphsage//模型结构定义
├── init //导入模块
├──aggregators // 聚合函数定义
├──inits.py // 初始化的一些公用函数
├── layers // GCN层的定义
├── metrics // 评测指标的计算
├── minibatch//minibatch iterator函数定义
├── models // 各种模型结构定义
├── neigh_samplers //定义从节点的邻居中采样的采样器
├── prediction//
├── supervised_models
├── supervised_train
├── unsupervised_train
└── utils // 工具函数的定义
表一:
表二:
注:表一摘自https://blog.csdn.net/yyl424525/article/details/102966617
1.toy-ppi-G.json //图的信息
数据中只有一个图,用来做节点分类任务。
图为无向图,由nodes集和links集合构成,每个集合都是一个list,里面包含的每一个node或link都是词典形式存储的
数据格式:
注:(上图摘自:https://blog.csdn.net/yyl424525/article/details/102966617
)
2.toy-ppi-class_map.json,图节点id映射到类。格式为:{“0”: [1, 0, 0,…],…,“14754”: [1, 1, 0, 0,…]}
3.toy-ppi-id_map.jsontoy-ppi-id_map.json //节点编号与序号的一一对应
数据格式为:{“0”: 0, “1”: 1,…, “14754”: 14754}
4.toy-ppi-feats.npy //预训练好得到的features
1.可通过pip install -r requirements.txt 安装以下版本的包
absl-py==0.2.2
astor==0.6.2
backports.weakref==1.0.post1
bleach==1.5.0
decorator==4.3.0
enum34==1.1.6
funcsigs==1.0.2
futures==3.1.0
gast==0.2.0
grpcio==1.12.1
html5lib==0.9999999
Markdown==2.6.11
mock==2.0.0
networkx==1.11
numpy==1.14.5
pbr==4.0.4
protobuf==3.6.0
scikit-learn==0.19.1
scipy==1.1.0
six==1.11.0
sklearn==0.0
tensorboard==1.8.0
tensorflow==1.8.0
termcolor==1.1.0
Werkzeug==0.14.1
注:
1.在命令运行unsupervised_train.py
python -m graphsage.unsupervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --max_total_steps 1000 --validate_iter 10
#参考https://blog.csdn.net/yyl424525/article/details/102966617
model可选值:
* graphsage_mean -- GraphSage with mean-based aggregator
* graphsage_seq -- GraphSage with LSTM-based aggregator
* graphsage_maxpool -- GraphSage with max-pooling aggregator (as described in the NIPS 2017 paper)
* graphsage_meanpool -- GraphSage with mean-pooling aggregator (a variant of the pooling aggregator, where the element-wie mean replaces the element-wise max).
* gcn -- GraphSage with GCN-based aggregator
* n2v -- an implementation of [DeepWalk](https://arxiv.org/abs/1403.6652) (called n2v for short in the code.)
2.运行supervised_train.py,注意train_prefix参数的值也需要改:…/example_data/toy-ppi
python -m graphsage.supervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --sigmoid
GraphSage代码阅读笔记(TensorFlow版)(一)
from __future__ import print_function
'''即使在python2.X,使用print就得像python3.X那样加括号使用'''
from __future__ import division
'''导入python未来支持的语言特征division(精确除法),
6 # 当我们没有在程序中导入该特征时,"/"操作符执行的是截断除法(Truncating Division);
7 # 当我们导入精确除法之后,"/"执行的是精确除法, "//"执行截断除除法'''
# _*_ coding:UTF-8
#supervised_train.py 是用节点分类的label来做loss训练,不能输出节点embedding,使用NodeMinibatchIterator
#unsupervised_train.py 是用节点和节点的邻接信息做loss训练,训练好可以输出节点embedding,使用EdgeMinibatchIterator
from __future__ import division
'''即使在python2.X,使用print就得像python3.X那样加括号使用'''
from __future__ import print_function
'''导入python未来支持的语言特征division(精确除法),
6 # 当我们没有在程序中导入该特征时,"/"操作符执行的是截断除法(Truncating Division);
7 # 当我们导入精确除法之后,"/"执行的是精确除法, "//"执行截断除除法'''
import os#导入操作系统模块
import time#导入时间模块
import tensorflow as tf#导入TensorFlow模块
import numpy as np#导入numpy模块
from graphsage.models import SampleAndAggregate, SAGEInfo, Node2VecModel
from graphsage.minibatch import EdgeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_data
'''如果服务器有多个GPU,tensorflow默认会全部使用。如果只想使用部分GPU,可以通过参数CUDA_VISIBLE_DEVICES来设置GPU的可见性。'''
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"# 按照PCI_BUS_ID顺序从0开始排列GPU设备 # 使用哪一块gpu,本人只有一块,需将1改为0
'''Set random seed设置相同的seed,则每次生成的随机数也相同,如果不设置seed,则每次生成的随机数都会不一样。'''
seed = 123
np.random.seed(seed)#random是一个算法,设置随机数种子,再不同设备上生成的随机数一样。
tf.set_random_seed(seed)
# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS #构造了一个解析器FLAGS 这样就可以从命令行中传入数据,从外部定义参数,如python train.py --model gcn
tf.app.flags.DEFINE_boolean('log_device_placement', False,
"""Whether to log device placement.""")#定义变量bool型。
#core params..#定义变量,通过命令行解析传入参数
flags.DEFINE_string('model', 'graphsage', 'model names. See README for possible values.') #传入模型,模型名字等参数
flags.DEFINE_float('learning_rate', 0.00001, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '', 'name of the object file that stores the training data. must be specified.')
# left to default values in main experiments 实验默认值
flags.DEFINE_integer('epochs', 1, 'number of epochs to train.')#迭代次数
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')#dropout率 避免过拟合(按照一定的概率随机丢弃一部分神经元)
# loss计算方式(权值衰减+正则化):self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')#权衰减 目的就是为了让权重减少到更小的值,在一定程度上减少模型过拟合的问题
flags.DEFINE_integer('max_degree', 100, 'maximum node degree.')#矩阵的度
flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')#第一层采样节点数 k =1 s = 25
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2')#第二层采样节点数 k = 2 s = 10
#若有concat操作,则维度变为2倍
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('neg_sample_size', 20, 'number of negative samples')#负采样数
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')#bachsize
flags.DEFINE_integer('n2v_test_epochs', 1, 'Number of new SGD epochs for n2v.')n2v SGD迭代次数
flags.DEFINE_integer('identity_dim', 0, 'Set to positive value to use identity embedding features of that dimension. Default 0.')#设置负值,用于识别嵌入特征维度 默认为0
#logging, saving, validation settings etc.
flags.DEFINE_boolean('save_embeddings', True, 'whether to save embeddings for all nodes after training')#选择是否保存嵌入
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')#用于记录和保存嵌入的基本目录
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")#验证集迭代次数
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")#验证集bach_size
flags.DEFINE_integer('gpu', 1, "which gpu to use.")#使用哪一个GPU,只有1块时需要改为0
flags.DEFINE_integer('print_every', 50, "How often to print training info.")#设置多久打印训练信息
flags.DEFINE_integer('max_total_steps', 10**10, "Maximum total number of iterations")#最大迭代次数
os.environ["CUDA_VISIBLE_DEVICES"]=str(FLAGS.gpu)#传入参数 # 使用哪一块gpu,只有一块,需将1改为0
GPU_MEM_FRACTION = 0.8#分配GPU多少资源给它使用
def log_dir():#定义嵌入数据保存目录设置函数
log_dir = FLAGS.base_log_dir + "/unsup-" + FLAGS.train_prefix.split("/")[-2]
log_dir += "/{model:s}_{model_size:s}_{lr:0.6f}/".format(
model=FLAGS.model,
model_size=FLAGS.model_size,
lr=FLAGS.learning_rate)
if not os.path.exists(log_dir):#如果不存在就创建一个
os.makedirs(log_dir)#os模块创建dir
return log_dir#将保存目录返回
# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):#定义模型评估函数,
t_test = time.time()
feed_dict_val = minibatch_iter.val_feed_dict(size)#采用minibatch梯度下降,比SGD、BGD快,通过feed_dict传值
outs_val = sess.run([model.loss, model.ranks, model.mrr],
feed_dict=feed_dict_val)#运行损失函数
return outs_val[0], outs_val[1], outs_val[2], (time.time() - t_test)#返回损失值,模型ranks,
def incremental_evaluate(sess, model, minibatch_iter, size):#增加评估
t_test = time.time()
finished = False
val_losses = []
val_mrrs = []
iter_num = 0
while not finished:
feed_dict_val, finished, _ = minibatch_iter.incremental_val_feed_dict(size, iter_num)
iter_num += 1
outs_val = sess.run([model.loss, model.ranks, model.mrr],
feed_dict=feed_dict_val)
val_losses.append(outs_val[0])
val_mrrs.append(outs_val[2])
return np.mean(val_losses), np.mean(val_mrrs), (time.time() - t_test)
def save_val_embeddings(sess, model, minibatch_iter, size, out_dir, mod=""):#保存验证集嵌入
val_embeddings = []
finished = False
seen = set([])
nodes = []
iter_num = 0
name = "val"
while not finished:
feed_dict_val, finished, edges = minibatch_iter.incremental_embed_feed_dict(size, iter_num)
iter_num += 1
outs_val = sess.run([model.loss, model.mrr, model.outputs1],
feed_dict=feed_dict_val)
#ONLY SAVE FOR embeds1 because of planetoid
for i, edge in enumerate(edges):
if not edge[0] in seen:
val_embeddings.append(outs_val[-1][i,:])
nodes.append(edge[0])
seen.add(edge[0])
if not os.path.exists(out_dir):
os.makedirs(out_dir)
val_embeddings = np.vstack(val_embeddings)#按垂直方向嵌入
np.save(out_dir + name + mod + ".npy", val_embeddings)
with open(out_dir + name + mod + ".txt", "w") as fp:
fp.write("\n".join(map(str,nodes)))#将节点映射后转化为json格式数据存储
def construct_placeholders():#定义放置placeholder函数,tf中占位符
# Define placeholders
placeholders = {
'batch1' : tf.placeholder(tf.int32, shape=(None), name='batch1'),
'batch2' : tf.placeholder(tf.int32, shape=(None), name='batch2'),
# negative samples for all nodes in the batch 所有nodes均为负样本
'neg_samples': tf.placeholder(tf.int32, shape=(None,),
name='neg_sample_size'),
'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),
'batch_size' : tf.placeholder(tf.int32, name='batch_size'),
}
return placeholders
def train(train_data, test_data=None):#定义训练函数
G = train_data[0]# 加载图信息
features = train_data[1] # 训练数据的features
id_map = train_data[2]# "n" : n 节点与节点直接的id映射 已经删除了节点是不具有'val'或'test'属性 的节点
if not features is None:#只要features不为None
#vstack为features添加列一行0向量,用于WX + b中与b相加
features = np.vstack([features, np.zeros((features.shape[1],))])
#这里vstack为features添加列一行0向量,用于WX + b中与b相加。
context_pairs = train_data[3] if FLAGS.random_context else None #random walk的点对
placeholders = construct_placeholders()
# def construct_placeholders()定义的placeholders包含:
# batch1, batch2, neg_samples, dropout, batch_size
minibatch = EdgeMinibatchIterator(G,
id_map,
placeholders, batch_size=FLAGS.batch_size,
max_degree=FLAGS.max_degree,
num_neg_samples=FLAGS.neg_sample_size,
context_pairs = context_pairs)
adj_info_ph = tf.placeholder(tf.int32, shape=minibatch.adj.shape)
adj_info = tf.Variable(adj_info_ph, trainable=False, name="adj_info")
if FLAGS.model == 'graphsage_mean':
# Create model
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
model_size=FLAGS.model_size,
identity_dim = FLAGS.identity_dim,
logging=True)
elif FLAGS.model == 'gcn':
# Create model
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, 2*FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, 2*FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="gcn",
model_size=FLAGS.model_size,
identity_dim = FLAGS.identity_dim,
concat=False,
logging=True)
elif FLAGS.model == 'graphsage_seq':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
identity_dim = FLAGS.identity_dim,
aggregator_type="seq",
model_size=FLAGS.model_size,
logging=True)
elif FLAGS.model == 'graphsage_maxpool':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="maxpool",
model_size=FLAGS.model_size,
identity_dim = FLAGS.identity_dim,
logging=True)
elif FLAGS.model == 'graphsage_meanpool':
sampler = UniformNeighborSampler(adj_info)
layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
model = SampleAndAggregate(placeholders,
features,
adj_info,
minibatch.deg,
layer_infos=layer_infos,
aggregator_type="meanpool",
model_size=FLAGS.model_size,
identity_dim = FLAGS.identity_dim,
logging=True)
elif FLAGS.model == 'n2v':
model = Node2VecModel(placeholders, features.shape[0],
minibatch.deg,
#2x because graphsage uses concat
nodevec_dim=2*FLAGS.dim_1,
lr=FLAGS.learning_rate)
else:
raise Exception('Error: model name unrecognized.')
config = tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)
config.gpu_options.allow_growth = True
# 使用allow_growth option,刚一开始分配少量的GPU容量,然后按需慢慢的增加
#config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTION
# 设置每个GPU应该拿出多少容量给进程使用,
# per_process_gpu_memory_fraction =0.4代表 40%
config.allow_soft_placement = True # 如果指定的设备不存在,允许TF自动分配设备
# 自动选择运行设备
# 在tf中,通过命令 "with tf.device('/cpu:0'):",允许手动设置操作运行的设备
# 如果手动设置的设备不存在或者不可用,就会导致tf程序等待或异常,
# 为了防止这种情况,可以设置tf.ConfigProto()中参数allow_soft_placement=True,
# 允许tf自动选择一个存在并且可用的设备来运行操作。
# Initialize session
sess = tf.Session(config=config)
merged = tf.summary.merge_all()#能够保存训练过程以及参数分布图并在tensorboard显示
#merge_all 可以将所有summary全部保存到磁盘,以便tensorboard显示。
# 指定一个文件用来保存图
# 格式:tf.summary.FileWritter(path,sess.graph)
# 可以调用其add_summary()方法将训练过程数据保存在filewriter指定的文件中
summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)
# Init variables
sess.run(tf.global_variables_initializer(), feed_dict={adj_info_ph: minibatch.adj})
# Train model
train_shadow_mrr = None
shadow_mrr = None
total_steps = 0
avg_time = 0.0
epoch_val_costs = []
train_adj_info = tf.assign(adj_info, minibatch.adj)
val_adj_info = tf.assign(adj_info, minibatch.test_adj)
for epoch in range(FLAGS.epochs):
minibatch.shuffle()
iter = 0
print('Epoch: %04d' % (epoch + 1))
epoch_val_costs.append(0)
while not minibatch.end():
# Construct feed dictionary
feed_dict = minibatch.next_minibatch_feed_dict()
feed_dict.update({placeholders['dropout']: FLAGS.dropout})
t = time.time()
# Training step
outs = sess.run([merged, model.opt_op, model.loss, model.ranks, model.aff_all,
model.mrr, model.outputs1], feed_dict=feed_dict)
train_cost = outs[2]
train_mrr = outs[5]
if train_shadow_mrr is None:
train_shadow_mrr = train_mrr#
else:
train_shadow_mrr -= (1-0.99) * (train_shadow_mrr - train_mrr)
if iter % FLAGS.validate_iter == 0:
# Validation
sess.run(val_adj_info.op)
val_cost, ranks, val_mrr, duration = evaluate(sess, model, minibatch, size=FLAGS.validate_batch_size)
sess.run(train_adj_info.op)
epoch_val_costs[-1] += val_cost
if shadow_mrr is None:
shadow_mrr = val_mrr
else:
shadow_mrr -= (1-0.99) * (shadow_mrr - val_mrr)
if total_steps % FLAGS.print_every == 0:
summary_writer.add_summary(outs[0], total_steps)
# Print results
avg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)
if total_steps % FLAGS.print_every == 0:
print("Iter:", '%04d' % iter,
"train_loss=", "{:.5f}".format(train_cost),
"train_mrr=", "{:.5f}".format(train_mrr),
"train_mrr_ema=", "{:.5f}".format(train_shadow_mrr), # exponential moving average
"val_loss=", "{:.5f}".format(val_cost),
"val_mrr=", "{:.5f}".format(val_mrr),
"val_mrr_ema=", "{:.5f}".format(shadow_mrr), # exponential moving average
"time=", "{:.5f}".format(avg_time))
iter += 1
total_steps += 1
if total_steps > FLAGS.max_total_steps:
break
if total_steps > FLAGS.max_total_steps:
break
print("Optimization Finished!")
if FLAGS.save_embeddings:# 训练以后是否存储节点的embeddings
sess.run(val_adj_info.op)
save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir())
if FLAGS.model == "n2v":
# stopping the gradient for the already trained nodes
train_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if not G.node[n]['val'] and not G.node[n]['test']],
dtype=tf.int32)
test_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if G.node[n]['val'] or G.node[n]['test']],
dtype=tf.int32)
update_nodes = tf.nn.embedding_lookup(model.context_embeds, tf.squeeze(test_ids))
no_update_nodes = tf.nn.embedding_lookup(model.context_embeds,tf.squeeze(train_ids))
update_nodes = tf.scatter_nd(test_ids, update_nodes, tf.shape(model.context_embeds))
no_update_nodes = tf.stop_gradient(tf.scatter_nd(train_ids, no_update_nodes, tf.shape(model.context_embeds)))
model.context_embeds = update_nodes + no_update_nodes
sess.run(model.context_embeds)
# run random walks
from graphsage.utils import run_random_walks
nodes = [n for n in G.nodes_iter() if G.node[n]["val"] or G.node[n]["test"]]
start_time = time.time()
pairs = run_random_walks(G, nodes, num_walks=50)
walk_time = time.time() - start_time
test_minibatch = EdgeMinibatchIterator(G,
id_map,
placeholders, batch_size=FLAGS.batch_size,
max_degree=FLAGS.max_degree,
num_neg_samples=FLAGS.neg_sample_size,
context_pairs = pairs,
n2v_retrain=True,
fixed_n2v=True)
start_time = time.time()
print("Doing test training for n2v.")
test_steps = 0
for epoch in range(FLAGS.n2v_test_epochs):
test_minibatch.shuffle()
while not test_minibatch.end():
feed_dict = test_minibatch.next_minibatch_feed_dict()
feed_dict.update({placeholders['dropout']: FLAGS.dropout})
outs = sess.run([model.opt_op, model.loss, model.ranks, model.aff_all,
model.mrr, model.outputs1], feed_dict=feed_dict)
if test_steps % FLAGS.print_every == 0:
print("Iter:", '%04d' % test_steps,
"train_loss=", "{:.5f}".format(outs[1]),
"train_mrr=", "{:.5f}".format(outs[-2]))
test_steps += 1
train_time = time.time() - start_time
save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir(), mod="-test")
print("Total time: ", train_time+walk_time)
print("Walk time: ", walk_time)
print("Train time: ", train_time)
# main函数,加载数据并训练
def main(argv=None):
print("Loading training data..")
train_data = load_data(FLAGS.train_prefix, load_walks=True)'''load_data函数在graphsage.utils中定义,加载标签数据集'''
print("Done loading training data..")
train(train_data)'''# train函数在该文件中定义def train(train_data, test_data=None)'''
if __name__ == '__main__':
tf.app.run() # 解析命令行参数,调用main 函数 main(sys.argv)
'''
tf.app.run()的作用:通过处理flag解析,然后执行main函数
如果你的代码中的入口函数不叫main(),而是一个其他名字的函数,如test(),则你应该这样写入口tf.app.run(test())
如果你的代码中的入口函数叫main(),则你就可以把入口写成tf.app.run()
使用tf.app.run() ,上面已经有FLAGS = tf.app.flags.FLAGS了,则已经解析了输入。
则tf.app.run() 中argv=None,通过args = argv[1:] if argv else None则args=None(即不指定,后面会自动解析command)
f = flags.FLAGS构造了解析器f用以解析args, f._parse_flags(参数args)解析args列表或者command输入,args列表为空,则解析command输入,返回的flags_passthrough内为无法解析的数据列表(不包括文件名) 。
'''
from __future__ import print_function
import numpy as np'''导入numpy模块'''
import random'''导入randomm模块'''
import json'''导入json模块'''
import sys'''导入系统模块'''
import os'''导入操作系统模块'''
import networkx as nx'''networkx(图论)的基本操作,用于创建图等操作'''
from networkx.readwrite import json_graph'''用于将networks图保存为json图'''
version_info = list(map(int, nx.__version__.split('.')))#获取netwoeks版本信息然后转换为列表
major = version_info[0]#获取版本号点号前面的数字
minor = version_info[1]#获取版本号点号后面的数字
assert (major <= 1) and (minor <= 11), "networkx major version > 1.11"#networkx版本必须小于等于1.11,否则断言
WALK_LEN=5
N_WALKS=50
'''
type() 与 isinstance() 区别:
type() 不会认为子类是一种父类类型,不考虑继承关系。
isinstance() 会认为子类是一种父类类型,考虑继承关系。
如果要判断两个类型是否相同推荐使用 isinstance()。
参数
object – 实例对象。
classinfo – 可以是直接或间接类名、基本类型或者由它们组成的元组。
返回值
如果对象的类型与参数二的类型(classinfo)相同则返回 True,否则返回 False。
'''
'''G.nodes() 返回的是图中节点n与节点属性nodedata。'''
def load_data(prefix, normalize=True, load_walks=False):
G_data = json.load(open(prefix + "-G.json"))#加载图信息 图信息为json文件 所以用json模块导入
G = json_graph.node_link_graph(G_data)#Return graph from node-link data format
#定义conversion函数
#判断G.nodes()[0] 是否为int型(即不带nodedata)
if isinstance(G.nodes()[0], int):
conversion = lambda n : int(n)# lambda parameters:express
else:
conversion = lambda n : n#保持n不动
if os.path.exists(prefix + "-feats.npy"):#如果路径下面存在预训练好得到的features文件
feats = np.load(prefix + "-feats.npy")
else:
print("No features present.. Only identity features will be used.")
feats = None
#一个json存储的字典,将图节点id映射为连续整数。
id_map = json.load(open(prefix + "-id_map.json"))#加载节点编号与序号的一一对应的id数据
id_map = {conversion(k):int(v) for k,v in id_map.items()}
walks = []
class_map = json.load(open(prefix + "-class_map.json"))#标签数据加载,字典集合
# print("class_map:",class_map )
#{"0": [1, 0, 0,...],...,"14754": [1, 1, 0, 0,...]}
if isinstance(list(class_map.values())[0], list):#将字典数据转换为列表数据,在判断是否转换成功
lab_conversion = lambda n : n
else:
lab_conversion = lambda n : int(n)#将标签转换为整型
class_map = {conversion(k):lab_conversion(v) for k,v in class_map.items()}
"""遍历标签数据集中所有键--值,以列表返回,构造集合 #id_map的迭代中k为str类型,v为int型,将其全部转换成整型"""
# print("class_map:",class_map)
#{0: [1, 0, 0,...],...,14754: [1, 1, 0, 0,...]}
'''代码中edge对edges迭代,每次去list中的一个元组,而edge[0], edge[1]则分别表示两个顶点。
若两个顶点中至少有一个的val / test不为空,则将该边的’train_removed’设为True,否则为False。
该操作为保证’train_removed’不为空。
'''
## Remove all nodes that do not have val/test annotations
## (necessary because of networkx weirdness with the Reddit data)怪异性
broken_count = 0
for node in G.nodes():
if not 'val' in G.node[node] or not 'test' in G.node[node]:
G.remove_node(node)
broken_count += 1
print("Removed {:d} nodes that lacked proper annotations due to networkx versioning issues".format(broken_count))#format():把传统的%替换为{}来实现格式化输出
'''G.edges() 得到edge_list, [( , ), ( , ), … ( , )]。list中每一个元素是所表示边的两个节点信息。若设置data = True,则会显示边的权重等属性信息。'''
## Make sure the graph has edge train_removed annotations
## (some datasets might already have this..)
print("Loaded data.. now preprocessing..")
for edge in G.edges():
if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or
G.node[edge[0]]['test'] or G.node[edge[1]]['test']):
G[edge[0]][edge[1]]['train_removed'] = True
else:
G[edge[0]][edge[1]]['train_removed'] = False
#获取训练数据features并标准化
'''将val,test均为None的node选为训练数据,通过id_map获取其在feature表中的索引值,添加到train_ids数组中。根据索引train_ids,train_fests获取这些nodes的features.'''
if normalize and not feats is None:#feats 非空 获取训练数据features并标准化
from sklearn.preprocessing import StandardScaler
train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']])#将val,test均为None的node选为训练数据
train_feats = feats[train_ids]#获取节点特征
scaler = StandardScaler()
scaler.fit(train_feats)#计算训练数据的均值和方差
feats = scaler.transform(feats)#计算训练数据的均值和方差,还会基于计算出来的均值和方差来转换训练数据,从而把数据转换成标准的正太分布
## 标准化数据,保证每个维度的特征数据方差为1,均值为0,使得预测结果不会被某些维度过大的特征值而主导
if load_walks:# false by default
with open(prefix + "-walks.txt") as fp:
for line in fp:
walk
from __future__ import print_function
import numpy as np#'''导入numpy模块'''
import random#'''导入randomm模块'''
import json#'''导入json模块'''
import sys#'''导入系统模块'''
import os#'''导入操作系统模块'''
import networkx as nx#'''networkx(图论)的基本操作,用于创建图等操作'''
from networkx.readwrite import json_graph#'''用于将networks图保存为json图'''
version_info = list(map(int, nx.__version__.split('.')))#获取netwoeks版本信息然后转换为列表
WALK_LEN=5
N_WALKS=50
major = version_info[0]#获取版本号点号前面的数字
minor = version_info[1]#获取版本号点号后面的数字
assert (major <= 1) and (minor <= 11), "networkx major version > 1.11"#networkx版本必须小于等于1.11,否则断言
prefix ='toy-ppi'
G_data = json.load(open('./toy-ppi-G.json'))#加载图信息 图信息为json文件 所以用json模块导入
G = json_graph.node_link_graph(G_data)#Return graph from node-link data format
if isinstance(G.nodes()[0], int):
conversion = lambda n : int(n)# lambda parameters:express
else:
conversion = lambda n : n#保持n不动
print(G.nodes()[0])
0
print(G.nodes())
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 14649, 14650, 14651, 14652, 14653, 14654, 14655, 14656, 14657, 14658, 14659, 14660, 14661, 14662, 14663, 14664, 14665, 14666, 14667, 14668, 14669, 14670, 14671, 14672, 14673, 14674, 14675, 14676, 14677, 14678, 14679, 14680, 14681, 14682, 14683, 14684, 14685, 14686, 14687, 14688, 14689, 14690, 14691, 14692, 14693, 14694, 14695, 14696, 14697, 14698, 14699, 14700, 14701, 14702, 14703, 14704, 14705, 14706, 14707, 14708, 14709, 14710, 14711, 14712, 14713, 14714, 14715, 14716, 14717, 14718, 14719, 14720, 14721, 14722, 14723, 14724, 14725, 14726, 14727, 14728, 14729, 14730, 14731, 14732, 14733, 14734, 14735, 14736, 14737, 14738, 14739, 14740, 14741, 14742, 14743, 14744, 14745, 14746, 14747, 14748, 14749, 14750, 14751, 14752, 14753, 14754]
if isinstance(G.nodes()[0], int):
conversion = lambda n : int(n)# lambda parameters:express
else:
conversion = lambda n : n#保持n不动
print(conversion(1.0))
1
if os.path.exists(prefix + "-feats.npy"):#如果路径下面存在预训练好得到的features文件
feats = np.load(prefix + "-feats.npy")
else:
print("No features present.. Only identity features will be used.")
feats = None
id_map = json.load(open(prefix + "-id_map.json"))#加载节点编号与序号的一一对应的id数据
print(id_map)
print('*******************************************************')
id_map = {conversion(k):int(v) for k,v in id_map.items()}
print(id_map)
{'0': 0, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6, '7': 7, '8': 8, '9': 9, '10': 10, '11': 11, '12': 12, '13': 13, '14': 14, '15': 15, '16': 16, '17': 17, '18': 18, '19': 19, '20': 20, '21': 21, '22': 22, '23': 23, '24': 24, '25': 25, '26':