GraphSAGE NIPS 2017 代码分析(Tensorflow版)

文章目录

    • 数据集
      • ppi数据集信息
        • toy-ppi-G.json 图的信息
        • toy-ppi-class_map.json
        • toy-ppi-id_map.json
        • toy-ppi-walks.txt
        • toy-ppi-feats.npy
    • 实验要求
      • 安装Docker与程序运行
      • 运行
    • 代码分析
      • `__init__.py`
      • `utils.py`
      • `neigh_samplers.py`
      • `models.py`
      • `layers.py`
      • `minibatch.py`
      • `aggregators.py`
      • `prediction.py`
      • `supervised_train.py`
      • `unsupervised_train.py`
      • `inits.py`
    • 其他
      • `citation_eval.py`
      • `ppi_eval.py`
      • `reddit_eval.py`
    • 参考

数据集

数据集 #图 #节点 #边 #特征 #标签(y)
Cora 1 2708 5429 1433 7
Citeseer 1 3327 4732 3703 6
Pubmed 1 19717 44338 500 3
PPI 24 56944 818716 50 121
Reddit 1 232965 11606919 602 41
Nell 1 65755 266144 61278 210

ppi数据集信息

toy-ppi-G.json 图的信息

{ 
  directed: false
  graph : {
              {name: disjoint_union(,) }
           nodes:  [
                        {  
                                test: false
                         id: 0
                         features: [ ... ]
                         val: false
                          lable: [ ... ]
                       }
                       {...}
                         ...
                  ]

            links: [
                       {  
                                test_removed: false
                        train_removed: false
                        target: 800 # 指向的节点id(默认从小节点指向大节点)
                        source: 0   # 从0节点按顺序展示
                         }
                         {...}
                           ...
                    ]
      }
}
  • name: disjoint_union(,)表示图的名字
  • toy-ppi-G.json里只有一个图 (可能是因为用于节点分类只需要一张图即可,做图分类任务需要多张图)
  • 可以看出,这是个无向图,并且由nodes集和links集合构成,每个集合都是一个list,里面包含的每一个node或link都是词典形式存储的
  • 从github下载的源码中,没有links部分的数据?其实是由于文件过大显示不完整,其实是存在的,比如节点只显示到1883,总共14754个

toy-ppi-class_map.json

格式为:{“0”: [1, 0, 0,…],…,“14754”: [1, 1, 0, 0,…]}

toy-ppi-id_map.json

节点编号与序号的一一对应;数据格式为:{“0”: 0, “1”: 1,…, “14754”: 14754}

toy-ppi-walks.txt

0	708
0	3163
0	276
0	1789
...
1	15
1	1455
1	1327
1	317
1	63
1	1420
...
9715	7369
9715	8983
9715	6983
  • 从一点出发随机游走到邻居节点的情况,对于每个点取198次(即可能有重复情况)
  • 例如:0 708 表示从0点走到708点。

toy-ppi-feats.npy

预训练好得到的features。

数据处理的时候主要通过两个函数
(1):np.save(“test.npy”,数据结构) ----存数据
(2):data =np.load('test.npy") ----取数据
例如,存列表

z = [[[1, 2, 3], ['w']], [[1, 2, 3], ['w']]]
np.save('test.npy', z)
x = np.load('test.npy')

x:
->array([[list([1, 2, 3]), list(['w'])],
       [list([1, 2, 3]), list(['w'])]], dtype=object)

例如,存字典

x
-> {0: 'wpy', 1: 'scg'}
np.save('test.npy',x)
x = np.load('test.npy')
x
->array({0: 'wpy', 1: 'scg'}, dtype=object)

在存为字典格式读取后,需要先调用如下语句
data.item()
将数据numpy.ndarray对象转换为dict

实验要求

  • networkx版本必须小于等于1.11,本人的是2.3版本,因此报错
  File "/home/yyl/桌面/Papers/GraphSage/GraphSAGE-master/graphsage/utils.py", line 20, in load_data
    G_data = json.load(open(prefix + "-G.json"))
FileNotFoundError: [Errno 2] No such file or directory: '-G.json'

换成1.11版本即可:pip install networkx==1.11

安装Docker与程序运行

参考:https://www.cnblogs.com/shiyublog/p/9858786.html

运行

在命令运行unsupervised_train.py

python -m graphsage.unsupervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --max_total_steps 1000 --validate_iter 10
等价于
python ./graphsage/unsupervised_train.py  --train_prefix ./example_data/toy-ppi --model graphsage_mean --max_total_steps 1000 --validate_iter 10

注意,上述数据集路径和官方给的不一样。如果是在Pycharm中运行,需要更改train_prefix,model等参数的值,需要注意在ide和命令行中参数的格式,在idea中:

flags.DEFINE_string('model', 'graphsage_mean', 'model names. See README for possible values.')  
flags.DEFINE_string('train_prefix', '../example_data/toy-ppi', 'prefix identifying training data. must be specified.')

model参数可选值

  • graphsage_mean – GraphSage with mean-based aggregator
  • graphsage_seq – GraphSage with LSTM-based aggregator
  • graphsage_maxpool – GraphSage with max-pooling aggregator (as described in the NIPS 2017 paper)
  • graphsage_meanpool – GraphSage with mean-pooling aggregator (a variant of the pooling aggregator, where the element-wie mean replaces the element-wise max).
  • gcn – GraphSage with GCN-based aggregator
  • n2v – an implementation of DeepWalk (called n2v for short in the code.)
date_iter 10
Loading training data..
Removed 0 nodes that lacked proper annotations due to networkx versioning issues
Loaded data.. now preprocessing..
Done loading training data..
Unexpected missing: 0
9716 train nodes
5039 test nodes
2019-11-05 20:49:37.036227: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-05 20:49:37.075335: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-11-05 20:49:37.076045: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: yyl-Z370-HD3
2019-11-05 20:49:37.076058: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: yyl-Z370-HD3
2019-11-05 20:49:37.076154: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 430.34.0
2019-11-05 20:49:37.076205: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 430.34.0
2019-11-05 20:49:37.076214: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 430.34.0
Epoch: 0001
2019-11-05 20:49:41.997221: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 20:49:42.001801: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 20:49:42.018491: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 20:49:42.024759: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
Iter: 0000 train_loss= 18.78112 train_mrr= 0.25004 train_mrr_ema= 0.25004 val_loss= 19.16457 val_mrr= 0.21838 val_mrr_ema= 0.21838 time= 1.29920
2019-11-05 20:49:42.517739: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
Iter: 0050 train_loss= 18.49755 train_mrr= 0.18471 train_mrr_ema= 0.22649 val_loss= 18.71147 val_mrr= 0.24803 val_mrr_ema= 0.20979 time= 0.14266
Iter: 0100 train_loss= 18.56170 train_mrr= 0.16419 train_mrr_ema= 0.21142 val_loss= 18.31936 val_mrr= 0.18485 val_mrr_ema= 0.21664 time= 0.12880
Iter: 0150 train_loss= 17.91184 train_mrr= 0.19551 train_mrr_ema= 0.20265 val_loss= 18.08941 val_mrr= 0.18543 val_mrr_ema= 0.21273 time= 0.12436
Iter: 0200 train_loss= 17.14944 train_mrr= 0.22213 train_mrr_ema= 0.19652 val_loss= 17.85653 val_mrr= 0.18931 val_mrr_ema= 0.20871 time= 0.12251
Iter: 0250 train_loss= 17.14205 train_mrr= 0.18210 train_mrr_ema= 0.19319 val_loss= 17.21363 val_mrr= 0.20572 val_mrr_ema= 0.20632 time= 0.12138
Iter: 0300 train_loss= 16.96532 train_mrr= 0.18807 train_mrr_ema= 0.19088 val_loss= 16.77304 val_mrr= 0.17903 val_mrr_ema= 0.20116 time= 0.12090
Iter: 0350 train_loss= 16.62087 train_mrr= 0.18385 train_mrr_ema= 0.18813 val_loss= 16.53304 val_mrr= 0.21569 val_mrr_ema= 0.21001 time= 0.12049
Iter: 0400 train_loss= 16.15347 train_mrr= 0.19938 train_mrr_ema= 0.18743 val_loss= 16.20924 val_mrr= 0.20107 val_mrr_ema= 0.20434 time= 0.11976
Iter: 0450 train_loss= 15.92187 train_mrr= 0.18782 train_mrr_ema= 0.18764 val_loss= 16.09361 val_mrr= 0.21507 val_mrr_ema= 0.20851 time= 0.11877
Iter: 0500 train_loss= 15.61726 train_mrr= 0.20567 train_mrr_ema= 0.18762 val_loss= 15.82525 val_mrr= 0.20445 val_mrr_ema= 0.20857 time= 0.11862
Iter: 0550 train_loss= 15.49840 train_mrr= 0.18336 train_mrr_ema= 0.18751 val_loss= 15.63526 val_mrr= 0.21188 val_mrr_ema= 0.20637 time= 0.11833
Iter: 0600 train_loss= 15.29428 train_mrr= 0.18559 train_mrr_ema= 0.18814 val_loss= 15.43966 val_mrr= 0.18432 val_mrr_ema= 0.20902 time= 0.11812
Iter: 0650 train_loss= 15.22701 train_mrr= 0.18361 train_mrr_ema= 0.18770 val_loss= 15.30805 val_mrr= 0.18943 val_mrr_ema= 0.20266 time= 0.11810
Iter: 0700 train_loss= 15.12402 train_mrr= 0.17775 train_mrr_ema= 0.18618 val_loss= 15.19020 val_mrr= 0.19980 val_mrr_ema= 0.20185 time= 0.11768
Iter: 0750 train_loss= 14.99245 train_mrr= 0.18521 train_mrr_ema= 0.18568 val_loss= 15.03517 val_mrr= 0.21548 val_mrr_ema= 0.20307 time= 0.11793
Iter: 0800 train_loss= 14.90705 train_mrr= 0.19703 train_mrr_ema= 0.18656 val_loss= 14.99781 val_mrr= 0.22909 val_mrr_ema= 0.20756 time= 0.11749
Iter: 0850 train_loss= 14.89672 train_mrr= 0.17020 train_mrr_ema= 0.18719 val_loss= 14.98761 val_mrr= 0.21074 val_mrr_ema= 0.21088 time= 0.11691
Iter: 0900 train_loss= 14.79581 train_mrr= 0.20458 train_mrr_ema= 0.18726 val_loss= 14.86050 val_mrr= 0.20545 val_mrr_ema= 0.20940 time= 0.11659
Iter: 0950 train_loss= 14.78135 train_mrr= 0.17666 train_mrr_ema= 0.18843 val_loss= 14.79613 val_mrr= 0.20859 val_mrr_ema= 0.20693 time= 0.11645
Iter: 1000 train_loss= 14.75321 train_mrr= 0.19232 train_mrr_ema= 0.18724 val_loss= 14.77326 val_mrr= 0.21229 val_mrr_ema= 0.20599 time= 0.11612
Optimization Finished!
  • 可以看出,unsupervised_train.py只运行了一个epoch,共1000次迭代,每10个迭代运行一次validation
  • batch_size:512
  • python -m graphsage.unsupervised_train 表示以模块运行,不用具体路径
  • python ./graphsage/unsupervised_train.py 表示以脚本文件直接运行

运行supervised_train.py,注意train_prefix参数的值也需要改:…/example_data/toy-ppi

python -m graphsage.supervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --sigmoid
等价于
python ./graphsage/supervised_train.py --train_prefix ./example_data/toy-ppi --model graphsage_mean --sigmoid

若报错:

  File "/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/cluster/supervised.py", line 14, in 
    from scipy.misc import comb
ImportError: cannot import name 'comb'

则修改lib\site-packages\sklearn\metrics\cluster\supervised.py中的from scipy.misc import comb为from scipy.special import comb

Loading training data..
Removed 0 nodes that lacked proper annotations due to networkx versioning issues
Loaded data.. now preprocessing..
Done loading training data..
2019-11-05 21:37:11.763295: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-05 21:37:11.770739: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-11-05 21:37:11.770767: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: yyl-Z370-HD3
2019-11-05 21:37:11.770772: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: yyl-Z370-HD3
2019-11-05 21:37:11.770792: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 430.34.0
2019-11-05 21:37:11.770810: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 430.34.0
2019-11-05 21:37:11.770815: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 430.34.0
Epoch: 0001
2019-11-05 21:37:12.108357: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 21:37:12.112443: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1076: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)
Iter: 0000 train_loss= 0.69289 train_f1_mic= 0.34109 train_f1_mac= 0.29393 val_loss= 0.66505 val_f1_mic= 0.37310 val_f1_mac= 0.13088 time= 0.38797
2019-11-05 21:37:12.349277: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 21:37:12.353043: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 21:37:12.411469: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
Iter: 0005 train_loss= 0.58413 train_f1_mic= 0.36509 train_f1_mac= 0.08875 val_loss= 0.66505 val_f1_mic= 0.37310 val_f1_mac= 0.13088 time= 0.11344
Iter: 0010 train_loss= 0.54682 train_f1_mic= 0.40012 train_f1_mac= 0.10253 val_loss= 0.66505 val_f1_mic= 0.37310 val_f1_mac= 0.13088 time= 0.08710
Iter: 0015 train_loss= 0.54642 train_f1_mic= 0.37319 train_f1_mac= 0.08884 val_loss= 0.66505 val_f1_mic= 0.37310 val_f1_mac= 0.13088 time= 0.07637
Epoch: 0002
..........
Epoch: 0010
Iter: 0004 train_loss= 0.48976 train_f1_mic= 0.49845 train_f1_mac= 0.27817 val_loss= 0.52274 val_f1_mic= 0.49617 val_f1_mac= 0.29091 time= 0.05695
Iter: 0009 train_loss= 0.49231 train_f1_mic= 0.48284 train_f1_mac= 0.26799 val_loss= 0.52274 val_f1_mic= 0.49617 val_f1_mac= 0.29091 time= 0.05683
Iter: 0014 train_loss= 0.48515 train_f1_mic= 0.50462 train_f1_mac= 0.28371 val_loss= 0.52274 val_f1_mic= 0.49617 val_f1_mac= 0.29091 time= 0.05664
Optimization Finished!
Full validation stats: loss= 0.51017 f1_micro= 0.52730 f1_macro= 0.32655 time= 0.18505
Writing test set stats to file (don't peak!)

注意

/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1076: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.
  'recall', 'true', average, warn_for)

原因:存在一些样本 label 为 y_true,但是y_pred 并没有预测到,即在预测数据中存在实际类别没有的标签时报此warning,此时F1当作0
比如
y_true = (0, 1, 2, 3, 4)
y_pred = (0, 1, 1, 3, 4)
label‘2’ 从来没有被预测到,所以F-score没有计算这项 label, 因此这种情况下 F-score 就被当作为 0.0 了。
但是又因为,要计算所有分类结果的平均得分就必须将这项得分为 0 的情况考虑进去,所以,scikit-learn出来提醒你,warning警告一下,但不是错误。

/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

代码分析

__init__.py

from __future__ import print_function  
#即使在python2.X,使用print就得像python3.X那样加括号使用。

from __future__ import division          
# 导入python未来支持的语言特征division(精确除法),
# 当我们没有在程序中导入该特征时,"/"操作符执行的是截断除法(Truncating Division);
# 当我们导入精确除法之后,"/"执行的是精确除法, "//"执行截断除除法

utils.py

(1)type() 与 isinstance() 区别
isinstance() 函数来判断一个对象是否是一个已知的类型,类似 type()。
isinstance(object, classinfo)
参数
object – 实例对象。
classinfo – 可以是直接或间接类名、基本类型或者由它们组成的元组。

返回值
如果对象的类型与参数二的类型(classinfo)相同则返回 True,否则返回 False。

>>>a = 2
>>> isinstance (a,int)
True
>>> isinstance (a,str)
False
>>> isinstance (a,(str,int,list))    # 是元组中的一个返回 True
True

type() 不会认为子类是一种父类类型,不考虑继承关系。
isinstance() 会认为子类是一种父类类型,考虑继承关系。

class A:
    pass
 
class B(A):
    pass
 
isinstance(A(), A)    # returns True
type(A()) == A        # returns True
isinstance(B(), A)    # returns True
type(B()) == A        # returns False

(2)G.nodes()
返回的是图中节点n与节点属性nodedata。

    G_data = json.load(open(prefix + "-G.json"))
    G = json_graph.node_link_graph(G_data) #Return graph from node-link data format
    print("G.nodes():",G.nodes)
    #G.nodes(): >

    print("list(G.nodes):",list(G))
    # list(G.nodes): [0, 1, 2, 3, 4,...,14754]

    # print("G.nodes.data():",G.nodes.data()) #AttributeError: 'function' object has no attribute 'data'

    print(list(G.nodes(data=True))) #带nodedata
    #类似这种格式:[(0, {'val': False},‘label':[...],...), (1, {...}), (2, {...}),...]

    #判断G.nodes()[0] 是否为int型(即不带nodedata)
    if isinstance(G.nodes()[0], int): #若为int型,则将n转为int型
        conversion = lambda n : int(n)
    else:
        conversion = lambda n : n

(3)G.edges()
代码中edge对edges迭代,每次去list中的一个元组,而edge[0], edge[1]则分别表示两个顶点。
若两个顶点中至少有一个的val / test不为空,则将该边的’train_removed’设为True,否则为False。
该操作为保证’train_removed’不为空。

    ## Make sure the graph has edge train_removed annotations
    ## (some datasets might already have this..)
    for edge in G.edges():
        if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or
            G.node[edge[0]]['test'] or G.node[edge[1]]['test']):
            G[edge[0]][edge[1]]['train_removed'] = True
        else:
            G[edge[0]][edge[1]]['train_removed'] = False

G.edges() 得到edge_list, [( , ), ( , ), … ( , )]。list中每一个元素是所表示边的两个节点信息。若设置data = True,则会显示边的权重等属性信息。

>>> G = nx.Graph()   # or DiGraph, MultiGraph, MultiDiGraph, etc
>>> G.add_path([0,1,2])
>>> G.add_edge(2,3,weight=5)
>>> G.edges()
[(0, 1), (1, 2), (2, 3)]
>>> G.edges(data=True) # default edge data is {} (empty dictionary)
[(0, 1, {}), (1, 2, {}), (2, 3, {'weight': 5})]
>>> list(G.edges_iter(data='weight', default=1))
[(0, 1, 1), (1, 2, 1), (2, 3, 5)]
>>> G.edges([0,3])
[(0, 1), (3, 2)]
>>> G.edges(0)
[(0, 1)]

(4)获取训练数据features并标准化
这里if not feats is None 等价于 if feats is not None。
将val,test均为None的node选为训练数据,通过id_map获取其在feature表中的索引值,添加到train_ids数组中。
根据索引train_ids,train_fests获取这些nodes的features。

if normalize and not feats is None:
        from sklearn.preprocessing import StandardScaler
        train_ids = np.array([id_map[n] for n in G.nodes(
        ) if not G.node[n]['val'] and not G.node[n]['test']])
        train_feats = feats[train_ids]
        scaler = StandardScaler()
        scaler.fit(train_feats)
        feats = scaler.transform(feats)

(5)StandardScaler的用法
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Methods:

  • fit(X[, y]) : Compute the mean and std to be used for later scaling. 预处理的数据,计算矩阵列均值和列标准差
  • transform(X[, y, copy]) : Perform standardization by centering and scaling:得到标准化的矩阵 。用此方法,必须使用fit先进行预处理计算均值和标准差,然后用fit计算的均值和标准差,进行标准化处理 {x_i - u}/标准差
  • fit_transform(X[, y]) : Fit to data, then transform it. 相当于是fit和transform的组合
>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]

# 计算得
# 均值[0.5, 0.5], 
# 方差:1/4 * [(0 - 0.5)^2 * 2 + (1 - 0.5)^2 * 2] = 1/4 = 0.25
# 标准差:0.5
# 对于[2,2] transform 标准化之后: (2 - 0.5) / 0.5 = 3

(6)map() 的用法
map(function, iterable, …)
map() 会根据提供的函数对指定序列做映射。
第一个参数 function 以参数序列中的每一个元素调用 function 函数,返回包含每次 function 函数返回值的新列表。
例子:

>>>def square(x) :            # 计算平方数
...     return x ** 2
... 
>>> map(square, [1,2,3,4,5])   # 计算列表各个元素的平方
[1, 4, 9, 16, 25]
>>> map(lambda x: x ** 2, [1, 2, 3, 4, 5])  # 使用 lambda 匿名函数
[1, 4, 9, 16, 25]
 
# 提供了两个列表,对相同位置的列表数据进行相加
>>> map(lambda x, y: x + y, [1, 3, 5, 7, 9], [2, 4, 6, 8, 10])
[3, 7, 11, 15, 19]

utils.py完整代码

from __future__ import print_function

import numpy as np
import random
import json
import sys
import os

import networkx as nx
from networkx.readwrite import json_graph
version_info = list(map(int, nx.__version__.split('.')))
major = version_info[0]
minor = version_info[1]
assert (major <= 1) and (minor <= 11), "networkx major version > 1.11"

WALK_LEN=5
N_WALKS=50

def load_data(prefix, normalize=True, load_walks=False):
    G_data = json.load(open(prefix + "-G.json"))
    G = json_graph.node_link_graph(G_data) #Return graph from node-link data format
    print("G.nodes():",G.nodes)
    #G.nodes(): >

    # print("list(G.nodes):",list(G))
    # list(G.nodes): [0, 1, 2, 3, 4,...,14754]

    # print("G.nodes.data():",G.nodes.data()) #AttributeError: 'function' object has no attribute 'data'

    # print(list(G.nodes(data=True))) #带nodedata
    #类似这种格式:[(0, {'val': False},‘label':[...],...), (1, {...}), (2, {...}),...]

    #定义conversion函数
    #判断G.nodes()[0] 是否为int型(即不带nodedata)
    if isinstance(G.nodes()[0], int): #若为int型,则将n转为int型
        conversion = lambda n : int(n)
    else:
        conversion = lambda n : n

    #节点特征存于.npy文件中,如果不存在则feats = None
    if os.path.exists(prefix + "-feats.npy"):
        feats = np.load(prefix + "-feats.npy")
    else:
        print("No features present.. Only identity features will be used.")
        feats = None

    id_map = json.load(open(prefix + "-id_map.json"))
    # print("id_map:",id_map)
    # id_map :{'143': 143, '10758': 10758, '13438': 13438,..
    id_map = {conversion(k):int(v) for k,v in id_map.items()}   #id_map的迭代中k为str类型,v为int型,将其全部转换成整型
    # print("id_map:",id_map)
    #id_map: {0: 0, 1: 1, 2: 2...

    walks = []
    class_map = json.load(open(prefix + "-class_map.json"))
    # print("class_map:",class_map)
    #{"0": [1, 0, 0,...],...,"14754": [1, 1, 0, 0,...]}

    if isinstance(list(class_map.values())[0], list):
        lab_conversion = lambda n : n
    else:
        lab_conversion = lambda n : int(n)

    class_map = {conversion(k):lab_conversion(v) for k,v in class_map.items()} #同上,将class_map的keys和values都转换成成型,并返回新的class_map

    # print("class_map:",class_map)
    #{0: [1, 0, 0,...],...,14754: [1, 1, 0, 0,...]}


    #这里删除的节点是不具有'val'或'test'属性 的节点,而不是'val','test' 属性值为None的节点。
    #区分开 if not 'val' in G.node[node] 和 if not G.node[n]['val']的不同意义。
    ## Remove all nodes that do not have val/test annotations
    ## (necessary because of networkx weirdness with the Reddit data)
    broken_count = 0  #记录删去的没有val 或者 test的属性的节点的数目。
    for node in G.nodes():
        if not 'val' in G.node[node] or not 'test' in G.node[node]:
            G.remove_node(node)
            broken_count += 1
    print("Removed {:d} nodes that lacked proper annotations due to networkx versioning issues".format(broken_count))

    ## Make sure the graph has edge train_removed annotations
    ## (some datasets might already have this..)
    # 代码中edge对edges迭代,每次去list中的一个元组,而edge[0], edge[1]则分别表示两个顶点。
    # 若两个顶点中至少有一个的val / test不为空,则将该边的'train_removed'设为True,否则为False。
    # 该操作为保证'train_removed'不为空。
    print("Loaded data.. now preprocessing..")
    for edge in G.edges():
        if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] or
            G.node[edge[0]]['test'] or G.node[edge[1]]['test']):
            G[edge[0]][edge[1]]['train_removed'] = True
        else:
            G[edge[0]][edge[1]]['train_removed'] = False


    #获取训练数据features并标准化
    #这里if not feats is None 等价于 if feats is not None
    if normalize and not feats is None:
        from sklearn.preprocessing import StandardScaler
        #将val,test均为None的node选为训练数据,通过id_map获取其在feature表中的索引值,添加到train_ids数组中。
        #根据索引train_ids,train_fests获取这些nodes的features.
        train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']])
        train_feats = feats[train_ids]
        ## 标准化数据,保证每个维度的特征数据方差为1,均值为0,使得预测结果不会被某些维度过大的特征值而主导
        scaler = StandardScaler()
        scaler.fit(train_feats)
        feats = scaler.transform(feats)

    #如果load_walks为True则加载walks
    if load_walks:
        with open(prefix + "-walks.txt") as fp:
            for line in fp:
                #map将walk中的数据都转为整型
                #walks初始化为[], 之后append的是游走的节点对的对象
                walks.append(map(conversion, line.split()))

    # print("walks:",walks)
    #[, , ....]
    print("len(walks):",len(walks))
    #len(walks): 1895817


    return G, feats, id_map, walks, class_map

def run_random_walks(G, nodes, num_walks=N_WALKS):
    pairs = []
    for count, node in enumerate(nodes):
        if G.degree(node) == 0:
            continue
        for i in range(num_walks):
            curr_node = node
            for j in range(WALK_LEN):
                next_node = random.choice(G.neighbors(curr_node))
                # self co-occurrences are useless
                if curr_node != node:
                    pairs.append((node,curr_node))
                curr_node = next_node
        if count % 1000 == 0:
            print("Done walks for", count, "nodes")
    return pairs

if __name__ == "__main__":
    """ Run random walks """
    graph_file = sys.argv[1]
    out_file = sys.argv[2]
    G_data = json.load(open(graph_file))
    G = json_graph.node_link_graph(G_data)
    nodes = [n for n in G.nodes() if not G.node[n]["val"] and not G.node[n]["test"]]
    G = G.subgraph(nodes)
    pairs = run_random_walks(G, nodes)
    with open(out_file, "w") as fp:
        fp.write("\n".join([str(p[0]) + "\t" + str(p[1]) for p in pairs]))

neigh_samplers.py

定义了从节点的邻居中采样的均匀采样器。
均匀:shuffle打乱0维的顺序,即打乱行顺序,以此使下面采样可以“均匀”。为了使用shuffle函数,需要在shuffle前后transpose一下。
采样:slice之后,相当于随机挑选了num_samples个样本,并保留了这些样本的全部属性特征。
最后的adj_lists即为均匀采样后的表示邻居信息的矩阵。

neigh_samplers.py完整代码

from __future__ import division
from __future__ import print_function

from graphsage.layers import Layer

import tensorflow as tf
flags = tf.app.flags
FLAGS = flags.FLAGS


"""
Classes that are used to sample node neighborhoods
"""

class UniformNeighborSampler(Layer):
    """
    Uniformly samples neighbors.
    Assumes that adj lists are padded with random re-sampling
    """
    def __init__(self, adj_info, **kwargs):
        super(UniformNeighborSampler, self).__init__(**kwargs)
        self.adj_info = adj_info

    def _call(self, inputs):
        ids, num_samples = inputs
        adj_lists = tf.nn.embedding_lookup(self.adj_info, ids)  #用于根据ids在adj_info中找到各个对应位的向量。
        adj_lists = tf.transpose(tf.random_shuffle(tf.transpose(adj_lists)))
        adj_lists = tf.slice(adj_lists, [0,0], [-1, num_samples])
        return adj_lists

models.py

(1)namedtuple 命名元组,可以给tuple命名
例子:

import collections

MyTupleClass = collections.namedtuple('MyTupleClass',['name', 'age', 'job'])
obj = MyTupleClass("Tomsom",12,'Cooker')
print(obj.name)# Tomsom
print(obj.age) # 12
print(obj.job) # Cooker


Person=collections.namedtuple('Person','name age gender')
# 以空格分开,表示这个namedtuple有三个元素

print( 'Type of Person:',type(Person)) # Type of Person: 
Bob=Person(name='Bob',age=30,gender='male') 
print( 'Representation:',Bob) #Representation: Person(name='Bob', age=30, gender='male')
Jane=Person(name='Jane',age=29,gender='female')
print( 'Field by Name:',Jane.name) # Field by Name: Jane
for people in [Bob,Jane]:
    print ("%s is %d years old %s" % people)
# Bob is 30 years old male
# Jane is 29 years old female


# 在使用namedtyuple的时候要注意其中的名称不能使用Python的关键字,如class def等
# 不能有重复的元素名称,比如:不能有两个’age age’。如果出现这些情况,程序会报错。
# 但是,在实际使用的时候可能无法避免这种情况,
# 比如:可能我们的元素名称是从数据库里读出来的记录,这样很难保证一定不会出现Python关键字。
# 这种情况下的解决办法是将namedtuple的重命名模式打开,
# 这样如果遇到Python关键字或者有重复元素名 时,自动进行重命名。

with_class=collections.namedtuple('Person','name age class gender',rename=True)
print(with_class._fields)# ('name', 'age', '_2', 'gender')
two_ages=collections.namedtuple('Person','name age gender age',rename=True)
print(two_ages._fields)# ('name', 'age', 'gender', '_3')

# 使用rename=True的方式打开重命名选项
# 可以看到第一个集合中的class被重命名为 ‘_2' ;
# 第二个集合中重复的age被重命名为 ‘_3'
# namedtuple在重命名的时候使用了下划线 _ 加元素所在索引数的方式进行重命名

(2)class SampleAndAggregate(GeneralizedModel)主要包含的函数

  • def __init__(self, placeholders, features, adj, degrees, layer_infos, concat=True, aggregator_type=“mean”, model_size=“small”, identity_dim=0, **kwargs)
  • def sample(self, inputs, layer_infos, batch_size=None)
  • def aggregate(self, samples, input_features, dims, num_samples, support_sizes, batch_size=None,aggregators=None, name=None, concat=False, model_size=“small”)
  • def _build(self)
  • def build(self)
  • def _loss(self)
  • def _accuracy(self)

(3)类及其继承关系

      Model 
     /   \
    /     \
  MLP   GeneralizedModel
          /  \
         /    \
Node2VecModel  SampleAndAggregate
  • 其中Model与 GeneralizedModel的区别在于,Model的build()函数中搭建了序列层模型,而在GeneralizedModel中被删去
  • self.ouput必须在GeneralizedModel的子类build()中被赋值
  • 序列层实现的功能是,给输入,通过layer()返回输出,又将这个输出再次作为输入到下一个layer()中,最终,取最后一层layer的结果作为output

(4)class SampleAndAggregate(GeneralizedModel)

  • __init__()中self.features的由来
para: features    tf.get_variable()-> identity features
     |                   |
self.features     self.embeds   --> At least one is not None
      \                 /       --> Concat if both are not None 
       \               /
        \             /
         self.features
  • __init__()中self.dims:
    self.dims是一个list, 每一位记录各个神经网络层的维数。
    self.dims[0]的值相当于self.features的列数 (0 if features is None else features.shape[1]) + identity_dim),(注意:括号里features为传入的参数,而非self.features)。
    之后各位为各层output_dim,也就是hidden units的个数。

  • sample(inputs, layer_infos, batch_size=None)
    对于sample的算法描述,详见论文Appendix A, algorithm 2。

GraphSAGE NIPS 2017 代码分析(Tensorflow版)_第1张图片

sampler = layer_infos[t].neigh_sampler

当函数被调用时,layer_infos会被赋值,在unsupervised_train.py中,其中neigh_sampler被赋为UniformNeighborSampler,其在neigh_samplers.py中定义:class UniformNeighborSampler(Layer)。

目的是对于输入的samples[k] (即为上一步sample得到的节点,如上图依次得到黄色区域表示的samples[0],橙色区域表示的samples[1], 粉色区域表示的samples[2]。其中samples[k]是有由对samples[k - 1]中各节点的邻居采样而得),选取num_samples个数的邻居节点的序号(对应上图N(u))。(返回值是adj_lists, 即为被截断为num_samples列数的邻接矩阵。)

这里注意区别support_size与num_samples:

num_sample为当前深度每个节点u所选取的邻居节点的个数为num_samples;

support_size表示当前节点u的embedding受多少节点信息的影响。其既受当前层num_samples个直接邻居的影响,其邻居也受更先前深度num_samples个邻居的影响,以此类推。故support_size是到目前深度为止的各深度下num_samples的连乘积。则对于batch_size个输入节点,总的support个数为: support_size * batch_size。

最后将support_size存进support_sizes的数组中。

sample() 函数最终返回包含各深度下采样点的samples数组与各深度下各点受支持节点数目的support_sizes数组。

(5)tf.nn.fixed_unigram_candidate_sampler

按照用户提供的概率分布进行采样。
如果类别服从均匀分布,我们就用uniform_candidate_sampler;
如果词作类别,我们知道词服从 Zipfian, 我们就用 log_uniform_candidate_sampler;
如果能够通过统计或者其他渠道知道类别满足某些分布,用 nn.fixed_unigram_candidate_sampler;
如果实在不知道类别分布,我们还可以用 tf.nn.learned_unigram_candidate_sampler。

(2) Paras:
a. num_sampled:

sampling_candidates的元素是在没有替换(如果unique = True)的情况下绘制的,
或者是从基本分布中替换(如果unique = False)。

unique = True 可以看作无放回抽样;unique = False 可以看作有放回抽样。

b. distortion:

distortion used the word2vec freq energy table formulation
f^(3/4) / total(f^(3/4))
in word2vec energy counted by freq;
in graphsage energy counted by degrees
so in unigrams = [] each ID recored each node's degree

c. unigrams: 

各个节点的度。

(3) Returns:
a. sampled_candidates: A tensor of type int64 and shape [num_sampled]. The sampled classes.
b. true_expected_count: A tensor of type float. Same shape as true_classes. The expected counts under the sampling distribution of each of true_classes.
c. sampled_expected_count: A tensor of type float. Same shape as sampled_candidates. The expected counts under the sampling distribution of each of sampled_candidates.

models.py完整代码

from collections import namedtuple

import tensorflow as tf
import math

import graphsage.layers as layers
import graphsage.metrics as metrics

from .prediction import BipartiteEdgePredLayer
from .aggregators import MeanAggregator, MaxPoolingAggregator, MeanPoolingAggregator, SeqAggregator, GCNAggregator

flags = tf.app.flags
FLAGS = flags.FLAGS

# DISCLAIMER:
# Boilerplate parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras package

class Model(object):
    def __init__(self, **kwargs):
        allowed_kwargs = {'name', 'logging', 'model_size'}
        for kwarg in kwargs.keys():
            assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
        name = kwargs.get('name')
        if not name:
            name = self.__class__.__name__.lower()
        self.name = name

        logging = kwargs.get('logging', False)
        self.logging = logging

        self.vars = {}
        self.placeholders = {}

        self.layers = []
        self.activations = []

        self.inputs = None
        self.outputs = None

        self.loss = 0
        self.accuracy = 0
        self.optimizer = None
        self.opt_op = None

    def _build(self):
        raise NotImplementedError

    def build(self):
        """ Wrapper for _build() """
        with tf.variable_scope(self.name):
            self._build()

        # Build sequential layer model
        self.activations.append(self.inputs)
        for layer in self.layers:
            hidden = layer(self.activations[-1])
            self.activations.append(hidden)
        self.outputs = self.activations[-1]
        # 这部分sequential layer model模型在GeneralizedModel的build()中被删去
        
        
        # Store model variables for easy access
        variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
        self.vars = {var.name: var for var in variables}

        # Build metrics
        self._loss()
        self._accuracy()

        self.opt_op = self.optimizer.minimize(self.loss)

    def predict(self):
        pass

    def _loss(self):
        raise NotImplementedError

    def _accuracy(self):
        raise NotImplementedError

    def save(self, sess=None):
        if not sess:
            raise AttributeError("TensorFlow session not provided.")
        saver = tf.train.Saver(self.vars)
        save_path = saver.save(sess, "tmp/%s.ckpt" % self.name)
        print("Model saved in file: %s" % save_path)

    def load(self, sess=None):
        if not sess:
            raise AttributeError("TensorFlow session not provided.")
        saver = tf.train.Saver(self.vars)
        save_path = "tmp/%s.ckpt" % self.name
        saver.restore(sess, save_path)
        print("Model restored from file: %s" % save_path)


class MLP(Model):
    """ A standard multi-layer perceptron """
    def __init__(self, placeholders, dims, categorical=True, **kwargs):
        super(MLP, self).__init__(**kwargs)

        self.dims = dims
        self.input_dim = dims[0]
        self.output_dim = dims[-1]
        self.placeholders = placeholders
        self.categorical = categorical

        self.inputs = placeholders['features']
        self.labels = placeholders['labels']

        self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)

        self.build()

    def _loss(self):
        # Weight decay loss
        for var in self.layers[0].vars.values():
            self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)

        # Cross entropy error
        if self.categorical:
            self.loss += metrics.masked_softmax_cross_entropy(self.outputs, self.placeholders['labels'],
                    self.placeholders['labels_mask'])
        # L2
        else:
            diff = self.labels - self.outputs
            self.loss += tf.reduce_sum(tf.sqrt(tf.reduce_sum(diff * diff, axis=1)))

    def _accuracy(self):
        if self.categorical:
            self.accuracy = metrics.masked_accuracy(self.outputs, self.placeholders['labels'],
                    self.placeholders['labels_mask'])

    def _build(self):
        self.layers.append(layers.Dense(input_dim=self.input_dim,
                                 output_dim=self.dims[1],
                                 act=tf.nn.relu,
                                 dropout=self.placeholders['dropout'],
                                 sparse_inputs=False,
                                 logging=self.logging))

        self.layers.append(layers.Dense(input_dim=self.dims[1],
                                 output_dim=self.output_dim,
                                 act=lambda x: x,
                                 dropout=self.placeholders['dropout'],
                                 logging=self.logging))

    def predict(self):
        return tf.nn.softmax(self.outputs)

class GeneralizedModel(Model):
    """
    Base class for models that aren't constructed from traditional, sequential layers.
    Subclasses must set self.outputs in _build method

    (Removes the layers idiom from build method of the Model class)
    """

    def __init__(self, **kwargs):
        super(GeneralizedModel, self).__init__(**kwargs)
        

    def build(self):
        """ Wrapper for _build() """
        with tf.variable_scope(self.name):
            self._build()

        # Store model variables for easy access
        variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
        self.vars = {var.name: var for var in variables}

        # Build metrics
        self._loss()
        self._accuracy()

        self.opt_op = self.optimizer.minimize(self.loss)

# SAGEInfo is a namedtuple that specifies the parameters of the recursive GraphSAGE layers
SAGEInfo = namedtuple("SAGEInfo",
    ['layer_name', # name of the layer (to get feature embedding etc.)
     'neigh_sampler', # callable neigh_sampler constructor
     'num_samples',
     'output_dim' # the output (i.e., hidden) dimension
    ])

class SampleAndAggregate(GeneralizedModel):
    """
    Base implementation of unsupervised GraphSAGE
    """

    def __init__(self, placeholders, features, adj, degrees,
            layer_infos, concat=True, aggregator_type="mean", 
            model_size="small", identity_dim=0,
            **kwargs):
        '''
        Args:
            - placeholders: Stanford TensorFlow placeholder object.
            - features: Numpy array with node features. 
                        NOTE: Pass a None object to train in featureless mode (identity features for nodes)!
            - adj: Numpy array with adjacency lists (padded with random re-samples)
            - degrees: Numpy array with node degrees. 
            - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all 
                   the recursive layers. See SAGEInfo definition above.
            - concat: whether to concatenate during recursive iterations
            - aggregator_type: how to aggregate neighbor information
            - model_size: one of "small" and "big"
            - identity_dim: Set to positive int to use identity features (slow and cannot generalize, but better accuracy)
        '''
        super(SampleAndAggregate, self).__init__(**kwargs)
        if aggregator_type == "mean":
            self.aggregator_cls = MeanAggregator
        elif aggregator_type == "seq":
            self.aggregator_cls = SeqAggregator
        elif aggregator_type == "maxpool":
            self.aggregator_cls = MaxPoolingAggregator
        elif aggregator_type == "meanpool":
            self.aggregator_cls = MeanPoolingAggregator
        elif aggregator_type == "gcn":
            self.aggregator_cls = GCNAggregator
        else:
            raise Exception("Unknown aggregator: ", self.aggregator_cls)

        # get info from placeholders...
        self.inputs1 = placeholders["batch1"]
        self.inputs2 = placeholders["batch2"]
        self.model_size = model_size
        self.adj_info = adj
        if identity_dim > 0:
           self.embeds = tf.get_variable("node_embeddings", [adj.get_shape().as_list()[0], identity_dim])
        else:
           self.embeds = None
        if features is None: 
            if identity_dim == 0:
                raise Exception("Must have a positive value for identity feature dimension if no input features given.")
            self.features = self.embeds
        else:
            self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False)
            if not self.embeds is None:
                self.features = tf.concat([self.embeds, self.features], axis=1)
        self.degrees = degrees
        self.concat = concat

        self.dims = [(0 if features is None else features.shape[1]) + identity_dim]
        self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))])
        self.batch_size = placeholders["batch_size"]
        self.placeholders = placeholders
        self.layer_infos = layer_infos

        self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)

        self.build()

    def sample(self, inputs, layer_infos, batch_size=None):
        """ Sample neighbors to be the supportive fields for multi-layer convolutions.

        Args:
            inputs: batch inputs
            batch_size: the number of inputs (different for batch inputs and negative samples).
        """
        
        if batch_size is None:
            batch_size = self.batch_size
        samples = [inputs]
        # size of convolution support at each layer per node
        support_size = 1
        support_sizes = [support_size]
        for k in range(len(layer_infos)):
            t = len(layer_infos) - k - 1
            support_size *= layer_infos[t].num_samples
            sampler = layer_infos[t].neigh_sampler
            node = sampler((samples[k], layer_infos[t].num_samples))
            samples.append(tf.reshape(node, [support_size * batch_size,]))
            support_sizes.append(support_size)
        return samples, support_sizes


    def aggregate(self, samples, input_features, dims, num_samples, support_sizes, batch_size=None,
            aggregators=None, name=None, concat=False, model_size="small"):
        """ At each layer, aggregate hidden representations of neighbors to compute the hidden representations 
            at next layer.
        Args:
            samples: a list of samples of variable hops away for convolving at each layer of the
                network. Length is the number of layers + 1. Each is a vector of node indices.
            input_features: the input features for each sample of various hops away.
            dims: a list of dimensions of the hidden representations from the input layer to the
                final layer. Length is the number of layers + 1.
            num_samples: list of number of samples for each layer.
            support_sizes: the number of nodes to gather information from for each layer.
            batch_size: the number of inputs (different for batch inputs and negative samples).
        Returns:
            The hidden representation at the final layer for all nodes in batch
        """

        if batch_size is None:
            batch_size = self.batch_size

        # length: number of layers + 1
        hidden = [tf.nn.embedding_lookup(input_features, node_samples) for node_samples in samples]
        new_agg = aggregators is None
        if new_agg:
            aggregators = []
        for layer in range(len(num_samples)):
            if new_agg:
                dim_mult = 2 if concat and (layer != 0) else 1
                # aggregator at current layer
                if layer == len(num_samples) - 1:
                    aggregator = self.aggregator_cls(dim_mult*dims[layer], dims[layer+1], act=lambda x : x,
                            dropout=self.placeholders['dropout'], 
                            name=name, concat=concat, model_size=model_size)
                else:
                    aggregator = self.aggregator_cls(dim_mult*dims[layer], dims[layer+1],
                            dropout=self.placeholders['dropout'], 
                            name=name, concat=concat, model_size=model_size)
                aggregators.append(aggregator)
            else:
                aggregator = aggregators[layer]
            # hidden representation at current layer for all support nodes that are various hops away
            next_hidden = []
            # as layer increases, the number of support nodes needed decreases
            for hop in range(len(num_samples) - layer):
                dim_mult = 2 if concat and (layer != 0) else 1
                neigh_dims = [batch_size * support_sizes[hop], 
                              num_samples[len(num_samples) - hop - 1], 
                              dim_mult*dims[layer]]
                h = aggregator((hidden[hop],
                                tf.reshape(hidden[hop + 1], neigh_dims)))
                next_hidden.append(h)
            hidden = next_hidden
        return hidden[0], aggregators

    def _build(self):
        labels = tf.reshape(
                tf.cast(self.placeholders['batch2'], dtype=tf.int64),
                [self.batch_size, 1])
        self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(
            true_classes=labels,
            num_true=1,
            num_sampled=FLAGS.neg_sample_size,
            unique=False,
            range_max=len(self.degrees),
            distortion=0.75,
            unigrams=self.degrees.tolist()))

           
        # perform "convolution"
        samples1, support_sizes1 = self.sample(self.inputs1, self.layer_infos)
        samples2, support_sizes2 = self.sample(self.inputs2, self.layer_infos)
        num_samples = [layer_info.num_samples for layer_info in self.layer_infos]
        self.outputs1, self.aggregators = self.aggregate(samples1, [self.features], self.dims, num_samples,
                support_sizes1, concat=self.concat, model_size=self.model_size)
        self.outputs2, _ = self.aggregate(samples2, [self.features], self.dims, num_samples,
                support_sizes2, aggregators=self.aggregators, concat=self.concat,
                model_size=self.model_size)

        neg_samples, neg_support_sizes = self.sample(self.neg_samples, self.layer_infos,
            FLAGS.neg_sample_size)
        self.neg_outputs, _ = self.aggregate(neg_samples, [self.features], self.dims, num_samples,
                neg_support_sizes, batch_size=FLAGS.neg_sample_size, aggregators=self.aggregators,
                concat=self.concat, model_size=self.model_size)

        dim_mult = 2 if self.concat else 1
        self.link_pred_layer = BipartiteEdgePredLayer(dim_mult*self.dims[-1],
                dim_mult*self.dims[-1], self.placeholders, act=tf.nn.sigmoid, 
                bilinear_weights=False,
                name='edge_predict')

        self.outputs1 = tf.nn.l2_normalize(self.outputs1, 1)
        self.outputs2 = tf.nn.l2_normalize(self.outputs2, 1)
        self.neg_outputs = tf.nn.l2_normalize(self.neg_outputs, 1)

    def build(self):
        self._build()

        # TF graph management
        self._loss()
        self._accuracy()
        self.loss = self.loss / tf.cast(self.batch_size, tf.float32)
        grads_and_vars = self.optimizer.compute_gradients(self.loss)
        clipped_grads_and_vars = [(tf.clip_by_value(grad, -5.0, 5.0) if grad is not None else None, var) 
                for grad, var in grads_and_vars]
        self.grad, _ = clipped_grads_and_vars[0]
        self.opt_op = self.optimizer.apply_gradients(clipped_grads_and_vars)

    def _loss(self):
        for aggregator in self.aggregators:
            for var in aggregator.vars.values():
                self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)

        self.loss += self.link_pred_layer.loss(self.outputs1, self.outputs2, self.neg_outputs) 
        tf.summary.scalar('loss', self.loss)

    def _accuracy(self):
        # shape: [batch_size]
        aff = self.link_pred_layer.affinity(self.outputs1, self.outputs2)
        # shape : [batch_size x num_neg_samples]
        self.neg_aff = self.link_pred_layer.neg_cost(self.outputs1, self.neg_outputs)
        self.neg_aff = tf.reshape(self.neg_aff, [self.batch_size, FLAGS.neg_sample_size])
        _aff = tf.expand_dims(aff, axis=1)
        self.aff_all = tf.concat(axis=1, values=[self.neg_aff, _aff])
        size = tf.shape(self.aff_all)[1]
        _, indices_of_ranks = tf.nn.top_k(self.aff_all, k=size)
        _, self.ranks = tf.nn.top_k(-indices_of_ranks, k=size)
        self.mrr = tf.reduce_mean(tf.div(1.0, tf.cast(self.ranks[:, -1] + 1, tf.float32)))
        tf.summary.scalar('mrr', self.mrr)


class Node2VecModel(GeneralizedModel):
    def __init__(self, placeholders, dict_size, degrees, name=None,
                 nodevec_dim=50, lr=0.001, **kwargs):
        """ Simple version of Node2Vec/DeepWalk algorithm.

        Args:
            dict_size: the total number of nodes.
            degrees: numpy array of node degrees, ordered as in the data's id_map
            nodevec_dim: dimension of the vector representation of node.
            lr: learning rate of optimizer.
        """

        super(Node2VecModel, self).__init__(**kwargs)

        self.placeholders = placeholders
        self.degrees = degrees
        self.inputs1 = placeholders["batch1"]
        self.inputs2 = placeholders["batch2"]

        self.batch_size = placeholders['batch_size']
        self.hidden_dim = nodevec_dim

        # following the tensorflow word2vec tutorial
        self.target_embeds = tf.Variable(
                tf.random_uniform([dict_size, nodevec_dim], -1, 1),
                name="target_embeds")
        self.context_embeds = tf.Variable(
                tf.truncated_normal([dict_size, nodevec_dim],
                stddev=1.0 / math.sqrt(nodevec_dim)),
                name="context_embeds")
        self.context_bias = tf.Variable(
                tf.zeros([dict_size]),
                name="context_bias")

        self.optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)

        self.build()

    def _build(self):
        labels = tf.reshape(
                tf.cast(self.placeholders['batch2'], dtype=tf.int64),
                [self.batch_size, 1])
        self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(
            true_classes=labels,
            num_true=1,
            num_sampled=FLAGS.neg_sample_size,
            unique=True,
            range_max=len(self.degrees),
            distortion=0.75,
            unigrams=self.degrees.tolist()))

        self.outputs1 = tf.nn.embedding_lookup(self.target_embeds, self.inputs1)
        self.outputs2 = tf.nn.embedding_lookup(self.context_embeds, self.inputs2)
        self.outputs2_bias = tf.nn.embedding_lookup(self.context_bias, self.inputs2)
        self.neg_outputs = tf.nn.embedding_lookup(self.context_embeds, self.neg_samples)
        self.neg_outputs_bias = tf.nn.embedding_lookup(self.context_bias, self.neg_samples)

        self.link_pred_layer = BipartiteEdgePredLayer(self.hidden_dim, self.hidden_dim,
                self.placeholders, bilinear_weights=False)

    def build(self):
        self._build()
        # TF graph management
        self._loss()
        self._minimize()
        self._accuracy()

    def _minimize(self):
        self.opt_op = self.optimizer.minimize(self.loss)

    def _loss(self):
        aff = tf.reduce_sum(tf.multiply(self.outputs1, self.outputs2), 1) + self.outputs2_bias
        neg_aff = tf.matmul(self.outputs1, tf.transpose(self.neg_outputs)) + self.neg_outputs_bias
        true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
                labels=tf.ones_like(aff), logits=aff)
        negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
                labels=tf.zeros_like(neg_aff), logits=neg_aff)
        loss = tf.reduce_sum(true_xent) + tf.reduce_sum(negative_xent)
        self.loss = loss / tf.cast(self.batch_size, tf.float32)
        tf.summary.scalar('loss', self.loss)
        
    def _accuracy(self):
        # shape: [batch_size]
        aff = self.link_pred_layer.affinity(self.outputs1, self.outputs2)
       # shape : [batch_size x num_neg_samples]
        self.neg_aff = self.link_pred_layer.neg_cost(self.outputs1, self.neg_outputs)
        self.neg_aff = tf.reshape(self.neg_aff, [self.batch_size, FLAGS.neg_sample_size])
        _aff = tf.expand_dims(aff, axis=1)
        self.aff_all = tf.concat(axis=1, values=[self.neg_aff, _aff])
        size = tf.shape(self.aff_all)[1]
        _, indices_of_ranks = tf.nn.top_k(self.aff_all, k=size)
        _, self.ranks = tf.nn.top_k(-indices_of_ranks, k=size)
        self.mrr = tf.reduce_mean(tf.div(1.0, tf.cast(self.ranks[:, -1] + 1, tf.float32)))
        tf.summary.scalar('mrr', self.mrr)

layers.py

(1)_LAYER_UIDS = {}
_LAYER_UIDS = {} 是记录layer及其出现次数的字典
在class Layer中,当未赋variable scope的name时,通过实例化Layer的次数来标定不同的layer_id
例子:

class Layer():
    def __init__(self):
        layer = self.__class__.__name__
        name = layer + '_' + str(get_layer_uid(layer))
        print(name) 

layer1 = Layer()
layer2 = Layer()

# Output:
# Layer_1
# Layer_2

(2)class Layer
class Layer主要定义基本的层的API。
方法:

  • __init__(): 获取传入的name, logging, model_size参数。初始化实例变量name, vars{}, logging, sparse_inputs
  • _call(inputs): 定义层的计算图:获取input, 返回output
  • __call__(inputs): 相当于_call()的装饰器,在实现列_call()基本功能后,丰富了其功能,这里主要通过tf.summary.histogram() 可以查看inputs与outputs分布情况的直方图
  • _log_vars(): 记录所有变量。实现时主要将vars中的各个变量以直方图形式显示

(2)class Dense
Dense layer主要用于实现全连接层的基本功能。即为了最终得到 Relu(Wx + b)。

  • __init__(): 用于获取初始化成员变量。其中num_features_nonzero和featureless的作用目前还不清楚
  • _call(): 用于实现并且返回Relu(Wx + b)

layers.py完整代码

from __future__ import division
from __future__ import print_function

import tensorflow as tf

from graphsage.inits import zeros

flags = tf.app.flags
FLAGS = flags.FLAGS

# DISCLAIMER:
# Boilerplate parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras package

# global unique layer ID dictionary for layer name assignment
_LAYER_UIDS = {} #记录layer及其出现次数的字典

#作用: 在class Layer中,当未赋variable scope的name时,通过实例化Layer的次数来标定不同的layer_id
def get_layer_uid(layer_name=''):
    """Helper function, assigns unique layer IDs."""
    if layer_name not in _LAYER_UIDS:
        _LAYER_UIDS[layer_name] = 1  #若layer_name从未出现过,如今出现了,则将_LAYER_UIDS[layer_name]设为1;否则累加
        return 1
    else:
        _LAYER_UIDS[layer_name] += 1
        return _LAYER_UIDS[layer_name]

class Layer(object):
    """Base layer class. Defines basic API for all layer objects.
    Implementation inspired by keras (http://keras.io).
    # Properties
        name: String, defines the variable scope of the layer.
        logging: Boolean, switches Tensorflow histogram logging on/off

    # Methods
        _call(inputs): Defines computation graph of layer
            (i.e. takes input, returns output)
        __call__(inputs): Wrapper for _call()
        _log_vars(): Log all variables
    """

    def __init__(self, **kwargs):
        allowed_kwargs = {'name', 'logging', 'model_size'}
        for kwarg in kwargs.keys():
            assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwarg
        name = kwargs.get('name')
        if not name:
            layer = self.__class__.__name__.lower()
            name = layer + '_' + str(get_layer_uid(layer))
        self.name = name
        self.vars = {}
        logging = kwargs.get('logging', False)
        self.logging = logging
        self.sparse_inputs = False


    def _call(self, inputs):
        return inputs

    #重写了__call__ python内置函数,可以把对象当函数调用,__call__输入为inputs,输出为outputs
    def __call__(self, inputs):
        with tf.name_scope(self.name):
            if self.logging and not self.sparse_inputs:
                tf.summary.histogram(self.name + '/inputs', inputs)
            outputs = self._call(inputs) #在子类也可以把对象当作函数调用,即调用__call__时也会自动调用_call()函数
            if self.logging:
                tf.summary.histogram(self.name + '/outputs', outputs)
            return outputs

    def _log_vars(self):
        for var in self.vars:
            tf.summary.histogram(self.name + '/vars/' + var, self.vars[var])


class Dense(Layer):
    """Dense layer."""
    def __init__(self, input_dim, output_dim, dropout=0., 
                 act=tf.nn.relu, placeholders=None, bias=True, featureless=False, 
                 sparse_inputs=False, **kwargs):
        super(Dense, self).__init__(**kwargs)

        self.dropout = dropout

        self.act = act
        self.featureless = featureless
        self.bias = bias
        self.input_dim = input_dim
        self.output_dim = output_dim

        # helper variable for sparse dropout
        self.sparse_inputs = sparse_inputs
        if sparse_inputs:
            self.num_features_nonzero = placeholders['num_features_nonzero']

        with tf.variable_scope(self.name + '_vars'):
            self.vars['weights'] = tf.get_variable('weights', shape=(input_dim, output_dim),
                                         dtype=tf.float32, 
                                         initializer=tf.contrib.layers.xavier_initializer(),
                                         regularizer=tf.contrib.layers.l2_regularizer(FLAGS.weight_decay))
            if self.bias:
                self.vars['bias'] = zeros([output_dim], name='bias')

        if self.logging:
            self._log_vars()

    def _call(self, inputs):
        x = inputs

        x = tf.nn.dropout(x, 1-self.dropout)

        # transform
        output = tf.matmul(x, self.vars['weights'])

        # bias
        if self.bias:
            output += self.vars['bias']

        return self.act(output)

minibatch.py

minibatch.py完整代码

from __future__ import division
from __future__ import print_function

import numpy as np

np.random.seed(123)

class EdgeMinibatchIterator(object):
    
    """ This minibatch iterator iterates over batches of sampled edges or
    random pairs of co-occuring edges.

    G -- networkx graph
    id2idx -- dict mapping node ids to index in feature tensor
    placeholders -- tensorflow placeholders object
    context_pairs -- if not none, then a list of co-occuring node pairs (from random walks)
    batch_size -- size of the minibatches
    max_degree -- maximum size of the downsampled adjacency lists
    n2v_retrain -- signals that the iterator is being used to add new embeddings to a n2v model
    fixed_n2v -- signals that the iterator is being used to retrain n2v with only existing nodes as context
    """
    def __init__(self, G, id2idx, 
            placeholders, context_pairs=None, batch_size=100, max_degree=25,
            n2v_retrain=False, fixed_n2v=False,
            **kwargs):

        self.G = G
        self.nodes = G.nodes()
        self.id2idx = id2idx
        self.placeholders = placeholders
        self.batch_size = batch_size
        self.max_degree = max_degree
        self.batch_num = 0

        ## 函数shuffle与permutation都是对原来的数组进行重新洗牌,即随机打乱原来的元素顺序
        # permutation不直接在原来的数组上进行操作,而是返回一个新的打乱顺序的数组,并不改变原来的数组。
        self.nodes = np.random.permutation(G.nodes())
        self.adj, self.deg = self.construct_adj()
        self.test_adj = self.construct_test_adj()
        if context_pairs is None:
            edges = G.edges()
        else:
            edges = context_pairs
        self.train_edges = self.edges = np.random.permutation(edges)
        if not n2v_retrain:
            self.train_edges = self._remove_isolated(self.train_edges)
            self.val_edges = [e for e in G.edges() if G[e[0]][e[1]]['train_removed']]
        else:
            if fixed_n2v:
                self.train_edges = self.val_edges = self._n2v_prune(self.edges)
            else:
                self.train_edges = self.val_edges = self.edges

        print(len([n for n in G.nodes() if not G.node[n]['test'] and not G.node[n]['val']]), 'train nodes')
        print(len([n for n in G.nodes() if G.node[n]['test'] or G.node[n]['val']]), 'test nodes')
        self.val_set_size = len(self.val_edges)

    def _n2v_prune(self, edges):
        is_val = lambda n : self.G.node[n]["val"] or self.G.node[n]["test"]
        return [e for e in edges if not is_val(e[1])]

    def _remove_isolated(self, edge_list):
        new_edge_list = []
        missing = 0
        for n1, n2 in edge_list:
            if not n1 in self.G.node or not n2 in self.G.node:
                missing += 1
                continue
            if (self.deg[self.id2idx[n1]] == 0 or self.deg[self.id2idx[n2]] == 0) \
                    and (not self.G.node[n1]['test'] or self.G.node[n1]['val']) \
                    and (not self.G.node[n2]['test'] or self.G.node[n2]['val']):
                continue
            else:
                new_edge_list.append((n1,n2))
        print("Unexpected missing:", missing)
        return new_edge_list

    def construct_adj(self):
        adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
        deg = np.zeros((len(self.id2idx),))

        for nodeid in self.G.nodes():
            if self.G.node[nodeid]['test'] or self.G.node[nodeid]['val']:
                continue
            neighbors = np.array([self.id2idx[neighbor] 
                for neighbor in self.G.neighbors(nodeid)
                if (not self.G[nodeid][neighbor]['train_removed'])])
            deg[self.id2idx[nodeid]] = len(neighbors)
            if len(neighbors) == 0:
                continue
            if len(neighbors) > self.max_degree:
                neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
            elif len(neighbors) < self.max_degree:
                neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
            adj[self.id2idx[nodeid], :] = neighbors
        return adj, deg

    def construct_test_adj(self):
        adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
        for nodeid in self.G.nodes():
            neighbors = np.array([self.id2idx[neighbor] 
                for neighbor in self.G.neighbors(nodeid)])
            if len(neighbors) == 0:
                continue
            if len(neighbors) > self.max_degree:
                neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
            elif len(neighbors) < self.max_degree:
                neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
            adj[self.id2idx[nodeid], :] = neighbors
        return adj

    def end(self):
        return self.batch_num * self.batch_size >= len(self.train_edges)

    def batch_feed_dict(self, batch_edges):
        batch1 = []
        batch2 = []
        for node1, node2 in batch_edges:
            batch1.append(self.id2idx[node1])
            batch2.append(self.id2idx[node2])

        feed_dict = dict()
        feed_dict.update({self.placeholders['batch_size'] : len(batch_edges)})
        feed_dict.update({self.placeholders['batch1']: batch1})
        feed_dict.update({self.placeholders['batch2']: batch2})

        return feed_dict

    #函数中获取下个edgeminibatch的起始与终止序号,将batch后的边的信息传给batch_feed_dict(self, batch_edges)函数,更新placeholders中的batch1, batch2, batch_size信息
    def next_minibatch_feed_dict(self):
        start_idx = self.batch_num * self.batch_size
        self.batch_num += 1
        end_idx = min(start_idx + self.batch_size, len(self.train_edges))
        batch_edges = self.train_edges[start_idx : end_idx]
        return self.batch_feed_dict(batch_edges)

    def num_training_batches(self):
        return len(self.train_edges) // self.batch_size + 1

    def val_feed_dict(self, size=None):
        edge_list = self.val_edges
        if size is None:
            return self.batch_feed_dict(edge_list)
        else:
            ind = np.random.permutation(len(edge_list))
            val_edges = [edge_list[i] for i in ind[:min(size, len(ind))]]
            return self.batch_feed_dict(val_edges)

    def incremental_val_feed_dict(self, size, iter_num):
        edge_list = self.val_edges
        val_edges = edge_list[iter_num*size:min((iter_num+1)*size, 
            len(edge_list))]
        return self.batch_feed_dict(val_edges), (iter_num+1)*size >= len(self.val_edges), val_edges

    def incremental_embed_feed_dict(self, size, iter_num):
        node_list = self.nodes
        val_nodes = node_list[iter_num*size:min((iter_num+1)*size, 
            len(node_list))]
        val_edges = [(n,n) for n in val_nodes]
        return self.batch_feed_dict(val_edges), (iter_num+1)*size >= len(node_list), val_edges

    def label_val(self):
        train_edges = []
        val_edges = []
        for n1, n2 in self.G.edges():
            if (self.G.node[n1]['val'] or self.G.node[n1]['test'] 
                    or self.G.node[n2]['val'] or self.G.node[n2]['test']):
                val_edges.append((n1,n2))
            else:
                train_edges.append((n1,n2))
        return train_edges, val_edges

    # shuffle直接在原来的数组上进行操作,改变原来数组的顺序,无返回值
    def shuffle(self):
        """ Re-shuffle the training set.
            Also reset the batch number.
        """
        self.train_edges = np.random.permutation(self.train_edges)
        self.nodes = np.random.permutation(self.nodes)
        self.batch_num = 0

class NodeMinibatchIterator(object):
    
    """ 
    This minibatch iterator iterates over nodes for supervised learning.

    G -- networkx graph
    id2idx -- dict mapping node ids to integer values indexing feature tensor
    placeholders -- standard tensorflow placeholders object for feeding
    label_map -- map from node ids to class values (integer or list)
    num_classes -- number of output classes
    batch_size -- size of the minibatches
    max_degree -- maximum size of the downsampled adjacency lists
    """
    def __init__(self, G, id2idx, 
            placeholders, label_map, num_classes, 
            batch_size=100, max_degree=25,
            **kwargs):

        self.G = G
        self.nodes = G.nodes()
        self.id2idx = id2idx
        self.placeholders = placeholders
        self.batch_size = batch_size
        self.max_degree = max_degree
        self.batch_num = 0
        self.label_map = label_map
        self.num_classes = num_classes

        self.adj, self.deg = self.construct_adj()
        self.test_adj = self.construct_test_adj()

        self.val_nodes = [n for n in self.G.nodes() if self.G.node[n]['val']]
        self.test_nodes = [n for n in self.G.nodes() if self.G.node[n]['test']]

        self.no_train_nodes_set = set(self.val_nodes + self.test_nodes)
        self.train_nodes = set(G.nodes()).difference(self.no_train_nodes_set)
        # don't train on nodes that only have edges to test set
        self.train_nodes = [n for n in self.train_nodes if self.deg[id2idx[n]] > 0]

    def _make_label_vec(self, node):
        label = self.label_map[node]
        if isinstance(label, list):
            label_vec = np.array(label)
        else:
            label_vec = np.zeros((self.num_classes))
            class_ind = self.label_map[node]
            label_vec[class_ind] = 1
        return label_vec

    def construct_adj(self):
        # 该矩阵记录训练数据中各节点的邻居节点的编号
        # 采样只取max_degree个邻居节点,采样方法见下
        # 同样进行了行数加一操作
        adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))

        # 该矩阵记录了每个节点的度数
        deg = np.zeros((len(self.id2idx),))

        for nodeid in self.G.nodes():
            # 在选取邻居节点时进行了筛选,对于G.neighbors(nodeid) 点node的邻居,
            # 只取该node与neighbor相连的边的train_removed = False的neighbor
            # 也就是只取不是val, test的节点
            # neighbors得到了邻居节点编号数列
            if self.G.node[nodeid]['test'] or self.G.node[nodeid]['val']:
                continue
            neighbors = np.array([self.id2idx[neighbor] 
                for neighbor in self.G.neighbors(nodeid)
                if (not self.G[nodeid][neighbor]['train_removed'])])


            # deg各位取值为该位对应nodeid的节点的度数,
            # 也即经过上面筛选后得到的邻居数
            deg[self.id2idx[nodeid]] = len(neighbors)
            if len(neighbors) == 0:
                continue
            if len(neighbors) > self.max_degree:                # np.random.choice为选取size大小的数列
                neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
            elif len(neighbors) < self.max_degree:
                # 经过choice随机选取,得到了固定大小max_degree = 25的直接相连的邻居数列
                neighbors = np.random.choice(neighbors, self.max_degree, replace=True)


            # 把该node的邻居数列,赋值给adj矩阵中对应nodeid位的向量
            adj[self.id2idx[nodeid], :] = neighbors
        return adj, deg

    #在construct_test_adj()  函数中,与上不同之处在于,可以直接得到邻居而无需根据val/test/train_removed筛选
    def construct_test_adj(self):
        adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))
        for nodeid in self.G.nodes():
            neighbors = np.array([self.id2idx[neighbor] 
                for neighbor in self.G.neighbors(nodeid)])
            if len(neighbors) == 0:
                continue
            if len(neighbors) > self.max_degree:
                neighbors = np.random.choice(neighbors, self.max_degree, replace=False)
            elif len(neighbors) < self.max_degree:
                neighbors = np.random.choice(neighbors, self.max_degree, replace=True)
            adj[self.id2idx[nodeid], :] = neighbors
        return adj

    def end(self):
        return self.batch_num * self.batch_size >= len(self.train_nodes)

    #也即next_minibatch_feed_dict()返回的是下一个edge minibatch的placeholders信息
    def batch_feed_dict(self, batch_nodes, val=False):
        batch1id = batch_nodes
        batch1 = [self.id2idx[n] for n in batch1id]
              
        labels = np.vstack([self._make_label_vec(node) for node in batch1id])
        feed_dict = dict()
        feed_dict.update({self.placeholders['batch_size'] : len(batch1)})
        feed_dict.update({self.placeholders['batch']: batch1})
        feed_dict.update({self.placeholders['labels']: labels})

        return feed_dict, labels

    def node_val_feed_dict(self, size=None, test=False):
        if test:
            val_nodes = self.test_nodes
        else:
            val_nodes = self.val_nodes
        if not size is None:
            val_nodes = np.random.choice(val_nodes, size, replace=True)
        # add a dummy neighbor
        ret_val = self.batch_feed_dict(val_nodes)
        return ret_val[0], ret_val[1]

    def incremental_node_val_feed_dict(self, size, iter_num, test=False):
        if test:
            val_nodes = self.test_nodes
        else:
            val_nodes = self.val_nodes
        val_node_subset = val_nodes[iter_num*size:min((iter_num+1)*size, 
            len(val_nodes))]

        # add a dummy neighbor
        ret_val = self.batch_feed_dict(val_node_subset)
        return ret_val[0], ret_val[1], (iter_num+1)*size >= len(val_nodes), val_node_subset

    def num_training_batches(self):
        return len(self.train_nodes) // self.batch_size + 1

    def next_minibatch_feed_dict(self):
        start_idx = self.batch_num * self.batch_size
        self.batch_num += 1
        end_idx = min(start_idx + self.batch_size, len(self.train_nodes))
        batch_nodes = self.train_nodes[start_idx : end_idx]
        return self.batch_feed_dict(batch_nodes)

    def incremental_embed_feed_dict(self, size, iter_num):
        node_list = self.nodes
        val_nodes = node_list[iter_num*size:min((iter_num+1)*size, 
            len(node_list))]
        return self.batch_feed_dict(val_nodes), (iter_num+1)*size >= len(node_list), val_nodes

    def shuffle(self):
        """ Re-shuffle the training set.
            Also reset the batch number.
        """
        self.train_nodes = np.random.permutation(self.train_nodes)
        self.batch_num = 0

aggregators.py

该类主要用于实现:
h v k ← σ ( W ⋅ M E A N ( { h v k − 1 } ∪ { h u k − 1 , ∀ u ∈ N ( v ) } ) ) \mathbf{h}^k_v \leftarrow \sigma(\mathbf{W} \cdot \mathrm{MEAN}(\lbrace \mathbf{h}^{k-1}_v \rbrace \cup \lbrace \mathbf{h}^{k-1}_u, \forall u \in \mathcal{N}(v) \rbrace)) hvkσ(WMEAN({hvk1}{huk1,uN(v)}))

(1)class GCNAggregator(Layer)
这里__init__()与MeanAggregator基本相同,在_call()的实现中略有不同

def _call(self, inputs):
    self_vecs, neigh_vecs = inputs

    neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)
    self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)
    means = tf.reduce_mean(tf.concat([neigh_vecs, 
        tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)
   
    # [nodes] x [out_dim]
    output = tf.matmul(means, self.vars['weights'])

    # bias
    if self.bias:
        output += self.vars['bias']
   
    return self.act(output)

其中对means求解时:

  • 先将self_vecs行列转换(tf.expand_dims(self_vecs, axis=1)),
  • 之后self_vecs的行数与neigh_vecs行数相同时,将二者concat,即相当于在原先的neigh_vecs矩阵后面新增一列self_vecs的转置
  • 最后将得到的矩阵每行求均值,即得means
    之后means与权值矩阵vars[‘weights’]求内积,并加上vars[‘bias’], 最终将该值带入激活函数(ReLu)。

下面举个例子简单说明(例子中省略了点乘W的操作):

import tensorflow as tf

neigh_vecs = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
self_vecs = [2, 3, 4]

means = tf.reduce_mean(tf.concat([neigh_vecs,
                                  tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)

print(tf.shape(self_vecs))

print(tf.expand_dims(self_vecs, axis=0))
# Tensor("ExpandDims_1:0", shape=(1, 3), dtype=int32)

print(tf.expand_dims(self_vecs, axis=1))
# Tensor("ExpandDims_2:0", shape=(3, 1), dtype=int32)

sess = tf.Session()
print(sess.run(tf.expand_dims(self_vecs, axis=1)))
# [[2]
#  [3]
#  [4]]

print(sess.run(tf.concat([neigh_vecs,
                          tf.expand_dims(self_vecs, axis=1)], axis=1)))
# [[1 2 3 2]
#  [4 5 6 3]
#  [7 8 9 4]]

print(means)
# Tensor("Mean:0", shape=(3,), dtype=int32)

print(sess.run(tf.reduce_mean(tf.concat([neigh_vecs,
                                         tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)))
# [2 4 7]

# [[1 2 3 2]   = 8 // 4  = 2
#  [4 5 6 3]   = 18 // 4 = 4
#  [7 8 9 4]]  = 28 // 4 = 7

bias = [1]
output = means + bias
print(sess.run(output))
# [3 5 8]
# [2 + 1, 4 + 1, 7 + 1] = [3, 5, 8]

aggregators.py完整代码

import tensorflow as tf

from .layers import Layer, Dense
from .inits import glorot, zeros

class MeanAggregator(Layer):
    """
    Aggregates via mean followed by matmul and non-linearity.
    """
    #__init_() 用于获取并初始化成员变量 dropout, bias(False), act(ReLu), concat(False), input_dim, output_dim, name(Variable scopr)
    def __init__(self, input_dim, output_dim, neigh_input_dim=None,
            dropout=0., bias=False, act=tf.nn.relu, 
            name=None, concat=False, **kwargs):
        super(MeanAggregator, self).__init__(**kwargs)

        self.dropout = dropout
        self.bias = bias
        self.act = act
        self.concat = concat

        if neigh_input_dim is None:
            neigh_input_dim = input_dim

        if name is not None:
            name = '/' + name
        else:
            name = ''
        #用glorot()方法初始化节点v的权值矩阵 vars['self_weights'] 和邻居节点均值u的权值矩阵 vars['neigh_weights']
        with tf.variable_scope(self.name + name + '_vars'):
            self.vars['neigh_weights'] = glorot([neigh_input_dim, output_dim],
                                                        name='neigh_weights')
            self.vars['self_weights'] = glorot([input_dim, output_dim],
                                                        name='self_weights')
            #用零向量初始化vars['bias']
            if self.bias:
                self.vars['bias'] = zeros([self.output_dim], name='bias')

        #若logging为True,则调用 layers.py 中 class Layer()的成员函数_log_vars(), 生成vars中各个变量的直方图
        if self.logging:
            self._log_vars()

        self.input_dim = input_dim
        self.output_dim = output_dim

    def _call(self, inputs):
        self_vecs, neigh_vecs = inputs


        #tf.nn.dropout(x, keep_prob, noise_shape=None, seed=None, name=None)
        #输出的非0元素是原来的 “1/keep_prob” 倍,以保证总和不变
        neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)
        self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)
        neigh_means = tf.reduce_mean(neigh_vecs, axis=1)
       
        # [nodes] x [out_dim]
        from_neighs = tf.matmul(neigh_means, self.vars['neigh_weights'])

        from_self = tf.matmul(self_vecs, self.vars["self_weights"])
         
        if not self.concat:
            output = tf.add_n([from_self, from_neighs])
        else:
            output = tf.concat([from_self, from_neighs], axis=1) #在concat后其维数变为之前的2倍

        # bias
        if self.bias:
            output += self.vars['bias']
       
        return self.act(output)




#这里__init__()与MeanAggregator基本相同,在_call()的实现中略有不同
class GCNAggregator(Layer):
    """
    Aggregates via mean followed by matmul and non-linearity.
    Same matmul parameters are used self vector and neighbor vectors.
    """

    def __init__(self, input_dim, output_dim, neigh_input_dim=None,
            dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
        super(GCNAggregator, self).__init__(**kwargs)

        self.dropout = dropout
        self.bias = bias
        self.act = act
        self.concat = concat

        if neigh_input_dim is None:
            neigh_input_dim = input_dim

        if name is not None:
            name = '/' + name
        else:
            name = ''

        with tf.variable_scope(self.name + name + '_vars'):
            self.vars['weights'] = glorot([neigh_input_dim, output_dim],
                                                        name='neigh_weights')
            if self.bias:
                self.vars['bias'] = zeros([self.output_dim], name='bias')

        if self.logging:
            self._log_vars()

        self.input_dim = input_dim
        self.output_dim = output_dim

    def _call(self, inputs):
        self_vecs, neigh_vecs = inputs

        neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)
        self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)

        # 其中对means求解时,
        # 1.先将self_vecs行列转换(tf.expand_dims(self_vecs, axis=1)),
        # 2.之后self_vecs的行数与neigh_vecs行数相同时,将二者concat, 即相当于在原先的neigh_vecs矩阵后面新增一列self_vecs的转置
        # 3.最后将得到的矩阵每行求均值,即得means.
        # 之后means与权值矩阵vars['weights']
        # 求内积,并加上vars['bias'], 最终将该值带入激活函数(ReLu)
        means = tf.reduce_mean(tf.concat([neigh_vecs, 
            tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)
       
        # [nodes] x [out_dim]
        output = tf.matmul(means, self.vars['weights'])

        # bias
        if self.bias:
            output += self.vars['bias']
       
        return self.act(output)


class MaxPoolingAggregator(Layer):
    """ Aggregates via max-pooling over MLP functions.
    """
    def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
            dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
        super(MaxPoolingAggregator, self).__init__(**kwargs)

        self.dropout = dropout
        self.bias = bias
        self.act = act
        self.concat = concat

        if neigh_input_dim is None:
            neigh_input_dim = input_dim

        if name is not None:
            name = '/' + name
        else:
            name = ''

        if model_size == "small":
            hidden_dim = self.hidden_dim = 512
        elif model_size == "big":
            hidden_dim = self.hidden_dim = 1024

        self.mlp_layers = []
        self.mlp_layers.append(Dense(input_dim=neigh_input_dim,
                                 output_dim=hidden_dim,
                                 act=tf.nn.relu,
                                 dropout=dropout,
                                 sparse_inputs=False,
                                 logging=self.logging))

        with tf.variable_scope(self.name + name + '_vars'):
            self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],
                                                        name='neigh_weights')
           
            self.vars['self_weights'] = glorot([input_dim, output_dim],
                                                        name='self_weights')
            if self.bias:
                self.vars['bias'] = zeros([self.output_dim], name='bias')

        if self.logging:
            self._log_vars()

        self.input_dim = input_dim
        self.output_dim = output_dim
        self.neigh_input_dim = neigh_input_dim

    def _call(self, inputs):
        self_vecs, neigh_vecs = inputs
        neigh_h = neigh_vecs

        dims = tf.shape(neigh_h)
        batch_size = dims[0]
        num_neighbors = dims[1]
        # [nodes * sampled neighbors] x [hidden_dim]
        h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))

        for l in self.mlp_layers:
            h_reshaped = l(h_reshaped)
        neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim))
        neigh_h = tf.reduce_max(neigh_h, axis=1)
        
        from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
        from_self = tf.matmul(self_vecs, self.vars["self_weights"])
        
        if not self.concat:
            output = tf.add_n([from_self, from_neighs])
        else:
            output = tf.concat([from_self, from_neighs], axis=1)

        # bias
        if self.bias:
            output += self.vars['bias']
       
        return self.act(output)

class MeanPoolingAggregator(Layer):
    """ Aggregates via mean-pooling over MLP functions.
    """
    def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
            dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
        super(MeanPoolingAggregator, self).__init__(**kwargs)

        self.dropout = dropout
        self.bias = bias
        self.act = act
        self.concat = concat

        if neigh_input_dim is None:
            neigh_input_dim = input_dim

        if name is not None:
            name = '/' + name
        else:
            name = ''

        if model_size == "small":
            hidden_dim = self.hidden_dim = 512
        elif model_size == "big":
            hidden_dim = self.hidden_dim = 1024

        self.mlp_layers = []
        self.mlp_layers.append(Dense(input_dim=neigh_input_dim,
                                 output_dim=hidden_dim,
                                 act=tf.nn.relu,
                                 dropout=dropout,
                                 sparse_inputs=False,
                                 logging=self.logging))

        with tf.variable_scope(self.name + name + '_vars'):
            self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],
                                                        name='neigh_weights')
           
            self.vars['self_weights'] = glorot([input_dim, output_dim],
                                                        name='self_weights')
            if self.bias:
                self.vars['bias'] = zeros([self.output_dim], name='bias')

        if self.logging:
            self._log_vars()

        self.input_dim = input_dim
        self.output_dim = output_dim
        self.neigh_input_dim = neigh_input_dim

    def _call(self, inputs):
        self_vecs, neigh_vecs = inputs
        neigh_h = neigh_vecs

        dims = tf.shape(neigh_h)
        batch_size = dims[0]
        num_neighbors = dims[1]
        # [nodes * sampled neighbors] x [hidden_dim]
        h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))

        for l in self.mlp_layers:
            h_reshaped = l(h_reshaped)
        neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim))
        neigh_h = tf.reduce_mean(neigh_h, axis=1)
        
        from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
        from_self = tf.matmul(self_vecs, self.vars["self_weights"])
        
        if not self.concat:
            output = tf.add_n([from_self, from_neighs])
        else:
            output = tf.concat([from_self, from_neighs], axis=1)

        # bias
        if self.bias:
            output += self.vars['bias']
       
        return self.act(output)


class TwoMaxLayerPoolingAggregator(Layer):
    """ Aggregates via pooling over two MLP functions.
    """
    def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
            dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
        super(TwoMaxLayerPoolingAggregator, self).__init__(**kwargs)

        self.dropout = dropout
        self.bias = bias
        self.act = act
        self.concat = concat

        if neigh_input_dim is None:
            neigh_input_dim = input_dim

        if name is not None:
            name = '/' + name
        else:
            name = ''

        if model_size == "small":
            hidden_dim_1 = self.hidden_dim_1 = 512
            hidden_dim_2 = self.hidden_dim_2 = 256
        elif model_size == "big":
            hidden_dim_1 = self.hidden_dim_1 = 1024
            hidden_dim_2 = self.hidden_dim_2 = 512

        self.mlp_layers = []
        self.mlp_layers.append(Dense(input_dim=neigh_input_dim,
                                 output_dim=hidden_dim_1,
                                 act=tf.nn.relu,
                                 dropout=dropout,
                                 sparse_inputs=False,
                                 logging=self.logging))
        self.mlp_layers.append(Dense(input_dim=hidden_dim_1,
                                 output_dim=hidden_dim_2,
                                 act=tf.nn.relu,
                                 dropout=dropout,
                                 sparse_inputs=False,
                                 logging=self.logging))


        with tf.variable_scope(self.name + name + '_vars'):
            self.vars['neigh_weights'] = glorot([hidden_dim_2, output_dim],
                                                        name='neigh_weights')
           
            self.vars['self_weights'] = glorot([input_dim, output_dim],
                                                        name='self_weights')
            if self.bias:
                self.vars['bias'] = zeros([self.output_dim], name='bias')

        if self.logging:
            self._log_vars()

        self.input_dim = input_dim
        self.output_dim = output_dim
        self.neigh_input_dim = neigh_input_dim

    def _call(self, inputs):
        self_vecs, neigh_vecs = inputs
        neigh_h = neigh_vecs

        dims = tf.shape(neigh_h)
        batch_size = dims[0]
        num_neighbors = dims[1]
        # [nodes * sampled neighbors] x [hidden_dim]
        h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))

        for l in self.mlp_layers:
            h_reshaped = l(h_reshaped)
        neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim_2))
        neigh_h = tf.reduce_max(neigh_h, axis=1)
        
        from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
        from_self = tf.matmul(self_vecs, self.vars["self_weights"])
        
        if not self.concat:
            output = tf.add_n([from_self, from_neighs])
        else:
            output = tf.concat([from_self, from_neighs], axis=1)

        # bias
        if self.bias:
            output += self.vars['bias']
       
        return self.act(output)

class SeqAggregator(Layer):
    """ Aggregates via a standard LSTM.
    """
    def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
            dropout=0., bias=False, act=tf.nn.relu, name=None,  concat=False, **kwargs):
        super(SeqAggregator, self).__init__(**kwargs)

        self.dropout = dropout
        self.bias = bias
        self.act = act
        self.concat = concat

        if neigh_input_dim is None:
            neigh_input_dim = input_dim

        if name is not None:
            name = '/' + name
        else:
            name = ''

        if model_size == "small":
            hidden_dim = self.hidden_dim = 128
        elif model_size == "big":
            hidden_dim = self.hidden_dim = 256

        with tf.variable_scope(self.name + name + '_vars'):
            self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],
                                                        name='neigh_weights')
           
            self.vars['self_weights'] = glorot([input_dim, output_dim],
                                                        name='self_weights')
            if self.bias:
                self.vars['bias'] = zeros([self.output_dim], name='bias')

        if self.logging:
            self._log_vars()

        self.input_dim = input_dim
        self.output_dim = output_dim
        self.neigh_input_dim = neigh_input_dim
        self.cell = tf.contrib.rnn.BasicLSTMCell(self.hidden_dim)

    def _call(self, inputs):
        self_vecs, neigh_vecs = inputs

        dims = tf.shape(neigh_vecs)
        batch_size = dims[0]
        initial_state = self.cell.zero_state(batch_size, tf.float32)
        used = tf.sign(tf.reduce_max(tf.abs(neigh_vecs), axis=2))
        length = tf.reduce_sum(used, axis=1)
        length = tf.maximum(length, tf.constant(1.))
        length = tf.cast(length, tf.int32)

        with tf.variable_scope(self.name) as scope:
            try:
                rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
                        self.cell, neigh_vecs,
                        initial_state=initial_state, dtype=tf.float32, time_major=False,
                        sequence_length=length)
            except ValueError:
                scope.reuse_variables()
                rnn_outputs, rnn_states = tf.nn.dynamic_rnn(
                        self.cell, neigh_vecs,
                        initial_state=initial_state, dtype=tf.float32, time_major=False,
                        sequence_length=length)
        batch_size = tf.shape(rnn_outputs)[0]
        max_len = tf.shape(rnn_outputs)[1]
        out_size = int(rnn_outputs.get_shape()[2])
        index = tf.range(0, batch_size) * max_len + (length - 1)
        flat = tf.reshape(rnn_outputs, [-1, out_size])
        neigh_h = tf.gather(flat, index)

        from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
        from_self = tf.matmul(self_vecs, self.vars["self_weights"])
         
        output = tf.add_n([from_self, from_neighs])

        if not self.concat:
            output = tf.add_n([from_self, from_neighs])
        else:
            output = tf.concat([from_self, from_neighs], axis=1)

        # bias
        if self.bias:
            output += self.vars['bias']
       
        return self.act(output)

prediction.py

prediction.py完整代码

from __future__ import division
from __future__ import print_function

from graphsage.inits import zeros
from graphsage.layers import Layer
import tensorflow as tf

flags = tf.app.flags
FLAGS = flags.FLAGS


class BipartiteEdgePredLayer(Layer):
    def __init__(self, input_dim1, input_dim2, placeholders, dropout=False, act=tf.nn.sigmoid,
            loss_fn='xent', neg_sample_weights=1.0,
            bias=False, bilinear_weights=False, **kwargs):
        """
        Basic class that applies skip-gram-like loss
        (i.e., dot product of node+target and node and negative samples)
        Args:
            bilinear_weights: use a bilinear weight for affinity calculation: u^T A v. If set to
                false, it is assumed that input dimensions are the same and the affinity will be 
                based on dot product.
        """
        super(BipartiteEdgePredLayer, self).__init__(**kwargs)
        self.input_dim1 = input_dim1
        self.input_dim2 = input_dim2
        self.act = act
        self.bias = bias
        self.eps = 1e-7

        # Margin for hinge loss
        self.margin = 0.1
        self.neg_sample_weights = neg_sample_weights

        self.bilinear_weights = bilinear_weights

        if dropout:
            self.dropout = placeholders['dropout']
        else:
            self.dropout = 0.

        # output a likelihood term
        self.output_dim = 1
        with tf.variable_scope(self.name + '_vars'):
            # bilinear form
            if bilinear_weights:
                #self.vars['weights'] = glorot([input_dim1, input_dim2],
                #                              name='pred_weights')
                self.vars['weights'] = tf.get_variable(
                        'pred_weights', 
                        shape=(input_dim1, input_dim2),
                        dtype=tf.float32, 
                        initializer=tf.contrib.layers.xavier_initializer())

            if self.bias:
                self.vars['bias'] = zeros([self.output_dim], name='bias')

        if loss_fn == 'xent':
            self.loss_fn = self._xent_loss
        elif loss_fn == 'skipgram':
            self.loss_fn = self._skipgram_loss
        elif loss_fn == 'hinge':
            self.loss_fn = self._hinge_loss

        if self.logging:
            self._log_vars()

    def affinity(self, inputs1, inputs2):
        """ Affinity score between batch of inputs1 and inputs2.
        Args:
            inputs1: tensor of shape [batch_size x feature_size].
        """
        # shape: [batch_size, input_dim1]
        if self.bilinear_weights:
            prod = tf.matmul(inputs2, tf.transpose(self.vars['weights']))
            self.prod = prod
            result = tf.reduce_sum(inputs1 * prod, axis=1)
        else:
            result = tf.reduce_sum(inputs1 * inputs2, axis=1)
        return result

    def neg_cost(self, inputs1, neg_samples, hard_neg_samples=None):
        """ For each input in batch, compute the sum of its affinity to negative samples.

        Returns:
            Tensor of shape [batch_size x num_neg_samples]. For each node, a list of affinities to
                negative samples is computed.
        """
        if self.bilinear_weights:
            inputs1 = tf.matmul(inputs1, self.vars['weights'])
        neg_aff = tf.matmul(inputs1, tf.transpose(neg_samples))
        return neg_aff

    def loss(self, inputs1, inputs2, neg_samples):
        """ negative sampling loss.
        Args:
            neg_samples: tensor of shape [num_neg_samples x input_dim2]. Negative samples for all
            inputs in batch inputs1.
        """
        return self.loss_fn(inputs1, inputs2, neg_samples)

    def _xent_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):
        aff = self.affinity(inputs1, inputs2)
        neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)
        true_xent = tf.nn.sigmoid_cross_entropy_with_logits(
                labels=tf.ones_like(aff), logits=aff)
        negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(
                labels=tf.zeros_like(neg_aff), logits=neg_aff)
        loss = tf.reduce_sum(true_xent) + self.neg_sample_weights * tf.reduce_sum(negative_xent)
        return loss

    def _skipgram_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):
        aff = self.affinity(inputs1, inputs2)
        neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)
        neg_cost = tf.log(tf.reduce_sum(tf.exp(neg_aff), axis=1))
        loss = tf.reduce_sum(aff - neg_cost)
        return loss

    def _hinge_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):
        aff = self.affinity(inputs1, inputs2)
        neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)
        diff = tf.nn.relu(tf.subtract(neg_aff, tf.expand_dims(aff, 1) - self.margin), name='diff')
        loss = tf.reduce_sum(diff)
        self.neg_shape = tf.shape(neg_aff)
        return loss

    def weights_norm(self):
        return tf.nn.l2_norm(self.vars['weights'])

supervised_train.py

supervised_train.py完整代码

from __future__ import division
from __future__ import print_function

import os
import time
import tensorflow as tf
import numpy as np
import sklearn
from sklearn import metrics

from graphsage.supervised_models import SupervisedGraphsage
from graphsage.models import SAGEInfo
from graphsage.minibatch import NodeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_data

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"

# Set random seed
seed = 123
np.random.seed(seed)
tf.set_random_seed(seed)

# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS

tf.app.flags.DEFINE_boolean('log_device_placement', False,
                            """Whether to log device placement.""")
#core params..
flags.DEFINE_string('model', 'graphsage_mean', 'model names. See README for possible values.')  
flags.DEFINE_float('learning_rate', 0.01, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '../example_data/toy-ppi', 'prefix identifying training data. must be specified.')

# left to default values in main experiments 
flags.DEFINE_integer('epochs', 10, 'number of epochs to train.')
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')
flags.DEFINE_integer('max_degree', 128, 'maximum node degree.')
flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')
flags.DEFINE_integer('samples_2', 10, 'number of samples in layer 2')
flags.DEFINE_integer('samples_3', 0, 'number of users samples in layer 3. (Only for mean model)')
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')
flags.DEFINE_boolean('sigmoid', False, 'whether to use sigmoid loss')
flags.DEFINE_integer('identity_dim', 0, 'Set to positive value to use identity embedding features of that dimension. Default 0.')

#logging, saving, validation settings etc.
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")
flags.DEFINE_integer('gpu', 1, "which gpu to use.")
flags.DEFINE_integer('print_every', 5, "How often to print training info.")
flags.DEFINE_integer('max_total_steps', 10**10, "Maximum total number of iterations")

os.environ["CUDA_VISIBLE_DEVICES"]=str(FLAGS.gpu)

GPU_MEM_FRACTION = 0.8

def calc_f1(y_true, y_pred):
    if not FLAGS.sigmoid:
        y_true = np.argmax(y_true, axis=1)
        y_pred = np.argmax(y_pred, axis=1)
    else:
        y_pred[y_pred > 0.5] = 1
        y_pred[y_pred <= 0.5] = 0
    return metrics.f1_score(y_true, y_pred, average="micro"), metrics.f1_score(y_true, y_pred, average="macro")

# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):
    t_test = time.time()
    feed_dict_val, labels = minibatch_iter.node_val_feed_dict(size)
    node_outs_val = sess.run([model.preds, model.loss], 
                        feed_dict=feed_dict_val)
    mic, mac = calc_f1(labels, node_outs_val[0])
    return node_outs_val[1], mic, mac, (time.time() - t_test)

def log_dir():
    log_dir = FLAGS.base_log_dir + "/sup-" + FLAGS.train_prefix.split("/")[-2]
    log_dir += "/{model:s}_{model_size:s}_{lr:0.4f}/".format(
            model=FLAGS.model,
            model_size=FLAGS.model_size,
            lr=FLAGS.learning_rate)
    if not os.path.exists(log_dir):
        os.makedirs(log_dir)
    return log_dir

def incremental_evaluate(sess, model, minibatch_iter, size, test=False):
    t_test = time.time()
    finished = False
    val_losses = []
    val_preds = []
    labels = []
    iter_num = 0
    finished = False
    while not finished:
        feed_dict_val, batch_labels, finished, _  = minibatch_iter.incremental_node_val_feed_dict(size, iter_num, test=test)
        node_outs_val = sess.run([model.preds, model.loss], 
                         feed_dict=feed_dict_val)
        val_preds.append(node_outs_val[0])
        labels.append(batch_labels)
        val_losses.append(node_outs_val[1])
        iter_num += 1
    val_preds = np.vstack(val_preds)
    labels = np.vstack(labels)
    f1_scores = calc_f1(labels, val_preds)
    return np.mean(val_losses), f1_scores[0], f1_scores[1], (time.time() - t_test)

def construct_placeholders(num_classes):
    # Define placeholders
    placeholders = {
        'labels' : tf.placeholder(tf.float32, shape=(None, num_classes), name='labels'),
        'batch' : tf.placeholder(tf.int32, shape=(None), name='batch1'),
        'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),
        'batch_size' : tf.placeholder(tf.int32, name='batch_size'),
    }
    return placeholders

def train(train_data, test_data=None):

    G = train_data[0]
    features = train_data[1]
    id_map = train_data[2]
    class_map  = train_data[4]
    if isinstance(list(class_map.values())[0], list):
        num_classes = len(list(class_map.values())[0])
    else:
        num_classes = len(set(class_map.values()))

    if not features is None:
        # pad with dummy zero vector
        features = np.vstack([features, np.zeros((features.shape[1],))])

    context_pairs = train_data[3] if FLAGS.random_context else None
    placeholders = construct_placeholders(num_classes)
    minibatch = NodeMinibatchIterator(G, 
            id_map,
            placeholders, 
            class_map,
            num_classes,
            batch_size=FLAGS.batch_size,
            max_degree=FLAGS.max_degree, 
            context_pairs = context_pairs)
    adj_info_ph = tf.placeholder(tf.int32, shape=minibatch.adj.shape)
    adj_info = tf.Variable(adj_info_ph, trainable=False, name="adj_info")

    if FLAGS.model == 'graphsage_mean':
        # Create model
        sampler = UniformNeighborSampler(adj_info)
        if FLAGS.samples_3 != 0:
            layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                                SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2),
                                SAGEInfo("node", sampler, FLAGS.samples_3, FLAGS.dim_2)]
        elif FLAGS.samples_2 != 0:
            layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                                SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]
        else:
            layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1)]

        model = SupervisedGraphsage(num_classes, placeholders, 
                                     features,
                                     adj_info,
                                     minibatch.deg,
                                     layer_infos, 
                                     model_size=FLAGS.model_size,
                                     sigmoid_loss = FLAGS.sigmoid,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)
    elif FLAGS.model == 'gcn':
        # Create model
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, 2*FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, 2*FLAGS.dim_2)]

        model = SupervisedGraphsage(num_classes, placeholders, 
                                     features,
                                     adj_info,
                                     minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="gcn",
                                     model_size=FLAGS.model_size,
                                     concat=False,
                                     sigmoid_loss = FLAGS.sigmoid,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)

    elif FLAGS.model == 'graphsage_seq':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SupervisedGraphsage(num_classes, placeholders, 
                                     features,
                                     adj_info,
                                     minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="seq",
                                     model_size=FLAGS.model_size,
                                     sigmoid_loss = FLAGS.sigmoid,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)

    elif FLAGS.model == 'graphsage_maxpool':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SupervisedGraphsage(num_classes, placeholders, 
                                    features,
                                    adj_info,
                                    minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="maxpool",
                                     model_size=FLAGS.model_size,
                                     sigmoid_loss = FLAGS.sigmoid,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)

    elif FLAGS.model == 'graphsage_meanpool':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                            SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SupervisedGraphsage(num_classes, placeholders, 
                                    features,
                                    adj_info,
                                    minibatch.deg,
                                     layer_infos=layer_infos, 
                                     aggregator_type="meanpool",
                                     model_size=FLAGS.model_size,
                                     sigmoid_loss = FLAGS.sigmoid,
                                     identity_dim = FLAGS.identity_dim,
                                     logging=True)

    else:
        raise Exception('Error: model name unrecognized.')

    config = tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)
    config.gpu_options.allow_growth = True
    #config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTION
    config.allow_soft_placement = True
    
    # Initialize session
    sess = tf.Session(config=config)
    merged = tf.summary.merge_all()
    summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)
     
    # Init variables
    sess.run(tf.global_variables_initializer(), feed_dict={adj_info_ph: minibatch.adj})
    
    # Train model
    
    total_steps = 0
    avg_time = 0.0
    epoch_val_costs = []

    train_adj_info = tf.assign(adj_info, minibatch.adj)
    val_adj_info = tf.assign(adj_info, minibatch.test_adj)
    for epoch in range(FLAGS.epochs): 
        minibatch.shuffle() 

        iter = 0
        print('Epoch: %04d' % (epoch + 1))
        epoch_val_costs.append(0)
        while not minibatch.end():
            # Construct feed dictionary
            feed_dict, labels = minibatch.next_minibatch_feed_dict()
            feed_dict.update({placeholders['dropout']: FLAGS.dropout})

            t = time.time()
            # Training step
            outs = sess.run([merged, model.opt_op, model.loss, model.preds], feed_dict=feed_dict)
            train_cost = outs[2]

            if iter % FLAGS.validate_iter == 0:
                # Validation
                sess.run(val_adj_info.op)
                if FLAGS.validate_batch_size == -1:
                    val_cost, val_f1_mic, val_f1_mac, duration = incremental_evaluate(sess, model, minibatch, FLAGS.batch_size)
                else:
                    val_cost, val_f1_mic, val_f1_mac, duration = evaluate(sess, model, minibatch, FLAGS.validate_batch_size)
                sess.run(train_adj_info.op)
                epoch_val_costs[-1] += val_cost

            if total_steps % FLAGS.print_every == 0:
                summary_writer.add_summary(outs[0], total_steps)
    
            # Print results
            avg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)

            if total_steps % FLAGS.print_every == 0:
                train_f1_mic, train_f1_mac = calc_f1(labels, outs[-1])
                print("Iter:", '%04d' % iter, 
                      "train_loss=", "{:.5f}".format(train_cost),
                      "train_f1_mic=", "{:.5f}".format(train_f1_mic), 
                      "train_f1_mac=", "{:.5f}".format(train_f1_mac), 
                      "val_loss=", "{:.5f}".format(val_cost),
                      "val_f1_mic=", "{:.5f}".format(val_f1_mic), 
                      "val_f1_mac=", "{:.5f}".format(val_f1_mac), 
                      "time=", "{:.5f}".format(avg_time))
 
            iter += 1
            total_steps += 1

            if total_steps > FLAGS.max_total_steps:
                break

        if total_steps > FLAGS.max_total_steps:
                break
    
    print("Optimization Finished!")
    sess.run(val_adj_info.op)
    val_cost, val_f1_mic, val_f1_mac, duration = incremental_evaluate(sess, model, minibatch, FLAGS.batch_size)
    print("Full validation stats:",
                  "loss=", "{:.5f}".format(val_cost),
                  "f1_micro=", "{:.5f}".format(val_f1_mic),
                  "f1_macro=", "{:.5f}".format(val_f1_mac),
                  "time=", "{:.5f}".format(duration))
    with open(log_dir() + "val_stats.txt", "w") as fp:
        fp.write("loss={:.5f} f1_micro={:.5f} f1_macro={:.5f} time={:.5f}".
                format(val_cost, val_f1_mic, val_f1_mac, duration))

    print("Writing test set stats to file (don't peak!)")
    val_cost, val_f1_mic, val_f1_mac, duration = incremental_evaluate(sess, model, minibatch, FLAGS.batch_size, test=True)
    with open(log_dir() + "test_stats.txt", "w") as fp:
        fp.write("loss={:.5f} f1_micro={:.5f} f1_macro={:.5f}".
                format(val_cost, val_f1_mic, val_f1_mac))

def main(argv=None):
    print("Loading training data..")
    train_data = load_data(FLAGS.train_prefix)
    print("Done loading training data..")
    train(train_data)

if __name__ == '__main__':
    tf.app.run()

unsupervised_train.py

unsupervised_train.py完整代码

from __future__ import division
from __future__ import print_function

import os
import time
import tensorflow as tf
import numpy as np

from graphsage.models import SampleAndAggregate, SAGEInfo, Node2VecModel
from graphsage.minibatch import EdgeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_data

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"

# Set random seed
seed = 123
np.random.seed(seed)
tf.set_random_seed(seed)

# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS

tf.app.flags.DEFINE_boolean('log_device_placement', False,
                            """Whether to log device placement.""")
# core params..
flags.DEFINE_string('model', 'graphsage_maxpool', 'model names. See README for possible values.')
flags.DEFINE_float('learning_rate', 0.00001, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '../example_data/toy-ppi',
                    'name of the object file that stores the training data. must be specified.')

# left to default values in main experiments
flags.DEFINE_integer('epochs', 1, 'number of epochs to train.')
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')
flags.DEFINE_integer('max_degree', 100, 'maximum node degree.')

#对应论文中的K = 1 ,第一层S1 = 25; K = 2 ,第二层S2 = 10。
flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2')

#若有concat操作,则维度变为2倍
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')


flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('neg_sample_size', 20, 'number of negative samples')
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')
flags.DEFINE_integer('n2v_test_epochs', 1, 'Number of new SGD epochs for n2v.')
flags.DEFINE_integer('identity_dim', 0,
                     'Set to positive value to use identity embedding features of that dimension. Default 0.')

# logging, saving, validation settings etc.
flags.DEFINE_boolean('save_embeddings', True, 'whether to save embeddings for all nodes after training')
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")
flags.DEFINE_integer('gpu', 0, "which gpu to use.")
flags.DEFINE_integer('print_every', 50, "How often to print training info.")
flags.DEFINE_integer('max_total_steps', 10 ** 10, "Maximum total number of iterations")

os.environ["CUDA_VISIBLE_DEVICES"] = str(FLAGS.gpu)  # 使用哪一块gpu,本人只有一块,需将1改为0
#
GPU_MEM_FRACTION = 0.8


def log_dir():
    log_dir = FLAGS.base_log_dir + "/unsup-" + FLAGS.train_prefix.split("/")[-2]
    log_dir += "/{model:s}_{model_size:s}_{lr:0.6f}/".format(
        model=FLAGS.model,
        model_size=FLAGS.model_size,
        lr=FLAGS.learning_rate)
    if not os.path.exists(log_dir):
        os.makedirs(log_dir)
    return log_dir


# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):
    t_test = time.time()
    feed_dict_val = minibatch_iter.val_feed_dict(size)
    outs_val = sess.run([model.loss, model.ranks, model.mrr],
                        feed_dict=feed_dict_val)
    return outs_val[0], outs_val[1], outs_val[2], (time.time() - t_test)


def incremental_evaluate(sess, model, minibatch_iter, size):
    t_test = time.time()
    finished = False
    val_losses = []
    val_mrrs = []
    iter_num = 0
    while not finished:
        feed_dict_val, finished, _ = minibatch_iter.incremental_val_feed_dict(size, iter_num)
        iter_num += 1
        outs_val = sess.run([model.loss, model.ranks, model.mrr],
                            feed_dict=feed_dict_val)
        val_losses.append(outs_val[0])
        val_mrrs.append(outs_val[2])
    return np.mean(val_losses), np.mean(val_mrrs), (time.time() - t_test)


def save_val_embeddings(sess, model, minibatch_iter, size, out_dir, mod=""):
    val_embeddings = []
    finished = False
    seen = set([])
    nodes = []
    iter_num = 0
    name = "val"
    while not finished:
        feed_dict_val, finished, edges = minibatch_iter.incremental_embed_feed_dict(size, iter_num)
        iter_num += 1
        outs_val = sess.run([model.loss, model.mrr, model.outputs1],
                            feed_dict=feed_dict_val)
        # ONLY SAVE FOR embeds1 because of planetoid
        for i, edge in enumerate(edges):
            if not edge[0] in seen:
                val_embeddings.append(outs_val[-1][i, :])
                nodes.append(edge[0])
                seen.add(edge[0])
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)
    val_embeddings = np.vstack(val_embeddings)
    np.save(out_dir + name + mod + ".npy", val_embeddings)
    with open(out_dir + name + mod + ".txt", "w") as fp:
        fp.write("\n".join(map(str, nodes)))


def construct_placeholders():
    # Define placeholders
    placeholders = {
        'batch1': tf.placeholder(tf.int32, shape=(None), name='batch1'),
        'batch2': tf.placeholder(tf.int32, shape=(None), name='batch2'),
        # negative samples for all nodes in the batch ,所有nodes均为负样本
        'neg_samples': tf.placeholder(tf.int32, shape=(None,),
                                      name='neg_sample_size'),
        'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),
        'batch_size': tf.placeholder(tf.int32, name='batch_size'),
    }
    return placeholders


def train(train_data, test_data=None):
    G = train_data[0]
    features = train_data[1]   # 训练数据的features
    id_map = train_data[2] ## "n" : n;已经删除了节点是不具有'val'或'test'属性 的节点

    class_map=train_data[4]
    # print("class_map:",class_map)

    print("G:", G)
    # G: disjoint_union(, )
    print("features:", features)
    # features: [[-0.08760569 - 0.08760569 - 0.1132336... - 0.13184157 - 0.14681277
    #             - 0.14717815]
    #            [-0.08760569 - 0.08760569 - 0.1132336... - 0.13184157 - 0.14681277
    #             - 0.14717815]
    #            ...
    #           ]
    print("id_map:", id_map)
    # id_map: {0: 0, 1: 1,...,14754: 14754}
    print("feature.length",len(features))
    #feature.length 14755
    print("features.shape:", features.shape)
    # features.shape: (14755, 50)

    if not features is None:
        # pad with dummy zero vector
        #vstack为features添加列一行0向量,用于WX + b中与b相加
        features = np.vstack([features, np.zeros((features.shape[1],))])
    print("features.shape:", features.shape)
    # features.shape: (14756, 50)
    print(features[14755])  # 添加一个0向量
    # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    #  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
    #  0. 0.]

    context_pairs = train_data[3] if FLAGS.random_context else None  # #random walk的点对
    placeholders = construct_placeholders()
    minibatch = EdgeMinibatchIterator(G,
                                      id_map,
                                      placeholders, batch_size=FLAGS.batch_size,
                                      max_degree=FLAGS.max_degree,
                                      num_neg_samples=FLAGS.neg_sample_size,
                                      context_pairs=context_pairs)
    adj_info_ph = tf.placeholder(tf.int32, shape=minibatch.adj.shape)
    adj_info = tf.Variable(adj_info_ph, trainable=False, name="adj_info")

    if FLAGS.model == 'graphsage_mean':
        # Create model
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                       SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders,
                                   features,
                                   adj_info,
                                   minibatch.deg,
                                   layer_infos=layer_infos,
                                   model_size=FLAGS.model_size,
                                   identity_dim=FLAGS.identity_dim,
                                   logging=True)
    elif FLAGS.model == 'gcn':
        # Create model
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, 2 * FLAGS.dim_1),
                       SAGEInfo("node", sampler, FLAGS.samples_2, 2 * FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders,
                                   features,
                                   adj_info,
                                   minibatch.deg,
                                   layer_infos=layer_infos,
                                   aggregator_type="gcn",
                                   model_size=FLAGS.model_size,
                                   identity_dim=FLAGS.identity_dim,
                                   concat=False,
                                   logging=True)

    elif FLAGS.model == 'graphsage_seq':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                       SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders,
                                   features,
                                   adj_info,
                                   minibatch.deg,
                                   layer_infos=layer_infos,
                                   identity_dim=FLAGS.identity_dim,
                                   aggregator_type="seq",
                                   model_size=FLAGS.model_size,
                                   logging=True)

    elif FLAGS.model == 'graphsage_maxpool':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                       SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders,
                                   features,
                                   adj_info,
                                   minibatch.deg,
                                   layer_infos=layer_infos,
                                   aggregator_type="maxpool",
                                   model_size=FLAGS.model_size,
                                   identity_dim=FLAGS.identity_dim,
                                   logging=True)
    elif FLAGS.model == 'graphsage_meanpool':
        sampler = UniformNeighborSampler(adj_info)
        layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),
                       SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]

        model = SampleAndAggregate(placeholders,
                                   features,
                                   adj_info,
                                   minibatch.deg,
                                   layer_infos=layer_infos,
                                   aggregator_type="meanpool",
                                   model_size=FLAGS.model_size,
                                   identity_dim=FLAGS.identity_dim,
                                   logging=True)

    elif FLAGS.model == 'n2v':  # node2vec
        model = Node2VecModel(placeholders, features.shape[0],
                              minibatch.deg,
                              # 2x because graphsage uses concat
                              nodevec_dim=2 * FLAGS.dim_1,
                              lr=FLAGS.learning_rate)
    else:
        raise Exception('Error: model name unrecognized.')

    config = tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)
    config.gpu_options.allow_growth = True  # 使用allow_growth option,刚一开始分配少量的GPU容量,然后按需慢慢的增加

    # 设置每个GPU应该拿出多少容量给进程使用,
    # per_process_gpu_memory_fraction =0.4代表 40%
    # config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTION


    # 自动选择运行设备
    # 在tf中,通过命令 "with tf.device('/cpu:0'):",允许手动设置操作运行的设备
    # 如果手动设置的设备不存在或者不可用,就会导致tf程序等待或异常,
    # 为了防止这种情况,可以设置tf.ConfigProto()中参数allow_soft_placement=True,
    # 允许tf自动选择一个存在并且可用的设备来运行操作。
    config.allow_soft_placement = True  # 如果指定的设备不存在,允许TF自动分配设备


    # Initialize session
    sess = tf.Session(config=config)


    # tf.summary()能够保存训练过程以及参数分布图并在tensorboard显示。
    # merge_all 可以将所有summary全部保存到磁盘,以便tensorboard显示。
    # 如果没有特殊要求,一般用这一句就可一显示训练时的各种信息了
    merged = tf.summary.merge_all()



    # 指定一个文件用来保存图
    # 格式:tf.summary.FileWritter(path,sess.graph)
    # 可以调用其add_summary()方法将训练过程数据保存在filewriter指定的文件中
    summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)


    # Init variables
    sess.run(tf.global_variables_initializer(), feed_dict={adj_info_ph: minibatch.adj})

    # Train model

    train_shadow_mrr = None
    shadow_mrr = None

    total_steps = 0
    avg_time = 0.0
    epoch_val_costs = []

    train_adj_info = tf.assign(adj_info, minibatch.adj)
    val_adj_info = tf.assign(adj_info, minibatch.test_adj)
    for epoch in range(FLAGS.epochs):
        minibatch.shuffle()

        iter = 0
        print('Epoch: %04d' % (epoch + 1))
        epoch_val_costs.append(0)
        while not minibatch.end():
            # Construct feed dictionary
            feed_dict = minibatch.next_minibatch_feed_dict()
            feed_dict.update({placeholders['dropout']: FLAGS.dropout})

            t = time.time()
            # Training step
            outs = sess.run([merged, model.opt_op, model.loss, model.ranks, model.aff_all,
                             model.mrr, model.outputs1], feed_dict=feed_dict)
            train_cost = outs[2]
            train_mrr = outs[5]
            if train_shadow_mrr is None:
                train_shadow_mrr = train_mrr  #
            else:
                train_shadow_mrr -= (1 - 0.99) * (train_shadow_mrr - train_mrr)

            if iter % FLAGS.validate_iter == 0:
                # Validation
                sess.run(val_adj_info.op)
                val_cost, ranks, val_mrr, duration = evaluate(sess, model, minibatch, size=FLAGS.validate_batch_size)
                sess.run(train_adj_info.op)
                epoch_val_costs[-1] += val_cost
            if shadow_mrr is None:
                shadow_mrr = val_mrr
            else:
                shadow_mrr -= (1 - 0.99) * (shadow_mrr - val_mrr)

            if total_steps % FLAGS.print_every == 0:
                summary_writer.add_summary(outs[0], total_steps)

            # Print results
            avg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)

            if total_steps % FLAGS.print_every == 0:
                print("Iter:", '%04d' % iter,
                      "train_loss=", "{:.5f}".format(train_cost),
                      "train_mrr=", "{:.5f}".format(train_mrr),
                      "train_mrr_ema=", "{:.5f}".format(train_shadow_mrr),  # exponential moving average,指数滑动平均EMA
                      "val_loss=", "{:.5f}".format(val_cost),
                      "val_mrr=", "{:.5f}".format(val_mrr),
                      "val_mrr_ema=", "{:.5f}".format(shadow_mrr),  # exponential moving average
                      "time=", "{:.5f}".format(avg_time))

            iter += 1
            total_steps += 1

            if total_steps > FLAGS.max_total_steps:
                break

        if total_steps > FLAGS.max_total_steps:
            break

    print("Optimization Finished!")
    if FLAGS.save_embeddings:  # 训练以后是否存储节点的embeddings
        sess.run(val_adj_info.op)

        save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir())

        if FLAGS.model == "n2v":
            # stopping the gradient for the already trained nodes
            train_ids = tf.constant(
                [[id_map[n]] for n in G.nodes_iter() if not G.node[n]['val'] and not G.node[n]['test']],
                dtype=tf.int32)
            test_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if G.node[n]['val'] or G.node[n]['test']],
                                   dtype=tf.int32)
            update_nodes = tf.nn.embedding_lookup(model.context_embeds, tf.squeeze(test_ids))
            no_update_nodes = tf.nn.embedding_lookup(model.context_embeds, tf.squeeze(train_ids))
            update_nodes = tf.scatter_nd(test_ids, update_nodes, tf.shape(model.context_embeds))
            no_update_nodes = tf.stop_gradient(
                tf.scatter_nd(train_ids, no_update_nodes, tf.shape(model.context_embeds)))
            model.context_embeds = update_nodes + no_update_nodes
            sess.run(model.context_embeds)

            # run random walks
            from graphsage.utils import run_random_walks
            nodes = [n for n in G.nodes_iter() if G.node[n]["val"] or G.node[n]["test"]]
            start_time = time.time()
            pairs = run_random_walks(G, nodes, num_walks=50)
            walk_time = time.time() - start_time

            test_minibatch = EdgeMinibatchIterator(G,
                                                   id_map,
                                                   placeholders, batch_size=FLAGS.batch_size,
                                                   max_degree=FLAGS.max_degree,
                                                   num_neg_samples=FLAGS.neg_sample_size,
                                                   context_pairs=pairs,
                                                   n2v_retrain=True,
                                                   fixed_n2v=True)

            start_time = time.time()
            print("Doing test training for n2v.")
            test_steps = 0
            for epoch in range(FLAGS.n2v_test_epochs):
                test_minibatch.shuffle()
                while not test_minibatch.end():
                    feed_dict = test_minibatch.next_minibatch_feed_dict()
                    feed_dict.update({placeholders['dropout']: FLAGS.dropout})
                    outs = sess.run([model.opt_op, model.loss, model.ranks, model.aff_all,
                                     model.mrr, model.outputs1], feed_dict=feed_dict)
                    if test_steps % FLAGS.print_every == 0:
                        print("Iter:", '%04d' % test_steps,
                              "train_loss=", "{:.5f}".format(outs[1]),
                              "train_mrr=", "{:.5f}".format(outs[-2]))
                    test_steps += 1
            train_time = time.time() - start_time
            save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir(), mod="-test")
            print("Total time: ", train_time + walk_time)
            print("Walk time: ", walk_time)
            print("Train time: ", train_time)


# main函数,加载数据并训练
def main(argv=None):
    print("Loading training data..")
    train_data = load_data(FLAGS.train_prefix, load_walks=True)
    print("Done loading training data..")
    train(train_data)


if __name__ == '__main__':
    tf.app.run()
# tf.app.run()的作用:通过处理flag解析,然后执行main函数
# 如果代码中的入口函数不叫main(),而是一个其他名字的函数,如test(),则应该这样写入口tf.app.run(test)
# 如果代码中的入口函数叫main(),则可以把入口写成tf.app.run()

inits.py

  • glorot初始化方法:它为了保证前向传播和反向传播时每一层的方差一致:在正向传播时,每层的激活值的方差保持不变;在反向传播时,每层的梯度值的方差保持不变。根据每层的输入个数和输出个数来决定参数随机初始化的分布范围,是一个通过该层的输入和输出参数个数得到的分布范围内的均匀分布。
    (推导见:https://blog.csdn.net/yyl424525/article/details/100823398#4_Xavier_21)

这部分和GCN中的一样:

import tensorflow as tf
import numpy as np

#产生一个维度为shape的Tensor,值分布在(-0.005-0.005)之间,且为均匀分布
def uniform(shape, scale=0.05, name=None):
    """Uniform init."""
    initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)
    return tf.Variable(initial, name=name)


def glorot(shape, name=None):
    """Glorot & Bengio (AISTATS 2010) init."""
    #
    init_range = np.sqrt(6.0/(shape[0]+shape[1]))
    initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32)
    return tf.Variable(initial, name=name)

#产生一个维度为shape,值全为1的Tensor
def zeros(shape, name=None):
    """All zeros."""
    initial = tf.zeros(shape, dtype=tf.float32)
    return tf.Variable(initial, name=name)

#产生一个维度为shape,值全为0的Tensor
def ones(shape, name=None):
    """All ones."""
    initial = tf.ones(shape, dtype=tf.float32)
    return tf.Variable(initial, name=name)

其他

citation_eval.py

ppi_eval.py

reddit_eval.py

参考

博客园 listenviolet 的GraphSAGE 代码解析系列

你可能感兴趣的:(深度学习)