DeepWalk代码解释

代码运行

对给定的图数据集生成节点的embedding:

  • 先对图中的节点进行随机游走
  • 再将随机游走的路径作为Word2Vec的输入,生成节点的embedding

执行代码如下:

python __main__.py --input ../example_graphs/karate.adjlist --output ../example_graphs/karate.embeddings
python __main__.py --format mat --input ../example_graphs/blogcatalog.mat --number-walks 80 --representation-size 128 --walk-length 40 --window-size 10 --workers 1 --output ../example_graphs/blogcatalog.embeddings
python scoring.py --emb blogcatalog.embeddings --network blogcatalog.mat --num-shuffle 10 --all

下面是对各个文件进行的简单分析。

main.py

# __main__.py
import graph
import walks

graph里面有random walk
walks里也有random walk
两者的主要区别是前者针对小数据,后这针对大数据,前者采样的路径不会被序列化,后者会被序列化的本地磁盘.但是后者使用的randwalk仍然是graph里定义的方法,实际上是在调用graph.

parser.add_argument('--format', default='adjlist',
                        help='File format of input file')

这里程序允许三种类型的输入( ‘adjlist’,‘edgelist’,‘mat’),默认是’adjlist’。

if data_size < args.max_memory_data_size:
    print("Walking...")
    walks = graph.build_deepwalk_corpus(G, num_paths=args.number_walks,
                                        path_length=args.walk_length, alpha=0, rand=random.Random(args.seed))
    print("Training...")
    model = Word2Vec(walks, size=args.representation_size, window=args.window_size, min_count=0, sg=1, hs=1, workers=args.workers)

这里调用build_deep_corpus,是从每个节点开始进行多次随机游走。

num_path:设置了从一个节点开始的次数

path_length:设置了一个随机游走的长度

alpha:以概率1-alpha从当前节点继续走下去,或者以alpha的概率停止

训练的过程是要将随机游走得到的walks放进Word2Vec模型,从而得到节点对应的embedding.

else:
    print("Data size {} is larger than limit (max-memory-data-size: {}).  Dumping walks to disk.".format(data_size, args.max_memory_data_size))
    print("Walking...")

    walks_filebase = args.output + ".walks"
    walk_files = serialized_walks.write_walks_to_disk(G, walks_filebase, num_paths=args.number_walks, 
    												  path_length=args.walk_length, alpha=0, rand=random.Random(args.seed), num_workers=args.workers)

    print("Counting vertex frequency...")
    if not args.vertex_freq_degree:
      vertex_counts = serialized_walks.count_textfiles(walk_files, args.workers)
    else:
      # use degree distribution for frequency in tree
      vertex_counts = G.degree(nodes=G.iterkeys())

    print("Training...")
    walks_corpus = serialized_walks.WalksCorpus(walk_files)
    model = Skipgram(sentences=walks_corpus, vocabulary_counts=vertex_counts,
                     size=args.representation_size,
                     window=args.window_size, min_count=0, trim_rule=None, workers=args.workers)

当指定内存不足以存放游走结果,游走的路径会被存入一系列文件output.walks.x中。程序结束后会出现两个文件,一个是file_path,一个是file_path.walks.0,file_path.walks.1,…file_path.walks.x。
file_path存的是各个节点的embedding,
output.walks.x存的是采样的游走路径,x表示这个文件是第x个处理器存入的。

其实,只需要知道serialized_walks.write_walks_to_disk本质上也是在调用graph里的randwalk,只不过包装了一下,加入了并行化代码和写到磁盘的程序。

graph.py

def __init__(self):
	super(Graph, self).__init__(list)
    # super().__init__(list) #python 3.x中的语法

这里,构建的图是一个字典,key是节点,key对应的value是list。

如果构建了一个graph实例然后输出出来,会得到:

defaultdict(
	,
	{
		1: [2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 18, 20, 22, 32],
		2: [1, 3, 4, 8, 14, 18, 20, 22, 31],
		……
		34: [9, 10, 14, 15, 16, 19, 20, 21, 23, 24, 27, 28, 29, 30, 31, 32, 33]
	}
)

  def make_undirected(self):
  
    t0 = time()

    for v in self.keys():
      for other in self[v]:
        if v != other:
          self[other].append(v)
    
    t1 = time()
    logger.info('make_directed: added missing edges {}s'.format(t1-t0))

    self.make_consistent()
    return self

  def make_consistent(self):
    t0 = time()
    for k in iterkeys(self):
      self[k] = list(sorted(set(self[k])))
    
    t1 = time()
    logger.info('make_consistent: made consistent in {}s'.format(t1-t0))

    self.remove_self_loops()

    return self

  def remove_self_loops(self):

    removed = 0
    t0 = time()

    for x in self:
      if x in self[x]: 
        self[x].remove(x)
        removed += 1
    
    t1 = time()

    logger.info('remove_self_loops: removed {} loops in {}s'.format(removed, (t1-t0)))
    return self

  def check_self_loops(self):
    for x in self:
      for y in self[x]:
        if x == y:
          return True
    
    return False

  def random_walk(self, path_length, alpha=0, rand=random.Random(), start=None):
    """ Returns a truncated random walk.

        path_length: Length of the random walk.
        alpha: probability of restarts.
        start: the start node of the random walk.
    """
    G = self
    if start:
      path = [start]
    else:
      # Sampling is uniform w.r.t V, and not w.r.t E
      path = [rand.choice(list(G.keys()))]

    while len(path) < path_length:
      cur = path[-1]
      if len(G[cur]) > 0:
        if rand.random() >= alpha:   
          #这条语句成立的概率是1-alpha,即以1-alpha的概率从当前节点继续向前走,以alpha的概率restart
          path.append(rand.choice(G[cur]))
        else:
          path.append(path[0])
      else:
        break
    return [str(node) for node in path]

Graph类中的这些函数主要是为了检查生成的随机游走路径,需要去掉重复节点,去掉自循环。

random_walk函数是要生成以startnode开始的随机游走路径。

def build_deepwalk_corpus(G, num_paths, path_length, alpha=0,
                      rand=random.Random(0)):
  walks = []

  nodes = list(G.nodes())
  print(nodes)
  
  for cnt in range(num_paths):
    rand.shuffle(nodes)
    for node in nodes:
      walks.append(G.random_walk(path_length, rand=rand, alpha=alpha, start=node))
  
  return walks

这个函数能够对一个图生成一个语料库,即从每个节点开始的多次随机游走路径。

def build_deepwalk_corpus_iter(G, num_paths, path_length, alpha=0,
                      rand=random.Random(0)):
  walks = []

  nodes = list(G.nodes())

  for cnt in range(num_paths):
    rand.shuffle(nodes)
    for node in nodes:
      yield G.random_walk(path_length, rand=rand, alpha=alpha, start=node)

这个函数也是生成语料库,只不过是迭代生成的。适用于设定内存不足以保存路径的情况。

walks.py

def count_words(file):
  """ Counts the word frequences in a list of sentences.

  Note:
    This is a helper function for parallel execution of `Vocabulary.from_text`
    method.
  """
  c = Counter()
  with open(file, 'r') as f:
    for l in f:
      words = l.strip().split()
      c.update(words)
  return c


def count_textfiles(files, workers=1):
  c = Counter()
  with ProcessPoolExecutor(max_workers=workers) as executor:
    for c_ in executor.map(count_words, files):
      c.update(c_)
  return c

这里有两个计算词频的函数。

count_words的参数file中每行是一个walk,函数最终返回这个file中每个单词出现的次数。

count_textfiles是使用了多线程的技巧:ProcessPoolExecutor方法

可以了解一下python并行编程:python并行编程 中文版

def write_walks_to_disk(G, filebase, num_paths, path_length, alpha=0, rand=random.Random(0), num_workers=cpu_count(),
                        always_rebuild=True):
  global __current_graph
  __current_graph = G
  files_list = ["{}.{}".format(filebase, str(x)) for x in list(range(num_paths))]
  expected_size = len(G)
  args_list = []
  files = []

  if num_paths <= num_workers:
    paths_per_worker = [1 for x in range(num_paths)]
  else:
    paths_per_worker = [len(list(filter(lambda z: z!= None, [y for y in x])))
                        for x in graph.grouper(int(num_paths / num_workers)+1, range(1, num_paths+1))]

  with ProcessPoolExecutor(max_workers=num_workers) as executor:
    for size, file_, ppw in zip(executor.map(count_lines, files_list), files_list, paths_per_worker):
      if always_rebuild or size != (ppw*expected_size):
        args_list.append((ppw, path_length, alpha, random.Random(rand.randint(0, 2**31)), file_))
      else:
        files.append(file_)

这里是将walks写入文件。

你可能感兴趣的:(network,embedding,网络表示,python)