【GNN】Graph embedding

Orignal in Wechat article.

Why do we need graph embedding

Machine learning algorithms are tuned for continuous data, hence why embedding is always to a continuous vector space.

What is graph embedding

Definition

Graph embedding is an approach that is used to transform nodes, edges, and their features into vector space (a lower dimension) whilst maximally preserving properties like graph structure and information. https://towardsdatascience.com/overview-of-deep-learning-on-graph-embeddings-4305c10ad4a4

Representation learning for networks

Graph representation:

Embedding methods

There is a variety of ways to go about embedding graphs, each with a different level of granularity. Embeddings can be performed on the node level, the sub-graph level, or through strategies like graph walks.

DeepWalk

Deepwalk belongs to the family of graph embedding techniques that uses walks, which are a concept in graph theory that enables the traversal of a graph by moving from one node to another, as long as they are connected to a common edge.

Random walk

  • Generate γ \gamma γ random walks for each vertex
  • Each random walk has length l l l
  • Every jump is uniform

Code of random walk on example graph, revised according to the repository on Github.

import networkx as nx
import numpy as np
import matplotlib.pyplot as plt
import random

# Create Graph
G = nx.Graph()

# Add nodes
G.add_nodes_from(range(0, 9))

# Add edges
G.add_edge(0, 1)
G.add_edge(1, 2)
G.add_edge(1, 6)
G.add_edge(6, 4)
G.add_edge(4, 3)
G.add_edge(4, 5)
G.add_edge(6, 7)
G.add_edge(7, 8)
G.add_edge(7, 9)
G.add_edge(2, 3)
G.add_edge(2, 4)
G.add_edge(2, 5)

# Draw graph
#nx.draw(G)
#plt.show()

# Define red and green nodes
red_vertices = [0, 2, 3, 9]
green_vertices = [5, 8]

# Store number of successes
nsuccess = 0

# Execute 1million times this command sequence
for step in range(1, 2):
    # Choose a random start node
    vertexid = 1
    # Dictionary that associate nodes with the amount of times it was visited
    visited_vertices = {}
    # Store and print path
    path = [vertexid]
    
    print("Step: %d" % (step))
    # Restart the cycle
    counter = 0
    # Execute the random walk with size 100,000 (100,000 steps)
    for counter in range(1, 10): 
        # Extract vertex neighbours vertex neighborhood
        vertex_neighbors = [n for n in G.neighbors(vertexid)]
        # Set probability of going to a neighbour is uniform
        probability = []
        probability = probability + [1./len(vertex_neighbors)] * len(vertex_neighbors)
        # Choose a vertex from the vertex neighborhood to start the next random walk
        vertexid = np.random.choice(vertex_neighbors, p = probability)
        # Accumulate the amount of times each vertex is visited
        if vertexid in visited_vertices:
            visited_vertices[vertexid] += 1
        else:
            visited_vertices[vertexid] = 1

        # Append to path
        path.append(vertexid) 
        nsuccess = nsuccess + 1

    # Organize the vertex list in most visited decrescent order
    mostvisited = sorted(visited_vertices, key = visited_vertices.get, reverse = True)
    print("Path: ", path)
    # Separate the top 10 most visited vertex
    print("Most visited nodes: ", mostvisited[:10])

RW path to matrix Φ {\bf \Phi} Φ

  • Define a window.
  • The steps in a window of that traversal could be aggregated by arranging the node representation vectors v ∈ R d \pmb{v} \in \mathbb{R}^d vvvRd next to each other in a matrix Φ {\bf \Phi} Φ.

The approach taken by DeepWalk is to complete a series of random walks using the equation:
KaTeX parse error: No such environment: equation* at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲*̲}̲P(v_i| {\bf \Ph…
The goal is to estimate the likelihood of observing node v i v_i vi given all the previous nodes visited so far in the random walk.

And next, feed that matrix representing the graph to a neural networks make a prediction about a node feature or classification, e.g. The method used to make predictions is skip-gram.

Node2vec

Idea: use flexible, biased random walks that can trade off between local and global views of the network.

The difference between Node2vec and DeepWalk is subtle but significant. Node2vec features a walk bias variable α \alpha α, which is parameterized by p p p and q q q. The parameter p p p prioritizes a breadth-first-search (BFS) procedure, while the parameter q q q prioritizes a depth-first-search (DFS) procedure. The decision of where to walk next is therefore influenced by probabilities 1 / p 1/p 1/p or 1 / q 1/q 1/q.
KaTeX parse error: No such environment: equation* at position 8: \begin{̲e̲q̲u̲a̲t̲i̲o̲n̲*̲}̲\alpha_{pg}(t,x…
where t t t is the last node, x x x is the next node and now resides at node v v v.

【GNN】Graph embedding_第1张图片

As the visualization implies, BFS is ideal for learning local neighbors, while DFS is better for learning global variables

Graph2vec

A modification to the node2vec variant, graph2vec essentially learns to embed a graph’s sub-graphs.

Using an analogy with word2vec, if a document is made of sentences (which is then made of words), then a graph is made of sub-graphs (which is then made of nodes).

【GNN】Graph embedding_第2张图片

Three steps:

  • Sample and re-label all sub-graphs in the graph;
  • Training the skip-gram;
  • The embed is calculated by providing the id index vector of the sub-graph at the input.

【GNN】Graph embedding_第3张图片

A brief history of graph embedding

你可能感兴趣的:(图论,GNN,深度学习)