CS224W: Machine Learning with Graphs - 02 Traditional Feature-based Methods

Traditional Feature-based Methods

1. Traditional ML Pipeline

  • Design features for nodes/links/graphs
  • Obtain features for all training data

2. Design Choices

  • Features: d d d-dimensional vectors
  • Objects: Nodes, edges, set of nodes, entire graphs
  • Objective functions: What task are we aiming to solve?

3. Node-level Tasks

1). Node-level features: Overview

Goal: characterize the structure and position of a node in the network

  • Node degree
  • Node centrality
  • Clustering coefficient
  • Graphlets

2). Node Degree

  • The degree k v k_v kv of node v v v is the number of edges (neighboring nodes) the node has
  • Treats all neighboring nodes equally

3). Node Centrality

  • Node degree counts the neighboring nodes without capturing their importance.
  • Node centrality c v c_v cv takes the node importance in a graph into account
  • Different ways to model importance: eigenvector centrality, betweenness centrality, closeness centrality and so on.
a). Eigenvector centrality
  • A node v v v is important if surrounded by important neighboring nodes u ∈ N ( v ) u \in N(v) uN(v).
  • c v = 1 λ ∑ u ∈ N ( v ) c u c_v=\frac{1}{\lambda} \sum_{u \in N(v)} c_u cv=λ1uN(v)cu, λ \lambda λ is some positive constant.
  • λ c = A c \lambda c = Ac λc=Ac, A A A is the adjacency matrix A u v = 1 A_{uv}=1 Auv=1 if u ∈ N ( v ) u \in N(v) uN(v) and c c c is centrality vector
  • The largest eigenvector λ m a x \lambda_{max} λmax is always positive and unique
  • The leading eigenvector c m a x c_{max} cmax is used for centrality
b). Betweenness centrality
  • A node is important if it lies on many shortest paths between other nodes
    c v = ∑ s ≠ v ≠ t # ( shortest paths between  s  and  t  that contain  v ) # ( shortest paths between  s  and  t ) c_v = \sum _{s \neq v \neq t} \frac{\#(\text{shortest paths between \textit{s} and \textit{t} that contain \textit{v}})}{\#(\text{shortest paths between \textit{s} and \textit{t}})} cv=s=v=t#(shortest paths between s and t)#(shortest paths between s and t that contain v)
c). Closeness centrality
  • A node is important if it has small shortest path lengths to all other nodes
    c v = ∑ u ≠ v 1 # ( shortest paths length between  u  and  v ) c_v = \sum _{u \neq v} \frac{1}{\#(\text{shortest paths length between \textit{u} and \textit{v}})} cv=u=v#(shortest paths length between u and v)1

4). Node Features: Clustering Coefficient

  • Measures how connected the neighboring nodes of v v v are
    e v = ∑ u ≠ v # ( edges among neighboring nodes ) ( 2 k v ) ∈ [ 0 , 1 ] e_v = \sum _{u \neq v} \frac{\#(\text{edges among neighboring nodes})}{(_2^{k_v})} \in [0, 1] ev=u=v(2kv)#(edges among neighboring nodes)[0,1]
    where ( 2 k v ) (_2^{k_v}) (2kv) is the number of node pairs among k v k_v kv neighboring nodes

5). Node Features: Graphlets

Observation: clustering coefficient counts the #(triangles) in the ego-network
Idea: we can generalize the above by counting #(pre-specified subgraphs, i.e., graphlets)

  • Graphlets: rooted connected non-isomorphic subgraphs.
  • Graphlet Degree Vector (GDV): Graphlet-based features for nodes

Compare with node degree and clustering coefficient
Degree counts #(edges) that a node touches
Clustering coefficient counts #(triangles) that a node touches
GDV counts #(graphlets) that a node touches

  • Graphlet Degree Vector (GDV): a count vector of graphlets rooted at a given node.
  • Considering graphlets on 2 to 5 nodes we get vector of 73 coordinates is a signature of a node that describes the topology of its neighborhood and captures its interconnectivities out to a distance of 4 hops
  • GDV provides a more detailed measure of a node’s local network topology than node degrees or clustering coefficient

4. Link-level Tasks

The task is to predict new links based on existing links.
When testing, all node pairs (no existing links) are ranked, and top K K K node pairs are predicted.
The key is to design features for a pair of nodes
Two formulations

  • Links missing at random: remove a random set of links and then aim to predict them
  • Links over time: given G [ t 0 , t 0 ′ ] G[t_0, t_0^{'}] G[t0,t0], a graph on edges up to time t 0 ′ t_0^{'} t0, output a ranked list L L L of links (not in G [ t 0 , t 0 ′ ] G[t_0, t_0^{'}] G[t0,t0]) that are predicted to appear in G [ t 1 , t 1 ′ ] G[t_1, t_1^{'}] G[t1,t1]

Methodology: compute score c ( x , y ) c(x, y) c(x,y) for each pair of nodes ( x , y ) (x, y) (x,y), sort pairs ( x , y ) (x, y) (x,y) by the decreasing score c ( x , y ) c(x, y) c(x,y), predict top n n n pairs as new links, see which of these links actually appear in G [ t 1 , t 1 ′ ] G[t_1, t_1^{'}] G[t1,t1]

1). Link-level Features: Overview

  • Distance-based feature
  • Local neighborhood overlap
  • Global neighborhood overlap

2). Distance-based Feature

Shortest-path distance between two nodes. However, this does not capture the degree of neighborhood overlap.

3). Local Neighborhood Overlap

Captures the number of neighboring nodes shared between two nodes v 1 v_1 v1 and v 2 v_2 v2

  • Common neighbors: ∣ N ( v 1 ) ∩ N ( v 1 ) ∣ |N(v_1)\cap N(v_1)| N(v1)N(v1)
  • Jaccard’s coefficient: ∣ N ( v 1 ) ∩ N ( v 1 ) ∣ ∣ N ( v 1 ) ∪ N ( v 1 ) ∣ \frac{|N(v_1)\cap N(v_1)|}{|N(v_1)\cup N(v_1)|} N(v1)N(v1)N(v1)N(v1)
  • Adamic-Adar index: ∑ u ∈ N ( v 1 ) ∩ N ( v 1 ) 1 l o g ( k u ) \sum_{u \in N(v_1)\cap N(v_1)} \frac{1}{log(k_u)} uN(v1)N(v1)log(ku)1
    Limitation: metric is always zero if the two nodes do not have any neighbors in common. However, the two nodes may still potentially be connected in the future.

4). Global Neighborhood Overlap

  • Power of adjacency matrix: let P u v ( K ) = P_{uv}^{(K)} = Puv(K)= #(paths of lenght K K K between u u u and v v v), then P ( K ) = A K P^{(K)}=A^K P(K)=AK
  • Katz index: count the number of paths of all lengths between a given pair of nodes
    S u v = ∑ l = 1 ∞ β l A u v l S_{uv}=\sum_{l=1}^{\infty}\beta^l A_{uv}^l Suv=l=1βlAuvl
    S = ∑ l = 1 ∞ β l A l = ( I − β A ) − 1 − I S=\sum_{l=1}^{\infty}\beta^l A^l=(I-\beta A)^{-1}-I S=l=1βlAl=(IβA)1I

5. Graph-level Features

1). Kernel Methods

Idea: design kernels instead of feature vectors

  • Kernel k ( G , G ′ ) ∈ R k(G, G^{'}) \in \mathtt {R} k(G,G)R measures similarity between data
  • Kernel matrix K = ( k ( G , G ′ ) ) G , G ′ \mathtt {K}=(k(G, G^{'}))_{G,G^{'}} K=(k(G,G))G,G must always be positive semidefinite (i.e., has positive eigenvalues)
  • There exists a feature representation ϕ \phi ϕ such that k ( G , G ′ ) = ϕ ( G ) T ϕ ( G ′ ) k(G, G^{'}) = \phi (G)^T\phi(G^{'}) k(G,G)=ϕ(G)Tϕ(G)
  • Once the kernel is defined, off-the-shelf ML model, such as kernel SVM, can be used to make predictions.

Graph kernels
Goal: design graph feature vector ϕ ( G ) \phi (G) ϕ(G)
Key idea: Bag-of-Words (BoW)* for a graph (regard nodes as words)
*: BoW simply uses the word counts as features for documents

  • Graphlet kernel
  • Weisfeiler-Lehman kernel
  • random-walk kernel, shortest-path graph kernel and many more

2). Graphlet Features

Key idea: count the number of graphlets in a graph
Note: Definition of graphlets here is different from node-level features.

  • Nodes in graphlets here do not need to be connected
  • The graphlets here are not rooted

Given a graph G G G and a graphlet list G k = ( g 1 , g 2 , . . . , g n ) G_k=(g_1, g_2,...,g_n) Gk=(g1,g2,...,gn), define the graphlet count vector f G f_G fG as
( f G ) i = # ( g i ⊆ G ) for  i =1,2,..., n (f_G)_i= \#(g_i \subseteq G) \text{for \textit{i}=1,2,...,\textit{n}} (fG)i=#(giG)for i=1,2,...,n
Given two graphs G G G and G ′ G^{'} G, graphlet kernel is
k ( G , G ′ ) = h G T h G ′ k(G, G^{'}) = h_G^T h_G^{'} k(G,G)=hGThG
where h G = f G s u m ( f G ) h_G=\frac{f_G}{sum(f_G)} hG=sum(fG)fG
Limitations: counting graphlets is expensive.

  • Counting size- k k k graphlets for a graph with size n n n by enumeration takes n k n^k nk
  • This is unavoidable in the worst-case since subgraph isomorphism test is NP-hard
  • If a graph’s node degree is bounded by d d d, an O ( n d k − 1 ) O(nd^{k-1}) O(ndk1) algorithm exists to count all the graphlets of size k k k

3). Weisfeiler-Lehman Kernel

Goal: design an efficient graph feature descriptor ϕ ( G ) \phi (G) ϕ(G)
Idea: use neighborhood structure to iteratively enrich node vocabulary
Color refinement: given a graph G G G with a set of nodes V V V, assign an initial color c 0 ( v ) c^0(v) c0(v) to each ndoe v v v. Then iteratively refine node colors by
c k + 1 ( v ) = HASH ( { c k ( v ) , { c k ( u ) } u ∈ N ( v ) } ) c^{k+1}(v)=\text{HASH}(\{c^k(v), \{c^k(u)\}_{u\in N(v)}\}) ck+1(v)=HASH({ck(v),{ck(u)}uN(v)})
where HASH maps different inputs to different colors. After K K K steps of color refinement, c k ( v ) c^k(v) ck(v) summarizes the structure of K K K-hop neighborhood
WL is computationally efficient
The time complexity for color reinement at each step is linear in #(degs), since it involves aggregating neighboring colors.

你可能感兴趣的:(机器学习,神经网络,数据挖掘,图论)