目前Gelly拥有越来越多的图算法,可以方便地进行大型图分析。使用也非常简单,只需要在需要分析的图上执行run()
方法即可,以标签传播进行社区发现为例,代码如下:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
Graph graph = ...
// run Label Propagation for 30 iterations to detect communities on the input graph
DataSet> verticesWithCommunity = graph.run(new LabelPropagation(30));
// print the result
verticesWithCommunity.print();
Community Detection
Overview
在图论中,社区指的是内部连接紧密但与其他组连接稀疏的节点组合。Gelly算法库中提供的社区发现算法是论文 Towards real-time community detection in large networks中提到的社区发现算法的实现。
Details
该算法使用scatter-gather iterations
模型来实现。一开始,为每个顶点分配一个Tuple2
用于保存该顶点原始标签值label和一个等于1.0的分值score。在每次迭代中,顶点向其邻接顶点发送一个包含当前顶点label和score的消息。当一个顶点从其邻接顶点接收到消息时,选择具有最大score的label作为当前节点的新label,并根据边上的值以及用户定义的度衰减系数和当前superstep号对当前score进行重新打分。当顶点不再更新其label值或达到最大迭代次数时,算法收敛。
Usage
该算法以任意顶点ID类型,Long型顶点值和Double型边值的图数据作为输入,并返回一个类型和输入图类型相同的新图,其中返回的顶点值对应于社区的标签。即如果两个顶点具有相同的顶点值,则它们属于同一个社区。该构造函数具有如下两个参数:
-
maxIterations
:最大迭代次数。 -
delta
:衰减系数,默认0.5。
Label Propagation
Overview
这是一个著名的标签传播算法的实现,具体可以参考这篇论文Near linear time algorithm to detect community structures in large-scale networks。该算法通过在邻接顶点间互相传播标签来实现社区发现。与上一个社区发现算法不同,该算法实现不需要考虑每个社区的标签label的score值。
Details
该算法同样使用scatter-gather iterations
模型来实现,标签的类型应该是Comparable,并使用输入图的顶点value值初始化(注意,初始化的时候可以将图中每个顶点的ID值作为初始标签值赋值给顶点的value)。该算法迭代地通过传播标签来细化发现的社区,在每次迭代过程中,每个顶点将其邻接顶点中标签出现次数最多的标签作为当前节点的新标签,如果出现次数最多的标签有多个,将标签值最大的值作为当前节点的新标签(注意:该算法理论上是允许随机选则一个最大的标签的,但是Gelly中的实现是取标签最大值的,这也是为什么标签在初始化的时候必须是Comparable类型)。当顶点不再更新其label值或达到最大迭代次数时,算法收敛。由于LPA算法是不稳定的,不同的初始化方式可能导致不同的社区发现结果。
Usage
该算法要求输入的图必须满足两个条件:即顶点的类型(ID)和顶点值value都是Comparable类型的,边值value类型不考虑。该算法返回一个节点组成的数据集DataSet
,其中顶点的value值表示节点所属的对应社区。该构造函数需要传入一个参数:
-
maxIterations
:最大迭代次数。
Connected Components
Overview
该算法是弱连通分支算法的实现,在收敛时,如果两个顶点之间存在一条连通的路径(不考虑方向),那么这两个点就属于同一个连通分支。
Details
该算法也是使用scatter-gather iterations
模型来实现的。初始化的时候,使用唯一的comparable类型的顶点value表示连通分支的ID。每次迭代的时候,顶点向邻接顶点传播当前的顶点value值,每个顶点在接收到的所有ID中选择一个最小的ID与当前顶点的value值比较,如果当前顶点的value值小于选出最小ID值,则将当前顶点的value值修改为这个选出的最小值。当顶点不再更新其value值或达到最大迭代次数时,算法收敛。
Usage
该算法生成一个顶点的数据集合DataSet
,其中顶点的value值对用于指定的连通分支,该构造函数需要传入一个参数:
-
maxIterations
:最大迭代次数。
GSA Connected Components
Overview
与前一个算法一样,该算法也是弱连通分支算法的实现,在收敛时,如果两个顶点之间存在一条连通的路径(不考虑方向),那么这两个点就属于同一个连通分支。
Details
与前一个算法不同,该算法使用gather-sum-apply iterations
模型来实现的。初始化的时候,使用唯一的comparable类型的顶点value表示连通分支的ID。在gather阶段,每个顶点收集其所有邻接顶点的value值;在sum阶段,从上一阶段收集的所有顶点value值中选出最小值;在apply阶段,如果该最小值小于当前顶点的value值,则将当前顶点的value值修改为该最小值。当顶点不再更新其value值或达到最大迭代次数时,算法收敛。
Usage
参数:
-
maxIterations
:最大迭代次数。
Single Source Shortest Paths
Overview
该方法是针对加权图中的单源点最短路径算法的实现。给定一个源顶点,该算法会计算出该顶点与图中其他所有顶点的最短路径。
Details
该算法是使用scatter-gather iterations
模型来实现的。在每次迭代中,每个顶点向其邻接顶点发送一条顶点到源顶点的距离和边权重之和的消息Sum(distance, weight)
作为邻接顶点的候选距离,在接收到邻接顶点的候选距离时,如果候选距离中的最小值小于当前顶点的距离值,则更新当前顶点的距离值,否则不更新。如果在此次迭代中,该顶点距离值没有更新,则在下次迭代中,该顶点不再向邻接顶点发送消息。当顶点不再更新其value值或达到最大迭代次数时,算法收敛。
Usage
该方法要求输入的边的值是Double类型,其他类型不考虑。输出的结果是个顶点的数据集合,其中顶点的value值表示当前顶点到源点的最短距离。
参数:
-
srcVertexId
:源点ID -
maxIterations
:最大迭代次数。
Triangle Enumerator
Overview
该方法用于枚举输入的图中出现的所有的唯一的三角形。一个三角形由三个顶点和三个顶点间互相连接的三个边组成。该方法的实现不考虑边的方向。
Details
该算法的实现思路也很简单,首先对所有的共享同一个顶点的边进行分组,并构建一个三元组,即三个顶点被两条边连接在一起。然后过滤掉所有不存在第三条边的三元组,剩下的就是该图中的所有的三角形枚举结果。对于一组共享同一个顶点的n条边,一共可以组成((n*(n-1))/2)个三元组。因此,优化算法的一种方法是在输出度较小的顶点上对边进行分组,以减少三角形的数量。该实现在基础算法的基础上进行了扩展,通过计算边缘顶点的输出度,并对边的小度顶点进行边的分组。
Usage
该算法以有向图作为输入,并输出一个Tuple3
的DataSet
,图中的顶点ID是Comparable类型的。每个三元组对应一个三角形,其中每个域代表一个顶点的ID。
Local Clustering Coefficient
Overview
局部聚类系数度量每个顶点邻接顶点之间的连通性。分值范围从0.0(邻接顶点之间没有边)到1.0(邻接顶点之间互相关联)。
Details
在图中,如果一个顶点的两个相邻顶点之间存在一条边,则可以构成一个三角形。统计一个顶点的邻接顶点之间的边数等价于统计一个顶点组成的三角形个数。聚类系数为一个顶点的邻接定点间的边数处以邻接顶点间潜在的边数。可以看下面的例子:
顶点A的邻接顶点有B,C,D,因此邻接顶点之间的边数为2(红色的),而A的邻接顶点之间的潜在边数为3条(所有可能的边数,包括虚线)。故顶点A的聚类系数为2/3=0.6667,另外我们也可以知道顶点B的聚类系数为0.3333,顶点C的聚类系数为0.6667,顶点D的聚类系数为0.333。
Usage
该方法即支持有向图也支持无向图,该方法的返回值是一个类型为UnaryResult
的DataSet
,其中每个UnaryResult
中包含当前顶点的ID,顶点的度数以及顶点,以及包含当前顶点的三角形个数。然后UnaryResult
提供了计算Local Clustering Efficiency的方法,直接调用即可。要注意的是图中顶点ID必须是Comparable和Copyable类型。该方法具有以下参数:
-
setIncludeZeroDegreeVertices
:是否包含度为零的顶点的结果。 -
setParallelism
:设置并发数。
Average Clustering Coefficient
Overview
平均聚类系数度量一个图的平均连通性。系数范围从0.0(所有邻接顶点之间都没有边)到1.0(完全图)。
Details
平均聚类系数是具有至少两个邻接顶点的所有顶点的局部聚类系数得分的平均值。
Usage
该方法对有向图和无相图均适用,返回结果是一个AnalyticResult
类型的对象,里面包含当前图的平均聚类系数和顶点数。要注意的是图中顶点ID必须是Comparable和Copyable类型。该方法的参数:
-
setParallelism
:设置并发数。
Global Clustering Coefficient
Overview
全局聚类系数度量一个图的连通性。系数范围从0.0(所有邻接顶点之间都没有边)到1.0(完全图)。
Details
全局聚类系数指的是每个节点的邻接顶点中互相连接的个数在整个图中的占比。通常情况下,一个顶点关联的相互连接的顶点对越多,该值越大。
Usage
该方法对有向图和无相图均适用,返回结果是一个AnalyticResult
类型的对象。里面包含该图中的三元组数目和三角形数,并提供了方法计算每个顶点的全局聚类系数。要注意的是图中顶点ID必须是Comparable和Copyable类型。该方法的参数
-
setParallelism
:设置并发数。
Triadic Census
Overview
在一个图中,任意三个顶点都可以组成一个三元组,这三个顶点之间可能相连也可能不相连。Triadic Census用来计算图中任意类型的三元组出现的次数。
Details
This analytic counts the four undirected triad types (formed with 0, 1, 2, or 3 connecting edges) or 16 directed triad types by counting the triangles from Triangle Listing and running Vertex Metrics to obtain the number of triplets and edges. Triangle counts are then deducted from triplet counts, and triangle and triplet counts are removed from edge counts.
Usage
Directed and undirected variants are provided. The algorithms take a simple graph as input and output a DataSet of TertiaryResult containing the three triangle vertices and, for the directed algorithm, a bitmask marking each of the six potential edges connecting the three vertices. The graph ID type must be Comparable and Copyable.
- setParallelism: override the parallelism of operators processing small amounts of data
- setSortTriangleVertices: normalize the triangle listing such that for each result (K0, K1, K2) the vertex IDs are sorted K0 < K1 < K2
Hyperlink-Induced Topic Search
Overview
Hyperlink-Induced Topic Search computes two interdependent scores for every vertex in a directed graph. Good hubs are those which point to many good authorities and good authorities are those pointed to by many good hubs
Details
Every vertex is assigned the same initial hub and authority scores. The algorithm then iteratively updates the scores until termination. During each iteration new hub scores are computed from the authority scores, then new authority scores are computed from the new hub scores. The scores are then normalized and optionally tested for convergence. HITS is similar to PageRank but vertex scores are emitted in full to each neighbor whereas in PageRank the vertex score is first divided by the number of neighbors.
Usage
The algorithm takes a simple directed graph as input and outputs a DataSet of UnaryResult containing the vertex ID, hub score, and authority score. Termination is configured by the number of iterations and/or a convergence threshold on the iteration sum of the change in scores over all vertices.
- setIncludeZeroDegreeVertices: whether to include zero-degree vertices in the iterative computation
- setParallelism: override the operator parallelism
PageRank
Overview
PageRank is an algorithm that was first used to rank web search engine results. Today, the algorithm and many variations are used in various graph application domains. The idea of PageRank is that important or relevant vertices tend to link to other important vertices.
Details
The algorithm operates in iterations, where pages distribute their scores to their neighbors (pages they have links to) and subsequently update their scores based on the sum of values they receive. In order to consider the importance of a link from one page to another, scores are divided by the total number of out-links of the source page. Thus, a page with 10 links will distribute 1/10 of its score to each neighbor, while a page with 100 links will distribute 1/100 of its score to each neighboring page.
Usage
The algorithm takes a directed graph as input and outputs a DataSet where each Result contains the vertex ID and PageRank score. Termination is configured with a maximum number of iterations and/or a convergence threshold on the sum of the change in score for each vertex between iterations.
- setParallelism: override the operator parallelism
Adamic-Adar
Overview
Adamic-Adar measures the similarity between pairs of vertices as the sum of the inverse logarithm of degree over shared neighbors. Scores are non-negative and unbounded. A vertex with higher degree has greater overall influence but is less influential to each pair of neighbors
Details
The algorithm first annotates each vertex with the inverse of the logarithm of the vertex degree then joins this score onto edges by source vertex. Grouping on the source vertex, each pair of neighbors is emitted with the vertex score. Grouping on vertex pairs, the Adamic-Adar score is summed.
Usage
参数:
The algorithm takes a simple undirected graph as input and outputs a DataSet of BinaryResult containing two vertex IDs and the Adamic-Adar similarity score. The graph ID type must be Copyable.
- setMinimumRatio: filter out Adamic-Adar scores less than the given ratio times the average score
- setMinimumScore: filter out Adamic-Adar scores less than the given minimum
setParallelism: override the parallelism of operators processing small amounts of data
Jaccard Index
Overview
The Jaccard Index measures the similarity between vertex neighborhoods and is computed as the number of shared neighbors divided by the number of distinct neighbors. Scores range from 0.0 (no shared neighbors) to 1.0 (all neighbors are shared).
Details
Counting shared neighbors for pairs of vertices is equivalent to counting connecting paths of length two. The number of distinct neighbors is computed by storing the sum of degrees of the vertex pair and subtracting the count of shared neighbors, which are double-counted in the sum of degrees.
The algorithm first annotates each edge with the target vertex’s degree. Grouping on the source vertex, each pair of neighbors is emitted with the degree sum. Grouping on vertex pairs, the shared neighbors are counted.
Usage
参数:
The algorithm takes a simple undirected graph as input and outputs a DataSet of tuples containing two vertex IDs, the number of shared neighbors, and the number of distinct neighbors. The result class provides a method to compute the Jaccard Index score. The graph ID type must be Copyable.
- setMaximumScore: filter out Jaccard Index scores greater than or equal to the given maximum fraction
- setMinimumScore: filter out Jaccard Index scores less than the given minimum fraction
- setParallelism: override the parallelism of operators processing small amounts of data
Vertex Metrics
Overview
Gelly的图分析针对有向图和无向图,支持下列统计信息:
- number of vertices:顶点数
- number of edges:顶点的边数
- average degree:顶点的平均度数
- number of triplets:顶点关联的三元组个数
- maximum degree:节点的最大度数
- maximum number of triplets:节点关联的最大三元组个数
针对无向图,还有以下几个统计信息: - number of unidirectional edges:单向边个数
- number of bidirectional edges:双向边个数
- maximum out degree:最大出度
- maximum in degree:最大入度
Details
这些顶点的统计信息分别由degree.annotate.directed.VertexDegrees
和 degree.annotate.undirected.VertexDegree
。
Usage
该方法对有向图和无相图均适用,返回结果是一个AnalyticResult
类型的对象。里面包含图中获取节点统计信息的方法,并提供了方法计算每个顶点的全局聚类系数。要注意的是图中顶点ID必须是Comparable和Copyable类型。该方法的参数:
-
setIncludeZeroDegreeVertices
:是否包含度为0的顶点结果。 -
setParallelism
:设置并发数。 -
setReduceOnTargetId
:仅支持无向图,用来计算不考虑方向的度数。
Edge Metrics
Overview
- number of triangle triplets:边上的三角形三元组个数
- number of rectangle triplets:边上矩形三元组个数
- maximum number of triangle triplets:边上最大的三角形三元组个数
- maximum number of rectangle triplets:边上最大的矩形三元组个数
Details
这些边的统计信息分别由degree.annotate.directed.EdgeDegreesPair
和 degree.annotate.undirected.EdgeDegreePair
。
Usage
参数:
-
setParallelism
:设置并发数。 -
setReduceOnTargetId
:仅支持无向图,用来计算不考虑方向的度数。