1. 基本结构
可以把VertexRDD
理解成一个tuple, 它由一个全局唯一的额long型的key
vertexid
和 用户定义的vetex上的各种属性 VD
构成
/**
* Extends `RDD[(VertexId, VD)]` by ensuring that there is only one entry for each vertex and by
* pre-indexing the entries for fast, efficient joins. Two VertexRDDs with the same index can be
* joined efficiently. All operations except [[reindex]] preserve the index. To construct a
* `VertexRDD`, use the [[org.apache.spark.graphx.VertexRDD$ VertexRDD object]].
*
* Additionally, stores routing information to enable joining the vertex attributes with an
* [[EdgeRDD]].
*
* @example Construct a `VertexRDD` from a plain RDD:
* {{{
* // Construct an initial vertex set
* val someData: RDD[(VertexId, SomeType)] = loadData(someFile)
* val vset = VertexRDD(someData)
* // If there were redundant values in someData we would use a reduceFunc
* val vset2 = VertexRDD(someData, reduceFunc)
* // Finally we can use the VertexRDD to index another dataset
* val otherData: RDD[(VertexId, OtherType)] = loadData(otherFile)
* val vset3 = vset2.innerJoin(otherData) { (vid, a, b) => b }
* // Now we can construct very fast joins between the two sets
* val vset4: VertexRDD[(SomeType, OtherType)] = vset.leftJoin(vset3)
* }}}
*
* @tparam VD the vertex attribute associated with each vertex in the set.
*/
abstract class VertexRDD[VD](
sc: SparkContext,
deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps)
注释中给出了从Plain RDD, 可以理解成装着平稳本作为Vertex Property的RDD, 组装出VertexRDD的流程
2. 内部维护的结构
在进行Vertex Cut后, 需要在VertexRDD中维护一个切分后的paritionRDD, 它是一个 RDD[ShippableVertexPartition[VD]]
implicit protected def vdTag: ClassTag[VD]
private[graphx] def partitionsRDD: RDD[ShippableVertexPartition[VD]]
override protected def getPartitions: Array[Partition] = partitionsRDD.partitions
3. 与图整体操作相关的method
-
minus
- 图的减操作
/**
* For each VertexId present in both `this` and `other`, minus will act as a set difference
* operation returning only those unique VertexId's present in `this`.
*
* @param other an RDD to run the set operation against
*/
def minus(other: RDD[(VertexId, VD)]): VertexRDD[VD]
-
diff
- 和另外一个图的差异vetex
/**
* For each vertex present in both `this` and `other`, `diff` returns only those vertices with
* differing values; for values that are different, keeps the values from `other`. This is
* only guaranteed to work if the VertexRDDs share a common ancestor.
*
* @param other the other RDD[(VertexId, VD)] with which to diff against.
*/
def diff(other: RDD[(VertexId, VD)]): VertexRDD[VD]
-
leftJoin
- 归并另外一个图
/**
* Left joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is
* backed by a VertexRDD with the same index then the efficient [[leftZipJoin]] implementation is
* used. The resulting VertexRDD contains an entry for each vertex in `this`. If `other` is
* missing any vertex in this VertexRDD, `f` is passed `None`. If there are duplicates,
* the vertex is picked arbitrarily.
*
* @tparam VD2 the attribute type of the other VertexRDD
* @tparam VD3 the attribute type of the resulting VertexRDD
*
* @param other the other VertexRDD with which to join
* @param f the function mapping a vertex id and its attributes in this and the other vertex set
* to a new vertex attribute.
* @return a VertexRDD containing all the vertices in this VertexRDD with the attributes emitted
* by `f`.
*/
def leftJoin[VD2: ClassTag, VD3: ClassTag]
(other: RDD[(VertexId, VD2)])
(f: (VertexId, VD, Option[VD2]) => VD3)
: VertexRDD[VD3]
-
innerJoin
- 求两个图的并
/**
* Inner joins this VertexRDD with an RDD containing vertex attribute pairs. If the other RDD is
* backed by a VertexRDD with the same index then the efficient [[innerZipJoin]] implementation
* is used.
*
* @param other an RDD containing vertices to join. If there are multiple entries for the same
* vertex, one is picked arbitrarily. Use [[aggregateUsingIndex]] to merge multiple entries.
* @param f the join function applied to corresponding values of `this` and `other`
* @return a VertexRDD co-indexed with `this`, containing only vertices that appear in both
* `this` and `other`, with values supplied by `f`
*/
def innerJoin[U: ClassTag, VD2: ClassTag](other: RDD[(VertexId, U)])
(f: (VertexId, VD, U) => VD2): VertexRDD[VD2]
- reverseRoutingTables - 把有向图所有的边反转
/**
* Returns a new `VertexRDD` reflecting a reversal of all edge directions in the corresponding
* [[EdgeRDD]].
*/
def reverseRoutingTables(): VertexRDD[VD]
3. 静态实现部分的method
-
createRoutingTables
- 生成vertex到分区的映射表
前面提到过GraphX非常重要的一个特性是用的vetex cut, 在切分后, 把不同的边放到不同的parition里面去. 同时维护了一个Routing Table来保存vertex到parition的映射关系.
这个方法输入包含 EdgeRDD 和 Partitioner
在RDD
相关章节讲过Partitionner对象是RDD对内部tuple按key分区的一种实现, 有多种不同的分区策略. RDD内部Partition的多少决定了并发的程度
这个方法是由内部的apply
方法调用的.
private[graphx] def createRoutingTables(
edges: EdgeRDD[_], vertexPartitioner: Partitioner): RDD[RoutingTablePartition] = {
// Determine which vertices each edge partition needs by creating a mapping from vid to pid.
val vid2pid = edges.partitionsRDD.mapPartitions(_.flatMap(
Function.tupled(RoutingTablePartition.edgePartitionToMsgs)))
.setName("VertexRDD.createRoutingTables - vid2pid (aggregation)")
val numEdgePartitions = edges.partitions.size
vid2pid.partitionBy(vertexPartitioner).mapPartitions(
iter => Iterator(RoutingTablePartition.fromMsgs(numEdgePartitions, iter)),
preservesPartitioning = true)
}
- apply - 组装一个vertexRDD出来
可以看到如果Paritioner如果没有的话, 会使用HashPartitioner
策略来进行分块. 实际上这里的Paritioner的策略更多的是依赖我们前面说过的Vetex Cut的策略来做的.
/**
* Constructs a `VertexRDD` from an RDD of vertex-attribute pairs. Duplicate vertex entries are
* merged using `mergeFunc`. The resulting `VertexRDD` will be joinable with `edges`, and any
* missing vertices referred to by `edges` will be created with the attribute `defaultVal`.
*
* @tparam VD the vertex attribute type
*
* @param vertices the collection of vertex-attribute pairs
* @param edges the [[EdgeRDD]] that these vertices may be joined with
* @param defaultVal the vertex attribute to use when creating missing vertices
* @param mergeFunc the commutative, associative duplicate vertex attribute merge function
*/
def apply[VD: ClassTag](
vertices: RDD[(VertexId, VD)], edges: EdgeRDD[_], defaultVal: VD, mergeFunc: (VD, VD) => VD
): VertexRDD[VD] = {
val vPartitioned: RDD[(VertexId, VD)] = vertices.partitioner match {
case Some(p) => vertices
case None => vertices.partitionBy(new HashPartitioner(vertices.partitions.size))
}
val routingTables = createRoutingTables(edges, vPartitioned.partitioner.get)
val vertexPartitions = vPartitioned.zipPartitions(routingTables, preservesPartitioning = true) {
(vertexIter, routingTableIter) =>
val routingTable =
if (routingTableIter.hasNext) routingTableIter.next() else RoutingTablePartition.empty
Iterator(ShippableVertexPartition(vertexIter, routingTable, defaultVal, mergeFunc))
}
new VertexRDDImpl(vertexPartitions)
}