GraphX图的创建

如何构建图

GraphX构建图的方式很简单,分为3步:

  1. 构建边RDD
  2. 构建顶点RDD
  3. 生成Graph对象
val myVertices: RDD[(Long, String)] = spark.sparkContext.makeRDD(Array((1L, "a"), (2L, "b"), (3L, "c"), (4L, "d"), (5L, "e")))
val myEdges: RDD[Edge[String]] = spark.sparkContext.makeRDD(Array(Edge(1L, 2L, "is-friends-with"), Edge(2L, 3L, "is-friends-with"),
      Edge(3L, 4L, "is-friends-with"), Edge(4L, 5L, "Likes-status"), Edge(3L, 5L, "Wrote-status")))
val myGraph: Graph[String, String] = Graph(myVertices, myEdges)

Graph伴生对象中定义了apply方法,因此代码Graph(myVertices, myEdges)实际上是调用Graph.apply(),源码如下,

  def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD = null.asInstanceOf[VD],
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
    GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel)
  }

vertices

在GraphX中,对于一个构建好的图,调用vertices,可以返回图的顶点集VertexRDD

myGraph.vertices.collect().foreach(println)

(4,d)
(1,a)
(5,e)
(2,b)
(3,c)

VertexRDD源码如下,

abstract class VertexRDD[VD](
    sc: SparkContext,
    deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps) 

VertexRDD[VD]继承自RDD[(VertexId, VD)],其中VertexId表示顶点Id,GraphX将VertexId定义为64位的Long类型(type VertexId = Long),VD则表示顶点属性的类型。

edges

在GraphX中,对于一个构建好的图,调用edges,可以返回图的边集EdgeRDD

myGraph.edges.collect().foreach(println)

Edge(1,2,is-friends-with)
Edge(2,3,is-friends-with)
Edge(3,4,is-friends-with)
Edge(3,5,Wrote-status)
Edge(4,5,Likes-status)

EdgeRDD源码如下,

abstract class EdgeRDD[ED](
    sc: SparkContext,
    deps: Seq[Dependency[_]]) extends RDD[Edge[ED]](sc, deps)

EdgeRDD[ED]继承自RDD[Edge[ED]],其中ED表示边的属性类型,Edge[ED]的源码如下,它用来表示一条包含源顶点、目标顶点和边属性的有向边。

case class Edge[@specialized(Char, Int, Boolean, Byte, Long, Float, Double) ED] (
    var srcId: VertexId = 0,
    var dstId: VertexId = 0,
    var attr: ED = null.asInstanceOf[ED])
  extends Serializable

其中,ED为边属性的类型,srcId表示源顶点的VertexId,dstId表示目的顶点的VertexId,attr为边的属性值。

triplets

对于一个构建好的图,调用triplets返回RDD[EdgeTriplet[VD, ED]],RDD的类型为EdgeTriplet[VD, ED],

myGraph.triplets.collect().foreach(println)

((1,a),(2,b),is-friends-with)
((2,b),(3,c),is-friends-with)
((3,c),(4,d),is-friends-with)
((3,c),(5,e),Wrote-status)
((4,d),(5,e),Likes-status)

EdgeTriplet源码如下,

class EdgeTriplet[VD, ED] extends Edge[ED] {
  /**
   * The source vertex attribute
   */
  var srcAttr: VD = _ // nullValue[VD]
  /**
   * The destination vertex attribute
   */
  var dstAttr: VD = _ // nullValue[VD]
  /**
   * Set the edge properties of this triplet.
   */
  protected[spark] def set(other: Edge[ED]): EdgeTriplet[VD, ED] = {
    srcId = other.srcId
    dstId = other.dstId
    attr = other.attr
    this
  }
  ...
}

EdgeTriplet[VD, ED]继承自Edge[ED],其中,VD为顶点的属性类型,ED为边的属性类型。此外还增加了两个成员变量srcAttrdstAttr,分别为源顶点和目标顶点的属性值。

GraphX中Graph类及其依赖的UML图如下,
GraphX的UML.png

参考

  1. Spark GraphX in Action,Michael S. Malak.
  2. https://www.bookstack.cn/read...

你可能感兴趣的:(graph)