GraphX的图运算操作

GraphX内置了许多函数实现图运算操作。

mapVertices

mapVertices的功能是transform each vertex attribute in the graph using the map function。即对已有图的顶点属性做转换。

  def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2)
    (implicit eq: VD =:= VD2 = null): Graph[VD2, ED]

其中，VD2为转换后的顶点属性类型，map定义了转换函数

案例

myGraph.vertices.collect().foreach(println)
(4,d)
(1,a)
(5,e)
(2,b)
(3,c)

myGraph.mapVertices[Int]((vertexId, _) => if (vertexId < 2L) 0 else 1).vertices.collect().foreach(println)
(4,1)
(1,0)
(5,1)
(2,1)
(3,1)

mapEdges

mapEdges对已有图的边属性做转换，transforms each edge attribute in the graph using the map function。

// （1）
  def mapEdges[ED2: ClassTag](map: Edge[ED] => ED2): Graph[VD, ED2] = {
    mapEdges((pid, iter) => iter.map(map))
  }
// （2）
  def mapEdges[ED2: ClassTag](map: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2])
    : Graph[VD, ED2]

ED2为转换后的边属性类型，Edge[ED]只包含边的属性值和与边相连的顶点的VertexId，不包含顶点的属性值。第一个mapEdges在内部调用了第二个mapEdges，第二个mapEdges中的map函数，以一个分区的所有Edge作为输入，对边的属性进行转换。

案例

myGraph.edges.collect().foreach(println)
Edge(1,2,is-friends-with)
Edge(2,3,is-friends-with)
Edge(3,4,is-friends-with)
Edge(3,5,Wrote-status)
Edge(4,5,Likes-status)

myGraph.mapEdges(e => if (e.attr == "is-friends-with") 0 else 1).edges.collect().foreach(println)
Edge(1,2,0)
Edge(2,3,0)
Edge(3,4,0)
Edge(3,5,1)
Edge(4,5,1)

mapTriplets

  def mapTriplets[ED2: ClassTag](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2] = {
    mapTriplets((pid, iter) => iter.map(map), TripletFields.All)
  }
  def mapTriplets[ED2: ClassTag](
      map: EdgeTriplet[VD, ED] => ED2,
      tripletFields: TripletFields): Graph[VD, ED2] = {
    mapTriplets((pid, iter) => iter.map(map), tripletFields)
  }

关于mapTriplets的功能，官方的描述是Transforms each edge attribute using the map function, passing it the adjacent vertex attributes as well. If adjacent vertex values are not required, consider using mapEdges instead。mapTriplets同样是对已有图的边属性进行转换，只不过EdgeTriplet包含与边相邻的顶点的属性值，而Edge不包含。

案例

myGraph.mapTriplets(et => if (et.attr == "is-friends-with" && et.srcAttr == "a") 0 else 1).edges.collect().foreach(println)
Edge(1,2,0)
Edge(2,3,1)
Edge(3,4,1)
Edge(3,5,1)
Edge(4,5,1)

值得一提的是，mapVertices的map函数是以单个顶点作为输入，而mapEdges和mapTriplets的map函数是以一个分区的所有边作为输入，这是因为GraphX使用点切分的方式存储图。

joinVertices

有时候，我们希望将额外的属性合并到一个图的顶点属性中去，可以使用joinVertice操作。

  def joinVertices[U: ClassTag](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD)
    : Graph[VD, ED] = {
    val uf = (id: VertexId, data: VD, o: Option[U]) => {
      o match {
        case Some(u) => mapFunc(id, data, u)
        case None => data
      }
    }
    graph.outerJoinVertices(table)(uf)
  }

其中，table为额外的属性，mapFunc定义了如何将额外的属性和顶点的已有属性进行合并。使用joinVertice会返回一个新的带有顶点属性的图。在执行mapFunc时，顶点的VertexId、顶点的属性值会与该顶点对应的额外属性进行匹配，由case None => data可知，如果一个顶点没有对应的额外属性，则会保留该顶点的原有属性值。

案例

已有图的顶点属性为顶点名称（a,b,c,...），需求是将地点属性修改为顶点的“出度”。

myGraph.joinVertices(outDegrees)((_,_, d) => d.toString).vertices.collect().foreach(println)

(4,1)
(1,1)
(5,e)
(2,1)
(3,2)

发现，VertexId为1，2，3，4的顶点都正确实现了需求，但VertexId为5的顶点，没有对应的额外属性值（顶点5的出度为0，在outDegrees中没有记录），因而顶点5的属性值没有改变，如果想要实现将出度为0的顶点的属性值修改为0，有两种方法，

在outDegrees中增加出度为0的点的记录；
使用outerJoinVertices操作。

outerJoinVertices

从joinVertices的源码可以发现，joinVertices是outerJoinVertices的一个特例。

案例

已有图的顶点属性为顶点名称（a,b,c,...），需求是将地点属性修改为顶点的“出度”。

// 方法1
val outDegrees: VertexRDD[Int] = myGraph.aggregateMessages[Int](_.sendToSrc(1), _ + _)
myGraph.outerJoinVertices(outDegrees)((_,_, d) => d).vertices.collect().foreach(println)

(4,Some(1))
(1,Some(1))
(5,None)
(2,Some(1))
(3,Some(2))
//方法2
val outDegrees: VertexRDD[Int] = myGraph.aggregateMessages[Int](_.sendToSrc(1), _ + _)
myGraph.outerJoinVertices(outDegrees)((_,_, d) => d.getOrElse(0)).vertices.collect().foreach(println)

(4,1)
(1,1)
(5,0)
(2,1)
(3,2)

可以发现，调用outerJoinVertices，顶点在匹配其额外属性时，会将其匹配的额外属性转换为Option类型，因此可以通过getOrElse方法指定默认值的方式，为出度为0且没有记录在outDegrees中的顶点设置正确的额外属性值。

aggregateMessages

aggregateMessages的主要功能是向邻边发消息，然后合并邻边收到的消息。

  def aggregateMessages[A: ClassTag](
      sendMsg: EdgeContext[VD, ED, A] => Unit,
      mergeMsg: (A, A) => A,
      tripletFields: TripletFields = TripletFields.All)
    : VertexRDD[A] = {
    aggregateMessagesWithActiveSet(sendMsg, mergeMsg, tripletFields, None)
  }

其中，A为消息的类型，aggregateMessages的返回值为VertexRDD[A]

sendMsg

sendMsg函数用来发送消息，它以EdgeContext作为参数，没有返回值。EdgeContext与EdgeTriplet有些类似，成员变量都包含：srcId、dstId、srcAttr、dstAttr和attr。但是EdgeContext抽象类提高了两个发送消息的方法：sendToSrc和sendToDst，sendToSrc将类型为A的信息发送给源顶点，sendToDst将类型为A 的信息发送给目标顶点，toEdgeTriplet方法实现了从EdgeContext到EdgeTriplet的转换。

/**
 * Represents an edge along with its neighboring vertices and allows sending messages along the
 * edge. Used in [[Graph#aggregateMessages]].
 */
abstract class EdgeContext[VD, ED, A] {
  /** The vertex id of the edge's source vertex. */
  def srcId: VertexId
  /** The vertex id of the edge's destination vertex. */
  def dstId: VertexId
  /** The vertex attribute of the edge's source vertex. */
  def srcAttr: VD
  /** The vertex attribute of the edge's destination vertex. */
  def dstAttr: VD
  /** The attribute associated with the edge. */
  def attr: ED

  /** Sends a message to the source vertex. */
  def sendToSrc(msg: A): Unit
  /** Sends a message to the destination vertex. */
  def sendToDst(msg: A): Unit

  /** Converts the edge and vertex properties into an [[EdgeTriplet]] for convenience. */
  def toEdgeTriplet: EdgeTriplet[VD, ED] = {
    val et = new EdgeTriplet[VD, ED]
    et.srcId = srcId
    et.srcAttr = srcAttr
    et.dstId = dstId
    et.dstAttr = dstAttr
    et.attr = attr
    et
  }
}12

mergeMsg

mergeMsg函数用来合并消息。每个顶点收到的所有消息都会被聚集起来传递给mergeMsg函数，mergeMsg对其合并得到最终结果。

案例

统计图中顶点的“出度”。

myGraph.aggregateMessages[Int](_.sendToSrc(1), _+_).collect().foreach(println)

(4,1)
(1,1)
(2,1)
(3,2)