更多代码请见:https://github.com/xubo245/SparkLearning
1解释
使用pregel函数求单源最短路径
GraphX中的单源点最短路径例子,使用的是类Pregel的方式。
核心部分是三个函数:
1.节点处理消息的函数 vprog: (VertexId, VD, A) => VD (节点id,节点属性,消息) => 节点属性
2.节点发送消息的函数 sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId,A)] (边元组) => Iterator[(目标节点id,消息)]
3.消息合并函数 mergeMsg: (A, A) => A) (消息,消息) => 消息
具体请参考【3】主要代码:
val sssp = initialGraph.pregel(Double.PositiveInfinity)( (id, dist, newDist) => math.min(dist, newDist), // Vertex Program triplet => { // Send Message if (triplet.srcAttr + triplet.attr < triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a, b) => math.min(a, b) // Merge Message )
源码:
/** * Execute a Pregel-like iterative vertex-parallel abstraction. The * user-defined vertex-program `vprog` is executed in parallel on * each vertex receiving any inbound messages and computing a new * value for the vertex. The `sendMsg` function is then invoked on * all out-edges and is used to compute an optional message to the * destination vertex. The `mergeMsg` function is a commutative * associative function used to combine messages destined to the * same vertex. * * On the first iteration all vertices receive the `initialMsg` and * on subsequent iterations if a vertex does not receive a message * then the vertex-program is not invoked. * * This function iterates until there are no remaining messages, or * for `maxIterations` iterations. * * @tparam A the Pregel message type * * @param initialMsg the message each vertex will receive at the on * the first iteration * * @param maxIterations the maximum number of iterations to run for * * @param activeDirection the direction of edges incident to a vertex that received a message in * the previous round on which to run `sendMsg`. For example, if this is `EdgeDirection.Out`, only * out-edges of vertices that received a message in the previous round will run. * * @param vprog the user-defined vertex program which runs on each * vertex and receives the inbound message and computes a new vertex * value. On the first iteration the vertex program is invoked on * all vertices and is passed the default message. On subsequent * iterations the vertex program is only invoked on those vertices * that receive messages. * * @param sendMsg a user supplied function that is applied to out * edges of vertices that received messages in the current * iteration * * @param mergeMsg a user supplied function that takes two incoming * messages of type A and merges them into a single message of type * A. ''This function must be commutative and associative and * ideally the size of A should not increase.'' * * @return the resulting graph at the end of the computation * */ def pregel[A: ClassTag]( initialMsg: A, maxIterations: Int = Int.MaxValue, activeDirection: EdgeDirection = EdgeDirection.Either)( vprog: (VertexId, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)], mergeMsg: (A, A) => A) : Graph[VD, ED] = { Pregel(graph, initialMsg, maxIterations, activeDirection)(vprog, sendMsg, mergeMsg) }
object Pregel extends Logging { /** * Execute a Pregel-like iterative vertex-parallel abstraction. The * user-defined vertex-program `vprog` is executed in parallel on * each vertex receiving any inbound messages and computing a new * value for the vertex. The `sendMsg` function is then invoked on * all out-edges and is used to compute an optional message to the * destination vertex. The `mergeMsg` function is a commutative * associative function used to combine messages destined to the * same vertex. * * On the first iteration all vertices receive the `initialMsg` and * on subsequent iterations if a vertex does not receive a message * then the vertex-program is not invoked. * * This function iterates until there are no remaining messages, or * for `maxIterations` iterations. * * @tparam VD the vertex data type * @tparam ED the edge data type * @tparam A the Pregel message type * * @param graph the input graph. * * @param initialMsg the message each vertex will receive at the first * iteration * * @param maxIterations the maximum number of iterations to run for * * @param activeDirection the direction of edges incident to a vertex that received a message in * the previous round on which to run `sendMsg`. For example, if this is `EdgeDirection.Out`, only * out-edges of vertices that received a message in the previous round will run. The default is * `EdgeDirection.Either`, which will run `sendMsg` on edges where either side received a message * in the previous round. If this is `EdgeDirection.Both`, `sendMsg` will only run on edges where * *both* vertices received a message. * * @param vprog the user-defined vertex program which runs on each * vertex and receives the inbound message and computes a new vertex * value. On the first iteration the vertex program is invoked on * all vertices and is passed the default message. On subsequent * iterations the vertex program is only invoked on those vertices * that receive messages. * * @param sendMsg a user supplied function that is applied to out * edges of vertices that received messages in the current * iteration * * @param mergeMsg a user supplied function that takes two incoming * messages of type A and merges them into a single message of type * A. ''This function must be commutative and associative and * ideally the size of A should not increase.'' * * @return the resulting graph at the end of the computation * */ def apply[VD: ClassTag, ED: ClassTag, A: ClassTag] (graph: Graph[VD, ED], initialMsg: A, maxIterations: Int = Int.MaxValue, activeDirection: EdgeDirection = EdgeDirection.Either) (vprog: (VertexId, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)], mergeMsg: (A, A) => A) : Graph[VD, ED] = { var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() // compute the messages var messages = g.mapReduceTriplets(sendMsg, mergeMsg) var activeMessages = messages.count() // Loop var prevG: Graph[VD, ED] = null var i = 0 while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog).cache() val oldMessages = messages // Send new messages, skipping edges where neither side received a message. We must cache // messages so it can be materialized on the next line, allowing us to uncache the previous // iteration. messages = g.mapReduceTriplets( sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache() // The call to count() materializes `messages` and the vertices of `g`. This hides oldMessages // (depended on by the vertices of g) and the vertices of prevG (depended on by oldMessages // and the vertices of g). activeMessages = messages.count() logInfo("Pregel finished iteration " + i) // Unpersist the RDDs hidden by newly-materialized RDDs oldMessages.unpersist(blocking = false) prevG.unpersistVertices(blocking = false) prevG.edges.unpersist(blocking = false) // count the iteration i += 1 } g } // end of apply
2.代码:
/** * @author xubo * ref http://spark.apache.org/docs/1.5.2/graphx-programming-guide.html * time 20160503 */ package org.apache.spark.graphx.learning import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.graphx.Graph import org.apache.spark.graphx.Graph.graphToGraphOps import org.apache.spark.graphx.VertexId import org.apache.spark.graphx.util.GraphGenerators object Pregeloperator { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("CollectingNeighbors").setMaster("local[4]") // Assume the SparkContext has already been constructed val sc = new SparkContext(conf) // A graph with edge attributes containing distances val graph: Graph[Long, Double] = GraphGenerators.logNormalGraph(sc, numVertices = 5).mapEdges(e => e.attr.toDouble) val sourceId: VertexId = 2 // The ultimate source // Initialize the graph such that all vertices except the root have distance infinity. println("graph:"); println("vertices:"); graph.vertices.collect.foreach(println) println("edges:"); graph.edges.collect.foreach(println) println(); val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity) println("initialGraph:"); println("vertices:"); initialGraph.vertices.collect.foreach(println) println("edges:"); initialGraph.edges.collect.foreach(println) val sssp = initialGraph.pregel(Double.PositiveInfinity)( (id, dist, newDist) => math.min(dist, newDist), // Vertex Program triplet => { // Send Message if (triplet.srcAttr + triplet.attr < triplet.dstAttr) { Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) } else { Iterator.empty } }, (a, b) => math.min(a, b) // Merge Message ) println(); println("sssp:"); println("vertices:"); println(sssp.vertices.collect.mkString("\n")) println("edges:"); sssp.edges.collect.foreach(println) } }
3.结果:
graph: vertices: (4,3) (0,3) (1,2) (2,3) (3,4) edges: Edge(0,0,1.0) Edge(0,0,1.0) Edge(0,4,1.0) Edge(1,1,1.0) Edge(1,3,1.0) Edge(2,1,1.0) Edge(2,1,1.0) Edge(2,1,1.0) Edge(3,1,1.0) Edge(3,2,1.0) Edge(3,2,1.0) Edge(3,4,1.0) Edge(4,0,1.0) Edge(4,2,1.0) Edge(4,4,1.0) initialGraph: vertices: (4,Infinity) (0,Infinity) (1,Infinity) (2,0.0) (3,Infinity) edges: Edge(0,0,1.0) Edge(0,0,1.0) Edge(0,4,1.0) Edge(1,1,1.0) Edge(1,3,1.0) Edge(2,1,1.0) Edge(2,1,1.0) Edge(2,1,1.0) Edge(3,1,1.0) Edge(3,2,1.0) Edge(3,2,1.0) Edge(3,4,1.0) Edge(4,0,1.0) Edge(4,2,1.0) Edge(4,4,1.0) 2016-05-04 14:43:01 WARN BlockManager:71 - Block rdd_23_1 already exists on this machine; not re-adding it sssp: vertices: (4,3.0) (0,4.0) (1,1.0) (2,0.0) (3,2.0) edges: Edge(0,0,1.0) Edge(0,0,1.0) Edge(0,4,1.0) Edge(1,1,1.0) Edge(1,3,1.0) Edge(2,1,1.0) Edge(2,1,1.0) Edge(2,1,1.0) Edge(3,1,1.0) Edge(3,2,1.0) Edge(3,2,1.0) Edge(3,4,1.0) Edge(4,0,1.0) Edge(4,2,1.0) Edge(4,4,1.0)
分析:
由上诉结果画图可得:
黑色部分为初始化图Graph的点和边,initGraph会将除了第二个节点外的所有节点的值初始化为无穷大,自己设为0,然后从0开始pregel处理。红色部分为实际求单源最短路径可能的路线,所以节点2到节点1为1,到3为2,到4为3,到0为4
参考
【1】 http://spark.apache.org/docs/1.5.2/graphx-programming-guide.html
【2】https://github.com/xubo245/SparkLearning