目录
1、Pregel API:
2、代码实现:
使用pregal实现找出源顶点到每个节点最小花费
使用pregel实现找出源节点到每个节点的最大深度
图本身就是内在的递归的数据结构,因为一个顶点的属性可能依赖于其neighbor,而neighbor的属性又依赖于他们的neighbour。所以很多重要的图算法都会迭代计算每个顶点的属性,直到达到一个稳定状态。
GraphX中的Pregel操作符是一个批量同步并行(bulk-synchronous parallel message abstraction)的messaging abstraction,用于图的拓扑结构(topology of the graph)。The Pregel operator executes in a series of super steps in whichvertices receive the sum of their inbound messagesfrom the previous super step,compute a new valuefor the vertex property, and thensend messages to neighboring verticesin the next super step. Message是作为edge triplet的一个函数并行计算的,message的计算可以使用source和dest顶点的属性。没有收到message的顶点在super step中被跳过。迭代会在么有剩余的信息之后停止,并返回最终的图。
pregel的定义:
def pregel[A]
(initialMsg: A,//在第一次迭代中每个顶点获取的起始
msgmaxIter: Int = Int.MaxValue,//迭代计算的次数
activeDir: EdgeDirection = EdgeDirection.Out
)(
vprog: (VertexId, VD, A) => VD,//顶点的计算函数,在每个顶点运行,根据顶点的ID,属性和获取的inbound message来计算顶点的新属性值。顶一次迭代的时候,inbound message为initialMsg,且每个顶点都会执行一遍该函数。以后只有上次迭代中接收到信息的顶点会执行。
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],//应用于顶点的出边(out edges)用于接收顶点发出的信息
mergeMsg: (A, A) => A//合并信息的算法
)
算法实现的大致过程:
var g = mapVertices((vid, vdata) => vprog(vid, vdata, initMsg)).cache //第一步是根据initMsg在每个顶点执行一次vprog算法,从而每个顶点的属性都会迭代一次。
var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
var messagesCount = messages.count
var i = 0
while(activeMessages > 0 && i < maxIterations){
g = g.joinVertices(messages)(vprog).cache
val oldMessages = messages
messages = g.mapReduceTriplets(
sendMsg,
mergeMsg,
Some((oldMessages, activeDirection))
).cache()
activeMessages = messages.count
i += 1
}
g
pregel算法的一个实例:将图跟一些一些初始的score做关联,然后将顶点分数根据出度大小向外发散,并自己保留一份:
//将图中顶点添加上该顶点的出度属性
val graphWithDegree = graph.outerJoinVertices(graph.outDegrees){
case (vid, name, deg) => (name, deg match {
case Some(deg) => deg+0.0
case None => 1.25}
)
}//将图与初始分数做关联
val graphWithScoreAndDegree = graphWithDegree.outerJoinVertices(scoreRDD){
case (vid, (name, deg), score) => (name,deg, score.getOrElse(0.0))
}
graphWithScoreAndDegree.vertices.foreach(x => println("++++++++++++id:"+x._1+"; deg: "+x._2._2+"; score:"+x._2._3))//将图与初始分数做关联
val graphWithScoreAndDegree = graphWithDegree.outerJoinVertices(scoreRDD){
case (vid, (name, deg), score) => (name,deg, score.getOrElse(0.0))
}
graphWithScoreAndDegree.vertices.foreach(x => println("++++++++++++id:"+x._1+"; deg: "+x._2._2+"; score:"+x._2._3))
算法的第一步:将0.0(也就是传入的初始值initMsg)跟各个顶点的值相加(还是原来的值),然后除以顶点的出度。这一步很重要,不能忽略。 并且在设计的时候也要考虑结果会不会被这一步所影响。
解释来源:https://www.jianshu.com/p/d9170a0723e4
package homeWork
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.graphx.util.GraphGenerators
object MapGraphX5 {
def main(args: Array[String]): Unit = {
//设置运行环境
val conf = new SparkConf().setAppName("Pregel API GraphX").setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
// 构建图
val myVertices = sc.parallelize(Array((1L, 0), (2L, 0), (3L, 0), (4L, 0),
(5L, 0)))
val myEdges = sc.makeRDD(Array(Edge(1L, 2L, 2.5),
Edge(2L, 3L, 3.6), Edge(3L, 4L, 4.5),
Edge(4L, 5L, 0.1), Edge(3L, 5L, 5.2)
))
val myGraph = Graph(myVertices, myEdges)
//设置源顶点
val sourceId: VertexId = 1L
//初始化数据集,是源顶点就为0.0,不是就设置为double的正无穷大
val initialGraph = myGraph.mapVertices((id, _) =>
if (id == sourceId) 0.0 else Double.PositiveInfinity)
/*
def pregel[A](
initialMsg : A,
maxIterations : scala.Int = { /* compiled code */ },
activeDirection : org.apache.spark.graphx.EdgeDirection = { /* compiled code */ }
)
(
vprog : scala.Function3[org.apache.spark.graphx.VertexId, VD, A, VD],
sendMsg : scala.Function1[org.apache.spark.graphx.EdgeTriplet[VD, ED],
scala.Iterator[scala.Tuple2[org.apache.spark.graphx.VertexId, A]]],
mergeMsg : scala.Function2[A, A, A])(implicit evidence$6 : scala.reflect.ClassTag[A]
)
: org.apache.spark.graphx.Graph[VD, ED] = { /* compiled code */ }
*/
val sssp: Graph[Double, Double] = initialGraph.pregel(
//initialMs
Double.PositiveInfinity
//maxIterations和activeDirection使用默认值
)(
//vprog 更改数据集
(id, dist, newDist) => math.min(dist, newDist),
//sendMsg
triplet => { // Send Message
//寻找1L顶点到每个顶点的最小花费
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
//满足sum(起始顶点+边值) 小于 终止顶点当前数据集中的值,就把sum发送给终止顶点,更新数据集的数据
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
//mergeMsg 选择当前数据和发送数据的最小值传送
(a, b) => math.min(a, b)
)
sssp.vertices.collect.foreach(println(_))
}
}
package pregel
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx.{Edge, EdgeDirection, Graph}
object Demo2 {
def main(args: Array[String]): Unit = {
//设置运行环境
val conf = new SparkConf().setAppName("Pregol Api GraphX").setMaster("local")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
// 构建图
val myVertices = sc.parallelize(Array((1L, "张三"), (2L, "李四"), (3L, "王五"), (4L, "钱六"),
(5L, "领导")))
val myEdges = sc.makeRDD(Array( Edge(1L,2L,"朋友"),
Edge(2L,3L,"朋友") , Edge(3L,4L,"朋友"),
Edge(4L,5L,"上下级"),Edge(3L,5L,"上下级")
))
val myGraph = Graph(myVertices,myEdges)
val g = myGraph.mapVertices((vid,vd)=>0)
var newGraph: Graph[Int, String] = g.pregel(0)(
(id, attr, maxValue) => maxValue,
triplet => { // Send Message
if (triplet.srcAttr + 1 > triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + 1))
} else {
Iterator.empty
}
},
(a: Int, b: Int) => math.max(a, b)
)
newGraph.vertices.collect.foreach(println(_))
}
}