spark-graphx实现的pagerank源代码分为两个:
spark-1.6.\graphx\src\main\scala\org\apache\spark\graphx\Pregel.scala有一个还有一个在spark-1.6.\graphx\src\main\scala\org\apache\spark\graphx\lib\PageRank.scala中
该PageRank模型提供了两种调用方式:
*
第一种:(静态)在调用时提供一个参数number,用于指定迭代次数,即无论结果如何,该算法在迭代number次后停止计算,返回图结果。
* for a fixed number of iterations:
* {{{
* var PR = Array.fill(n)( 1.0 )
* val oldPR = Array.fill(n)( 1.0 )
* for( iter <- 0 until numIter ) {
* swap(oldPR, PR)
* for( i <- 0 until n ) {
* PR[i] = alpha + (1 - alpha) * inNbrs[i].map(j => oldPR[j] / outDeg[j]).sum
* }
* }
* }}}
*
第二种:(动态)在调用时提供一个参数tol,用于指定前后两次迭代的结果差值应小于tol,以达到最终收敛的效果时才停止计算,返回图结果。* {{{
先介绍第一种静态方法的实现源码:
“`
def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], numIter: Int,
resetProb: Double = 0.15): Graph[Double, Double] =
{
runWithOptions(graph, numIter, resetProb)
}
/**
* Run PageRank for a fixed number of iterations returning a graph
* with vertex attributes containing the PageRank and edge
* attributes the normalized edge weight.
*
* @tparam VD the original vertex attribute (not used)即顶点属性的类型
* @tparam ED the original edge attribute (not used)
*边的属性的类型
* @param graph the graph on which to compute PageRank
需要计算的图
* @param numIter the number of iterations of PageRank to run
控制pagerank迭代次数
* @param resetProb the random reset probability (alpha)
//用于初始化PageRank图模型,具体内容是赋予每个顶点属性为值1.0,赋予每条边属性为值“1.0/该边的出发顶点的出度数
var rankGraph: Graph[Double, Double] = graph
// 计算每个节点的出度,与原图的顶点进行连接,是原图的顶点属性为该点的出度
.outerJoinVertices(graph.outDegrees) { (vid, vdata, deg) => deg.getOrElse(0) }
// 根据边的出度,计算边的权重TripletFields.Src该值为指定在使用聚合消息函数的时候,需要传播的为一个Triplets的src部分,其余不需要传播
.mapTriplets( e => 1.0 / e.srcAttr, TripletFields.Src )
//设置图的顶点的初始值
.mapVertices { (id, attr) =>
if (!(id != src && personalized)) resetProb else 0.0
}
//为persionliaed的时候用到的
def delta(u: VertexId, v: VertexId): Double = { if (u == v) 1.0 else 0.0 }
var iteration = 0
var prevRankGraph: Graph[Double, Double] = null
while (iteration < numIter) {
rankGraph.cache()
val rankUpdates = rankGraph.aggregateMessages[Double](
ctx => ctx.sendToDst(ctx.srcAttr * ctx.attr), _ + _, TripletFields.Src)
prevRankGraph = rankGraph
val rPrb = if (personalized) {
(src: VertexId , id: VertexId) => resetProb * delta(src, id)
} else {
(src: VertexId, id: VertexId) => resetProb
}
rankGraph = rankGraph.joinVertices(rankUpdates) {
(id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum
}.cache()
rankGraph.edges.foreachPartition(x => {}) // also materializes rankGraph.vertices
logInfo(s"PageRank finished iteration $iteration.")
prevRankGraph.vertices.unpersist(false)
prevRankGraph.edges.unpersist(fal```
e)
iteration += 1
}
rankGraph
}
“`虽然graphx实现了一种,但是在现实。。
需要根据自己的业务需要去实现自己的pagerank。概算发由于控制迭代次数,所以pagerank值可能会没有达到。/**
* 动态的pagerank
*
* @tparam VD the original vertex attribute (not used)
* @tparam ED the original edge attribute (not used)
*
* @param graph the graph on which to compute PageRank
* @收敛值
* @param resetProb the random reset probability (alpha)
*
* @return the graph containing with each vertex containing the PageRank and each edge
* containing the normalized weight.
*/
def runUntilConvergence[VD: ClassTag, ED: ClassTag](
graph: Graph[VD, ED], tol: Double, resetProb: Double = 0.15): Graph[Double, Double] =
{
runUntilConvergenceWithOptions(graph, tol, resetProb)
}
* @tparam VD the original vertex attribute (not used)
* @tparam ED the original edge attribute (not used)
*
* @param graph the graph on which to compute PageRank
* @param tol the tolerance allowed at convergence (smaller => more accurate).
* @param resetProb the random reset probability (alpha)
* @param srcId the source vertex for a Personalized Page Rank (optional)
*
* @return the graph containing with each vertex containing the PageRank and each edge
* containing the normalized weight.
*/
def runUntilConvergenceWithOptions[VD: ClassTag, ED: ClassTag](
graph: Graph[VD, ED], tol: Double, resetProb: Double = 0.15,
srcId: Option[VertexId] = None): Graph[Double, Double] =
{
val personalized = srcId.isDefined
val src: VertexId = srcId.getOrElse(-1L)
//同静态方法
val pagerankGraph: Graph[(Double, Double), Double] = graph
// Associate the degree with each vertex
.outerJoinVertices(graph.outDegrees) {
(vid, vdata, deg) => deg.getOrElse(0)
}
.mapTriplets( e => 1.0 / e.srcAttr )
.mapVertices { (id, attr) =>
if (id == src) (resetProb, Double.NegativeInfinity) else (0.0, 0.0)
}
.cache()
//图中每个节点的权重处理函数
def vertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = {
val (oldPR, lastDelta) = attr
val newPR = oldPR + (1.0 - resetProb) * msgSum
(newPR, newPR - oldPR)
}
def personalizedVertexProgram(id: VertexId, attr: (Double, Double),
msgSum: Double): (Double, Double) = {
val (oldPR, lastDelta) = attr
var teleport = oldPR
val delta = if (src==id) 1.0 else 0.0
teleport = oldPR*delta
val newPR = teleport + (1.0 - resetProb) * msgSum
val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR
(newPR, newDelta)
}
def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {
if (edge.srcAttr._2 > tol) {
Iterator((edge.dstId, edge.srcAttr._2 * edge.attr))
} else {
Iterator.empty
}
}
def messageCombiner(a: Double, b: Double): Double = a + b
// 初始化消息
val initialMessage = if (personalized) 0.0 else resetProb / (1.0 - resetProb)
// Execute a dynamic version of Pregel.
val vp = if (personalized) {
(id: VertexId, attr: (Double, Double), msgSum: Double) =>
personalizedVertexProgram(id, attr, msgSum)
} else {
(id: VertexId, attr: (Double, Double), msgSum: Double) =>
vertexProgram(id, attr, msgSum)
}
Pregel(pagerankGraph, in
itialMessage, activeDirection = EdgeDirection.Out)(
vp, sendMessage, messageCombiner)
.mapVertices((vid, attr) => attr._1)
} 如果在应用中,需要使用边的权重而不是按照graphx的实现的方法,该怎么去实现呢,需要在初始化图的时候进行处理,后期会补上相关实现。