Spark GraphX 简单介绍

    Spark GraphX是一个分布式的图处理框架。社交网络中,用户与用户之间会存在错综复杂的联系,如微信、QQ、微博的用户之间的好友、关注等关系,构成了一张巨大的图,单机无法处理,只能使用分布式图处理框架处理,Spark GraphX就是一种分布式图处理框架。

    下面以一个简单的示例进行说明:

1. POM文件

    在项目的pom文件中加上Spark GraphX的包:


    org.apache.spark
    spark-graphx_2.10
    1.6.0

2. 设置运行环境

// 设置运行环境
    val conf = new SparkConf().setAppName("Simple GraphX").setMaster("spark://master:7077").setJars(Seq("E:\\Intellij\\Projects\\SimpleGraphX\\SimpleGraphX.jar"))
    val sc = new SparkContext(conf)

3. 图的构造

    图是由若干顶点和边构成的,Spark GraphX里面的图也是一样的,所以在初始图之前,先要定义若干的顶点和边:

// 顶点                       
val vertexArray = Array(    
  (1L,("Alice", 38)),       
  (2L,("Henry", 27)),       
  (3L,("Charlie", 55)),     
  (4L,("Peter", 32)),       
  (5L,("Mike", 35)),        
  (6L,("Kate", 23))         
)                           

                          
// 边                        
val edgeArray = Array(      
  Edge(2L, 1L, 5),          
  Edge(2L, 4L, 2),          
  Edge(3L, 2L, 7),          
  Edge(3L, 6L, 3),          
  Edge(4L, 1L, 1),          
  Edge(5L, 2L, 3),          
  Edge(5L, 3L, 8),          
  Edge(5L, 6L, 8)           
)                           

    然后再利用点和边生成各自的RDD:

//构造vertexRDD和edgeRDD
val vertexRDD:RDD[(Long,(String,Int))] = sc.parallelize(vertexArray)
val edgeRDD:RDD[Edge[Int]] = sc.parallelize(edgeArray)

    最后利用两个RDD生成图:

// 构造图
val graph:Graph[(String,Int),Int] = Graph(vertexRDD, edgeRDD)

 

4. 图的属性操作

Spark GraphX图的属性包括:

    • Graph.vertices:图中的所有顶点;

    • Graph.edges:图中所有的边;

    • Graph.triplets:由三部分组成,源顶点,目的顶点,以及两个顶点之间的边;

    • Graph.degrees:图中所有顶点的度;

    • Graph.inDegrees:图中所有顶点的入度;

    • Graph.outDegrees:图中所有顶点的出度;

    对这些属性的操作,直接上代码:


//图的属性操作
println("*************************************************************")
println("属性演示")
println("*************************************************************")

// 方法一
println("找出图中年龄大于20的顶点方法之一:")
graph.vertices.filter{case(id,(name,age)) => age>20}.collect.foreach {
  case(id,(name,age)) => println(s"$name is $age")
}

// 方法二
println("找出图中年龄大于20的顶点方法之二:")
graph.vertices.filter(v => v._2._2>20).collect.foreach {
  v => println(s"${v._2._1} is ${v._2._2}")
}

// 边的操作
println("找出图中属性大于3的边:")
graph.edges.filter(e => e.attr>3).collect.foreach(e => println(s"${e.srcId} to ${e.dstId} att ${e.attr}"))
println

// Triplet操作
println("列出所有的Triples:")
for(triplet <- graph.triplets.collect){
  println(s"${triplet.srcAttr._1} likes ${triplet.dstAttr._1}")
}
println

println("列出边属性>3的Triples:")
for(triplet <- graph.triplets.filter(t => t.attr > 3).collect){
  println(s"${triplet.srcAttr._1} likes ${triplet.dstAttr._1}")
}
println

// Degree操作
println("找出图中最大的出度,入度,度数:")
def max(a:(VertexId,Int), b:(VertexId,Int)):(VertexId,Int) = {
  if (a._2>b._2) a else b
}
println("Max of OutDegrees:" + graph.outDegrees.reduce(max))
println("Max of InDegrees:" + graph.inDegrees.reduce(max))
println("Max of Degrees:" + graph.degrees.reduce(max))
println

    运行结果:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/05/22 20:45:35 INFO Slf4jLogger: Slf4jLogger started
17/05/22 20:45:35 INFO Remoting: Starting remoting
17/05/22 20:45:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:53375]
*************************************************************
属性演示
*************************************************************
找出图中年龄大于20的顶点方法之一:
Peter is 32
Alice is 38
Charlie is 55
Mike is 35
找出图中年龄大于20的顶点方法之二:
Peter is 32
Alice is 38
Charlie is 55
Mike is 35
找出图中属性大于3的边:
to 2 att 7
to 3 att 8
to 6 att 8

列出所有的Triples:
Henry likes Alice
Henry likes Peter
Charlie likes Henry
Charlie likes Kate
Peter likes Alice
Mike likes Henry
Mike likes Charlie
Mike likes Kate

列出边属性>3的Triples:
Charlie likes Henry
Mike likes Charlie
Mike likes Kate

找出图中最大的出度,入度,度数:
Max of OutDegrees:(5,3)
Max of InDegrees:(1,2)
Max of Degrees:(2,4)

 

你可能感兴趣的:(#,Spark,scala,spark)