spark graphx 教程 04 (join 算子)

spark graphx 04 (join 算子)

为了演示graph的join算子,首先我们定义一个graph

val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Array(
    (1L, ("a", "student")), (2L, ("b", "salesman")),
    (3L, ("c", "programmer")), (4L, ("d", "doctor")),
    (5L, ("e", "postman"))
  ))

val relationships: RDD[Edge[String]] =
  sc.parallelize(Array(Edge(1L, 2L, "customer"), Edge(3L, 2L, "customer"),
    Edge(3L, 4L, "patient"), Edge(5L, 4L, "patient"),
    Edge(3L, 4L, "friend"), Edge(5L, 99L, "father")))

val defaultUser = ("f", "none")

val graph = Graph(users, relationships, defaultUser)

这个graph描述了每个人的名字和工作,这里我们给每个人增加除了名字和工作的其他属性,这个属性就是年龄属性
因此,我们需要定义一个rdd,描述每个人的年龄
代码如下:

 val userWithAge: RDD[(VertexId, Int)] =
      sc.parallelize(Array(
        (3L, 2), (4L, 19), (5L, 23), (6L, 42), (7L, 59)
      ))

这里我们定义了id为3到7的这5个人的年龄,注意我们原来的graph的所有人的id为1到5
接下来有2种方法来把这个年龄属性加到我们graph中的每个人上面

outerJoinVertices

第一种方法就是outerJoinVertices,代码如下:

graph.outerJoinVertices(userWithAge) { (id, attr, age) =>
      age match {
        case Some(a) => (attr._1, attr._2, a)
        case None => (attr._1, attr._2)
      }
      //      (attr._1 + "", attr._2 + "," + A)
    }.vertices.collect.foreach(println)

打印结果如下:

(1,(a,student,-1))
(2,(b,salesman,-1))
(3,(c,programmer,2))
(99,(f,none,-1))
(4,(d,doctor,19))
(5,(e,postman,23))

joinVertices

第一种方法就是joinVertices,代码如下:

graph.joinVertices(userWithAge) { (id, attr, age) => {
  (attr._1 + "", attr._2 + "、" + age)
}}.vertices.collect.foreach(println)

打印结果如下

(1,(a,student))
(2,(b,salesman))
(3,(c,programmer、2))
(99,(f,none))
(4,(d,doctor、19))
(5,(e,postman、23))

实际上,当我们读spark graphx的源代码的时候会发现, joinVertices 的底层就是调用了 outerJoinVertices,源代码如下:

def joinVertices[U: ClassTag](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD)
    : Graph[VD, ED] = {
    val uf = (id: VertexId, data: VD, o: Option[U]) => {
      o match {
        case Some(u) => mapFunc(id, data, u)
        case None => data
      }
    }
    graph.outerJoinVertices(table)(uf)
  }

outerJoinVertices 和 joinVertices

  • 这2个算子都类似于sql中的left join
    从我上面举的例子你也可以看出这点,graph的vertex的id是1到5,age rdd的id是3到7,然后这2个算子返回的graph的vertex的id都是1到5,其实就是sql的left join嘛

  • outerJoinVertices返回的graph,它的vertex的属性可以是任意类型,而joinVertices返回的graph,它的vertex的属性类型只能是原来vertex的属性的类型

你可能感兴趣的:(spark,graphx,spark,graphx,图计算,大数据)