Spark Transformations and Actions

Spark Transformations and Actions_第1张图片
转换、操作算子

groupByKey

groupByKey([numTasks])是数据分组操作,在一个由(K,V)对组成的数据集上调用,返回一个(K,Seq[V])对的数据集。

val rdd0 = sc.parallelize(Array((1,1), (1,2) , (1,3) , (2,1) , (2,2) , (2,3)), 3)

val rdd1 = rdd0.groupByKey()

rdd1.collect

res0: Array[(Int, Iterable[Int])] = Array((1,ArrayBuffer(1, 2, 3)), (2,ArrayBuffer(1, 2, 3)))

union

union(otherDataset)是数据合并,返回一个新的数据集,由原数据集和otherDataset联合而成。

val rdd1 = sc.parallelize(1 to 9, 3)

val rdd2 = rdd1.map(x => x * 2)

rdd2.collect

val rdd3 = rdd2.filter(x => x > 10)

rdd3.collect

val rdd4 = rdd1.union(rdd3)

rdd4.collect

res: Array[Int] = Array(1,2,3,4,5,6,7,8,9,12,14,16,18)

join

join(otherDataset, [numTasks])是连接操作,将输入数据集(K,V)和另外一个数据集(K,W)进行Join, 得到(K, (V,W));该操作是对于相同K的V和W集合进行笛卡尔积 操作,也即V和W的所有组合;

val rdd0 = sc.parallelize((1,1),(1,2),(1,3),(2,1),(2,2),(2,3), 3)

val rdd5 = rdd0.join(rdd0)

rdd5.collect

res: Array[(Int, (Int, Int))] = Array((1,(1,1)), (1,(1,2)), (1,(1,3)), (1,(2,1)), (1,(2,2)), (1,(2,3)), (1,(3,1)), (1,(3,2)), (1,(3,3)), (2,(1,1)), (2,(1,2)), (2,(1,3)), (2,(2,1)), (2,(2,2)), (2,(2,3)), (2,(3,1)), (2,(3,2)), (2,(3,3)))

cogroup

cogroup(otherDataset, [numTasks])是将输入数据集(K, V)和另外一个数据集(K, W)进行cogroup,得到一个格式为(K, Seq[V], Seq[W])的数据集。

val rdd6 = rdd0.cogroup(rdd0)

rdd6.collect

res: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(ArrayBuffer(1, 2, 3),ArrayBuffer(1, 2, 3))), (2,(ArrayBuffer(1, 2, 3),ArrayBuffer(1, 2, 3))))

你可能感兴趣的:(Spark Transformations and Actions)