RDD中JOIN的使用

JOIN在Spark Core中的使用

1. inner join

inner join,只返回左右都匹配上的

>>> data2 = sc.parallelize(range(6,15)).map(lambda line:(line,1))
>>> data2.collect()
[(6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]   
>>> data1 = sc.parallelize(range(10)).map(lambda line:(line,1))
>>> data1.collect()
[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)]
>>> data1.join(data2)
PythonRDD[14] at RDD at PythonRDD.scala:43
>>> data1.join(data2).collect()
[(6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1))]  

2. left outer join

left:是以左边为基准,向左靠

左边(a)的记录一定会存在,右边(b)的记录有的返回Some(x),没有的补None。

>>> data1.leftOuterJoin(data2).collect()
[(0, (1, None)), (1, (1, None)), (2, (1, None)), (3, (1, None)), (4, (1, None)), (5, (1, None)), (6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1))]

3. right outer join

right:是以右边为基准,向右靠

>>> data1.rightOuterJoin(data2).collect()
[(6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1)), (10, (None, 1)), (11, (None, 1)), (12, (None, 1)), (13, (None, 1)), (14, (None, 1))]

右边(b)的记录一定会存在,左边(a)的记录有的返回None,没有的补None。

4. full outer join

注意:使用JOIN之前,要知道JOIN之后的数据结构是什么。

>>> data1.fullOuterJoin(data2).collect()
[(0, (1, None)), (1, (1, None)), (2, (1, None)), (3, (1, None)), (4, (1, None)), (5, (1, None)), (6, (1, 1)), (7, (1, 1)), (8, (1, 1)), (9, (1, 1)), (10, (None, 1)), (11, (None, 1)), (12, (None, 1)), (13, (None, 1)), (14, (None, 1))]
>>>

参考:https://blog.csdn.net/wawa8899/article/details/81027633 

你可能感兴趣的:(Spark)