spark-sql> with test1 as (
> select 1 as user_id,'xiaoming' as name
> union all
> select 2 as user_id,'xiaolan' as name
> union all
> select 3 as user_id,'xiaoxin' as name
> ),
>
> test2 as (
> select 1 as user_id,19 as age
> union all
> select 2 as user_id,20 as age
> union all
> select 4 as user_id,21 as age
> )
>
> select
> t1.user_id,
> t1.name,
> t2.user_id,
> t2.age
> from
> (select
> user_id,
> name
> from test1) t1
>
> left join
> (select
> user_id,
> age
> from test2) t2
> on t1.user_id=t2.user_id;
1 xiaoming 1 19
2 xiaolan 2 20
3 xiaoxin NULL NULL
Time taken: 0.936 seconds, Fetched 3 row(s)
以左表为主,左表数据全部返回,右表数据只返回和左表通过关联条件能关联上的数据
spark-sql> with test1 as (
> select 1 as user_id,'xiaoming' as name
> union all
> select 2 as user_id,'xiaolan' as name
> union all
> select 3 as user_id,'xiaoxin' as name
> ),
>
> test2 as (
> select 1 as user_id,19 as age
> union all
> select 2 as user_id,20 as age
> union all
> select 4 as user_id,21 as age
> )
>
> select
> t1.user_id,
> t1.name,
> t2.user_id,
> t2.age
> from
> (select
> user_id,
> name
> from test1) t1
>
> right join
> (select
> user_id,
> age
> from test2) t2
> on t1.user_id=t2.user_id;
1 xiaoming 1 19
2 xiaolan 2 20
NULL NULL 4 21
Time taken: 0.936 seconds, Fetched 3 row(s)
以右表为主,右表数据全部返回,左表只返回和右表通过关联条件能关联上的数据
spark-sql> with test1 as (
> select 1 as user_id,'xiaoming' as name
> union all
> select 2 as user_id,'xiaolan' as name
> union all
> select 3 as user_id,'xiaoxin' as name
> ),
>
> test2 as (
> select 1 as user_id,19 as age
> union all
> select 2 as user_id,20 as age
> union all
> select 4 as user_id,21 as age
> )
>
> select
> t1.user_id,
> t1.name,
> t2.user_id,
> t2.age
> from
> (select
> user_id,
> name
> from test1) t1
>
> inner join
> (select
> user_id,
> age
> from test2) t2
> on t1.user_id=t2.user_id;
1 xiaoming 1 19
2 xiaolan 2 20
Time taken: 1.108 seconds, Fetched 2 row(s)
通过关联条件只取两个表数据的交集,关联不上的全部剔除了
spark-sql> with test1 as (
> select 1 as user_id,'xiaoming' as name
> union all
> select 2 as user_id,'xiaolan' as name
> union all
> select 3 as user_id,'xiaoxin' as name
> ),
>
> test2 as (
> select 1 as user_id,19 as age
> union all
> select 2 as user_id,20 as age
> union all
> select 4 as user_id,21 as age
> )
>
> select
> t1.user_id,
> t1.name,
> t2.user_id,
> t2.age
> from
> (select
> user_id,
> name
> from test1) t1
>
> full join
> (select
> user_id,
> age
> from test2) t2
> on t1.user_id=t2.user_id;
1 xiaoming 1 19
3 xiaoxin NULL NULL
NULL NULL 4 21
2 xiaolan 2 20
通过关联条件取两个表数据的并集,关联上的数据返回,关联不上的数据,另一部分返回NULL
总结:
如果只做数据拼接,一般用left join;如果只取几个表共同的部分,用inner join;如果所有的关联数据都要保留用full join