问题:
每个app下只保留一个用户
案例:
spark-sql> with test1 as
> (select 122 as userid,100024 as apptypeid
> union all
> select 123 as userid,100024 as apptypeid
> union all
> select 123 as userid,100024 as apptypeid)
> select
> distinct userid,apptypeid
> from test1;
问题:
每个app下只保留一个用户
案例:
spark-sql> with test1 as
> (select 122 as userid,100024 as apptypeid
> union all
> select 123 as userid,100024 as apptypeid
> union all
> select 123 as userid,100024 as apptypeid)
> select
> userid,
> apptypeid
> from
> (select
> userid,
> apptypeid
> from test1) t1
> group by userid,apptypeid;
问题:
每个app下,每个用户取最近的渠道、版本、操作系统数据
分析:
distinct只是简单的去重,解决不了问题;group by也是简单的分组去重,也解决不了问题;order by只是简单的排序,也解决不了问题。那这个时候row_number()就派上用场了,分组完再排序
案例:
spark-sql> with test1 as
> (select 122 as userid,100024 as apptypeid,'appstore' as qid,'ios' as os,'1.0.2' as ver,1627440618 as dateline
> union all
> select 123 as userid,100024 as apptypeid,'huawei' as qid,'android' as os,'1.0.3' as ver,1627440620 as dateline
> union all
> select 123 as userid,100024 as apptypeid,'huawei' as qid,'android' as os,'1.0.4' as ver,1627440621 as dateline)
> select
> userid,
> apptypeid,
> qid,
> os,
> ver
> from
> (select
> userid,
> apptypeid,
> qid,
> os,
> ver,
> row_number() over(distribute by apptypeid,userid sort by dateline desc) as rank
> from test1) t1
> where t1.rank=1;
问题:
求每天的新增用户。现在有一张每天的用户表test1,有一张历史的新用户表test2(新用户:每个app下,每个用户只有一条数据)
分析:
1.每天的用户表test1用group by进行去重,得到每天的用户数据
2.再将用户数据根据历史新用户表进行关联,不在历史新用户表里面的,即为每天新增用户
案例:
spark-sql> with test1 as
> (select 122 as userid,100024 as apptypeid
> union all
> select 123 as userid,100024 as apptypeid
> union all
> select 123 as userid,100024 as apptypeid),
>
> test2 as
> (select 122 as userid,100024 as apptypeid
> union all
> select 124 as userid,100024 as apptypeid
> union all
> select 125 as userid,100024 as apptypeid)
> select
> t1.userid,
> t1.apptypeid
> from
> (select
> userid,
> apptypeid
> from test1
> group by userid,apptypeid) t1
>
> left join
> (select
> userid,
> apptypeid
> from test2) t2
> on t1.apptypeid=t2.apptypeid and t1.userid=t2.userid
> where t2.userid is null;
问题:
求每天的新增用户。现在有一张每天的用户表test1,有一张历史的新用户表test2(新用户:每个app下,每个用户只有一条数据)
分析:
1.每天的用户表test1用group by进行去重,得到每天的用户数据
2.将每天的用户数据打上标签10,历史的新用户数据打上标签1(位操作的标签)
3.进行union all拼接,对标签进行汇总,取标签为10的数据,即为每天的新增用户
案例:
spark-sql> with test1 as
> (select 122 as userid,100024 as apptypeid
> union all
> select 123 as userid,100024 as apptypeid
> union all
> select 123 as userid,100024 as apptypeid),
>
> test2 as
> (select 122 as userid,100024 as apptypeid
> union all
> select 124 as userid,100024 as apptypeid
> union all
> select 125 as userid,100024 as apptypeid)
>
> select
> userid,
> apptypeid
> from
> (select
> sum(tag) as tag,
> userid,
> apptypeid
> from
> (select
> 10 as tag,
> t1.userid,
> t1.apptypeid
> from
> (select
> userid,
> apptypeid
> from test1
> group by userid,apptypeid) t1
>
> union all
> select
> 1 as tag,
> userid,
> apptypeid
> from test2) t2
> group by userid,apptypeid) t3
> where t3.tag=10;
总结:
1.简单数据去重建议用group by替代distinct的方式;distinct去重,所有数据都在一个reduce里面,很浪费资源,效率又很低,会有内存溢出的风险
2.对于求每天新增用户,如果数据量很大的情况下,建议用位操作的方式;