按一列分组按另一列排序
直接使用row_number即可达到分组排序效果
select id,up,row_number() over(partition by substring(id,1,2) order by up)
from temp.setup_cleanup ;
id up row_number
13760778710 120 1
13926435656 132 2
13480253104 180 3
13926251106 240 4
13719199419 240 5
13826544101 264 6
15989002119 1938 1
15920133257 3156 2
15013685858 3659 3
分组排序求topN(子查询中取rank编号,外部再筛选)
select * from
(select id,row_number() over(partition by substring(id,1,3) order by up) as rank
from temp.setup_cleanup) a
where rank<=3;
id up row_number
13760778710 120 1
13926435656 132 2
13480253104 180 3
15989002119 1938 1
15920133257 3156 2
15013685858 3659 3
分组聚合并求各组占比
type qty
a 1
a 2
b 3
c 5
a 6
c 3
select type,sum(qty),
sum(sum(qty)) over(partition by 1),
sum(qty)/sum(sum(qty)) over(partition by 1)
from temp.x group by type;
type sum total per
c 8 20 0.4
b 3 20 0.15
a 9 20 0.45
explain该语句分析执行过程
可以发现是先执行了select type,sum(qty) from x group by type;,再对这个结果集做开窗计算。
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Spark
Edges:
Reducer 2 <- Map 1 (GROUP, 60)
Reducer 3 <- Reducer 2 (PARTITION-LEVEL SORT, 60)
DagName: hadoop_20190407123138_845b100c-26b2-48f8-bfe3-a8a2e0b6e29b:19
Vertices:
//(Map 1+Reducer 2)执行了select type,sum(qty) from x group by type;
Map 1
Map Operator Tree:
TableScan
alias: x //此处是从表x读取数据
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
Select Operator //读取了两列type、qty
expressions: type (type: string), qty (type: int)
outputColumnNames: type, qty
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
Group By Operator //注意此处是在map端做了第一次group by sum聚合,相当于combine
aggregations: sum(qty)
keys: type (type: string)
mode: hash
outputColumnNames: _col0, _col1 //最终生成了col0:type,col1:sum(qty)
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reducer 2 //注意此处是在reduce端做了第二次group by sum聚合
Reduce Operator Tree:
Group By Operator
aggregations: sum(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: 1 (type: int)
sort order: +
Map-reduce partition columns: 1 (type: int)
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string), _col1 (type: bigint)
//此处是对前面汇总结果做了窗口计算
Reducer 3
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: string), VALUE._col1 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
PTF Operator
Function definitions: //开始开窗计算,为每一行计算开窗汇总值,得到sum_window_0
Input definition
input alias: ptf_0
output shape: _col0: string, _col1: bigint
type: WINDOWING
Windowing table definition
input alias: ptf_1
name: windowingtablefunction
order by: 1 ASC NULLS FIRST
partition by: 1 //对(col0,col1)按1分组,对应partition by 1
raw input shape:
window functions:
window function definition
alias: sum_window_0
arguments: _col1
name: sum
window function: GenericUDAFSumLong
window frame: PRECEDING(MAX)~FOLLOWING(MAX)
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
//整理输出结果_col0, _col1, sum_window_0, _col1/sum_window_0
Select Operator
expressions: _col0 (type: string), _col1 (type: bigint), sum_window_0 (type: bigint), (UDFToDouble(_col1) / UDFToDouble(sum_window_0)) (type: double)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
连续三天消费>100
stdate num
2019-01-01 12
2019-01-02 135
2019-01-03 129
2019-01-04 0
2019-01-05 166
2019-01-06 110
2019-01-07 178
2019-01-08 198
2019-01-09 13
2019-01-10 178
2019-01-11 190
2019-01-12 121
2019-01-13 16
select s1.stdate,s1.num from temp.xzq_y s1
LEFT JOIN temp.xzq_y s2 on s1.stdate=date_add(s2.stdate,-2)
LEFT JOIN temp.xzq_y s3 on s1.stdate=date_add(s3.stdate,-1)
LEFT JOIN temp.xzq_y s4 on s1.stdate=date_add(s4.stdate,1)
LEFT JOIN temp.xzq_y s5 on s1.stdate=date_add(s5.stdate,2)
where (s1.num>100 and s2.num>100 and s3.num>100)
or (s1.num>100 and s3.num>100 and s4.num>100)
or (s1.num>100 and s4.num>100 and s5.num>100);
客户留存率
create table temp.a as
select DISTINCT cust_id,acct_id from temp.x where stdate='2019-01-01';
create table temp.b as
select DISTINCT cust_id,acct_id from temp.x where stdate='2019-01-02';
select s1.cust_id,sum(case when s2.cust_id is not null then 1 else 0 end)/count(s1.cust_id)
from temp.a s1
LEFT JOIN temp.b s2
on s1.cust_id=s2.cust_id and s1.acct_id=s2.acct_id
group by s1.cust_id;
多行收集为数组:collect_set()/collect_list() 去重/不去重
select type,concat_ws(',',collect_list(cast(qty as string))) from temp.x group by type;
a 1
a 2 a 1,2,6
b 3 ==>> b 3
c 5 c 5,3
a 6
c 3
hive实现随机前缀二次聚合group by count
select split(s2.year_key,'_')[0] as y,sum(cnt) from -- 去除后缀第二次聚合
(select year_key,count(1) cnt from -- 第一次聚合
(select concat(substring(date_id,1,4),'_',round(rand()*6,0)) as year_key from ka.tb_prod) s1 -- 加随机后缀
group by year_key) s2
GROUP BY split(s2.year_key,'_')[0];