一般用来分组取topN等
依次为部门 dep,员工 emp,月薪 sal
a 101 7
a 103 11
a 102 9
a 109 14
b 105 12
b 104 12
b 106 6
b 107 13
b 108 9
(1)3个区别?
SELECT *,
row_number() OVER(PARTITION BY dep ORDER BY sal DESC),
rank() OVER(PARTITION BY dep ORDER BY sal DESC),
dense_rank() OVER(PARTITION BY dep ORDER BY sal DESC)
FROM test
可以看出,row_number当两个值相同时,排名不一样,rank当两个值相同时,排名一样,
dense_rank当两个值相同时,排名一样,并且总数会减少
(2)取各部门薪资top2员工?
SELECT tmp.*
FROM
(
SELECT dep,emp,sal,
dense_rank() OVER(PARTITION BY dep ORDER BY sal DESC) rank
FROM test
) tmp
WHERE rank<=2
聚合函数sum,count,avg等,是将多行数据按某种规则聚合成一行,聚合后的行数小于等于聚合前行数的,有些场景我们不仅需要聚合前数据,也需要聚合后数据,这就需要over()
一般用来求占比,将各值与总数直接求比等
依次为部门 dep,员工 emp,月薪 sal
a 101 7
a 103 11
a 102 9
a 109 14
b 105 12
b 104 12
b 106 6
b 107 13
b 108 9
(1)每个员工占部门总薪资比例?
SELECT dep,emp,
sal/sum(sal) OVER(PARTITION BY dep)
FROM test
sum(sal) OVER(PARTITION BY dep)//求每个部门薪资总和
(2)每个员工占公司总薪资比例?
SELECT dep,emp,
sal/sum(sal) OVER()
FROM test
sum(sal) OVER() //求所有部门薪资总和
通常用来计算页面停留时间,同比环比增长率等
依次为sessionID(从进入app到离开的唯一标识),页面id,进入时间
s1 p1 2019-02-24 12:23:40
s1 p2 2019-02-24 12:23:43
s1 p3 2019-02-24 12:23:49
s2 p2 2019-02-24 12:27:01
s2 p3 2019-02-24 12:27:19
s2 p5 2019-02-24 12:27:31
(1)求页面的平均停留时长?
SELECT session_id,page_id,`date`,
lead(`date`,1,'end') OVER(PARTITION BY session_id ORDER BY `date`)
FROM rs
lead三个参数:第一个为hive列,第二个为取往后第几行,第三个为取不到时默认值
最后一个页面无离开时间,故无法计算最后页面的停留时间,
接下来过滤掉end数据,后面时间减去前面时间
SELECT spd.page_id,avg(unix_timestamp(spd.stop)-unix_timestamp(spd.start))
FROM
(
SELECT session_id,page_id,`date` start,
lead(`date`,1,'end') OVER(PARTITION BY session_id ORDER BY `date`) stop
FROM rs
) spd
WHERE spd.stop!='end'
GROUP BY spd.page_id
lag与lead刚好相反,取前面的某行
同比环比例子,依次为月份,销售额
1 2610
2 813
3 855
4 818
5 873
6 953
7 1242
8 1199
9 1884
10 796
11 1197
12 2900
(2)求每月环比增长率?
SELECT month,num,
num-(lag(num,1) OVER(ORDER BY month)),
lag(num,1) OVER(ORDER BY month)
FROM test
然后我们用第三列除以第四列就可以了。
同比的话,lag(num,12)就ok.
非常强大的函数,lag(),lead(),first_value(),last_value()都可用window()实现
依次为月份,销售额
1 2610
2 813
3 855
4 818
5 873
6 953
7 1242
8 1199
9 1884
10 796
11 1197
12 2900
(1)用window()实现lead(字段,1)功能?
SELECT month,num,
sum(num) OVER(ORDER BY month ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING)
FROM test
rows between ....and...是窗口区间
ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING//取排序后范围在当前行后一行与后一行之间数据
由于只有一行,sum换为avg也ok,故效果与lead类似,1 FOLLOWING表示当前行后一行
lead(字段,n)只需将1 FOLLOWING改为n FOLLOWING即可。
(2)用window()实现lag(字段,1)功能?
SELECT month,num,
sum(num) OVER(ORDER BY month ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
FROM test
VROWS BETWEEN 1 PRECEDING AND 1 PRECEDING//取排序后范围在当前行前一行与前一行的值
(3)求截止每个月月底的销量?
SELECT month,num,
sum(num) OVER(ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM test
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW//表示取排序后范围从起始行到当前行
目前想不到有什么应用场景