最近遇到了hive开窗函数的工作,并且面试时很有可能会问到hive的开窗函数,特整理加深印象,并方便以后复习
普通的聚合函数聚合的行集是组,开窗函数聚合的行集是窗口。因此,普通的聚合函数每组(Group by)只返回一个值,而开窗函数则可为窗口中的每行都返回一个值。简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。
开窗函数一般分为两类,聚合开窗函数和排序开窗函数。
表的名称:
rt_data -> 存放了当天每半小时的店铺销售数据
表的字段:
shop_id, stat_date, ordamt
字段stat_date格式yyMMddHHmm
表的数据:
10026 201901230030 5170
10026 201901230100 5669
10026 201901230130 2396
10026 201901230200 1498
10026 201901230230 1997
10026 201901230300 1188
10026 201901230330 598
10026 201901230400 479
10026 201901230430 1587
10026 201901230530 799
10027 201901230030 2170
10027 201901230100 1623
10027 201901230130 3397
10027 201901230200 1434
10027 201901230230 1001
10028 201901230300 1687
10028 201901230330 1298
10028 201901230400 149
10029 201901230430 2587
10029 201901230530 589
设置本地模式
set hive.exec.mode.local.auto=true;
创建数据库
CREATE DATABASE IF NOT EXISTS db_window;
创建表
CREATE TABLE IF NOT EXISTS db_window.rt_data(
shop_id INT,
stat_date STRING,
ordamt DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
加载数据到表的指定位置
hdfs dfs -put rt_data.txt /user/hive/warehouse-3.1.1/db_window.db/rt_data/
开窗函数中指定行及值的语法
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
主要场景算截止到某位置的数目
select shop_id,stat_date,ordamt,
-- 以符合条件的所有行作为窗口
count(shop_id) over() as count1,
-- 以按shop_id分组的所有行作为窗口
count(shop_id) over(partition by shop_id) as count2,
-- 以按shop_id分组、按stat_date排序的所有行作为窗口
count(shop_id) over(partition by shop_id order by stat_date) as count3,
-- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口
count(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as count4,
-- 以按shop_id分组、按stat_date排序、按从起点到末尾,默认从起点到末尾和count2结果相同
count(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as count5,
-- 以按shop_id分组、按stat_date排序、按从起点到当前行的前一行
count(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as count6,
-- 以按shop_id分组、按stat_date排序、按从起点到当前行
count(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as count7,
-- 以按shop_id分组、按stat_date排序、按从当前行到末尾
count(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as count8,
-- 以按shop_id分组、按stat_date排序、按从当前行往后一行到末尾
count(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as count9,
-- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行
count(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as count10
from rt_data;
OK
shop_id stat_date ordamt count1 count2 count3 count4 count5 count6 count7 count8 count9 count10
10026 201901230030 5170.0 20 10 1 3 10 0 1 10 9 2
10026 201901230100 5669.0 20 10 2 4 10 1 2 9 8 2
10026 201901230130 2396.0 20 10 3 4 10 2 3 8 7 2
10026 201901230200 1498.0 20 10 4 4 10 3 4 7 6 2
10026 201901230230 1997.0 20 10 5 4 10 4 5 6 5 2
10026 201901230300 1188.0 20 10 6 4 10 5 6 5 4 2
10026 201901230330 598.0 20 10 7 4 10 6 7 4 3 2
10026 201901230400 479.0 20 10 8 4 10 7 8 3 2 2
10026 201901230430 1587.0 20 10 9 3 10 8 9 2 1 1
10026 201901230530 799.0 20 10 10 2 10 9 10 1 0 0
10027 201901230030 2170.0 20 5 1 3 5 0 1 5 4 2
10027 201901230100 1623.0 20 5 2 4 5 1 2 4 3 2
10027 201901230130 3397.0 20 5 3 4 5 2 3 3 2 2
10027 201901230200 1434.0 20 5 4 3 5 3 4 2 1 1
10027 201901230230 1001.0 20 5 5 2 5 4 5 1 0 0
10028 201901230300 1687.0 20 3 1 3 3 0 1 3 2 2
10028 201901230330 1298.0 20 3 2 3 3 1 2 2 1 1
10028 201901230400 149.0 20 3 3 2 3 2 3 1 0 0
10029 201901230430 2587.0 20 2 1 2 2 0 1 2 1 1
10029 201901230530 589.0 20 2 2 2 2 1 2 1 0 0
主要场景统计截止到某个位置的值
select
shop_id, stat_date, ordamt,
-- 以按shop_id分组、按stat_date排序、统计每个商品截止到当前时间的销售总额,默认从起点到当前行
sum(ordamt) over(partition by shop_id order by stat_date) as sum_amt1,
-- 以按shop_id分组、按stat_date排序、统计每个商品前半小时到后一小时的销售额(按当前行+往前1行+往后2行的行作为窗口)
sum(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as sum_amt2,
-- 以按shop_id分组、按stat_date排序、统计每个商品的销售总额(从起点到末尾)
sum(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as sum_amt3,
-- 以按shop_id分组、按stat_date排序、统计截止到前半小时的销售总额(从起点到当前行的前一行)
sum(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as sum_amt4,
-- 以按shop_id分组、按stat_date排序、统计每个商品截止到当前时间的销售总额,默认从起点到当前行(从起点到当前行)
sum(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as sum_amt5,
-- 以按shop_id分组、按stat_date排序、统计当前时间及之后的销售总额(从当前行的末尾)
sum(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as sum_amt6,
-- 以按shop_id分组、按stat_date排序、统计当前时间的后半小时及之后的销售额(当前行后一行到末尾)
sum(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as sum_amt7,
-- 以按shop_id分组、按stat_date排序、统计当前时间后半小时到后一小时之间的销售额(按从当前行往后一行到当前行往后2行)
sum(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as sum_amt8
from db_window.rt_data;
OK
shop_id stat_date ordamt sum_amt1 sum_amt2 sum_amt3 sum_amt4 sum_amt5 sum_amt6 sum_amt7 sum_amt8
10026 201901230030 5170.0 5170.0 13235.0 21381.0 NULL 5170.0 21381.0 16211.0 8065.0
10026 201901230100 5669.0 10839.0 14733.0 21381.0 5170.0 10839.0 16211.0 10542.0 3894.0
10026 201901230130 2396.0 13235.0 11560.0 21381.0 10839.0 13235.0 10542.0 8146.0 3495.0
10026 201901230200 1498.0 14733.0 7079.0 21381.0 13235.0 14733.0 8146.0 6648.0 3185.0
10026 201901230230 1997.0 16730.0 5281.0 21381.0 14733.0 16730.0 6648.0 4651.0 1786.0
10026 201901230300 1188.0 17918.0 4262.0 21381.0 16730.0 17918.0 4651.0 3463.0 1077.0
10026 201901230330 598.0 18516.0 3852.0 21381.0 17918.0 18516.0 3463.0 2865.0 2066.0
10026 201901230400 479.0 18995.0 3463.0 21381.0 18516.0 18995.0 2865.0 2386.0 2386.0
10026 201901230430 1587.0 20582.0 2865.0 21381.0 18995.0 20582.0 2386.0 799.0 799.0
10026 201901230530 799.0 21381.0 2386.0 21381.0 20582.0 21381.0 799.0 NULL NULL
10027 201901230030 2170.0 2170.0 7190.0 9625.0 NULL 2170.0 9625.0 7455.0 5020.0
10027 201901230100 1623.0 3793.0 8624.0 9625.0 2170.0 3793.0 7455.0 5832.0 4831.0
10027 201901230130 3397.0 7190.0 7455.0 9625.0 3793.0 7190.0 5832.0 2435.0 2435.0
10027 201901230200 1434.0 8624.0 5832.0 9625.0 7190.0 8624.0 2435.0 1001.0 1001.0
10027 201901230230 1001.0 9625.0 2435.0 9625.0 8624.0 9625.0 1001.0 NULL NULL
10028 201901230300 1687.0 1687.0 3134.0 3134.0 NULL 1687.0 3134.0 1447.0 1447.0
10028 201901230330 1298.0 2985.0 3134.0 3134.0 1687.0 2985.0 1447.0 149.0 149.0
10028 201901230400 149.0 3134.0 1447.0 3134.0 2985.0 3134.0 149.0 NULL NULL
10029 201901230430 2587.0 2587.0 3176.0 3176.0 NULL 2587.0 3176.0 589.0 589.0
10029 201901230530 589.0 3176.0 3176.0 3176.0 2587.0 3176.0 589.0 NULL NULL
select
shop_id, stat_date, ordamt,
-- 以按shop_id分组、按stat_date排序、
min(ordamt) over(partition by shop_id order by stat_date) as min_amt1,
-- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的最小数
min(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as min_amt2,
-- 以按shop_id分组、按stat_date排序、从起点到末尾的最小数
min(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as min_amt3,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的最小数
min(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as min_amt4,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的最小数
min(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as min_amt5,
-- 以按shop_id分组、按stat_date排序、从当前行的末尾的最小数
min(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as min_amt6,
-- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的最小数
min(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as min_amt7,
-- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的最小数
min(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as min_amt8
from db_window.rt_data;
OK
shop_id stat_date ordamt min_amt1 min_amt2 min_amt3 min_amt4 min_amt5 min_amt6 min_amt7 min_amt8
10026 201901230030 5170.0 5170.0 2396.0 479.0 NULL 5170.0 479.0 479.0 2396.0
10026 201901230100 5669.0 5170.0 1498.0 479.0 5170.0 5170.0 479.0 479.0 1498.0
10026 201901230130 2396.0 2396.0 1498.0 479.0 5170.0 2396.0 479.0 479.0 1498.0
10026 201901230200 1498.0 1498.0 1188.0 479.0 2396.0 1498.0 479.0 479.0 1188.0
10026 201901230230 1997.0 1498.0 598.0 479.0 1498.0 1498.0 479.0 479.0 598.0
10026 201901230300 1188.0 1188.0 479.0 479.0 1498.0 1188.0 479.0 479.0 479.0
10026 201901230330 598.0 598.0 479.0 479.0 1188.0 598.0 479.0 479.0 479.0
10026 201901230400 479.0 479.0 479.0 479.0 598.0 479.0 479.0 799.0 799.0
10026 201901230430 1587.0 479.0 479.0 479.0 479.0 479.0 799.0 799.0 799.0
10026 201901230530 799.0 479.0 799.0 479.0 479.0 479.0 799.0 NULL NULL
10027 201901230030 2170.0 2170.0 1623.0 1001.0 NULL 2170.0 1001.0 1001.0 1623.0
10027 201901230100 1623.0 1623.0 1434.0 1001.0 2170.0 1623.0 1001.0 1001.0 1434.0
10027 201901230130 3397.0 1623.0 1001.0 1001.0 1623.0 1623.0 1001.0 1001.0 1001.0
10027 201901230200 1434.0 1434.0 1001.0 1001.0 1623.0 1434.0 1001.0 1001.0 1001.0
10027 201901230230 1001.0 1001.0 1001.0 1001.0 1434.0 1001.0 1001.0 NULL NULL
10028 201901230300 1687.0 1687.0 149.0 149.0 NULL 1687.0 149.0 149.0 149.0
10028 201901230330 1298.0 1298.0 149.0 149.0 1687.0 1298.0 149.0 149.0 149.0
10028 201901230400 149.0 149.0 149.0 149.0 1298.0 149.0 149.0 NULL NULL
10029 201901230430 2587.0 2587.0 589.0 589.0 NULL 2587.0 589.0 589.0 589.0
10029 201901230530 589.0 589.0 589.0 589.0 2587.0 589.0 589.0 NULL NULL
select
shop_id, stat_date, ordamt,
-- 以按shop_id分组、按stat_date排序、
max(ordamt) over(partition by shop_id order by stat_date) as max_amt1,
-- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的最大数
max(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as max_amt2,
-- 以按shop_id分组、按stat_date排序、从起点到末尾的最大数
max(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as max_amt3,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的最大数
max(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as max_amt4,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的最大数
max(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as max_amt5,
-- 以按shop_id分组、按stat_date排序、从当前行的末尾的最大数
max(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as max_amt6,
-- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的最大数
max(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as max_amt7,
-- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的最大数
max(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as max_amt8
from db_window.rt_data;
OK
shop_id stat_date ordamt max_amt1 max_amt2 max_amt3 max_amt4 max_amt5 max_amt6 max_amt7 max_amt8
10026 201901230030 5170.0 5170.0 5669.0 5669.0 NULL 5170.0 5669.0 5669.0 5669.0
10026 201901230100 5669.0 5669.0 5669.0 5669.0 5170.0 5669.0 5669.0 2396.0 2396.0
10026 201901230130 2396.0 5669.0 5669.0 5669.0 5669.0 5669.0 2396.0 1997.0 1997.0
10026 201901230200 1498.0 5669.0 2396.0 5669.0 5669.0 5669.0 1997.0 1997.0 1997.0
10026 201901230230 1997.0 5669.0 1997.0 5669.0 5669.0 5669.0 1997.0 1587.0 1188.0
10026 201901230300 1188.0 5669.0 1997.0 5669.0 5669.0 5669.0 1587.0 1587.0 598.0
10026 201901230330 598.0 5669.0 1587.0 5669.0 5669.0 5669.0 1587.0 1587.0 1587.0
10026 201901230400 479.0 5669.0 1587.0 5669.0 5669.0 5669.0 1587.0 1587.0 1587.0
10026 201901230430 1587.0 5669.0 1587.0 5669.0 5669.0 5669.0 1587.0 799.0 799.0
10026 201901230530 799.0 5669.0 1587.0 5669.0 5669.0 5669.0 799.0 NULL NULL
10027 201901230030 2170.0 2170.0 3397.0 3397.0 NULL 2170.0 3397.0 3397.0 3397.0
10027 201901230100 1623.0 2170.0 3397.0 3397.0 2170.0 2170.0 3397.0 3397.0 3397.0
10027 201901230130 3397.0 3397.0 3397.0 3397.0 2170.0 3397.0 3397.0 1434.0 1434.0
10027 201901230200 1434.0 3397.0 3397.0 3397.0 3397.0 3397.0 1434.0 1001.0 1001.0
10027 201901230230 1001.0 3397.0 1434.0 3397.0 3397.0 3397.0 1001.0 NULL NULL
10028 201901230300 1687.0 1687.0 1687.0 1687.0 NULL 1687.0 1687.0 1298.0 1298.0
10028 201901230330 1298.0 1687.0 1687.0 1687.0 1687.0 1687.0 1298.0 149.0 149.0
10028 201901230400 149.0 1687.0 1298.0 1687.0 1687.0 1687.0 149.0 NULL NULL
10029 201901230430 2587.0 2587.0 2587.0 2587.0 NULL 2587.0 2587.0 589.0 589.0
10029 201901230530 589.0 2587.0 2587.0 2587.0 2587.0 2587.0 589.0 NULL NULL
select
shop_id, stat_date, ordamt,
-- 以按shop_id分组、按stat_date排序、
round(avg(ordamt) over(partition by shop_id order by stat_date),2) as avg_amt1,
-- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的平均值
round(avg(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following), 2) as avg_amt2,
-- 以按shop_id分组、按stat_date排序、从起点到末尾的平均值
round(avg(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following), 2) as avg_amt3,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的平均值
round(avg(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding), 2) as avg_amt4,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的平均值
round(avg(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row), 2) as avg_amt5,
-- 以按shop_id分组、按stat_date排序、从当前行的末尾的平均值
round(avg(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following), 2) as avg_amt6,
-- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的平均值
round(avg(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following), 2) as avg_amt7,
-- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的平均值
round(avg(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following), 2) as avg_amt8
from db_window.rt_data;
OK
shop_id stat_date ordamt avg_amt1 avg_amt2 avg_amt3 avg_amt4 avg_amt5 avg_amt6 avg_amt7 avg_amt8
10026 201901230030 5170.0 5170.0 4411.67 2138.1 NULL 5170.0 2138.1 1801.22 4032.5
10026 201901230100 5669.0 5419.5 3683.25 2138.1 5170.0 5419.5 1801.22 1317.75 1947.0
10026 201901230130 2396.0 4411.67 2890.0 2138.1 5419.5 4411.67 1317.75 1163.71 1747.5
10026 201901230200 1498.0 3683.25 1769.75 2138.1 4411.67 3683.25 1163.71 1108.0 1592.5
10026 201901230230 1997.0 3346.0 1320.25 2138.1 3683.25 3346.0 1108.0 930.2 893.0
10026 201901230300 1188.0 2986.33 1065.5 2138.1 3346.0 2986.33 930.2 865.75 538.5
10026 201901230330 598.0 2645.14 963.0 2138.1 2986.33 2645.14 865.75 955.0 1033.0
10026 201901230400 479.0 2374.38 865.75 2138.1 2645.14 2374.38 955.0 1193.0 1193.0
10026 201901230430 1587.0 2286.89 955.0 2138.1 2374.38 2286.89 1193.0 799.0 799.0
10026 201901230530 799.0 2138.1 1193.0 2138.1 2286.89 2138.1 799.0 NULL NULL
10027 201901230030 2170.0 2170.0 2396.67 1925.0 NULL 2170.0 1925.0 1863.75 2510.0
10027 201901230100 1623.0 1896.5 2156.0 1925.0 2170.0 1896.5 1863.75 1944.0 2415.5
10027 201901230130 3397.0 2396.67 1863.75 1925.0 1896.5 2396.67 1944.0 1217.5 1217.5
10027 201901230200 1434.0 2156.0 1944.0 1925.0 2396.67 2156.0 1217.5 1001.0 1001.0
10027 201901230230 1001.0 1925.0 1217.5 1925.0 2156.0 1925.0 1001.0 NULL NULL
10028 201901230300 1687.0 1687.0 1044.67 1044.67 NULL 1687.0 1044.67 723.5 723.5
10028 201901230330 1298.0 1492.5 1044.67 1044.67 1687.0 1492.5 723.5 149.0 149.0
10028 201901230400 149.0 1044.67 723.5 1044.67 1492.5 1044.67 149.0 NULL NULL
10029 201901230430 2587.0 2587.0 1588.0 1588.0 NULL 2587.0 1588.0 589.0 589.0
10029 201901230530 589.0 1588.0 1588.0 1588.0 2587.0 1588.0 589.0 NULL NULL
select
shop_id, stat_date, ordamt,
-- 以按shop_id分组、按stat_date排序、
first_value(ordamt) over(partition by shop_id order by stat_date) as first_amt1,
-- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的第一个值
first_value(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as first_amt2,
-- 以按shop_id分组、按stat_date排序、从起点到末尾的第一个值
first_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as first_amt3,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的第一个值
first_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as first_amt4,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的第一个值
first_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as first_amt5,
-- 以按shop_id分组、按stat_date排序、从当前行的末尾的第一个值
first_value(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as first_amt6,
-- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的第一个值
first_value(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as first_amt7,
-- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的第一个值
first_value(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as first_amt8
from db_window.rt_data;
OK
shop_id stat_date ordamt first_amt1 first_amt2 first_amt3 first_amt4 first_amt5 first_amt6 first_amt7 first_amt8
10026 201901230030 5170.0 5170.0 5170.0 5170.0 NULL 5170.0 5170.0 5669.0 5669.0
10026 201901230100 5669.0 5170.0 5170.0 5170.0 5170.0 5170.0 5669.0 2396.0 2396.0
10026 201901230130 2396.0 5170.0 5669.0 5170.0 5170.0 5170.0 2396.0 1498.0 1498.0
10026 201901230200 1498.0 5170.0 2396.0 5170.0 5170.0 5170.0 1498.0 1997.0 1997.0
10026 201901230230 1997.0 5170.0 1498.0 5170.0 5170.0 5170.0 1997.0 1188.0 1188.0
10026 201901230300 1188.0 5170.0 1997.0 5170.0 5170.0 5170.0 1188.0 598.0 598.0
10026 201901230330 598.0 5170.0 1188.0 5170.0 5170.0 5170.0 598.0 479.0 479.0
10026 201901230400 479.0 5170.0 598.0 5170.0 5170.0 5170.0 479.0 1587.0 1587.0
10026 201901230430 1587.0 5170.0 479.0 5170.0 5170.0 5170.0 1587.0 799.0 799.0
10026 201901230530 799.0 5170.0 1587.0 5170.0 5170.0 5170.0 799.0 NULL NULL
10027 201901230030 2170.0 2170.0 2170.0 2170.0 NULL 2170.0 2170.0 1623.0 1623.0
10027 201901230100 1623.0 2170.0 2170.0 2170.0 2170.0 2170.0 1623.0 3397.0 3397.0
10027 201901230130 3397.0 2170.0 1623.0 2170.0 2170.0 2170.0 3397.0 1434.0 1434.0
10027 201901230200 1434.0 2170.0 3397.0 2170.0 2170.0 2170.0 1434.0 1001.0 1001.0
10027 201901230230 1001.0 2170.0 1434.0 2170.0 2170.0 2170.0 1001.0 NULL NULL
10028 201901230300 1687.0 1687.0 1687.0 1687.0 NULL 1687.0 1687.0 1298.0 1298.0
10028 201901230330 1298.0 1687.0 1687.0 1687.0 1687.0 1687.0 1298.0 149.0 149.0
10028 201901230400 149.0 1687.0 1298.0 1687.0 1687.0 1687.0 149.0 NULL NULL
10029 201901230430 2587.0 2587.0 2587.0 2587.0 NULL 2587.0 2587.0 589.0 589.0
10029 201901230530 589.0 2587.0 2587.0 2587.0 2587.0 2587.0 589.0 NULL NULL
select
shop_id, stat_date, ordamt,
-- 以按shop_id分组、按stat_date排序、
last_value(ordamt) over(partition by shop_id order by stat_date) as last_amt1,
-- 以按shop_id分组、按stat_date排序、按当前行+往前1行+往后2行的行作为窗口的最后一个值
last_value(ordamt) over(partition by shop_id order by stat_date rows between 1 preceding and 2 following) as last_amt2,
-- 以按shop_id分组、按stat_date排序、从起点到末尾的最后一个值
last_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and unbounded following) as last_amt3,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的前一行的最后一个值
last_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and 1 preceding) as last_amt4,
-- 以按shop_id分组、按stat_date排序、从起点到当前行的最后一个值
last_value(ordamt) over(partition by shop_id order by stat_date rows between unbounded preceding and current row) as last_amt5,
-- 以按shop_id分组、按stat_date排序、从当前行的末尾的最后一个值
last_value(ordamt) over(partition by shop_id order by stat_date rows between current row and unbounded following) as last_amt6,
-- 以按shop_id分组、按stat_date排序、当前行后一行到末尾的最后一个值
last_value(ordamt) over(partition by shop_id order by stat_date rows between 1 following and unbounded following) as last_amt7,
-- 以按shop_id分组、按stat_date排序、按从当前行往后一行到当前行往后2行的最后一个值
last_value(ordamt) over(partition by shop_id order by stat_date rows between 1 following and 2 following) as last_amt8
from db_window.rt_data;
OK
shop_id stat_date ordamt last_amt1 last_amt2 last_amt3 last_amt4 last_amt5 last_amt6 last_amt7 last_amt8
10026 201901230030 5170.0 5170.0 2396.0 799.0 NULL 5170.0 799.0 799.0 2396.0
10026 201901230100 5669.0 5669.0 1498.0 799.0 5170.0 5669.0 799.0 799.0 1498.0
10026 201901230130 2396.0 2396.0 1997.0 799.0 5669.0 2396.0 799.0 799.0 1997.0
10026 201901230200 1498.0 1498.0 1188.0 799.0 2396.0 1498.0 799.0 799.0 1188.0
10026 201901230230 1997.0 1997.0 598.0 799.0 1498.0 1997.0 799.0 799.0 598.0
10026 201901230300 1188.0 1188.0 479.0 799.0 1997.0 1188.0 799.0 799.0 479.0
10026 201901230330 598.0 598.0 1587.0 799.0 1188.0 598.0 799.0 799.0 1587.0
10026 201901230400 479.0 479.0 799.0 799.0 598.0 479.0 799.0 799.0 799.0
10026 201901230430 1587.0 1587.0 799.0 799.0 479.0 1587.0 799.0 799.0 799.0
10026 201901230530 799.0 799.0 799.0 799.0 1587.0 799.0 799.0 NULL NULL
10027 201901230030 2170.0 2170.0 3397.0 1001.0 NULL 2170.0 1001.0 1001.0 3397.0
10027 201901230100 1623.0 1623.0 1434.0 1001.0 2170.0 1623.0 1001.0 1001.0 1434.0
10027 201901230130 3397.0 3397.0 1001.0 1001.0 1623.0 3397.0 1001.0 1001.0 1001.0
10027 201901230200 1434.0 1434.0 1001.0 1001.0 3397.0 1434.0 1001.0 1001.0 1001.0
10027 201901230230 1001.0 1001.0 1001.0 1001.0 1434.0 1001.0 1001.0 NULL NULL
10028 201901230300 1687.0 1687.0 149.0 149.0 NULL 1687.0 149.0 149.0 149.0
10028 201901230330 1298.0 1298.0 149.0 149.0 1687.0 1298.0 149.0 149.0 149.0
10028 201901230400 149.0 149.0 149.0 149.0 1298.0 149.0 149.0 NULL NULL
10029 201901230430 2587.0 2587.0 589.0 589.0 NULL 2587.0 589.0 589.0 589.0
10029 201901230530 589.0 589.0 589.0 589.0 2587.0 589.0 589.0 NULL NULL
这两个函数的应用场景的语法使用在我的第一篇博客上已经有了,就不重复写了,这两个函数可以求连续登录的用户数
https://blog.csdn.net/lz6363/article/details/82470278
计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:
小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。
select
shop_id, stat_date, ordamt,
-- 统计小于等于当前销售额的订单占总订单的比例
cume_dist() over(order by ordamt) as cume_dist1,
-- 统计大于等于当前销售额的订单占总订单的比例
cume_dist() over(order by ordamt desc) as cume_dist2,
-- 统计分区内小于等于当前销售额的订单占总订单的比例
round(cume_dist() over(partition by shop_id order by ordamt), 2) as cume_dist3
from db_window.rt_data;
shop_id stat_date ordamt cume_dist1 cume_dist2 cume_dist3
10026 201901230400 479.0 0.1 0.95 0.1
10026 201901230330 598.0 0.2 0.85 0.2
10026 201901230530 799.0 0.25 0.8 0.3
10026 201901230300 1188.0 0.35 0.7 0.4
10026 201901230200 1498.0 0.5 0.55 0.5
10026 201901230430 1587.0 0.55 0.5 0.6
10026 201901230230 1997.0 0.7 0.35 0.7
10026 201901230130 2396.0 0.8 0.25 0.8
10026 201901230030 5170.0 0.95 0.1 0.9
10026 201901230100 5669.0 1.0 0.05 1.0
10027 201901230230 1001.0 0.3 0.75 0.2
10027 201901230200 1434.0 0.45 0.6 0.4
10027 201901230100 1623.0 0.6 0.45 0.6
10027 201901230030 2170.0 0.75 0.3 0.8
10027 201901230130 3397.0 0.9 0.15 1.0
10028 201901230400 149.0 0.05 1.0 0.33
10028 201901230330 1298.0 0.4 0.65 0.67
10028 201901230300 1687.0 0.65 0.4 1.0
10029 201901230530 589.0 0.15 0.9 0.5
10029 201901230430 2587.0 0.85 0.2 1.0
rank 开窗函数基于 over 子句中的 order by 确定一组值中一个值的排名。
如果存在partition by ,则为每个分区组中的每个值排名。排名可能不是连续的。例如,如果两个行的排名为 1,则下一个排名为 3。
select
shop_id, stat_date, ordamt,
-- 以按shop_id排序
rank() over(order by shop_id) as rank_amt1,
-- 以按shop_id分区、按stat_date排序
rank() over(partition by shop_id order by stat_date) as rank_amt2
from db_window.rt_data;
OK
shop_id stat_date ordamt rank_amt1 rank_amt2
10026 201901230030 5170.0 1 1
10026 201901230100 5669.0 1 2
10026 201901230130 2396.0 1 3
10026 201901230200 1498.0 1 4
10026 201901230230 1997.0 1 5
10026 201901230300 1188.0 1 6
10026 201901230330 598.0 1 7
10026 201901230400 479.0 1 8
10026 201901230430 1587.0 1 9
10026 201901230530 799.0 1 10
10027 201901230030 2170.0 11 1
10027 201901230100 1623.0 11 2
10027 201901230130 3397.0 11 3
10027 201901230200 1434.0 11 4
10027 201901230230 1001.0 11 5
10028 201901230300 1687.0 16 1
10028 201901230330 1298.0 16 2
10028 201901230400 149.0 16 3
10029 201901230430 2587.0 19 1
10029 201901230530 589.0 19 2
dense_rank与rank有一点不同,当排名一样的时候,接下来的行是连续的。如两个行的排名为 1,则下一个排名为 2。
select
shop_id, stat_date, ordamt,
-- 以按shop_id排序
dense_rank() over(order by shop_id) as dense_amt1,
-- 以按shop_id分区、按stat_date排序
dense_rank() over(partition by shop_id order by stat_date) as dense_amt2
from db_window.rt_data;
OK
shop_id stat_date ordamt dense_amt1 dense_amt2
10026 201901230030 5170.0 1 1
10026 201901230100 5669.0 1 2
10026 201901230130 2396.0 1 3
10026 201901230200 1498.0 1 4
10026 201901230230 1997.0 1 5
10026 201901230300 1188.0 1 6
10026 201901230330 598.0 1 7
10026 201901230400 479.0 1 8
10026 201901230430 1587.0 1 9
10026 201901230530 799.0 1 10
10027 201901230030 2170.0 2 1
10027 201901230100 1623.0 2 2
10027 201901230130 3397.0 2 3
10027 201901230200 1434.0 2 4
10027 201901230230 1001.0 2 5
10028 201901230300 1687.0 3 1
10028 201901230330 1298.0 3 2
10028 201901230400 149.0 3 3
10029 201901230430 2587.0 4 1
10029 201901230530 589.0 4 2
将分区中已排序的行划分为大小尽可能相等的指定数量的排名的组,并返回给定行所在的组的排名。
select
shop_id, stat_date, ordamt,
-- 以按shop_id分成两组、按stat_date排序
ntile(2) over(partition by shop_id order by stat_date) as ntile_amt1,
-- 以按shop_id分成三组、按stat_date排序
ntile(3) over(partition by shop_id order by stat_date) as ntile_amt2
from db_window.rt_data;
OK
shop_id stat_date ordamt ntile_amt1 ntile_amt2
10026 201901230030 5170.0 1 1
10026 201901230100 5669.0 1 1
10026 201901230130 2396.0 1 1
10026 201901230200 1498.0 1 1
10026 201901230230 1997.0 1 2
10026 201901230300 1188.0 2 2
10026 201901230330 598.0 2 2
10026 201901230400 479.0 2 3
10026 201901230430 1587.0 2 3
10026 201901230530 799.0 2 3
10027 201901230030 2170.0 1 1
10027 201901230100 1623.0 1 1
10027 201901230130 3397.0 1 2
10027 201901230200 1434.0 2 2
10027 201901230230 1001.0 2 3
10028 201901230300 1687.0 1 1
10028 201901230330 1298.0 1 2
10028 201901230400 149.0 2 3
10029 201901230430 2587.0 1 1
10029 201901230530 589.0 2 2
从1开始对分区内的数据排序。
select
shop_id, stat_date, ordamt,
-- 以按shop_id分区、按stat_date排序
row_number() over(partition by shop_id order by stat_date) as row_amt
from db_window.rt_data;
OK
shop_id stat_date ordamt row_amt
10026 201901230030 5170.0 1
10026 201901230100 5669.0 2
10026 201901230130 2396.0 3
10026 201901230200 1498.0 4
10026 201901230230 1997.0 5
10026 201901230300 1188.0 6
10026 201901230330 598.0 7
10026 201901230400 479.0 8
10026 201901230430 1587.0 9
10026 201901230530 799.0 10
10027 201901230030 2170.0 1
10027 201901230100 1623.0 2
10027 201901230130 3397.0 3
10027 201901230200 1434.0 4
10027 201901230230 1001.0 5
10028 201901230300 1687.0 1
10028 201901230330 1298.0 2
10028 201901230400 149.0 3
10029 201901230430 2587.0 1
10029 201901230530 589.0 2
计算给定行的百分比排名。可以用来计算超过了百分之多少的人。如360小助手开机速度超过了百分之多少的人。
(当前行的rank值-1)/(分组内的总行数-1)
select
shop_id, stat_date, ordamt,
-- 以按shop_id分区、按stat_date排序
row_number() over(partition by shop_id order by stat_date) as row_amt,
round(percent_rank() over(partition by shop_id order by stat_date), 2) as percent_amt
from db_window.rt_data;
OK
shop_id stat_date ordamt row_amt percent_amt
10026 201901230030 5170.0 1 0.0
10026 201901230100 5669.0 2 0.11
10026 201901230130 2396.0 3 0.22
10026 201901230200 1498.0 4 0.33
10026 201901230230 1997.0 5 0.44
10026 201901230300 1188.0 6 0.56
10026 201901230330 598.0 7 0.67
10026 201901230400 479.0 8 0.78
10026 201901230430 1587.0 9 0.89
10026 201901230530 799.0 10 1.0
10027 201901230030 2170.0 1 0.0
10027 201901230100 1623.0 2 0.25
10027 201901230130 3397.0 3 0.5
10027 201901230200 1434.0 4 0.75
10027 201901230230 1001.0 5 1.0
10028 201901230300 1687.0 1 0.0
10028 201901230330 1298.0 2 0.5
10028 201901230400 149.0 3 1.0
10029 201901230430 2587.0 1 0.0
10029 201901230530 589.0 2 1.0