在hive的统计分析中,其实窗口函数还是比较常用也重要的。
今天整理下hive中窗口函数的sum,avg,min,max,后续再整理其他常用的。
首先模拟创建一张通话记录表:字段有主叫号码,主叫时间,通话时长
> create table `call_test` (
`pone_number` string,
`createtime` string, --day
`call_minute` int
);
OK
Time taken: 0.369 seconds
查看下表结构
> desc call_test;
OK
pone_number string
createtime string
call_minute int
Time taken: 0.149 seconds, Fetched: 3 row(s)
插入模拟数据
insert into call_test values('18600000000', '2018-12-10 13:00:00', 1);
insert into call_test values('18600000000', '2018-12-11 13:00:00', 6);
insert into call_test values('18600000000', '2018-12-12 13:00:00', 8);
insert into call_test values('18600000000', '2018-12-13 13:00:00', 4);
insert into call_test values('18600000000', '2018-12-14 13:00:00', 7);
insert into call_test values('18600000000', '2018-12-15 13:00:00', 1);
insert into call_test values('18600000000', '2018-12-16 13:00:00', 6);
insert into call_test values('18600000000', '2018-12-17 13:00:00', 8);
insert into call_test values('18600000000', '2018-12-18 13:00:00', 2);
insert into call_test values('18600000000', '2018-12-19 13:00:00', 4);
insert into call_test values('18600000000', '2018-12-20 13:00:00', 7);
insert into call_test values('18600000000', '2018-12-21 13:00:00', 1);
insert into call_test values('18600000000', '2018-12-22 13:00:00', 6);
insert into call_test values('18600000000', '2018-12-23 13:00:00', 8);
insert into call_test values('15600000000', '2018-12-10 13:00:00', 2);
insert into call_test values('15600000000', '2018-12-11 13:00:00', 4);
insert into call_test values('15600000000', '2018-12-12 13:00:00', 7);
insert into call_test values('15600000000', '2018-12-13 13:00:00', 1);
insert into call_test values('15600000000', '2018-12-14 13:00:00', 6);
insert into call_test values('15600000000', '2018-12-15 13:00:00', 8);
insert into call_test values('15600000000', '2018-12-16 13:00:00', 2);
insert into call_test values('15600000000', '2018-12-17 13:00:00', 4);
insert into call_test values('15600000000', '2018-12-18 13:00:00', 7);
SUM — 注意,结果和ORDER BY相关,默认为升序
> select pone_number,
createtime,
call_minute,
sum(call_minute) OVER(partition by pone_number order by createtime) as call_minute1, -- 默认为从起点到当前行
sum(call_minute) OVER(partition by pone_number order by createtime rows between unbounded preceding and current row) as call_minute2, --从起点到当前行,结果同call_minute1
sum(call_minute) OVER(partition by pone_number) as call_minute3,--分组内所有行
sum(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and current row) as call_minute4, --当前行+往前3行
sum(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and 1 following) as call_minute5, --当前行+往前3行+往后1行
sum(call_minute) OVER(partition by pone_number order by createtime rows between current row and unbounded following) as call_minute6 ---当前行+往后所有行
FROM call_test;
Query ID = hdfs_20181211000153_8870b5b2-ecaf-46aa-90f2-49a73e9e4ddf
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1541064601030_38864)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 0.66 s
----------------------------------------------------------------------------------------------
OK
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| pone_number | createtime | call_minute | call_minute1 | call_minute2 | call_minute3 | call_minute4 | call_minute5 | call_minute6 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| 15600000000 | 2018-12-14 13:00:00 | 6 | 20 | 20 | 41 | 18 | 26 | 27 |
| 15600000000 | 2018-12-13 13:00:00 | 1 | 14 | 14 | 41 | 14 | 20 | 28 |
| 15600000000 | 2018-12-12 13:00:00 | 7 | 13 | 13 | 41 | 13 | 14 | 35 |
| 15600000000 | 2018-12-11 13:00:00 | 4 | 6 | 6 | 41 | 6 | 13 | 39 |
| 15600000000 | 2018-12-10 13:00:00 | 2 | 2 | 2 | 41 | 2 | 6 | 41 |
| 15600000000 | 2018-12-18 13:00:00 | 7 | 41 | 41 | 41 | 21 | 21 | 7 |
| 15600000000 | 2018-12-17 13:00:00 | 4 | 34 | 34 | 41 | 20 | 27 | 11 |
| 15600000000 | 2018-12-16 13:00:00 | 2 | 30 | 30 | 41 | 17 | 21 | 13 |
| 15600000000 | 2018-12-15 13:00:00 | 8 | 28 | 28 | 41 | 22 | 24 | 21 |
| 18600000000 | 2018-12-23 13:00:00 | 8 | 69 | 69 | 69 | 22 | 22 | 8 |
| 18600000000 | 2018-12-22 13:00:00 | 6 | 61 | 61 | 69 | 18 | 26 | 14 |
| 18600000000 | 2018-12-21 13:00:00 | 1 | 55 | 55 | 69 | 14 | 20 | 15 |
| 18600000000 | 2018-12-20 13:00:00 | 7 | 54 | 54 | 69 | 21 | 22 | 22 |
| 18600000000 | 2018-12-19 13:00:00 | 4 | 47 | 47 | 69 | 20 | 27 | 26 |
| 18600000000 | 2018-12-18 13:00:00 | 2 | 43 | 43 | 69 | 17 | 21 | 28 |
| 18600000000 | 2018-12-17 13:00:00 | 8 | 41 | 41 | 69 | 22 | 24 | 36 |
| 18600000000 | 2018-12-16 13:00:00 | 6 | 33 | 33 | 69 | 18 | 26 | 42 |
| 18600000000 | 2018-12-15 13:00:00 | 1 | 27 | 27 | 69 | 20 | 26 | 43 |
| 18600000000 | 2018-12-14 13:00:00 | 7 | 26 | 26 | 69 | 25 | 26 | 50 |
| 18600000000 | 2018-12-13 13:00:00 | 4 | 19 | 19 | 69 | 19 | 26 | 54 |
| 18600000000 | 2018-12-11 13:00:00 | 6 | 7 | 7 | 69 | 7 | 15 | 68 |
| 18600000000 | 2018-12-10 13:00:00 | 1 | 1 | 1 | 69 | 1 | 7 | 69 |
| 18600000000 | 2018-12-12 13:00:00 | 8 | 15 | 15 | 69 | 15 | 19 | 62 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
Time taken: 1.14 seconds, Fetched: 23 row(s)
解释:
call_minute1: 分组内从起点到当前行的call_minute累积,如,11号的call_minute1=10号的call_minute+11号的call_minute, 12号=10号+11号+12号
call_minute2: 同call_minute1
call_minute3: 分组内(call_minute1)所有的call_minute累加
call_minute4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
call_minute5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号
call_minute6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号,14号=14号+15号+16号
如果不指定rows between,默认为从起点到当前行;
如果不指定order by,则将分组内所有值累加;
关键是理解rows between含义,也叫做window子句:
preceding:往前
following:往后
current row:当前行
unbounded:起点,unbounded preceding 表示从前面的起点, unbounded following:表示到后面的终点
其他avg,min,max,和sum用法一样。
AVG
> select pone_number,
createtime,
call_minute,
round(avg(call_minute) OVER(partition by pone_number order by createtime), 2) as call_minute1, -- 默认为从起点到当前行
round(avg(call_minute) OVER(partition by pone_number order by createtime rows between unbounded preceding and current row), 2) as call_minute2, --从起点到当前行,结果同call_minute1
round(avg(call_minute) OVER(partition by pone_number), 2) as call_minute3,--分组内所有行
round(avg(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and current row), 2) as call_minute4, --当前行+往前3行
round(avg(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and 1 following), 2) as call_minute5, --当前行+往前3行+往后1行
round(avg(call_minute) OVER(partition by pone_number order by createtime rows between current row and unbounded following), 2) as call_minute6 ---当前行+往后所有行
FROM call_test;
Query ID = hdfs_20181211000203_53ab6fb6-628c-4ac8-81aa-244c73b701f0
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1541064601030_38864)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 4.04 s
----------------------------------------------------------------------------------------------
OK
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| pone_number | createtime | call_minute | call_minute1 | call_minute2 | call_minute3 | call_minute4 | call_minute5 | call_minute6 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| 15600000000 | 2018-12-14 13:00:00 | 6 | 4.0 | 4.0 | 4.56 | 4.5 | 5.2 | 5.4 |
| 15600000000 | 2018-12-13 13:00:00 | 1 | 3.5 | 3.5 | 4.56 | 3.5 | 4.0 | 4.67 |
| 15600000000 | 2018-12-12 13:00:00 | 7 | 4.33 | 4.33 | 4.56 | 4.33 | 3.5 | 5.0 |
| 15600000000 | 2018-12-11 13:00:00 | 4 | 3.0 | 3.0 | 4.56 | 3.0 | 4.33 | 4.88 |
| 15600000000 | 2018-12-10 13:00:00 | 2 | 2.0 | 2.0 | 4.56 | 2.0 | 3.0 | 4.56 |
| 15600000000 | 2018-12-18 13:00:00 | 7 | 4.56 | 4.56 | 4.56 | 5.25 | 5.25 | 7.0 |
| 15600000000 | 2018-12-17 13:00:00 | 4 | 4.25 | 4.25 | 4.56 | 5.0 | 5.4 | 5.5 |
| 15600000000 | 2018-12-16 13:00:00 | 2 | 4.29 | 4.29 | 4.56 | 4.25 | 4.2 | 4.33 |
| 15600000000 | 2018-12-15 13:00:00 | 8 | 4.67 | 4.67 | 4.56 | 5.5 | 4.8 | 5.25 |
| 18600000000 | 2018-12-23 13:00:00 | 8 | 4.93 | 4.93 | 4.93 | 5.5 | 5.5 | 8.0 |
| 18600000000 | 2018-12-22 13:00:00 | 6 | 4.69 | 4.69 | 4.93 | 4.5 | 5.2 | 7.0 |
| 18600000000 | 2018-12-21 13:00:00 | 1 | 4.58 | 4.58 | 4.93 | 3.5 | 4.0 | 5.0 |
| 18600000000 | 2018-12-20 13:00:00 | 7 | 4.91 | 4.91 | 4.93 | 5.25 | 4.4 | 5.5 |
| 18600000000 | 2018-12-19 13:00:00 | 4 | 4.7 | 4.7 | 4.93 | 5.0 | 5.4 | 5.2 |
| 18600000000 | 2018-12-18 13:00:00 | 2 | 4.78 | 4.78 | 4.93 | 4.25 | 4.2 | 4.67 |
| 18600000000 | 2018-12-17 13:00:00 | 8 | 5.13 | 5.13 | 4.93 | 5.5 | 4.8 | 5.14 |
| 18600000000 | 2018-12-16 13:00:00 | 6 | 4.71 | 4.71 | 4.93 | 4.5 | 5.2 | 5.25 |
| 18600000000 | 2018-12-15 13:00:00 | 1 | 4.5 | 4.5 | 4.93 | 5.0 | 5.2 | 4.78 |
| 18600000000 | 2018-12-14 13:00:00 | 7 | 5.2 | 5.2 | 4.93 | 6.25 | 5.2 | 5.0 |
| 18600000000 | 2018-12-13 13:00:00 | 4 | 4.75 | 4.75 | 4.93 | 4.75 | 5.2 | 4.91 |
| 18600000000 | 2018-12-11 13:00:00 | 6 | 3.5 | 3.5 | 4.93 | 3.5 | 5.0 | 5.23 |
| 18600000000 | 2018-12-10 13:00:00 | 1 | 1.0 | 1.0 | 4.93 | 1.0 | 3.5 | 4.93 |
| 18600000000 | 2018-12-12 13:00:00 | 8 | 5.0 | 5.0 | 4.93 | 5.0 | 4.75 | 5.17 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
Time taken: 4.55 seconds, Fetched: 23 row(s)
MIN
> select pone_number,
createtime,
call_minute,
min(call_minute) OVER(partition by pone_number order by createtime) as call_minute1, -- 默认为从起点到当前行
min(call_minute) OVER(partition by pone_number order by createtime rows between unbounded preceding and current row) as call_minute2, --从起点到当前行,结果同call_minute1
min(call_minute) OVER(partition by pone_number) as call_minute3,--分组内所有行
min(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and current row) as call_minute4, --当前行+往前3行
min(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and 1 following) as call_minute5, --当前行+往前3行+往后1行
min(call_minute) OVER(partition by pone_number order by createtime rows between current row and unbounded following) as call_minute6 ---当前行+往后所有行
FROM call_test;
Query ID = hdfs_20181211000210_2e8b0633-0e95-4ace-a964-79ed946da362
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1541064601030_38864)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 0.31 s
----------------------------------------------------------------------------------------------
OK
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| pone_number | createtime | call_minute | call_minute1 | call_minute2 | call_minute3 | call_minute4 | call_minute5 | call_minute6 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| 15600000000 | 2018-12-14 13:00:00 | 6 | 1 | 1 | 1 | 1 | 1 | 2 |
| 15600000000 | 2018-12-13 13:00:00 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 15600000000 | 2018-12-12 13:00:00 | 7 | 2 | 2 | 1 | 2 | 1 | 1 |
| 15600000000 | 2018-12-11 13:00:00 | 4 | 2 | 2 | 1 | 2 | 2 | 1 |
| 15600000000 | 2018-12-10 13:00:00 | 2 | 2 | 2 | 1 | 2 | 2 | 1 |
| 15600000000 | 2018-12-18 13:00:00 | 7 | 1 | 1 | 1 | 2 | 2 | 7 |
| 15600000000 | 2018-12-17 13:00:00 | 4 | 1 | 1 | 1 | 2 | 2 | 4 |
| 15600000000 | 2018-12-16 13:00:00 | 2 | 1 | 1 | 1 | 1 | 1 | 2 |
| 15600000000 | 2018-12-15 13:00:00 | 8 | 1 | 1 | 1 | 1 | 1 | 2 |
| 18600000000 | 2018-12-23 13:00:00 | 8 | 1 | 1 | 1 | 1 | 1 | 8 |
| 18600000000 | 2018-12-22 13:00:00 | 6 | 1 | 1 | 1 | 1 | 1 | 6 |
| 18600000000 | 2018-12-21 13:00:00 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 18600000000 | 2018-12-20 13:00:00 | 7 | 1 | 1 | 1 | 2 | 1 | 1 |
| 18600000000 | 2018-12-19 13:00:00 | 4 | 1 | 1 | 1 | 2 | 2 | 1 |
| 18600000000 | 2018-12-18 13:00:00 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
| 18600000000 | 2018-12-17 13:00:00 | 8 | 1 | 1 | 1 | 1 | 1 | 1 |
| 18600000000 | 2018-12-16 13:00:00 | 6 | 1 | 1 | 1 | 1 | 1 | 1 |
| 18600000000 | 2018-12-15 13:00:00 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 18600000000 | 2018-12-14 13:00:00 | 7 | 1 | 1 | 1 | 4 | 1 | 1 |
| 18600000000 | 2018-12-13 13:00:00 | 4 | 1 | 1 | 1 | 1 | 1 | 1 |
| 18600000000 | 2018-12-11 13:00:00 | 6 | 1 | 1 | 1 | 1 | 1 | 1 |
| 18600000000 | 2018-12-10 13:00:00 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 18600000000 | 2018-12-12 13:00:00 | 8 | 1 | 1 | 1 | 1 | 1 | 1 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
Time taken: 0.823 seconds, Fetched: 23 row(s)
MAX
> select pone_number,
createtime,
call_minute,
max(call_minute) OVER(partition by pone_number order by createtime) as call_minute1, -- 默认为从起点到当前行
max(call_minute) OVER(partition by pone_number order by createtime rows between unbounded preceding and current row) as call_minute2, --从起点到当前行,结果同call_minute1
max(call_minute) OVER(partition by pone_number) as call_minute3, --分组内所有行
max(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and current row) as call_minute4, --当前行+往前3行
max(call_minute) OVER(partition by pone_number order by createtime rows between 3 preceding and 1 following) as call_minute5, --当前行+往前3行+往后1行
max(call_minute) OVER(partition by pone_number order by createtime rows between current row and unbounded following) as call_minute6 ---当前行+往后所有行
FROM call_test;
Query ID = hdfs_20181211000216_bdde124f-79b2-4d0f-b3c2-a8b7339a02a6
Total jobs = 1
Launching Job 1 out of 1
Status: Running (Executing on YARN cluster with App id application_1541064601030_38864)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 1 1 0 0 0 0
Reducer 2 ...... container SUCCEEDED 1 1 0 0 0 0
Reducer 3 ...... container SUCCEEDED 1 1 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME: 0.34 s
----------------------------------------------------------------------------------------------
OK
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| pone_number | createtime | call_minute | call_minute1 | call_minute2 | call_minute3 | call_minute4 | call_minute5 | call_minute6 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
| 15600000000 | 2018-12-14 13:00:00 | 6 | 7 | 7 | 8 | 7 | 8 | 8 |
| 15600000000 | 2018-12-13 13:00:00 | 1 | 7 | 7 | 8 | 7 | 7 | 8 |
| 15600000000 | 2018-12-12 13:00:00 | 7 | 7 | 7 | 8 | 7 | 7 | 8 |
| 15600000000 | 2018-12-11 13:00:00 | 4 | 4 | 4 | 8 | 4 | 7 | 8 |
| 15600000000 | 2018-12-10 13:00:00 | 2 | 2 | 2 | 8 | 2 | 4 | 8 |
| 15600000000 | 2018-12-18 13:00:00 | 7 | 8 | 8 | 8 | 8 | 8 | 7 |
| 15600000000 | 2018-12-17 13:00:00 | 4 | 8 | 8 | 8 | 8 | 8 | 7 |
| 15600000000 | 2018-12-16 13:00:00 | 2 | 8 | 8 | 8 | 8 | 8 | 7 |
| 15600000000 | 2018-12-15 13:00:00 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
| 18600000000 | 2018-12-23 13:00:00 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
| 18600000000 | 2018-12-22 13:00:00 | 6 | 8 | 8 | 8 | 7 | 8 | 8 |
| 18600000000 | 2018-12-21 13:00:00 | 1 | 8 | 8 | 8 | 7 | 7 | 8 |
| 18600000000 | 2018-12-20 13:00:00 | 7 | 8 | 8 | 8 | 8 | 8 | 8 |
| 18600000000 | 2018-12-19 13:00:00 | 4 | 8 | 8 | 8 | 8 | 8 | 8 |
| 18600000000 | 2018-12-18 13:00:00 | 2 | 8 | 8 | 8 | 8 | 8 | 8 |
| 18600000000 | 2018-12-17 13:00:00 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
| 18600000000 | 2018-12-16 13:00:00 | 6 | 8 | 8 | 8 | 7 | 8 | 8 |
| 18600000000 | 2018-12-15 13:00:00 | 1 | 8 | 8 | 8 | 8 | 8 | 8 |
| 18600000000 | 2018-12-14 13:00:00 | 7 | 8 | 8 | 8 | 8 | 8 | 8 |
| 18600000000 | 2018-12-13 13:00:00 | 4 | 8 | 8 | 8 | 8 | 8 | 8 |
| 18600000000 | 2018-12-11 13:00:00 | 6 | 6 | 6 | 8 | 6 | 8 | 8 |
| 18600000000 | 2018-12-10 13:00:00 | 1 | 1 | 1 | 8 | 1 | 6 | 8 |
| 18600000000 | 2018-12-12 13:00:00 | 8 | 8 | 8 | 8 | 8 | 8 | 8 |
+--------------+----------------------+--------------+---------------+---------------+---------------+---------------+---------------+---------------+--+
Time taken: 0.832 seconds, Fetched: 23 row(s)
后续继续整理并分享hive其他窗口函数。。。