Hive中提供了很多的分析函数,用于完成负责的统计分析。
本文先介绍SUM、AVG、MIN、MAX这四个函数。
环境信息:
Hive版本为apache-hive-0.14.0-bin
Hadoop版本为hadoop-2.6.0
Tez版本为tez-0.7.0
构造数据:
P088888888888,2016-02-10,1
P088888888888,2016-02-11,3
P088888888888,2016-02-12,1
P088888888888,2016-02-13,9
P088888888888,2016-02-14,3
P088888888888,2016-02-15,12
P088888888888,2016-02-16,3
创建表:
hive (hiveinaction)> create table windows_func
>(
> polno string,
> createtime string,
> pnum int
>)
>ROW FORMAT DELIMITED
>FIELDS TERMINATED BY ','
>stored as textfile;
导入数据到表中:
load data local inpath'/home/hadoop/testhivedata/windows_func.txt' into table windows_func;
测试:
SELECT polno,
createtime,
pnum,
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime) AS pnum1, --默认为从起点到当前行
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2, --从起点到当前行
SUM(pnum) OVER(PARTITION BY polno) ASpnum3, --分组内所有行
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4, --当前行+往前3行(当前行的值+前面三行的值)
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5, --当前行+往前3行+往后1行
SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6 ---当前行+往后所有行
FROM windows_func;
结果:
polno |
createtime |
pnum |
pnum1 |
pnum2 |
pnum3 |
pnum4 |
pnum5 |
pnum6 |
P088888888888 |
2016/2/10 |
1 |
1 |
1 |
32 |
1 |
4 |
32 |
P088888888888 |
2016/2/11 |
3 |
4 |
4 |
32 |
4 |
5 |
31 |
P088888888888 |
2016/2/12 |
1 |
5 |
5 |
32 |
5 |
14 |
28 |
P088888888888 |
2016/2/13 |
9 |
14 |
14 |
32 |
14 |
17 |
27 |
P088888888888 |
2016/2/14 |
3 |
17 |
17 |
32 |
16 |
28 |
18 |
P088888888888 |
2016/2/15 |
12 |
29 |
29 |
32 |
25 |
28 |
15 |
P088888888888 |
2016/2/16 |
3 |
32 |
32 |
32 |
27 |
27 |
3 |
注释:
1. 如果不指定ROWS BETWEEN,默认为从起点到当前行;
2. 如果不指定ORDER BY,则将分组内所有值累加;
理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
其他AVG,MIN,MAX,和SUM用法一样。
演示AVG环境:
SELECT polno,
createtime,
pnum,
AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime) AS pnum1, --默认为从起点到当前行
AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) ASpnum2, --从起点到当前行
AVG(pnum) OVER(PARTITION BYpolno) AS pnum3, --分组内所有行
AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) ASpnum4, --当前行+往前3行(当前行的值+前面三行的值)
AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) ASpnum5, --当前行+往前3行+往后1行
AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) ASpnum6 ---当前行+往后所有行
FROM windows_func;
结果:
polno |
createtime |
pnum |
pnum1 |
pnum2 |
pnum3 |
pnum4 |
pnum5 |
pnum6 |
P088888888888 |
2016/2/10 |
1 |
1 |
1 |
4.57142857 |
1 |
2 |
4.5714286 |
P088888888888 |
2016/2/11 |
3 |
2 |
2 |
4.57142857 |
2 |
1.666667 |
5.1666667 |
P088888888888 |
2016/2/12 |
1 |
1.66667 |
1.6667 |
4.57142857 |
1.666667 |
3.5 |
5.6 |
P088888888888 |
2016/2/13 |
9 |
3.5 |
3.5 |
4.57142857 |
3.5 |
3.4 |
6.75 |
P088888888888 |
2016/2/14 |
3 |
3.4 |
3.4 |
4.57142857 |
4 |
5.6 |
6 |
P088888888888 |
2016/2/15 |
12 |
4.83333 |
4.8333 |
4.57142857 |
6.25 |
5.6 |
7.5 |
P088888888888 |
2016/2/16 |
3 |
4.57143 |
4.5714 |
4.57142857 |
6.75 |
6.75 |
3 |
其他类似的函数就不举例了。