Hive分析窗口函数之SUM,AVG,MIN和MAX

Hive中提供了很多的分析函数,用于完成负责的统计分析。

本文先介绍SUMAVGMINMAX这四个函数


环境信息:

Hive版本为apache-hive-0.14.0-bin

Hadoop版本为hadoop-2.6.0

Tez版本为tez-0.7.0


构造数据:

P088888888888,2016-02-10,1

P088888888888,2016-02-11,3

P088888888888,2016-02-12,1

P088888888888,2016-02-13,9

P088888888888,2016-02-14,3

P088888888888,2016-02-15,12

P088888888888,2016-02-16,3

创建表:

hive (hiveinaction)> create table windows_func

                   >(

                  >     polno string,

                  >     createtime string,

                  >     pnum int   

                   >)

                   >ROW FORMAT DELIMITED

                   >FIELDS TERMINATED BY ','

                   >stored as textfile;

导入数据到表中:

load data local inpath '/home/hadoop/testhivedata/windows_func.txt' into table windows_func;

测试:

SELECT polno,

       createtime,

       pnum,

       SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime) AS pnum1, --默认为从起点到当前行

       SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2,  --从起点到当前行

       SUM(pnum) OVER(PARTITION BY polno) ASpnum3, --分组内所有行

       SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4,  --当前行+往前3(当前行的值+前面三行的值)

       SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5,  --当前行+往前3+往后1

       SUM(pnum) OVER(PARTITION BY polno ORDERBY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6 ---当前行+往后所有行

FROM windows_func;

结果:

polno

        createtime

pnum

pnum1

pnum2

pnum3

pnum4

pnum5

pnum6

P088888888888

2016/2/10

1

1

1

32

1

4

32

P088888888888

2016/2/11

3

4

4

32

4

5

31

P088888888888

2016/2/12

1

5

5

32

5

14

28

P088888888888

2016/2/13

9

14

14

32

14

17

27

P088888888888

2016/2/14

3

17

17

32

16

28

18

P088888888888

2016/2/15

12

29

29

32

25

28

15

P088888888888

2016/2/16

3

32

32

32

27

27

3

注释:

1.   如果不指定ROWS BETWEEN,默认为从起点到当前行;

2.   如果不指定ORDER BY,则将分组内所有值累加;

理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING
:往前
FOLLOWING
:往后
CURRENT ROW
:当前行
UNBOUNDED
:起点,UNBOUNDED PRECEDING表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
其他AVGMINMAX,和SUM用法一样。

 

演示AVG环境:

SELECT polno,

       createtime,

       pnum,

       AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime) AS pnum1, --默认为从起点到当前行

       AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2,  --从起点到当前行

       AVG(pnum) OVER(PARTITION BY polno) AS pnum3, --分组内所有行

       AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4,  --当前行+往前3(当前行的值+前面三行的值)

       AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5,  --当前行+往前3+往后1

       AVG(pnum) OVER(PARTITION BYpolno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6 ---当前行+往后所有行

FROM windows_func;

结果:

polno

createtime

pnum

pnum1

pnum2

pnum3

pnum4

pnum5

pnum6

P088888888888

2016/2/10

1

1

1

4.57142857

1

2

4.5714286

P088888888888

2016/2/11

3

2

2

4.57142857

2

1.666667

5.1666667

P088888888888

2016/2/12

1

1.66667

1.6667

4.57142857

1.666667

3.5

5.6

P088888888888

2016/2/13

9

3.5

3.5

4.57142857

3.5

3.4

6.75

P088888888888

2016/2/14

3

3.4

3.4

4.57142857

4

5.6

6

P088888888888

2016/2/15

12

4.83333

4.8333

4.57142857

6.25

5.6

7.5

P088888888888

2016/2/16

3

4.57143

4.5714

4.57142857

6.75

6.75

3

 

其他类似的函数就不举例了。

 

你可能感兴趣的:(Hive,Hive实战)