大数据学习24:hive_sql_累计求和小案例

环境:
hive

需求:
输入数据,每天用户的流量 分隔符 \t

user    date    access
panda   2017-1-01   5t
gifshow 2017-1-01   3t
yy  2017-1-01   2t
laifeng 2017-1-01   2t
panda   2017-1-02   5t
gifshow 2017-1-02   3t
yy  2017-1-02   2t
laifeng 2017-1-02   2t
panda   2017-2-01   4t
gifshow 2017-2-01   3t
yy  2017-2-01   1t
laifeng 2017-2-01   4t
panda   2017-2-02   4t
gifshow 2017-2-02   3t
yy  2017-2-02   1t
laifeng 2017-2-02   4t
panda   2017-3-01   4t
gifshow 2017-3-01   3t
yy  2017-3-01   1t
laifeng 2017-3-01   4t
panda   2017-3-02   4t
gifshow 2017-3-02   3t
yy  2017-3-02   1t
laifeng 2017-3-02   4t
=============>求出下表,每个用户按月统计当月数值,并新增按月累加字段
user    date    acc     acc_sum
gifshow 2017-1  6       6
gifshow 2017-2  6       12
gifshow 2017-3  6       18
laifeng 2017-1  4       4
laifeng 2017-2  8       12
laifeng 2017-3  8       20
panda   2017-1  10      10
panda   2017-2  8       18
panda   2017-3  8       26
yy      2017-1  4       4
yy      2017-2  2       6
yy      2017-3  2       8

分析:
由于原输入数据为文本,所以从本地导入hive 建 shipin_origin 的时候,hive表中字段均为String。
那么结果需要按月统计,则必须对 date 进行 substr() 。
在本案例中,有两种方式实现这个功能。
1)通过 hive 自带的函数 sum()over() 进行求解,效果好,时间短
2)通过 标准 sql 进行 inner join 进行求解,效果差,但是更能理解inner join 和 实现原理

操作:
方法一:
1、先在hive上建表,并load数据

create table shipin_origin (
user String,
date String,
access String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;

2、这个查询是用来将原始表进行按月累计,形成一个每个月总量的中间表

select  user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2 
from shipin_origin  a group by user,substr(date,1,6)
结果:
gifshow 2017-1  6
gifshow 2017-2  6
gifshow 2017-3  6
laifeng 2017-1  4
laifeng 2017-2  8
laifeng 2017-3  8
panda   2017-1  10
panda   2017-2  8
panda   2017-3  8
yy      2017-1  4
yy      2017-2  2
yy      2017-3  2

3、通过这个中间表,采用sum()over() 函数

select A.user ,A.c1 ,A.c2 ,sum(A.c2)over(partition by A.user order by A.c1 ) acc_sum from 
(
select  user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2 
from shipin_origin  a group by user,substr(date,1,6)
) A;

结果:
gifshow 2017-1  6       6
gifshow 2017-2  6       12
gifshow 2017-3  6       18
laifeng 2017-1  4       4
laifeng 2017-2  8       12
laifeng 2017-3  8       20
panda   2017-1  10      10
panda   2017-2  8       18
panda   2017-3  8       26
yy      2017-1  4       4
yy      2017-2  2       6
yy      2017-3  2       8
Time taken: 23.988 seconds, Fetched: 12 row(s)

注意:在上面这个sql中,因为使用了嵌套查询,一定要把嵌套内的查询结果进行别名处理,否则会出现以下的报错:

FAILED: ParseException line 3:53 cannot recognize input near '' '' '' in subquery source

方法二:
1、先在hive上建表,并load数据

create table shipin_origin (
user String,
date String,
access String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;

2、同样将原始表进行按月累计,形成一个每个月总量的中间表,但是这里我们创建一个中间表去存储
(为什么这里进行了中间表存储,是为了简便sql写法,不建表其实也可以,但是逻辑上面要注意!!)

hive (default)>
create table mid_shipin ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' as 
select  user,substr(date,1,6) c1 ,sum (cast(substr(access,1,1) as INT ))  c2 
from shipin_origin  a group by user,substr(date,1,6) ; 
结果:
gifshow 2017-1  6
gifshow 2017-2  6
gifshow 2017-3  6
laifeng 2017-1  4
laifeng 2017-2  8
laifeng 2017-3  8
panda   2017-1  10
panda   2017-2  8
panda   2017-3  8
yy      2017-1  4
yy      2017-2  2
yy      2017-3  2

hive (default)> desc mid_shipin;
OK
user                    string                                      
c1                      string                                      
c2                      bigint                                      
Time taken: 0.326 seconds, Fetched: 3 row(s)

3、开始对这个表进行操作,并获得结果:

select A.user,A.c1,max(A.c2) c2 ,sum(B.c2) accumulate 
from 
(select user,c1,sum(c2) c2  from  mid_shipin group by user,c1)  A 
inner join 
(select user,c1,sum(c2)  c2  from  mid_shipin group by user,c1 ) B 
on A.user = B.user
where B.c1<=A.c1
group by A.user,A.c1
order by A.user,A.c1

====>结果:
gifshow 2017-1  6       6
gifshow 2017-2  6       12
gifshow 2017-3  6       18
laifeng 2017-1  4       4
laifeng 2017-2  8       12
laifeng 2017-3  8       20
panda   2017-1  10      10
panda   2017-2  8       18
panda   2017-3  8       26
yy      2017-1  4       4
yy      2017-2  2       6
yy      2017-3  2       8
Time taken: 114.184 seconds, Fetched: 12 row(s)

这个过程中,用到了4对 mapreduce 和 1次 map 时间消耗很大!

过程分析:

#为了测试,放到mysql 里面看sql逻辑
create table sp1 (
user varchar(10),
date varchar(10),
acc int
);
insert into  sp1 values ('gifshow', '2017-1',  6    );
insert into  sp1 values ('gifshow', '2017-2' , 6    );
insert into  sp1 values ('gifshow', '2017-3' , 6    );
insert into  sp1 values ('laifeng', '2017-1' , 4    );
insert into  sp1 values ('laifeng', '2017-2' , 8    );
insert into  sp1 values ('laifeng', '2017-3' , 8    );
insert into  sp1 values ('panda' ,  '2017-1' , 10   );
insert into  sp1 values ('panda' ,  '2017-2' , 8    );
insert into  sp1 values ('panda' ,  '2017-3' , 8    );
insert into  sp1 values ('yy'    ,  '2017-1' , 4    );
insert into  sp1 values ('yy'    ,  '2017-2' , 2    );
insert into  sp1 values ('yy'    ,  '2017-3' , 2    );
select * from (select user,date ,acc from sp1)  A inner join (select user,date ,acc from sp1) B on A.user = B.user ;

注意后面带 * 的行

+---------+--------+------+---------+--------+------+
| user    | date   | acc  | user    | date   | acc  |
+---------+--------+------+---------+--------+------+
| gifshow | 2017-1 |    6 | gifshow | 2017-1 |    6 |
| gifshow | 2017-1 |    6 | gifshow | 2017-2 |    6 |*
| gifshow | 2017-1 |    6 | gifshow | 2017-3 |    6 |*
| gifshow | 2017-2 |    6 | gifshow | 2017-1 |    6 |
| gifshow | 2017-2 |    6 | gifshow | 2017-2 |    6 |
| gifshow | 2017-2 |    6 | gifshow | 2017-3 |    6 |*
| gifshow | 2017-3 |    6 | gifshow | 2017-1 |    6 |
| gifshow | 2017-3 |    6 | gifshow | 2017-2 |    6 |
| gifshow | 2017-3 |    6 | gifshow | 2017-3 |    6 |
| laifeng | 2017-1 |    4 | laifeng | 2017-1 |    4 |
| laifeng | 2017-1 |    4 | laifeng | 2017-2 |    8 |*
| laifeng | 2017-1 |    4 | laifeng | 2017-3 |    8 |*
| laifeng | 2017-2 |    8 | laifeng | 2017-1 |    4 |
| laifeng | 2017-2 |    8 | laifeng | 2017-2 |    8 |
| laifeng | 2017-2 |    8 | laifeng | 2017-3 |    8 |*
| laifeng | 2017-3 |    8 | laifeng | 2017-1 |    4 |
| laifeng | 2017-3 |    8 | laifeng | 2017-2 |    8 |
| laifeng | 2017-3 |    8 | laifeng | 2017-3 |    8 |
| panda   | 2017-1 |   10 | panda   | 2017-1 |   10 |
| panda   | 2017-1 |   10 | panda   | 2017-2 |    8 |*
| panda   | 2017-1 |   10 | panda   | 2017-3 |    8 |*
| panda   | 2017-2 |    8 | panda   | 2017-1 |   10 |
| panda   | 2017-2 |    8 | panda   | 2017-2 |    8 |
| panda   | 2017-2 |    8 | panda   | 2017-3 |    8 |*
| panda   | 2017-3 |    8 | panda   | 2017-1 |   10 |
| panda   | 2017-3 |    8 | panda   | 2017-2 |    8 |
| panda   | 2017-3 |    8 | panda   | 2017-3 |    8 |
| yy      | 2017-1 |    4 | yy      | 2017-3 |    2 |*
| yy      | 2017-1 |    4 | yy      | 2017-1 |    4 |
| yy      | 2017-1 |    4 | yy      | 2017-2 |    2 |*
| yy      | 2017-2 |    2 | yy      | 2017-3 |    2 |*
| yy      | 2017-2 |    2 | yy      | 2017-1 |    4 |
| yy      | 2017-2 |    2 | yy      | 2017-2 |    2 |
| yy      | 2017-3 |    2 | yy      | 2017-3 |    2 |
| yy      | 2017-3 |    2 | yy      | 2017-1 |    4 |
| yy      | 2017-3 |    2 | yy      | 2017-2 |    2 |
+---------+--------+------+---------+--------+------+
36 rows in set (0.00 sec)
mysql> select * from (select user,date ,acc from sp1)  A inner join (select user,date ,acc from sp1) B on A.user = B.user where B.date <= A.date ;

上表后缀星号被过滤,采用inner join 的方式,过滤掉 不用累加的行!!

+---------+--------+------+---------+--------+------+
| user    | date   | acc  | user    | date   | acc  |
+---------+--------+------+---------+--------+------+
| gifshow | 2017-1 |    6 | gifshow | 2017-1 |    6 |
| gifshow | 2017-2 |    6 | gifshow | 2017-1 |    6 |
| gifshow | 2017-2 |    6 | gifshow | 2017-2 |    6 |
| gifshow | 2017-3 |    6 | gifshow | 2017-1 |    6 |
| gifshow | 2017-3 |    6 | gifshow | 2017-2 |    6 |
| gifshow | 2017-3 |    6 | gifshow | 2017-3 |    6 |
| laifeng | 2017-1 |    4 | laifeng | 2017-1 |    4 |
| laifeng | 2017-2 |    8 | laifeng | 2017-1 |    4 |
| laifeng | 2017-2 |    8 | laifeng | 2017-2 |    8 |
| laifeng | 2017-3 |    8 | laifeng | 2017-1 |    4 |
| laifeng | 2017-3 |    8 | laifeng | 2017-2 |    8 |
| laifeng | 2017-3 |    8 | laifeng | 2017-3 |    8 |
| panda   | 2017-1 |   10 | panda   | 2017-1 |   10 |
| panda   | 2017-2 |    8 | panda   | 2017-1 |   10 |
| panda   | 2017-2 |    8 | panda   | 2017-2 |    8 |
| panda   | 2017-3 |    8 | panda   | 2017-1 |   10 |
| panda   | 2017-3 |    8 | panda   | 2017-2 |    8 |
| panda   | 2017-3 |    8 | panda   | 2017-3 |    8 |
| yy      | 2017-1 |    4 | yy      | 2017-1 |    4 |
| yy      | 2017-2 |    2 | yy      | 2017-1 |    4 |
| yy      | 2017-2 |    2 | yy      | 2017-2 |    2 |
| yy      | 2017-3 |    2 | yy      | 2017-1 |    4 |
| yy      | 2017-3 |    2 | yy      | 2017-2 |    2 |
| yy      | 2017-3 |    2 | yy      | 2017-3 |    2 |
+---------+--------+------+---------+--------+------+
24 rows in set (0.00 sec)

按照user,date 进行分类,用 max 取出 acc (因为每个月的都一样) ,用sum 进行累加。

mysql> 
select A.user,A.date,max(a.acc) acc_month, sum(B.acc) acc_sum from 
(select user,date ,acc from sp1) A 
inner join 
(select user,date ,acc from sp1) B 
on A.user = B.user 
where B.date <= A.date 
group by A.user,A.date 
order by A.user,A.date;
+---------+--------+-----------+---------+
| user    | date   | acc_month | acc_sum |
+---------+--------+-----------+---------+
| gifshow | 2017-1 |         6 |       6 |
| gifshow | 2017-2 |         6 |      12 |
| gifshow | 2017-3 |         6 |      18 |
| laifeng | 2017-1 |         4 |       4 |
| laifeng | 2017-2 |         8 |      12 |
| laifeng | 2017-3 |         8 |      20 |
| panda   | 2017-1 |        10 |      10 |
| panda   | 2017-2 |         8 |      18 |
| panda   | 2017-3 |         8 |      26 |
| yy      | 2017-1 |         4 |       4 |
| yy      | 2017-2 |         2 |       6 |
| yy      | 2017-3 |         2 |       8 |
+---------+--------+-----------+---------+
12 rows in set (0.00 sec)

sql分析结束!

你可能感兴趣的:(大数据)