环境:
hive
需求:
输入数据,每天用户的流量 分隔符 \t
user date access
panda 2017-1-01 5t
gifshow 2017-1-01 3t
yy 2017-1-01 2t
laifeng 2017-1-01 2t
panda 2017-1-02 5t
gifshow 2017-1-02 3t
yy 2017-1-02 2t
laifeng 2017-1-02 2t
panda 2017-2-01 4t
gifshow 2017-2-01 3t
yy 2017-2-01 1t
laifeng 2017-2-01 4t
panda 2017-2-02 4t
gifshow 2017-2-02 3t
yy 2017-2-02 1t
laifeng 2017-2-02 4t
panda 2017-3-01 4t
gifshow 2017-3-01 3t
yy 2017-3-01 1t
laifeng 2017-3-01 4t
panda 2017-3-02 4t
gifshow 2017-3-02 3t
yy 2017-3-02 1t
laifeng 2017-3-02 4t
=============>求出下表,每个用户按月统计当月数值,并新增按月累加字段
user date acc acc_sum
gifshow 2017-1 6 6
gifshow 2017-2 6 12
gifshow 2017-3 6 18
laifeng 2017-1 4 4
laifeng 2017-2 8 12
laifeng 2017-3 8 20
panda 2017-1 10 10
panda 2017-2 8 18
panda 2017-3 8 26
yy 2017-1 4 4
yy 2017-2 2 6
yy 2017-3 2 8
分析:
由于原输入数据为文本,所以从本地导入hive 建 shipin_origin 的时候,hive表中字段均为String。
那么结果需要按月统计,则必须对 date 进行 substr() 。
在本案例中,有两种方式实现这个功能。
1)通过 hive 自带的函数 sum()over() 进行求解,效果好,时间短
2)通过 标准 sql 进行 inner join 进行求解,效果差,但是更能理解inner join 和 实现原理
操作:
方法一:
1、先在hive上建表,并load数据
create table shipin_origin (
user String,
date String,
access String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;
2、这个查询是用来将原始表进行按月累计,形成一个每个月总量的中间表
select user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2
from shipin_origin a group by user,substr(date,1,6)
结果:
gifshow 2017-1 6
gifshow 2017-2 6
gifshow 2017-3 6
laifeng 2017-1 4
laifeng 2017-2 8
laifeng 2017-3 8
panda 2017-1 10
panda 2017-2 8
panda 2017-3 8
yy 2017-1 4
yy 2017-2 2
yy 2017-3 2
3、通过这个中间表,采用sum()over() 函数
select A.user ,A.c1 ,A.c2 ,sum(A.c2)over(partition by A.user order by A.c1 ) acc_sum from
(
select user,substr(date,1,6) c1 ,sum(cast(substr(access,1,1) as INT )) c2
from shipin_origin a group by user,substr(date,1,6)
) A;
结果:
gifshow 2017-1 6 6
gifshow 2017-2 6 12
gifshow 2017-3 6 18
laifeng 2017-1 4 4
laifeng 2017-2 8 12
laifeng 2017-3 8 20
panda 2017-1 10 10
panda 2017-2 8 18
panda 2017-3 8 26
yy 2017-1 4 4
yy 2017-2 2 6
yy 2017-3 2 8
Time taken: 23.988 seconds, Fetched: 12 row(s)
注意:在上面这个sql中,因为使用了嵌套查询,一定要把嵌套内的查询结果进行别名处理,否则会出现以下的报错:
FAILED: ParseException line 3:53 cannot recognize input near '' '' '' in subquery source
方法二:
1、先在hive上建表,并load数据
create table shipin_origin (
user String,
date String,
access String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
LOAD DATA LOCAL INPATH '/root/data/shipin_origin' OVERWRITE INTO TABLE shipin_origin;
2、同样将原始表进行按月累计,形成一个每个月总量的中间表,但是这里我们创建一个中间表去存储
(为什么这里进行了中间表存储,是为了简便sql写法,不建表其实也可以,但是逻辑上面要注意!!)
hive (default)>
create table mid_shipin ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' as
select user,substr(date,1,6) c1 ,sum (cast(substr(access,1,1) as INT )) c2
from shipin_origin a group by user,substr(date,1,6) ;
结果:
gifshow 2017-1 6
gifshow 2017-2 6
gifshow 2017-3 6
laifeng 2017-1 4
laifeng 2017-2 8
laifeng 2017-3 8
panda 2017-1 10
panda 2017-2 8
panda 2017-3 8
yy 2017-1 4
yy 2017-2 2
yy 2017-3 2
hive (default)> desc mid_shipin;
OK
user string
c1 string
c2 bigint
Time taken: 0.326 seconds, Fetched: 3 row(s)
3、开始对这个表进行操作,并获得结果:
select A.user,A.c1,max(A.c2) c2 ,sum(B.c2) accumulate
from
(select user,c1,sum(c2) c2 from mid_shipin group by user,c1) A
inner join
(select user,c1,sum(c2) c2 from mid_shipin group by user,c1 ) B
on A.user = B.user
where B.c1<=A.c1
group by A.user,A.c1
order by A.user,A.c1
====>结果:
gifshow 2017-1 6 6
gifshow 2017-2 6 12
gifshow 2017-3 6 18
laifeng 2017-1 4 4
laifeng 2017-2 8 12
laifeng 2017-3 8 20
panda 2017-1 10 10
panda 2017-2 8 18
panda 2017-3 8 26
yy 2017-1 4 4
yy 2017-2 2 6
yy 2017-3 2 8
Time taken: 114.184 seconds, Fetched: 12 row(s)
这个过程中,用到了4对 mapreduce 和 1次 map 时间消耗很大!
过程分析:
#为了测试,放到mysql 里面看sql逻辑
create table sp1 (
user varchar(10),
date varchar(10),
acc int
);
insert into sp1 values ('gifshow', '2017-1', 6 );
insert into sp1 values ('gifshow', '2017-2' , 6 );
insert into sp1 values ('gifshow', '2017-3' , 6 );
insert into sp1 values ('laifeng', '2017-1' , 4 );
insert into sp1 values ('laifeng', '2017-2' , 8 );
insert into sp1 values ('laifeng', '2017-3' , 8 );
insert into sp1 values ('panda' , '2017-1' , 10 );
insert into sp1 values ('panda' , '2017-2' , 8 );
insert into sp1 values ('panda' , '2017-3' , 8 );
insert into sp1 values ('yy' , '2017-1' , 4 );
insert into sp1 values ('yy' , '2017-2' , 2 );
insert into sp1 values ('yy' , '2017-3' , 2 );
select * from (select user,date ,acc from sp1) A inner join (select user,date ,acc from sp1) B on A.user = B.user ;
注意后面带 * 的行
+---------+--------+------+---------+--------+------+
| user | date | acc | user | date | acc |
+---------+--------+------+---------+--------+------+
| gifshow | 2017-1 | 6 | gifshow | 2017-1 | 6 |
| gifshow | 2017-1 | 6 | gifshow | 2017-2 | 6 |*
| gifshow | 2017-1 | 6 | gifshow | 2017-3 | 6 |*
| gifshow | 2017-2 | 6 | gifshow | 2017-1 | 6 |
| gifshow | 2017-2 | 6 | gifshow | 2017-2 | 6 |
| gifshow | 2017-2 | 6 | gifshow | 2017-3 | 6 |*
| gifshow | 2017-3 | 6 | gifshow | 2017-1 | 6 |
| gifshow | 2017-3 | 6 | gifshow | 2017-2 | 6 |
| gifshow | 2017-3 | 6 | gifshow | 2017-3 | 6 |
| laifeng | 2017-1 | 4 | laifeng | 2017-1 | 4 |
| laifeng | 2017-1 | 4 | laifeng | 2017-2 | 8 |*
| laifeng | 2017-1 | 4 | laifeng | 2017-3 | 8 |*
| laifeng | 2017-2 | 8 | laifeng | 2017-1 | 4 |
| laifeng | 2017-2 | 8 | laifeng | 2017-2 | 8 |
| laifeng | 2017-2 | 8 | laifeng | 2017-3 | 8 |*
| laifeng | 2017-3 | 8 | laifeng | 2017-1 | 4 |
| laifeng | 2017-3 | 8 | laifeng | 2017-2 | 8 |
| laifeng | 2017-3 | 8 | laifeng | 2017-3 | 8 |
| panda | 2017-1 | 10 | panda | 2017-1 | 10 |
| panda | 2017-1 | 10 | panda | 2017-2 | 8 |*
| panda | 2017-1 | 10 | panda | 2017-3 | 8 |*
| panda | 2017-2 | 8 | panda | 2017-1 | 10 |
| panda | 2017-2 | 8 | panda | 2017-2 | 8 |
| panda | 2017-2 | 8 | panda | 2017-3 | 8 |*
| panda | 2017-3 | 8 | panda | 2017-1 | 10 |
| panda | 2017-3 | 8 | panda | 2017-2 | 8 |
| panda | 2017-3 | 8 | panda | 2017-3 | 8 |
| yy | 2017-1 | 4 | yy | 2017-3 | 2 |*
| yy | 2017-1 | 4 | yy | 2017-1 | 4 |
| yy | 2017-1 | 4 | yy | 2017-2 | 2 |*
| yy | 2017-2 | 2 | yy | 2017-3 | 2 |*
| yy | 2017-2 | 2 | yy | 2017-1 | 4 |
| yy | 2017-2 | 2 | yy | 2017-2 | 2 |
| yy | 2017-3 | 2 | yy | 2017-3 | 2 |
| yy | 2017-3 | 2 | yy | 2017-1 | 4 |
| yy | 2017-3 | 2 | yy | 2017-2 | 2 |
+---------+--------+------+---------+--------+------+
36 rows in set (0.00 sec)
mysql> select * from (select user,date ,acc from sp1) A inner join (select user,date ,acc from sp1) B on A.user = B.user where B.date <= A.date ;
上表后缀星号被过滤,采用inner join 的方式,过滤掉 不用累加的行!!
+---------+--------+------+---------+--------+------+
| user | date | acc | user | date | acc |
+---------+--------+------+---------+--------+------+
| gifshow | 2017-1 | 6 | gifshow | 2017-1 | 6 |
| gifshow | 2017-2 | 6 | gifshow | 2017-1 | 6 |
| gifshow | 2017-2 | 6 | gifshow | 2017-2 | 6 |
| gifshow | 2017-3 | 6 | gifshow | 2017-1 | 6 |
| gifshow | 2017-3 | 6 | gifshow | 2017-2 | 6 |
| gifshow | 2017-3 | 6 | gifshow | 2017-3 | 6 |
| laifeng | 2017-1 | 4 | laifeng | 2017-1 | 4 |
| laifeng | 2017-2 | 8 | laifeng | 2017-1 | 4 |
| laifeng | 2017-2 | 8 | laifeng | 2017-2 | 8 |
| laifeng | 2017-3 | 8 | laifeng | 2017-1 | 4 |
| laifeng | 2017-3 | 8 | laifeng | 2017-2 | 8 |
| laifeng | 2017-3 | 8 | laifeng | 2017-3 | 8 |
| panda | 2017-1 | 10 | panda | 2017-1 | 10 |
| panda | 2017-2 | 8 | panda | 2017-1 | 10 |
| panda | 2017-2 | 8 | panda | 2017-2 | 8 |
| panda | 2017-3 | 8 | panda | 2017-1 | 10 |
| panda | 2017-3 | 8 | panda | 2017-2 | 8 |
| panda | 2017-3 | 8 | panda | 2017-3 | 8 |
| yy | 2017-1 | 4 | yy | 2017-1 | 4 |
| yy | 2017-2 | 2 | yy | 2017-1 | 4 |
| yy | 2017-2 | 2 | yy | 2017-2 | 2 |
| yy | 2017-3 | 2 | yy | 2017-1 | 4 |
| yy | 2017-3 | 2 | yy | 2017-2 | 2 |
| yy | 2017-3 | 2 | yy | 2017-3 | 2 |
+---------+--------+------+---------+--------+------+
24 rows in set (0.00 sec)
按照user,date 进行分类,用 max 取出 acc (因为每个月的都一样) ,用sum 进行累加。
mysql>
select A.user,A.date,max(a.acc) acc_month, sum(B.acc) acc_sum from
(select user,date ,acc from sp1) A
inner join
(select user,date ,acc from sp1) B
on A.user = B.user
where B.date <= A.date
group by A.user,A.date
order by A.user,A.date;
+---------+--------+-----------+---------+
| user | date | acc_month | acc_sum |
+---------+--------+-----------+---------+
| gifshow | 2017-1 | 6 | 6 |
| gifshow | 2017-2 | 6 | 12 |
| gifshow | 2017-3 | 6 | 18 |
| laifeng | 2017-1 | 4 | 4 |
| laifeng | 2017-2 | 8 | 12 |
| laifeng | 2017-3 | 8 | 20 |
| panda | 2017-1 | 10 | 10 |
| panda | 2017-2 | 8 | 18 |
| panda | 2017-3 | 8 | 26 |
| yy | 2017-1 | 4 | 4 |
| yy | 2017-2 | 2 | 6 |
| yy | 2017-3 | 2 | 8 |
+---------+--------+-----------+---------+
12 rows in set (0.00 sec)
sql分析结束!