hive ---求行差值案例---判断行值是否相等案例 【lag, lead, frist_value, last_value】的使用案例

假设有表goods(日期,产品id,产品当日收入,产品当日成本),日期和产品id是组合主键,有若干条数据,日期范围2016年1月1日至今,且一定每个产品,每天都有数据

写出SQL实现如下要求:

数据文件:goods.txt数据:
2018-03-01,a,3000,2500
2018-03-01,b,4000,3200
2018-03-01,c,3200,2400
2018-03-01,d,3000,2500
2018-03-02,a,3000,2500
2018-03-02,b,1500,800
2018-03-02,c,2600,1800
2018-03-02,d,2400,1000
2018-03-03,a,3100,2400
2018-03-03,b,2500,2100
2018-03-03,c,4000,1200
2018-03-03,d,2500,1900
2018-03-04,a,2800,2400
2018-03-04,b,3200,2700
2018-03-04,c,2900,2200
2018-03-04,d,2700,2500
2018-03-05,a,2700,1000
2018-03-05,b,1800,200
2018-03-05,c,5600,2200
2018-03-05,d,1200,1000
2018-03-06,a,2900,2500
2018-03-06,b,4500,2500
2018-03-06,c,6700,2300
2018-03-06,d,7500,5000
2018-04-01,a,3000,2500
2018-04-01,b,4000,3200
2018-04-01,c,3200,2400
2018-04-01,d,3000,2500
2018-04-02,a,3000,2500
2018-04-02,b,1500,800
2018-04-02,c,4600,1800
2018-04-02,d,2400,1000
2018-04-03,a,6100,2400
2018-04-03,b,4500,2100
2018-04-03,c,6000,1200
2018-04-03,d,3500,1900
2018-04-04,a,2800,2400
2018-04-04,b,3200,2700
2018-04-04,c,2900,2200
2018-04-04,d,2700,2500
2018-04-05,a,4700,1000
2018-04-05,b,3800,200
2018-04-05,c,5600,2200
2018-04-05,d,5200,1000
2018-04-06,a,2900,2500
2018-04-06,b,4500,2500
2018-04-06,c,6700,2300
2018-04-06,d,7500,5000

要求:输出2018年3月中每一天与上一天相比,总成本的变化。

1)先求2018每一天的总成本。创建中间表方便使用。

create table goods_day as
select dt,sum(income) income,sum(cost) cost 
from goods 
where substring(dt,1,4)="2018" 
group by dt;

表内容:

select * from goods_day limit 10;

+---------------+-------------------+-----------------+
| goods_day.dt  | goods_day.income  | goods_day.cost  |
+---------------+-------------------+-----------------+
| 2018-03-01    | 13200             | 10600           |
| 2018-03-02    | 9500              | 6100            |
| 2018-03-03    | 12100             | 7600            |
| 2018-03-04    | 11600             | 9800            |
| 2018-03-05    | 11300             | 4400            |
| 2018-03-06    | 21600             | 12300           |
| 2018-04-01    | 13200             | 10600           |
| 2018-04-02    | 11500             | 6100            |
| 2018-04-03    | 20100             | 7600            |
| 2018-04-04    | 11600             | 9800            |
+---------------+-------------------+-----------------+

2)
方法一:开窗over函数中取当前[ current ]数据与前一条数据 [ following ] ,的第一个值(first_value)和最后一个值(last_value)得差。

select dt,cost,
cost - (LAST_VALUE(cost) OVER(ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING)) as change 
from goods_day 
where substring(dt,6,2)="03" 
order by dt;

方法二:使用lead函数获取前一行的值。

select dt,cost,
cost - lead(cost,1,cost)over() as change 
from goods_day 
substring(dt,6,2)="03"
 order by dt;

结果如下:

+-------------+--------+---------+
|     dt      |  cost  | change  |
+-------------+--------+---------+
| 2018-03-01  | 10600  | 0       |
| 2018-03-02  | 6100   | -4500   |
| 2018-03-03  | 7600   | 1500    |
| 2018-03-04  | 9800   | 2200    |
| 2018-03-05  | 4400   | -5400   |
| 2018-03-06  | 12300  | 7900    |
+-------------+--------+---------+

 方法三:传统自连接查询。

select distinct  
g1.dt,g1.cost,g1.cost-g2.cost change
from goods_day g1 left join goods_day g2
on cast(substring(g1.dt,-1) as int)-1=cast(substring(g2.dt,-1) as int)
where substring(g1.dt,1,7)="2018-03";

 结果:

+-------------+----------+---------+
|    g1.dt    | g1.cost  | change  |
+-------------+----------+---------+
| 2018-03-01  | 10600    | NULL    |
| 2018-03-02  | 6100     | -4500   |
| 2018-03-03  | 7600     | 1500    |
| 2018-03-04  | 9800     | 2200    |
| 2018-03-05  | 4400     | -5400   |
| 2018-03-06  | 12300    | 7900    |
+-------------+----------+---------+

 

案例二:

编写一个SQL实现查找所有至少连续三次出现的数字

1	1
2	1
3	1
4	2
5	1
6	2
7	2
8	3
9	3
10	3
11	3
12	4
准备操作:
create database if not exists hive_interview;
use hive_interview;
drop table if exists numbers;
create table numbers(id int, number int) row format delimited fields terminated by "\t";
load data local inpath "/home/hadoop/numbers.txt" into table numbers;
select * from numbers;

hql解决方案:

分析:连续3次,就是当前行的值与前两行的值相同。

hql如下:

select distinct(a.number) number
from 
(select number,
(case when 
    number=(LAG(number,1,0) OVER(order by id))  //前一行
    and number=(LAG(number,2,0) OVER(order by id)) //前第二行
then 1 else 0                                      //相同返回1,不同返回0
end 
) as b  
from numbers) a
where a.b=1;

结果:

+---------+
| number  |
+---------+
| 1       |
| 3       |
+---------+

   传统SQL解决方案--自连接:

select  
distinct n1.number
from numbers n1 
join numbers n2 on n1.id-1=n2.id and n2.number=n1.number
join numbers n3 on n1.id-2=n3.id and n3.number=n2.number;

结果:

+------------+
| n1.number  |
+------------+
| 1          |
| 3          |
+------------+

 

这两个题提示我们SQL世界里永远不要忘记自连接!

 

 

 

你可能感兴趣的:(hive)