案例:每日人流量信息被记录在这三列信息中:序号 (id)、日期 (visit_date)、 人流量 (people)。 查找至少连续三行记录中的人流量不少于100的日期和相应人流量
建表:
create table public.stadium(
id int,
visit_date varchar(100),
people int
);
插入数据:
insert into public.stadium
select 1,'2017-01-01',10
union select 2,'2017-01-02',109
union select 3,'2017-01-03',150
union select 4,'2017-01-04',99
union select 5,'2017-01-05',145
union select 6,'2017-01-06',1455
union select 7,'2017-01-07',199
union select 8,'2017-01-08',188
;
测试数据格式:
解法一:
对于连续几条数据的问题,一般情况都是可以通过多次自关联实现
select
distinct a.*
from public.stadium a,
public.stadium b,
public.stadium c
where a.people>=100 and
b.people>=100 and
c.people>=100 and
(
(a.id = b.id-1 and b.id = c.id -1) or
(a.id = b.id-1 and a.id = c.id +1) or
(a.id = b.id+1 and b.id = c.id +1)
)
order by a.id
结果 :
注意:1.这里需要做distinct 因为如果有出现超过连续3天人流量大于100的情况,where条件中的or会使结果出现重复的天数
2.如果正式环境的数据量很大,多表自关联会导致查询结果慢甚至直接OOM
解法二:
vertica的分析函数lag/lead : 一般情况下可以代替表的自关联
lag(field,num,defaultvalue) over(partition by A order by B) : 查询当前数据的前几条数据 field取哪个字段的值 num取前面第几行的数据 defaultvalue取不到数据时的默认值 over()取数据的区间
lead(field,num,defaultvalue) over(partition by A order by B) : 查询当前数据的后几条数据 field取哪个字段的值 num取前面第几行的数据 defaultvalue取不到数据时的默认值 over()取数据的区间
select
id,
visit_date,
p3 as people
from
(select
id,
visit_date,
lag(people,2,null) over(order by id) as p1,
lag(people,1,null) over(order by id) as p2,
people as p3 ,
lead(people,1,null) over(order by id) as p4,
lead(people,2,null) over(order by id) as p5
from public.stadium
)tmp1
where (p1>=100 and p2>=100 and p3>=100)
or (p2>=100 and p3>=100 and p4>=100)
or (p3>=100 and p4>=100 and p5>=100)
order by id
解法三:
通过vertica的窗口函数row_number() over(partition by )来给数据打标签
由于id是自增且每天一条数据,那么连续出现人流量大于100的id和他对应的标签的差值应该是固定的
select
id,
visit_date,
people
from
(select
id,
visit_date ,
people ,
count(1) over( partition by diff ) as cnt --人流量连续大于100的天数
from
(select
id,
visit_date ,
people ,
(row_number() over(order by id)) - id as diff --人流量大于100的标签和id的差值
from public.stadium
where people > 100
)tmp1
)tmp2
where cnt >= 3
ORDER BY id