hive sql复杂场景-1

startdate version
2021-02-28 12
2021-03-27 11
2021-03-29 10
2021-10-28 12
2021-11-26 11
2021-11-27 1
2021-11-28 10
2021-12-26 9

需求:
如果两个startdate相差小于30天,那么认为这两个startdate对应的数据在一个集合中;如果第三条数据的startdate与这个集合中任意一条数据的startdate相差小于30天,认为这条数据也是属于这个集合;以此类推,将所有数据分为不同的集合。
输出每个集合中最大的version和它对应的那条数据。


思路:
重点在于分组。

先对所有数据按日期进行升序排序,如果一条数据的日期与前面集合最大日期的差值在30以内,则认为这条数据属于前一个集合。
在排好序的基础上,一条数据的日期与前一条数据的差值如果小于30就是0;如果大于30就是1(认为是一个新的集合)
按顺序是1,0,0,0,1,0,0等的情况。
然后sum over() ,按顺序是1,1,1,1,2,2,2等的情况
这样不同的集合就有了不同的值。

与前一条数据日期差是否在30以内为0
startdate version per_startdate flag sum over
2021-02-28 12 1900-01-01 1 1
2021-03-27 11 2021-02-28 0 1
2021-03-29 10 2021-03-27 0 1
2021-10-28 12 2021-03-29 1 2
2021-11-26 11 2021-10-28 0 2
2021-11-27 1 2021-11-26 0 2
2021-11-28 10 2021-11-27 0 2
2021-12-26 9 2021-11-28 0 2
with testdataset as (
  select '2021-02-28' as startdate, 12 as version
  union
  select '2021-03-27' as startdate, 11 as version
  union
  select '2021-03-29' as startdate, 10 as version
  union
  select '2021-10-28' as startdate, 12 as version
  union
  select '2021-11-26' as startdate, 11 as version
  union
  select '2021-11-27' as startdate, 1 as version
  union
  select '2021-11-28' as startdate, 10 as version
  union
  select '2021-12-26' as startdate, 9 as version
)


select startdate,version,row_number()over(partition by sumr order by version desc) as rank
from (
    select startdate,version,sum(case when datediff(startdate,per_startdate) < 30 then 0 else 1 end) over (order by startdate asc) as sumr
    from (       
        select startdate,version,lag(startdate,1,'1900-01-01') over (ORDER BY startdate asc) as per_startdate
        from testdataset  
    )
) 
having rank = 1


-- startdate	version	rank
-- 2021-02-28	12	1	
-- 2021-10-28	12	1	

你可能感兴趣的:(hive,sql复杂场景,sql)